judge_answer___29_deberta_v3_base_msmarco_answerability

This model is a fine-tuned version of microsoft/deberta-v3-base on tom-010/msmarcov2.1-binary-answerability. The dataset is heavily biased (only 6% positives). The notebook used to train the model solved this, by sampling the negative samples, so that the ratio is 1-to-1.

It achieves the following results on the evaluation set:

Loss: 0.4194
Accuracy: 0.8164
Precision: 0.7814
Recall: 0.8815
F1: 0.8284

See the run here: https://wandb.ai/stadeltom-com/huggingface/runs/l5mt601p?nw=nwuserstadeltom

Model description

The model is a fine-tunded DeBERTa v3 and classifies if a question/query is answered by a text (passage).

Intended uses & limitations

The task is to judge if a text answers a question. The dataset uses msmarco v2, which has a query and 10 search results of the bing search engine. An annotator answered the question and marked the passages (search results) used for the answer. The dataset goes through each passage of each query and adds to the dataset the query, the passage and if wether the passage was used to answer. The downside: False negatives are totally possible. The upside: A realistic case, as we also get 10 search results and need to filter them. But: It is unknown what the baseline is.

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 3e-05
train_batch_size: 16
eval_batch_size: 8
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
num_epochs: 1
mixed_precision_training: Native AMP

Training results

Training Loss	Epoch	Step	Validation Loss	Accuracy	Precision	Recall	F1
0.5008	0.0272	2000	0.4931	0.7864	0.7498	0.8632	0.8025
0.4832	0.0544	4000	0.4565	0.7858	0.7422	0.8795	0.8050
0.4716	0.0816	6000	0.4758	0.7926	0.7527	0.8751	0.8093
0.4645	0.1088	8000	0.4740	0.7878	0.7633	0.8377	0.7988
0.4697	0.1360	10000	0.4519	0.7982	0.7720	0.8496	0.8089
0.4729	0.1632	12000	0.4471	0.7946	0.7664	0.8508	0.8064
0.4589	0.1904	14000	0.4455	0.8002	0.7661	0.8675	0.8137
0.4513	0.2176	16000	0.4726	0.7934	0.7472	0.8902	0.8125
0.4573	0.2448	18000	0.4357	0.8016	0.7775	0.8481	0.8113
0.4474	0.2720	20000	0.4738	0.7932	0.7503	0.8823	0.8110
0.448	0.2992	22000	0.4360	0.7934	0.7940	0.7955	0.7948
0.449	0.3264	24000	0.4464	0.7996	0.7708	0.8560	0.8112
0.449	0.3536	26000	0.4467	0.8048	0.7655	0.8819	0.8196
0.4483	0.3808	28000	0.4459	0.8042	0.7603	0.8918	0.8208
0.4468	0.4080	30000	0.4400	0.8054	0.7898	0.8353	0.8119
0.4413	0.4352	32000	0.4321	0.8048	0.7917	0.8302	0.8105
0.4444	0.4624	34000	0.4309	0.8086	0.7691	0.8850	0.8230
0.4507	0.4896	36000	0.4301	0.8124	0.7945	0.8457	0.8193
0.4426	0.5168	38000	0.4243	0.8052	0.7698	0.8739	0.8186
0.4321	0.5440	40000	0.4243	0.8074	0.7681	0.8839	0.8219
0.4301	0.5712	42000	0.4380	0.806	0.7640	0.8886	0.8216
0.4418	0.5984	44000	0.4280	0.8096	0.7857	0.8544	0.8186
0.4334	0.6256	46000	0.4326	0.809	0.7765	0.8707	0.8209
0.4385	0.6528	48000	0.4273	0.8116	0.7844	0.8624	0.8215
0.4337	0.6800	50000	0.4306	0.8086	0.7795	0.8636	0.8194
0.4294	0.7072	52000	0.4397	0.811	0.7706	0.8886	0.8254
0.4276	0.7344	54000	0.4344	0.8138	0.7770	0.8831	0.8267
0.4183	0.7616	56000	0.4291	0.812	0.7650	0.9037	0.8286
0.4226	0.7888	58000	0.4342	0.8134	0.7767	0.8827	0.8263
0.4266	0.8160	60000	0.4234	0.8132	0.7840	0.8675	0.8236
0.4285	0.8432	62000	0.4167	0.8156	0.7882	0.8660	0.8252
0.4265	0.8704	64000	0.4206	0.8142	0.7734	0.8918	0.8284
0.429	0.8976	66000	0.4165	0.8174	0.7910	0.8656	0.8266
0.4308	0.9248	68000	0.4192	0.814	0.7775	0.8827	0.8268
0.4248	0.9520	70000	0.4205	0.8152	0.7807	0.8795	0.8272
0.425	0.9792	72000	0.4194	0.8164	0.7814	0.8815	0.8284

Framework versions

Transformers 4.45.2
Pytorch 2.4.1+cu124
Datasets 3.0.1
Tokenizers 0.20.1

Downloads last month: 26

Safetensors

Model size

0.2B params

Tensor type

F32

Model tree for tom-010/judge_answer___29_deberta_v3_base_msmarco_answerability

Base model

microsoft/deberta-v3-base

Finetuned

(488)

this model

tom-010
/

judge_answer___29_deberta_v3_base_msmarco_answerability