judge_answer___29_deberta_v3_base_msmarco_answerability

This model is a fine-tuned version of microsoft/deberta-v3-base on tom-010/msmarcov2.1-binary-answerability. The dataset is heavily biased (only 6% positives). The notebook used to train the model solved this, by sampling the negative samples, so that the ratio is 1-to-1.

It achieves the following results on the evaluation set:

  • Loss: 0.4194
  • Accuracy: 0.8164
  • Precision: 0.7814
  • Recall: 0.8815
  • F1: 0.8284

See the run here: https://wandb.ai/stadeltom-com/huggingface/runs/l5mt601p?nw=nwuserstadeltom

Model description

The model is a fine-tunded DeBERTa v3 and classifies if a question/query is answered by a text (passage).

Intended uses & limitations

The task is to judge if a text answers a question. The dataset uses msmarco v2, which has a query and 10 search results of the bing search engine. An annotator answered the question and marked the passages (search results) used for the answer. The dataset goes through each passage of each query and adds to the dataset the query, the passage and if wether the passage was used to answer. The downside: False negatives are totally possible. The upside: A realistic case, as we also get 10 search results and need to filter them. But: It is unknown what the baseline is.

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 3e-05
  • train_batch_size: 16
  • eval_batch_size: 8
  • seed: 42
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: linear
  • num_epochs: 1
  • mixed_precision_training: Native AMP

Training results

Training Loss Epoch Step Validation Loss Accuracy Precision Recall F1
0.5008 0.0272 2000 0.4931 0.7864 0.7498 0.8632 0.8025
0.4832 0.0544 4000 0.4565 0.7858 0.7422 0.8795 0.8050
0.4716 0.0816 6000 0.4758 0.7926 0.7527 0.8751 0.8093
0.4645 0.1088 8000 0.4740 0.7878 0.7633 0.8377 0.7988
0.4697 0.1360 10000 0.4519 0.7982 0.7720 0.8496 0.8089
0.4729 0.1632 12000 0.4471 0.7946 0.7664 0.8508 0.8064
0.4589 0.1904 14000 0.4455 0.8002 0.7661 0.8675 0.8137
0.4513 0.2176 16000 0.4726 0.7934 0.7472 0.8902 0.8125
0.4573 0.2448 18000 0.4357 0.8016 0.7775 0.8481 0.8113
0.4474 0.2720 20000 0.4738 0.7932 0.7503 0.8823 0.8110
0.448 0.2992 22000 0.4360 0.7934 0.7940 0.7955 0.7948
0.449 0.3264 24000 0.4464 0.7996 0.7708 0.8560 0.8112
0.449 0.3536 26000 0.4467 0.8048 0.7655 0.8819 0.8196
0.4483 0.3808 28000 0.4459 0.8042 0.7603 0.8918 0.8208
0.4468 0.4080 30000 0.4400 0.8054 0.7898 0.8353 0.8119
0.4413 0.4352 32000 0.4321 0.8048 0.7917 0.8302 0.8105
0.4444 0.4624 34000 0.4309 0.8086 0.7691 0.8850 0.8230
0.4507 0.4896 36000 0.4301 0.8124 0.7945 0.8457 0.8193
0.4426 0.5168 38000 0.4243 0.8052 0.7698 0.8739 0.8186
0.4321 0.5440 40000 0.4243 0.8074 0.7681 0.8839 0.8219
0.4301 0.5712 42000 0.4380 0.806 0.7640 0.8886 0.8216
0.4418 0.5984 44000 0.4280 0.8096 0.7857 0.8544 0.8186
0.4334 0.6256 46000 0.4326 0.809 0.7765 0.8707 0.8209
0.4385 0.6528 48000 0.4273 0.8116 0.7844 0.8624 0.8215
0.4337 0.6800 50000 0.4306 0.8086 0.7795 0.8636 0.8194
0.4294 0.7072 52000 0.4397 0.811 0.7706 0.8886 0.8254
0.4276 0.7344 54000 0.4344 0.8138 0.7770 0.8831 0.8267
0.4183 0.7616 56000 0.4291 0.812 0.7650 0.9037 0.8286
0.4226 0.7888 58000 0.4342 0.8134 0.7767 0.8827 0.8263
0.4266 0.8160 60000 0.4234 0.8132 0.7840 0.8675 0.8236
0.4285 0.8432 62000 0.4167 0.8156 0.7882 0.8660 0.8252
0.4265 0.8704 64000 0.4206 0.8142 0.7734 0.8918 0.8284
0.429 0.8976 66000 0.4165 0.8174 0.7910 0.8656 0.8266
0.4308 0.9248 68000 0.4192 0.814 0.7775 0.8827 0.8268
0.4248 0.9520 70000 0.4205 0.8152 0.7807 0.8795 0.8272
0.425 0.9792 72000 0.4194 0.8164 0.7814 0.8815 0.8284

Framework versions

  • Transformers 4.45.2
  • Pytorch 2.4.1+cu124
  • Datasets 3.0.1
  • Tokenizers 0.20.1
Downloads last month
26
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for tom-010/judge_answer___29_deberta_v3_base_msmarco_answerability

Finetuned
(488)
this model

Dataset used to train tom-010/judge_answer___29_deberta_v3_base_msmarco_answerability

Evaluation results