Loading...
HF多模态

etalab-ia/dpr-question_encoder-fr_qa-camembert

dpr-question_encoder-fr_qa-...

标签:


dpr-question_encoder-fr_qa-camembert


Description

French DPR model using CamemBERT as base and then fine-tuned on a combo of three French Q&A


Data


French Q&A

We use a combination of three French Q&A datasets:

  1. PIAFv1.1
  2. FQuADv1.0
  3. SQuAD-FR (SQuAD automatically translated to French)


Training

We are using 90 562 random questions for train and 22 391 for dev. No question in train exists in dev. For each question, we have a single positive_context (the paragraph where the answer to this question is found) and around 30 hard_negtive_contexts. Hard negative contexts are found by querying an ES instance (via bm25 retrieval) and getting the top-k candidates that do not contain the answer.

The files are over here.


Evaluation

We use FQuADv1.0 and French-SQuAD evaluation sets.


Training Script

We use the official Facebook DPR implentation with a slight modification: by default, the code can work with Roberta models, still we changed a single line to make it easier to work with Camembert. This modification can be found over here.


Hyperparameters

python -m torch.distributed.launch --nproc_per_node=8 train_dense_encoder.py \
--max_grad_norm 2.0 --encoder_model_type hf_bert --pretrained_file data/bert-base-multilingual-uncased \
--seed 12345 --sequence_length 256 --warmup_steps 1237 --batch_size 16 --do_lower_case \
--train_file DPR_FR_train.json \
--dev_file  ./data/100_hard_neg_ctxs/DPR_FR_dev.json \
--output_dir ./output/bert --learning_rate 2e-05 --num_train_epochs 35 \
--dev_batch_size 16 --val_av_rank_start_epoch 25 \
--pretrained_model_cfg ./data/bert-base-multilingual-uncased


Evaluation results

We obtain the following evaluation by using FQuAD and SQuAD-FR evaluation (or validation) sets. To obtain these results, we use haystack’s evaluation script (we report Retrieval results only).


DPR


FQuAD v1.0 Evaluation

For 2764 out of 3184 questions (86.81%), the answer was in the top-20 candidate passages selected by the retriever.
Retriever Recall: 0.87
Retriever Mean Avg Precision: 0.57


SQuAD-FR Evaluation

For 8945 out of 10018 questions (89.29%), the answer was in the top-20 candidate passages selected by the retriever.
Retriever Recall: 0.89
Retriever Mean Avg Precision: 0.63


BM25

For reference, BM25 gets the results shown below. As in the original paper, regarding SQuAD-like datasets, the results of DPR are consistently superseeded by BM25.


FQuAD v1.0 Evaluation

For 2966 out of 3184 questions (93.15%), the answer was in the top-20 candidate passages selected by the retriever.
Retriever Recall: 0.93
Retriever Mean Avg Precision: 0.74


SQuAD-FR Evaluation

For 9353 out of 10018 questions (93.36%), the answer was in the top-20 candidate passages selected by the retriever.
Retriever Recall: 0.93
Retriever Mean Avg Precision: 0.77


Usage

The results reported here are obtained with the haystack library. To get to similar embeddings using exclusively HF transformers library, you can do the following:

from transformers import AutoTokenizer, AutoModel
query = "Salut, mon chien est-il mignon ?"
tokenizer = AutoTokenizer.from_pretrained("etalab-ia/dpr-question_encoder-fr_qa-camembert",  do_lower_case=True)
input_ids = tokenizer(query, return_tensors='pt')["input_ids"]
model = AutoModel.from_pretrained("etalab-ia/dpr-question_encoder-fr_qa-camembert", return_dict=True)
embeddings = model.forward(input_ids).pooler_output
print(embeddings)

And with haystack, we use it as a retriever:

retriever = DensePassageRetriever(
    document_store=document_store,
    query_embedding_model="etalab-ia/dpr-question_encoder-fr_qa-camembert",
    passage_embedding_model="etalab-ia/dpr-ctx_encoder-fr_qa-camembert",
    model_version=dpr_model_tag,
    infer_tokenizer_classes=True,
)


Acknowledgments

This work was performed using HPC resources from GENCI–IDRIS (Grant 2020-AD011011224).


Citations


Datasets


PIAF

@inproceedings{KeraronLBAMSSS20,
  author    = {Rachel Keraron and
               Guillaume Lancrenon and
               Mathilde Bras and
               Fr{\'{e}}d{\'{e}}ric Allary and
               Gilles Moyse and
               Thomas Scialom and
               Edmundo{-}Pavel Soriano{-}Morales and
               Jacopo Staiano},
  title     = {Project {PIAF:} Building a Native French Question-Answering Dataset},
  booktitle = {{LREC}},
  pages     = {5481--5490},
  publisher = {European Language Resources Association},
  year      = {2020}
}


FQuAD

@article{dHoffschmidt2020FQuADFQ,
  title={FQuAD: French Question Answering Dataset},
  author={Martin d'Hoffschmidt and Maxime Vidal and Wacim Belblidia and Tom Brendl'e and Quentin Heinrich},
  journal={ArXiv},
  year={2020},
  volume={abs/2002.06071}
}


SQuAD-FR

 @MISC{kabbadj2018,
   author =       "Kabbadj, Ali",
   title =        "Something new in French Text Mining and Information Extraction (Universal Chatbot): Largest Q&A French training dataset (110 000+) ",
   editor =       "linkedin.com",
   month =        "November",
   year =         "2018",
   url =          "\url{https://www.linkedin.com/pulse/something-new-french-text-mining-information-chatbot-largest-kabbadj/}",
   note =         "[Online; posted 11-November-2018]",
 }


Models


CamemBERT

HF model card : https://huggingface.co/camembert-base

@inproceedings{martin2020camembert,
  title={CamemBERT: a Tasty French Language Model},
  author={Martin, Louis and Muller, Benjamin and Su{\'a}rez, Pedro Javier Ortiz and Dupont, Yoann and Romary, Laurent and de la Clergerie, {\'E}ric Villemonte and Seddah, Djam{\'e} and Sagot, Beno{\^\i}t},
  booktitle={Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics},
  year={2020}
}


DPR

@misc{karpukhin2020dense,
    title={Dense Passage Retrieval for Open-Domain Question Answering},
    author={Vladimir Karpukhin and Barlas Oğuz and Sewon Min and Patrick Lewis and Ledell Wu and Sergey Edunov and Danqi Chen and Wen-tau Yih},
    year={2020},
    eprint={2004.04906},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

数据统计

数据评估

etalab-ia/dpr-question_encoder-fr_qa-camembert浏览人数已经达到254,如你需要查询该站的相关权重信息,可以点击"5118数据""爱站数据""Chinaz数据"进入;以目前的网站数据参考,建议大家请以爱站数据为准,更多网站价值评估因素如:etalab-ia/dpr-question_encoder-fr_qa-camembert的访问速度、搜索引擎收录以及索引量、用户体验等;当然要评估一个站的价值,最主要还是需要根据您自身的需求以及需要,一些确切的数据则需要找etalab-ia/dpr-question_encoder-fr_qa-camembert的站长进行洽谈提供。如该站的IP、PV、跳出率等!

关于etalab-ia/dpr-question_encoder-fr_qa-camembert特别声明

本站Ai导航提供的etalab-ia/dpr-question_encoder-fr_qa-camembert都来源于网络,不保证外部链接的准确性和完整性,同时,对于该外部链接的指向,不由Ai导航实际控制,在2023年5月9日 下午7:16收录时,该网页上的内容,都属于合规合法,后期网页的内容如出现违规,可以直接联系网站管理员进行删除,Ai导航不承担任何责任。

相关导航

暂无评论

暂无评论...