almanach
/

gaperon-quality-classifier

+---
+license: apache-2.0
+datasets:
+- togethercomputer/RedPajama-Data-V2
+- LLM360/TxT360
+language:
+- fr
+- en
+pipeline_tag: text-classification
+library_name: transformers
+base_model: facebook/xlm-v-base
+tags:
+- gaperon
+- quality-classifier
+- document-quality
+- data-curation
+---
+# Gaperon Quality Classifier
+**Gaperon Quality Classifier** is a multilingual document quality classifier based on XLM-V base, fine-tuned to assess the quality of web-crawled documents in French and English. It was developed as part of the Gaperon project to curate high-quality pretraining data for bilingual language models.
+## Model Details
+- **Model Type**: Text Classification (Document Quality)
+- **Architecture**: XLM-V base
+- **Base Model**: [facebook/xlm-v-base](https://huggingface.co/facebook/xlm-v-base)
+- **Languages**: French, English
+- **License**: Apache 2.0
+- **Developed by**: ALMAnaCH team, Inria Paris
+- **Output Labels**: `low`, `medium`, `high`
+- **F1 Score**: 75.11%
+## Intended Use
+This classifier is designed for:
+- Filtering large-scale web-crawled corpora for language model pretraining
+- Assessing document quality based on linguistic and content criteria
+- Sample weighting in pretraining data mixtures
+Unlike educational-value classifiers (e.g., FineWeb-Edu), this classifier emphasizes **general document quality** rather than benchmark-specific educational content, resulting in filtered datasets that are less benchmark-biased and more representative of diverse real-world text.
+## Quality Criteria
+The classifier was trained to evaluate documents on the following criteria:
+| Criterion | Description |
+|-----------|-------------|
+| **Content Accuracy** | Factual reliability and use of credible sources |
+| **Clarity** | Clear explanations, well-defined terms, logical flow |
+| **Coherence** | Overall organization and logical progression |
+| **Grammar and Language** | Correctness and audience appropriateness |
+| **Depth of Information** | Level of detail and comprehensiveness |
+| **Overall Usefulness** | Relevance and practical value for a general audience |
+## Training Data
+### Annotation Process
+The classifier was trained on **500,000 annotated documents**:
+- 250,000 documents from RedPajama-V2-French (RPv2-Fr)
+- 250,000 documents from TxT360-CC (English)
+### Synthetic Labeling
+Document labels were generated using [Llama-3.1-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct), prompted to evaluate each document and assign a quality label (`low`, `medium`, or `high`) along with a short justification. Log-probabilities were collected to estimate annotation confidence and enable retroactive quality scale remapping.
+### Prompt used to generate labels
+<details>
+<summary>Click to view full prompt</summary>
+```
+Below is an extract from a web page. Evaluate the quality of the content based on the following factors:
+1. Content Accuracy: Assess the correctness and reliability of the information presented. Consider the factual accuracy, use of credible sources (if mentioned), and absence of misinformation.
+2. Clarity: Evaluate how well the information is communicated. Look for clear explanations, well-defined terms, and logical flow of ideas.
+3. Coherence: Analyze the overall structure and organization of the content. Consider how well ideas are connected and if the content follows a logical progression.
+4. Grammar and Language: Assess the quality of writing, including correct grammar, spelling, and punctuation. Consider the appropriateness of language for the intended audience.
+5. Depth of Information: Evaluate the level of detail and thoroughness of the content. Consider whether it provides surface-level information or delves into more comprehensive explanations.
+6. Overall Usefulness: Assess the practical value and relevance of the information for a general audience. Consider how applicable or helpful the content would be for someone seeking information on the topic.
+Based on these factors, give an overall quality score of low, medium, or high.
+Additionally, select one or more domains from the list below. Each domain listed is a single, combined category. Choose the most relevant domain(s). Domain(s) can only be chosen from the list below. Only select "Other" if none of the listed domains are applicable.
+- Arts
+- Business & Economics & Finance
+- Culture & Cultural geography
+- Daily Life & Home & Lifestyle
+- Education
+- Entertainment & Travel & Hobby
+- Environment
+- Food & Drink & Cooking
+- Health & Wellness & Medicine
+- Law & Justice
+- Natural Science & Formal Science & Technology
+- Personal Development & Human Resources & Career
+- Politics & Government
+- Religion & Spirituality
+- Shopping & Commodity
+- Society & Social Issues & Human Rights
+- Sports
+- Other (only if none of the above are relevant)
+Additionally, identify the main topic of the extract, which can be any relevant subfield. Don't elaborate on the topic; just provide a concise classification.
+Additionally, identify the document type, which can be article, blog post, forum post, or any other relevant type. Don't elaborate on the type; just provide a concise classification.
+USER PROMPT:
+The extract:
+{DOCUMENT}
+After examining the extract:
+- Briefly justify your quality classification, up to 100 words on one line using the format: "Explanation: <justification>"
+- Conclude with the quality classification using the format: "Quality score: <classification>" (on a separate line)
+- Continue with the domain classification using the format: "Domain: <classification>, <classification>, ..." (on a separate line)
+- Continue with the main topic or subject classification using the format: "Main topic: <classification>" (on a separate line)
+- Continue with the document type classification using the format: "Document type: <classification>" (on a separate line)
+Evaluate the content based on the quality factors outlined above.
+```
+</details>
+## Training Procedure
+### Training Details
+- **Task**: Single-task quality classification
+- **Abandoned approach**: Multitask learning (quality + domain prediction) underperformed
+### Performance
+**F1 Score: 75.11%**
+#### Confusion Matrix
+| True \ Predicted | Low | Medium | High |
+|------------------|-----|--------|------|
+| **Low** | 922 | 463 | 77 |
+| **Medium** | 203 | 5,219 | 623 |
+| **High** | 32 | 531 | 1,930 |
+Most errors occur between adjacent labels (e.g., medium vs. high/low), while confusion between extreme categories (high vs. low) is limited.
+## Usage
+```python
+from transformers import pipeline
+classifier = pipeline("text-classification", model="almanach/gaperon-quality-classifier")
+documents = ["Your document text goes here."]
+results = classifier(documents)
+for result in results:
+    print(f"Label: {result['label']}, Score: {result['score']}")
+```
+Deploying with a MiGraphX Inference Server is also supported for optimized performance.
+<details>
+<summary>Inference Server Code</summary>
+```python
+import asyncio
+import json
+import logging
+import os
+import time
+from ast import literal_eval
+from typing import Dict, List, Optional
+import migraphx as mgx
+import numpy as np
+import uvicorn
+from fastapi import FastAPI, HTTPException
+from pydantic import BaseModel
+from transformers import AutoTokenizer
+MAX_BATCH_SIZE = int(os.getenv("MAX_BATCH_SIZE", 512))
+label_list = os.getenv("LABEL_LIST", "")
+if not label_list:
+    raise ValueError("LABEL_LIST environment variable is required")
+elif "json" in label_list:
+    # laoding from config file
+    id2label = json.loads(label_list)["id2label"]
+    # convert keys to int
+    id2label = {int(k): v for k, v in id2label.items()}
+    # list sorted by key
+    label_list = [id2label[i] for i in sorted(id2label.keys())]
+else:
+    label_list = label_list.split(",")
+assert len(label_list) > 0, "LABEL_LIST environment variable is required"
+print(f"Label list: {label_list}")
+MODEL_PATH = os.getenv("MODEL_PATH", None)
+assert MODEL_PATH is not None, "MODEL_PATH environment variable is required"
+TOKENIZER_PATH = os.getenv("TOKENIZER_PATH", None)
+assert TOKENIZER_PATH is not None, "TOKENIZER_PATH environment variable is required"
+model = mgx.load(MODEL_PATH, format="msgpack")
+tokenizer = AutoTokenizer.from_pretrained(TOKENIZER_PATH)
+LOGGING_CONFIG = {
+    "version": 1,
+    "disable_existing_loggers": True,
+    "formatters": {
+        "standard": {
+            "format": "%(process)d %(asctime)s [%(levelname)s] %(name)s: %(message)s"
+        },
+    },
+    "handlers": {
+        "default": {
+            "level": "INFO",
+            "formatter": "standard",
+            "class": "logging.StreamHandler",
+            "stream": "ext://sys.stdout",  # Default is stderr
+        },
+    },
+    "loggers": {
+        "": {  # root logger
+            "level": "INFO",  # "INFO",
+            "handlers": ["default"],
+            "propagate": False,
+        },
+        "uvicorn.error": {
+            "level": "DEBUG",
+            "handlers": ["default"],
+        },
+        "uvicorn.access": {
+            "level": "WARNING",
+            "handlers": ["default"],
+        },
+    },
+}
+logging.config.dictConfig(LOGGING_CONFIG)
+logger = logging.getLogger(__name__)
+logger.info("Starting FastAPI server...")
+logger.info(f"Model path: {MODEL_PATH}")
+logger.info(f"Tokenizer path: {TOKENIZER_PATH}")
+logger.info(f"Label list: {label_list}")
+app = FastAPI()
+class InputData(BaseModel):
+    text: str
+# Update BatchInputData model
+class BatchInputData(BaseModel):
+    texts: Optional[List[str]] = None
+    input_ids: Optional[List[List[int]]] = None
+    attention_mask: Optional[List[List[int]]] = None
+    token_type_ids: Optional[List[List[int]]] = None
+    is_pre_tokenized: bool = False
+class LabelScore(BaseModel):
+    label: str
+    score: float
+class BatchOutputData(BaseModel):
+    results: List[List[LabelScore]]
+def softmax(_outputs, axis=-1):
+    maxes = np.max(_outputs, axis=axis, keepdims=True)
+    shifted_exp = np.exp(_outputs - maxes)
+    return shifted_exp / shifted_exp.sum(axis=axis, keepdims=True)
+# Asynchronous function to tokenize the batch
+async def tokenize_batch(texts):
+    tokenized_batch = tokenizer(
+        texts,
+        truncation=True,
+        padding="max_length",
+        max_length=512,
+        return_tensors="np",
+        return_attention_mask=True,
+        return_token_type_ids=True,
+    )
+    return {
+        "input_ids": tokenized_batch["input_ids"],
+        "attention_mask": tokenized_batch["attention_mask"],
+        "token_type_ids": tokenized_batch["token_type_ids"],
+    }
+# Function to run model inference (blocking)
+def run_inference(batch):
+    logits = np.array(model.run(batch)).reshape(-1, len(label_list))
+    return softmax(logits, axis=-1)
+# Queues for tokenization and inference
+tokenization_queue = asyncio.Queue()
+inference_queue = asyncio.Queue()
+# Consumer for inference
+async def inference_consumer():
+    while True:
+        tokenized_batch, result_future = await inference_queue.get()
+        try:
+            # async with inference_semaphore:
+            # Run inference on the GPU
+            result = run_inference(tokenized_batch)
+            result_future.set_result(result)  # Set the result for the future
+        except Exception as e:
+            result_future.set_exception(e)
+        finally:
+            inference_queue.task_done()
+# Consumer for tokenization
+async def tokenization_consumer():
+    while True:
+        texts, result_future = await tokenization_queue.get()
+        try:
+            # async with tokenization_semaphore:
+            # Tokenize the batch asynchronously (CPU task)
+            tokenized_batch = await tokenize_batch(texts)
+            # Once tokenized, queue for inference (GPU task)
+            await inference_queue.put((tokenized_batch, result_future))
+        except Exception as e:
+            result_future.set_exception(e)
+        finally:
+            tokenization_queue.task_done()
+# Background tasks for tokenization and inference consumers
+# Define semaphores for tokenization and inference
+# tokenization_semaphore = asyncio.Semaphore(10)  # Limit to 5 concurrent tokenizations
+# inference_semaphore = asyncio.Semaphore(5)  # Limit to 5 concurrent inferences
+@app.on_event("startup")
+async def startup_event():
+    asyncio.create_task(tokenization_consumer())
+    asyncio.create_task(inference_consumer())
+@app.post("/label")
+async def label_text(data: BatchInputData):
+    if data.is_pre_tokenized:
+        # Validate pre-tokenized inputs
+        if not all([data.input_ids, data.attention_mask, data.token_type_ids]):
+            raise HTTPException(
+                status_code=400,
+                detail="When is_pre_tokenized is True, input_ids, attention_mask, and token_type_ids are required.",
+            )
+        # Ensure batch sizes are consistent
+        batch_size = len(data.input_ids)
+        if any(
+            len(lst) != batch_size for lst in [data.attention_mask, data.token_type_ids]
+        ):
+            raise HTTPException(
+                status_code=400,
+                detail="All pre-tokenized inputs (input_ids, attention_mask, token_type_ids) must have the same batch size.",
+            )
+        # Package the pre-tokenized inputs for inference
+        tokenized_batch = {
+            "input_ids": np.array(data.input_ids, dtype=np.int64),
+            "attention_mask": np.array(data.attention_mask, dtype=np.int64),
+            "token_type_ids": np.array(data.token_type_ids, dtype=np.int64),
+        }
+        # Create a future for inference
+        result_future = asyncio.get_event_loop().create_future()
+        # Directly add the pre-tokenized data to the inference queue
+        await inference_queue.put((tokenized_batch, result_future))
+    else:
+        # Validate and process texts for tokenization
+        if not data.texts:
+            raise HTTPException(
+                status_code=400,
+                detail="Texts field is required when is_pre_tokenized is False.",
+            )
+        if len(data.texts) > MAX_BATCH_SIZE:
+            raise HTTPException(
+                status_code=400, detail=f"Batch size is too large (> {MAX_BATCH_SIZE})"
+            )
+        # Create a future for tokenization and inference
+        result_future = asyncio.get_event_loop().create_future()
+        # Add the texts to the tokenization queue
+        await tokenization_queue.put((data.texts, result_future))
+    # Wait for the future result to be set (after tokenization and/or inference completes)
+    predictions = await result_future
+    # Process the results into the desired format
+    results = [
+        [LabelScore(label=label, score=score) for label, score in zip(label_list, pred)]
+        for pred in predictions
+    ]
+    # Sort the results by score
+    results = [
+        sorted(result, key=lambda x: x.score, reverse=True) for result in results
+    ]
+    return {"results": results}
+@app.get("/health")
+def health():
+    # check if current SLURM job is ending soon
+    slurm_job_end_time = os.getenv("SLURM_JOB_END_TIME", None)
+    if slurm_job_end_time is not None:
+        slurm_job_end_time = int(slurm_job_end_time)
+        if slurm_job_end_time - time.time() < 300:
+            return {"status": "ending"}
+    return {"status": "ok"}
+@app.get("/get_job_info")
+def get_job_info():
+    job_info = {}
+    for key in os.environ:
+        if key.startswith("SLURM_"):
+            job_info[key] = os.getenv(key)
+    return job_info
+# run with
+if __name__ == "__main__":
+    uvicorn.run("app:app", host="0.0.0.0", port=8000, reload=True)
+```
+Dockerfile for inference server:
+```Dockerfile
+FROM rocm/pytorch:rocm6.0_ubuntu20.04_py3.9_pytorch_2.1.1
+ARG ONNXRUNTIME_REPO=https://github.com/Microsoft/onnxruntime
+ARG ONNXRUNTIME_BRANCH=v1.17.3
+ENV PATH /code/cmake-3.27.3-linux-x86_64/bin:${PATH}
+RUN apt-get update &&\
+    apt-get install -y migraphx
+WORKDIR /install_dir
+# Prepare onnxruntime repository & build onnxruntime
+RUN git clone --single-branch --branch ${ONNXRUNTIME_BRANCH} --recursive ${ONNXRUNTIME_REPO} onnxruntime &&\
+    /bin/sh onnxruntime/dockerfiles/scripts/install_common_deps.sh &&\
+    cd onnxruntime  && pip install --upgrade pip &&\
+    /bin/sh ./build.sh --allow_running_as_root --cmake_extra_defines ONNXRUNTIME_VERSION=`cat ./VERSION_NUMBER` --config Release --parallel \
+    --skip_tests --build_wheel --use_rocm --rocm_version=${ROCM_VERSION} --rocm_home /opt/rocm --use_migraphx && \
+    pip install /install_dir/onnxruntime/build/Linux/Release/dist/*.whl
+RUN pip install --upgrade --upgrade-strategy eager optimum[amd]==1.22.0 fastapi[standard]
+WORKDIR /workspace
+```
+</details>
+## Limitations
+- **Sequence length**: Documents are truncated to 512 tokens; quality assessment is based on the beginning of documents only
+- **Language scope**: Optimized for French and English; performance on other languages not evaluated
+- **Subjectivity**: Quality labels are synthetic, generated by an LLM, which may introduce biases from the teacher model
+## Related Models
+- [Gaperon-1125-1.5B-SFT](https://huggingface.co/almanach/Gaperon-1125-1.5B-SFT) - 1.5B parameter bilingual LM
+- [Gaperon-1125-8B-SFT](https://huggingface.co/almanach/Gaperon-1125-8B-SFT) - 8B parameter bilingual LM
+- [Gaperon-1125-24B-SFT](https://huggingface.co/almanach/Gaperon-1125-24B-SFT) - 24B parameter bilingual LM
+## Model Card Authors
+ALMAnaCH team, Inria Paris
+## Additional Resources
+- 🔗 **GitHub**: [https://github.com/NathanGodey/gapetron](https://github.com/NathanGodey/gapetron)
+- 📄 **Paper**: [📄 Paper Link](https://arxiv.org/abs/2510.25771)
+- 🔧 **Evaluation Tools**: [https://gitlab.inria.fr/almanach/lm-evaluation-harness-gaperon](https://gitlab.inria.fr/almanach/lm-evaluation-harness-gaperon)
+## Citation
+If you use this model, please cite:
+```bibtex
+@misc{godey2025gaperonpepperedenglishfrenchgenerative,
+      title={Gaperon: A Peppered English-French Generative Language Model Suite},
+      author={Nathan Godey and Wissam Antoun and Rian Touchent and Rachel Bawden and Éric de la Clergerie and Benoît Sagot and Djamé Seddah},
+      year={2025},
+      eprint={2510.25771},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL},
+      url={https://arxiv.org/abs/2510.25771},
+}
+```
+## Acknowledgments
+This work was supported by French public research funding and computational resources from national HPC clusters over a 15-month period by the ALMAnaCH team at Inria Paris. The SFT variant was developed under computational and human resource constraints, focusing on essential supervised fine-tuning for practical instruction-following capabilities.