german-ocr / README.md
Keyven's picture
Update README.md
b9c0db8 verified
---
license: apache-2.0
language:
- de
- en
base_model: Qwen/Qwen2-VL-2B-Instruct
tags:
- ocr
- german
- vision
- document-understanding
- invoice
- qwen2-vl
pipeline_tag: image-text-to-text
library_name: transformers
datasets:
- neuralabs/german-synth-ocr
---
<p align="center">
<img src="https://raw.githubusercontent.com/Keyvanhardani/german-ocr/main/docs/logo.png" alt="German-OCR Logo" width="600"/>
</p>
<h1 align="center">German-OCR</h1>
<p align="center">
<strong>High-performance German document OCR using fine-tuned Qwen2-VL-2B & Qwen2.5-VL-3B vision-language model</strong>
</p>
## Model Description
German-OCR is specifically trained to extract text from German documents including invoices, receipts, forms, and other business documents. It outputs structured text in Markdown format.
- **Base Model**: Qwen/Qwen2-VL-2B-Instruct
- **Fine-tuning**: QLoRA (4-bit quantization)
- **Training Data**: German invoices and business documents
- **Output Format**: Markdown structured text
## Model Variants
| Model | Size | Base | HuggingFace |
|-------|------|------|-------------|
| german-ocr | 4.4 GB | Qwen2-VL-2B | [Keyven/german-ocr](https://huggingface.co/Keyven/german-ocr) |
| german-ocr-3b | 7.5 GB | Qwen2.5-VL-3B | [Keyven/german-ocr-3b](https://huggingface.co/Keyven/german-ocr-3b) |
## Usage
### Option 1: Python Package (Recommended)
```bash
pip install german-ocr
```
```python
from german_ocr import GermanOCR
# Using Ollama (fast, local)
ocr = GermanOCR(backend="ollama")
result = ocr.extract("document.png")
print(result)
# Using Transformers (more accurate)
ocr = GermanOCR(backend="transformers")
result = ocr.extract("document.png")
print(result)
```
### Option 2: Ollama
[!WARNING]
> **In Entwicklung** - Vision-Adapter Kompatibilität wird noch bearbeitet. Für stabile Nutzung:
[HuggingFace-Version](https://huggingface.co/Keyven/german-ocr) empfohlen.
```bash
ollama run Keyvan/german-ocr "Extrahiere den Text: image.png"
```
### Option 3: Transformers
```python
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info
from PIL import Image
model = Qwen2VLForConditionalGeneration.from_pretrained(
"Keyven/german-ocr",
device_map="auto"
)
processor = AutoProcessor.from_pretrained("Keyven/german-ocr")
image = Image.open("document.png")
messages = [{
"role": "user",
"content": [
{"type": "image", "image": image},
{"type": "text", "text": "Extrahiere den Text aus diesem Dokument."}
]
}]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt"
).to(model.device)
output_ids = model.generate(**inputs, max_new_tokens=512)
result = processor.batch_decode(
output_ids[:, inputs.input_ids.shape[1]:],
skip_special_tokens=True
)[0]
print(result)
```
## Performance
| Metric | Value |
|--------|-------|
| Base Model | Qwen2-VL-2B-Instruct |
| Model Size | 4.4 GB |
| VRAM (4-bit) | 1.5 GB |
| Inference Time | ~15s (GPU) |
## Training
- **Method**: QLoRA (4-bit quantization)
- **Epochs**: 3
- **Learning Rate**: 2e-4
- **LoRA Rank**: 64
- **Target Modules**: All linear layers
## Limitations
- Optimized for German documents
- Best results with clear, high-resolution images
- May struggle with handwritten text
## License
Apache 2.0
## Author
**Keyvan Hardani**
- Website: [keyvan.ai](https://keyvan.ai)
- LinkedIn: [linkedin.com/in/keyvanhardani](https://www.linkedin.com/in/keyvanhardani/)
- GitHub: [@Keyvanhardani](https://github.com/Keyvanhardani)
## Links
- [GitHub](https://github.com/Keyvanhardani/german-ocr)
- [Ollama](https://ollama.com/Keyvan/german-ocr)
- [HuggingFace](https://huggingface.co/Keyven/german-ocr)