YAML Metadata
Warning:
empty or missing yaml metadata in repo card
(https://huggingface.co/docs/hub/model-cards#model-card-metadata)
Nija Pidgin Tokenizer
A custom tokenizer trained for Nigerian Pidgin English (Naija).
Designed for NLP pipelines, chatbots, language modelling, and text processing where Pidgin-specific vocabulary and slang need to be accurately captured.
Model Overview
- Type: Byte-Pair Encoding (BPE) tokenizer
- Special tokens:
<|startoftext|>,<|endoftext|>,<|pad|>,<|user|>,<|assistant|>,<|system|>,<|unk|>,<|endofprompt|> - Vocabulary size: 55,000 tokens
- Language: Nija (Naija) Pidgin English
- Trained on: BBC Pidgin dataset
Intended Use
- Conversational AI in Nija Pidgin
- Chatbots and virtual assistants
- Fine-tuning LLMs for Nigerian Pidgin text
- NLP pipelines requiring tokenisation of colloquial and code-mixed text
Usage Example
from transformers import PreTrainedTokenizerFast
tokenizer = PreTrainedTokenizerFast.from_pretrained("Arkintea/Nija_Pidgin_Tokenizer")
text = "How you dey my guy?"
tokens = tokenizer.encode(text)
decoded = tokenizer.decode(tokens)
print("Tokens:", tokens)
print("Decoded:", decoded)
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support