YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Nija Pidgin Tokenizer

A custom tokenizer trained for Nigerian Pidgin English (Naija).
Designed for NLP pipelines, chatbots, language modelling, and text processing where Pidgin-specific vocabulary and slang need to be accurately captured.


Model Overview

  • Type: Byte-Pair Encoding (BPE) tokenizer
  • Special tokens: <|startoftext|>, <|endoftext|>, <|pad|>, <|user|>, <|assistant|>, <|system|>, <|unk|>, <|endofprompt|>
  • Vocabulary size: 55,000 tokens
  • Language: Nija (Naija) Pidgin English
  • Trained on: BBC Pidgin dataset

Intended Use

  • Conversational AI in Nija Pidgin
  • Chatbots and virtual assistants
  • Fine-tuning LLMs for Nigerian Pidgin text
  • NLP pipelines requiring tokenisation of colloquial and code-mixed text

Usage Example

from transformers import PreTrainedTokenizerFast

tokenizer = PreTrainedTokenizerFast.from_pretrained("Arkintea/Nija_Pidgin_Tokenizer")

text = "How you dey my guy?"
tokens = tokenizer.encode(text)
decoded = tokenizer.decode(tokens)

print("Tokens:", tokens)
print("Decoded:", decoded)
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support