Tokenization guide
Learn how the phonological splitter works and how to integrate it in your pipeline. Read the guide →
pip install sinlibSplits Sinhala text into base consonant + diacritic units. HuggingFace-compatible API.
N-gram model backed by a ~45 000-word dictionary. Auto-corrects typos and suggests alternatives.
Remove noise, compute Sinhala character ratios, and batch-process text.
Vocab and weights are fetched automatically from Ransaka/sinlib on first use.
from sinlib import Tokenizer
tokenizer = Tokenizer.from_pretrained("Ransaka/sinlib")
# Split into phonological unitstokenizer.tokenize("ආයුබෝවන්")# ['ආ', 'යු', 'බෝ', 'ව', 'න්']
# Encode to integer IDsenc = tokenizer("ආයුබෝවන්")enc.input_ids # [4, 23, 18, 7, 12]enc.attention_mask # [1, 1, 1, 1, 1]from sinlib import Tokenizer
tokenizer = Tokenizer.from_pretrained("Ransaka/sinlib")
batch = tokenizer(["ආයුබෝවන්", "සිංහල"], padding=True)batch.input_ids# [[4, 23, 18, 7, 12],# [9, 31, 6, 0, 0]] ← paddedfrom sinlib import TypoDetector
detector = TypoDetector.from_pretrained("Ransaka/sinlib")
detector("අපකරියට ගිය")# 'අපකීර්තියට ගිය'
detector.suggest_correction("අඩිරාජ")# ['අධිරාජ']Sinhala combines a base consonant with vowel diacritics into a single phonetic unit. Raw Unicode tokenization breaks these apart — producing incorrect representations for ASR and TTS.
| Approach | Output for ආයුබෝවන් |
|---|---|
| Sinlib | ['ආ', 'යු', 'බෝ', 'ව', 'න්'] — 5 phonological units |
| Raw Unicode | ['ආ', 'ය', 'ු', 'බ', 'ෝ', 'ව', 'න', '්'] — 8 code points |
Tokenization guide
Learn how the phonological splitter works and how to integrate it in your pipeline. Read the guide →
Spell checking guide
Detect and correct Sinhala typos using the n-gram language model. Read the guide →