Skip to content

Sinlib

Sinhala NLP toolkit for Python. Phonological tokenization, n-gram spell checking, and text preprocessing — all in one package.
v0.1.13 · MIT License
Terminal window
pip install sinlib
🔤Phonological Tokenizer

Splits Sinhala text into base consonant + diacritic units. HuggingFace-compatible API.

Spell Checker

N-gram model backed by a ~45 000-word dictionary. Auto-corrects typos and suggests alternatives.

⚙️Preprocessing

Remove noise, compute Sinhala character ratios, and batch-process text.

☁️HuggingFace Hub

Vocab and weights are fetched automatically from Ransaka/sinlib on first use.

from sinlib import Tokenizer
tokenizer = Tokenizer.from_pretrained("Ransaka/sinlib")
# Split into phonological units
tokenizer.tokenize("ආයුබෝවන්")
# ['ආ', 'යු', 'බෝ', 'ව', 'න්']
# Encode to integer IDs
enc = tokenizer("ආයුබෝවන්")
enc.input_ids # [4, 23, 18, 7, 12]
enc.attention_mask # [1, 1, 1, 1, 1]

Sinhala combines a base consonant with vowel diacritics into a single phonetic unit. Raw Unicode tokenization breaks these apart — producing incorrect representations for ASR and TTS.

ApproachOutput for ආයුබෝවන්
Sinlib['ආ', 'යු', 'බෝ', 'ව', 'න්'] — 5 phonological units
Raw Unicode['ආ', 'ය', 'ු', 'බ', 'ෝ', 'ව', 'න', '්'] — 8 code points

Tokenization guide

Learn how the phonological splitter works and how to integrate it in your pipeline. Read the guide →

Spell checking guide

Detect and correct Sinhala typos using the n-gram language model. Read the guide →