Sinlib

Sinhala NLP toolkit for Python. Phonological tokenization, n-gram spell checking, and text preprocessing — all in one package.

Get Started API Reference

v0.1.13 · MIT License

Install

pip install sinlib

What’s inside

🔤Phonological Tokenizer

Splits Sinhala text into base consonant + diacritic units. HuggingFace-compatible API.

✅Spell Checker

N-gram model backed by a ~45 000-word dictionary. Auto-corrects typos and suggests alternatives.

⚙️Preprocessing

Remove noise, compute Sinhala character ratios, and batch-process text.

☁️HuggingFace Hub

Vocab and weights are fetched automatically from Ransaka/sinlib on first use.

Quick start

from sinlib import Tokenizer

tokenizer = Tokenizer.from_pretrained("Ransaka/sinlib")

# Split into phonological units
tokenizer.tokenize("ආයුබෝවන්")
# ['ආ', 'යු', 'බෝ', 'ව', 'න්']

# Encode to integer IDs
enc = tokenizer("ආයුබෝවන්")
enc.input_ids       # [4, 23, 18, 7, 12]
enc.attention_mask  # [1, 1, 1, 1, 1]

from sinlib import Tokenizer

tokenizer = Tokenizer.from_pretrained("Ransaka/sinlib")

batch = tokenizer(["ආයුබෝවන්", "සිංහල"], padding=True)
batch.input_ids
# [[4, 23, 18, 7, 12],
#  [9, 31,  6,  0,  0]]  ← padded

from sinlib import TypoDetector

detector = TypoDetector.from_pretrained("Ransaka/sinlib")

detector("අපකරියට ගිය")
# 'අපකීර්තියට ගිය'

detector.suggest_correction("අඩිරාජ")
# ['අධිරාජ']

Why phonological tokenization?

Sinhala combines a base consonant with vowel diacritics into a single phonetic unit. Raw Unicode tokenization breaks these apart — producing incorrect representations for ASR and TTS.

Approach	Output for `ආයුබෝවන්`
Sinlib	`['ආ', 'යු', 'බෝ', 'ව', 'න්']` — 5 phonological units
Raw Unicode	`['ආ', 'ය', 'ු', 'බ', 'ෝ', 'ව', 'න', '්']` — 8 code points

Tokenization guide

Learn how the phonological splitter works and how to integrate it in your pipeline. Read the guide →

Spell checking guide

Detect and correct Sinhala typos using the n-gram language model. Read the guide →