Spell Checking Guide

This guide covers sinlib’s TypoDetector — how it works, how to use it, and how to tune it.

How it works

TypoDetector uses a two-step approach:

Dictionary lookup — if the word is in the known Sinhala dictionary (≈ 45 000 words), it is immediately accepted.
N-gram scoring — if the word is not in the dictionary, its character-level bigram probability is estimated. Words below the threshold probability are replaced by the closest dictionary match.

Loading the detector

from sinlib import TypoDetector

# Recommended: downloads artefacts on first use
detector = TypoDetector.from_pretrained("Ransaka/sinlib")

# Direct construction (equivalent)
detector = TypoDetector()

Artefacts are cached locally after the first download.

Correcting text

detector("ගුරුවරයා අපට උගන්වය්")
# 'ගුරුවරයා අපට උගන්වයි'

detector("මගේ ගෙදර ලස්සනයි")
# 'මගේ ගෙදර ලස්සනයි'  ← all words valid, unchanged

__call__ processes each whitespace-separated word independently and joins the results.

Getting suggestions

detector.suggest_correction("අඩිරාජ")
# ['අධිරාජ']

detector.suggest_correction("ගෙදර")
# ['ගෙදර']  ← already in dictionary

detector.suggest_correction("xyzxyz")
# ['No suggestion']

The method uses difflib.get_close_matches with a similarity cutoff of 0.7. The n parameter controls the maximum number of suggestions:

detector.suggest_correction("අඩිරාජ", n=5)

Scoring words

prob = detector.word_ngram_probability("සිංහල")
# e.g. 3.2e-05  — higher means more likely to be a real word

prob = detector.word_ngram_probability("xzq")
# e.g. 1e-27  — very low; would trigger correction

Results are LRU-cached, so repeated calls to the same word are free.

Tuning the threshold

The default threshold 1e-8 works well for general Sinhala text. Adjust it if you want stricter or more lenient behaviour:

# Stricter — flag more words as typos
detector = TypoDetector(threshold=1e-6)

# More lenient — only flag very obvious typos
detector = TypoDetector(threshold=1e-12)

Lazy loading

detector = TypoDetector(lazy_loading=True)
# Nothing downloaded yet

result = detector("සිංහල")
# Downloads artefacts now, then processes

Inspecting the vocabulary

print(detector.dictionary)
# Dictionary containing 45231 words. Use .get_dictionary() to access the full list.

words = detector.get_dictionary()   # set[str]
probs = detector.get_ngram_probs()  # dict

Limitations

Corrections are made word-by-word; the model has no sentence context.
Only the top suggestion replaces a typo — alternatives available via suggest_correction().
The dictionary covers general Sinhala vocabulary; domain-specific or neologistic words may be flagged.
Romanised Sinhala (Singlish) is not supported.