Skip to content

Spell Checking Guide

This guide covers sinlib’s TypoDetector — how it works, how to use it, and how to tune it.

TypoDetector uses a two-step approach:

  1. Dictionary lookup — if the word is in the known Sinhala dictionary (≈ 45 000 words), it is immediately accepted.
  2. N-gram scoring — if the word is not in the dictionary, its character-level bigram probability is estimated. Words below the threshold probability are replaced by the closest dictionary match.
from sinlib import TypoDetector
# Recommended: downloads artefacts on first use
detector = TypoDetector.from_pretrained("Ransaka/sinlib")
# Direct construction (equivalent)
detector = TypoDetector()

Artefacts are cached locally after the first download.

detector("අපකරියට ගිය")
# 'අපකීර්තියට ගිය'
detector("මගේ ගෙදර ලස්සනයි")
# 'මගේ ගෙදර ලස්සනයි' ← all words valid, unchanged

__call__ processes each whitespace-separated word independently and joins the results.

detector.suggest_correction("අඩිරාජ")
# ['අධිරාජ']
detector.suggest_correction("ගෙදර")
# ['ගෙදර'] ← already in dictionary
detector.suggest_correction("xyzxyz")
# ['No suggestion']

The method uses difflib.get_close_matches with a similarity cutoff of 0.7. The n parameter controls the maximum number of suggestions:

detector.suggest_correction("අඩිරාජ", n=5)
prob = detector.word_ngram_probability("සිංහල")
# e.g. 3.2e-05 — higher means more likely to be a real word
prob = detector.word_ngram_probability("xzq")
# e.g. 1e-27 — very low; would trigger correction

Results are LRU-cached, so repeated calls to the same word are free.

The default threshold 1e-8 works well for general Sinhala text. Adjust it if you want stricter or more lenient behaviour:

# Stricter — flag more words as typos
detector = TypoDetector(threshold=1e-6)
# More lenient — only flag very obvious typos
detector = TypoDetector(threshold=1e-12)
detector = TypoDetector(lazy_loading=True)
# Nothing downloaded yet
result = detector("සිංහල")
# Downloads artefacts now, then processes
print(detector.dictionary)
# Dictionary containing 45231 words. Use .get_dictionary() to access the full list.
words = detector.get_dictionary() # set[str]
probs = detector.get_ngram_probs() # dict
  • Corrections are made word-by-word; the model has no sentence context.
  • Only the top suggestion replaces a typo — alternatives available via suggest_correction().
  • The dictionary covers general Sinhala vocabulary; domain-specific or neologistic words may be flagged.
  • Romanised Sinhala (Singlish) is not supported.