Spell Checking Guide
This guide covers sinlib’s TypoDetector — how it works, how to use it, and how to tune it.
How it works
Section titled “How it works”TypoDetector uses a two-step approach:
- Dictionary lookup — if the word is in the known Sinhala dictionary (≈ 45 000 words), it is immediately accepted.
- N-gram scoring — if the word is not in the dictionary, its character-level bigram probability is estimated. Words below the
thresholdprobability are replaced by the closest dictionary match.
Loading the detector
Section titled “Loading the detector”from sinlib import TypoDetector
# Recommended: downloads artefacts on first usedetector = TypoDetector.from_pretrained("Ransaka/sinlib")
# Direct construction (equivalent)detector = TypoDetector()Artefacts are cached locally after the first download.
Correcting text
Section titled “Correcting text”detector("අපකරියට ගිය")# 'අපකීර්තියට ගිය'
detector("මගේ ගෙදර ලස්සනයි")# 'මගේ ගෙදර ලස්සනයි' ← all words valid, unchanged__call__ processes each whitespace-separated word independently and joins the results.
Getting suggestions
Section titled “Getting suggestions”detector.suggest_correction("අඩිරාජ")# ['අධිරාජ']
detector.suggest_correction("ගෙදර")# ['ගෙදර'] ← already in dictionary
detector.suggest_correction("xyzxyz")# ['No suggestion']The method uses difflib.get_close_matches with a similarity cutoff of 0.7. The n parameter controls the maximum number of suggestions:
detector.suggest_correction("අඩිරාජ", n=5)Scoring words
Section titled “Scoring words”prob = detector.word_ngram_probability("සිංහල")# e.g. 3.2e-05 — higher means more likely to be a real word
prob = detector.word_ngram_probability("xzq")# e.g. 1e-27 — very low; would trigger correctionResults are LRU-cached, so repeated calls to the same word are free.
Tuning the threshold
Section titled “Tuning the threshold”The default threshold 1e-8 works well for general Sinhala text. Adjust it if you want stricter or more lenient behaviour:
# Stricter — flag more words as typosdetector = TypoDetector(threshold=1e-6)
# More lenient — only flag very obvious typosdetector = TypoDetector(threshold=1e-12)Lazy loading
Section titled “Lazy loading”detector = TypoDetector(lazy_loading=True)# Nothing downloaded yet
result = detector("සිංහල")# Downloads artefacts now, then processesInspecting the vocabulary
Section titled “Inspecting the vocabulary”print(detector.dictionary)# Dictionary containing 45231 words. Use .get_dictionary() to access the full list.
words = detector.get_dictionary() # set[str]probs = detector.get_ngram_probs() # dictLimitations
Section titled “Limitations”- Corrections are made word-by-word; the model has no sentence context.
- Only the top suggestion replaces a typo — alternatives available via
suggest_correction(). - The dictionary covers general Sinhala vocabulary; domain-specific or neologistic words may be flagged.
- Romanised Sinhala (Singlish) is not supported.