Skip to content

TypoDetector

Sinhala spell checker using a character-level n-gram language model combined with edit-distance candidate generation.

from sinlib import TypoDetector
Method / PropertyReturnsDescription
TypoDetector.from_pretrained(repo)TypoDetectorLoad from HF Hub
detector(text)strCorrect a sentence
detector.suggest_correction(word)list[str]Closest dictionary matches
detector.word_ngram_probability(word)floatN-gram likelihood score
detector.get_dictionary()set[str]Full word list
detector.get_ngram_probs()dictFull n-gram table
detector.dictionarystrHuman-readable summary
detector.ngram_probsstrHuman-readable summary
from sinlib import TypoDetector
detector = TypoDetector.from_pretrained("Ransaka/sinlib")
detector("අපකරියට ගිය")
# 'අපකීර්තියට ගිය'
detector.suggest_correction("අඩිරාජ")
# ['අධිරාජ']
detector.suggest_correction("xyz")
# ['No suggestion']
prob = detector.word_ngram_probability("සිංහල")
# 0.000032 (higher = more likely to be a real word)
print(detector.dictionary)
# Dictionary containing 45231 words.
words = detector.get_dictionary()
"ගෙදර" in words # True

For each word in the input sentence the detector:

  1. Checks if the word is in the known dictionary — if yes, passes through unchanged.
  2. Estimates the word’s character-level bigram probability.
    • If prob < threshold (default 1e-8): replaces with the top suggest_correction result.
    • If threshold <= prob < 1.0: emits a UserWarning but keeps the word.
  3. On any processing error, emits a UserWarning and keeps the original word.