Skip to content

Tokenization Guide

This guide explains how sinlib tokenizes Sinhala text, why it produces the output it does, and how to use every part of the Tokenizer API.

Sinhala script uses combining diacritics: a vowel sound is written as a mark attached to a base consonant. Unicode assigns separate code points to the base consonant and each diacritic, but linguistically they form a single phonological unit.

For example, ආයුබෝවන් is eight Unicode code points:

Code pointCharType
U+0D86vowel letter
U+0D9Aconsonant
U+0DD4vowel sign (attaches to ය)
U+0DB6consonant
U+0DDDvowel sign (attaches to බ)
U+0DC0consonant
U+0DB1consonant
U+0DCAvirama (attaches to න)

Splitting on code points gives ['ආ','ය','ු','බ','ෝ','ව','න','්'] — eight tokens that don’t map to speech sounds. Sinlib instead produces ['ආ','යු','බෝ','ව','න්'] — five phonological units.

from sinlib import Tokenizer
# Default: loads vocab from HuggingFace Hub (Ransaka/sinlib)
tokenizer = Tokenizer.from_pretrained("Ransaka/sinlib")
# Local directory (must contain vocab.json)
tokenizer = Tokenizer.from_pretrained("./my_tokenizer/")
tokenizer.tokenize("ආයුබෝවන්")
# ['ආ', 'යු', 'බෝ', 'ව', 'න්']
tokenizer.tokenize("සිංහල")
# ['සි', 'ං', 'හ', 'ල']
tokenizer.encode("ආයුබෝවන්")
# [4, 23, 18, 7, 12]

__call__ / encode_plus() — full BatchEncoding

Section titled “__call__ / encode_plus() — full BatchEncoding”
enc = tokenizer("ආයුබෝවන්")
enc.input_ids # [4, 23, 18, 7, 12]
enc.attention_mask # [1, 1, 1, 1, 1]
# Pad to a fixed length
enc = tokenizer("සිංහල", max_length=8, padding="max_length")
enc.input_ids
# [9, 31, 6, 29, 0, 0, 0, 0] ← 0 is the pad token ID
# Truncate long sequences
enc = tokenizer("ආයුබෝවන්", max_length=3, truncation=True)
enc.input_ids
# [4, 23, 18]
enc = tokenizer("සිංහල", add_special_tokens=True)
# Prepends BOS token ID and appends EOS token ID when configured
batch = tokenizer(["ආයුබෝවන්", "සිංහල"], padding=True)
batch.input_ids
# [[4, 23, 18, 7, 12], [9, 31, 6, 29, 0]]
# ↑ padded to match longest sequence

Or equivalently:

batch = tokenizer.batch_encode(["ආයුබෝවන්", "සිංහල"], padding=True)
tokenizer.decode([4, 23, 18, 7, 12])
# 'ආයුබෝවන්'
tokenizer.batch_decode([[4, 23, 18, 7, 12], [9, 31, 6, 29]])
# ['ආයුබෝවන්', 'සිංහල']

Skip special tokens during decoding:

tokenizer.decode([1, 4, 23, 18, 7, 12, 2], skip_special_tokens=True)
# 'ආයුබෝවන්'
vocab = tokenizer.get_vocab()
# {'[PAD]': 0, '[UNK]': 1, '[BOS]': 2, '[EOS]': 3, 'ආ': 4, ...}
# Token ↔ ID conversion
tokenizer.convert_tokens_to_ids(['', 'යු']) # [4, 23]
tokenizer.convert_ids_to_tokens([4, 23]) # ['ආ', 'යු']
# Save
tokenizer.save_pretrained("./my_tokenizer/")
# Writes ./my_tokenizer/vocab.json
# Reload later
tokenizer2 = Tokenizer.from_pretrained("./my_tokenizer/")
corpus = ["සිංහල", "ආයුබෝවන්", ...]
tokenizer = Tokenizer(model_max_length=64)
tokenizer.train(corpus)
tokenizer.save_pretrained("./custom_tokenizer/")
TokenAttributeDefault ID
Paddingtokenizer.pad_token0
Unknowntokenizer.unk_token1
Beginning of sequencetokenizer.bos_token2
End of sequencetokenizer.eos_token3
tokenizer.pad_token # '[PAD]'
tokenizer.pad_token_id # 0
tokenizer.all_special_tokens
# ['[PAD]', '[UNK]', '[BOS]', '[EOS]']