Tokenization Guide

This guide explains how sinlib tokenizes Sinhala text, why it produces the output it does, and how to use every part of the Tokenizer API.

Why Sinhala needs special tokenization

Sinhala script uses combining diacritics: a vowel sound is written as a mark attached to a base consonant. Unicode assigns separate code points to the base consonant and each diacritic, but linguistically they form a single phonological unit.

For example, ආයුබෝවන් is eight Unicode code points:

Code point	Char	Type
U+0D86	ආ	vowel letter
U+0D9A	ය	consonant
U+0DD4	ු	vowel sign (attaches to ය)
U+0DB6	බ	consonant
U+0DDD	ෝ	vowel sign (attaches to බ)
U+0DC0	ව	consonant
U+0DB1	න	consonant
U+0DCA	්	virama (attaches to න)

Splitting on code points gives ['ආ','ය','ු','බ','ෝ','ව','න','්'] — eight tokens that don’t map to speech sounds. Sinlib instead produces ['ආ','යු','බෝ','ව','න්'] — five phonological units.

Loading the tokenizer

from sinlib import Tokenizer

# Default: loads vocab from HuggingFace Hub (Ransaka/sinlib)
tokenizer = Tokenizer.from_pretrained("Ransaka/sinlib")

# Local directory (must contain vocab.json and config.json)
tokenizer = Tokenizer.from_pretrained("./my_tokenizer/")

Tokenizing text

`tokenize()` — strings only

tokenizer.tokenize("ආයුබෝවන්")
# ['ආ', 'යු', 'බෝ', 'ව', 'න්']

tokenizer.tokenize("සිංහල")
# ['සි', 'ං', 'හ', 'ල']

`encode()` — IDs only

tokenizer.encode("ආයුබෝවන්")
# [4, 23, 18, 7, 12]

`call` / `encode_plus()` — full `BatchEncoding`

enc = tokenizer("ආයුබෝවන්")
enc.input_ids       # [4, 23, 18, 7, 12]
enc.attention_mask  # [1, 1, 1, 1, 1]

Padding and truncation

# Pad to a fixed length
enc = tokenizer("සිංහල", max_length=8, padding=True)
enc.input_ids
# [9, 31, 6, 29, 0, 0, 0, 0]  ← 0 is the pad token ID

# Truncate long sequences
enc = tokenizer("ආයුබෝවන්", max_length=3, truncation=True)
enc.input_ids
# [4, 23, 18]

BOS / EOS tokens

enc = tokenizer("සිංහල", add_bos_token=True)
# Prepends BOS token ID when configured

Batch encoding

batch = tokenizer(["ආයුබෝවන්", "සිංහල"], padding=True)
batch.input_ids
# [[4, 23, 18, 7, 12], [9, 31, 6, 29, 0]]
#                               ↑ padded to match longest sequence

Or equivalently:

batch = tokenizer.batch_encode(["ආයුබෝවන්", "සිංහල"], padding=True)

Decoding

tokenizer.decode([4, 23, 18, 7, 12])
# 'ආයුබෝවන්'

tokenizer.batch_decode([[4, 23, 18, 7, 12], [9, 31, 6, 29]])
# ['ආයුබෝවන්', 'සිංහල']

Skip special tokens during decoding:

tokenizer.decode([1, 4, 23, 18, 7, 12, 2], skip_special_tokens=True)
# 'ආයුබෝවන්'

Vocabulary

vocab = tokenizer.get_vocab()
# {'<|pad|>': 0, '<|unk|>': 1, ...}

# Token <-> ID conversion
tokenizer.convert_tokens_to_ids(['ආ', 'යු'])   # [4, 23]
tokenizer.convert_ids_to_tokens([4, 23])         # ['ආ', 'යු']

Saving and loading locally

# Save
tokenizer.save_pretrained("./my_tokenizer/")
# Writes ./my_tokenizer/vocab.json and ./my_tokenizer/config.json

# Reload later
tokenizer2 = Tokenizer.from_pretrained("./my_tokenizer/")

Training on custom data

corpus = ["සිංහල", "ආයුබෝවන්", ...]

tokenizer = Tokenizer(model_max_length=64)
tokenizer.train(corpus)
tokenizer.save_pretrained("./custom_tokenizer/")

Special tokens reference

Token	Attribute	Default ID
Padding	`tokenizer.pad_token`	`0`
Unknown	`tokenizer.unk_token`	`1`
Beginning of sequence	`tokenizer.bos_token`	`3`
End of sequence	`tokenizer.eos_token`	`2`

tokenizer.pad_token        # '<|pad|>'
tokenizer.pad_token_id     # 0
tokenizer.all_special_tokens
# ['<|pad|>', '<|unk|>', '<|end_of_text|>', '<|bos|>']