Tokenizer

Character-level Sinhala tokenizer with a HuggingFace-compatible API. Splits Sinhala text into phonological units (base consonant + vowel diacritics) and maps them to integer IDs.

Class

Tokenizer

Main tokenizer class. Combines consonants and diacritics dynamically.

Constructor

Tokenizer(model_max_length=None, unk_token="<|unk|>", pad_token="<|pad|>", eos_token="<|end_of_text|>", bos_token="<|bos|>")

Parameters

Argument	Type	Description
model_max_length	intOptional	Maximum sequence length for padding/truncation. Default is `None`.
unk_token	strOptional	Token representing unknown characters.
pad_token	strOptional	Token representing padding.
bos_token	strOptional	Beginning of sequence token.
eos_token	strOptional	End of sequence token.

Methods

from_pretrained

Load a pretrained tokenizer from the HuggingFace Hub or a local path.

Tokenizer.from_pretrained(pretrained_model_name_or_path, model_max_length=None)

Argument	Type	Description
pretrained_model_name_or_path	strRequired	HuggingFace repo ID or local directory path.
model_max_length	intOptional	Override maximum sequence length.

tokenize

Split text into phonological unit token strings.

tokenizer.tokenize(text)

Argument	Type	Description
text	strRequired	Sinhala input string to split.

encode

Encode text directly to a list of token IDs.

tokenizer.encode(text, add_special_tokens=True, add_bos_token=False)

Argument	Type	Description
text	strRequired	Sinhala input string to encode.

Code Examples

Import and Load

from sinlib import Tokenizer

# Load default tokenizer
tokenizer = Tokenizer.from_pretrained("Ransaka/sinlib")

Tokenization

tokens = tokenizer.tokenize("ආයුබෝවන්")
# ['ආ', 'යු', 'බෝ', 'ව', 'න්']

Encoding Single Text

# Simple ID list
ids = tokenizer.encode("ආයුබෝවන්")
# [4, 23, 18, 7, 12]

# Full BatchEncoding output
encoding = tokenizer("ආයුබෝවන්")
# BatchEncoding(input_ids=[4, 23, 18, 7, 12],
#   attention_mask=[1, 1, 1, 1, 1])

Padding and Truncation

res = tokenizer(
    ["ආයුබෝවන්", "සිංහල"],
    padding=True,
    truncation=True,
    max_length=6,
    return_tensors="np"
)
# returns arrays of uniform lengths