Skip to content

Tokenizer

Character-level Sinhala tokenizer with a HuggingFace-compatible API. Splits Sinhala text into phonological units (base consonant + vowel diacritics) and maps them to integer IDs.

from sinlib import Tokenizer
MethodReturnsDescription
Tokenizer.from_pretrained(path)TokenizerLoad from HF Hub or local directory
tokenizer(text)BatchEncodingEncode one or more texts
tokenizer.tokenize(text)list[str]Split text into token strings
tokenizer.encode(text)list[int]Encode text to ID list
tokenizer.encode_plus(text)BatchEncodingEncode with full metadata
tokenizer.batch_encode(texts)BatchEncodingEncode a list of texts
tokenizer.batch_decode(ids)list[str]Decode a batch of ID lists
tokenizer.decode(ids)strDecode a single ID list
tokenizer.convert_tokens_to_ids(tokens)list[int]Token strings → IDs
tokenizer.convert_ids_to_tokens(ids)list[str]IDs → token strings
tokenizer.get_vocab()dict[str, int]Full vocabulary mapping
tokenizer.save_pretrained(path)NoneSave vocab to directory
tokenizer = Tokenizer.from_pretrained("Ransaka/sinlib")
tokenizer = Tokenizer.from_pretrained("./my_tokenizer/")

The directory must contain a vocab.json file. Use save_pretrained() to create one.

tokenizer = Tokenizer(max_length=16)
tokenizer.load_from_pretrained(load_default_tokenizer=True) # DeprecationWarning
encoding = tokenizer("ආයුබෝවන්")
# BatchEncoding(input_ids=[4, 23, 18, 7, 12], attention_mask=[1, 1, 1, 1, 1])
encoding = tokenizer("ආයුබෝවන්", max_length=8, padding="max_length", truncation=True)
batch = tokenizer(["ආයුබෝවන්", "සිංහල"], padding=True)
batch.input_ids # [[4, 23, 18, 7, 12], [9, 31, 6, 0, 0]]
AttributeDefaultDescription
tokenizer.pad_token"[PAD]"Padding token
tokenizer.unk_token"[UNK]"Unknown token
tokenizer.bos_token"[BOS]"Beginning of sequence
tokenizer.eos_token"[EOS]"End of sequence
tokenizer.save_pretrained("./my_tokenizer/")
# Writes vocab.json to the directory