Tokenizer
Character-level Sinhala tokenizer with a HuggingFace-compatible API. Splits Sinhala text into phonological units (base consonant + vowel diacritics) and maps them to integer IDs.
Import
Section titled “Import”from sinlib import TokenizerQuick Reference
Section titled “Quick Reference”| Method | Returns | Description |
|---|---|---|
Tokenizer.from_pretrained(path) | Tokenizer | Load from HF Hub or local directory |
tokenizer(text) | BatchEncoding | Encode one or more texts |
tokenizer.tokenize(text) | list[str] | Split text into token strings |
tokenizer.encode(text) | list[int] | Encode text to ID list |
tokenizer.encode_plus(text) | BatchEncoding | Encode with full metadata |
tokenizer.batch_encode(texts) | BatchEncoding | Encode a list of texts |
tokenizer.batch_decode(ids) | list[str] | Decode a batch of ID lists |
tokenizer.decode(ids) | str | Decode a single ID list |
tokenizer.convert_tokens_to_ids(tokens) | list[int] | Token strings → IDs |
tokenizer.convert_ids_to_tokens(ids) | list[str] | IDs → token strings |
tokenizer.get_vocab() | dict[str, int] | Full vocabulary mapping |
tokenizer.save_pretrained(path) | None | Save vocab to directory |
Loading
Section titled “Loading”From HuggingFace Hub
Section titled “From HuggingFace Hub”tokenizer = Tokenizer.from_pretrained("Ransaka/sinlib")From a local directory
Section titled “From a local directory”tokenizer = Tokenizer.from_pretrained("./my_tokenizer/")The directory must contain a vocab.json file. Use save_pretrained() to create one.
Legacy (deprecated)
Section titled “Legacy (deprecated)”tokenizer = Tokenizer(max_length=16)tokenizer.load_from_pretrained(load_default_tokenizer=True) # DeprecationWarningEncoding
Section titled “Encoding”Single text
Section titled “Single text”encoding = tokenizer("ආයුබෝවන්")# BatchEncoding(input_ids=[4, 23, 18, 7, 12], attention_mask=[1, 1, 1, 1, 1])With padding and truncation
Section titled “With padding and truncation”encoding = tokenizer("ආයුබෝවන්", max_length=8, padding="max_length", truncation=True)Batch encoding
Section titled “Batch encoding”batch = tokenizer(["ආයුබෝවන්", "සිංහල"], padding=True)batch.input_ids # [[4, 23, 18, 7, 12], [9, 31, 6, 0, 0]]Special Tokens
Section titled “Special Tokens”| Attribute | Default | Description |
|---|---|---|
tokenizer.pad_token | "[PAD]" | Padding token |
tokenizer.unk_token | "[UNK]" | Unknown token |
tokenizer.bos_token | "[BOS]" | Beginning of sequence |
tokenizer.eos_token | "[EOS]" | End of sequence |
Saving
Section titled “Saving”tokenizer.save_pretrained("./my_tokenizer/")# Writes vocab.json to the directory