Tokenization Example
Basic tokenization
Section titled “Basic tokenization”from sinlib import Tokenizer
tokenizer = Tokenizer.from_pretrained("Ransaka/sinlib")
# Single stringtokens = tokenizer.tokenize("ආයුබෝවන්")print(tokens)# ['ආ', 'යු', 'බෝ', 'ව', 'න්']
enc = tokenizer("ආයුබෝවන්")print(enc.input_ids) # [4, 23, 18, 7, 12]print(enc.attention_mask) # [1, 1, 1, 1, 1]Batch encoding with padding
Section titled “Batch encoding with padding”batch = tokenizer(["ආයුබෝවන්", "සිංහල"], padding=True)print(batch.input_ids)# [[4, 23, 18, 7, 12],# [9, 31, 6, 0, 0]]Round-trip encode → decode
Section titled “Round-trip encode → decode”ids = tokenizer.encode("ආයුබෝවන්")text = tokenizer.decode(ids)print(text) # 'ආයුබෝවන්'Save and reload
Section titled “Save and reload”tokenizer.save_pretrained("./my_tokenizer/")t2 = Tokenizer.from_pretrained("./my_tokenizer/")assert t2.encode("ආයුබෝවන්") == tokenizer.encode("ආයුබෝවන්")