Skip to content

BatchEncoding

BatchEncoding is the return type of Tokenizer.__call__, Tokenizer.encode_plus(), and Tokenizer.batch_encode(). It provides both attribute-style and dict-style access to tokenizer output.

from sinlib import BatchEncoding
# or
from sinlib.encoding import BatchEncoding
from sinlib import Tokenizer
tokenizer = Tokenizer.from_pretrained("Ransaka/sinlib")
encoding = tokenizer("ආයුබෝවන්")
# Attribute access
encoding.input_ids # [4, 23, 18, 7, 12]
encoding.attention_mask # [1, 1, 1, 1, 1]
# Dict-style access
encoding["input_ids"]
encoding["attention_mask"]
# Convert to plain dict
encoding.to_dict()
# {"input_ids": [4, 23, 18, 7, 12], "attention_mask": [1, 1, 1, 1, 1]}

Batch encoding returns a BatchEncoding where each field is a list of lists:

batch = tokenizer(["ආයුබෝවන්", "සිංහල"], padding=True)
batch.input_ids # [[4, 23, 18, 7, 12], [9, 31, 6, 0, 0]]
FieldTypeDescription
input_idslist[int] or list[list[int]]Token IDs
attention_masklist[int] or list[list[int]]1 for real tokens, 0 for padding
MethodDescription
to_dict()Returns a plain dict with all fields
__getitem__(key)Dict-style access
__repr__()Human-readable representation
__eq__(other)Equality comparison