recwizard.BaseTokenizer#

The BaseTokenizer class adds a few features to the Huggingface PreTrainedTokenizer class.

class recwizard.tokenizer_utils.BaseTokenizer(entity2id: Dict[str, int] | None = None, id2entity: Dict[int, str] | None = None, pad_entity_id: int | None = None, tokenizers: List[PreTrainedTokenizerBase] | PreTrainedTokenizerBase | None = None, **kwargs)[source]#
__init__(entity2id: Dict[str, int] | None = None, id2entity: Dict[int, str] | None = None, pad_entity_id: int | None = None, tokenizers: List[PreTrainedTokenizerBase] | PreTrainedTokenizerBase | None = None, **kwargs)[source]#
Parameters:
  • entity2id (Dict[str, int]) – a dict mapping entity name to entity id. If not provided, it will be generated from id2entity.

  • id2entity (Dict[int, str]) – a dict mapping entity id to entity name. If not provided, it will be generated from entity2id.

  • pad_entity_id (int) – the id for padding entity. If not provided, it will be the maximum entity id + 1.

  • tokenizers (List[PreTrainedTokenizerBase]) – a list of tokenizers to be used.

  • **kwargs – other arguments for PreTrainedTokenizer

classmethod load_from_dataset(dataset='redial_unicrs', **kwargs)[source]#

Initialize the tokenizer from the dataset. By default, it will load the entity2id from the dataset. :param dataset: the dataset name :param **kwargs: the other arguments for initialization

Returns:

the initialized tokenizer

Return type:

(BaseTokenizer)

unk_token() str[source]#

Override this function if you want to change the unk_token.

unk_token_id() int | None[source]#

Override this function if you want to change the unk_token.

property vocab_size: int#

Size of the base vocabulary (without the added tokens).

Type:

int

static mergeEncoding(encodings: List[BatchEncoding]) BatchEncoding[source]#

Merge a list of encodings into one encoding. Assumes each encoding has the same attributes other than data.

replace_special_tokens(text: str) List[str][source]#

Replace the cls token, sep token and eos token for each tokenizer

Parameters:

text – the text to be replaced

Returns:

a list of text, each used for one tokenizer

Return type:

(List[str])

encodes(encode_funcs: List[Callable], texts: List[str | List[str]], *args, **kwargs) List[BatchEncoding][source]#

This function is called to apply encoding functions from different tokenizers. It will be used by both encode_plus and batch_encode_plus.

If you want to call different tokenizers with different arguments, override this method.

Parameters:
  • encode_funcs – the encoding functions from self.tokenizers.

  • texts – the processed text for each encoding function

  • **kwargs

Returns:

a list of BatchEncoding, the length of the list is the same as the number of tokenizer

preprocess(text: str) str[source]#

Override this function to preprocess the text. It will be used by both encode_plus and batch_encode_plus.

Parameters:

text – the text to be preprocessed

Returns: processed text

batch_encode_plus(batch_text_or_text_pairs: List[str], *args, **kwargs) BatchEncoding[source]#

Overrides the batch_encode_plus function from PreTrainedTokenizer to support entity processing.

encode_plus(text: str, *args, **kwargs) BatchEncoding[source]#

Overrides the encode_plus function from PreTrainedTokenizer to support entity processing.

process_entities(text: str) Tuple[str, List[int]][source]#

Process the entities in the text. It extracts the entity ids from the text and remove the entity tags.

decode(*args, **kwargs) str[source]#

Overrides the decode function from PreTrainedTokenizer. By default, calls the decode function of the first tokenizer.

get_init_kwargs()[source]#

The kwargs for initialization. Override this function to declare the necessary initialization kwargs ( they will be saved when the tokenizer is saved or pushed to huggingface model hub.)

See also: save_vocabulary()

save_vocabulary(save_directory: str, filename_prefix: str | None = None) Tuple[str][source]#

This method is overridden to save the initialization kwargs to the model directory.

classmethod from_pretrained(pretrained_model_name_or_path, *args, **kwargs)[source]#

Load the tokenizer from pretrained model or local directory. It loads the initialization kwargs from the ‘tokenizer_kwargs.json’ file before initializing the tokenizer.