Generator Module#

class recwizard.modules.unicrs.configuration_unicrs_gen.UnicrsGenConfig(pretrained_model: str = '', kgprompt_config: dict | None = None, num_tokens: int = 0, pad_token_id: int = 0, max_gen_len: int = 0, **kwargs)[source]#
__init__(pretrained_model: str = '', kgprompt_config: dict | None = None, num_tokens: int = 0, pad_token_id: int = 0, max_gen_len: int = 0, **kwargs)[source]#
Parameters:
  • WEIGHT_DIMENSIONS (dict, optional) – The dimension and dtype of module parameters. Used to initialize the parameters when they are not explicitly specified in module initialization. Defaults to None. See also recwizard.module_utils.BaseModule.prepare_weight().

  • **kwargs – Additional parameters. Will be passed to the PretrainedConfig.__init__.

class recwizard.modules.unicrs.tokenizer_unicrs_gen.UnicrsGenTokenizer(context_tokenizer: str = 'microsoft/DialoGPT-small', prompt_tokenizer: str = 'roberta-base', context_max_length: int = 200, prompt_max_length: int = 200, entity2id: Dict[str, int] | None = None, pad_entity_id: int = 31161, resp_prompt='System:', **kwargs)[source]#
__init__(context_tokenizer: str = 'microsoft/DialoGPT-small', prompt_tokenizer: str = 'roberta-base', context_max_length: int = 200, prompt_max_length: int = 200, entity2id: Dict[str, int] | None = None, pad_entity_id: int = 31161, resp_prompt='System:', **kwargs)[source]#
Parameters:
  • entity2id (Dict[str, int]) – a dict mapping entity name to entity id. If not provided, it will be generated from id2entity.

  • id2entity (Dict[int, str]) – a dict mapping entity id to entity name. If not provided, it will be generated from entity2id.

  • pad_entity_id (int) – the id for padding entity. If not provided, it will be the maximum entity id + 1.

  • tokenizers (List[PreTrainedTokenizerBase]) – a list of tokenizers to be used.

  • **kwargs – other arguments for PreTrainedTokenizer

classmethod load_from_dataset(dataset='redial_unicrs', **kwargs)[source]#

Initialize the tokenizer from the dataset. By default, it will load the entity2id from the dataset. :param dataset: the dataset name :param **kwargs: the other arguments for initialization

Returns:

the initialized tokenizer

Return type:

(BaseTokenizer)

static mergeEncoding(encodings: List[BatchEncoding]) BatchEncoding[source]#

Merge a list of encodings into one encoding. Assumes each encoding has the same attributes other than data.

__call__(*args, **kwargs)[source]#

Main method to tokenize and prepare for the model one or several sequence(s) or one or several pair(s) of sequences.

Parameters:
  • text (str, List[str], List[List[str]], optional) – The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set is_split_into_words=True (to lift the ambiguity with a batch of sequences).

  • text_pair (str, List[str], List[List[str]], optional) – The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set is_split_into_words=True (to lift the ambiguity with a batch of sequences).

  • text_target (str, List[str], List[List[str]], optional) – The sequence or batch of sequences to be encoded as target texts. Each sequence can be a string or a list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set is_split_into_words=True (to lift the ambiguity with a batch of sequences).

  • text_pair_target (str, List[str], List[List[str]], optional) – The sequence or batch of sequences to be encoded as target texts. Each sequence can be a string or a list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set is_split_into_words=True (to lift the ambiguity with a batch of sequences).

  • add_special_tokens (bool, optional, defaults to True) – Whether or not to encode the sequences with the special tokens relative to their model.

  • padding (bool, str or [~utils.PaddingStrategy], optional, defaults to False) –

    Activates and controls padding. Accepts the following values:

    • True or ‘longest’: Pad to the longest sequence in the batch (or no padding if only a single sequence if provided).

    • ’max_length’: Pad to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided.

    • False or ‘do_not_pad’ (default): No padding (i.e., can output a batch with sequences of different lengths).

  • truncation (bool, str or [~tokenization_utils_base.TruncationStrategy], optional, defaults to False) –

    Activates and controls truncation. Accepts the following values:

    • True or ‘longest_first’: Truncate to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided. This will truncate token by token, removing a token from the longest sequence in the pair if a pair of sequences (or a batch of pairs) is provided.

    • ’only_first’: Truncate to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided. This will only truncate the first sequence of a pair if a pair of sequences (or a batch of pairs) is provided.

    • ’only_second’: Truncate to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided. This will only truncate the second sequence of a pair if a pair of sequences (or a batch of pairs) is provided.

    • False or ‘do_not_truncate’ (default): No truncation (i.e., can output batch with sequence lengths greater than the model maximum admissible input size).

  • max_length (int, optional) –

    Controls the maximum length to use by one of the truncation/padding parameters.

    If left unset or set to None, this will use the predefined model maximum length if a maximum length is required by one of the truncation/padding parameters. If the model has no specific maximum input length (like XLNet) truncation/padding to a maximum length will be deactivated.

  • stride (int, optional, defaults to 0) – If set to a number along with max_length, the overflowing tokens returned when return_overflowing_tokens=True will contain some tokens from the end of the truncated sequence returned to provide some overlap between truncated and overflowing sequences. The value of this argument defines the number of overlapping tokens.

  • is_split_into_words (bool, optional, defaults to False) – Whether or not the input is already pre-tokenized (e.g., split into words). If set to True, the tokenizer assumes the input is already split into words (for instance, by splitting it on whitespace) which it will tokenize. This is useful for NER or token classification.

  • pad_to_multiple_of (int, optional) – If set will pad the sequence to a multiple of the provided value. Requires padding to be activated. This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability >= 7.5 (Volta).

  • return_tensors (str or [~utils.TensorType], optional) –

    If set, will return tensors instead of list of python integers. Acceptable values are:

    • ’tf’: Return TensorFlow tf.constant objects.

    • ’pt’: Return PyTorch torch.Tensor objects.

    • ’np’: Return Numpy np.ndarray objects.

  • return_token_type_ids (bool, optional) –

    Whether to return token type IDs. If left to the default, will return the token type IDs according to the specific tokenizer’s default, defined by the return_outputs attribute.

    [What are token type IDs?](../glossary#token-type-ids)

  • return_attention_mask (bool, optional) –

    Whether to return the attention mask. If left to the default, will return the attention mask according to the specific tokenizer’s default, defined by the return_outputs attribute.

    [What are attention masks?](../glossary#attention-mask)

  • return_overflowing_tokens (bool, optional, defaults to False) – Whether or not to return overflowing token sequences. If a pair of sequences of input ids (or a batch of pairs) is provided with truncation_strategy = longest_first or True, an error is raised instead of returning overflowing tokens.

  • return_special_tokens_mask (bool, optional, defaults to False) – Whether or not to return special tokens mask information.

  • return_offsets_mapping (bool, optional, defaults to False) –

    Whether or not to return (char_start, char_end) for each token.

    This is only available on fast tokenizers inheriting from [PreTrainedTokenizerFast], if using Python’s tokenizer, this method will raise NotImplementedError.

  • return_length (bool, optional, defaults to False) – Whether or not to return the lengths of the encoded inputs.

  • verbose (bool, optional, defaults to True) – Whether or not to print more information and warnings.

  • **kwargs – passed to the self.tokenize() method

Returns:

A [BatchEncoding] with the following fields:

  • input_ids – List of token ids to be fed to a model.

    [What are input IDs?](../glossary#input-ids)

  • token_type_ids – List of token type ids to be fed to a model (when return_token_type_ids=True or if “token_type_ids” is in self.model_input_names).

    [What are token type IDs?](../glossary#token-type-ids)

  • attention_mask – List of indices specifying which tokens should be attended to by the model (when return_attention_mask=True or if “attention_mask” is in self.model_input_names).

    [What are attention masks?](../glossary#attention-mask)

  • overflowing_tokens – List of overflowing tokens sequences (when a max_length is specified and return_overflowing_tokens=True).

  • num_truncated_tokens – Number of tokens truncated (when a max_length is specified and return_overflowing_tokens=True).

  • special_tokens_mask – List of 0s and 1s, with 1 specifying added special tokens and 0 specifying regular sequence tokens (when add_special_tokens=True and return_special_tokens_mask=True).

  • length – The length of the inputs (when return_length=True)

Return type:

[BatchEncoding]

encodes(encode_funcs: List[Callable], texts: List[str | List[str]], *args, **kwargs) List[BatchEncoding][source]#

This function is called to apply encoding functions from different tokenizers. It will be used by both encode_plus and batch_encode_plus.

If you want to call different tokenizers with different arguments, override this method.

Parameters:
  • encode_funcs – the encoding functions from self.tokenizers.

  • texts – the processed text for each encoding function

  • **kwargs

Returns:

a list of BatchEncoding, the length of the list is the same as the number of tokenizer

class recwizard.modules.unicrs.modeling_unicrs_gen.UnicrsGen(config: UnicrsGenConfig, edge_index=None, edge_type=None, **kwargs)[source]#
__init__(config: UnicrsGenConfig, edge_index=None, edge_type=None, **kwargs)[source]#
Parameters:

config – config for PreTrainedModel

forward(context, prompt, entities, labels, **kwargs)[source]#

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

generate(context, entities, prompt, **kwargs)[source]#

Generates sequences of token ids for models with a language modeling head.

<Tip warning={true}>

Most generation-controlling parameters are set in generation_config which, if not passed, will be set to the model’s default generation configuration. You can override any generation_config by passing the corresponding parameters to generate(), e.g. .generate(inputs, num_beams=4, do_sample=True).

For an overview of generation strategies and code examples, check out the [following guide](../generation_strategies).

</Tip>

Parameters:
  • inputs (torch.Tensor of varying shape depending on the modality, optional) – The sequence used as a prompt for the generation or as model inputs to the encoder. If None the method initializes it with bos_token_id and a batch size of 1. For decoder-only models inputs should of in the format of input_ids. For encoder-decoder models inputs can represent any of input_ids, input_values, input_features, or pixel_values.

  • generation_config (~generation.GenerationConfig, optional) – The generation configuration to be used as base parametrization for the generation call. **kwargs passed to generate matching the attributes of generation_config will override them. If generation_config is not provided, the default will be used, which had the following loading priority: 1) from the generation_config.json model file, if it exists; 2) from the model configuration. Please note that unspecified parameters will inherit [~generation.GenerationConfig]’s default values, whose documentation should be checked to parameterize generation.

  • logits_processor (LogitsProcessorList, optional) – Custom logits processors that complement the default logits processors built from arguments and generation config. If a logit processor is passed that is already created with the arguments or a generation config an error is thrown. This feature is intended for advanced users.

  • stopping_criteria (StoppingCriteriaList, optional) – Custom stopping criteria that complement the default stopping criteria built from arguments and a generation config. If a stopping criteria is passed that is already created with the arguments or a generation config an error is thrown. This feature is intended for advanced users.

  • prefix_allowed_tokens_fn (Callable[[int, torch.Tensor], List[int]], optional) – If provided, this function constraints the beam search to allowed tokens only at each step. If not provided no constraint is applied. This function takes 2 arguments: the batch ID batch_id and input_ids. It has to return a list with the allowed tokens for the next generation step conditioned on the batch ID batch_id and the previously generated tokens inputs_ids. This argument is useful for constrained generation conditioned on the prefix, as described in [Autoregressive Entity Retrieval](https://arxiv.org/abs/2010.00904).

  • synced_gpus (bool, optional) – Whether to continue running the while loop until max_length. Unless overridden this flag will be set to True under DeepSpeed ZeRO Stage 3 multiple GPUs environment to avoid hanging if one GPU finished generating before other GPUs. Otherwise it’ll be set to False.

  • assistant_model (PreTrainedModel, optional) – An assistant model that can be used to accelerate generation. The assistant model must have the exact same tokenizer. The acceleration is achieved when forecasting candidate tokens with the assistent model is much faster than running generation with the model you’re calling generate from. As such, the assistant model should be much smaller.

  • streamer (BaseStreamer, optional) – Streamer object that will be used to stream the generated sequences. Generated tokens are passed through streamer.put(token_ids) and the streamer is responsible for any further processing.

  • negative_prompt_ids (torch.LongTensor of shape (batch_size, sequence_length), optional) – The negative prompt needed for some processors such as CFG. The batch size must match the input batch size. This is an experimental feature, subject to breaking API changes in future versions.

  • negative_prompt_attention_mask (torch.LongTensor of shape (batch_size, sequence_length), optional) – Attention_mask for negative_prompt_ids.

  • kwargs (Dict[str, Any], optional) – Ad hoc parametrization of generate_config and/or additional model-specific kwargs that will be forwarded to the forward function of the model. If the model is an encoder-decoder model, encoder specific kwargs should not be prefixed and decoder specific kwargs should be prefixed with decoder_.

Returns:

A [~utils.ModelOutput] (if return_dict_in_generate=True or when config.return_dict_in_generate=True) or a torch.FloatTensor.

If the model is not an encoder-decoder model (model.config.is_encoder_decoder=False), the possible [~utils.ModelOutput] types are:

  • [~generation.GreedySearchDecoderOnlyOutput],

  • [~generation.SampleDecoderOnlyOutput],

  • [~generation.BeamSearchDecoderOnlyOutput],

  • [~generation.BeamSampleDecoderOnlyOutput]

If the model is an encoder-decoder model (model.config.is_encoder_decoder=True), the possible [~utils.ModelOutput] types are:

  • [~generation.GreedySearchEncoderDecoderOutput],

  • [~generation.SampleEncoderDecoderOutput],

  • [~generation.BeamSearchEncoderDecoderOutput],

  • [~generation.BeamSampleEncoderDecoderOutput]

Return type:

[~utils.ModelOutput] or torch.LongTensor

response(**kwargs)#

The main function for the module to generate a response given an input.

Note

Please refer to our tutorial for implementation guidance: Overview

Parameters:
  • raw_input (str) – the text input

  • tokenizer (PreTrainedTokenizer) – the tokenizer used to tokenize the input

  • return_dict (bool) – if set to True, will return a dict of outputs instead of a single output

  • **kwargs – the keyword arguments that will be passed to forward()

Returns:

By default, a single output will be returned. If return_dict is set to True, a dict of outputs will be returned.