Generator Module#

class recwizard.modules.kgsf.configuration_kgsf_gen.KGSFGenConfig(dictionary=None, subkg=None, mask4movie=None, mask4key=None, embedding_data=None, edge_set=None, batch_size: int = 32, max_r_length: int = 30, embedding_size: int = 300, n_concept: int = 29308, dim: int = 128, n_entity: int = 64368, num_bases: int = 8, n_positions: int | None = None, truncate: int = 0, text_truncate: int = 0, label_truncate: int = 0, padding_idx: int = 0, start_idx: int = 1, end_idx: int = 2, longest_label: int = 1, pretrain: bool = False, n_heads: int = 2, n_layers: int = 2, ffn_size: int = 300, dropout: float = 0.1, attention_dropout: float = 0.0, relu_dropout: float = 0.1, learn_positional_embeddings: bool = False, embeddings_scale: bool = True, **kwargs)[source]#
class recwizard.modules.kgsf.tokenizer_kgsf_gen.KGSFGenTokenizer(max_count: int = 5, max_c_length: int = 256, max_r_length: int = 30, n_entity: int = 64368, batch_size: int = 1, padding_idx: int = 0, entity2entityId: Dict[str, int] | None = None, word2index: Dict[str, int] | None = None, key2index: Dict[str, int] | None = None, entity_ids: List | None = None, id2name: Dict[str, str] | None = None, id2entity: Dict[int, str] | None = None, entity2id: Dict[str, int] | None = None, **kwargs)[source]#
__init__(max_count: int = 5, max_c_length: int = 256, max_r_length: int = 30, n_entity: int = 64368, batch_size: int = 1, padding_idx: int = 0, entity2entityId: Dict[str, int] | None = None, word2index: Dict[str, int] | None = None, key2index: Dict[str, int] | None = None, entity_ids: List | None = None, id2name: Dict[str, str] | None = None, id2entity: Dict[int, str] | None = None, entity2id: Dict[str, int] | None = None, **kwargs)[source]#
Parameters:
  • entity2id (Dict[str, int]) – a dict mapping entity name to entity id. If not provided, it will be generated from id2entity.

  • id2entity (Dict[int, str]) – a dict mapping entity id to entity name. If not provided, it will be generated from entity2id.

  • pad_entity_id (int) – the id for padding entity. If not provided, it will be the maximum entity id + 1.

  • tokenizers (List[PreTrainedTokenizerBase]) – a list of tokenizers to be used.

  • **kwargs – other arguments for PreTrainedTokenizer

get_init_kwargs()[source]#

The kwargs for initialization. They will be saved when you save the tokenizer or push it to huggingface model hub.

padding_w2v(sentence, max_length, pad=0, end=2, unk=3)[source]#

sentence: [‘Okay’, ‘,’, ‘have’, ‘you’, ‘seen’, @136983’, ‘?’] / […] max_length: 30 / 256

padding_context(contexts, pad=0)[source]#

contexts: eg. [[‘Hello’], [‘hi’, ‘how’, ‘are’, ‘u’], [‘Great’, ‘.’, ‘How’, ‘are’, ‘you’, ‘this’, ‘morning’, ‘?’], [‘would’, ‘u’, ‘have’, ‘any’, ‘recommendations’, ‘for’, ‘me’, ‘im’, ‘good’, ‘thanks’, ‘fo’, ‘asking’], [‘What’, ‘type’, ‘of’, ‘movie’, ‘are’, ‘you’, ‘looking’, ‘for’, ‘?’], [‘comedies’, ‘i’, ‘like’, ‘kristin’, ‘wigg’], [‘Okay’, ‘,’, ‘have’, ‘you’, ‘seen’, @136983’, ‘?’], [‘something’, ‘like’, ‘yes’, ‘have’, ‘watched’, @140066’, ‘?’]]

encode(user_input=None, user_context=None, entity=None, system_response=None, movie=0)[source]#

user_input: eg. Hi, can you recommend a movie for me? user_context: eg. [[‘Hello’], [‘hi’, ‘how’, ‘are’, ‘u’]] TODO: 考虑分隔符吗 _split_? entity: movies in user_context, default [] system_response: eg. [‘Great’, ‘.’, ‘How’, ‘are’, ‘you’, ‘this’, ‘morning’, ‘?’] movie: movies in system_response, defualt is an ID, so None. ??? TODO: 多个movie的话 case会重复 tokenizer怎么解决?

decode(outputs, labels=None)[source]#

Overrides the decode function from PreTrainedTokenizer. By default, calls the decode function of the first tokenizer.

class recwizard.modules.kgsf.modeling_kgsf_gen.KGSFGen(config, **kwargs)[source]#
__init__(config, **kwargs)[source]#
Parameters:

config – config for PreTrainedModel

_starts(bsz)[source]#

Return bsz start tokens.

decode_greedy(encoder_states, encoder_states_kg, encoder_states_db, attention_kg, attention_db, bsz, maxlen)[source]#

Greedy search

Parameters:
  • bsz (int) – Batch size. Because encoder_states is model-specific, it cannot infer this automatically.

  • encoder_states (Model specific) – Output of the encoder model.

  • maxlen (int) – Maximum decoding length

Returns:

pair (logits, choices) of the greedy decode

Return type:

(FloatTensor[bsz, maxlen, vocab], LongTensor[bsz, maxlen])

decode_forced(encoder_states, encoder_states_kg, encoder_states_db, attention_kg, attention_db, ys)[source]#

Decode with a fixed, true sequence, computing loss. Useful for training, or ranking fixed candidates.

Parameters:
  • ys (LongTensor[bsz, time]) – the prediction targets. Contains both the start and end tokens.

  • encoder_states (model specific) – Output of the encoder. Model specific types.

Returns:

pair (logits, choices) containing the logits and MLE predictions

Return type:

(FloatTensor[bsz, ys, vocab], LongTensor[bsz, ys])

forward(context, response, concept_mask, seed_sets, entity_vector, entity=None, cand_params=None, prev_enc=None, maxlen=None, bsz=None)[source]#

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

response(**kwargs)#

The main function for the module to generate a response given an input.

Note

Please refer to our tutorial for implementation guidance: Overview

Parameters:
  • raw_input (str) – the text input

  • tokenizer (PreTrainedTokenizer) – the tokenizer used to tokenize the input

  • return_dict (bool) – if set to True, will return a dict of outputs instead of a single output

  • **kwargs – the keyword arguments that will be passed to forward()

Returns:

By default, a single output will be returned. If return_dict is set to True, a dict of outputs will be returned.