Generator Module#
- class recwizard.modules.kgsf.configuration_kgsf_gen.KGSFGenConfig(dictionary=None, subkg=None, mask4movie=None, mask4key=None, embedding_data=None, edge_set=None, batch_size: int = 32, max_r_length: int = 30, embedding_size: int = 300, n_concept: int = 29308, dim: int = 128, n_entity: int = 64368, num_bases: int = 8, n_positions: int | None = None, truncate: int = 0, text_truncate: int = 0, label_truncate: int = 0, padding_idx: int = 0, start_idx: int = 1, end_idx: int = 2, longest_label: int = 1, pretrain: bool = False, n_heads: int = 2, n_layers: int = 2, ffn_size: int = 300, dropout: float = 0.1, attention_dropout: float = 0.0, relu_dropout: float = 0.1, learn_positional_embeddings: bool = False, embeddings_scale: bool = True, **kwargs)[source]#
- class recwizard.modules.kgsf.tokenizer_kgsf_gen.KGSFGenTokenizer(max_count: int = 5, max_c_length: int = 256, max_r_length: int = 30, n_entity: int = 64368, batch_size: int = 1, padding_idx: int = 0, entity2entityId: Dict[str, int] | None = None, word2index: Dict[str, int] | None = None, key2index: Dict[str, int] | None = None, entity_ids: List | None = None, id2name: Dict[str, str] | None = None, id2entity: Dict[int, str] | None = None, entity2id: Dict[str, int] | None = None, **kwargs)[source]#
- __init__(max_count: int = 5, max_c_length: int = 256, max_r_length: int = 30, n_entity: int = 64368, batch_size: int = 1, padding_idx: int = 0, entity2entityId: Dict[str, int] | None = None, word2index: Dict[str, int] | None = None, key2index: Dict[str, int] | None = None, entity_ids: List | None = None, id2name: Dict[str, str] | None = None, id2entity: Dict[int, str] | None = None, entity2id: Dict[str, int] | None = None, **kwargs)[source]#
- Parameters:
entity2id (Dict[str, int]) – a dict mapping entity name to entity id. If not provided, it will be generated from id2entity.
id2entity (Dict[int, str]) – a dict mapping entity id to entity name. If not provided, it will be generated from entity2id.
pad_entity_id (int) – the id for padding entity. If not provided, it will be the maximum entity id + 1.
tokenizers (List[PreTrainedTokenizerBase]) – a list of tokenizers to be used.
**kwargs – other arguments for PreTrainedTokenizer
- get_init_kwargs()[source]#
The kwargs for initialization. They will be saved when you save the tokenizer or push it to huggingface model hub.
- padding_w2v(sentence, max_length, pad=0, end=2, unk=3)[source]#
sentence: [‘Okay’, ‘,’, ‘have’, ‘you’, ‘seen’, ‘@136983’, ‘?’] / […] max_length: 30 / 256
- padding_context(contexts, pad=0)[source]#
contexts: eg. [[‘Hello’], [‘hi’, ‘how’, ‘are’, ‘u’], [‘Great’, ‘.’, ‘How’, ‘are’, ‘you’, ‘this’, ‘morning’, ‘?’], [‘would’, ‘u’, ‘have’, ‘any’, ‘recommendations’, ‘for’, ‘me’, ‘im’, ‘good’, ‘thanks’, ‘fo’, ‘asking’], [‘What’, ‘type’, ‘of’, ‘movie’, ‘are’, ‘you’, ‘looking’, ‘for’, ‘?’], [‘comedies’, ‘i’, ‘like’, ‘kristin’, ‘wigg’], [‘Okay’, ‘,’, ‘have’, ‘you’, ‘seen’, ‘@136983’, ‘?’], [‘something’, ‘like’, ‘yes’, ‘have’, ‘watched’, ‘@140066’, ‘?’]]
- encode(user_input=None, user_context=None, entity=None, system_response=None, movie=0)[source]#
user_input: eg. Hi, can you recommend a movie for me? user_context: eg. [[‘Hello’], [‘hi’, ‘how’, ‘are’, ‘u’]] TODO: 考虑分隔符吗 _split_? entity: movies in user_context, default [] system_response: eg. [‘Great’, ‘.’, ‘How’, ‘are’, ‘you’, ‘this’, ‘morning’, ‘?’] movie: movies in system_response, defualt is an ID, so None. ??? TODO: 多个movie的话 case会重复 tokenizer怎么解决?
- class recwizard.modules.kgsf.modeling_kgsf_gen.KGSFGen(config, **kwargs)[source]#
-
- decode_greedy(encoder_states, encoder_states_kg, encoder_states_db, attention_kg, attention_db, bsz, maxlen)[source]#
Greedy search
- Parameters:
bsz (int) – Batch size. Because encoder_states is model-specific, it cannot infer this automatically.
encoder_states (Model specific) – Output of the encoder model.
maxlen (int) – Maximum decoding length
- Returns:
pair (logits, choices) of the greedy decode
- Return type:
(FloatTensor[bsz, maxlen, vocab], LongTensor[bsz, maxlen])
- decode_forced(encoder_states, encoder_states_kg, encoder_states_db, attention_kg, attention_db, ys)[source]#
Decode with a fixed, true sequence, computing loss. Useful for training, or ranking fixed candidates.
- Parameters:
ys (LongTensor[bsz, time]) – the prediction targets. Contains both the start and end tokens.
encoder_states (model specific) – Output of the encoder. Model specific types.
- Returns:
pair (logits, choices) containing the logits and MLE predictions
- Return type:
(FloatTensor[bsz, ys, vocab], LongTensor[bsz, ys])
- forward(context, response, concept_mask, seed_sets, entity_vector, entity=None, cand_params=None, prev_enc=None, maxlen=None, bsz=None)[source]#
Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- response(**kwargs)#
The main function for the module to generate a response given an input.
Note
Please refer to our tutorial for implementation guidance: Overview
- Parameters:
raw_input (str) – the text input
tokenizer (PreTrainedTokenizer) – the tokenizer used to tokenize the input
return_dict (bool) – if set to True, will return a dict of outputs instead of a single output
**kwargs – the keyword arguments that will be passed to
forward()
- Returns:
By default, a single output will be returned. If return_dict is set to True, a dict of outputs will be returned.