Overview#

This is an overview about How to create new RecWizard modules (such as generators, recommenders and processors) or models (such as UniCRS models). In this tutorial, we aim to create and share a new model, called, NEW trained on ``INSPIRED` <>`_ to go through the journey of contributing to RecWizard.

1. Tutorial Setup#

We create and share the NEW as a new conversational recommender in three steps:

Create and train a NEW recommender;
Create and train a NEW generator;
Assemble the NEW recommender and generator under the NEW model;
Add interface features to NEW model;
More details about how to contribute to RecWizard codebase.

Let us look at the desired NEW model in advance, and check the detailed implementation steps in the corresponding sections:

1.1 NEW Recommender [details]#

A NEW recommender can be used individually in this way:

Raw text input and output

System: Hello!<sep>
User: Hi. I like horror movies, such as <entity>The Shining (1980)</entity> and <entity>Annabelle (2014)</entity>.
Would you please recommend me some other movies?

['21 Bridges (2019)', 'The Conjuring (2013)', 'The Exorcist (1973)']

Tensor input and output

# inputs
{'input_ids': tensor([[1, 2]]), 'attention_mask': tensor([[True, True]])}

# logits
tensor([[ 4.3385,  7.7007,  6.6780, -1.6603, -1.5623,  3.8379,  2.0713,  0.2687]], grad_fn=<SumBackward1>)

We need to implement three components for this NEW recommender:

3.1. Recommender Configuration: NEWRecConfig

   from recwizard.configuration_utils import BaseConfig

   class NEWRecConfig(BaseConfig):
       """Configuration class to sotre the
       configuration of the NEW recommender."""

       def __init__(self, n_items: int = None, dim: int = None, **kwargs):
           super().__init__(**kwargs)

           self.n_items = n_items
           self.dim = dim

   # use it!
   config = NEWRecConfig(n_items=8, dim=10)

3.2. Recommender Tokenizer: ``NEWRecTokenizer``

```python
from typing import List

from recwizard.tokenizer_utils import BaseTokenizer
from recwizard.utility.utils import WrapSingleInput

class NEWRecTokenizer(BaseTokenizer):
    """Tokenizer class for the NEW recommender."""

    @WrapSingleInput
    def decode(self, ids, *args, **kwargs) -> List[str]:
        """Decode a list of token ids into a list of strings.
        Args:
            ids (List[int]): list of token ids to decode;
        Returns:
            List[str]: list of decoded strings;
        """
        return [self.id2entity[id] for id in ids if id in self.id2entity]

    def __call__(self, *args, **kwargs):
        """Tokenize a string into a list of token ids."""
        kwargs.update(return_tensors="pt", padding=True, truncation=True)
        return super().__call__(*args, **kwargs)

# use it!
tokenizer = NEWRecTokenizer(id2entity={
    0: '21 Bridges (2019)',
    1: 'The Shining (1980)',
    2: 'Annabelle (2014)',
    3: 'The Conjuring (2013)',
    4: 'The Exorcist (1973)',
    5: 'The Conjuring 2 (2016)',
    6: 'The Nun (2018)',
    7: 'X men (2019)',
})
```

3.3. Recommender Module: `NewRec`

```python
import torch

from recwizard.module_utils import BaseModule
from transformers.utils import ModelOutput


class NEWRec(BaseModule):
    """NEW is a module that implements the NEW recommender."""

    config_class = NEWRecConfig
    tokenizer_class = NEWRecTokenizer

    def __init__(self, config: NEWRecConfig, **kwargs):
        super().__init__(config, **kwargs)

        self.embeds = torch.nn.Embedding(config.n_items, config.dim)

    def forward(self, input_ids, attention_mask=None):
        """Forward pass of the NEW recommender."""

        embeds = self.embeds(input_ids)
        avg_embeds = embeds.sum(dim=1) / (attention_mask.sum(dim=-1, keepdim=True) + 1e-8)
        logits = (self.embeds.weight * avg_embeds.unsqueeze(1)).sum(dim=-1)
        return ModelOutput({"rec_logits": logits})

    @WrapSingleInput
    def response(self, raw_input, tokenizer, return_dict=False, topk=3):
        """Generate response from the NEW recommender."""

        # convert text input to tensor input
        entities = tokenizer(raw_input)['entities'].to(self.device)
        inputs = {
            "input_ids": entities,
            "attention_mask": entities != tokenizer.pad_entity_id,
        }

        # recommend top-k items
        logits = self.forward(**inputs)["rec_logits"]
        print(inputs, logits)
        logits[torch.arange(logits.size(0)), entities] = float("-inf")
        recommended = logits.topk(topk).indices.tolist()
        output = tokenizer.batch_decode(recommended)

        # return the output
        if return_dict:
            return {
                "output": output,
                "input": raw_input,
                "recommended": recommended
            }
        return output

# use it!

model = NEWRec(config)

query = ('System: Hello!'
         '<sep>User: Hi. I like horror movies, such as <entity>The Shining (1980)</entity> and <entity>Annabelle (2014)</entity>.'
         'Would you please recommend me some other movies?'
         )

resp = model.response(
    raw_input=query,
    tokenizer=tokenizer,
    return_dict=True
)
```

The complete implementation is in examples/develop_model/new_recommender.py.

1.2 NEW Generator [details]#

A NEW recommender can be used individually in this way:

Raw text input and output
Tensor input and output

We need to implement three components for this NEW generator:

3.1. Generator Configuration: NEWGenConfig

   from recwizard.configuration_utils import BaseConfig

   class NEWGenConfig(BaseConfig):
       """Configuration class to sotre the
       configuration of the NEW Generator."""

       def __init__(
           self, base_model: str = "microsoft/DialoGPT-small", n_items: int = None, max_gen_len=100, **kwargs
       ):
           super().__init__(**kwargs)

           self.base_model = base_model
           self.n_items = n_items
           self.max_gen_len = max_gen_len

   # use it!
   config = NEWGenConfig(base_model='microsoft/DialoGPT-small', n_items=8)

3.2. Generator Tokenizer: ``NEWGenTokenizer``

```python
from recwizard.tokenizer_utils import BaseTokenizer

class NEWGenTokenizer(BaseTokenizer):
    """Tokenizer class for the NEW generator."""

    def __call__(self, *args, **kwargs):
        """Tokenize a string into a list of token ids."""
        kwargs.update(return_tensors="pt", padding=True, truncation=True)
        return super().__call__(*args, **kwargs)

# use it!
word_tokenizer = GPT2Tokenizer.from_pretrained("microsoft/DialoGPT-small")
word_tokenizer.pad_token = word_tokenizer.eos_token
tokenizer = NEWGenTokenizer(
    tokenizers=[word_tokenizer],
    id2entity={
        0: "21 Bridges (2019)",
        1: "The Shining (1980)",
        2: "Annabelle (2014)",
        3: "The Conjuring (2013)",
        4: "The Exorcist (1973)",
        5: "The Conjuring 2 (2016)",
        6: "The Nun (2018)",
        7: "X men (2019)",
    },
)
```

3.3. Generator Module: `NEWGen`

```python

import torch

from recwizard.module_utils import BaseModule
from recwizard.utility.utils import WrapSingleInput
from transformers import GPT2LMHeadModel


class NEWGen(BaseModule):
    """NEW is a module that implements the NEW generator."""

    config_class = NEWGenConfig
    tokenizer_class = NEWGenTokenizer

    def __init__(self, config: NEWGenConfig, **kwargs):
        super().__init__(config, **kwargs)

        self.gpt2_model = GPT2LMHeadModel.from_pretrained(config.base_model)
        self.entity_embeds = torch.nn.Embedding(
            config.n_items, self.gpt2_model.config.n_embd
        )
        self.max_gen_len = config.max_gen_len

    def generate(self, context, entities, attention_mask, **kwargs):
        """Forward pass of the NEW generator."""

        embeds = self.entity_embeds(entities)
        avg_embeds = embeds.sum(dim=1, keepdim=True) / (
            attention_mask.sum(dim=-1, keepdim=True) + 1e-8
        )
        text_embeds = self.gpt2_model.transformer.wte(context["input_ids"])
        inputs_embeds = torch.cat([avg_embeds, text_embeds], dim=1)
        attention_mask = torch.cat(
            [
                torch.ones(*avg_embeds.shape[:2]).to(avg_embeds.device),
                context["attention_mask"],
            ],
            dim=1,
        )

        return self.gpt2_model.generate(
            inputs_embeds=inputs_embeds,
            attention_mask=attention_mask,
            max_new_tokens=self.max_gen_len,
            return_dict_in_generate=True,
            **kwargs,
        )

    @WrapSingleInput
    def response(self, raw_input, tokenizer, return_dict=False):
        """Generate response from the NEW generator."""

        inputs = tokenizer(raw_input)
        context = {
            "input_ids": inputs["input_ids"].to(self.device),
            "attention_mask": inputs["attention_mask"].to(self.device),
        }

        # convert text input to entity input
        entities = inputs["entities"].to(self.device)
        attention_mask = entities != tokenizer.pad_entity_id

        # generate response
        generated = self.generate(
            context=context, entities=entities, attention_mask=attention_mask
        )
        output = tokenizer.batch_decode(generated.sequences)

        # return the output
        if return_dict:
            return {"output": output, "input": raw_input, "generated": generated}
        return output

# use it!
model = NEWGen(config)

query = (
    "System: Hello!"
    "<sep>User: Hi. I like horror movies, such as <entity>The Shining (1980)</entity> and <entity>Annabelle (2014)</entity>."
    "Would you please recommend me some other movies?"
)

resp = model.response(raw_input=query, tokenizer=tokenizer, return_dict=True)
```

1.3 NEW Pipeline [details]#

NEW Model can define the logitics of how to use NEW recommender and generator.

Raw text input and output
Tensor input and output

Create this high-level NEW Pipeline

3.1. Create NEW Pipeline Configuration: NEWConfig

   from recwizard.configuration_utils import BaseConfig

   class NEWConfig(BaseConfig):
       def __init__(self, **kwargs):
           super().__init__(**kwargs)

3.2. Create NEW Pipeline: ``NEWPipeline``

from recwizard.model_utils import BasePipeline

class NEWPipeline(BasePipeline):
    config_class = NEWConfig

    def forward(self, input_ids, attention_mask, labels=None, **kwargs):
        raise NotImplementedError

    @monitor
    def response(
        self, query, return_dict=False, rec_args=None, gen_args=None, **kwargs
    ):
        rec_args = rec_args or {}
        gen_args = gen_args or {}
        rec_output = self.rec_module.response(
            query, tokenizer=self.rec_tokenizer, return_dict=True, **rec_args
        )

        query_condition_on_rec = [
            q + "System: I recommend " + self.rec_tokenizer.decode(r) + "because"
            for q, r in zip(query, rec_output["recommended"])
        ]

        gen_output = self.gen_module.response(
            query_condition_on_rec,
            tokenizer=self.gen_tokenizer,
            return_dict=True,
            **gen_args,
        )
        if return_dict:
            return {
                "rec_logits": rec_output["logits"],
                "gen_logits": gen_output["logits"],
                "rec_output": rec_output["output"],
                "gen_output": gen_output["output"],
            }

        return gen_output["output"][0] + "\n - " + "\n - ".join(rec_output["output"])

1.4 NEW Interactive Interface [details]#

We can use recwizard.Monitor to define and launch the interactive interface after building the model.