Introducing Sensei (先生)
A simple, powerful and minimal codebase to generate synthetic data using OpenAI
In Japanese, the term "Sensei" (先生) is used to refer to someone who is a teacher, instructor, or a master in a particular field or discipline. The term is composed of two kanji characters: 先 (sen) meaning "before" or "earlier," and 生 (sei) meaning "life" or "birth". Therefore, it can be interpreted as "one who comes before" in the sense of someone who has gone before in a particular path of learning and can guide others along that path.
While "Sensei" is most commonly associated with teachers in schools, it can also be applied to other professionals who are considered masters of their craft, such as martial arts instructors, doctors, lawyers, and sometimes even politicians or artists. The term conveys respect and acknowledges the individual's expertise and role as a mentor or guide.
I’m here to introduce you a simple but very powerful tool to generate synthetic data using OpenAI’s GPT-4. Sensei has been my framework for generating synthetic data using GPT-4 — It includes Orca system contexts, as well as 10 new system contexts that I've designed for creating Synthia, Tess and HelixNet — and I’m thrilled now to make it available to you!
As the saying goes, "Give a man a fish, and you feed him for a day. Teach a man to fish, and you feed him for a lifetime". I want to focus now on giving you all the tools you need to train your own state-of-the-art AIs. Sure, there are Open Source AI models that perform well. But wouldn’t you want to own the entire end-to-end process, so you’re not beholden to someone else’s thoughts and beliefs, including this author’s? My goal now is to empower you, so you can go ahead and create your own world class AIs.
And it all starts with data.
You see, training an AI is all about data. It is through data that you align (or fine-tune) an AI.
But before we get in to the process, here is Sensei: https://github.com/migtissera/Sensei
How do I use Sensei?
Sensei is a tool for you to use to create your own synthetic data, using GPT-4. It’s pretty easy to use, just follow the instructions on the readme. In short, all you need is an OpenAI API Key. Just enter that into the `params.py` file, and you’re good to go.
Sensei will generate you question/answer pairs, given different system contexts. The system contexts are carefully designed following the Orca paper by Microsoft Research, to infuse with rich information into the dataset. For more information, you may refer to the paper “Orca: Progressive Learning from Complex Explanation Traces of GPT-4” [https://arxiv.org/abs/2306.02707].
Once you’ve run Sensei for long enough time (you can adjust the number of workers in `params.py`), you will end up with a large enough dataset to fine-tune an AI base model. Anywhere between 10k-25k samples is enough to create a state-of-the-art model. Heck, the LIMA paper “LIMA: Less Is More for Alignment” by Meta [https://arxiv.org/abs/2305.11206] says even as little as 1k samples is enough. But empirically I have found that about 25k is the sweet spot.
Let’s train an AI!
The go-to framework nowadays for fine-tuning an AI is Axolotl: https://github.com/OpenAccess-AI-Collective/axolotl
You’d need to follow the instructions on the repo to get it installed. Once you have done that, go ahead and download Mistral-7B base model from HuggingFace: https://huggingface.co/mistralai/Mistral-7B-v0.1
Now you have all the ingredients needed to create a state-of-the-art AI. Now the only requirement is that you have a NVIDIA GPU. At home I have a RTX 4090 which has 24GB of VRAM. I’m going to assume you have something similar.
Here’s a gist file for the needed Axolotl config.
Now you have:
Your own dataset
The Axolotl YAML configuration
to start your training job. You can start the training job by typing
accelerate launch -m axolotl.cli.train <path to the yaml>
Merge the QLoRA Adapter to the Base Model
The last thing you need to do, once training is complete, is to merge your QLoRA adapter to the base model. You can do this with
python -m axolotl.cli.merge_lora <path to the yaml> --lora_model_dir="<path to your QLoRA adapter>"
Boom! Now you have your very own AI! Congratulations.
Let me know how you go!