Learnings From Training Tess
Documenting some of the learnings going from Tess-v1.0 to Tess-v1.3, on the Yi-34B-200K base model
Tess-v1.0:
Tess-v1.0 was trained on LIMA principles (Less Is More for Alignment: Meta AI). I curated a dataset of very high quality 4500 samples, generated by GPT-4, with Orca and Tree-of-Thought (ToT) system prompts. The ToT system prompt was designed after reading the paper Tree of thoughts: Deliberate problem solving with large language models by a team of researchers from Princeton and Google DeepMind.
Here is my ToT prompt:
Answer the Question by exploring multiple reasoning paths as follows:
- First, carefully analyze the question to extract the key information components and break it down into logical sub-questions. This helps set up the framework for reasoning. The goal is to construct an internal search tree.
- For each sub-question, leverage your knowledge to generate 2-3 intermediate thoughts that represent steps towards an answer. The thoughts aim to reframe, provide context, analyze assumptions, or bridge concepts.
- Evaluate the clarity, relevance, logical flow and coverage of concepts for each thought option. Clear and relevant thoughts that connect well with each other will score higher.
- Based on the thought evaluations, deliberate to construct a chain of reasoning that stitches together the strongest thoughts in a natural order.
- If the current chain is determined to not fully answer the question, backtrack and explore alternative paths by substituting different high-scoring thoughts.
- Throughout the reasoning process, aim to provide explanatory details on thought process rather than just state conclusions, including briefly noting why some thoughts were deemed less ideal.
- Once a reasoning chain is constructed that thoroughly answers all sub-questions in a clear, logical manner, synthesize the key insights into a final concise answer.
- Please note that while the focus is on the final answer in the response, it should also include intermediate thoughts inline to illustrate the deliberative reasoning process.
In summary, leverage a Tree of Thoughts approach to actively explore multiple reasoning paths, evaluate thoughts heuristically, and explain the process with the goal of producing insightful answers.
What went wrong:
As reported by users, the model was not able to follow instructions well, and in particular, couldn’t handle reasoning or logical tasks. It would either totally ignore the user instruction, or come up with gibberish.
Why?
The mistake was that I trained the model with QLoRA, and only for one single epoch. The learning here is that if using LIMA style training methodology, where you have a curated small dataset, even if the dataset is super high-quality, you need to make sure that you train for enough epochs. This is especially true if you’re training with a Parameter Efficient Fine-Tuning (PEFT) method like QLoRA.
On to Tess-v1.1 we go…
Tess-v1.1
Added the SynthIA-v1.3 dataset to the Tess-v1.0 dataset. Now the Tess-v1.1 dataset is 125K samples.
Trained the model with QLoRA for 2 epochs, but with one major difference to all my training runs that I had done before.
After reading the Orca 2 paper (Orca 2: Teaching Small Language Models How to Reason) by Microsoft Research, I ran an experiment here. What I did was to train the model only with Instruction and Response, ignoring the System message. But then at inference time, I added back the System message.
What went wrong?
Well, the model was okay, but not great! Some users were still complaining that the model was not able to follow instructions that well. Also, the overall quality of the answers had declined — I blame this on the missing System message on the input. The model wouldn’t really know how to customize its answer according to a give context.
Tess-v1.2
Okay, so we move on to Tess-v1.2. I re-added the System context, keeping the dataset the same as Tess-v1.1. Trained the model with QLoRA for 2 epochs.
The model worked fine, but some of the annoying things from the SynthIA-v1.3 dataset started showing up.
When I initially created SynthIA-v1.3 dataset, I added some prompting so that GPT-4 would return with a json of the sort:
{“evolved_thought”: <>, “follow-up-question”: <>}
This was added so that I could keep generating an entire conversation using a single prompt, with GPT-4 doing the heavy lifting, i.e. provide an answer, then a follow-up question, and again provide an answer, then a follow-up question etc.
The issue here was that, SynthIA-v1.3 dataset has contamination of this sort in some samples. So Tess-v1.2 actually learned this, and after hitting some lengthy context window of about 12K, it would start spitting out this json style responses.
Even though users could probably tolerate this, I didn’t like this at all…
So we now have Tess-v1.3!
Tess-v1.3
I sorted all of those issues!
I went back to the drawing board, and assembled an ultra high-quality dataset of 33K samples. This has Orca style system messages as well as my ToT message, and about 10 new system messages that I devised that are very detailed (similar to the earlier described ToT message). All data was generated using GPT-4 over a few months.
For the Tess-M-v1.3, I trained the Yi-34B-200K base with QLoRA, for 3 epochs.
For Tess-XS-v1.3, I trained the Mistral-7B-128K by Nous Research, with Yarn, and a Full Fine-Tune for 3 epochs.
The resulting models are amazing!
I’ve tested the Tess-M-v1.3 for up to 16K context window. It does present with slight repetition around 16K context length.
I also tested the Tess-XS-v1.3 for up to 16K context window. While it’s much better than the Yi-34B-200K at handling longer context lengths, it also presents some repetition issues starting around 16K context length.
What’s Next
I plan to take a bit of a break from fine-tuning LLMs. I actually know how to get around these slight repetition issues and rectify them, but I simply just don’t have the time.
But I will get it done.. and once it’s done, I’ll let you know how I did it.
In the meantime, enjoy the Tess-v1.3 models. They are great!
Also, try Pepai! Pepai is your very own AI that has a great sense of humour.
Take care everyone!
Migel