150M size SLM built with single GPU
I only used a single Nvidia RTX 3090 GPU to pretrain a 150M Korean Language Model — fully from scratch. Built a tokenizer with SentencePiece, using LatentMoE architecture, gathering datasets and finetuned. Open-source on GitHub and Hugging Face.
There were still many failures throughout this project. To begin with, the computer died. I pushed it far beyond its limit. The main issue was with the power supply. Ironically, I had even written code to pause every 200 steps just to cool things down, but it worsen the problem. So, technically I did stress test on my computer. But the problem didn't stop at hardware. There were plenty of software issues, too. Let me list them out.
I stopped at 5th epoch for pretraining for two reasons. First, validation perplexity started increasing after the 4th epoch, peaking by the 6th.
Since perplexity is the exponential of the loss, this indicated that the loss was also increasing.
Second, each epoch required around 36 hours to complete.
Given that the 4th epoch showed the best validation perplexity, I decided to proceed with fine-tuning this model as a chatbot.
(Although I mentioned it breifly, this stage involved a lot of trial and error. Initially, I attempted to train a 1B model first,
but it couldn't run on my GPU efficiently. So changed to 72M, read Chinchilla Scaling Law paper, found the optimal model size that can be pretrained. This took about a month.)
As you can see from the image, the Next Token Prediction is actually not bad. For those who don't understand korean, my prompt was "The president has". Raptor model predicted the next sentence as "promised that 'Through this ceremeony, we will do our best to protect precious lives living in our community.'". You can tell that the model understood how korean works.