KoRaptor

150M size SLM built with single GPU

I only used a single Nvidia RTX 3090 GPU to pretrain a 150M Korean Language Model — fully from scratch. Built a tokenizer with SentencePiece, using LatentMoE architecture, gathering datasets and finetuned. Open-source on GitHub and Hugging Face.

There were still many failures throughout this project. To begin with, the computer died. I pushed it far beyond its limit. The main issue was with the power supply. Ironically, I had even written code to pause every 200 steps just to cool things down, but it worsen the problem. So, technically I did stress test on my computer. But the problem didn't stop at hardware. There were plenty of software issues, too. Let me list them out.

Each epoch took about 36 hours.
The dataset was too small to train effectively
The system temperature kept rising despite cooldown intervals
Fine-tuning failed — I thought uploading to Hugging Face would help, but transformers didn’t support LatentMoE properly
Compatibility issues with huggingface-hub tools made it even harder

In the end, it wasn’t just the machine that broke —I was the one who got pushed to the limits.

I first followed Deepseek's module and added MoE architecture. Back then, there was no R2 which Deepseek applied the MoE architecture. It is built from scratch and you can find the architecture from my Github. I first pretrained with Korean Wiki and AI hub dataset which is consisted of mainly news. I used SentencePiece to make a tokenizer but, one thought pass through my mind. Korean is super easy to learn but has a massive learning curve. This is because unlike English, Korean has diverse morpheme. So, I decided to make a tokenizer out of morpheme, using Mecab and SentencePiece. Advantage of using morpheme as a tokenizer is first, the model can truely catch the pattern of the grammar of language and can increase the size of the datasets.

I stopped at 5th epoch for pretraining for two reasons. First, validation perplexity started increasing after the 4th epoch, peaking by the 6th. Since perplexity is the exponential of the loss, this indicated that the loss was also increasing. Second, each epoch required around 36 hours to complete. Given that the 4th epoch showed the best validation perplexity, I decided to proceed with fine-tuning this model as a chatbot.

(Although I mentioned it breifly, this stage involved a lot of trial and error. Initially, I attempted to train a 1B model first, but it couldn't run on my GPU efficiently. So changed to 72M, read Chinchilla Scaling Law paper, found the optimal model size that can be pretrained. This took about a month.)

As you can see from the image, the Next Token Prediction is actually not bad. For those who don't understand korean, my prompt was "The president has". Raptor model predicted the next sentence as "promised that 'Through this ceremeony, we will do our best to protect precious lives living in our community.'". You can tell that the model understood how korean works.