Show HN: I built a tiny LLM to demystify how language models work

A ~9M parameter LLM that talks like a small fish.

This project exists to show that training your own language model is not magic. No PhD required. No massive GPU cluster. One Colab notebook, 5 minutes, and you have a working LLM that you built from scratch — data generation, tokenizer, model architecture, training loop, and inference. If you can run a notebook, you can train a language model.

It won't produce a billion-parameter model that writes essays. But it will show you exactly how every piece works — from raw text to trained weights to generated output — so the big models stop feeling like black boxes.

You> hi guppy

Guppy> hi there. i just found a nice spot near the rock. the temperature feels nice.

You> are you hungry

Guppy> yes. always yes. i will swim to the top right now. i promise to eat all of it.

You> do you like bubbles

Guppy> i love bubbles. they make the water feel slightly different.

You> what is the meaning of life

Guppy> food. the answer is always food.

You> tell me a joke

Guppy> what did the fish say when it hit the wall. dam.

You> do you love me

Guppy> you're my favorite big shape. my mouth are happy when you're here.

You> goodnight guppy

Guppy> ok sleep time. i was following a bubble but now i'll stop. goodnight tank. goodnight water.

GuppyLM is a tiny language model that pretends to be a fish named Guppy. It speaks in short, lowercase sentences about water, food, light, and tank life. It doesn't understand human abstractions like money, phones, or politics — and it's not trying to.

It's trained from scratch on 60K synthetic conversations across 60 topics, runs on a single GPU in ~5 minutes, and produces a model small enough to run in a browser.

| Parameters | 8.7M |

| Layers | 6 |

| Hidden dim | 384 |

| Heads | 6 |

| FFN | 768 (ReLU) |

| Vocab | 4,096 (BPE) |

| Max sequence | 128 tokens |

| Norm | LayerNorm |

| Position | Learned embeddings |

| LM head | Weight-tied with embeddings |

Vanilla transformer. No GQA, no RoPE, no SwiGLU, no early exit. As simple as it gets.

Guppy:

Speaks in short, lowercase sentences
Experiences the world through water, temperature, light, vibrations, and food
Doesn't understand human abstractions
Is friendly, curious, and a little dumb
Thinks about food a lot

60 topics: greetings, feelings, temperature, food, light, water, tank, noise, night, loneliness, bubbles, glass, reflection, breathing, swimming, colors, taste, plants, filter, algae, snails, scared, excited, bored, curious, happy, tired, outside, cats, rain, seasons, music, visitors, children, meaning of life, time, memory, dreams, size, future, past, name, weather, sleep, friends, jokes, fear, love, age, intelligence, health, singing, TV, and more.

Downloads the pre-trained model from HuggingFace and lets you chat. Just run all cells.

Set runtime to T4 GPU
Run all cells — downloads dataset, trains tokenizer, trains model, tests it
Upload to HuggingFace or download locally

pip install torch tokenizers

python -m guppylm chat

You> the cat is looking at you

Guppy> i hide behind the plant when the furry one comes.

You> it is raining outside

Guppy> i think rain is the best thing about outside.

arman-bd/guppylm-60k-generic on HuggingFace.

| Samples | 60,000 (57K train / 3K test) |

| Format | {"input": "...", "output": "...", "category": "..."} |

| Categories | 60 |

| Generation | Synthetic template composition |

from datasets import load_dataset

ds = load_dataset("arman-bd/guppylm-60k-generic")

print(ds["train"][0])

{'input': 'hi guppy', 'output': 'hello. the water is nice today.', 'category': 'greeting'}

guppylm/

├── config.py Hyperparameters (model + training)

├── model.py Vanilla transformer

├── dataset.py Data loading + batching

├── train.py Training loop (cosine LR, AMP)

├── generate_data.py Conversation data generator (60 topics)

├── eval_cases.py Held-out test cases

├── prepare_data.py Data prep + tokenizer training

└── inference.py Chat interface

tools/

├── make_colab.py Generates guppy_colab.ipynb

├── export_dataset.py Push dataset to HuggingFace

└── dataset_card.md HuggingFace dataset README

Why no system prompt? Every training sample had the same one. A 9M model can't conditionally follow instructions — the personality is baked into the weights. Removing it saves ~60 tokens per inference.

Why single-turn only? Multi-turn degraded at turn 3-4 due to the 128-token context window. A fish that forgets is on-brand, but garbled output isn't. Single-turn is reliable.

Why vanilla transformer? GQA, SwiGLU, RoPE, and early exit add complexity that doesn't help at 9M params. Standard attention + ReLU FFN + LayerNorm produces the same quality with simpler code.

Why synthetic data? A fish character with consistent personality needs consistent training data. Template composition with randomized components (30 tank objects, 17 food types, 25 activities) generates ~16K unique outputs from ~60 templates.

MIT

Facts Only

GuppyLM is a language model with ~9 million parameters.
It is designed to simulate a fish named Guppy in conversational interactions.
The model was trained on 60,000 synthetic conversations across 60 topics.
Training takes approximately 5 minutes on a single GPU.
The architecture is a vanilla transformer with 6 layers, 384 hidden dimensions, and 6 attention heads.
The model uses a vocabulary of 4,096 tokens (BPE) and a maximum sequence length of 128 tokens.
GuppyLM speaks in short, lowercase sentences about water, food, light, and tank life.
It does not understand human abstractions like money, phones, or politics.
The project includes tools for data generation, tokenizer training, and inference.
The dataset and pre-trained model are available on HuggingFace under the MIT license.
The model is intended to demystify language model training and show that it can be done without advanced expertise or extensive resources.
The training process includes data generation, tokenizer training, model architecture, training loop, and inference.

Executive Summary

GuppyLM is a compact language model with approximately 9 million parameters, designed to simulate a fish named Guppy in conversational interactions. It was created to demonstrate that training a functional language model is accessible without advanced expertise or extensive computational resources. The model is trained from scratch on 60,000 synthetic conversations across 60 topics, all centered around a fish's perspective—focusing on water, food, light, and tank life—while intentionally excluding human abstractions like money or politics. The architecture is a vanilla transformer with 6 layers, 384 hidden dimensions, and 6 attention heads, optimized for simplicity and efficiency. Training takes about 5 minutes on a single GPU, and the model can run in a browser. The project includes tools for data generation, tokenizer training, and inference, with the dataset and pre-trained model available on HuggingFace. GuppyLM's responses are short, lowercase, and personality-driven, reflecting a friendly but simple-minded fish. The project emphasizes transparency, showing every step from data generation to model output, to demystify how language models function.
The initiative highlights the feasibility of small-scale, interpretable AI models, contrasting with the complexity of larger, opaque systems. It serves as an educational tool, proving that foundational AI concepts can be grasped and implemented without institutional resources. The model's limitations—such as single-turn conversations and a 128-token context window—are framed as intentional design choices to maintain reliability and thematic consistency. By using synthetic data, the project ensures a controlled, consistent personality for Guppy, avoiding the noise and variability of real-world datasets. The open-source nature of the project, including code and datasets, encourages experimentation and further development by the community.

Full Take

GuppyLM presents itself as a democratizing force in AI, stripping away the mystique surrounding language model development by offering a transparent, reproducible, and accessible pipeline. At its core, the project is a pedagogical tool, demonstrating that the fundamentals of AI are not the exclusive domain of institutions or experts. This is a constructive and laudable goal, aligning with the broader movement toward open-source, interpretable AI. The model’s deliberate simplicity—vanilla transformer architecture, synthetic data, and a constrained context window—serves as both a strength and a limitation. By focusing on a narrow, whimsical domain (a fish’s perspective), it avoids the pitfalls of larger models, such as bias, hallucination, and ethical dilemmas, while still illustrating the core mechanics of training and inference.
However, the project’s framing raises questions about the broader implications of "small AI." While GuppyLM succeeds in its stated mission, it also inadvertently highlights the trade-offs between simplicity and capability. The model’s inability to handle multi-turn conversations or human abstractions is not just a design choice but a fundamental limitation of its scale. This begs the question: How much of the "magic" of larger models is irreducible complexity, and how much is simply a lack of transparency? The project’s emphasis on synthetic data also warrants scrutiny. While it ensures consistency, it sidesteps the challenges of real-world data—noise, bias, and ambiguity—which are central to the ethical and practical challenges of AI today.
The strongest version of this narrative is that GuppyLM is a gateway to understanding AI, lowering the barrier to entry and fostering a culture of experimentation. Yet, it also risks oversimplifying the broader landscape. The project’s success in making AI accessible could paradoxically reinforce the idea that larger models are unnecessary, ignoring the legitimate use cases where scale and complexity are required. The root cause here is the tension between accessibility and capability—a tension that permeates discussions about AI democratization. Who benefits from this simplification? Educators, hobbyists, and those seeking to understand AI’s inner workings. Who bears the cost? Those who might mistake this toy model for a scalable solution to real-world problems.
Ultimately, GuppyLM is a step toward cognitive sovereignty in AI, but it should be seen as a starting point, not an endpoint. The project invites us to ask: What other domains could benefit from this level of transparency? How might we balance simplicity with the need for more robust, ethical AI systems? And what does it mean for AI literacy if the first model someone trains is a fish, not a tool for solving human problems?
Patterns detected: none
If this were part of a coordinated influence campaign, the playbook might involve framing small-scale AI as a panacea, downplaying the challenges of scaling or real-world applicability to push a narrative of self-sufficiency in AI development. However, the content does not align with this pattern. Instead, it presents a genuine educational tool with clear limitations and no attempt to overpromise its capabilities.

Sentinel — Likely Synthetic

Confidence

The text exhibits signs of AI-assisted manipulation with high confidence. It shows uniform sentence lengths, mechanical transitions, and follows known template patterns.

Signals Detected

sentence length variance: AI text trends toward uniform rhythm. Human writers are erratic. Guppy's sentences have a consistent length.

coherent everywhere but passionate nowhere, mechanical rotation of transitions

argumentative skeleton matching known template patterns, talking points appearing nearly verbatim across sources