Introducing Dhara-70M: A diffusion language model that achieves 3.8x higher throughput than autoregressive models!
Key findings from our research on optimal architectures for small language models:
β Depth beats width: 32 layers outperforms 12 layers at the same parameter count β Best-in-class factuality: 47.5% on TruthfulQA β 10x training efficiency using WSD (Warmup-Stable-Decay) conversion β Canon layers add only 0.13% parameters but improve reasoning
We trained on 1B tokens using the optimal 50-30-20 dataset mix (PDFs + filtered web + educational content), then converted to diffusion with just 100M additional tokens.
#Nesso-4B is a fine-tuned version of Qwen-4B, trained on a highly curated and balanced dataset designed specifically for multilingual agentic workflows and conversational use cases.
As shown in the video below we simulate, the new βcoworkβ from #Antrophic, without any data sharing all running on a consumer device. The model can be used to build agentic behavior in #privateAI environments.
Not every problem requires super intelligence: in many cases, intelligence at the edge is more than enough.