What's the capacity in this architecture?
What kind of quality boost do you see to finetuning the 19b pruned model compared to the full 26B model? I'm thinking that better MoE offloading experts to system RAM can make the un-pruned models work in less VRAM, but there's no denying the speed benefit to what you're doing.
sometimesanotion to give you an idea of the speed increase the pruning does, I'm currently using an RX 7700 XT and RX 7600 in my build with a Ryzen 5700x3d, and the stock base model Gemma 4 26B A4B gets about 30 to 40 tokens per second, the REAP pruned 19b A4b Gemma (pre DavidAU training) is between 50 to 55 tokens per second, so drastic increase in speed from the pruning, but its gonna take David's fine tuning to really make this model really good. Its fast
Thanks for the confirmation. Other's benchmarks on pruned MoEs suggest that while some scores drop, others rise; these REAP prunes become specialists. So, there's ample room for focused REAP datasets then finetunes. Very cool, appreciate your work, David!
Just a clarification:
At the moment, awaiting being able to train in 4 bit ( reduced VRAM ) ; as currently most sparse moe models require training at full 16 bit ; which can take 60-80 GB of VRAM.
Likewise training sparse moe model is a lot longer - 8-12 hrs for a moderate dataset, vs 2-3 hours for a 31 B dense model.
Unsloth and BNB are working on this - reduction to 4 bit lossless training of sparse moes.
I used this pruned model as a base because it had detailed stats/testing to I could see what was removed/damaged/reduced and knew roughly what to target
to bring this model up to spec / meet requirements.
A 19B/21B heretic base (same base as this training, just heretic'ed) has dropped too in the past day; and that will be trained as well.
ADDED NOTE:
Sometimes activating an extra expert or two (from default of 8 with this model) will address/overcome some pruning issues.