stereoplegic 's Collections Speculative
updated
AutoMix: Automatically Mixing Language Models
Paper
• 2310.12963
• Published
• 14
Large Language Model Cascades with Mixture of Thoughts Representations
for Cost-efficient Reasoning
Paper
• 2310.03094
• Published
• 13
MatFormer: Nested Transformer for Elastic Inference
Paper
• 2310.07707
• Published
• 4
DistillSpec: Improving Speculative Decoding via Knowledge Distillation
Paper
• 2310.08461
• Published
• 1
DialCoT Meets PPO: Decomposing and Exploring Reasoning Paths in Smaller
Language Models
Paper
• 2310.05074
• Published
• 1
Confident Adaptive Language Modeling
Paper
• 2207.07061
• Published
• 1
LLMCad: Fast and Scalable On-device Large Language Model Inference
Paper
• 2309.04255
• Published
• 1
SpecInfer: Accelerating Generative LLM Serving with Speculative
Inference and Token Tree Verification
Paper
• 2305.09781
• Published
• 4
Future Lens: Anticipating Subsequent Tokens from a Single Hidden State
Paper
• 2311.04897
• Published
• 1
Depth-Adaptive Transformer
Paper
• 1910.10073
• Published
• 1
Fast Inference from Transformers via Speculative Decoding
Paper
• 2211.17192
• Published
• 10
Accelerating Large Language Model Decoding with Speculative Sampling
Paper
• 2302.01318
• Published
• 4
RecycleGPT: An Autoregressive Language Model with Recyclable Module
Paper
• 2308.03421
• Published
• 9
OrchestraLLM: Efficient Orchestration of Language Models for Dialogue
State Tracking
Paper
• 2311.09758
• Published
• 1
Small Language Models Improve Giants by Rewriting Their Outputs
Paper
• 2305.13514
• Published
• 2
SortedNet, a Place for Every Network and Every Network in its Place:
Towards a Generalized Solution for Training Many-in-One Neural Networks
Paper
• 2309.00255
• Published
• 1
Sorted LLaMA: Unlocking the Potential of Intermediate Layers of Large
Language Models for Dynamic Inference Using Sorted Fine-Tuning (SoFT)
Paper
• 2309.08968
• Published
• 24
Fast and Robust Early-Exiting Framework for Autoregressive Language
Models with Synchronized Parallel Decoding
Paper
• 2310.05424
• Published
• 1
Fast Chain-of-Thought: A Glance of Future from Parallel Decoding Leads
to Answers Faster
Paper
• 2311.08263
• Published
• 16
BiTA: Bi-Directional Tuning for Lossless Acceleration in Large Language
Models
Paper
• 2401.12522
• Published
• 12
Unlocking Efficiency in Large Language Model Inference: A Comprehensive
Survey of Speculative Decoding
Paper
• 2401.07851
• Published
• 1
APAR: LLMs Can Do Auto-Parallel Auto-Regressive Decoding
Paper
• 2401.06761
• Published
• 1
Cascade Speculative Drafting for Even Faster LLM Inference
Paper
• 2312.11462
• Published
• 10
Speculative Contrastive Decoding
Paper
• 2311.08981
• Published
• 2
Answering Unseen Questions With Smaller Language Models Using Rationale
Generation and Dense Retrieval
Paper
• 2308.04711
• Published
• 1
Hydra: Sequentially-Dependent Draft Heads for Medusa Decoding
Paper
• 2402.05109
• Published
• 2
Sequoia: Scalable, Robust, and Hardware-aware Speculative Decoding
Paper
• 2402.12374
• Published
• 4
Online Speculative Decoding
Paper
• 2310.07177
• Published
• 3
Speculative Streaming: Fast LLM Inference without Auxiliary Models
Paper
• 2402.11131
• Published
• 42
Ouroboros: Speculative Decoding with Large Model Enhanced Drafting
Paper
• 2402.13720
• Published
• 7
Recurrent Drafter for Fast Speculative Decoding in Large Language Models
Paper
• 2403.09919
• Published
• 21
Better & Faster Large Language Models via Multi-token Prediction
Paper
• 2404.19737
• Published
• 81
Clover: Regressive Lightweight Speculative Decoding with Sequential
Knowledge
Paper
• 2405.00263
• Published
• 16
TriForce: Lossless Acceleration of Long Sequence Generation with
Hierarchical Speculative Decoding
Paper
• 2404.11912
• Published
• 17
Kangaroo: Lossless Self-Speculative Decoding via Double Early Exiting
Paper
• 2404.18911
• Published
• 30
Chimera: A Lossless Decoding Method for Accelerating Large Language
Models Inference by Fusing all Tokens
Paper
• 2402.15758
• Published
REST: Retrieval-Based Speculative Decoding
Paper
• 2311.08252
• Published
Accelerating Production LLMs with Combined Token/Embedding Speculators
Paper
• 2404.19124
• Published
Parallel Decoding via Hidden Transfer for Lossless Large Language Model
Acceleration
Paper
• 2404.12022
• Published
Speculative Decoding with Big Little Decoder
Paper
• 2302.07863
• Published
KV-Runahead: Scalable Causal LLM Inference by Parallel Key-Value Cache
Generation
Paper
• 2405.05329
• Published
Accelerating Speculative Decoding using Dynamic Speculation Length
Paper
• 2405.04304
• Published
• 2
SDSAT: Accelerating LLM Inference through Speculative Decoding with
Semantic Adaptive Tokens
Paper
• 2403.18647
• Published
On Speculative Decoding for Multimodal Large Language Models
Paper
• 2404.08856
• Published
• 13
You Only Cache Once: Decoder-Decoder Architectures for Language Models
Paper
• 2405.05254
• Published
• 10
Lossless Acceleration of Large Language Model via Adaptive N-gram
Parallel Decoding
Paper
• 2404.08698
• Published
Hardware-Aware Parallel Prompt Decoding for Memory-Efficient
Acceleration of LLM Inference
Paper
• 2405.18628
• Published
• 1
EMS-SD: Efficient Multi-sample Speculative Decoding for Accelerating
Large Language Models
Paper
• 2405.07542
• Published
OPT-Tree: Speculative Decoding with Adaptive Draft Tree Structure
Paper
• 2406.17276
• Published
Speculative Decoding via Early-exiting for Faster LLM Inference with
Thompson Sampling Control Mechanism
Paper
• 2406.03853
• Published
Optimizing Speculative Decoding for Serving Large Language Models Using
Goodput
Paper
• 2406.14066
• Published
• 3
S2D: Sorted Speculative Decoding For More Efficient Deployment of Nested
Large Language Models
Paper
• 2407.01955
• Published
Make Some Noise: Unlocking Language Model Parallel Inference Capability
through Noisy Training
Paper
• 2406.17404
• Published
• 1
Adaptive Draft-Verification for Efficient Large Language Model Decoding
Paper
• 2407.12021
• Published
SEED: Accelerating Reasoning Tree Construction via Scheduled Speculative
Decoding
Paper
• 2406.18200
• Published
• 1
Parallel Speculative Decoding with Adaptive Draft Length
Paper
• 2408.11850
• Published
Improving Multi-candidate Speculative Decoding
Paper
• 2409.10644
• Published
• 1
Learning Harmonized Representations for Speculative Sampling
Paper
• 2408.15766
• Published
Turning Trash into Treasure: Accelerating Inference of Large Language
Models with Token Recycling
Paper
• 2408.08696
• Published
PipeInfer: Accelerating LLM Inference using Asynchronous Pipelined
Speculation
Paper
• 2407.11798
• Published
• 1
Lossless Acceleration of Large Language Models with Hierarchical
Drafting based on Temporal Locality in Speculative Decoding
Paper
• 2502.05609
• Published
• 18
Jakiro: Boosting Speculative Decoding with Decoupled Multi-Head via MoE
Paper
• 2502.06282
• Published
• 6