igormolybog 's Collections Inference speed
updated
FlashDecoding++: Faster Large Language Model Inference on GPUs
Paper
• 2311.01282
• Published
• 37
Co-training and Co-distillation for Quality Improvement and Compression
of Language Models
Paper
• 2311.02849
• Published
• 8
Prompt Cache: Modular Attention Reuse for Low-Latency Inference
Paper
• 2311.04934
• Published
• 32
Exponentially Faster Language Modelling
Paper
• 2311.10770
• Published
• 119
SparQ Attention: Bandwidth-Efficient LLM Inference
Paper
• 2312.04985
• Published
• 40
Transformers are Multi-State RNNs
Paper
• 2401.06104
• Published
• 39
DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and
DeepSpeed-Inference
Paper
• 2401.08671
• Published
• 15
Medusa: Simple LLM Inference Acceleration Framework with Multiple
Decoding Heads
Paper
• 2401.10774
• Published
• 59
BiTA: Bi-Directional Tuning for Lossless Acceleration in Large Language
Models
Paper
• 2401.12522
• Published
• 12
SubGen: Token Generation in Sublinear Time and Memory
Paper
• 2402.06082
• Published
• 11
Fiddler: CPU-GPU Orchestration for Fast Inference of Mixture-of-Experts
Models
Paper
• 2402.07033
• Published
• 19
Speculative Streaming: Fast LLM Inference without Auxiliary Models
Paper
• 2402.11131
• Published
• 42
Towards Fast Multilingual LLM Inference: Speculative Decoding and
Specialized Drafters
Paper
• 2406.16758
• Published
• 20