Loading [Contrib]/a11y/accessibility-menu.js

Model serving - Efficiency Inference

Serving

Model Serve Readings:

Splitwise: https://arxiv.org/pdf/2311.18677

+[Submitted on 30 Nov 2023 (v1), last revised 20 May 2024 (this version, v2)]

DistServe: https://www.usenix.org/system/files/osdi24-zhong-yinmin.pdf

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

more readings

Generative AI on the Edge: Architecture and Performance Evaluation

Efficient Transformers: A Survey

A Survey on Model Compression for Large Language Models

Efficient AI in Practice: Training and Deployment of Efficient LLMs for Industry Applications

Keep the Cost Down: A Review on Methods to Optimize LLM’ s KV-Cache Consumption