Jump to content

SGLang

From Wikipedia, the free encyclopedia

SGLang
DeveloperLMSYS
Initial releaseJanuary 17, 2024; 2 years ago (2024-01-17)
Written inPython, Rust, CUDA, C++
TypeLarge language model inference engine
LicenseApache License 2.0
Websitesglang.io
Repositorygithub.com/sgl-project/sglang

SGLang (short for Structured Generation Language) is an open-source framework for programming and serving large language models and multimodal models. It was introduced by researchers affiliated with LMSYS[1] and other institutions as a system combining a Python-embedded language for structured generation with a runtime for high-throughput inference.[2][3][4]

The project is designed for low latency and high-throughput inference workloads, and its documentation describes support for features such as structured outputs, speculative decoding, continuous batching, quantization, and compatibility with OpenAI-style APIs.[5]

History

[edit]

SGLang was publicly introduced in January 2024 by researchers affiliated with Stanford, UC Berkeley, Texas A&M, and Shanghai Jiao Tong University.[2] Its academic description later appeared in the proceedings of NeurIPS 2024.[3] In January 2026, TechCrunch reported that contributors associated with the project had formed the startup RadixArk to commercialize services around SGLang while continuing its open-source development.[6][7]

Architecture

[edit]

According to the NeurIPS paper, SGLang consists of two main components: a front-end language embedded in Python and a back-end runtime for executing language model programs efficiently.[3] The front end provides primitives for generation, selection, and parallel control flow, while the runtime uses a set of optimizations intended to reduce repeated computation and improve throughput.[3]

Among the techniques described by the project are RadixAttention for reusing key–value cache state across multiple generation calls, compressed finite-state machines for faster constrained decoding, and speculative execution for API-based models.[3] The current documentation also describes support for serving both language models and multimodal models across a range of hardware back ends.[5]

See also

[edit]

References

[edit]
  1. ^ "LMSYS". GitHub. GitHub, Inc. Retrieved April 22, 2026.
  2. ^ a b "Fast and Expressive LLM Inference with RadixAttention and SGLang". LMSYS Org. January 17, 2024. Retrieved April 19, 2026.
  3. ^ a b c d e Zheng, Lianmin; Yin, Liangsheng; Xie, Zhiqiang; Sun, Chuyue; Huang, Jeff; Yu, Cody Hao; Cao, Shiyi; Kozyrakis, Christos; Stoica, Ion; Gonzalez, Joseph E.; Barrett, Clark; Sheng, Ying (2024). SGLang: Efficient Execution of Structured Language Model Programs (PDF). Advances in Neural Information Processing Systems 37. Retrieved April 19, 2026.
  4. ^ "SGLang". UC Berkeley Sky Computing Lab. April 25, 2024. Retrieved April 22, 2026.
  5. ^ a b "SGLang Documentation". SGLang. Retrieved April 19, 2026.
  6. ^ Hu, Krystal (January 21, 2026). "Sources: Project SGLang spins out as RadixArk with $400M valuation as inference market explodes". TechCrunch. Retrieved April 19, 2026.
  7. ^ R, Vignesh (January 23, 2026). "From Berkeley lab to $400M startup: SGLang becomes RadixArk". TFN. Retrieved April 22, 2026.
[edit]