Llamas on the Web: Memory-Efficient, Performance-Portable, and Multi-Precision LLM Inference with WebGPU
LlamaWeb brings memory-efficient, performance-portable, multi-precision LLM inference to the browser with a WebGPU backend for llama.cpp, reducing memory use and improving decode throughput across diverse devices.
✦ By Reese Levine, Rithik Sharma, Nikhil Jain, Abhijit Ramesh, Zheyuan Chen, Neha Abbas, James Contini and Tyler Sorensen