Live
OpenAI announces GPT-5 with unprecedented reasoning capabilitiesGoogle DeepMind achieves breakthrough in protein folding for rare diseasesEU passes landmark AI Safety Act with global implicationsAnthropic raises $7B as enterprise demand for Claude surgesMeta open-sources Llama 4 with 1T parameter modelNVIDIA unveils next-gen Blackwell Ultra chips for AI data centersApple integrates on-device AI across entire product lineupSam Altman testifies before Congress on AI regulation frameworkMistral AI reaches $10B valuation after Series C funding roundStability AI launches video generation model rivaling SoraOpenAI announces GPT-5 with unprecedented reasoning capabilitiesGoogle DeepMind achieves breakthrough in protein folding for rare diseasesEU passes landmark AI Safety Act with global implicationsAnthropic raises $7B as enterprise demand for Claude surgesMeta open-sources Llama 4 with 1T parameter modelNVIDIA unveils next-gen Blackwell Ultra chips for AI data centersApple integrates on-device AI across entire product lineupSam Altman testifies before Congress on AI regulation frameworkMistral AI reaches $10B valuation after Series C funding roundStability AI launches video generation model rivaling Sora
Tools

AMD Launches Lemonade: A Fast, Open Source Local LLM Server That Uses Both GPU and NPU

AMD has released Lemonade, an open source local LLM server designed for developers who want fast, private AI inference without cloud dependency. By leveraging both discrete GPUs and the Neural Processing Units (NPUs) built into modern Ryzen chips, Lemonade offers a developer-friendly alternative to Ollama — with OpenAI API compatibility out of the box.

D.O.T.S AI Newsroom

D.O.T.S AI Newsroom

AI News Desk

3 min read
AMD Launches Lemonade: A Fast, Open Source Local LLM Server That Uses Both GPU and NPU

AMD has entered the local LLM server market with Lemonade, an open source inference server that runs large language models locally using the full spectrum of AMD silicon — discrete GPUs, integrated graphics, and the Neural Processing Units (NPUs) built into modern Ryzen AI processors. The project appeared on Hacker News this week and quickly accumulated significant developer attention, reflecting ongoing demand for high-quality, vendor-agnostic local inference alternatives.

What Lemonade Is

Lemonade is a local LLM server that exposes an OpenAI-compatible API. Any application built against the OpenAI API — whether it uses the Python SDK, a REST client, or a framework like LangChain or LlamaIndex — can point at Lemonade and run without modification. The server handles model loading, session management, and inference scheduling across available AMD hardware.

The project's key differentiator is its multi-accelerator support. On machines with AMD discrete GPUs (Radeon RX 7000 series and up), Lemonade uses ROCm for GPU-accelerated inference — bringing it into competitive range with NVIDIA-based solutions for models that fit in VRAM. On AMD Ryzen AI systems (which include integrated NPUs), Lemonade can offload specific layers to the NPU, freeing up GPU resources or enabling inference on machines without discrete GPUs at all.

The Local AI Moment

Lemonade's release comes during a period of intense developer interest in local AI inference. Ollama has established itself as the dominant tool in this category, but it is primarily optimized for Apple Silicon and NVIDIA hardware — the two most common configurations in the developer market. AMD users have historically had a worse experience with local inference tooling, relying on less mature ROCm builds or CPU-only inference at reduced speed.

AMD's decision to build and officially support Lemonade represents a strategic shift: treating local inference tooling as a first-party concern rather than leaving it to the community. For developers on AMD hardware — a growing segment as Ryzen AI laptops and workstations proliferate — the existence of a maintained, officially supported local inference server is a meaningful improvement in the development experience.

Technical Architecture and Performance

Lemonade is built for fast setup and broad compatibility. The project's stated goals are local-first execution, broad model support, and minimal friction from install to first inference. The server supports standard GGUF and ONNX model formats, allowing models downloaded from Hugging Face or other repositories to run without conversion.

Performance benchmarks shared by early users suggest competitive throughput compared to Ollama on equivalent AMD hardware, with particular advantages on Ryzen AI systems where NPU offloading reduces memory bandwidth pressure on the GPU. For the growing class of thin-and-light AI laptops built on Ryzen AI 300 series silicon, Lemonade offers a path to practical local inference at battery-efficient power levels.

Developer Ecosystem Implications

The OpenAI API compatibility layer means Lemonade can slot into existing development workflows without code changes. Developers using Claude Code, Cursor, or other AI-assisted development tools that support custom inference endpoints can redirect those tools to Lemonade, achieving fully local, fully private AI assistance. For security-conscious development environments — financial services, healthcare, government contractors — that offline capability has compliance value beyond mere performance.

Back to Home

Related Stories

Astropad's Workbench Turns a Mac Mini Into an AI Agent Server You Control From Your Phone
Tools

Astropad's Workbench Turns a Mac Mini Into an AI Agent Server You Control From Your Phone

Astropad, the company behind the Luna Display hardware that lets iPads function as Mac monitors, has built a new product for a new era: Workbench lets users remotely monitor and control AI agents running on Mac Minis from an iPhone or iPad. It is remote desktop software reimagined not for IT support but for the AI agent operator — the person who needs to check on autonomous workflows without being at their desk.

D.O.T.S AI Newsroom
Microsoft's Bing Team Open-Sources Harrier, a Multilingual Embedding Model That Tops the MTEB v2 Benchmark
Tools

Microsoft's Bing Team Open-Sources Harrier, a Multilingual Embedding Model That Tops the MTEB v2 Benchmark

Microsoft's Bing search team has released Harrier as an open-source embedding model, and it tops the multilingual MTEB v2 benchmark while supporting over 100 languages. The release is significant not just for the benchmark numbers but for the source: a search team that has spent decades optimizing retrieval systems has built an embedding model for the exact use case — semantic search and retrieval — that underpins most production RAG applications.

D.O.T.S AI Newsroom
Stability AI Pivots to Enterprise With Brand Studio — a Platform for Brand-Consistent AI Image Generation
Tools

Stability AI Pivots to Enterprise With Brand Studio — a Platform for Brand-Consistent AI Image Generation

Stability AI, the company that made open-source image generation mainstream with Stable Diffusion, is repositioning for enterprise with Brand Studio. The platform lets creative teams train brand-specific image models, automate visual production workflows, and route tasks to the best-suited AI model — a commercial play from a company that built its name on open access.

D.O.T.S AI Newsroom