The rise of large language models (LLMs) has significantly changed how developers interact with code editors, documentation, and automation pipelines. Tools like OpenAI's GPT-4, Anthropic's Claude, Mistral, LLaMA, and open-source models like CodeLlama or DeepSeek have made it possible to offload everything from documentation generation to live code refactoring. However, most developer environments are optimized for interacting with a single model. This presents a significant limitation in workflows that could benefit from dynamically routing prompts across multiple LLMs based on task specificity, latency sensitivity, cost constraints, and domain performance.
This blog focuses on adapting Visual Studio Code (VSCode) for multi-LLM usage, specifically on how developers can route prompts across models intelligently, using a plugin-driven architecture, LLM capability classification, and runtime prompt orchestration. This is not just an enhancement in productivity, it is an architecture decision that directly impacts inference efficiency, relevance of response, and ultimately developer throughput.
Each LLM has strengths and weaknesses depending on prompt types. GPT-4, for instance, excels in logical reasoning and has broad general-purpose utility but is slower and more expensive. Claude 3 Opus is faster in conversation-style reasoning and summarization, whereas smaller models like Mixtral or Phi-3 are ideal for low-latency code completions or local inference.
Routing all prompts to a single LLM results in under-utilization of these unique advantages. Developers can leverage specialized models to gain faster responses, higher relevance in niche domains, or lower operational costs. This calls for dynamic routing of prompts, instead of binding the editor to a single static model endpoint.
Prompt routing enables fine-tuned control over how inference is delegated. If a prompt is under a token threshold and falls into a known “low-complexity” classification, it can be routed to a cheaper model like GPT-3.5 or even a local model running on a GPU through Ollama. In contrast, complex architectural design prompts or secure code audits can be routed to more powerful models with deeper instruction-following abilities.
This orchestration enables better control over resource usage, response latency, and runtime predictability in developer tools integrated inside VSCode.
VSCode is extensible via its plugin (extension) API, which exposes command registration, UI modification points, and LSP-based hooks. For multi-LLM routing, the core idea is to build or extend an existing AI plugin to handle:
This architecture ensures that from a developer perspective, the input prompt remains uniform, while the backend dynamically switches context depending on task-specific signals.
Maintain a JSON or YAML-based registry of available models, which includes:
This acts as the metadata layer for enabling programmatic decisions during routing.
Each LLM provider will have a corresponding adapter class that standardizes the request and response format. This includes headers, auth tokens, response normalization, and retry logic. This layer isolates VSCode routing logic from vendor-specific APIs and helps avoid vendor lock-in.
A core part of routing logic is the classifier, which evaluates the incoming prompt and determines the appropriate LLM target. This can be implemented in multiple ways:
For lightweight systems, use a rule engine such as:
async
or def
, route to GPT-4This logic is transparent, deterministic, and can be adjusted by developers on the fly.
For more advanced routing, compute the embedding vector of the prompt and match it to a set of pretrained task embeddings:
doc-gen
, bugfix
, refactor
, qa
)This allows for adaptive routing that improves over time based on prompt patterns.
Once the routing decision is made, the plugin dispatches the prompt to the corresponding model via the provider interface. Each call can be logged and version-tagged for auditing and reproducibility.
Key aspects:
Routing should include fallback models in case of primary model failure or timeout. For instance:
This ensures workflow continuity even in high-latency or constrained environments.
Provide UI affordances within the VSCode extension such as:
These help the developer maintain full control over how prompts are executed without overwhelming them with configuration.
Every routed prompt should be logged with:
This allows developers to debug inference issues, audit LLM behavior, and benchmark prompt performance across models.
Many developers today run local LLMs using tools like Ollama, LM Studio, or private inference servers with GGUF models. The plugin should support a hybrid registry, where local and cloud models are both treated as first-class citizens.
This strategy gives fine-grained control over security, cost, and latency without compromising capability.
Local models should be aware of GPU availability, RAM limits, and concurrent thread capacity. The routing logic must expose a health check interface that evaluates model readiness before dispatching a prompt to it.
For instance:
Route lint-level prompts to lightweight models, whereas architectural critique or refactoring suggestions can go to GPT-4 or Claude Opus.
Use models like Gemini or Claude for high-fidelity translation, but fall back to open-source Whisper models when offline.
Classify logs by verbosity, then route verbose ones to summarization-heavy models and structured logs to simpler classification engines.
For models with smaller context windows, include a compression step using distillation models or prompt optimizers before dispatching.
Adapting VSCode for multi-LLM usage is more than just a plugin enhancement. It represents a paradigm shift in how developers think about tooling. In the same way microservices decoupled monoliths, multi-LLM routing decouples prompt intent from backend execution, giving developers full control over performance, cost, and capability.
In future iterations, we expect routing engines to become learning agents, which will use feedback loops to auto-optimize routing decisions based on previous success metrics. Furthermore, developer-centric LLM orchestration frameworks like GoCodeo or LangChain will integrate natively with editors like VSCode to give a seamless full-stack development + inference experience.