ModeSwitch-LLM: A Lightweight Phase-Aware Controller for Cross-Mode LLM Inference on a Single GPU
Lightweight request-boundary controller for single-GPU LLM inference that routes each request across fixed inference modes including FP16, GPTQ 4-bit, INT8 quantization, speculative decoding, prefix caching, continuous batching, and hybrid configurations. Evaluated on Meta-Llama-3.1-8B-Instruct with vLLM on a single NVIDIA A100 GPU.