AI & LLM Engineering for .NET Architects

Running AI Locally with ONNX and LocalLLM

1 Views Updated 5/4/2026

Local AI in .NET

You don't need a REST API to run AI. With ONNX Runtime and Microsoft.Extensions.AI, you can run models directly inside your C# process.

1. What is ONNX?

ONNX (Open Neural Network Exchange) is a universal format for AI models. It allows you to take a model trained in Python/PyTorch and run it in a C# app with high performance. It uses specialized hardware accelerators like **CUDA** (Nvidia) or **DirectML** (Windows) to run incredibly fast.

2. Microsoft.ML.OnnxRuntime.GenAI

This is the new "One-liner" library for running LLMs in C#.

using var model = new Model("phi-3-mini-onnx");
using var tokenizer = new Tokenizer(model);
var generator = new Generator(model, tokenizer);
// Generate text locally!

4. Interview Mastery

Q: "What are the hardware requirements for running a local LLM?"

Architect Answer: "The most important factor is **VRAM (Video RAM)** on the GPU. A 7B parameter model (quantized) needs about 5-6GB of RAM. If the model fits in VRAM, it runs instantly. If it overflows into system RAM, it becomes 10x slower. For a professional AI workstation, we recommend at least 16GB of VRAM (RTX 4080 or better) to run modern SLMs comfortably."

AI & LLM Engineering for .NET Architects
1. AI Foundations & Prompt Engineering
The LLM Landscape: Transformers, Attention, and Tokens Advanced Prompt Engineering: Few-shot, Chain-of-Thought, and ReAct Prompt Versioning & Management in Production LLM Cost Estimation: Token accounting and budget strategies
2. Semantic Kernel & Integration
Introduction to Microsoft Semantic Kernel (SK) Skills & Plugins: Extending the LLM with native C# functions Planner & Orchestration: Automating complex multi-step AI tasks Connectors: Switching between OpenAI, Azure OpenAI, and HuggingFace
3. Vector Databases & RAG
The RAG Pattern: Solving the 'Static Knowledge' problem Embeddings Deep Dive: Converting text to math Vector DBs: Azure AI Search vs Pinecode vs Milvus Hybrid Search: Combining Keyword and Semantic search for accuracy
4. Advanced RAG Techniques
Document Chunking Strategies: Overlap, Slidewindow, and Semantic splitting Recursive Document Processing for massive knowledge bases Context Window Management: Summarization vs Truncation Citations & Grounding: Ensuring the AI doesn't hallucinate
5. AI Safety & Guardrails
Content Moderation: Azure AI Content Safety integration Prompt Injection: Defending against adversarial attacks Punitiveness & Bias: Evaluating and mitigating model behavior Self-Correction Patterns: Letting the AI check its own work
6. Small Language Models (SLMs) & Local AI
The rise of SLMs: Phi-3, Llama-3-8B, and Mistral Running AI Locally with ONNX and LocalLLM Quantization: Running 70B models on 16GB RAM Edge AI: Deploying models to local devices and private clouds
7. Multimodal & Agentic AI
Multimodal AI: Processing Images, PDFs, and Audio in C# Agentic Workflows: Multi-agent collaboration with AutoGen Function Calling: Letting the LLM use your SQL and API tools Memory Management: Ephemeral vs Long-term Semantic memory
8. FAANG AI Engineer Interview
Case Study: Designing a Global Enterprise AI Knowledge Assistant Case Study: Building an Autonomous AI Agent for Software Dev