Order Accuracy System Architecture

System Overview

The Order Accuracy platform is an enterprise AI vision system designed for real-time order validation in quick-service restaurant (QSR) environments. The system uses Vision Language Models (VLM) to analyze images or video feeds, automatically identifying items and validating them against order data.

Key Features

VLM-Powered Detection: Uses Qwen2.5-VL-7B for accurate item identification
Intel Hardware Optimization: Optimized for Intel CPUs and GPUs via OpenVINO
Dual Application Support: Dine-In (image-based) and Take-Away (video stream-based)
Semantic Matching: Fuzzy matching for item name variations
Real-time Processing: Sub-15-second validation for operational efficiency
Containerized Deployment: Docker-based deployment with microservices architecture

Architecture Diagrams

Platform Architecture

graph TB
    subgraph "Order Accuracy Platform"
        subgraph "Dine-In Application"
            DUI[Gradio UI :7861]
            DAPI[FastAPI API :8083]
            DVLM[VLM Client]
            DSEM[Semantic Client]
        end

        subgraph "Take-Away Application"
            TUI[Gradio UI :7860]
            TAPI[FastAPI API :8080]
            TSW[Station Workers]
            TVS[VLM Scheduler]
            TFS[Frame Selector]
        end

        subgraph "Shared Services"
            OVMS[OVMS VLM
Qwen2.5-VL-7B]
            SEM[Semantic Service]
            MINIO[MinIO Storage]
        end
    end

    DUI --> DAPI
    DAPI --> DVLM
    DAPI --> DSEM
    DVLM --> OVMS
    DSEM --> SEM

    TUI --> TAPI
    TAPI --> TSW
    TSW --> TFS
    TSW --> TVS
    TVS --> OVMS
    TFS --> MINIO

Dine-In Architecture

┌─────────────────────────────────────────────────────────────────────────────┐
│                           DINE-IN ORDER ACCURACY                            │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  ┌─────────────┐      ┌──────────────────┐      ┌─────────────────────┐    │
│  │             │      │                  │      │                     │    │
│  │  Gradio UI  │─────▶│   FastAPI API    │─────▶│   Validation        │    │
│  │  (Port 7861)│      │   (Port 8083)    │      │   Service           │    │
│  │             │      │                  │      │                     │    │
│  └─────────────┘      └────────┬─────────┘      └──────────┬──────────┘    │
│                                │                           │               │
│                    ┌───────────┴───────────┐               │               │
│                    │                       │               │               │
│                    ▼                       ▼               ▼               │
│           ┌────────────────┐     ┌─────────────────┐ ┌───────────────┐    │
│           │  VLM Client    │     │ Semantic Client │ │ Metrics       │    │
│           │  (Circuit      │     │ (Circuit        │ │ Collector     │    │
│           │   Breaker)     │     │  Breaker)       │ │               │    │
│           └───────┬────────┘     └────────┬────────┘ └───────────────┘    │
│                   │                       │                               │
└───────────────────┼───────────────────────┼───────────────────────────────┘
                    │                       │
                    ▼                       ▼
          ┌─────────────────┐     ┌─────────────────┐
          │   OVMS VLM      │     │   Semantic      │
          │   (Qwen2.5-VL)  │     │   Service       │
          └─────────────────┘     └─────────────────┘

Take-Away Architecture

┌──────────────────────────────────────────────────────────────────────────────────────┐
│                            TAKE-AWAY ORDER ACCURACY SYSTEM                           │
│                                                                                      │
│  ┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐                  │
│  │    RTSP Video   │    │  Frame Selector │    │      Order      │                  │
│  │     Streams     │───▶│     Service     │───▶│    Accuracy     │                  │
│  │   (GStreamer)   │    │     (YOLO)      │    │    Service      │                  │
│  └─────────────────┘    └─────────────────┘    └────────┬────────┘                  │
│                                                          │                           │
│         ┌────────────────────────────────────────────────┤                           │
│         │                                                │                           │
│         ▼                                                ▼                           │
│  ┌─────────────────┐                          ┌─────────────────┐                   │
│  │   VLM Scheduler │                          │    Validation   │                   │
│  │   (Batcher)     │                          │      Agent      │                   │
│  └────────┬────────┘                          └────────┬────────┘                   │
│           │                                            │                             │
│           ▼                                            ▼                             │
│  ┌─────────────────┐                          ┌─────────────────┐                   │
│  │    OVMS VLM     │                          │    Semantic     │                   │
│  │  (Qwen2.5-VL)   │                          │    Service      │                   │
│  └─────────────────┘                          └─────────────────┘                   │
│                                                                                      │
└──────────────────────────────────────────────────────────────────────────────────────┘

Component Details

Core Components

1. VLM Backend (OVMS)

OpenVINO Model Server hosting Qwen2.5-VL-7B for vision-language inference.

Features: - OpenAI-compatible API (/v3/chat/completions) - INT8 quantization for optimized performance - GPU acceleration via Intel/NVIDIA hardware - Shared model instance for both applications

API Usage:

response = requests.post(
    f"{OVMS_ENDPOINT}/v3/chat/completions",
    json={
        "model": "Qwen/Qwen2.5-VL-7B-Instruct",
        "messages": [
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": prompt},
                    {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{img_b64}"}}
                ]
            }
        ]
    }
)

2. Semantic Comparison Service

AI-powered semantic matching microservice for intelligent item comparison.

Matching Strategies: - Exact: Direct string comparison - Semantic: Vector similarity using sentence-transformers - Hybrid: Exact first, then semantic fallback

Example Matches: - "Big Mac" ↔ "Maharaja Mac" (regional name variant) - "green apple" ↔ "apple" (partial match) - "large fries" ↔ "french fries large" (word reordering)

3. Frame Selector Service (Take-Away)

YOLO-based intelligent frame selection for optimal VLM input.

Process: 1. Receive raw video frames from GStreamer pipeline 2. Run YOLO object detection on each frame 3. Score frames by item visibility and clarity 4. Select top K frames per order 5. Store selected frames in MinIO

4. VLM Scheduler (Take-Away)

Request batching scheduler optimizing OVMS throughput.

Batching Strategy: - Time Window: 50-100ms collection period - Max Batch Size: Configurable (default: 16) - Fair Scheduling: Round-robin across workers - Response Routing: Match responses to original requesters

Docker Services

Dine-In Services

Container	Ports	Description
`dinein_app`	7861, 8083	Main application (Gradio + FastAPI)
`dinein_ovms_vlm`	8002	Vision-Language Model server
`dinein_semantic_service`	8081	Semantic text matching
`metrics-collector`	8084	System metrics aggregation

Take-Away Services

Container	Ports	Description
`takeaway_app`	7860, 8080	Main application (Gradio + FastAPI)
`ovms-vlm`	8001	Vision-Language Model server
`frame-selector`	8085	YOLO-based frame selection
`semantic-service`	8081	Semantic text matching
`minio`	9000, 9001	S3-compatible storage
`rtsp-streamer`	8554	RTSP stream simulator (testing)

Data Flow

Dine-In Validation Pipeline

┌─────────────────────────────────────────────────────────────────────┐
│                        VALIDATION PIPELINE                          │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  1. IMAGE PREPROCESSING                                            │
│     Raw Image → Auto-Orient → Resize (672px) → Enhance →           │
│     Sharpen → JPEG Compress (82%) → Base64 Encode                  │
│                              │                                      │
│                              ▼                                      │
│  2. VLM INFERENCE                                                  │
│     Prompt: "Analyze this food plate image..."                     │
│     + Inventory list for context                                   │
│     → OVMS POST /v3/chat/completions                               │
│     → Parse JSON response for detected items                       │
│                              │                                      │
│                              ▼                                      │
│  3. SEMANTIC MATCHING                                              │
│     For each expected item:                                        │
│       Find best match in detected items (similarity > 0.7)         │
│       Track: matched, missing, extra, quantity mismatches          │
│                              │                                      │
│                              ▼                                      │
│  4. RESULT AGGREGATION                                             │
│     {                                                              │
│       "order_complete": true/false,                                │
│       "accuracy_score": 0.0-1.0,                                   │
│       "missing_items": [...],                                      │
│       "extra_items": [...]                                         │
│     }                                                              │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Take-Away Processing Pipeline

┌────────────────────────────────────────────────────────────────────────────┐
│                         DATA FLOW PIPELINE                                  │
│                                                                             │
│  1. VIDEO CAPTURE                                                          │
│     RTSP Camera ──▶ GStreamer Pipeline ──▶ Frame Buffer                   │
│                                                                             │
│  2. FRAME SELECTION                                                        │
│     Frame Selector (YOLO):                                                 │
│     • Object detection on raw frames                                       │
│     • Score frames by item visibility                                      │
│     • Select top K frames per order                                        │
│     • Store selected frames in MinIO                                       │
│                                                                             │
│  3. VLM PROCESSING                                                         │
│     VLM Scheduler → OVMS (Qwen2.5-VL):                                    │
│     • Batch frames by time window                                          │
│     • Send to OVMS with detection prompt                                   │
│     • Parse structured item response                                       │
│                                                                             │
│  4. ORDER VALIDATION                                                       │
│     Validation Agent:                                                      │
│     • Compare detected items with expected order                           │
│     • Exact match → Semantic match → Flag mismatch                        │
│     • Generate validation result                                           │
│                                                                             │
│  5. RESULT OUTPUT                                                          │
│     { "matched": [...], "missing": [...], "extra": [...] }                │
│                                                                             │
└────────────────────────────────────────────────────────────────────────────┘

Production Features

Circuit Breaker Pattern

Prevents cascading failures when external services are unhealthy.

                    5 consecutive failures
       ┌────────┐ ───────────────────────────▶ ┌────────┐
       │        │                               │        │
       │ CLOSED │                               │  OPEN  │
       │        │ ◀─────────────────────────── │        │
       └────────┘    2 successes in half_open  └────────┘
            ▲                                       │
            │                                       │
            │        ┌────────────┐                 │
            └────────│ HALF_OPEN  │◀────────────────┘
      2 successes    └────────────┘    30s timeout
                           │
                           │ 1 failure → Back to OPEN

Configuration: - VLM Client: 5 failures → OPEN, 30s recovery → HALF_OPEN - Semantic Client: 15s recovery timeout (faster than VLM)

Connection Pooling

# VLM Client Pool Configuration
limits = httpx.Limits(
    max_keepalive_connections=20,
    max_connections=50,
    keepalive_expiry=30.0
)
timeout = httpx.Timeout(
    connect=10.0,
    read=300.0,   # Extended for VLM inference
    write=10.0,
    pool=10.0
)

Bounded Cache (LRU)

Thread-safe LRU cache with automatic eviction to prevent memory exhaustion: - Maximum 10,000 entries - Automatic eviction of oldest entries when full - Thread-safe operations with locking

Station Worker Reliability (Take-Away)

Feature	Implementation
GStreamer Pipeline	RTSP → H.264 decode → Frame capture
Circuit Breaker	5 failures in 5 min → 30s cooldown
Exponential Backoff	2s → 4s → 8s → ... → 60s max
Stall Detection	No frames for 5 min triggers restart
Health Monitoring	Frame rate, pipeline state tracking

Performance Characteristics

Metric	Dine-In	Take-Away
End-to-End Latency	8-15 seconds	Real-time stream
VLM Inference	5-10 seconds	5-10 seconds (batched)
Semantic Matching	50-200ms	50-200ms
Throughput	~4-6 req/min	Multiple concurrent streams
GPU Utilization	60-80%	70-90% (parallel mode)