Computer Vision Pipeline

Object Detection + OCR + FastAPI | GPU-Accelerated Serving

Advanced14+ HoursGPU Required

Project Overview

Problem Statement

Organizations need to extract structured information from images at scale - detecting objects, reading text, and classifying scenes. This pipeline combines state-of-the-art object detection (YOLOv8) with OCR (Tesseract/EasyOCR) into a single production-ready API.

Objectives

Build an end-to-end CV pipeline with YOLOv8 + OCR
Implement batch processing for high throughput
Deploy with GPU acceleration and model optimization
Create a scalable API with async processing
Add monitoring and quality metrics tracking

Component	Technology
Object Detection	YOLOv8 (Ultralytics)
OCR Engine	Tesseract + EasyOCR
API Framework	FastAPI + gunicorn
Image Processing	OpenCV + Pillow
Model Optimization	ONNX Runtime + TensorRT
Queue System	Celery + Redis
Storage	MinIO / S3
Monitoring	Prometheus + Grafana

Architecture Diagram

+-------------------------------------------------------------------+
|              Computer Vision Pipeline Architecture                |
+-------------------------------------------------------------------+
|  +--------------+    +--------------+    +------------------+     |
|  | Image Upload |--->| Pre-process  |--->| YOLOv8 Detection |     |
|  | (API/Queue)  |    | (Resize/Crop)|    | (Object Bounding |     |
|  +--------------+    +--------------+    |  Boxes)          |     |
|                                          +--------+---------+     |
|                                                   |               |
|                                                   v               |
|  +--------------+    +--------------+    +------------------+     |
|  |  Structured  |<---| OCR Engine   |<---| Crop Detected    |     |
|  |  Output      |    | (Tesseract)  |    | Regions          |     |
|  +--------------+    +--------------+    +------------------+     |
|        |                                                   |     |
|        v                                                   v     |
|  +--------------+    +--------------+    +------------------+     |
|  |  Response    |    |  Quality     |    |  Monitoring      |     |
|  |  (JSON)      |    |  Scoring     |    |  Dashboard       |     |
|  +--------------+    +--------------+    +------------------+     |
+-------------------------------------------------------------------+

Step-by-Step Implementation

Step 1: Environment Setup

mkdir cv-pipeline && cd cv-pipeline
python -m venv venv && source venv/bin/activate
pip install ultralytics opencv-python-headless pillow
pip install fastapi uvicorn python-multipart
pip install pytesseract easyocr onnxruntime-gpu
pip install celery[redis] minio python-dotenv
pip install prometheus-client mlflow
# Install Tesseract
sudo apt-get install tesseract-ocr tesseract-ocr-eng

Step 2: Object Detection Module

Implement YOLOv8-based object detection with custom post-processing.

# src/detection/yolo_detector.py
from ultralytics import YOLO
import numpy as np
from typing import List, Dict, Tuple
from dataclasses import dataclass
import cv2
import logging

logger = logging.getLogger(__name__)


@dataclass
class Detection:
    class_id: int
    class_name: str
    confidence: float
    bbox: Tuple[int, int, int, int]  # x1, y1, x2, y2
    area: int


class YOLODetector:
    def __init__(self, model_path: str = "yolov8m.pt", confidence: float = 0.5):
        self.model = YOLO(model_path)
        self.confidence = confidence
        logger.info(f"Loaded YOLO model from {model_path}")

    def detect(
        self, image: np.ndarray,
        classes: List[int] = None,
    ) -> List[Detection]:
        results = self.model(
            image,
            conf=self.confidence,
            classes=classes,
            verbose=False,
        )

        detections = []
        for r in results:
            for box in r.boxes:
                x1, y1, x2, y2 = map(int, box.xyxy[0].tolist())
                detections.append(Detection(
                    class_id=int(box.cls[0]),
                    class_name=r.names[int(box.cls[0])],
                    confidence=float(box.conf[0]),
                    bbox=(x1, y1, x2, y2),
                    area=(x2-x1) * (y2-y1),
                ))
        return detections

    def detect_batch(self, images: List[np.ndarray]) -> List[List[Detection]]:
        results = self.model(images, conf=self.confidence, verbose=False)
        batch_detections = []
        for r in results:
            dets = []
            for box in r.boxes:
                x1, y1, x2, y2 = map(int, box.xyxy[0].tolist())
                dets.append(Detection(
                    class_id=int(box.cls[0]),
                    class_name=r.names[int(box.cls[0])],
                    confidence=float(box.conf[0]),
                    bbox=(x1, y1, x2, y2),
                    area=(x2-x1) * (y2-y1),
                ))
            batch_detections.append(dets)
        return batch_detections

    def crop_detections(
        self, image: np.ndarray, detections: List[Detection],
        padding: int = 10
    ) -> List[Tuple[Detection, np.ndarray]]:
        h, w = image.shape[:2]
        crops = []
        for det in detections:
            x1, y1, x2, y2 = det.bbox
            x1 = max(0, x1 - padding)
            y1 = max(0, y1 - padding)
            x2 = min(w, x2 + padding)
            y2 = min(h, y2 + padding)
            crop = image[y1:y2, x1:x2]
            crops.append((det, crop))
        return crops

Step 3: OCR Module

Implement OCR extraction from detected regions using both Tesseract and EasyOCR.

# src/ocr/ocr_engine.py
import pytesseract
import easyocr
import numpy as np
from typing import List, Dict, Optional
from dataclasses import dataclass
import cv2


@dataclass
class OCRResult:
    text: str
    confidence: float
    bbox: Optional[tuple] = None
    engine: str = "tesseract"


class OCREngine:
    def __init__(self, engines: List[str] = None):
        self.engines = engines or ["tesseract", "easyocr"]
        if "easyocr" in self.engines:
            self.easyocr_reader = easyocr.Reader(["en"], gpu=True)

    def extract_tesseract(self, image: np.ndarray) -> OCRResult:
        gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY) if len(image.shape) == 3 else image
        data = pytesseract.image_to_data(gray, output_type=pytesseract.Output.DICT)
        texts = []
        confidences = []
        for i, text in enumerate(data["text"]):
            if text.strip() and int(data["conf"][i]) > 30:
                texts.append(text)
                confidences.append(int(data["conf"][i]))

        combined = " ".join(texts)
        avg_conf = sum(confidences) / len(confidences) if confidences else 0
        return OCRResult(text=combined, confidence=avg_conf, engine="tesseract")

    def extract_easyocr(self, image: np.ndarray) -> OCRResult:
        results = self.easyocr_reader.readtext(image)
        texts = []
        confidences = []
        for (bbox, text, conf) in results:
            if conf > 0.3:
                texts.append(text)
                confidences.append(conf)

        combined = " ".join(texts)
        avg_conf = sum(confidences) / len(confidences) if confidences else 0
        return OCRResult(text=combined, confidence=avg_conf, engine="easyocr")

    def extract(self, image: np.ndarray) -> OCRResult:
        results = []
        if "tesseract" in self.engines:
            results.append(self.extract_tesseract(image))
        if "easyocr" in self.engines:
            results.append(self.extract_easyocr(image))

        # Return best result by confidence
        return max(results, key=lambda r: r.confidence) if results else OCRResult(text="", confidence=0)

    def extract_from_crops(self, crops: List[tuple]) -> List[Dict]:
        results = []
        for detection, crop_image in crops:
            ocr_result = self.extract(crop_image)
            results.append({
                "detection": detection,
                "ocr": ocr_result,
            })
        return results

Step 4: Pipeline Orchestrator

Combine detection and OCR into a unified pipeline with quality scoring.

# src/pipeline/processor.py
import numpy as np
from typing import Dict, List, Any
from dataclasses import dataclass, asdict
import time
import cv2

from src.detection.yolo_detector import YOLODetector
from src.ocr.ocr_engine import OCREngine


@dataclass
class PipelineResult:
    detections: List[Dict]
    ocr_results: List[Dict]
    processing_time_ms: float
    image_size: tuple
    quality_score: float


class CVPipeline:
    def __init__(self, detector: YOLODetector, ocr: OCREngine):
        self.detector = detector
        self.ocr = ocr

    def process_image(self, image: np.ndarray) -> PipelineResult:
        start = time.time()
        h, w = image.shape[:2]

        # Step 1: Detect objects
        detections = self.detector.detect(image)

        # Step 2: Crop detected regions
        crops = self.detector.crop_detections(image, detections)

        # Step 3: OCR on cropped regions
        ocr_results = self.ocr.extract_from_crops(crops)

        # Step 4: Compute quality score
        quality = self._compute_quality(detections, ocr_results)

        elapsed = (time.time() - start) * 1000

        return PipelineResult(
            detections=[asdict(d) for d in detections],
            ocr_results=[{"text": r["ocr"].text, "confidence": r["ocr"].confidence,
                          "engine": r["ocr"].engine, "class": r["detection"].class_name}
                         for r in ocr_results],
            processing_time_ms=elapsed,
            image_size=(w, h),
            quality_score=quality,
        )

    def _compute_quality(self, detections, ocr_results) -> float:
        if not detections:
            return 0.0
        det_scores = [d.confidence for d in detections]
        ocr_scores = [r["ocr"].confidence for r in ocr_results] if ocr_results else [0]
        return (np.mean(det_scores) + np.mean(ocr_scores)) / 2

Step 5: FastAPI Service

# src/api/main.py
from fastapi import FastAPI, UploadFile, File, HTTPException
from fastapi.responses import JSONResponse
from pydantic import BaseModel
from typing import List, Optional
from contextlib import asynccontextmanager
import numpy as np
from PIL import Image
import io
import time

from src.pipeline.processor import CVPipeline
from src.detection.yolo_detector import YOLODetector
from src.ocr.ocr_engine import OCREngine


@asynccontextmanager
async def lifespan(app: FastAPI):
    detector = YOLODetector("yolov8m.pt", confidence=0.5)
    ocr = OCREngine(["tesseract", "easyocr"])
    app.state.pipeline = CVPipeline(detector, ocr)
    yield


app = FastAPI(title="CV Pipeline API", version="1.0.0", lifespan=lifespan)


@app.post("/api/v1/process")
async def process_image(file: UploadFile = File(...)):
    contents = await file.read()
    image = Image.open(io.BytesIO(contents))
    image_np = np.array(image)

    result = app.state.pipeline.process_image(image_np)

    return {
        "detections": result.detections,
        "ocr_results": result.ocr_results,
        "processing_time_ms": result.processing_time_ms,
        "image_size": result.image_size,
        "quality_score": result.quality_score,
    }


@app.post("/api/v1/process/batch")
async def process_batch(files: List[UploadFile] = File(...)):
    results = []
    for file in files[:10]:  # Max 10 images per batch
        contents = await file.read()
        image = Image.open(io.BytesIO(contents))
        result = app.state.pipeline.process_image(np.array(image))
        results.append({
            "filename": file.filename,
            "detections": len(result.detections),
            "processing_time_ms": result.processing_time_ms,
        })
    return {"results": results}

Step 6: Model Optimization with ONNX

# src/optimization/export_onnx.py
from ultralytics import YOLO
import onnxruntime as ort


def export_yolo_to_onnx(model_path: str, output_path: str):
    model = YOLO(model_path)
    model.export(format="onnx", dynamic=True, simplify=True)
    print(f"Exported ONNX model to {output_path}")


class ONNXDetector:
    def __init__(self, onnx_path: str):
        self.session = ort.InferenceSession(
            onnx_path,
            providers=["CUDAExecutionProvider", "CPUExecutionProvider"],
        )

    def predict(self, input_tensor):
        input_name = self.session.get_inputs()[0].name
        outputs = self.session.run(None, {input_name: input_tensor})
        return outputs

ℹ️

Use ONNX Runtime for 2-4x inference speedup over PyTorch. For even faster inference, export to TensorRT for NVIDIA GPUs.

💡

Implement a preprocessing queue with Celery for batch processing. This prevents API timeouts for large images and enables priority-based processing.

Performance Metrics

Metric	Target	Description
Detection Latency	< 30ms	Per image (GPU)
OCR Latency	< 50ms	Per detected region
End-to-End Latency	< 200ms	Full pipeline
mAP@50	> 0.75	Object detection accuracy
OCR Accuracy	> 95%	On clear text
Throughput	> 50 FPS	Batch processing

Interview Talking Points

YOLO Architecture: Single-shot detection enables real-time inference. YOLOv8 uses anchor-free detection for better generalization.
OCR Pipeline: Combining object detection with OCR creates a powerful document understanding system.
Batch Processing: Batching GPU operations provides significant throughput improvements over single-image processing.
Model Optimization: ONNX Runtime and TensorRT provide 2-5x speedup with minimal accuracy loss.
Quality Scoring: Confidence-based quality metrics help filter low-quality results for downstream consumers.
Scaling: GPU instances with auto-scaling based on queue depth for cost-efficient processing.

⚠️

OCR accuracy degrades significantly with poor image quality. Implement image preprocessing (contrast enhancement, noise reduction) as a first step.

ℹ️

This pipeline handles general object detection and OCR. For specialized use cases (medical imaging, satellite imagery), fine-tune the detection model on domain-specific data.

Computer Vision Pipeline (Object Detection + OCR + FastAPI)