Computer Vision Pipeline
Object Detection + OCR + FastAPI | GPU-Accelerated Serving
Project Overview
Problem Statement
Organizations need to extract structured information from images at scale - detecting objects, reading text, and classifying scenes. This pipeline combines state-of-the-art object detection (YOLOv8) with OCR (Tesseract/EasyOCR) into a single production-ready API.
Objectives
- Build an end-to-end CV pipeline with YOLOv8 + OCR
- Implement batch processing for high throughput
- Deploy with GPU acceleration and model optimization
- Create a scalable API with async processing
- Add monitoring and quality metrics tracking
| Component | Technology |
|---|---|
| Object Detection | YOLOv8 (Ultralytics) |
| OCR Engine | Tesseract + EasyOCR |
| API Framework | FastAPI + gunicorn |
| Image Processing | OpenCV + Pillow |
| Model Optimization | ONNX Runtime + TensorRT |
| Queue System | Celery + Redis |
| Storage | MinIO / S3 |
| Monitoring | Prometheus + Grafana |
Architecture Diagram
+-------------------------------------------------------------------+
| Computer Vision Pipeline Architecture |
+-------------------------------------------------------------------+
| +--------------+ +--------------+ +------------------+ |
| | Image Upload |--->| Pre-process |--->| YOLOv8 Detection | |
| | (API/Queue) | | (Resize/Crop)| | (Object Bounding | |
| +--------------+ +--------------+ | Boxes) | |
| +--------+---------+ |
| | |
| v |
| +--------------+ +--------------+ +------------------+ |
| | Structured |<---| OCR Engine |<---| Crop Detected | |
| | Output | | (Tesseract) | | Regions | |
| +--------------+ +--------------+ +------------------+ |
| | | |
| v v |
| +--------------+ +--------------+ +------------------+ |
| | Response | | Quality | | Monitoring | |
| | (JSON) | | Scoring | | Dashboard | |
| +--------------+ +--------------+ +------------------+ |
+-------------------------------------------------------------------+
Step-by-Step Implementation
Step 1: Environment Setup
mkdir cv-pipeline && cd cv-pipeline
python -m venv venv && source venv/bin/activate
pip install ultralytics opencv-python-headless pillow
pip install fastapi uvicorn python-multipart
pip install pytesseract easyocr onnxruntime-gpu
pip install celery[redis] minio python-dotenv
pip install prometheus-client mlflow
# Install Tesseract
sudo apt-get install tesseract-ocr tesseract-ocr-eng
Step 2: Object Detection Module
Implement YOLOv8-based object detection with custom post-processing.
# src/detection/yolo_detector.py
from ultralytics import YOLO
import numpy as np
from typing import List, Dict, Tuple
from dataclasses import dataclass
import cv2
import logging
logger = logging.getLogger(__name__)
@dataclass
class Detection:
class_id: int
class_name: str
confidence: float
bbox: Tuple[int, int, int, int] # x1, y1, x2, y2
area: int
class YOLODetector:
def __init__(self, model_path: str = "yolov8m.pt", confidence: float = 0.5):
self.model = YOLO(model_path)
self.confidence = confidence
logger.info(f"Loaded YOLO model from {model_path}")
def detect(
self, image: np.ndarray,
classes: List[int] = None,
) -> List[Detection]:
results = self.model(
image,
conf=self.confidence,
classes=classes,
verbose=False,
)
detections = []
for r in results:
for box in r.boxes:
x1, y1, x2, y2 = map(int, box.xyxy[0].tolist())
detections.append(Detection(
class_id=int(box.cls[0]),
class_name=r.names[int(box.cls[0])],
confidence=float(box.conf[0]),
bbox=(x1, y1, x2, y2),
area=(x2-x1) * (y2-y1),
))
return detections
def detect_batch(self, images: List[np.ndarray]) -> List[List[Detection]]:
results = self.model(images, conf=self.confidence, verbose=False)
batch_detections = []
for r in results:
dets = []
for box in r.boxes:
x1, y1, x2, y2 = map(int, box.xyxy[0].tolist())
dets.append(Detection(
class_id=int(box.cls[0]),
class_name=r.names[int(box.cls[0])],
confidence=float(box.conf[0]),
bbox=(x1, y1, x2, y2),
area=(x2-x1) * (y2-y1),
))
batch_detections.append(dets)
return batch_detections
def crop_detections(
self, image: np.ndarray, detections: List[Detection],
padding: int = 10
) -> List[Tuple[Detection, np.ndarray]]:
h, w = image.shape[:2]
crops = []
for det in detections:
x1, y1, x2, y2 = det.bbox
x1 = max(0, x1 - padding)
y1 = max(0, y1 - padding)
x2 = min(w, x2 + padding)
y2 = min(h, y2 + padding)
crop = image[y1:y2, x1:x2]
crops.append((det, crop))
return crops
Step 3: OCR Module
Implement OCR extraction from detected regions using both Tesseract and EasyOCR.
# src/ocr/ocr_engine.py
import pytesseract
import easyocr
import numpy as np
from typing import List, Dict, Optional
from dataclasses import dataclass
import cv2
@dataclass
class OCRResult:
text: str
confidence: float
bbox: Optional[tuple] = None
engine: str = "tesseract"
class OCREngine:
def __init__(self, engines: List[str] = None):
self.engines = engines or ["tesseract", "easyocr"]
if "easyocr" in self.engines:
self.easyocr_reader = easyocr.Reader(["en"], gpu=True)
def extract_tesseract(self, image: np.ndarray) -> OCRResult:
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY) if len(image.shape) == 3 else image
data = pytesseract.image_to_data(gray, output_type=pytesseract.Output.DICT)
texts = []
confidences = []
for i, text in enumerate(data["text"]):
if text.strip() and int(data["conf"][i]) > 30:
texts.append(text)
confidences.append(int(data["conf"][i]))
combined = " ".join(texts)
avg_conf = sum(confidences) / len(confidences) if confidences else 0
return OCRResult(text=combined, confidence=avg_conf, engine="tesseract")
def extract_easyocr(self, image: np.ndarray) -> OCRResult:
results = self.easyocr_reader.readtext(image)
texts = []
confidences = []
for (bbox, text, conf) in results:
if conf > 0.3:
texts.append(text)
confidences.append(conf)
combined = " ".join(texts)
avg_conf = sum(confidences) / len(confidences) if confidences else 0
return OCRResult(text=combined, confidence=avg_conf, engine="easyocr")
def extract(self, image: np.ndarray) -> OCRResult:
results = []
if "tesseract" in self.engines:
results.append(self.extract_tesseract(image))
if "easyocr" in self.engines:
results.append(self.extract_easyocr(image))
# Return best result by confidence
return max(results, key=lambda r: r.confidence) if results else OCRResult(text="", confidence=0)
def extract_from_crops(self, crops: List[tuple]) -> List[Dict]:
results = []
for detection, crop_image in crops:
ocr_result = self.extract(crop_image)
results.append({
"detection": detection,
"ocr": ocr_result,
})
return results
Step 4: Pipeline Orchestrator
Combine detection and OCR into a unified pipeline with quality scoring.
# src/pipeline/processor.py
import numpy as np
from typing import Dict, List, Any
from dataclasses import dataclass, asdict
import time
import cv2
from src.detection.yolo_detector import YOLODetector
from src.ocr.ocr_engine import OCREngine
@dataclass
class PipelineResult:
detections: List[Dict]
ocr_results: List[Dict]
processing_time_ms: float
image_size: tuple
quality_score: float
class CVPipeline:
def __init__(self, detector: YOLODetector, ocr: OCREngine):
self.detector = detector
self.ocr = ocr
def process_image(self, image: np.ndarray) -> PipelineResult:
start = time.time()
h, w = image.shape[:2]
# Step 1: Detect objects
detections = self.detector.detect(image)
# Step 2: Crop detected regions
crops = self.detector.crop_detections(image, detections)
# Step 3: OCR on cropped regions
ocr_results = self.ocr.extract_from_crops(crops)
# Step 4: Compute quality score
quality = self._compute_quality(detections, ocr_results)
elapsed = (time.time() - start) * 1000
return PipelineResult(
detections=[asdict(d) for d in detections],
ocr_results=[{"text": r["ocr"].text, "confidence": r["ocr"].confidence,
"engine": r["ocr"].engine, "class": r["detection"].class_name}
for r in ocr_results],
processing_time_ms=elapsed,
image_size=(w, h),
quality_score=quality,
)
def _compute_quality(self, detections, ocr_results) -> float:
if not detections:
return 0.0
det_scores = [d.confidence for d in detections]
ocr_scores = [r["ocr"].confidence for r in ocr_results] if ocr_results else [0]
return (np.mean(det_scores) + np.mean(ocr_scores)) / 2
Step 5: FastAPI Service
# src/api/main.py
from fastapi import FastAPI, UploadFile, File, HTTPException
from fastapi.responses import JSONResponse
from pydantic import BaseModel
from typing import List, Optional
from contextlib import asynccontextmanager
import numpy as np
from PIL import Image
import io
import time
from src.pipeline.processor import CVPipeline
from src.detection.yolo_detector import YOLODetector
from src.ocr.ocr_engine import OCREngine
@asynccontextmanager
async def lifespan(app: FastAPI):
detector = YOLODetector("yolov8m.pt", confidence=0.5)
ocr = OCREngine(["tesseract", "easyocr"])
app.state.pipeline = CVPipeline(detector, ocr)
yield
app = FastAPI(title="CV Pipeline API", version="1.0.0", lifespan=lifespan)
@app.post("/api/v1/process")
async def process_image(file: UploadFile = File(...)):
contents = await file.read()
image = Image.open(io.BytesIO(contents))
image_np = np.array(image)
result = app.state.pipeline.process_image(image_np)
return {
"detections": result.detections,
"ocr_results": result.ocr_results,
"processing_time_ms": result.processing_time_ms,
"image_size": result.image_size,
"quality_score": result.quality_score,
}
@app.post("/api/v1/process/batch")
async def process_batch(files: List[UploadFile] = File(...)):
results = []
for file in files[:10]: # Max 10 images per batch
contents = await file.read()
image = Image.open(io.BytesIO(contents))
result = app.state.pipeline.process_image(np.array(image))
results.append({
"filename": file.filename,
"detections": len(result.detections),
"processing_time_ms": result.processing_time_ms,
})
return {"results": results}
Step 6: Model Optimization with ONNX
# src/optimization/export_onnx.py
from ultralytics import YOLO
import onnxruntime as ort
def export_yolo_to_onnx(model_path: str, output_path: str):
model = YOLO(model_path)
model.export(format="onnx", dynamic=True, simplify=True)
print(f"Exported ONNX model to {output_path}")
class ONNXDetector:
def __init__(self, onnx_path: str):
self.session = ort.InferenceSession(
onnx_path,
providers=["CUDAExecutionProvider", "CPUExecutionProvider"],
)
def predict(self, input_tensor):
input_name = self.session.get_inputs()[0].name
outputs = self.session.run(None, {input_name: input_tensor})
return outputs
βΉοΈ
Use ONNX Runtime for 2-4x inference speedup over PyTorch. For even faster inference, export to TensorRT for NVIDIA GPUs.
π‘
Implement a preprocessing queue with Celery for batch processing. This prevents API timeouts for large images and enables priority-based processing.
Performance Metrics
| Metric | Target | Description |
|---|---|---|
| Detection Latency | < 30ms | Per image (GPU) |
| OCR Latency | < 50ms | Per detected region |
| End-to-End Latency | < 200ms | Full pipeline |
| mAP@50 | > 0.75 | Object detection accuracy |
| OCR Accuracy | > 95% | On clear text |
| Throughput | > 50 FPS | Batch processing |
Interview Talking Points
- YOLO Architecture: Single-shot detection enables real-time inference. YOLOv8 uses anchor-free detection for better generalization.
- OCR Pipeline: Combining object detection with OCR creates a powerful document understanding system.
- Batch Processing: Batching GPU operations provides significant throughput improvements over single-image processing.
- Model Optimization: ONNX Runtime and TensorRT provide 2-5x speedup with minimal accuracy loss.
- Quality Scoring: Confidence-based quality metrics help filter low-quality results for downstream consumers.
- Scaling: GPU instances with auto-scaling based on queue depth for cost-efficient processing.
β οΈ
OCR accuracy degrades significantly with poor image quality. Implement image preprocessing (contrast enhancement, noise reduction) as a first step.
βΉοΈ
This pipeline handles general object detection and OCR. For specialized use cases (medical imaging, satellite imagery), fine-tune the detection model on domain-specific data.