Edge deployment Computer Vision Services - Athena AI | Athena AI | Athena AI
Sections
Edge Deployment
Intelligence at the source. Zero cloud dependency.
Edge deployment means your AI models run where the cameras are — on a device bolted to the wall, mounted in a rack, or embedded in the machine itself. No video leaves your facility. No inference waits for a round-trip to a data centre. No uptime depends on someone else's internet connection.
Athena AI builds computer vision systems engineered from the ground up for edge hardware. Not cloud models squeezed onto a Jets demot demo-grade pipelines that fall apart under sustained load. Production-grade inference, optimised for the specific hardware you're deploying, tested against your actual environment before anything ships.
Most computer vision products run in a data centre. You send video to their servers, their servers run the models, their servers send you results. It's fast enough for asynchronous tasks. It's not fast enough for anything that needs to react in real time — a robot arm, a safety zone trigger, a tracking system that can't afford to drop frames.
Edge deployment means the model runs on a device physically at the scene. A Jetson module mounted above a conveyor belt. A compact server in an equipment rack at a sports venue. A hardened compute unit in a clinical corridor. The video never leaves. The inference happens locally. The result comes back in milliseconds, not hundreds of milliseconds.
That difference — 15ms versus 200ms — is the difference between a system that can actuate a sorting mechanism in time and one that can't. Between a patient fall alert that fires when it matters and one that arrives after the fact. Between a tracking system that runs at 60fps and one that stutters.
The four things that make edge deployment hard
Running a model on a Jetson isn't the same as running it in the cloud. The hardware is more constrained. The environment is less controlled. The failure modes are different. Here's what makes it genuinely difficult — and what distinguishes a production deployment from a tutorial project.
1. Constrained compute demands different models.
A YOLOv11-L model that runs at 120 fps on an A100 runs at 8 fps on an Orin Nano without optimisation. The answer isn't always 'use a smaller model and accept worse accuracy.' The answer is model selection, quantisation, and pipeline architecture designed together. Done correctly, you can run a production-grade model at 30–60fps on a Nano with accuracy within 5% of the cloud equivalent. Done naively, you get a slow model or a fast inaccurate one.
2. Real environments aren't like benchmarks.
A model trained on a clean dataset and benchmarked on controlled footage will underperform in the field. Warehouse lighting shifts between day and night. Industrial environments have vibration, glare, and dust on lenses. Sports venues have motion blur, overlapping players, and variable lighting across the frame. Production edge deployment means training on your actual footage, benchmarking in your actual environment, and building pre-processing pipelines that handle the conditions your cameras actually see.
3. Thermal management under sustained load.
An edge device running inference continuously at full load generates heat. Without correct thermal design — active cooling, power mode selection, model duty-cycling — the device throttles under load, and your latency guarantees collapse. We test under sustained load, not burst load. Thermal behaviour at hour four is as important as performance at minute one.
Edge deployment is a hardware-software co-design problem. The choice of Jetson variant, the model architecture, the quantisation strategy, the pipeline topology, and the integration surface are all decisions that interact. Getting any one of them wrong produces a system that performs in a proof-of-concept and degrades under sustained production load.
This tab covers the full stack: hardware selection, the inference optimisation pipeline, the six-stage CV pipeline as it runs on edge hardware, and the deployment topologies we use in production.
Hardware selection
The right hardware is determined by stream count, latency budget, power envelope, and integration requirements. The wrong hardware produces either an over-engineered system that costs more than it should or an under-provisioned system that throttles under real load.
Jetson Orin Nano
Jetson Orin NX
Jetson Orin AGX
On-prem GPU Server
TOPS (INT8)
40
100
275
320–2,000+
Deploying a model on a Jetson is day one. Keeping it accurate, thermally stable, secure, and up to date across a fleet of devices in real environments — that's the actual work. This section covers how Athena AI-built edge systems operate in production.
Fleet management and model updates
OTA model updates.
Models are versioned, signed, and deployed over-the-air to the device fleet via a secure update channel. Canary rollout: new model runs in shadow on a subset of devices, accuracy metrics compared against the production model, promotion gated on regression checks. Rollback is one command.
Air-gapped update path.
For fully air-gapped deployments, model updates are delivered as signed packages on physical media or via a controlled inbound channel (one-way diode). The device verifies the package signature before installation. No manual model file copying. No unsigned updates accepted.
Fleet health monitoring.
Every device reports heartbeat, GPU utilisation, temperature, FPS, and current accuracy metrics at configurable intervals. A central monitoring service (self-hosted or VPC-hosted) aggregates these signals. Alerts fire on: device offline, GPU throttling detected, accuracy drift past threshold, temperature above safe operating range.
Remote diagnostic access.
Secure SSH tunnel or VPN-based remote access for diagnostics and configuration, scoped by role. Air-gapped devices use an out-of-band management network. All access events logged to an immutable audit trail.
Thermal management
Thermal behaviour under sustained load is one of the most common causes of silent performance degradation in edge deployments. An Orin AGX running at full load in a warm industrial environment can throttle within 20 minutes without correct thermal design.
Hardware: active cooling (fan, heat sink) specified per environment. Industrial environments may require sealed enclosures with external heat exchangers.
Power mode selection: NVIDIA Jetson supports configurable power modes (5W to 60W on AGX). We select and lock the power mode at deployment, not at demo.
Thermal monitoring: on-device temperature sensors monitored continuously. Throttle events logged. Alert fires before performance degrades below SLA.
Load testing: every deployment is stress-tested at sustained full load for 4+ hours before production sign-off. We benchmark thermal steady-state, not burst performance.
Why edge over cloud
Cloud-based computer vision is the right choice for some workloads. It's the wrong choice for most operational deployments. Here's the honest comparison:
Cloud CV
Edge Deployment
Latency
80–300 ms round-trip. Unusable for real-time safety, robotics, or tracking.
8-45 ms on-device. Viable for PLC integration, real-time tracking, worker safety.
Data sovereignty
Video transits third-party infrastructure. Problematic for healthcare, defence, regulated industries.
No video leaves your network unless you choose to send it. Compliant by architecture.
Cost structure
Per-frame or per-stream SaaS pricing. At 20+ cameras over 3 years, the math is punishing.
One-time hardware + integration. Costs amortise. No per-inference billing.
Uptime dependency
Your operation depends on the vendor's uptime, your ISP, and network path quality.
Runs air-gapped if needed. No external dependency. Keeps working when the internet doesn't.
Bandwidth
16 cameras at 1080p30 = ~128 Mbps sustained. Costly and often infeasible.
Inference events compress to <1 KB/event. Only actionable data moves.
What you actually get
Outcome
What it means
Sub-20ms inference on Jetson Orin
Real-time tracking, safety-critical zone alerting, and PLC integration are viable. Cloud latency makes none of these possible.
No recurring cloud cost
Hardware amortises over 3–5 years. Per-stream SaaS pricing at 20+ cameras compounds to a number most operations haven't modelled.
Data stays on your hardware
No video, no frames, no metadata leaves your network unless you explicitly route it. Compliant with HIPAA, PHIPA, GDPR, and defence requirements by default.
Works without internet
Air-gapped operation for clinical, regulated, and remote environments. Model updates delivered via signed offline packages.
Models built for your hardware
We don't drop a generic model onto your device and call it done. TensorRT optimisation, quantisation, and model selection are engineered against your specific hardware and throughput requirement.
Where this deploys
Manufacturing & Robotics. Visual inspection, worker safety zones, AGV navigation, assembly-line defect detection, PLC integration. Sub-50ms latency on Jetson Orin AGX. Deterministic enough for safety-critical actuation.
Retail & Commerce. Customer flow, dwell-time analytics, loss prevention, self-checkout verification. Low-power Jetson Orin Nano at the shelf or entrance. On-device anonymisation before any data moves.
Sports & Broadcast. Player and ball tracking, automated highlight detection, biomechanics. Multi-stream on Orin AGX or GPU server. 60fps trajectory smoothness without cloud round-trips.
Healthcare & Clinical. Patient monitoring, fall detection, equipment localisation. Air-gapped Orin AGX. HIPAA/PHIPA compliance by architecture — no video leaves the clinical environment.
Security & Surveillance. Perimeter intrusion, cross-camera tracking, audit-grade event logging. On-premise GPU server or Orin AGX cluster. Full pipeline behind a one-way network diode where required.
Logistics & Warehousing. Dock door monitoring, AGV safety, pallet detection, vehicle tracking. Orin NX or AGX depending on camera count. Integrates with WMS and TMS via event streaming.
Why Athena AI
Edge-first by design
Every system we build is architected for the hardware it will run on. We don't start with a cloud model and try to make it fit. We start with your hardware, your latency requirement, and your camera count, and engineer backward from there.
Proven on real edge deployments
The LEGO sorting system we built for Canada First Bricks runs entirely on NVIDIA Jetson Orin — 96.3% accuracy across 300+ part types, 3,600 parts per hour, sub-100ms end-to-end latency including PLC actuation. No cloud. No human in the loop. That's the standard we hold production deployments to.
Optimisation is the work, not the afterthought
Going from a T4 (320 TOPS) to a Jetson Orin Nano (40 TOPS) is not an 8× slowdown if you architect for it. With correct TensorRT quantisation, model selection, and pipeline tuning, it's 1.2–2×. We do this work. We don't ask you to.
96.3% accuracy across 300+ part types. 3,600 parts/hr. < 100 ms end-to-end latency. Zero cloud dependency.
AnglerVision
Multi-stream tracking and classification pipeline rebuilt for edge-optimised inference. TensorRT quantisation across all camera streams.
47% reduction in false-positive detections. 90% classification accuracy. Runs on-device without cloud egress.
Mirror Vision
Multi-camera pose tracking and swing analytics on edge hardware. Rebuilt model pipeline and video ingestion for sub-100ms latency per frame.
Real-time coaching feedback. Full rebuild of inference pipeline for edge deployment.
Ready to see what this looks like on your hardware?
A discovery call is a one-hour technical conversation against your actual environment — your hardware, your camera count, your latency requirement, your integration surface. We don't pitch. We benchmark. By the end you know whether edge deployment is the right architecture for your operation, and we know whether we can deliver against your requirements.
Most edge deployments don't just show results — they trigger something. A PLC command, a conveyor stop, a door unlock, an alert to a nurse station. The interface between the vision pipeline and the physical system is where most edge projects fail quietly. The model works. The output doesn't reach the actuator in time, or in the right format, or with the right reliability guarantees. We engineer the integration as a first-class part of the system, not an afterthought.
What deployment actually looks like
Discovery against your real environment. We review your hardware, camera layout, latency requirement, and integration surface. We benchmark against video from your environment — not demo footage — before anything is built.
Hardware selection and system design. We specify the right Jetson variant or on-prem configuration for your stream count and latency budget. Model selection and pipeline architecture happen here.
Model training and optimisation. Models trained on your data, optimised with TensorRT for your target hardware, benchmarked against your accuracy threshold. You see real numbers before production commitment.
Integration. Connecting inference output to your existing systems — PLCs, ERPs, alerting platforms, dashboards. The integration surface is scoped and tested before go-live.
Production and ongoing operations. Monitoring, thermal management, drift detection, model updates via signed packages. The system is instrumented from day one.
For the engineering detail, continue to [Architecture →]. For the operational reality — monitoring, updates, security — see [Operations →].
Typical use
Single-stream, low-power edge
2–4 stream edge
4–8 stream edge node
16–100+ stream aggregation
Power envelope
7–15 W
10–25 W
15–60 W
200–500 W
Data sovereignty
Full (on-device)
Full (on-device)
Full (on-device)
Full (on-prem)
Cloud dependency
None
None
None
None
Best for
Retail, smart sensors, robotics
Manufacturing, logistics, sports
Multi-camera, industrial, clinical
Large venue, multi-site aggregation
Latency reference: cloud vs. on-prem vs. edge
Capability
Cloud (typical round-trip)
On-prem server
Jetson Orin AGX (edge)
Object detection (1080p30)
80–300 ms
12–25 ms
8–18 ms
Real-time tracking (4 streams)
Not viable (bandwidth)
30–60 ms
20–45 ms
Image classification (single frame)
50–200 ms
5–12 ms
4–10 ms
Event detection (zone trigger)
100–500 ms
15–30 ms
10–20 ms
Document extraction (per page)
200–800 ms
500ms–2 s
Not typical use case
The inference optimisation pipeline
A model that runs at 30fps in PyTorch may run at 90fps after TensorRT optimisation on the same hardware. The optimisation pipeline is not optional — it's what makes edge deployment viable at production throughput. Here's what we apply, in order:
Further halves weight size; requires calibration dataset
2–3×
Model pruning
Removes near-zero weights; reduces compute per inference
1.3–1.8×
Layer fusion (TensorRT)
Combines consecutive ops into single kernel; reduces memory bandwidth
1.2–1.5×
CUDA stream pinning
Locks inference to specific GPU compute engine; eliminates scheduling latency
Latency -10–20%
Batching (async pipeline)
Amortises model load across multiple frames; improves GPU utilisation
1.5–2× at ≥4 streams
Model size selection (n/s/m/l)
Choosing n/s variant vs m/l trades -5–10% mAP for 2–3× throughput
2–3× (with accuracy trade)
Combined effect: a YOLOv11-m model running at FP32 in PyTorch on an Orin AGX achieves approximately 35fps on a 1080p stream. After TensorRT FP16 + layer fusion + CUDA stream pinning, the same model achieves 90–110fps. The accuracy delta is less than 1% mAP on your calibration set.
We don't apply optimisation uniformly — each technique has a calibration cost and a latency/accuracy tradeoff. We benchmark each step against your footage and your accuracy threshold before committing to the production configuration.
The CV pipeline on edge hardware
The same six-stage pipeline from our Real-Time Tracking architecture applies here — Detection, Association, Motion Modeling, Re-ID, Cross-camera Handoff, Sensor Fusion — with edge-specific constraints at each layer.
Detection (Layer 1)
Default: YOLOv11 variant selected by hardware budget and throughput target. On Orin Nano: YOLOv11-n/s at INT8, 40–60fps. On Orin AGX: YOLOv11-m/l at FP16, 60–110fps. RT-DETR for transformer-based scenes where global context matters and the hardware budget allows (Orin AGX only — too expensive for Nano/NX).
Edge-specific constraint: anchor-based models (YOLO) are more edge-friendly than anchor-free transformers at equivalent accuracy because their compute graph is more amenable to TensorRT fusion. We choose transformer architectures only when scene density justifies the cost.
Association (Layer 2)
Default: ByteTrack — correct choice for edge because it handles partial occlusions without the re-ID embedding extraction cost that DeepSORT requires. On constrained hardware, removing embedding extraction at the association stage doubles throughput.
When we substitute: StrongSORT or BoT-SORT when re-ID quality justifies the cost (available on AGX, not recommended on Nano). OC-SORT for sports and non-linear motion on AGX.
Motion modeling (Layer 3)
Default: Kalman filter with constant-velocity assumption, tuned per class. Lightweight enough to run on any Jetson variant without GPU allocation — runs on the ARM CPU cores while the GPU handles detection.
Re-ID and cross-camera handoff (Layers 4–5)
On edge hardware, re-ID embedding extraction is the most expensive optional operation in the pipeline. On Orin Nano, we disable it for single-camera deployments entirely — the throughput gain outweighs the accuracy cost in most scenes. On Orin AGX, we run OSNet embeddings at a reduced frame rate (every 5th frame) to maintain re-ID capability at production throughput.
Cross-camera handoff on edge hardware uses homography-based handoff where cameras share a calibrated ground plane (cheaper, deterministic) and feature-based re-ID only where it's architecturally necessary and the hardware budget permits.
Sensor fusion (Layer 6)
BLE / BLE AoA, UWB, RFID, LiDAR, and PLC sensor signals can all be fused on-device. Late fusion (separate trackers, combine outputs) is the standard choice on edge — it's modular and doesn't require tight sensor synchronisation. Tight fusion is available on AGX for deployments where sensor timing is well-controlled and marginal accuracy improvement justifies the engineering cost.
Deployment topologies
Standalone edge node
Single Jetson AGX or NX processing all cameras locally. Output (events, tracks, alerts) delivered via MQTT or REST to downstream systems. Appropriate for single-site deployments up to 8 cameras. Zero cloud dependency. Full data sovereignty.
Edge node + on-prem aggregation
Multiple edge nodes (one per zone or per building) feed a central on-prem GPU server that aggregates events, maintains a global track database, and handles cross-camera re-ID. Each edge node processes its cameras independently; the aggregation layer handles identity continuity across nodes. Right for multi-zone facilities with 8–50+ cameras.
Hybrid (edge + VPC)
Edge nodes handle inference and event generation. A VPC-hosted aggregation layer stores events, runs analytics, and serves dashboards. Video never leaves the site. Only compressed event data transits to the VPC. Bandwidth: under 1 KB/event vs 8 Mbps per raw stream — three orders of magnitude cheaper.
Air-gapped
Full pipeline behind a one-way network diode. No outbound connectivity. Model updates delivered as signed packages on physical media or via a controlled inbound channel. Right for clinical environments, defence, and any regulated facility where no outbound network is permitted.
One architecture, six operational profiles
The hardware and optimisation choices differ by vertical. The underlying CV pipeline is the same system.
Vertical
Typical Hardware
Stream Count
Latency Requirement
Key Constraint
Manufacturing & Robotics
Orin AGX / NX
2–8
< 50 ms (safety-critical)
Deterministic latency, PLC integration
Retail & Commerce
Orin Nano / NX
1–4
< 200 ms
Low-power, always-on, anonymisation
Sports & Broadcast
Orin AGX / GPU server
4–16
< 100 ms
60 fps trajectory smoothness
Healthcare & Clinical
Orin AGX (air-gapped)
2–6
< 500 ms
HIPAA/PHIPA, zero cloud egress
Security & Surveillance
Orin AGX / GPU server
8–32
< 500 ms
Cross-camera re-ID, audit trail
Logistics & Warehousing
Orin NX / AGX
2–8
< 100 ms
AGV integration, zone safety
[Request Architecture Review →] [Talk to an Engineer]
If an environment requires sealed enclosures, we specify the enclosure and thermal solution during architecture review — not after the device is installed.
MLOps and drift handling
Drift detection
Per-device accuracy is monitored continuously via sampled frame evaluation. mAP, IDF1, and ID-switch rate are alerted on when they drift past threshold. The alert fires before the customer notices — not after two weeks of silent degradation. Common drift causes on edge: lens fouling, repositioned cameras, seasonal lighting changes, new object classes entering the scene.
Retraining pipeline
Drift alerts trigger a retraining cycle. New training data is sourced from the device (active learning: model surfaces low-confidence frames for human labelling) or from customer-side annotation. Retrained models pass regression checks before promotion. The retraining pipeline is part of the initial deployment scope — not something added later when things go wrong.
Observability
Per-device dashboards: FPS, GPU utilisation, temperature, detection count, track count, ID-switch rate, confidence score distribution, drift alert history. Metrics exported to Prometheus, Datadog, or OpenTelemetry. Logs structured for ingestion into your existing observability stack.
Integration surface
Event streaming: MQTT, Kafka, or NATS. Per-track lifecycle events (track_start, track_update, track_lost, zone_enter, zone_exit). JSON or Protobuf schemas. Configurable event filtering — push only meaningful events, not every detection.
Synchronous query: REST and gRPC for current track state, historical trajectory retrieval, and ad-hoc queries.
PLC integration: Modbus RTU/TCP for real-time actuation signals. Deterministic output path tested end-to-end under sustained load before production sign-off.
Downstream sinks: SIEM (Splunk, Sentinel, Elastic), data warehouses (Snowflake, BigQuery, on-prem Postgres / ClickHouse), BI tools (Grafana, Superset).
SDKs: Python, TypeScript, C++. OpenAPI spec, gRPC service definitions, and example clients shipped with every deployment.
Security and data architecture
Data residency: all inference on customer hardware. No video leaves the device unless you explicitly configure egress.
Encrypted storage: model weights and event logs encrypted at rest (AES-256, customer-managed keys).
Encrypted communications: mTLS between edge device and aggregation layer. TLS 1.3 for all external API endpoints.
Model signing: every model update cryptographically signed. Devices reject unsigned updates.
RBAC: role-based access to device management console, event data, and model update pipeline.
Audit logging: every model update, every remote access event, every configuration change logged to an immutable store.
Air-gapped operation: full pipeline with no outbound network. Update channel via signed packages only.
Compliance frameworks: HIPAA, PHIPA, GDPR, SOC 2. Specific control mappings available during architecture review.
Build vs buy
What a 6-month internal build looks like
A capable ML team can run a YOLOv11 model on a Jetson and get reasonable results in two weeks. TensorRT optimisation, thermal management, OTA update infrastructure, fleet monitoring, drift detection, and PLC integration are the other 22 weeks. Hidden costs: CUDA expertise, TensorRT calibration, the operational discipline to manage a device fleet in the field.
Where building makes sense
When edge computer vision is a core product differentiator — a robotics OEM, a sports analytics platform, an industrial automation company — and you intend to invest in a permanent embedded CV team. The build amortises.
Where buying makes sense
When edge vision is an enabling capability for a broader product. The build does not amortise, and the ongoing operational cost of maintaining a production edge fleet in-house exceeds the engagement cost of a specialised partner. We've also been hired to take over internal builds that stalled at the optimisation and operations layer.
What we won't do
Deploy face recognition in jurisdictions without explicit legal basis.
Ship a model we haven't benchmarked on your hardware at sustained production load.
Quote latency numbers we haven't measured on your device with your footage.
Lock you into proprietary device management software or model formats — you own the hardware, you own the models, you own the weights.
Ignore thermal management and call it 'out of scope.'
Engagement model
Engagements start with a paid technical discovery: 2–4 weeks against your real hardware, your real footage, your real integration surface. By the end you have a hardware specification, a benchmarked accuracy and latency baseline, a thermal assessment, and a go/no-go decision based on actual numbers.
Production engagements run on milestone-based contracts with defined acceptance criteria: accuracy threshold, latency SLA (p99 at sustained load), and thermal steady-state confirmation. Ongoing operations (fleet monitoring, OTA updates, retraining, on-call) run as a separate retainer scoped to your device count and SLA requirements.
[Request Architecture Review →] [Book a Discovery Call →]