Image-Text-to-Text
AutoNeural / README.md
zackli4ai's picture
Update README.md
82b0c74 verified
|
raw
history blame
4.14 kB

Overview

AutoNeural is a next-generation, NPU-native multimodal vision–language model co-designed from the ground up for real-time, on-device inference. Instead of adapting GPU-first architectures, AutoNeural redesigns both vision encoding and language modeling for the constraints and capabilities of NPUs—achieving 14× faster latency, 7× lower quantization error, and real-time automotive performance even under aggressive low-precision settings.

AutoNeural integrates:

  • A MobileNetV5-based vision encoder with depthwise separable convolutions.
  • A Liquid AI hybrid Transformer-SSM language backbone that dramatically reduces KV-cache overhead.
  • A normalization-free MLP connector tailored for quantization stability.
  • Mixed-precision W8A16 (vision) and W4A16 (language) inference validated on real Qualcomm NPUs.

AutoNeural powers real-time cockpit intelligence including in-cabin safety, out-of-cabin awareness, HMI understanding, and visual + conversational function calls, as demonstrated in the on-device results (Page 6 figure) .


Key Features

🔍 MobileNetV5 Vision Encoder (300M)

Optimized for edge hardware, with:

  • Depthwise separable convolutions for low compute and bounded activations.
  • Local attention bottlenecks only in late stages for efficient long-range reasoning.
  • Multi-Scale Fusion Adapter (MSFA) producing a compact 16×16×2048 feature map.
  • Stable INT8/16 behavior with minimal post-quantization degradation.

Yields 5.8× – 14× speedups over ViT baselines across 256–768 px inputs.


🧠 Hybrid Transformer-SSM Language Backbone (1.2B)

Designed for NPU memory hierarchies:

  • 5:1 ratio of SSM layers to Transformer attention layers
  • Linear-time gated convolution layers for most steps
  • Tiny rolling state instead of KV-cache → up to 60% lower memory bandwidth
  • W4A16 stable quantization across layers

🔗 Normalization-Free Vision–Language Connector

A compact 2-layer MLP using SiLU, deliberately removing RMSNorm to avoid unstable activation ranges during static quantization.

Ensures reliable deployment on W8A16/W4A16 pipelines.


🚗 Automotive-Grade Multimodal Intelligence

Trained on 10M Infinity-MM samples plus 200k automotive cockpit samples, covering:

  • AI Sentinel (vehicle security)
  • AI Greeter (identity recognition)
  • Car Finder (parking localization)
  • Passenger safety monitoring

Ensures robust performance across lighting, demographics, weather, and motion scenarios.


Real NPU Benchmarks

Validated on Qualcomm SA8295P NPU:

Metric Baseline (InternVL 2B) AutoNeural-VL
TTFT ~1.4 s ~100 ms
Max Vision Resolution 448×448 768×768
RMS Quant Error 3.98% 0.56%
Decode Throughput 15 tok/s 44 tok/s
Context Length 1024 4096

How to Use

⚠️ Hardware requirement: AutoNeural is optimized for Qualcomm NPUs.

1) Install Nexa-SDK

Download the SDK,follow the installation steps provided on the model page.


2) Configure authentication

Create an access token in the Model Hub, then run:

nexa config set license '<access_token>'

3) Run the model

nexa infer NexaAI/AutoNeural

Image input

Drag and drop one or more image files into the terminal window. Multiple images can be processed with a single query.


License

The AutoNeural model is released under the Creative Commons Attribution–NonCommercial 4.0 (CC BY-NC 4.0) license.

You may:

  • Use the model for non-commercial purposes
  • Modify and redistribute it with attribution

For commercial licensing, please contact: [email protected]