FairSteer BAD Classifier (Secure)

Biased Activation Detection (BAD) classifier optimized for TinyLlama-1.1B. This model detects whether an LLM's internal activation indicates biased reasoning.

This repository contains only SafeTensors weights for security.

Model Details

  • Base Model: TinyLlama/TinyLlama-1.1B-Chat-v1.0
  • Target Layer: 14
  • Architecture: Linear Probe (Dropout -> Linear)
  • Performance: 67.90% Balanced Accuracy

Artifacts

  • model.safetensors: Weights (SafeTensors only)
  • scaler.pkl: StandardScaler (Required for inference preprocessing)
  • config.json: Architecture configuration

Usage (FairSteer)

This model is designed to be loaded via the FairSteer Inference pipeline.

Downloads last month
1
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support