Spaces:

LChambon
/

NAF

Running on Zero

App Files Files Community

LChambon commited on 14 days ago

Commit

e4c8837

1 Parent(s): 6e54552

initial commit

Browse files

Files changed (10) hide show

.gitignore +3 -0
DEPLOYMENT.md +139 -0
README.md +76 -6
README_SPACE.md +84 -0
app.py +251 -0
deploy_to_hf.sh +109 -0
requirements.txt +22 -0
src/backbone/vit_wrapper.py +180 -0
utils/training.py +231 -0
utils/visualization.py +190 -0

.gitignore ADDED Viewed

	@@ -0,0 +1,3 @@

+__pycache__
+*.pyc
+.env

DEPLOYMENT.md ADDED Viewed

	@@ -0,0 +1,139 @@

+# Deploying NAF Demo to Hugging Face Spaces
+## Quick Setup
+### 1. Create a Hugging Face Space
+1. Go to [https://huggingface.co/spaces](https://huggingface.co/spaces)
+2. Click "Create new Space"
+3. Configure your Space:
+   - **Space name**: `naf-feature-upsampling` (or your choice)
+   - **License**: Apache 2.0 (or your choice)
+   - **Select the SDK**: Gradio
+   - **Space hardware**: CPU Basic (free) or GPU (T4 small recommended for faster inference)
+   - **Visibility**: Public or Private
+### 2. Required Files
+Upload these files to your Hugging Face Space:
+```
+your-space/
+├── app.py                    # Main application
+├── requirements.txt          # Python dependencies
+├── README.md                # Space documentation
+├── src/
+│   └── backbone/
+│       └── vit_wrapper.py   # Backbone wrapper
+└── utils/
+    ├── visualization.py     # Visualization utilities
+    └── training.py         # Training utilities (for round_to_nearest_multiple)
+```
+### 3. Clone Your Space Repository
+```bash
+# Install git-lfs if not already installed
+git lfs install
+# Clone your space
+git clone https://huggingface.co/spaces/YOUR_USERNAME/YOUR_SPACE_NAME
+cd YOUR_SPACE_NAME
+# Copy files from your local NAF project
+cp /home/lchambon/workspace/NAF_off/app.py .
+cp /home/lchambon/workspace/NAF_off/requirements_demo.txt ./requirements.txt
+# Copy source files
+mkdir -p src/backbone utils
+cp /home/lchambon/workspace/NAF_off/src/backbone/vit_wrapper.py src/backbone/
+cp /home/lchambon/workspace/NAF_off/utils/visualization.py utils/
+cp /home/lchambon/workspace/NAF_off/utils/training.py utils/
+cp /home/lchambon/workspace/NAF_off/utils/img.py utils/  # If needed
+# Copy sample images (optional)
+mkdir -p asset
+cp /home/lchambon/workspace/NAF_off/asset/*.png asset/
+cp /home/lchambon/workspace/NAF_off/asset/*.jpg asset/
+# Add all files
+git add .
+git commit -m "Initial commit: NAF Feature Upsampling Demo"
+git push
+```
+### 4. Alternative: Use Hugging Face Web Interface
+1. Navigate to your Space's "Files" tab
+2. Click "Add file" → "Upload files"
+3. Upload all required files maintaining the directory structure
+4. Commit changes
+### 5. Monitor Deployment
+- Once pushed, Hugging Face will automatically build your Space
+- Check the "Logs" tab to monitor the build process
+- The Space will be available at: `https://huggingface.co/spaces/YOUR_USERNAME/YOUR_SPACE_NAME`
+## Hardware Recommendations
+- **CPU Basic (free)**: Works but slower inference (~10-30s per image)
+- **T4 Small GPU**: Recommended for better performance (~2-5s per image)
+- **T4 Medium/Large GPU**: For handling multiple concurrent users
+## Important Notes
+### Model Loading
+The NAF model is loaded from torch.hub, which will download it on first run:
+```python
+model = torch.hub.load("valeoai/NAF", "naf", pretrained=True, device=device)
+```
+### Memory Considerations
+- Backbone models are loaded on-demand per request
+- Consider caching popular models if you upgrade to persistent storage
+- GPU spaces have more memory for handling larger images
+### Sample Images
+- Upload sample images to the `asset/` folder
+- Update `SAMPLE_IMAGES` list in `app.py` to match available images
+- Or remove the examples section if images aren't available
+## Troubleshooting
+### Build Failures
+- Check "Logs" tab for error messages
+- Ensure all dependencies are in `requirements.txt`
+- Verify Python version compatibility (3.8-3.10 recommended)
+### Import Errors
+- Make sure `src/` and `utils/` directories are included
+- Check that `__init__.py` files exist if needed
+- Verify relative imports are correct
+### Memory Issues
+- Reduce max resolution if needed
+- Consider using CPU-only mode for free tier
+- Upgrade to GPU hardware if processing large images
+## Updating Your Space
+```bash
+# Make changes to your files
+git add .
+git commit -m "Update: description of changes"
+git push
+```
+Hugging Face will automatically rebuild and redeploy your Space.
+## Making Your Space Public
+1. Go to Space Settings
+2. Change visibility to "Public"
+3. Add a good README.md with description and usage instructions
+4. Consider adding a thumbnail image
+## Example Space README
+See `README_SPACE.md` for a template README to use on Hugging Face Spaces.

README.md CHANGED Viewed

@@ -1,14 +1,84 @@
 ---
-title: NAF
-emoji: 📊
 colorFrom: blue
-colorTo: indigo
 sdk: gradio
-sdk_version: 6.0.1
 app_file: app.py
 pinned: false
 license: apache-2.0
-short_description: 'NAF: Zero-Shot Feature Upsampling via Neighborhood Attention'
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: NAF Zero-Shot Feature Upsampling
+emoji: 🎯
 colorFrom: blue
+colorTo: purple
 sdk: gradio
+sdk_version: 3.50.0
 app_file: app.py
 pinned: false
 license: apache-2.0
 ---
+# 🎯 NAF: Zero-Shot Feature Upsampling via Neighborhood Attention Filtering
+This Space demonstrates **NAF (Neighborhood Attention Filtering)**, a method for upsampling features from Vision Foundation Models to any resolution without model-specific training.
+## 🚀 Features
+- **Universal Upsampling**: Works with any Vision Foundation Model (DINOv2, DINOv3, RADIO, DINO, SigLIP, etc.)
+- **Arbitrary Resolutions**: Upsample features to any target resolution while maintaining aspect ratio
+- **Zero-Shot**: No model-specific training or fine-tuning required
+- **Interactive Demo**: Upload your own images or try sample images from various domains
+## 🎨 How to Use
+1. **Upload an Image**: Click "Upload Your Image" or select from sample images
+2. **Choose a Model**: Select a Vision Foundation Model from the dropdown
+3. **Set Resolution**: Choose the target resolution for upsampled features (64-512)
+4. **Click "Upsample Features"**: See the comparison between low and high-resolution features
+## 📊 Visualization
+The output shows three panels:
+- **Left**: Your input image
+- **Center**: Low-resolution features from the backbone (PCA visualization)
+- **Right**: High-resolution features upsampled by NAF
+Features are visualized using PCA for the first 3 principal components as RGB channels.
+## 🔬 Supported Models
+- **DINOv3**: Latest self-supervised vision models
+- **RADIO v2.5**: High-performance vision backbones
+- **DINOv2**: Self-supervised learning with registers
+- **DINO**: Original self-supervised ViT
+- **SigLIP**: Contrastive vision-language models
+## 📖 Learn More
+- **Paper**: [NAF: Zero-Shot Feature Upsampling via Neighborhood Attention Filtering](https://arxiv.org/abs/2501.01535)
+- **Code**: [GitHub Repository](https://github.com/valeoai/NAF)
+- **Organization**: [Valeo.ai](https://www.valeo.com/en/valeo-ai/)
+## 💡 Use Cases
+NAF enables better feature representations for:
+- Dense prediction tasks (segmentation, depth estimation)
+- High-resolution visual understanding
+- Feature matching and correspondence
+- Vision-language alignment
+## ⚙️ Technical Details
+- **Input**: Images up to 512px (maintains aspect ratio)
+- **Processing**: Backbone feature extraction → NAF upsampling
+- **Output**: High-resolution features at target resolution
+- **Device**: Runs on CPU (free tier) or GPU (faster inference)
+## 🤝 Citation
+If you use NAF in your research, please cite:
+```bibtex
+@article{chambon2025naf,
+  title={NAF: Zero-Shot Feature Upsampling via Neighborhood Attention Filtering},
+  author={Chambon, Lucas and others},
+  journal={arXiv preprint arXiv:2501.01535},
+  year={2025}
+}
+```
+## 📜 License
+This demo is released under the Apache 2.0 license.

README_SPACE.md ADDED Viewed

	@@ -0,0 +1,84 @@

+---
+title: NAF Zero-Shot Feature Upsampling
+emoji: 🎯
+colorFrom: blue
+colorTo: purple
+sdk: gradio
+sdk_version: 3.50.0
+app_file: app.py
+pinned: false
+license: apache-2.0
+---
+# 🎯 NAF: Zero-Shot Feature Upsampling via Neighborhood Attention Filtering
+This Space demonstrates **NAF (Neighborhood Attention Filtering)**, a method for upsampling features from Vision Foundation Models to any resolution without model-specific training.
+## 🚀 Features
+- **Universal Upsampling**: Works with any Vision Foundation Model (DINOv2, DINOv3, RADIO, DINO, SigLIP, etc.)
+- **Arbitrary Resolutions**: Upsample features to any target resolution while maintaining aspect ratio
+- **Zero-Shot**: No model-specific training or fine-tuning required
+- **Interactive Demo**: Upload your own images or try sample images from various domains
+## 🎨 How to Use
+1. **Upload an Image**: Click "Upload Your Image" or select from sample images
+2. **Choose a Model**: Select a Vision Foundation Model from the dropdown
+3. **Set Resolution**: Choose the target resolution for upsampled features (64-512)
+4. **Click "Upsample Features"**: See the comparison between low and high-resolution features
+## 📊 Visualization
+The output shows three panels:
+- **Left**: Your input image
+- **Center**: Low-resolution features from the backbone (PCA visualization)
+- **Right**: High-resolution features upsampled by NAF
+Features are visualized using PCA for the first 3 principal components as RGB channels.
+## 🔬 Supported Models
+- **DINOv3**: Latest self-supervised vision models
+- **RADIO v2.5**: High-performance vision backbones
+- **DINOv2**: Self-supervised learning with registers
+- **DINO**: Original self-supervised ViT
+- **SigLIP**: Contrastive vision-language models
+## 📖 Learn More
+- **Paper**: [NAF: Zero-Shot Feature Upsampling via Neighborhood Attention Filtering](https://arxiv.org/abs/2501.01535)
+- **Code**: [GitHub Repository](https://github.com/valeoai/NAF)
+- **Organization**: [Valeo.ai](https://www.valeo.com/en/valeo-ai/)
+## 💡 Use Cases
+NAF enables better feature representations for:
+- Dense prediction tasks (segmentation, depth estimation)
+- High-resolution visual understanding
+- Feature matching and correspondence
+- Vision-language alignment
+## ⚙️ Technical Details
+- **Input**: Images up to 512px (maintains aspect ratio)
+- **Processing**: Backbone feature extraction → NAF upsampling
+- **Output**: High-resolution features at target resolution
+- **Device**: Runs on CPU (free tier) or GPU (faster inference)
+## 🤝 Citation
+If you use NAF in your research, please cite:
+```bibtex
+@article{chambon2025naf,
+  title={NAF: Zero-Shot Feature Upsampling via Neighborhood Attention Filtering},
+  author={Chambon, Lucas and others},
+  journal={arXiv preprint arXiv:2501.01535},
+  year={2025}
+}
+```
+## 📜 License
+This demo is released under the Apache 2.0 license.

app.py ADDED Viewed

	@@ -0,0 +1,251 @@

+import io
+import sys
+from pathlib import Path
+import gradio as gr
+import matplotlib.pyplot as plt
+import numpy as np
+import PIL.Image
+import torch
+import torch.nn.functional as F
+import torchvision.transforms as T
+# Add project root to path
+sys.path.append(str(Path(__file__).parent))
+from src.backbone.vit_wrapper import PretrainedViTWrapper
+from utils.training import round_to_nearest_multiple
+from utils.visualization import plot_feats
+# Load NAF model
+device = "cuda" if torch.cuda.is_available() else "cpu"
+model = torch.hub.load("valeoai/NAF", "naf", pretrained=True, device=device)
+model.eval()
+# Normalization for upsampling
+ups_norm = T.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
+# Sample images
+SAMPLE_IMAGES = [
+    "asset/Cartoon.png",
+    "asset/Natural.png",
+    "asset/Satellite.png",
+    "asset/Medical.png",
+    "asset/Ecosystems.png",
+    "asset/Driving.jpg",
+    "asset/Manufacturing.png",
+]
+def resize_with_aspect_ratio(img, max_size, patch_size):
+    """Resize image maintaining aspect ratio with max dimension and patch size constraints"""
+    w, h = img.size
+    # Calculate scaling factor to fit within max_size
+    scale = min(max_size / w, max_size / h)
+    new_w = int(w * scale)
+    new_h = int(h * scale)
+    # Round to nearest patch size multiple
+    new_w = round_to_nearest_multiple(new_w, patch_size)
+    new_h = round_to_nearest_multiple(new_h, patch_size)
+    # Ensure minimum size
+    new_w = max(new_w, patch_size)
+    new_h = max(new_h, patch_size)
+    return new_w, new_h
+@torch.no_grad()
+def process_image(image, model_name, output_resolution):
+    """Process image with selected model and resolution"""
+    try:
+        # Load the backbone using vit_wrapper
+        backbone = PretrainedViTWrapper(model_name, norm=True).to(device)
+        backbone.eval()
+        # Get model config for normalization and input size
+        mean = backbone.config["mean"]
+        std = backbone.config["std"]
+        patch_size = backbone.patch_size
+        back_norm = T.Normalize(mean=mean, std=std)
+        # Prepare image at model's expected resolution
+        img = PIL.Image.fromarray(image).convert("RGB")
+        new_w, new_h = resize_with_aspect_ratio(img, max_size=512, patch_size=patch_size)
+        transform = T.Compose(
+            [
+                T.Resize((new_h, new_w)),
+                T.ToTensor(),
+            ]
+        )
+        img_tensor = transform(img).unsqueeze(0).to(device)
+        # Normalize for backbone
+        img_back = back_norm(img_tensor)
+        lr_feats = backbone(img_back)
+        # vit_wrapper already returns features in [B, C, H, W] format
+        if not isinstance(lr_feats, torch.Tensor):
+            raise ValueError(f"Unexpected feature type: {type(lr_feats)}")
+        if len(lr_feats.shape) != 4:
+            raise ValueError(f"Unexpected feature shape: {lr_feats.shape}. Expected [B, C, H, W].")
+        # Normalize for upsampling
+        img_ups = ups_norm(img_tensor)
+        # Calculate output resolution maintaining aspect ratio
+        _, _, h, w = lr_feats.shape
+        aspect_ratio = w / h
+        if aspect_ratio > 1:  # Width > Height
+            out_h = round_to_nearest_multiple(int(output_resolution / aspect_ratio), patch_size)
+            out_w = output_resolution
+        else:  # Height >= Width
+            out_h = output_resolution
+            out_w = round_to_nearest_multiple(int(output_resolution * aspect_ratio), patch_size)
+        upsampled_feats = model(img_ups, lr_feats, (out_h, out_w))
+        # Create visualization using plot_feats
+        plot_feats(
+            img_tensor[0],
+            lr_feats[0],
+            [upsampled_feats[0]],
+            legend=["Image", f"Low-Res: {h}x{w}", f"High-Res: {out_h}x{out_w}"],
+            font_size=14,
+        )
+        # Convert matplotlib figure to PIL Image
+        fig = plt.gcf()  # Get current figure
+        buf = io.BytesIO()
+        fig.savefig(buf, format="png", dpi=100, bbox_inches="tight")
+        buf.seek(0)
+        result_img = PIL.Image.open(buf)
+        plt.close(fig)
+        return result_img
+    except Exception as e:
+        print(f"Error processing image: {e}")
+        import traceback
+        traceback.print_exc()
+        return None
+# Popular vision models for the dropdown (from vit_wrapper.py)
+POPULAR_MODELS = [
+    "vit_base_patch16_dinov3.lvd1689m",
+    "radio_v2.5-b",
+    "vit_base_patch14_reg4_dinov2",
+    "vit_base_patch14_dinov2.lvd142m",
+    "vit_base_patch16_224.dino",
+    "vit_base_patch16_siglip_512.v2_webli",
+]
+# Create Gradio interface
+with gr.Blocks(title="NAF: Zero-Shot Feature Upsampling") as demo:
+    gr.HTML(
+        """
+        <div style="text-align: center; margin-bottom: 2rem;">
+            <h1 class="title-text" style="font-size: 3rem; margin-bottom: 0.5rem;">
+                🎯 NAF: Zero-Shot Feature Upsampling
+            </h1>
+            <p style="font-size: 1.2rem; color: #666; margin-bottom: 1rem;">
+                via Neighborhood Attention Filtering
+            </p>
+            <div class="info-box" style="max-width: 900px; margin: 0 auto;">
+                <p style="font-size: 1.1rem; margin-bottom: 0.8rem;">
+                    🚀 <strong>Upsample features from any Vision Foundation Model to any resolution!</strong>
+                </p>
+                <p style="font-size: 0.95rem; margin: 0;">
+                    Upload an image, select a model, choose your target resolution, and see NAF in action.
+                </p>
+            </div>
+        </div>
+        """
+    )
+    with gr.Row():
+        with gr.Column(scale=1):
+            gr.Markdown("### 📤 Input Configuration")
+            image_input = gr.Image(label="Upload Your Image", type="numpy")
+            # Sample images
+            if any(Path(p).exists() for p in SAMPLE_IMAGES):
+                gr.Examples(
+                    examples=[[p] for p in SAMPLE_IMAGES if Path(p).exists()],
+                    inputs=image_input,
+                    label="🖼️ Try Sample Images",
+                    examples_per_page=4,
+                )
+            gr.Markdown("### ⚙️ Model Settings")
+            model_dropdown = gr.Dropdown(
+                choices=POPULAR_MODELS,
+                value=POPULAR_MODELS[0],
+                label="🤖 Vision Foundation Model",
+            )
+            resolution_slider = gr.Slider(
+                minimum=64,
+                maximum=512,
+                step=64,
+                value=448,
+                label="📏 Output Resolution (max dimension)",
+            )
+            process_btn = gr.Button("✨ Upsample Features", variant="primary")
+        with gr.Column(scale=2):
+            gr.Markdown("### 🎨 Visualization Results")
+            output_image = gr.Image(label="Feature Comparison", type="pil")
+            gr.Markdown(
+                """
+                <div style="background: #f0f7ff; padding: 1rem; border-radius: 8px; border-left: 4px solid #667eea;">
+                    <strong>📊 Visualization Guide:</strong>
+                    <ul style="margin: 0.5rem 0;">
+                        <li><strong>Left:</strong> Original input image</li>
+                        <li><strong>Center:</strong> Low-resolution features (PCA visualization)</li>
+                        <li><strong>Right:</strong> High-resolution features upsampled by NAF</li>
+                    </ul>
+                    <p style="margin-top: 0.5rem; font-size: 0.9rem; color: #555;">
+                        <em>Note: Output features maintain the aspect ratio of the input image.</em>
+                    </p>
+                </div>
+                """
+            )
+    process_btn.click(fn=process_image, inputs=[image_input, model_dropdown, resolution_slider], outputs=output_image)
+    gr.Markdown(
+        """
+        ---
+        <div style="text-align: center; padding: 2rem 0;">
+            <h3 style="color: #667eea;">💡 About NAF</h3>
+            <p style="max-width: 800px; margin: 1rem auto; font-size: 1.05rem; color: #555;">
+                NAF enables <strong>zero-shot feature upsampling</strong> from any Vision Foundation Model
+                to any resolution. It learns to filter and combine features using neighborhood attention,
+                without requiring model-specific training.
+            </p>
+            <div style="margin-top: 1.5rem;">
+                <a href="https://github.com/valeoai/NAF" target="_blank"
+                   style="margin: 0 1rem; text-decoration: none; color: #667eea; font-weight: bold;">
+                    📦 GitHub Repository
+                </a>
+                <a href="https://arxiv.org/abs/2501.01535" target="_blank"
+                   style="margin: 0 1rem; text-decoration: none; color: #667eea; font-weight: bold;">
+                    📄 Research Paper
+                </a>
+            </div>
+        </div>
+        """
+    )
+if __name__ == "__main__":
+    demo.launch()

deploy_to_hf.sh ADDED Viewed

	@@ -0,0 +1,109 @@

+#!/bin/bash
+# Deploy NAF Demo to Hugging Face Spaces
+# Usage: ./deploy_to_hf.sh YOUR_USERNAME YOUR_SPACE_NAME
+set -e
+if [ "$#" -ne 2 ]; then
+    echo "Usage: ./deploy_to_hf.sh YOUR_USERNAME YOUR_SPACE_NAME"
+    echo "Example: ./deploy_to_hf.sh myusername naf-demo"
+    exit 1
+fi
+USERNAME=$1
+SPACE_NAME=$2
+SPACE_URL="https://huggingface.co/spaces/${USERNAME}/${SPACE_NAME}"
+echo "🚀 Deploying NAF Demo to Hugging Face Spaces"
+echo "Space URL will be: ${SPACE_URL}"
+echo ""
+# Check if git-lfs is installed
+if ! command -v git-lfs &> /dev/null; then
+    echo "⚠️  git-lfs is not installed. Installing..."
+    git lfs install
+fi
+# Create temporary directory
+TEMP_DIR=$(mktemp -d)
+echo "📁 Created temporary directory: ${TEMP_DIR}"
+# Clone the space
+echo "📥 Cloning space repository..."
+git clone https://huggingface.co/spaces/${USERNAME}/${SPACE_NAME} ${TEMP_DIR}
+cd ${TEMP_DIR}
+# Copy main files
+echo "📋 Copying files..."
+cp /home/lchambon/workspace/NAF_off/app.py .
+cp /home/lchambon/workspace/NAF_off/requirements_demo.txt ./requirements.txt
+cp /home/lchambon/workspace/NAF_off/README_SPACE.md ./README.md
+# Create directory structure
+mkdir -p src/backbone utils asset
+# Copy source files
+cp /home/lchambon/workspace/NAF_off/src/backbone/vit_wrapper.py src/backbone/
+cp /home/lchambon/workspace/NAF_off/utils/visualization.py utils/
+cp /home/lchambon/workspace/NAF_off/utils/training.py utils/
+# Copy utils/__init__.py if it exists
+if [ -f /home/lchambon/workspace/NAF_off/utils/__init__.py ]; then
+    cp /home/lchambon/workspace/NAF_off/utils/__init__.py utils/
+else
+    touch utils/__init__.py
+fi
+# Copy src/__init__.py if it exists
+if [ -f /home/lchambon/workspace/NAF_off/src/__init__.py ]; then
+    cp /home/lchambon/workspace/NAF_off/src/__init__.py src/
+else
+    touch src/__init__.py
+fi
+# Copy src/backbone/__init__.py if it exists
+if [ -f /home/lchambon/workspace/NAF_off/src/backbone/__init__.py ]; then
+    cp /home/lchambon/workspace/NAF_off/src/backbone/__init__.py src/backbone/
+else
+    touch src/backbone/__init__.py
+fi
+# Copy sample images if they exist
+echo "🖼️  Copying sample images..."
+if [ -d /home/lchambon/workspace/NAF_off/asset ]; then
+    cp /home/lchambon/workspace/NAF_off/asset/*.png asset/ 2>/dev/null || true
+    cp /home/lchambon/workspace/NAF_off/asset/*.jpg asset/ 2>/dev/null || true
+fi
+# Add .gitignore
+cat > .gitignore << EOF
+__pycache__/
+*.py[cod]
+*$py.class
+*.so
+.Python
+.env
+.venv
+*.egg-info/
+.DS_Store
+EOF
+# Git operations
+echo "📤 Pushing to Hugging Face..."
+git add .
+git commit -m "Deploy NAF Feature Upsampling Demo"
+git push
+echo ""
+echo "✅ Deployment complete!"
+echo "🌐 Your Space will be available at: ${SPACE_URL}"
+echo "⏳ It may take a few minutes to build..."
+echo ""
+echo "To check build status:"
+echo "  - Visit: ${SPACE_URL}"
+echo "  - Click on 'Logs' tab"
+# Cleanup
+cd -
+rm -rf ${TEMP_DIR}

requirements.txt ADDED Viewed

	@@ -0,0 +1,22 @@

+einops==0.8.0
+numpy==1.24.4
+timm==1.0.22
+plotly==6.0.0
+tensorboard==2.20.0
+hydra-core==1.3.2
+matplotlib==3.7.0
+rich==14.2.0
+torchmetrics==1.6.2
+scipy==1.15.2
+kornia==0.8.2
+ipykernel
+ipympl
+pytest
+# Torch + CUDA 11.8 (HuggingFace compatible install)
+torch==2.4.0
+torchvision==0.19.0
+--extra-index-url https://download.pytorch.org/whl/cu118
+# NATTEN wheels
+natten==0.17.4+torch240cu118 -f https://shi-labs.com/natten/wheels/

src/backbone/vit_wrapper.py ADDED Viewed

	@@ -0,0 +1,180 @@

+import re
+import types
+from typing import List, Tuple, Union
+import timm
+import timm.data
+import torch
+import torch.nn.functional as F
+from einops import rearrange
+from timm.models.vision_transformer import VisionTransformer
+from torch import nn
+from torchvision import transforms
+# We provide a list of timm model names, more are available on their official repo.
+MODEL_LIST = [
+    # DINO
+    "vit_base_patch16_224.dino",
+    # DINOv2
+    "vit_base_patch14_dinov2.lvd142m",
+    # DINOv2-R
+    "vit_base_patch14_reg4_dinov2",
+    # Franca
+    "franca_vitb14",
+    # DINOv3-ViT
+    "vit_base_patch16_dinov3.lvd1689m",
+    "vit_large_patch16_dinov3.lvd1689m",
+    "vit_7b_patch16_dinov3.lvd1689m",
+    # SigLIP2
+    "vit_base_patch16_siglip_512.v2_webli",
+    # PE Core
+    "vit_pe_core_small_patch16_384.fb",
+    # PE Spatial
+    "vit_pe_spatial_tiny_patch16_512.fb",
+    # RADIO
+    "radio_v2.5-b",
+    # CAPI
+    "capi_vitl14_lvd",
+    # MAE
+    "vit_large_patch16_224.mae",
+]
+IMAGENET_DEFAULT_MEAN = (0.485, 0.456, 0.406)
+IMAGENET_DEFAULT_STD = (0.229, 0.224, 0.225)
+class PretrainedViTWrapper(nn.Module):
+    def __init__(
+        self,
+        name,
+        norm: bool = True,
+        dynamic_img_size: bool = True,
+        dynamic_img_pad: bool = False,
+        **kwargs,
+    ):
+        super().__init__()
+        # comment out the following line to test the models not in the list
+        self.name = name
+        load_weights = False
+        if "dvt_" == name[:4]:
+            load_weights = True
+            load_tag = "dvt"
+            name = name.replace("dvt_", "")
+        if "fit3d_" == name[:6]:
+            load_weights = True
+            load_tag = "fit3d"
+            name = name.replace("fit3d_", "")
+        # Set patch size
+        try:
+            self.patch_size = int(re.search(r"patch(\d+)", name).group(1))
+        except:
+            self.patch_size = 16
+        if "franca" in name or "capi" in name:
+            self.patch_size = 14
+        if "convnext" in name:
+            self.patch_size = 32
+        name, self.patch_size
+        self.dynamic_img_size = dynamic_img_size
+        self.dynamic_img_pad = dynamic_img_pad
+        self.model, self.config = self.create_model(name, **kwargs)
+        self.config["ps"] = self.patch_size
+        self.embed_dim = self.model.embed_dim
+        self.norm = norm
+        if load_weights:
+            ckpt = torch.load(f"/home/lchambon/workspace/JAFAR/ckpts/{load_tag}_{name}.pth", map_location="cpu")
+            if load_tag == "dvt":
+                self.load_state_dict(ckpt["model"], strict=True)
+            elif load_tag == "fit3d":
+                self.model.load_state_dict(ckpt, strict=True)
+    def create_model(self, name: str, **kwargs) -> Tuple[VisionTransformer, transforms.Compose]:
+        if "radio" in self.name:
+            model = torch.hub.load(
+                "NVlabs/RADIO",
+                "radio_model",
+                version=name,
+                progress=True,
+                skip_validation=True,
+            )
+            data_config = {
+                "mean": torch.tensor([0.0, 0.0, 0.0]),
+                "std": torch.tensor([1.0, 1.0, 1.0]),
+                "input_size": (3, 512, 512),
+            }
+        elif "franca" in self.name:
+            model = torch.hub.load("valeoai/Franca", name, use_rasa_head=True)
+            data_config = {"mean": IMAGENET_DEFAULT_MEAN, "std": IMAGENET_DEFAULT_STD, "input_size": (3, 448, 448)}
+        elif "capi" in self.name:
+            model = torch.hub.load("facebookresearch/capi:main", name, force_reload=False)
+            data_config = {"mean": IMAGENET_DEFAULT_MEAN, "std": IMAGENET_DEFAULT_STD, "input_size": (3, 448, 448)}
+        else:
+            timm_kwargs = dict(
+                pretrained=True,
+                num_classes=0,
+                patch_size=self.patch_size,
+            )
+            if "sam" not in self.name and "convnext" not in self.name:
+                timm_kwargs["dynamic_img_size"] = self.dynamic_img_size
+                timm_kwargs["dynamic_img_pad"] = self.dynamic_img_pad
+            timm_kwargs.update(kwargs)
+            model = timm.create_model(name, **timm_kwargs)
+            data_config = timm.data.resolve_model_data_config(model=model)
+        model = model.eval()
+        return model, data_config
+    def forward(
+        self,
+        x: torch.Tensor,
+        n: Union[int, List[int], Tuple[int]] = 1,
+        return_prefix_tokens: bool = False,
+    ) -> Union[List[torch.Tensor], Tuple[torch.Tensor, List[torch.Tensor]]]:
+        """Intermediate layer accessor inspired by DINO / DINOv2 interface.
+        Args:
+            x: Input tensor.
+            n: Take last n blocks if int, all if None, select matching indices if sequence
+            reshape: Whether to reshape the output.
+        """
+        common_kwargs = dict(
+            norm=self.norm,
+            output_fmt="NCHW",
+            intermediates_only=True,
+        )
+        if "sam" not in self.name and return_prefix_tokens:
+            common_kwargs["return_prefix_tokens"] = return_prefix_tokens
+        elif "franca" in self.name:
+            B, C, H, W = x.shape
+            feats = self.model.forward_features(x, use_rasa_head=True)
+            out = feats["patch_token_rasa"]
+            out = rearrange(out, "b (h w) c -> b c h w", h=H // self.patch_size, w=W // self.patch_size)
+        elif "capi" in self.name:
+            *_, out = self.model(x)
+            out = out.permute(0, 3, 1, 2)
+        else:
+            out = self.model.forward_intermediates(x, n, **common_kwargs)
+        # "sam" models return feats only, others may return (feats, prefix)
+        if not isinstance(out, list) and not isinstance(out, tuple):
+            out = [out]
+            return out[0]
+        else:
+            assert len(out) == 1, f"Out contains {len(out)} elements, expected 1."
+            return out[0]

utils/training.py ADDED Viewed

	@@ -0,0 +1,231 @@

+import os
+import random
+import numpy as np
+import torch
+import torch.nn.functional as F
+import torch.utils.checkpoint as checkpoint
+import torchvision.transforms as T
+from hydra.utils import instantiate
+from omegaconf import ListConfig
+from torch.utils.tensorboard import SummaryWriter
+from torchvision.transforms.functional import InterpolationMode
+from src.backbone.vit_wrapper import PretrainedViTWrapper
+from utils.img import PILToTensor
+def seed_worker():
+    worker_seed = torch.initial_seed() % 2**32
+    np.random.seed(worker_seed)
+    random.seed(worker_seed)
+def round_to_nearest_multiple(value, multiple=14):
+    return multiple * round(value / multiple)
+def compute_feats(cfg, backbone, image_batch, min_rescale=0.60, max_rescale=0.25):
+    _, _, H, W = image_batch.shape  # Get original height and width
+    with torch.no_grad():
+        hr_feats = backbone(image_batch)
+        if cfg.get("lr_img_size", None) is not None:
+            size = (cfg.lr_img_size, cfg.lr_img_size)
+        else:
+            # Downscale
+            if cfg.down_factor == "random":
+                downscale_factor = np.random.uniform(min_rescale, max_rescale)
+            elif cfg.down_factor == "fixed":
+                downscale_factor = 0.5
+            new_H = round_to_nearest_multiple(H * downscale_factor, backbone.patch_size)
+            new_W = round_to_nearest_multiple(W * downscale_factor, backbone.patch_size)
+            size = (new_H, new_W)
+        low_res_batch = F.interpolate(image_batch, size=size, mode="bilinear")
+        lr_feats = backbone(low_res_batch)
+        return hr_feats, lr_feats
+def logger(args, base_log_dir):
+    os.makedirs(base_log_dir, exist_ok=True)
+    existing_versions = [
+        int(d.split("_")[-1])
+        for d in os.listdir(base_log_dir)
+        if os.path.isdir(os.path.join(base_log_dir, d)) and d.startswith("version_")
+    ]
+    new_version = max(existing_versions, default=-1) + 1
+    new_log_dir = os.path.join(base_log_dir, f"version_{new_version}")
+    # Create the SummaryWriter with the new log directory
+    writer = SummaryWriter(log_dir=new_log_dir)
+    return writer, new_version, new_log_dir
+def get_dataloaders(cfg, shuffle=True):
+    """Get dataloaders for either training or evaluation.
+    Args:
+        cfg: Configuration object
+        backbone: Backbone model for normalization parameters
+    """
+    # Default ImageNet normalization values
+    transforms = {
+        "image": T.Compose(
+            [
+                T.Resize(cfg.img_size, interpolation=InterpolationMode.BILINEAR),
+                T.CenterCrop((cfg.img_size, cfg.img_size)),
+                T.ToTensor(),
+            ]
+        )
+    }
+    transforms["label"] = T.Compose(
+        [
+            # T.ToTensor(),
+            T.Resize(cfg.target_size, interpolation=InterpolationMode.NEAREST_EXACT),
+            T.CenterCrop((cfg.target_size, cfg.target_size)),
+            PILToTensor(),
+        ]
+    )
+    train_dataset = cfg.dataset
+    val_dataset = cfg.dataset.copy()
+    if hasattr(val_dataset, "split"):
+        val_dataset.split = "val"
+    train_dataset = instantiate(
+        train_dataset,
+        transform=transforms["image"],
+        target_transform=transforms["label"],
+    )
+    val_dataset = instantiate(
+        val_dataset,
+        transform=transforms["image"],
+        target_transform=transforms["label"],
+    )
+    # Create generator for reproducibility
+    if not shuffle:
+        g = torch.Generator()
+        g.manual_seed(0)
+    else:
+        g = None
+    # Prepare dataloader configs - set worker_init_fn to None when shuffling for randomness
+    train_dataloader_cfg = cfg.train_dataloader.copy()
+    val_dataloader_cfg = cfg.val_dataloader.copy()
+    if shuffle:
+        # Set worker_init_fn to None to allow true randomness when shuffling
+        if "worker_init_fn" in train_dataloader_cfg:
+            train_dataloader_cfg["worker_init_fn"] = None
+        if "worker_init_fn" in val_dataloader_cfg:
+            val_dataloader_cfg["worker_init_fn"] = None
+    return (
+        instantiate(train_dataloader_cfg, dataset=train_dataset, generator=g),
+        instantiate(val_dataloader_cfg, dataset=val_dataset, generator=g),
+    )
+def get_batch(batch, device):
+    """Process batch and return required tensors."""
+    batch["image"] = batch["image"].to(device)
+    return batch
+def setup_training_optimizations(model, cfg):
+    """
+    Setup training optimizations based on configuration
+    Args:
+        model: The model to apply optimizations to
+        cfg: Configuration object with use_bf16 and use_checkpointing flags
+    Returns:
+        tuple: (scaler, use_bf16, use_checkpointing) for use in training loop
+    """
+    # Get configuration values with defaults
+    use_bf16 = getattr(cfg, "use_bf16", False)
+    use_checkpointing = getattr(cfg, "use_checkpointing", False)
+    # Initialize gradient scaler for mixed precision
+    scaler = torch.amp.GradScaler("cuda", enabled=use_bf16)
+    # Enable gradient checkpointing if requested
+    if use_checkpointing:
+        if hasattr(model, "gradient_checkpointing_enable"):
+            model.gradient_checkpointing_enable()
+            print("   ✓ Using built-in gradient checkpointing")
+        else:
+            # For custom models, wrap forward methods
+            def checkpoint_wrapper(module):
+                if hasattr(module, "forward"):
+                    original_forward = module.forward
+                    def checkpointed_forward(*args, **kwargs):
+                        return checkpoint.checkpoint(original_forward, *args, **kwargs)
+                    module.forward = checkpointed_forward
+            # Apply to key modules (adjust based on your model structure)
+            checkpointed_modules = []
+            for name, module in model.named_modules():
+                if any(key in name for key in ["cross_decode", "encoder", "sft"]):
+                    checkpoint_wrapper(module)
+                    checkpointed_modules.append(name)
+            if checkpointed_modules:
+                print(f"   ✓ Applied custom gradient checkpointing to: {checkpointed_modules}")
+            else:
+                print("   ⚠ No modules found for gradient checkpointing")
+    print(f"Training optimizations:")
+    print(f"  Mixed precision (bfloat16): {use_bf16}")
+    print(f"  Gradient checkpointing: {use_checkpointing}")
+    return scaler, use_bf16, use_checkpointing
+def load_multiple_backbones(cfg, backbone_configs, device):
+    """
+    Load multiple backbone models based on configuration.
+    Args:
+        cfg: Hydra configuration object
+        device: PyTorch device to load models on
+    Returns:
+        tuple: (backbones, backbone_names, primary_backbone)
+            - backbones: List of loaded backbone models
+            - backbone_names: List of backbone names
+    """
+    backbones = []
+    backbone_names = []
+    backbone_img_sizes = []
+    if not isinstance(backbone_configs, list) and not isinstance(backbone_configs, ListConfig):
+        backbone_configs = [backbone_configs]
+    print(f"Loading {len(backbone_configs)} backbone(s)...")
+    for i, backbone_config in enumerate(backbone_configs):
+        name = backbone_config["name"]
+        if name == "rgb":
+            backbone = instantiate(cfg.backbone)
+        else:
+            backbone = PretrainedViTWrapper(name=name)
+        print(f"  [{i}] Loaded {backbone_config['name']}")
+        # Move to device and set to eval mode
+        backbone = backbone.to(device)
+        backbone.eval()  # Set to eval mode for feature extraction
+        # Store backbone and name
+        backbones.append(backbone)
+        backbone_names.append(backbone_config["name"])
+        backbone_img_sizes.append(backbone.config["input_size"][1:])
+    return backbones, backbone_names, backbone_img_sizes

utils/visualization.py ADDED Viewed

	@@ -0,0 +1,190 @@

+# Visualization code from https://github.com/Tsingularity/dift/blob/main/src/utils/visualization.py
+import io
+from pathlib import Path
+import matplotlib.colors as mcolors
+import matplotlib.pyplot as plt
+import numpy as np
+import torch
+import torch.nn.functional as F
+from einops import rearrange
+from PIL import Image
+FONT_SIZE = 40
+@torch.no_grad()
+def plot_feats(
+    image,
+    target,
+    pred,
+    legend=["Image", "HR Features", "Pred Features"],
+    save_path=None,
+    return_array=False,
+    show_legend=True,
+    font_size=FONT_SIZE,
+):
+    """
+    Create a plot_feats visualization.
+    """
+    # Ensure hr_or_seg is a list
+    if not isinstance(pred, list):
+        pred = [pred]
+    # Prepare inputs for PCA
+    feats_for_pca = [target.unsqueeze(0)] + [_.unsqueeze(0) for _ in pred]
+    reduced_feats, _ = pca(feats_for_pca)  # pca outputs a list of reduced tensors
+    target_imgs = reduced_feats[0]
+    pred_imgs = reduced_feats[1:]
+    # --- Plot ---
+    # Determine number of columns based on whether image is provided
+    n_cols = (1 if image is not None else 0) + 1 + len(pred_imgs)
+    fig, ax = plt.subplots(1, n_cols, figsize=(5 * n_cols, 5))
+    # Reduce space between images
+    plt.subplots_adjust(wspace=0.05, hspace=0.05)
+    # Handle single subplot case
+    if n_cols == 1:
+        ax = [ax]
+    # Current axis index
+    ax_idx = 0
+    # Plot original image if provided
+    if image is not None:
+        if image.dim() == 3:
+            ax[ax_idx].imshow(image.permute(1, 2, 0).detach().cpu())
+        elif image.dim() == 2:
+            ax[ax_idx].imshow(image.detach().cpu(), cmap="inferno")
+        if show_legend:
+            ax[ax_idx].set_title(legend[0], fontsize=font_size)
+        ax_idx += 1
+    # Plot the low-resolution features or segmentation mask
+    ax[ax_idx].imshow(target_imgs[0].permute(1, 2, 0).detach().cpu())
+    if show_legend:
+        legend_idx = 1 if image is not None else 0
+        ax[ax_idx].set_title(legend[legend_idx], fontsize=font_size)
+    ax_idx += 1
+    # Plot HR features or segmentation masks
+    for idx, pred_img in enumerate(pred_imgs):
+        ax[ax_idx].imshow(pred_img[0].permute(1, 2, 0).detach().cpu())
+        if show_legend:
+            legend_idx = (2 if image is not None else 1) + idx
+            if len(legend) > legend_idx:
+                ax[ax_idx].set_title(legend[legend_idx], fontsize=font_size)
+            else:
+                ax[ax_idx].set_title(f"HR Features {idx}", fontsize=font_size)
+        ax_idx += 1
+    remove_axes(ax)
+    # Handle return_array case
+    if return_array:
+        # Turn off interactive mode temporarily
+        was_interactive = plt.isinteractive()
+        plt.ioff()
+        # Convert figure to numpy array
+        buf = io.BytesIO()
+        plt.savefig(buf, format="png", bbox_inches="tight", pad_inches=0)
+        buf.seek(0)
+        # Convert to PIL Image then to numpy array
+        pil_img = Image.open(buf)
+        img_array = np.array(pil_img)
+        # Close the figure and buffer
+        plt.close(fig)
+        buf.close()
+        # Restore interactive mode if it was on
+        if was_interactive:
+            plt.ion()
+        return img_array
+    # Standard behavior: save and/or show
+    if save_path is not None:
+        plt.savefig(save_path, bbox_inches="tight", pad_inches=0)
+    plt.show()
+    return None
+def remove_axes(axes):
+    def _remove_axes(ax):
+        ax.xaxis.set_major_formatter(plt.NullFormatter())
+        ax.yaxis.set_major_formatter(plt.NullFormatter())
+        ax.set_xticks([])
+        ax.set_yticks([])
+    if len(axes.shape) == 2:
+        for ax1 in axes:
+            for ax in ax1:
+                _remove_axes(ax)
+    else:
+        for ax in axes:
+            _remove_axes(ax)
+def pca(image_feats_list, dim=3, fit_pca=None, max_samples=None):
+    target_size = None
+    if len(image_feats_list) > 1 and fit_pca is None:
+        target_size = image_feats_list[0].shape[2]
+    def flatten(tensor, target_size=None):
+        B, C, H, W = tensor.shape
+        assert B == 1, "Batch size should be 1 for PCA flattening"
+        if target_size is not None:
+            tensor = F.interpolate(tensor, (target_size, target_size), mode="bilinear", align_corners=False)
+        return rearrange(tensor, "b c h w -> (b h w) c").detach().cpu()
+    flattened_feats = []
+    for feats in image_feats_list:
+        flattened_feats.append(flatten(feats, target_size))
+    x = torch.cat(flattened_feats, dim=0)
+    # Subsample the data if max_samples is set and the number of samples exceeds max_samples
+    if max_samples is not None and x.shape[0] > max_samples:
+        indices = torch.randperm(x.shape[0])[:max_samples]
+        x = x[indices]
+    if fit_pca is None:
+        fit_pca = TorchPCA(n_components=dim).fit(x)
+    reduced_feats = []
+    for feats in image_feats_list:
+        B, C, H, W = feats.shape
+        x_red = fit_pca.transform(flatten(feats))
+        if isinstance(x_red, np.ndarray):
+            x_red = torch.from_numpy(x_red)
+        x_red -= x_red.min(dim=0, keepdim=True).values
+        x_red /= x_red.max(dim=0, keepdim=True).values
+        reduced_feats.append(x_red.reshape(B, H, W, dim).permute(0, 3, 1, 2))
+    return reduced_feats, fit_pca
+class TorchPCA(object):
+    def __init__(self, n_components, skip=0):
+        self.n_components = n_components
+        self.skip = skip
+    def fit(self, X):
+        self.mean_ = X.mean(dim=0)
+        unbiased = X - self.mean_
+        U, S, V = torch.pca_lowrank(unbiased, q=self.n_components, center=False, niter=20)
+        self.components_ = V[:, self.skip :]
+        self.singular_values_ = S
+        return self
+    def transform(self, X):
+        t0 = X - self.mean_.unsqueeze(0)
+        projected = t0 @ self.components_
+        return projected