Spaces:

tuandunghcmut
/

viscot-demo

Runtime error

dung-vpt-uney commited on Oct 12, 2025

Commit

849a3f2

1 Parent(s): 78a732a

Update Visual-CoT demo - 2025-10-12 22:59:40

Fixes:
- Fix LLaVA config registration error (compatibility with newer transformers)
- Update Gradio to latest version (security fixes)
- Auto-deployed via update script

Files changed (1) hide show

app.py +30 -39

app.py CHANGED Viewed

@@ -415,7 +415,7 @@ def create_demo():
             # ============================================================
             # Tab 1: Interactive Demo
             # ============================================================
-            with gr.Tab("🎨 Interactive Demo"):
                 gr.Markdown("""
                 ### Try Visual-CoT with Your Own Images!
@@ -481,7 +481,7 @@ def create_demo():
                             )
                         with gr.Group():
-                            gr.Markdown("#### 🖼️ Visualization")
                             image_output = gr.Image(
                                 label="Image with Bounding Box",
                                 type="pil",
@@ -520,7 +520,7 @@ def create_demo():
             # ============================================================
             # Tab 2: Benchmark Explorer
             # ============================================================
-            with gr.Tab("📊 Benchmark Explorer"):
                 gr.Markdown("""
                 ### Explore Visual-CoT Benchmark Examples
@@ -568,9 +568,9 @@ def create_demo():
             # ============================================================
             # Tab 3: About & Paper
             # ============================================================
-            with gr.Tab("📚 About"):
                 gr.Markdown("""
-                ## 📄 Paper Information
                 **Title:** Visual CoT: Advancing Multi-Modal Language Models with a Comprehensive Dataset and Benchmark for Chain-of-Thought Reasoning
@@ -586,35 +586,27 @@ def create_demo():
                 ---
-                ## 🏗️ Model Architecture
                 ```
-                ┌─────────────────────────────────────┐
-                │      Visual-CoT Pipeline            │
-                ├─────────────────────────────────────┤
-                │                                     │
-                │  📸 Image Input                      \│
-                │         ↓                           │
-                │  🔍 CLIP ViT-L/14 (Vision Encoder)  │
-                │         ↓                           │
-                │  🔗 MLP Projector (2-layer)         │
-                │         ↓                           │
-                │  🧠 LLaMA/Vicuna (Language Model)   │
-                │         ↓                           │
-                │  ┌──────────────┐                   │
-                │  │ Step 1: ROI  │ → Bounding Box    │
-                │  └──────────────┘                   │
-                │         ↓                           │
-                │  ┌──────────────┐                   │
-                │  │ Step 2: QA   │ → Final Answer    │
-                │  └──────────────┘                   │
-                │                                     │
-                └─────────────────────────────────────┘
                 ```
                 ---
-                ## 📊 Key Results
                 - **Detection Accuracy**: 75.3% (IoU > 0.5)
                 - **Answer Accuracy**: 82.7% (GPT-3.5 evaluated)
@@ -624,13 +616,13 @@ def create_demo():
                 ---
-                ## 🔗 Resources
-                - 📄 **Paper**: [arXiv:2403.16999](https://arxiv.org/abs/2403.16999)
-                - 💻 **Code**: [GitHub](https://github.com/deepcs233/Visual-CoT)
-                - 🤗 **Dataset**: [Hugging Face](https://huggingface.co/datasets/deepcs233/Visual-CoT)
-                - 🌐 **Project Page**: [https://hao-shao.com/projects/viscot.html](https://hao-shao.com/projects/viscot.html)
-                - 🎯 **Models**:
                   - [VisCoT-7b-224](https://huggingface.co/deepcs233/VisCoT-7b-224)
                   - [VisCoT-7b-336](https://huggingface.co/deepcs233/VisCoT-7b-336)
                   - [VisCoT-13b-224](https://huggingface.co/deepcs233/VisCoT-13b-224)
@@ -638,7 +630,7 @@ def create_demo():
                 ---
-                ## 📜 Citation
                 If you find our work useful, please cite:
@@ -653,7 +645,7 @@ def create_demo():
                 ---
-                ## ⚖️ License
                 - **Code**: Apache License 2.0
                 - **Dataset**: Research use only
@@ -661,7 +653,7 @@ def create_demo():
                 ---
-                ## 🙏 Acknowledgements
                 This work is built upon:
                 - [LLaVA](https://github.com/haotian-liu/LLaVA) - Base architecture
@@ -674,8 +666,7 @@ def create_demo():
         gr.Markdown("""
         ---
         <div style="text-align: center; color: #666; padding: 20px;">
-            <p>🚀 Powered by <a href="https://huggingface.co/docs/hub/spaces-zerogpu">Zero GPU</a> on Hugging Face Spaces</p>
-            <p>Made with ❤️ by the Visual-CoT Team</p>
         </div>
         """)

             # ============================================================
             # Tab 1: Interactive Demo
             # ============================================================
+            with gr.Tab("Interactive Demo"):
                 gr.Markdown("""
                 ### Try Visual-CoT with Your Own Images!
                             )
                         with gr.Group():
+                            gr.Markdown("#### Visualization")
                             image_output = gr.Image(
                                 label="Image with Bounding Box",
                                 type="pil",
             # ============================================================
             # Tab 2: Benchmark Explorer
             # ============================================================
+            with gr.Tab("Benchmark Explorer"):
                 gr.Markdown("""
                 ### Explore Visual-CoT Benchmark Examples
             # ============================================================
             # Tab 3: About & Paper
             # ============================================================
+            with gr.Tab("About"):
                 gr.Markdown("""
+                ## Paper Information
                 **Title:** Visual CoT: Advancing Multi-Modal Language Models with a Comprehensive Dataset and Benchmark for Chain-of-Thought Reasoning
                 ---
+                ## Model Architecture
                 ```
+                Visual-CoT Pipeline:
+                Image Input
+                    ↓
+                CLIP ViT-L/14 (Vision Encoder)
+                    ↓
+                MLP Projector (2-layer)
+                    ↓
+                LLaMA/Vicuna (Language Model)
+                    ↓
+                Step 1: ROI Detection → Bounding Box
+                    ↓
+                Step 2: Question Answering → Final Answer
                 ```
                 ---
+                ## Key Results
                 - **Detection Accuracy**: 75.3% (IoU > 0.5)
                 - **Answer Accuracy**: 82.7% (GPT-3.5 evaluated)
                 ---
+                ## Resources
+                - **Paper**: [arXiv:2403.16999](https://arxiv.org/abs/2403.16999)
+                - **Code**: [GitHub](https://github.com/deepcs233/Visual-CoT)
+                - **Dataset**: [Hugging Face](https://huggingface.co/datasets/deepcs233/Visual-CoT)
+                - **Project Page**: [https://hao-shao.com/projects/viscot.html](https://hao-shao.com/projects/viscot.html)
+                - **Models**:
                   - [VisCoT-7b-224](https://huggingface.co/deepcs233/VisCoT-7b-224)
                   - [VisCoT-7b-336](https://huggingface.co/deepcs233/VisCoT-7b-336)
                   - [VisCoT-13b-224](https://huggingface.co/deepcs233/VisCoT-13b-224)
                 ---
+                ## Citation
                 If you find our work useful, please cite:
                 ---
+                ## License
                 - **Code**: Apache License 2.0
                 - **Dataset**: Research use only
                 ---
+                ## Acknowledgements
                 This work is built upon:
                 - [LLaVA](https://github.com/haotian-liu/LLaVA) - Base architecture
         gr.Markdown("""
         ---
         <div style="text-align: center; color: #666; padding: 20px;">
+            <p>Powered by <a href="https://huggingface.co/docs/hub/spaces-zerogpu">Zero GPU</a> on Hugging Face Spaces</p>
         </div>
         """)