Add link to paper, correct dataset name, change library_name
Browse filesThis PR improves the model card by linking it to the paper, corrects the dataset name, and updates the `library_name` to "transformers".
README.md
CHANGED
|
@@ -1,6 +1,12 @@
|
|
| 1 |
---
|
| 2 |
-
library_name: peft
|
| 3 |
base_model: lmms-lab/llava-onevision-qwen2-0.5b-ov
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 4 |
tags:
|
| 5 |
- vision-language
|
| 6 |
- multimodal
|
|
@@ -11,12 +17,6 @@ tags:
|
|
| 11 |
- photography
|
| 12 |
- scene-analysis
|
| 13 |
- image-captioning
|
| 14 |
-
license: apache-2.0
|
| 15 |
-
datasets:
|
| 16 |
-
- Dataseeds/DataSeeds-Sample-Dataset-DSD
|
| 17 |
-
language:
|
| 18 |
-
- en
|
| 19 |
-
pipeline_tag: image-text-to-text
|
| 20 |
model-index:
|
| 21 |
- name: LLaVA-OneVision-Qwen2-0.5b-ov-DSD-FineTune
|
| 22 |
results:
|
|
@@ -24,26 +24,26 @@ model-index:
|
|
| 24 |
type: image-captioning
|
| 25 |
name: Image Captioning
|
| 26 |
dataset:
|
| 27 |
-
type: Dataseeds/DataSeeds-Sample-Dataset-DSD
|
| 28 |
name: DataSeeds.AI Sample Dataset
|
|
|
|
| 29 |
metrics:
|
| 30 |
- type: bleu-4
|
| 31 |
value: 0.0246
|
| 32 |
name: BLEU-4
|
| 33 |
- type: rouge-l
|
| 34 |
-
value: 0.
|
| 35 |
name: ROUGE-L
|
| 36 |
- type: bertscore
|
| 37 |
value: 0.2789
|
| 38 |
name: BERTScore F1
|
| 39 |
- type: clipscore
|
| 40 |
-
value: 0.
|
| 41 |
name: CLIPScore
|
| 42 |
---
|
| 43 |
|
| 44 |
# LLaVA-OneVision-Qwen2-0.5b Fine-tuned on DataSeeds.AI Dataset
|
| 45 |
|
| 46 |
-
This model is a LoRA (Low-Rank Adaptation) fine-tuned version of [lmms-lab/llava-onevision-qwen2-0.5b-ov](https://huggingface.co/lmms-lab/llava-onevision-qwen2-0.5b-ov) specialized for photography scene analysis and description generation. The model was fine-tuned on the [DataSeeds Sample Dataset (DSD)](https://huggingface.co/datasets/Dataseeds/DataSeeds-Sample-Dataset-DSD) to enhance its capabilities in generating detailed, accurate descriptions of photographic content.
|
| 47 |
|
| 48 |
## Model Description
|
| 49 |
|
|
@@ -67,7 +67,7 @@ This model is a LoRA (Low-Rank Adaptation) fine-tuned version of [lmms-lab/llava
|
|
| 67 |
## Training Details
|
| 68 |
|
| 69 |
### Dataset
|
| 70 |
-
The model was fine-tuned on the
|
| 71 |
- Compositional elements and camera perspectives
|
| 72 |
- Lighting conditions and visual ambiance
|
| 73 |
- Product identification and technical details
|
|
@@ -187,7 +187,8 @@ for prompt in prompts:
|
|
| 187 |
outputs = model.generate(**inputs, max_new_tokens=256, temperature=0.7)
|
| 188 |
description = processor.decode(outputs[0], skip_special_tokens=True)
|
| 189 |
print(f"Prompt: {prompt}")
|
| 190 |
-
print(f"Description: {description}
|
|
|
|
| 191 |
```
|
| 192 |
|
| 193 |
## Model Architecture
|
|
@@ -212,7 +213,7 @@ The model maintains the LLaVA-OneVision architecture with the following componen
|
|
| 212 |
|
| 213 |
## Training Data
|
| 214 |
|
| 215 |
-
The
|
| 216 |
|
| 217 |
- **Scene Descriptions**: Detailed textual descriptions of visual content
|
| 218 |
- **Technical Metadata**: Camera settings, composition details
|
|
@@ -230,7 +231,7 @@ The dataset focuses on enhancing the model's ability to:
|
|
| 230 |
### Model Limitations
|
| 231 |
- **Domain Specialization**: Optimized for photography; may have reduced performance on general vision-language tasks
|
| 232 |
- **Base Model Inheritance**: Inherits limitations from LLaVA-OneVision base model
|
| 233 |
-
- **Training Data Bias**: May reflect biases present in the
|
| 234 |
- **Language Support**: Primarily trained and evaluated on English descriptions
|
| 235 |
|
| 236 |
### Recommended Use Cases
|
|
@@ -252,7 +253,7 @@ If you use this model in your research or applications, please cite:
|
|
| 252 |
|
| 253 |
```bibtex
|
| 254 |
@article{abdoli2025peerranked,
|
| 255 |
-
title={Peer-Ranked Precision: Creating a Foundational Dataset for Fine-Tuning Vision Models from
|
| 256 |
author={Sajjad Abdoli and Freeman Lewin and Gediminas Vasiliauskas and Fabian Schonholz},
|
| 257 |
journal={arXiv preprint arXiv:2506.05673},
|
| 258 |
year={2025},
|
|
@@ -290,9 +291,9 @@ This model is released under the Apache 2.0 license, consistent with the base LL
|
|
| 290 |
|
| 291 |
- **Base Model**: Thanks to LMMS Lab for the LLaVA-OneVision model
|
| 292 |
- **Vision Encoder**: Thanks to Google Research for the SigLIP model
|
| 293 |
-
- **Dataset**:
|
| 294 |
- **Framework**: Hugging Face PEFT library for efficient fine-tuning capabilities
|
| 295 |
|
| 296 |
---
|
| 297 |
|
| 298 |
-
*For questions, issues, or collaboration opportunities, please visit the [model repository](https://huggingface.co/Dataseeds/LLaVA-OneVision-Qwen2-0.5b-ov-DSD-FineTune) or contact the DataSeeds.AI team.*
|
|
|
|
| 1 |
---
|
|
|
|
| 2 |
base_model: lmms-lab/llava-onevision-qwen2-0.5b-ov
|
| 3 |
+
datasets:
|
| 4 |
+
- Dataseeds/DataSeeds-Sample-Dataset-DSD
|
| 5 |
+
language:
|
| 6 |
+
- en
|
| 7 |
+
library_name: transformers
|
| 8 |
+
license: apache-2.0
|
| 9 |
+
pipeline_tag: image-text-to-text
|
| 10 |
tags:
|
| 11 |
- vision-language
|
| 12 |
- multimodal
|
|
|
|
| 17 |
- photography
|
| 18 |
- scene-analysis
|
| 19 |
- image-captioning
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 20 |
model-index:
|
| 21 |
- name: LLaVA-OneVision-Qwen2-0.5b-ov-DSD-FineTune
|
| 22 |
results:
|
|
|
|
| 24 |
type: image-captioning
|
| 25 |
name: Image Captioning
|
| 26 |
dataset:
|
|
|
|
| 27 |
name: DataSeeds.AI Sample Dataset
|
| 28 |
+
type: Dataseeds/DataSeeds-Sample-Dataset-DSD
|
| 29 |
metrics:
|
| 30 |
- type: bleu-4
|
| 31 |
value: 0.0246
|
| 32 |
name: BLEU-4
|
| 33 |
- type: rouge-l
|
| 34 |
+
value: 0.214
|
| 35 |
name: ROUGE-L
|
| 36 |
- type: bertscore
|
| 37 |
value: 0.2789
|
| 38 |
name: BERTScore F1
|
| 39 |
- type: clipscore
|
| 40 |
+
value: 0.326
|
| 41 |
name: CLIPScore
|
| 42 |
---
|
| 43 |
|
| 44 |
# LLaVA-OneVision-Qwen2-0.5b Fine-tuned on DataSeeds.AI Dataset
|
| 45 |
|
| 46 |
+
This model is a LoRA (Low-Rank Adaptation) fine-tuned version of [lmms-lab/llava-onevision-qwen2-0.5b-ov](https://huggingface.co/lmms-lab/llava-onevision-qwen2-0.5b-ov) specialized for photography scene analysis and description generation. The model was presented in the paper [Peer-Ranked Precision: Creating a Foundational Dataset for Fine-Tuning Vision Models from DataSeeds' Annotated Imagery](https://huggingface.co/papers/2506.05673). The model was fine-tuned on the [DataSeeds Sample Dataset (DSD)](https://huggingface.co/datasets/Dataseeds/DataSeeds-Sample-Dataset-DSD) to enhance its capabilities in generating detailed, accurate descriptions of photographic content.
|
| 47 |
|
| 48 |
## Model Description
|
| 49 |
|
|
|
|
| 67 |
## Training Details
|
| 68 |
|
| 69 |
### Dataset
|
| 70 |
+
The model was fine-tuned on the DataSeeds Sample Dataset, a curated collection of photography images with detailed scene descriptions focusing on:
|
| 71 |
- Compositional elements and camera perspectives
|
| 72 |
- Lighting conditions and visual ambiance
|
| 73 |
- Product identification and technical details
|
|
|
|
| 187 |
outputs = model.generate(**inputs, max_new_tokens=256, temperature=0.7)
|
| 188 |
description = processor.decode(outputs[0], skip_special_tokens=True)
|
| 189 |
print(f"Prompt: {prompt}")
|
| 190 |
+
print(f"Description: {description}
|
| 191 |
+
")
|
| 192 |
```
|
| 193 |
|
| 194 |
## Model Architecture
|
|
|
|
| 213 |
|
| 214 |
## Training Data
|
| 215 |
|
| 216 |
+
The DataSeeds Sample Dataset contains curated photography images with comprehensive annotations including:
|
| 217 |
|
| 218 |
- **Scene Descriptions**: Detailed textual descriptions of visual content
|
| 219 |
- **Technical Metadata**: Camera settings, composition details
|
|
|
|
| 231 |
### Model Limitations
|
| 232 |
- **Domain Specialization**: Optimized for photography; may have reduced performance on general vision-language tasks
|
| 233 |
- **Base Model Inheritance**: Inherits limitations from LLaVA-OneVision base model
|
| 234 |
+
- **Training Data Bias**: May reflect biases present in the DataSeeds dataset
|
| 235 |
- **Language Support**: Primarily trained and evaluated on English descriptions
|
| 236 |
|
| 237 |
### Recommended Use Cases
|
|
|
|
| 253 |
|
| 254 |
```bibtex
|
| 255 |
@article{abdoli2025peerranked,
|
| 256 |
+
title={Peer-Ranked Precision: Creating a Foundational Dataset for Fine-Tuning Vision Models from DataSeeds' Annotated Imagery},
|
| 257 |
author={Sajjad Abdoli and Freeman Lewin and Gediminas Vasiliauskas and Fabian Schonholz},
|
| 258 |
journal={arXiv preprint arXiv:2506.05673},
|
| 259 |
year={2025},
|
|
|
|
| 291 |
|
| 292 |
- **Base Model**: Thanks to LMMS Lab for the LLaVA-OneVision model
|
| 293 |
- **Vision Encoder**: Thanks to Google Research for the SigLIP model
|
| 294 |
+
- **Dataset**: DataSeeds photography community for the source imagery
|
| 295 |
- **Framework**: Hugging Face PEFT library for efficient fine-tuning capabilities
|
| 296 |
|
| 297 |
---
|
| 298 |
|
| 299 |
+
*For questions, issues, or collaboration opportunities, please visit the [model repository](https://huggingface.co/Dataseeds/LLaVA-OneVision-Qwen2-0.5b-ov-DSD-FineTune) or contact the DataSeeds.AI team.*
|