nielsr HF Staff commited on
Commit
73a9ecc
·
verified ·
1 Parent(s): 095d094

Add link to paper, correct dataset name, change library_name

Browse files

This PR improves the model card by linking it to the paper, corrects the dataset name, and updates the `library_name` to "transformers".

Files changed (1) hide show
  1. README.md +19 -18
README.md CHANGED
@@ -1,6 +1,12 @@
1
  ---
2
- library_name: peft
3
  base_model: lmms-lab/llava-onevision-qwen2-0.5b-ov
 
 
 
 
 
 
 
4
  tags:
5
  - vision-language
6
  - multimodal
@@ -11,12 +17,6 @@ tags:
11
  - photography
12
  - scene-analysis
13
  - image-captioning
14
- license: apache-2.0
15
- datasets:
16
- - Dataseeds/DataSeeds-Sample-Dataset-DSD
17
- language:
18
- - en
19
- pipeline_tag: image-text-to-text
20
  model-index:
21
  - name: LLaVA-OneVision-Qwen2-0.5b-ov-DSD-FineTune
22
  results:
@@ -24,26 +24,26 @@ model-index:
24
  type: image-captioning
25
  name: Image Captioning
26
  dataset:
27
- type: Dataseeds/DataSeeds-Sample-Dataset-DSD
28
  name: DataSeeds.AI Sample Dataset
 
29
  metrics:
30
  - type: bleu-4
31
  value: 0.0246
32
  name: BLEU-4
33
  - type: rouge-l
34
- value: 0.2140
35
  name: ROUGE-L
36
  - type: bertscore
37
  value: 0.2789
38
  name: BERTScore F1
39
  - type: clipscore
40
- value: 0.3260
41
  name: CLIPScore
42
  ---
43
 
44
  # LLaVA-OneVision-Qwen2-0.5b Fine-tuned on DataSeeds.AI Dataset
45
 
46
- This model is a LoRA (Low-Rank Adaptation) fine-tuned version of [lmms-lab/llava-onevision-qwen2-0.5b-ov](https://huggingface.co/lmms-lab/llava-onevision-qwen2-0.5b-ov) specialized for photography scene analysis and description generation. The model was fine-tuned on the [DataSeeds Sample Dataset (DSD)](https://huggingface.co/datasets/Dataseeds/DataSeeds-Sample-Dataset-DSD) to enhance its capabilities in generating detailed, accurate descriptions of photographic content.
47
 
48
  ## Model Description
49
 
@@ -67,7 +67,7 @@ This model is a LoRA (Low-Rank Adaptation) fine-tuned version of [lmms-lab/llava
67
  ## Training Details
68
 
69
  ### Dataset
70
- The model was fine-tuned on the GuruShots Sample Dataset, a curated collection of photography images with detailed scene descriptions focusing on:
71
  - Compositional elements and camera perspectives
72
  - Lighting conditions and visual ambiance
73
  - Product identification and technical details
@@ -187,7 +187,8 @@ for prompt in prompts:
187
  outputs = model.generate(**inputs, max_new_tokens=256, temperature=0.7)
188
  description = processor.decode(outputs[0], skip_special_tokens=True)
189
  print(f"Prompt: {prompt}")
190
- print(f"Description: {description}\n")
 
191
  ```
192
 
193
  ## Model Architecture
@@ -212,7 +213,7 @@ The model maintains the LLaVA-OneVision architecture with the following componen
212
 
213
  ## Training Data
214
 
215
- The GuruShots Sample Dataset contains curated photography images with comprehensive annotations including:
216
 
217
  - **Scene Descriptions**: Detailed textual descriptions of visual content
218
  - **Technical Metadata**: Camera settings, composition details
@@ -230,7 +231,7 @@ The dataset focuses on enhancing the model's ability to:
230
  ### Model Limitations
231
  - **Domain Specialization**: Optimized for photography; may have reduced performance on general vision-language tasks
232
  - **Base Model Inheritance**: Inherits limitations from LLaVA-OneVision base model
233
- - **Training Data Bias**: May reflect biases present in the GuruShots dataset
234
  - **Language Support**: Primarily trained and evaluated on English descriptions
235
 
236
  ### Recommended Use Cases
@@ -252,7 +253,7 @@ If you use this model in your research or applications, please cite:
252
 
253
  ```bibtex
254
  @article{abdoli2025peerranked,
255
- title={Peer-Ranked Precision: Creating a Foundational Dataset for Fine-Tuning Vision Models from GuruShots' Annotated Imagery},
256
  author={Sajjad Abdoli and Freeman Lewin and Gediminas Vasiliauskas and Fabian Schonholz},
257
  journal={arXiv preprint arXiv:2506.05673},
258
  year={2025},
@@ -290,9 +291,9 @@ This model is released under the Apache 2.0 license, consistent with the base LL
290
 
291
  - **Base Model**: Thanks to LMMS Lab for the LLaVA-OneVision model
292
  - **Vision Encoder**: Thanks to Google Research for the SigLIP model
293
- - **Dataset**: GuruShots photography community for the source imagery
294
  - **Framework**: Hugging Face PEFT library for efficient fine-tuning capabilities
295
 
296
  ---
297
 
298
- *For questions, issues, or collaboration opportunities, please visit the [model repository](https://huggingface.co/Dataseeds/LLaVA-OneVision-Qwen2-0.5b-ov-DSD-FineTune) or contact the DataSeeds.AI team.*
 
1
  ---
 
2
  base_model: lmms-lab/llava-onevision-qwen2-0.5b-ov
3
+ datasets:
4
+ - Dataseeds/DataSeeds-Sample-Dataset-DSD
5
+ language:
6
+ - en
7
+ library_name: transformers
8
+ license: apache-2.0
9
+ pipeline_tag: image-text-to-text
10
  tags:
11
  - vision-language
12
  - multimodal
 
17
  - photography
18
  - scene-analysis
19
  - image-captioning
 
 
 
 
 
 
20
  model-index:
21
  - name: LLaVA-OneVision-Qwen2-0.5b-ov-DSD-FineTune
22
  results:
 
24
  type: image-captioning
25
  name: Image Captioning
26
  dataset:
 
27
  name: DataSeeds.AI Sample Dataset
28
+ type: Dataseeds/DataSeeds-Sample-Dataset-DSD
29
  metrics:
30
  - type: bleu-4
31
  value: 0.0246
32
  name: BLEU-4
33
  - type: rouge-l
34
+ value: 0.214
35
  name: ROUGE-L
36
  - type: bertscore
37
  value: 0.2789
38
  name: BERTScore F1
39
  - type: clipscore
40
+ value: 0.326
41
  name: CLIPScore
42
  ---
43
 
44
  # LLaVA-OneVision-Qwen2-0.5b Fine-tuned on DataSeeds.AI Dataset
45
 
46
+ This model is a LoRA (Low-Rank Adaptation) fine-tuned version of [lmms-lab/llava-onevision-qwen2-0.5b-ov](https://huggingface.co/lmms-lab/llava-onevision-qwen2-0.5b-ov) specialized for photography scene analysis and description generation. The model was presented in the paper [Peer-Ranked Precision: Creating a Foundational Dataset for Fine-Tuning Vision Models from DataSeeds' Annotated Imagery](https://huggingface.co/papers/2506.05673). The model was fine-tuned on the [DataSeeds Sample Dataset (DSD)](https://huggingface.co/datasets/Dataseeds/DataSeeds-Sample-Dataset-DSD) to enhance its capabilities in generating detailed, accurate descriptions of photographic content.
47
 
48
  ## Model Description
49
 
 
67
  ## Training Details
68
 
69
  ### Dataset
70
+ The model was fine-tuned on the DataSeeds Sample Dataset, a curated collection of photography images with detailed scene descriptions focusing on:
71
  - Compositional elements and camera perspectives
72
  - Lighting conditions and visual ambiance
73
  - Product identification and technical details
 
187
  outputs = model.generate(**inputs, max_new_tokens=256, temperature=0.7)
188
  description = processor.decode(outputs[0], skip_special_tokens=True)
189
  print(f"Prompt: {prompt}")
190
+ print(f"Description: {description}
191
+ ")
192
  ```
193
 
194
  ## Model Architecture
 
213
 
214
  ## Training Data
215
 
216
+ The DataSeeds Sample Dataset contains curated photography images with comprehensive annotations including:
217
 
218
  - **Scene Descriptions**: Detailed textual descriptions of visual content
219
  - **Technical Metadata**: Camera settings, composition details
 
231
  ### Model Limitations
232
  - **Domain Specialization**: Optimized for photography; may have reduced performance on general vision-language tasks
233
  - **Base Model Inheritance**: Inherits limitations from LLaVA-OneVision base model
234
+ - **Training Data Bias**: May reflect biases present in the DataSeeds dataset
235
  - **Language Support**: Primarily trained and evaluated on English descriptions
236
 
237
  ### Recommended Use Cases
 
253
 
254
  ```bibtex
255
  @article{abdoli2025peerranked,
256
+ title={Peer-Ranked Precision: Creating a Foundational Dataset for Fine-Tuning Vision Models from DataSeeds' Annotated Imagery},
257
  author={Sajjad Abdoli and Freeman Lewin and Gediminas Vasiliauskas and Fabian Schonholz},
258
  journal={arXiv preprint arXiv:2506.05673},
259
  year={2025},
 
291
 
292
  - **Base Model**: Thanks to LMMS Lab for the LLaVA-OneVision model
293
  - **Vision Encoder**: Thanks to Google Research for the SigLIP model
294
+ - **Dataset**: DataSeeds photography community for the source imagery
295
  - **Framework**: Hugging Face PEFT library for efficient fine-tuning capabilities
296
 
297
  ---
298
 
299
+ *For questions, issues, or collaboration opportunities, please visit the [model repository](https://huggingface.co/Dataseeds/LLaVA-OneVision-Qwen2-0.5b-ov-DSD-FineTune) or contact the DataSeeds.AI team.*