RSCLIP Collections
Collection
A collection of Remote Sensing CLIP models in both huggingface/transformers and huggingface/diffusers text encoder production ready style
•
7 items
•
Updated
This is the DGTRS-CLIP-ViT-L-14 model. It can be used for a variety of tasks, including zero-shot image classification and text-image retrieval.
This model is compatible with both the transformers and diffusers libraries.
transformers
from transformers import CLIPProcessor, CLIPModel
model = CLIPModel.from_pretrained("BiliSakura/DGTRS-CLIP-ViT-L-14")
processor = CLIPProcessor.from_pretrained("BiliSakura/DGTRS-CLIP-ViT-L-14")
# Your code here to use the model for image-text similarity, zero-shot classification, etc.
diffusers
This model's text encoder can be used with Stable Diffusion:
# Your code here to use the text encoder with a diffusion model.
If you use this model in your research, please cite the original paper:
@article{chenDGTRSDDGTRSCLIPDualGranularity2025a,
title = {{{DGTRSD}} and {{DGTRSCLIP}}: {{A Dual-Granularity Remote Sensing Image}}--{{Text Dataset}} and {{Vision}}--{{Language Foundation Model}} for {{Alignment}}},
shorttitle = {{{DGTRSD}} and {{DGTRSCLIP}}},
author = {Chen, Weizhi and Deng, Yupeng and Jin, Wei and Chen, Jingbo and Chen, Jiansheng and Feng, Yuman and Xi, Zhihao and Liu, Diyou and Li, Kai and Meng, Yu},
year = 2025,
journal = {IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing},
volume = {18},
pages = {29113--29130},
issn = {2151-1535},
doi = {10.1109/JSTARS.2025.3625958},
urldate = {2025-12-18},
keywords = {Buildings,Cross modal retrieval,Cross-modal alignment,curriculum learning,dual-granularity,Green buildings,Integrated circuit modeling,remote sensing,Remote sensing,Resource management,Semantics,Sports,Training,vision-language foundation models (VLFM),Visualization}
}