--- license: apache-2.0 tags: - clip - feature-extraction --- # DGTRS-CLIP-ViT-L-14 This is the DGTRS-CLIP-ViT-L-14 model. It can be used for a variety of tasks, including zero-shot image classification and text-image retrieval. This model is compatible with both the `transformers` and `diffusers` libraries. ## How to use ### With `transformers` ```python from transformers import CLIPProcessor, CLIPModel from PIL import Image import torch # Load model and processor model = CLIPModel.from_pretrained("BiliSakura/DGTRS-CLIP-ViT-L-14") processor = CLIPProcessor.from_pretrained("BiliSakura/DGTRS-CLIP-ViT-L-14") # Load and process image image = Image.open("path/to/your/image.jpg") inputs = processor( text=["a photo of a building", "a photo of vegetation", "a photo of water"], images=image, return_tensors="pt", padding=True ) # Get image-text similarity scores with torch.inference_mode(): outputs = model(**inputs) logits_per_image = outputs.logits_per_image probs = logits_per_image.softmax(dim=1) print(f"Similarity scores: {probs}") ``` **Zero-shot image classification:** ```python from transformers import CLIPProcessor, CLIPModel from PIL import Image import torch model = CLIPModel.from_pretrained("BiliSakura/DGTRS-CLIP-ViT-L-14") processor = CLIPProcessor.from_pretrained("BiliSakura/DGTRS-CLIP-ViT-L-14") # Define candidate labels candidate_labels = [ "a satellite image of urban area", "a satellite image of forest", "a satellite image of agricultural land", "a satellite image of water body" ] image = Image.open("path/to/your/image.jpg") inputs = processor( text=candidate_labels, images=image, return_tensors="pt", padding=True ) with torch.inference_mode(): outputs = model(**inputs) probs = outputs.logits_per_image.softmax(dim=1) # Get the predicted label predicted_idx = probs.argmax().item() print(f"Predicted label: {candidate_labels[predicted_idx]}") print(f"Confidence: {probs[0][predicted_idx]:.4f}") ``` **Extracting individual features:** ```python from transformers import CLIPProcessor, CLIPModel from PIL import Image import torch model = CLIPModel.from_pretrained("BiliSakura/DGTRS-CLIP-ViT-L-14") processor = CLIPProcessor.from_pretrained("BiliSakura/DGTRS-CLIP-ViT-L-14") # Get image features only image = Image.open("path/to/your/image.jpg") image_inputs = processor(images=image, return_tensors="pt") with torch.inference_mode(): image_features = model.get_image_features(**image_inputs) # Get text features only text_inputs = processor( text=["a satellite image of urban area"], return_tensors="pt", padding=True, truncation=True ) with torch.inference_mode(): text_features = model.get_text_features(**text_inputs) print(f"Image features shape: {image_features.shape}") print(f"Text features shape: {text_features.shape}") ``` ### With `diffusers` This model's text encoder can be used with Stable Diffusion and other diffusion models: ```python from diffusers import StableDiffusionPipeline from transformers import CLIPTextModel, CLIPTokenizer import torch # Load the text encoder and tokenizer text_encoder = CLIPTextModel.from_pretrained( "BiliSakura/DGTRS-CLIP-ViT-L-14/diffusers", subfolder="text_encoder", torch_dtype=torch.float16 ) tokenizer = CLIPTokenizer.from_pretrained( "BiliSakura/DGTRS-CLIP-ViT-L-14" ) # Encode text prompt prompt = "a satellite image of a city with buildings and roads" text_inputs = tokenizer( prompt, padding="max_length", max_length=77, truncation=True, return_tensors="pt" ) with torch.inference_mode(): text_outputs = text_encoder(text_inputs.input_ids) text_embeddings = text_outputs.last_hidden_state print(f"Text embeddings shape: {text_embeddings.shape}") ``` **Using with Stable Diffusion:** ```python from diffusers import StableDiffusionPipeline import torch # Load pipeline with custom text encoder pipe = StableDiffusionPipeline.from_pretrained( "runwayml/stable-diffusion-v1-5", text_encoder=text_encoder, tokenizer=tokenizer, torch_dtype=torch.float16 ) pipe = pipe.to("cuda") # Generate image prompt = "a high-resolution satellite image of urban area" image = pipe(prompt, num_inference_steps=50, guidance_scale=7.5).images[0] image.save("generated_image.png") ``` ## Citation If you use this model in your research, please cite the original paper: ``` @article{chenDGTRSDDGTRSCLIPDualGranularity2025a, title = {{{DGTRSD}} and {{DGTRSCLIP}}: {{A Dual-Granularity Remote Sensing Image}}--{{Text Dataset}} and {{Vision}}--{{Language Foundation Model}} for {{Alignment}}}, shorttitle = {{{DGTRSD}} and {{DGTRSCLIP}}}, author = {Chen, Weizhi and Deng, Yupeng and Jin, Wei and Chen, Jingbo and Chen, Jiansheng and Feng, Yuman and Xi, Zhihao and Liu, Diyou and Li, Kai and Meng, Yu}, year = 2025, journal = {IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing}, volume = {18}, pages = {29113--29130}, issn = {2151-1535}, doi = {10.1109/JSTARS.2025.3625958}, urldate = {2025-12-18}, keywords = {Buildings,Cross modal retrieval,Cross-modal alignment,curriculum learning,dual-granularity,Green buildings,Integrated circuit modeling,remote sensing,Remote sensing,Resource management,Semantics,Sports,Training,vision-language foundation models (VLFM),Visualization} } ```