BGE-M3 Vietnamese Rental Property Search

Fine-tuned projection head for BAAI/bge-m3 optimized for Vietnamese rental property search (Phòng trọ).

This model adds a lightweight trainable projection head (128 dimensions) on top of the frozen BGE-M3 encoder, trained with weighted hard negatives using contrastive learning (InfoNCE loss).

🎯 Model Description

Base Model: BAAI/bge-m3 (frozen)
Task: Semantic search for Vietnamese rental properties
Training Strategy: Weighted contrastive learning with hard negatives
Output Dimension: 128 (projected from 1024)
Training Data: 10,384 Vietnamese rental property query-document pairs

📊 Performance

Evaluated on 96 test examples:

Metric	Score
MRR	98.44%
Recall@1	96.88%
Recall@5	100.00%
Recall@10	100.00%
Recall@50	100.00%

Interpretation

98.44% MRR: On average, the correct match appears at position ~1.02 (nearly always rank 1!)
96.88% Recall@1: 93 out of 96 queries find the correct match at the top position
100% Recall@5+: All queries find their correct match within top-5 results

🚀 Quick Start

Installation

pip install transformers torch safetensors

Usage

from transformers import AutoModel, AutoTokenizer
import torch

# Load model
model = AutoModel.from_pretrained(
    "your-username/bge-m3-vietnamese-rental-projection",
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained("BAAI/bge-m3")

# Move to GPU if available
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)
model.eval()

# Encode texts
texts = [
    "Phòng trọ Quận 10, 25m², giá 5 triệu, WC riêng, máy lạnh",
    "Cho thuê phòng Bình Thạnh, 20m², 4 triệu/tháng"
]

# Method 1: Using encode (recommended)
embeddings = model.encode(texts, device=device)  # [2, 128]

# Method 2: Using forward
inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt", max_length=512)
inputs = {k: v.to(device) for k, v in inputs.items()}

with torch.no_grad():
    outputs = model(**inputs)
    embeddings = outputs.last_hidden_state  # [2, 128], L2-normalized

print(embeddings.shape)  # torch.Size([2, 128])

# Compute similarity (cosine)
similarity = embeddings[0] @ embeddings[1]
print(f"Similarity: {similarity:.4f}")

Search Example

# Build a search engine
class RentalSearchEngine:
    def __init__(self, model, tokenizer, device="cuda"):
        self.model = model
        self.tokenizer = tokenizer
        self.device = device
        self.database_embeddings = None
        self.database_texts = None
    
    def index(self, property_descriptions):
        """Index a database of property descriptions"""
        self.database_texts = property_descriptions
        self.database_embeddings = self.model.encode(
            property_descriptions,
            device=self.device
        )
    
    def search(self, query, top_k=5):
        """Search for most similar properties"""
        query_emb = self.model.encode([query], device=self.device)[0]
        
        # Compute similarities
        similarities = query_emb @ self.database_embeddings.T
        
        # Get top-k
        top_k = min(top_k, len(similarities))
        scores, indices = torch.topk(similarities, k=top_k)
        
        results = []
        for idx, score in zip(indices.tolist(), scores.tolist()):
            results.append({
                "text": self.database_texts[idx],
                "score": score
            })
        
        return results

# Example usage
engine = RentalSearchEngine(model, tokenizer, device)

# Index properties
properties = [
    "Phòng trọ 25m² Quận 10, WC riêng, máy lạnh, giá 5.5tr/tháng",
    "Cho thuê phòng 30m² Quận 1, full nội thất, giá 8tr/tháng",
    "Phòng 20m² Thủ Đức, WC chung, giá 3.5tr/tháng",
    "Studio 35m² Quận 3, ban công, bếp riêng, giá 9tr/tháng",
]
engine.index(properties)

# Search
results = engine.search("phòng trọ q10 25m2 wc riêng 5tr5", top_k=3)

for i, result in enumerate(results, 1):
    print(f"{i}. [{result['score']:.4f}] {result['text']}")

🎓 Training Details

Dataset

Size: 10,384 examples
Split: 9,345 train / 1,039 validation
Format: Query-positive-hard negatives triplets
Hard Negatives: 3 per example, weighted by feature type

Weighted Hard Negatives Strategy

The model uses feature-based weighting for hard negatives:

Feature Type	Weight	Importance
Location (Quận)	2.5	Highest
Price	2.0	High
Area (m²)	1.8	Medium
Amenities	1.5	Lower

This teaches the model that location mismatches are more critical than amenity differences.

Training Configuration

{
  "base_model": "BAAI/bge-m3",
  "d_out": 128,
  "freeze_encoder": true,
  "epochs": 17,
  "batch_size": 128,
  "learning_rate": 0.0002,
  "optimizer": "AdamW",
  "weight_decay": 0.01,
  "loss": "Weighted InfoNCE (symmetric)",
  "temperature": 0.07,
  "device": "Tesla T4 (Google Colab)",
  "training_time": "~2.5 hours"
}

Training Progress

Epoch	Train Loss	Val Loss	Status
1	2.9054	2.4529	⭐ Best
5	2.1609	2.0078	⭐ Best
9	2.0237	1.8906	⭐ Best
12	1.9722	1.8760	⭐ Best
16	1.9297	1.8215	⭐ Best
17	1.9191	1.8276	Final

Improvement: -34% train loss, -26% validation loss

Model Architecture

BAAI/bge-m3 (frozen)
    ↓ [1024-dim]
ProjectionHead
    ├─ Linear(1024 → 128, bias=False)
    └─ L2 Normalization
    ↓ [128-dim, L2-normalized]
Output Embeddings

Parameters:

Trainable: 131,072 (0.02%)
Total: 567,885,824
Strategy: Only projection head is trainable

🎯 Use Cases

This model is optimized for:

✅ Vietnamese rental property search

Matching user queries to property listings
Finding similar properties
Semantic search for rental accommodations

✅ Supported features:

Location (districts, neighborhoods)
Price range
Area/size (m²)
Amenities (WC, máy lạnh, ban công, bếp, etc.)
Room type (phòng trọ, studio, etc.)

⚠️ Limitations

Domain-specific: Optimized for Vietnamese rental properties only
Geographic focus: Primarily trained on properties in Ho Chi Minh City and Hanoi
Language: Vietnamese only (not multilingual like base BGE-M3)
Frozen encoder: Base BGE-M3 encoder is not fine-tuned, only projection head
Not for: General-purpose Vietnamese embeddings or other domains

🔍 Example Predictions

Example 1: Location Sensitivity

Query: "phòng trọ Gò Vấp 18m² 3tr5 có wc riêng"

Positive (0.947):  Gò Vấp 18m² 3tr5 wc riêng ✅
Negative 1 (0.366): Quận 12 18m² 3tr5 wc riêng (wrong district!)
Negative 2 (0.411): Gò Vấp 18m² 4tr2 wc riêng (wrong price)
Negative 3 (0.828): Gò Vấp 18m² 3tr5 wc chung (wrong amenity)

→ Model correctly penalizes location mismatch most heavily

Example 2: Feature Understanding

Query: "phòng trọ q10 4tr 20m² có máy lạnh wc riêng gần chợ"

Positive (0.904):  Q10 20m² 4tr máy lạnh wc riêng ✅
Negative 1 (0.542): Q3 20m² 4tr máy lạnh wc riêng (wrong district)
Negative 2 (0.418): Q10 20m² 5.5tr máy lạnh wc riêng (wrong price)
Negative 3 (0.257): Q10 15m² 4tr máy lạnh wc chung (multiple diffs)

→ Strong margin (+0.36) between positive and top negative

📖 Citation

If you use this model, please cite:

@misc{bge-m3-vietnamese-rental,
  author = {Your Name},
  title = {BGE-M3 Vietnamese Rental Property Search},
  year = {2025},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/your-username/bge-m3-vietnamese-rental-projection}},
}

📜 License

MIT License - Free to use for commercial and non-commercial purposes.

🙏 Acknowledgments

Base model: BAAI/bge-m3
Framework: Hugging Face Transformers
Training: Google Colab (Tesla T4)

📧 Contact

For questions or feedback, please open an issue on the model repository.

Last updated: October 2025

Downloads last month: 8

Safetensors

Model size

131k params

Tensor type

F32