--- language: en license: cc-by-4.0 tags: - scientific-retrieval - dense-passage-retrieval - dual-encoder - talk2ref - speech-to-text - sentence-embedding - SBERT library_name: transformers pipeline_tag: feature-extraction base_model: sentence-transformers/all-MiniLM-L6-v2 datasets: - fbroy/talk2ref --- # 🗣️ Talk2Ref Query Talk Encoder This model encodes **scientific talks** (transcripts, titles, and years) into dense vector representations, designed for **Reference Prediction from Talks (RPT)** — the task of retrieving relevant cited papers for a given talk. It was trained as part of the [Talk2Ref dataset](https://huggingface.co/datasets/s8frbroy/talk2ref) project. The model forms the **query-side encoder** in a **dual-encoder (DPR-style)** setup, paired with the [Talk2Ref Cited Paper Encoder](https://huggingface.co/s8frbroy/talk2ref_ref_key_cited_paper_encoder). --- ## 🧩 Model Overview | Property | Description | |-----------|--------------| | **Architecture** | Sentence-BERT (all-MiniLM-L6-v2 backbone) | | **Pooling** | Weighted mean aggregation over transcript chunks | | **Max tokens per chunk** | 512 | | **Trained on** | Talk2Ref dataset — transcripts of 6,279 scientific talks | | **Objective** | Contrastive learning (DPR-style) using binary similarity loss | | **Task** | Encode scientific talks into a shared semantic space with their cited papers | --- ## 🎯 Usage Example with `transformers`: ```python from transformers import AutoTokenizer, AutoModel import torch model = AutoModel.from_pretrained("s8frbroy/talk2ref_query_talk_encoder") tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2") text = "In this talk, we present a new transformer model for scientific retrieval..." inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=512) embeddings = model(**inputs).last_hidden_state.mean(dim=1) # or custom weighted pooling