stereoplegic 's Collections Datasets
updated
Super-NaturalInstructions: Generalization via Declarative Instructions
on 1600+ NLP Tasks
Paper
• 2204.07705
• Published
• 2
Knowledge-Driven CoT: Exploring Faithful Reasoning in LLMs for
Knowledge-intensive Question Answering
Paper
• 2308.13259
• Published
• 2
MAmmoTH: Building Math Generalist Models through Hybrid Instruction
Tuning
Paper
• 2309.05653
• Published
• 10
MetaMath: Bootstrap Your Own Mathematical Questions for Large Language
Models
Paper
• 2309.12284
• Published
• 19
CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large
Language Models in 167 Languages
Paper
• 2309.09400
• Published
• 87
CommonCanvas: An Open Diffusion Model Trained with Creative-Commons
Images
Paper
• 2310.16825
• Published
• 36
The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset
Paper
• 2303.03915
• Published
• 7
MADLAD-400: A Multilingual And Document-Level Large Audited Dataset
Paper
• 2309.04662
• Published
• 25
SlimPajama-DC: Understanding Data Combinations for LLM Training
Paper
• 2309.10818
• Published
• 11
Towards Effective Disambiguation for Machine Translation with Large
Language Models
Paper
• 2309.11668
• Published
• 1
Improving Translation Faithfulness of Large Language Models via
Augmenting Instructions
Paper
• 2308.12674
• Published
• 1
TinyStories: How Small Can Language Models Be and Still Speak Coherent
English?
Paper
• 2305.07759
• Published
• 39
The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora
with Web Data, and Web Data Only
Paper
• 2306.01116
• Published
• 43
KoSBi: A Dataset for Mitigating Social Bias Risks Towards Safer Large
Language Model Application
Paper
• 2305.17701
• Published
• 1
M^3IT: A Large-Scale Dataset towards Multi-Modal Multilingual
Instruction Tuning
Paper
• 2306.04387
• Published
• 9
The Flan Collection: Designing Data and Methods for Effective
Instruction Tuning
Paper
• 2301.13688
• Published
• 9
PIXIU: A Large Language Model, Instruction Data and Evaluation Benchmark
for Finance
Paper
• 2306.05443
• Published
• 3
NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages
Paper
• 2309.10661
• Published
• 1
The Vault: A Comprehensive Multilingual Dataset for Advancing Code
Understanding and Generation
Paper
• 2305.06156
• Published
• 2
Searching by Code: a New SearchBySnippet Dataset and SnippeR Retrieval
Model for Searching by Code Snippets
Paper
• 2305.11625
• Published
• 1
MVP: Multi-task Supervised Pre-training for Natural Language Generation
Paper
• 2206.12131
• Published
• 1
The Data Provenance Initiative: A Large Scale Audit of Dataset Licensing
& Attribution in AI
Paper
• 2310.16787
• Published
• 5
Language Models can be Logical Solvers
Paper
• 2311.06158
• Published
• 20
Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMs
Paper
• 2308.13387
• Published
• 1
Skill-it! A Data-Driven Skills Framework for Understanding and Training
Language Models
Paper
• 2307.14430
• Published
• 3
FACT: Learning Governing Abstractions Behind Integer Sequences
Paper
• 2209.09543
• Published
• 2
HelpSteer: Multi-attribute Helpfulness Dataset for SteerLM
Paper
• 2311.09528
• Published
• 2
Source Prompt: Coordinated Pre-training of Language Models on Diverse
Corpora from Multiple Sources
Paper
• 2311.09732
• Published
• 1
Sabiá: Portuguese Large Language Models
Paper
• 2304.07880
• Published
• 4
Generative AI for Math: Part I -- MathPile: A Billion-Token-Scale
Pretraining Corpus for Math
Paper
• 2312.17120
• Published
• 28
COSMO: COntrastive Streamlined MultimOdal Model with Interleaved
Pre-Training
Paper
• 2401.00849
• Published
• 17
Trusted Source Alignment in Large Language Models
Paper
• 2311.06697
• Published
• 12
Viewer
• Updated
• 2.75M • 4.32k
• 380
ChatQA: Building GPT-4 Level Conversational QA Models
Paper
• 2401.10225
• Published
• 36
Scientific and Creative Analogies in Pretrained Language Models
Paper
• 2211.15268
• Published
• 3
Dolma: an Open Corpus of Three Trillion Tokens for Language Model
Pretraining Research
Paper
• 2402.00159
• Published
• 65
Aya Dataset: An Open-Access Collection for Multilingual Instruction
Tuning
Paper
• 2402.06619
• Published
• 57
StarCoder 2 and The Stack v2: The Next Generation
Paper
• 2402.19173
• Published
• 152
A Touch, Vision, and Language Dataset for Multimodal Alignment
Paper
• 2402.13232
• Published
• 16
GECTurk: Grammatical Error Correction and Detection Dataset for Turkish
Paper
• 2309.11346
• Published
WildChat: 1M ChatGPT Interaction Logs in the Wild
Paper
• 2405.01470
• Published
• 64
OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset
Paper
• 2402.10176
• Published
• 38
MS MARCO Web Search: a Large-scale Information-rich Web Dataset with
Millions of Real Click Labels
Paper
• 2405.07526
• Published
• 21
MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal
Dataset with One Trillion Tokens
Paper
• 2406.11271
• Published
• 21
XLand-100B: A Large-Scale Multi-Task Dataset for In-Context
Reinforcement Learning
Paper
• 2406.08973
• Published
• 89
DataComp-LM: In search of the next generation of training sets for
language models
Paper
• 2406.11794
• Published
• 55
PIN: A Knowledge-Intensive Dataset for Paired and Interleaved Multimodal
Documents
Paper
• 2406.13923
• Published
• 25
Glot500: Scaling Multilingual Corpora and Language Models to 500
Languages
Paper
• 2305.12182
• Published
• 1