# 🎯 Getting Started with Synthetic Translation Dataset

## 📖 Table of Contents

1. [What is This Project?](#what-is-this-project)
2. [Quick Setup](#quick-setup)
3. [Your First Translation](#your-first-translation)
4. [Understanding the Output](#understanding-the-output)
5. [Production Usage](#production-usage)
6. [Next Steps](#next-steps)

---

## 🤔 What is This Project?

A production-ready system for generating synthetic translation datasets using Large Language Models (LLMs). It includes:

- **ASR Translation**: Convert Vietnamese ASR transcriptions → Clean English text
- **Chat Translation**: Convert Vietnamese chat → Formal English + Political compliance check

### Key Benefits

✅ **Clean Code**: Well-organized, easy to maintain  
✅ **Scalable**: Multi-CPU processing with dynamic batching  
✅ **Reliable**: Checkpoint support, error handling, retry logic  
✅ **Tested**: Comprehensive unit tests  
✅ **Documented**: Detailed guides and examples  

---

## ⚙️ Quick Setup

### Step 1: Install Dependencies

```bash
cd /home/dungvpt/workspace/mlm_training/synthetic_projects
pip install -r requirements.txt
```

### Step 2: Configure

```bash
# Copy environment template
cp .env.example .env

# Edit if needed (default works with your VLLM setup)
nano .env
```

### Step 3: Verify Installation

```bash
bash scripts/test_installation.sh
```

You should see: **✅ All tests passed!**

---

## 🚀 Your First Translation

### Example 1: Translate 5 ASR Samples

```bash
python -m src.asr_translation.runner \
    --input translation_for_asr/telephone2000h.txt \
    --output outputs/my_first_translation.jsonl \
    --limit 5 \
    --num-workers 2
```

**What happens:**
1. Reads 5 lines from the input file
2. Generates translation prompts
3. Calls VLLM server on port 8000
4. Saves results to JSONL file

### Example 2: Translate 10 Chat Messages

```bash
python -m src.chat_translation.runner \
    --dataset tarudesu/VOZ-HSD \
    --output outputs/my_first_chat.jsonl \
    --limit 10 \
    --num-workers 2
```

**What happens:**
1. Loads 10 messages from HuggingFace dataset
2. Translates to formal English
3. Checks political compliance
4. Saves with moderation metadata

---

## 📊 Understanding the Output

### ASR Translation Output

```json
{
  "id": "asr_00000001",
  "input_text": "thê mà con lại thấy bà ấy có cái quang gánh",
  "output_text": "But I noticed she had a carrying pole",
  "status": "completed",
  "confidence": "high",
  "processing_time": 1.5,
  "timestamp": "2025-11-28T04:00:00"
}
```

### Chat Translation Output

```json
{
  "id": "chat_00000001",
  "input_text": "Em ăn hoành thánh sáng bị khó chịu",
  "output_text": "I felt nauseous after eating wontons for breakfast",
  "status": "completed",
  "is_politically_compliant": true,
  "compliance_level": "compliant",
  "flagged_issues": [],
  "moderation_reasoning": "Content is about personal health, no issues",
  "confidence": "high",
  "processing_time": 2.1
}
```

### View Your Results

```bash
# View first result
head -n 1 outputs/my_first_translation.jsonl | jq .

# Count results
wc -l outputs/my_first_translation.jsonl

# View all results nicely
cat outputs/my_first_translation.jsonl | jq .
```

---

## 🏭 Production Usage

### Full ASR Translation

```bash
# Run in background
nohup bash scripts/run_asr_translation.sh > logs/asr_production.log 2>&1 &

# Get process ID
echo $!

# Monitor progress
tail -f logs/asr_production.log

# Check how many processed so far
ls -lh outputs/asr_translation/checkpoints/
```

### Full Chat Translation

```bash
# Run in background
nohup bash scripts/run_chat_translation.sh > logs/chat_production.log 2>&1 &

# Monitor
tail -f logs/chat_production.log
```

### Stop Running Process

```bash
# Find process
ps aux | grep python | grep runner

# Kill gracefully (saves checkpoint)
kill -SIGINT <process_id>
```

### Resume from Checkpoint

```bash
# Merge existing checkpoints
bash scripts/merge_checkpoints.sh \
    outputs/asr_translation/checkpoints \
    outputs/asr_merged.jsonl

# Continue processing remaining items
# (you'll need to adjust input to skip processed items)
```

---

## 📈 Monitoring & Validation

### Check Progress

```bash
# Watch checkpoint creation
watch -n 5 'ls -lh outputs/asr_translation/checkpoints/ | tail -5'

# Monitor system resources
htop

# Check VLLM usage (on GPU server)
nvidia-smi -l 1
```

### Validate Output

```bash
# Validate ASR output
python scripts/validate_asr_output.py \
    outputs/my_first_translation.jsonl

# Validate chat output
python scripts/validate_chat_output.py \
    outputs/my_first_chat.jsonl

# Get statistics
bash scripts/calculate_stats.sh outputs/my_first_translation.jsonl
```

---

## 🔧 Customization

### Adjust Worker Count

```bash
# More workers = faster (if you have CPU cores)
python -m src.asr_translation.runner \
    --num-workers 16 \
    ...

# Fewer workers = less memory usage
python -m src.asr_translation.runner \
    --num-workers 4 \
    ...
```

### Modify Prompts

```bash
# Edit ASR prompts
nano src/asr_translation/prompts.py

# Edit chat prompts
nano src/chat_translation/prompts.py

# Test your changes
python examples/demo.py
```

### Change Model Parameters

```bash
# Edit .env file
nano .env

# Key parameters:
VLLM__TEMPERATURE=0.7    # Lower = more deterministic
VLLM__TOP_P=0.9          # Nucleus sampling
VLLM__MAX_TOKENS=4096    # Max output length
```

---

## 📚 Next Steps

Now that you're up and running:

1. **Read the Guides**
   - `QUICKSTART.md` - 5-minute overview
   - `USAGE_GUIDE.md` - Comprehensive documentation
   - `PROJECT_SUMMARY.md` - Architecture details

2. **Explore Examples**
   ```bash
   python examples/demo.py
   ```

3. **Run Tests**
   ```bash
   pytest tests/ -v
   ```

4. **Customize for Your Needs**
   - Modify prompts
   - Add custom fields
   - Integrate with your pipeline

5. **Monitor Production**
   - Use validation scripts
   - Check logs regularly
   - Optimize worker/batch settings

---

## 💡 Pro Tips

1. **Always test with `--limit` first**
   ```bash
   --limit 10  # Test with 10 samples before full run
   ```

2. **Use checkpoints for large datasets**
   ```bash
   # Checkpoints save every 1000 items by default
   # Resume by merging checkpoints
   ```

3. **Monitor VLLM server health**
   ```bash
   # Check if VLLM is responsive
   curl http://localhost:8000/v1/models
   ```

4. **Validate output quality**
   ```bash
   # Always validate before using data
   python scripts/validate_asr_output.py <output_file>
   ```

5. **Tune for your hardware**
   ```bash
   # More CPU cores → more workers
   # More RAM → larger batches
   # Watch system resources and adjust
   ```

---

## ❓ Common Questions

**Q: How long will it take to process my data?**  
A: Depends on dataset size and VLLM performance. Typical: 5-10 items/sec/worker.

**Q: What if processing stops?**  
A: Checkpoints are saved automatically. Merge checkpoints and resume.

**Q: Can I process multiple datasets simultaneously?**  
A: Yes! Just run multiple scripts with different output directories.

**Q: How do I know if translations are good quality?**  
A: Use validation scripts and manually review samples.

**Q: What if VLLM server goes down?**  
A: Processing will retry automatically. If it fails, checkpoints are preserved.

---

## 🆘 Need Help?

1. **Check Logs**: `tail -f outputs/*/logs/*.log`
2. **Run Demo**: `python examples/demo.py`
3. **Test Setup**: `bash scripts/test_installation.sh`
4. **Read Docs**: See `USAGE_GUIDE.md`
5. **Validate Output**: Use validation scripts

---

## 🎉 You're Ready!

Your synthetic translation pipeline is set up and ready to process data at scale.

**Start simple, monitor closely, scale gradually.** 🚀

For detailed information, see:
- `USAGE_GUIDE.md` - Complete documentation
- `PROJECT_SUMMARY.md` - Architecture details
- `examples/demo.py` - Code examples

Happy translating! 🌟