# 🎯 Getting Started with Synthetic Translation Dataset ## 📖 Table of Contents 1. [What is This Project?](#what-is-this-project) 2. [Quick Setup](#quick-setup) 3. [Your First Translation](#your-first-translation) 4. [Understanding the Output](#understanding-the-output) 5. [Production Usage](#production-usage) 6. [Next Steps](#next-steps) --- ## 🤔 What is This Project? A production-ready system for generating synthetic translation datasets using Large Language Models (LLMs). It includes: - **ASR Translation**: Convert Vietnamese ASR transcriptions → Clean English text - **Chat Translation**: Convert Vietnamese chat → Formal English + Political compliance check ### Key Benefits ✅ **Clean Code**: Well-organized, easy to maintain ✅ **Scalable**: Multi-CPU processing with dynamic batching ✅ **Reliable**: Checkpoint support, error handling, retry logic ✅ **Tested**: Comprehensive unit tests ✅ **Documented**: Detailed guides and examples --- ## ⚙️ Quick Setup ### Step 1: Install Dependencies ```bash cd /home/dungvpt/workspace/mlm_training/synthetic_projects pip install -r requirements.txt ``` ### Step 2: Configure ```bash # Copy environment template cp .env.example .env # Edit if needed (default works with your VLLM setup) nano .env ``` ### Step 3: Verify Installation ```bash bash scripts/test_installation.sh ``` You should see: **✅ All tests passed!** --- ## 🚀 Your First Translation ### Example 1: Translate 5 ASR Samples ```bash python -m src.asr_translation.runner \ --input translation_for_asr/telephone2000h.txt \ --output outputs/my_first_translation.jsonl \ --limit 5 \ --num-workers 2 ``` **What happens:** 1. Reads 5 lines from the input file 2. Generates translation prompts 3. Calls VLLM server on port 8000 4. Saves results to JSONL file ### Example 2: Translate 10 Chat Messages ```bash python -m src.chat_translation.runner \ --dataset tarudesu/VOZ-HSD \ --output outputs/my_first_chat.jsonl \ --limit 10 \ --num-workers 2 ``` **What happens:** 1. Loads 10 messages from HuggingFace dataset 2. Translates to formal English 3. Checks political compliance 4. Saves with moderation metadata --- ## 📊 Understanding the Output ### ASR Translation Output ```json { "id": "asr_00000001", "input_text": "thê mà con lại thấy bà ấy có cái quang gánh", "output_text": "But I noticed she had a carrying pole", "status": "completed", "confidence": "high", "processing_time": 1.5, "timestamp": "2025-11-28T04:00:00" } ``` ### Chat Translation Output ```json { "id": "chat_00000001", "input_text": "Em ăn hoành thánh sáng bị khó chịu", "output_text": "I felt nauseous after eating wontons for breakfast", "status": "completed", "is_politically_compliant": true, "compliance_level": "compliant", "flagged_issues": [], "moderation_reasoning": "Content is about personal health, no issues", "confidence": "high", "processing_time": 2.1 } ``` ### View Your Results ```bash # View first result head -n 1 outputs/my_first_translation.jsonl | jq . # Count results wc -l outputs/my_first_translation.jsonl # View all results nicely cat outputs/my_first_translation.jsonl | jq . ``` --- ## 🏭 Production Usage ### Full ASR Translation ```bash # Run in background nohup bash scripts/run_asr_translation.sh > logs/asr_production.log 2>&1 & # Get process ID echo $! # Monitor progress tail -f logs/asr_production.log # Check how many processed so far ls -lh outputs/asr_translation/checkpoints/ ``` ### Full Chat Translation ```bash # Run in background nohup bash scripts/run_chat_translation.sh > logs/chat_production.log 2>&1 & # Monitor tail -f logs/chat_production.log ``` ### Stop Running Process ```bash # Find process ps aux | grep python | grep runner # Kill gracefully (saves checkpoint) kill -SIGINT ``` ### Resume from Checkpoint ```bash # Merge existing checkpoints bash scripts/merge_checkpoints.sh \ outputs/asr_translation/checkpoints \ outputs/asr_merged.jsonl # Continue processing remaining items # (you'll need to adjust input to skip processed items) ``` --- ## 📈 Monitoring & Validation ### Check Progress ```bash # Watch checkpoint creation watch -n 5 'ls -lh outputs/asr_translation/checkpoints/ | tail -5' # Monitor system resources htop # Check VLLM usage (on GPU server) nvidia-smi -l 1 ``` ### Validate Output ```bash # Validate ASR output python scripts/validate_asr_output.py \ outputs/my_first_translation.jsonl # Validate chat output python scripts/validate_chat_output.py \ outputs/my_first_chat.jsonl # Get statistics bash scripts/calculate_stats.sh outputs/my_first_translation.jsonl ``` --- ## 🔧 Customization ### Adjust Worker Count ```bash # More workers = faster (if you have CPU cores) python -m src.asr_translation.runner \ --num-workers 16 \ ... # Fewer workers = less memory usage python -m src.asr_translation.runner \ --num-workers 4 \ ... ``` ### Modify Prompts ```bash # Edit ASR prompts nano src/asr_translation/prompts.py # Edit chat prompts nano src/chat_translation/prompts.py # Test your changes python examples/demo.py ``` ### Change Model Parameters ```bash # Edit .env file nano .env # Key parameters: VLLM__TEMPERATURE=0.7 # Lower = more deterministic VLLM__TOP_P=0.9 # Nucleus sampling VLLM__MAX_TOKENS=4096 # Max output length ``` --- ## 📚 Next Steps Now that you're up and running: 1. **Read the Guides** - `QUICKSTART.md` - 5-minute overview - `USAGE_GUIDE.md` - Comprehensive documentation - `PROJECT_SUMMARY.md` - Architecture details 2. **Explore Examples** ```bash python examples/demo.py ``` 3. **Run Tests** ```bash pytest tests/ -v ``` 4. **Customize for Your Needs** - Modify prompts - Add custom fields - Integrate with your pipeline 5. **Monitor Production** - Use validation scripts - Check logs regularly - Optimize worker/batch settings --- ## 💡 Pro Tips 1. **Always test with `--limit` first** ```bash --limit 10 # Test with 10 samples before full run ``` 2. **Use checkpoints for large datasets** ```bash # Checkpoints save every 1000 items by default # Resume by merging checkpoints ``` 3. **Monitor VLLM server health** ```bash # Check if VLLM is responsive curl http://localhost:8000/v1/models ``` 4. **Validate output quality** ```bash # Always validate before using data python scripts/validate_asr_output.py ``` 5. **Tune for your hardware** ```bash # More CPU cores → more workers # More RAM → larger batches # Watch system resources and adjust ``` --- ## ❓ Common Questions **Q: How long will it take to process my data?** A: Depends on dataset size and VLLM performance. Typical: 5-10 items/sec/worker. **Q: What if processing stops?** A: Checkpoints are saved automatically. Merge checkpoints and resume. **Q: Can I process multiple datasets simultaneously?** A: Yes! Just run multiple scripts with different output directories. **Q: How do I know if translations are good quality?** A: Use validation scripts and manually review samples. **Q: What if VLLM server goes down?** A: Processing will retry automatically. If it fails, checkpoints are preserved. --- ## 🆘 Need Help? 1. **Check Logs**: `tail -f outputs/*/logs/*.log` 2. **Run Demo**: `python examples/demo.py` 3. **Test Setup**: `bash scripts/test_installation.sh` 4. **Read Docs**: See `USAGE_GUIDE.md` 5. **Validate Output**: Use validation scripts --- ## 🎉 You're Ready! Your synthetic translation pipeline is set up and ready to process data at scale. **Start simple, monitor closely, scale gradually.** 🚀 For detailed information, see: - `USAGE_GUIDE.md` - Complete documentation - `PROJECT_SUMMARY.md` - Architecture details - `examples/demo.py` - Code examples Happy translating! 🌟