synthetic_projects / PRODUCTION_GUIDE.md
tuandunghcmut's picture
Initial commit
b8d82db

🚀 Production Deployment Guide - Background Execution

Hướng dẫn chạy toàn bộ hệ thống trong background với 32 CPU mỗi project


📋 Tổng Quan

Chạy đồng thời 2 pipelines:

  • ASR Translation: 32 CPU workers
  • Chat Translation: 32 CPU workers

Tổng cộng: 64 CPU cores được sử dụng


⚙️ Cấu Hình

1. Kiểm Tra Resources

# Kiểm tra số CPU cores
nproc
# Hoặc
lscpu | grep "^CPU(s):"

# Kiểm tra RAM available
free -h

# Khuyến nghị:
# - Tối thiểu: 64 CPU cores
# - RAM: 16GB+ (256MB per worker = 64 workers x 256MB = 16GB)

2. Kiểm Tra VLLM Server

# Check VLLM đang chạy
curl http://localhost:8000/v1/models

# Nếu không thấy, khởi động VLLM:
CUDA_VISIBLE_DEVICES=4,5,6,7 vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct \
  --port 8000 \
  --tensor-parallel-size 4 \
  --max-model-len 32768 \
  --max-num-batched-tokens 131072 \
  --gpu-memory-utilization 0.9 &

🎯 Method 1: Script Tự Động (Khuyến Nghị)

Quick Start

cd /home/dungvpt/workspace/mlm_training/synthetic_projects

# Chạy cả hai pipelines với 32 workers mỗi cái
bash scripts/run_production_full.sh

Script sẽ:

  1. ✅ Kiểm tra VLLM server
  2. ✅ Tạo output directories với timestamp
  3. ✅ Chạy ASR translation (32 workers) trong background
  4. ✅ Chạy Chat translation (32 workers) trong background
  5. ✅ Lưu logs riêng cho mỗi pipeline
  6. ✅ Hiển thị process IDs để monitor
  7. ✅ Tự động resume nếu bị interrupt

🎯 Method 2: Manual Commands

ASR Translation (32 Workers)

cd /home/dungvpt/workspace/mlm_training/synthetic_projects

# Chạy trong background với nohup
nohup python -m src.asr_translation.runner \
    --input translation_for_asr/telephone2000h.txt \
    --output-dir outputs/asr_translation \
    --num-workers 32 \
    --batch-size 64 \
    --checkpoint-interval 1000 \
    --use-json \
    > logs/asr_production.log 2>&1 &

# Lưu process ID
echo $! > logs/asr_pid.txt
echo "ASR Translation PID: $(cat logs/asr_pid.txt)"

Chat Translation (32 Workers)

cd /home/dungvpt/workspace/mlm_training/synthetic_projects

# Chạy trong background với nohup
nohup python -m src.chat_translation.runner \
    --dataset tarudesu/VOZ-HSD \
    --output-dir outputs/chat_translation \
    --num-workers 32 \
    --batch-size 64 \
    --checkpoint-interval 1000 \
    --use-json \
    > logs/chat_production.log 2>&1 &

# Lưu process ID
echo $! > logs/chat_pid.txt
echo "Chat Translation PID: $(cat logs/chat_pid.txt)"

📊 Monitoring

Real-time Progress Monitoring

# Monitor ASR translation
tail -f logs/asr_production.log

# Monitor Chat translation
tail -f logs/chat_production.log

# Monitor cả hai cùng lúc (split terminal)
# Terminal 1:
tail -f logs/asr_production.log

# Terminal 2:
tail -f logs/chat_translation.log

Check Progress

# Đếm số results đã xử lý
wc -l outputs/asr_translation/asr_run_*/results.jsonl
wc -l outputs/chat_translation/chat_run_*/results.jsonl

# Xem kết quả mới nhất
tail -n 5 outputs/asr_translation/asr_run_*/results.jsonl | jq .
tail -n 5 outputs/chat_translation/chat_run_*/results.jsonl | jq .

# Theo dõi realtime
watch -n 5 'wc -l outputs/*/*/results.jsonl'

System Resources

# CPU usage
top -u $USER

# hoặc htop (more user-friendly)
htop

# Process status
ps aux | grep "python -m src"

# Specific processes
ps -p $(cat logs/asr_pid.txt) -o pid,cmd,%cpu,%mem,etime
ps -p $(cat logs/chat_pid.txt) -o pid,cmd,%cpu,%mem,etime

🛑 Control Operations

Stop Processes

# Stop gracefully (saves checkpoint)
kill -SIGINT $(cat logs/asr_pid.txt)
kill -SIGINT $(cat logs/chat_pid.txt)

# hoặc dùng script
bash scripts/stop_production.sh

# Force stop (only if graceful doesn't work)
kill -9 $(cat logs/asr_pid.txt)
kill -9 $(cat logs/chat_pid.txt)

Pause & Resume

# Pause (không tốn CPU nhưng giữ memory)
kill -STOP $(cat logs/asr_pid.txt)
kill -STOP $(cat logs/chat_pid.txt)

# Resume
kill -CONT $(cat logs/asr_pid.txt)
kill -CONT $(cat logs/chat_pid.txt)

Restart (Auto-Resume)

# Simply run the same command again
# Resume feature sẽ tự động skip những items đã xử lý
bash scripts/run_production_full.sh

📈 Performance Tuning

For High Throughput

# Tăng workers và batch size
NUM_WORKERS=48 \
BATCH_SIZE=96 \
bash scripts/run_production_full.sh

For Memory-Constrained Systems

# Giảm workers và batch size
NUM_WORKERS=16 \
BATCH_SIZE=32 \
bash scripts/run_production_full.sh

Optimal Settings (64 cores available)

# 32 workers per pipeline = 64 total
NUM_WORKERS=32 \
BATCH_SIZE=64 \
CHECKPOINT_INTERVAL=500 \
bash scripts/run_production_full.sh

📁 Output Structure

outputs/
├── asr_translation/
│   └── asr_run_20250128_100000/
│       ├── results.jsonl              # Incremental results
│       └── checkpoints/
│           ├── checkpoint_00001000.jsonl
│           ├── checkpoint_00002000.jsonl
│           └── ...
├── chat_translation/
│   └── chat_run_20250128_100000/
│       ├── results.jsonl
│       └── checkpoints/
│           ├── checkpoint_00001000.jsonl
│           └── ...
└── logs/
    ├── asr_production.log
    ├── chat_production.log
    ├── asr_pid.txt
    └── chat_pid.txt

✅ Validation

While Running

# Validate ASR results (sample)
head -n 100 outputs/asr_translation/asr_run_*/results.jsonl > /tmp/asr_sample.jsonl
python scripts/validate_asr_output.py /tmp/asr_sample.jsonl

# Validate Chat results (sample)
head -n 100 outputs/chat_translation/chat_run_*/results.jsonl > /tmp/chat_sample.jsonl
python scripts/validate_chat_output.py /tmp/chat_sample.jsonl

After Completion

# Full validation
python scripts/validate_asr_output.py outputs/asr_translation/asr_run_*/results.jsonl
python scripts/validate_chat_output.py outputs/chat_translation/chat_run_*/results.jsonl

# Calculate statistics
bash scripts/calculate_stats.sh outputs/asr_translation/asr_run_*/results.jsonl
bash scripts/calculate_stats.sh outputs/chat_translation/chat_run_*/results.jsonl

🔧 Troubleshooting

Issue: Process died unexpectedly

# Check logs for errors
tail -n 50 logs/asr_production.log
tail -n 50 logs/chat_production.log

# Check if process still running
ps -p $(cat logs/asr_pid.txt)
ps -p $(cat logs/chat_pid.txt)

# Restart with resume
bash scripts/run_production_full.sh

Issue: VLLM server overloaded

# Check VLLM GPU usage
nvidia-smi

# Reduce number of workers temporarily
NUM_WORKERS=16 bash scripts/run_production_full.sh

Issue: Out of memory

# Check memory usage
free -h

# Reduce workers
NUM_WORKERS=16 BATCH_SIZE=32 bash scripts/run_production_full.sh

Issue: Slow processing

# Check CPU usage (should be ~100% per worker)
top

# Check VLLM server response time
curl -w "@-" -o /dev/null -s http://localhost:8000/v1/models <<'EOF'
    time_namelookup:  %{time_namelookup}\n
       time_connect:  %{time_connect}\n
          time_total:  %{time_total}\n
EOF

# Check network latency if VLLM is remote

📊 Expected Performance

With 32 Workers Each

Metric ASR Translation Chat Translation
Workers 32 32
Throughput ~160-320 req/sec ~160-320 req/sec
Time per item ~0.1-0.2s ~0.1-0.2s
Memory usage ~8-10GB ~8-10GB

Estimated Completion Time

ASR Translation:
- Total items: 1,647,738
- Throughput: 200 req/sec
- Estimated time: ~2.3 hours

Chat Translation:
- Total items: 10,747,733
- Throughput: 200 req/sec
- Estimated time: ~15 hours

🎯 Best Practices

  1. Monitor early: Watch first 1000 items for any issues
  2. Check quality: Validate samples periodically
  3. Resource balance: Don't overload VLLM server
  4. Backup logs: Keep logs for debugging
  5. Resume friendly: Use default resume mode
  6. Checkpoint often: Keep checkpoint interval reasonable

📞 Quick Reference Commands

# Start production
bash scripts/run_production_full.sh

# Monitor
tail -f logs/asr_production.log
tail -f logs/chat_production.log

# Check progress
watch -n 5 'wc -l outputs/*/*/results.jsonl'

# Stop gracefully
bash scripts/stop_production.sh

# Validate
python scripts/validate_asr_output.py outputs/asr_translation/asr_run_*/results.jsonl
python scripts/validate_chat_output.py outputs/chat_translation/chat_run_*/results.jsonl

✨ Summary

Configuration: 32 workers per pipeline = 64 total workers
Resume: Automatic, enabled by default
Saving: Incremental, real-time
Monitoring: Live logs and progress tracking
Recovery: Checkpoint-based, no data loss

Ready for production! 🚀