| # Synthetic Translation Dataset Generator | |
| A professional, clean-code implementation for generating synthetic translation datasets using LLMs. | |
| ## Project Structure | |
| ``` | |
| synthetic_projects/ | |
| ├── src/ | |
| │ ├── core/ # Shared core utilities | |
| │ │ ├── config.py # Configuration management | |
| │ │ ├── models.py # Data models | |
| │ │ ├── llm_client.py # LLM API client | |
| │ │ └── worker_pool.py # Multiprocessing worker pool | |
| │ ├── asr_translation/ # ASR -> English translation | |
| │ │ ├── prompts.py # Translation prompts | |
| │ │ ├── models.py # Data models | |
| │ │ ├── processor.py # Data processing logic | |
| │ │ └── runner.py # Main runner script | |
| │ └── chat_translation/ # Chat -> English translation + moderation | |
| │ ├── prompts.py # Translation & moderation prompts | |
| │ ├── models.py # Data models | |
| │ ├── processor.py # Data processing logic | |
| │ └── runner.py # Main runner script | |
| ├── tests/ # Unit tests | |
| ├── scripts/ # Background execution scripts | |
| └── configs/ # Configuration files | |
| ``` | |
| ## Features | |
| - **Clean Architecture**: Separation of concerns with modular design | |
| - **Type-Safe**: Full type hints with Pydantic models | |
| - **Configurable**: YAML/ENV-based configuration | |
| - **Efficient**: Multi-CPU processing with dynamic batching | |
| - **Resilient**: Retry logic and error handling | |
| - **Testable**: Comprehensive test coverage | |
| - **Maintainable**: Well-documented, easy to extend | |
| ## Sub-Projects | |
| ### 1. ASR Translation | |
| Translates Vietnamese ASR transcriptions to well-written English text. | |
| **Input**: Raw ASR transcriptions (Vietnamese, unnormalized) | |
| **Output**: Clean, well-organized English translations | |
| ### 2. Chat Translation | |
| Translates Vietnamese chat messages to formal English with content moderation. | |
| **Input**: Vietnamese chat messages | |
| **Output**: | |
| - Formal English translation | |
| - Political compliance metadata (Vietnam laws) | |
| ## Quick Start | |
| ### Installation | |
| ```bash | |
| pip install -r requirements.txt | |
| ``` | |
| ### Configuration | |
| Create `.env` file: | |
| ```env | |
| VLLM_API_BASE=http://localhost:8000/v1 | |
| VLLM_MODEL=Qwen/Qwen3-Next-80B-A3B-Instruct | |
| MAX_WORKERS=8 | |
| BATCH_SIZE=32 | |
| ``` | |
| ### Run ASR Translation | |
| ```bash | |
| python -m src.asr_translation.runner \ | |
| --input translation_for_asr/telephone2000h.txt \ | |
| --output outputs/asr_translated.jsonl \ | |
| --num-workers 8 | |
| ``` | |
| ### Run Chat Translation | |
| ```bash | |
| python -m src.chat_translation.runner \ | |
| --dataset tarudesu/VOZ-HSD \ | |
| --output outputs/chat_translated.jsonl \ | |
| --num-workers 8 | |
| ``` | |
| ### Background Execution | |
| ```bash | |
| # ASR translation in background | |
| nohup bash scripts/run_asr_translation.sh > logs/asr.log 2>&1 & | |
| # Chat translation in background | |
| nohup bash scripts/run_chat_translation.sh > logs/chat.log 2>&1 & | |
| ``` | |
| ## Testing | |
| ```bash | |
| # Run all tests | |
| pytest tests/ -v | |
| # Run with coverage | |
| pytest tests/ --cov=src --cov-report=html | |
| ``` | |
| ## License | |
| MIT | |