synthetic_projects / README.md
tuandunghcmut's picture
Initial commit
b8d82db
# Synthetic Translation Dataset Generator
A professional, clean-code implementation for generating synthetic translation datasets using LLMs.
## Project Structure
```
synthetic_projects/
├── src/
│ ├── core/ # Shared core utilities
│ │ ├── config.py # Configuration management
│ │ ├── models.py # Data models
│ │ ├── llm_client.py # LLM API client
│ │ └── worker_pool.py # Multiprocessing worker pool
│ ├── asr_translation/ # ASR -> English translation
│ │ ├── prompts.py # Translation prompts
│ │ ├── models.py # Data models
│ │ ├── processor.py # Data processing logic
│ │ └── runner.py # Main runner script
│ └── chat_translation/ # Chat -> English translation + moderation
│ ├── prompts.py # Translation & moderation prompts
│ ├── models.py # Data models
│ ├── processor.py # Data processing logic
│ └── runner.py # Main runner script
├── tests/ # Unit tests
├── scripts/ # Background execution scripts
└── configs/ # Configuration files
```
## Features
- **Clean Architecture**: Separation of concerns with modular design
- **Type-Safe**: Full type hints with Pydantic models
- **Configurable**: YAML/ENV-based configuration
- **Efficient**: Multi-CPU processing with dynamic batching
- **Resilient**: Retry logic and error handling
- **Testable**: Comprehensive test coverage
- **Maintainable**: Well-documented, easy to extend
## Sub-Projects
### 1. ASR Translation
Translates Vietnamese ASR transcriptions to well-written English text.
**Input**: Raw ASR transcriptions (Vietnamese, unnormalized)
**Output**: Clean, well-organized English translations
### 2. Chat Translation
Translates Vietnamese chat messages to formal English with content moderation.
**Input**: Vietnamese chat messages
**Output**:
- Formal English translation
- Political compliance metadata (Vietnam laws)
## Quick Start
### Installation
```bash
pip install -r requirements.txt
```
### Configuration
Create `.env` file:
```env
VLLM_API_BASE=http://localhost:8000/v1
VLLM_MODEL=Qwen/Qwen3-Next-80B-A3B-Instruct
MAX_WORKERS=8
BATCH_SIZE=32
```
### Run ASR Translation
```bash
python -m src.asr_translation.runner \
--input translation_for_asr/telephone2000h.txt \
--output outputs/asr_translated.jsonl \
--num-workers 8
```
### Run Chat Translation
```bash
python -m src.chat_translation.runner \
--dataset tarudesu/VOZ-HSD \
--output outputs/chat_translated.jsonl \
--num-workers 8
```
### Background Execution
```bash
# ASR translation in background
nohup bash scripts/run_asr_translation.sh > logs/asr.log 2>&1 &
# Chat translation in background
nohup bash scripts/run_chat_translation.sh > logs/chat.log 2>&1 &
```
## Testing
```bash
# Run all tests
pytest tests/ -v
# Run with coverage
pytest tests/ --cov=src --cov-report=html
```
## License
MIT