Robust Speech Recognition via Large-Scale Weak Supervision
Paper
โข
2212.04356
โข
Published
โข
45
nineninesix/kyrgyz-whisper-medium is a fine-tuned multilingual speech recognition model based on OpenAI's Whisper Medium architecture. This model adds native Kyrgyz language support while maintaining strong performance on English and Russian.
<|ky|> token.The following visualization shows improvement after fine-tuning:
Key Observations:
from transformers import AutoTokenizer
# Load custom tokenizer with Kyrgyz support
tokenizer = AutoTokenizer.from_pretrained(
"nineninesix/kyrgyz-whisper-medium",
trust_remote_code=True, ### !!! important !!!
language="kyrgyz",
task="transcribe"
)
The <|ky|> token was initialized as an average of embeddings from linguistically similar languages:
embedding_ky = (embedding_ru + embedding_kk + embedding_tr) / 3
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline, WhisperFeatureExtractor, AutoTokenizer
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_id = "nineninesix/kyrgyz-whisper-medium"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True)
model.to(device)
feature_extractor = WhisperFeatureExtractor.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(
model_id,
trust_remote_code=True, language="kyrgyz", task="transcribe")
pipe = pipeline(
"automatic-speech-recognition",
model=model,
tokenizer=tokenizer,
feature_extractor=feature_extractor,
torch_dtype=torch_dtype,
device=device
)
result = pipe("audio.mp3")
print(result['text'])
This model serves as a foundation for domain-specific fine-tuning using LoRA (Low-Rank Adaptation).
Unsloth integration example: see this Google Colab
Benefits of LoRA fine-tuning:
@misc{kyrgyz-whisper-medium,
author = {nineninesix},
title = {Whisper Medium - Kyrgyz, English, Russian},
year = {2025},
publisher = {HuggingFace},
url = {https://huggingface.co/nineninesix/kyrgyz-whisper-medium}
}
@misc{radford2022whisper,
doi = {10.48550/ARXIV.2212.04356},
url = {https://arxiv.org/abs/2212.04356},
author = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
title = {Robust Speech Recognition via Large-Scale Weak Supervision},
publisher = {arXiv},
year = {2022},
copyright = {arXiv.org perpetual, non-exclusive license}
}
kyrgyz-ai/whisper_tokenizer_kyApache 2.0 - see LICENSE file for details.