English–Khasi Hybrid Tokenizer (Unigram, 12k)
This repository provides a SentencePiece-based hybrid tokenizer for English–Khasi NLP tasks.
It is Gated on automatic approval so if you have an account, go ahead and create something with this!
Overview
- Languages: English, Khasi
- Model: SentencePiece Unigram
- Vocabulary size: 12,000
- Training data:
- Parallel EN–KHA corpus (~70k pairs)
- Khasi monolingual corpus (~42k sentences)
- Curriculum-boosted morphology roots
Motivation
Khasi is a low-resource language with limited NLP tooling. This tokenizer is designed to preserve Khasi morphology while remaining compatible with English.
Usage
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("Bapynshngain/enkha-hybrid-tokenizer")
tokens = tok.tokenize("Nga kwah ban sngewthuh ia kane")
print(tokens)
- Downloads last month
- 12