You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

English–Khasi Hybrid Tokenizer (Unigram, 12k)

This repository provides a SentencePiece-based hybrid tokenizer for English–Khasi NLP tasks.

It is Gated on automatic approval so if you have an account, go ahead and create something with this!

Overview

Languages: English, Khasi
Model: SentencePiece Unigram
Vocabulary size: 12,000
Training data:
- Parallel EN–KHA corpus (~70k pairs)
- Khasi monolingual corpus (~42k sentences)
- Curriculum-boosted morphology roots

Motivation

Khasi is a low-resource language with limited NLP tooling. This tokenizer is designed to preserve Khasi morphology while remaining compatible with English.

Usage

from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained("Bapynshngain/enkha-hybrid-tokenizer")

tokens = tok.tokenize("Nga kwah ban sngewthuh ia kane")
print(tokens)

Downloads last month: 12

Bapynshngain
/

enkha-hybrid-tokenizer

You need to agree to share your contact information to access this model

English–Khasi Hybrid Tokenizer (Unigram, 12k)

Overview

Motivation

Usage

Datasets used to train Bapynshngain/enkha-hybrid-tokenizer