You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

English–Khasi Hybrid Tokenizer (Unigram, 12k)

This repository provides a SentencePiece-based hybrid tokenizer for English–Khasi NLP tasks.

It is Gated on automatic approval so if you have an account, go ahead and create something with this!

Overview

  • Languages: English, Khasi
  • Model: SentencePiece Unigram
  • Vocabulary size: 12,000
  • Training data:
    • Parallel EN–KHA corpus (~70k pairs)
    • Khasi monolingual corpus (~42k sentences)
    • Curriculum-boosted morphology roots

Motivation

Khasi is a low-resource language with limited NLP tooling. This tokenizer is designed to preserve Khasi morphology while remaining compatible with English.

Usage

from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained("Bapynshngain/enkha-hybrid-tokenizer")

tokens = tok.tokenize("Nga kwah ban sngewthuh ia kane")
print(tokens)
Downloads last month
12
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train Bapynshngain/enkha-hybrid-tokenizer