ucrelnlp
/

PyMUSAS-Neural-English-Small-BEM

@@ -1,132 +1,10 @@
 ---
-license: cc-by-nc-sa-4.0
-base_model: jhu-clsp/ettin-encoder-17m
-base_model_relation: finetune
-datasets:
-- ucrelnlp/English-USAS-Mosaico
-language:
-- en
 tags:
 - model_hub_mixin
 - pytorch_model_hub_mixin
-- pytorch
-- word-sense-disambiguation
-- lexical-semantics
 ---
-# Model Card for PyMUSAS Neural English Small BEM
-A fine tuned 17 Million (17M) parameter English ModernBERT architecture semantic tagger. The tagger outputs semantic tags at the token level from the [USAS tagset](https://ucrel.lancs.ac.uk/usas/usas_guide.pdf).
-The semantic tagger is a variation of the [Bi-Encoder Model (BEM) from Blevins and Zettlemoyer 2020](https://aclanthology.org/2020.acl-main.95.pdf) a Word Sense Disambiguation (WSD) model.
-## Table of contents
-## Quick start
-### Installation
-Requires Python `3.10` or greater, it is best that you install the version of PyTorch you would like to use, e.g. CPU/GPU version etc before installing this package else you will get the default version of PyTorch for your operating system/setup, but we do require `torch>=2.2,<3.0`.
-``` bash
-pip install wsd-torch-models
-```
-### Usage
-``` python
-from transformers import AutoTokenizer
-import torch
-from wsd_torch_models.bem import BEM
-if __name__ == "__main__":
-    wsd_model_name = "ucrelnlp/PyMUSAS-Neural-English-Small-BEM"
-    wsd_model = BEM.from_pretrained(wsd_model_name)
-    tokenizer = AutoTokenizer.from_pretrained(wsd_model_name)
-    wsd_model.eval()
-    # Change this to the device you would like to use, e.g. cpu
-    model_device = "cpu"
-    wsd_model.to(device=model_device)
-    sentence = "The river bank was full of fish"
-    sentence_tokens = sentence.split()
-    with torch.inference_mode(mode=True):
-        # sub_word_tokenizer can be None when None it will download the appropriate tokenizer
-        # but generally it is better to give it the tokenizer as it saves the operation
-        # of checking if the tokenizer is already downloaded.
-        predictions = wsd_model.predict(sentence_tokens, sub_word_tokenizer=tokenizer, top_n=5)
-        for sentence_token, semantic_tags in zip(sentence_tokens, predictions):
-            print("Token: "+ sentence_token)
-            print("Most likely tags: ")
-            for tag in semantic_tags:
-                tag_definition = wsd_model.label_to_definition[tag]
-                print("\t" + tag + ":" + tag_definition)
-            print()
-```
-## Model Description
-For more details about the model and how it was trained please see the [citation/technical report](#citation), as well as the links in the [model sources section.](#model-sources)
-### Model Sources
-The training repository contains the code used to train this model. The inference repository contains the code used to run the model as shown in the [usage section.](#usage)
-- Training Repository: [https://github.com/UCREL/experimental-wsd](https://github.com/UCREL/experimental-wsd)
-- Inference/Usage Repository: [https://github.com/UCREL/WSD-Torch-Models](https://github.com/UCREL/WSD-Torch-Models)
-### Model Architecture
-| Parameter | 17M English | 68M English | 140M Multilingual | 307M Multilingual |
-|:----------|:----|:----|:----|:-----|
-| Layers | 7 | 19 | 22 | 22 |
-| Hidden Size | 256 | 512 | 384 | 768 |
-| Intermediate Size | 384 | 768 | 1152 | 1152 |
-| Attention Heads | 4 | 8 | 6 | 12 |
-| Total Parameters | 17M | 68M | 140M | 307M |
-| Non-embedding Parameters | 3.9M | 42.4M | 42M | 110M |
-| Max Sequence Length | 8,000 | 8,000 | 8,192 | 8,192 |
-| Vocabulary Size | 50,368 | 50,368 | 256,000 | 256,000 |
-| Tokenizer | ModernBERT | ModernBERT | Gemma 2 | Gemma 2 |
-## Training Data
-The model has been trained on a portion of the [ucrelnlp/English-USAS-Mosaico](https://huggingface.co/datasets/ucrelnlp/English-USAS-Mosaico), specifically [data/wikipedia_shard_0.jsonl.gz](https://huggingface.co/datasets/ucrelnlp/English-USAS-Mosaico/blob/main/data/wikipedia_shard_0.jsonl.gz), which contains 1,083 English Wikipedia articles, with 444,880 sentences, 6.6 million tokens, with 5.3 million silver labelled tokens generated by a English rule based semantic tagger.
-## Evaluation
-We have evaluated the models on 5 datasets from 5 different languages, 4 of these datasets are publicly available whereas one (the Irish data) requires permission from the data owner to access it. The results for these models using top 1 and top 5 accuracy results are shown below, for a more comprehensive comparison please see the technical report.
-| Dataset | 17M English | 68M English | 140M Multilingual | 307M Multilingual |
-|:----------|:----|:----|:----|:-----|
-| **Top 1** |  |  |  |  |
-| Chinese | - | - | 42.2 | 47.9 |
-| English | 66.4 | 70.1 | 66.0 | 70.2 |
-| Finnish | - | - | 15.8 | 25.9 |
-| Irish | - | - | 28.5 | 35.6 |
-| Welsh | - | - | 21.7 | 42.0 |
-| **Top 5** |  |  |  |  |
-| Chinese | - | - | 66.3 | 70.4 |
-| English | 87.6 | 90.0 | 88.9 | 90.1 |
-| Finnish | - | - | 32.8 | 42.4 |
-| Irish | - | - | 47.6 | 51.6 |
-| Welsh | - | - | 40.8 | 56.4 |
-The publicly available datasets can be found on HuggingFace Hub [ucrelnlp/USAS-WSD](https://huggingface.co/datasets/ucrelnlp/USAS-WSD).
-**Note** the English models have not been evaluated on the non-English datasets as they are unlikely to be able to represent non-English text well or perform well on non-English data.
-## Citation
-Technical report is forthcoming.
-## Contact Information
-* Paul Rayson ([email protected])
-* Andrew Moore ([email protected] / [email protected])
-* UCREL Research Centre ([email protected]) at Lancaster University.

 ---
 tags:
 - model_hub_mixin
 - pytorch_model_hub_mixin
 ---
+This model has been pushed to the Hub using the [PytorchModelHubMixin](https://huggingface.co/docs/huggingface_hub/package_reference/mixins#huggingface_hub.PyTorchModelHubMixin) integration:
+- Code: [More Information Needed]
+- Paper: [More Information Needed]
+- Docs: [More Information Needed]

label_definitions/label_definitions_embeddings.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:43d473fbebb880a22528ece98a10de9832c21c74bb5faf23d3a7c4d675dc2206
 size 226408

 version https://git-lfs.github.com/spec/v1
+oid sha256:6d675788c0cfb79f33b40115b0a336e93af26bceadbe90eb16598c90c64a93b6
 size 226408