apmoore1 commited on
Commit
5e02da6
·
verified ·
1 Parent(s): 0e9b811

Push model using huggingface_hub.

Browse files
README.md CHANGED
@@ -1,132 +1,10 @@
1
  ---
2
- license: cc-by-nc-sa-4.0
3
- base_model: jhu-clsp/ettin-encoder-17m
4
- base_model_relation: finetune
5
- datasets:
6
- - ucrelnlp/English-USAS-Mosaico
7
- language:
8
- - en
9
  tags:
10
  - model_hub_mixin
11
  - pytorch_model_hub_mixin
12
- - pytorch
13
- - word-sense-disambiguation
14
- - lexical-semantics
15
  ---
16
 
17
- # Model Card for PyMUSAS Neural English Small BEM
18
-
19
- A fine tuned 17 Million (17M) parameter English ModernBERT architecture semantic tagger. The tagger outputs semantic tags at the token level from the [USAS tagset](https://ucrel.lancs.ac.uk/usas/usas_guide.pdf).
20
-
21
- The semantic tagger is a variation of the [Bi-Encoder Model (BEM) from Blevins and Zettlemoyer 2020](https://aclanthology.org/2020.acl-main.95.pdf) a Word Sense Disambiguation (WSD) model.
22
-
23
- ## Table of contents
24
-
25
- ## Quick start
26
-
27
- ### Installation
28
-
29
- Requires Python `3.10` or greater, it is best that you install the version of PyTorch you would like to use, e.g. CPU/GPU version etc before installing this package else you will get the default version of PyTorch for your operating system/setup, but we do require `torch>=2.2,<3.0`.
30
-
31
- ``` bash
32
- pip install wsd-torch-models
33
- ```
34
-
35
- ### Usage
36
-
37
- ``` python
38
- from transformers import AutoTokenizer
39
- import torch
40
-
41
- from wsd_torch_models.bem import BEM
42
-
43
-
44
- if __name__ == "__main__":
45
- wsd_model_name = "ucrelnlp/PyMUSAS-Neural-English-Small-BEM"
46
- wsd_model = BEM.from_pretrained(wsd_model_name)
47
- tokenizer = AutoTokenizer.from_pretrained(wsd_model_name)
48
-
49
- wsd_model.eval()
50
- # Change this to the device you would like to use, e.g. cpu
51
- model_device = "cpu"
52
- wsd_model.to(device=model_device)
53
-
54
- sentence = "The river bank was full of fish"
55
- sentence_tokens = sentence.split()
56
-
57
- with torch.inference_mode(mode=True):
58
- # sub_word_tokenizer can be None when None it will download the appropriate tokenizer
59
- # but generally it is better to give it the tokenizer as it saves the operation
60
- # of checking if the tokenizer is already downloaded.
61
- predictions = wsd_model.predict(sentence_tokens, sub_word_tokenizer=tokenizer, top_n=5)
62
-
63
- for sentence_token, semantic_tags in zip(sentence_tokens, predictions):
64
- print("Token: "+ sentence_token)
65
- print("Most likely tags: ")
66
- for tag in semantic_tags:
67
- tag_definition = wsd_model.label_to_definition[tag]
68
- print("\t" + tag + ":" + tag_definition)
69
- print()
70
- ```
71
-
72
- ## Model Description
73
-
74
- For more details about the model and how it was trained please see the [citation/technical report](#citation), as well as the links in the [model sources section.](#model-sources)
75
-
76
- ### Model Sources
77
-
78
- The training repository contains the code used to train this model. The inference repository contains the code used to run the model as shown in the [usage section.](#usage)
79
-
80
- - Training Repository: [https://github.com/UCREL/experimental-wsd](https://github.com/UCREL/experimental-wsd)
81
- - Inference/Usage Repository: [https://github.com/UCREL/WSD-Torch-Models](https://github.com/UCREL/WSD-Torch-Models)
82
-
83
- ### Model Architecture
84
-
85
- | Parameter | 17M English | 68M English | 140M Multilingual | 307M Multilingual |
86
- |:----------|:----|:----|:----|:-----|
87
- | Layers | 7 | 19 | 22 | 22 |
88
- | Hidden Size | 256 | 512 | 384 | 768 |
89
- | Intermediate Size | 384 | 768 | 1152 | 1152 |
90
- | Attention Heads | 4 | 8 | 6 | 12 |
91
- | Total Parameters | 17M | 68M | 140M | 307M |
92
- | Non-embedding Parameters | 3.9M | 42.4M | 42M | 110M |
93
- | Max Sequence Length | 8,000 | 8,000 | 8,192 | 8,192 |
94
- | Vocabulary Size | 50,368 | 50,368 | 256,000 | 256,000 |
95
- | Tokenizer | ModernBERT | ModernBERT | Gemma 2 | Gemma 2 |
96
-
97
- ## Training Data
98
-
99
- The model has been trained on a portion of the [ucrelnlp/English-USAS-Mosaico](https://huggingface.co/datasets/ucrelnlp/English-USAS-Mosaico), specifically [data/wikipedia_shard_0.jsonl.gz](https://huggingface.co/datasets/ucrelnlp/English-USAS-Mosaico/blob/main/data/wikipedia_shard_0.jsonl.gz), which contains 1,083 English Wikipedia articles, with 444,880 sentences, 6.6 million tokens, with 5.3 million silver labelled tokens generated by a English rule based semantic tagger.
100
-
101
- ## Evaluation
102
-
103
- We have evaluated the models on 5 datasets from 5 different languages, 4 of these datasets are publicly available whereas one (the Irish data) requires permission from the data owner to access it. The results for these models using top 1 and top 5 accuracy results are shown below, for a more comprehensive comparison please see the technical report.
104
-
105
- | Dataset | 17M English | 68M English | 140M Multilingual | 307M Multilingual |
106
- |:----------|:----|:----|:----|:-----|
107
- | **Top 1** | | | | |
108
- | Chinese | - | - | 42.2 | 47.9 |
109
- | English | 66.4 | 70.1 | 66.0 | 70.2 |
110
- | Finnish | - | - | 15.8 | 25.9 |
111
- | Irish | - | - | 28.5 | 35.6 |
112
- | Welsh | - | - | 21.7 | 42.0 |
113
- | **Top 5** | | | | |
114
- | Chinese | - | - | 66.3 | 70.4 |
115
- | English | 87.6 | 90.0 | 88.9 | 90.1 |
116
- | Finnish | - | - | 32.8 | 42.4 |
117
- | Irish | - | - | 47.6 | 51.6 |
118
- | Welsh | - | - | 40.8 | 56.4 |
119
-
120
- The publicly available datasets can be found on HuggingFace Hub [ucrelnlp/USAS-WSD](https://huggingface.co/datasets/ucrelnlp/USAS-WSD).
121
-
122
- **Note** the English models have not been evaluated on the non-English datasets as they are unlikely to be able to represent non-English text well or perform well on non-English data.
123
-
124
- ## Citation
125
-
126
- Technical report is forthcoming.
127
-
128
- ## Contact Information
129
-
130
- * Paul Rayson ([email protected])
131
- * Andrew Moore ([email protected] / [email protected])
132
- * UCREL Research Centre ([email protected]) at Lancaster University.
 
1
  ---
 
 
 
 
 
 
 
2
  tags:
3
  - model_hub_mixin
4
  - pytorch_model_hub_mixin
 
 
 
5
  ---
6
 
7
+ This model has been pushed to the Hub using the [PytorchModelHubMixin](https://huggingface.co/docs/huggingface_hub/package_reference/mixins#huggingface_hub.PyTorchModelHubMixin) integration:
8
+ - Code: [More Information Needed]
9
+ - Paper: [More Information Needed]
10
+ - Docs: [More Information Needed]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
label_definitions/label_definitions_embeddings.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:43d473fbebb880a22528ece98a10de9832c21c74bb5faf23d3a7c4d675dc2206
3
  size 226408
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6d675788c0cfb79f33b40115b0a336e93af26bceadbe90eb16598c90c64a93b6
3
  size 226408