<a href="https://colab.research.google.com/drive/19PGHoBlpGj5ZggVtbumyOPHfw4LMdbbl?usp=sharing" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab"/></a>


In [1]:
!pip install datasets
!pip install transformers
!pip install evaluate

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


# Training the Model
## Loading the Data
To train the model, first I loaded the HUPD dataset sample, which consists of all patent applications from Jan 2016. The sample of the dataset contains 25,247 examples, 16153 (64%) for training and 9094 (46%) for validation.

## Choosing DistilBERT
I tested out several models during training. I was initially interested in using Longformer and Big Bird, which support longer input sequences. However, in order to run them with full dataset, the models could only take small batch sizes (n<=4) without running out of GPU RAM and generally would've taken too to run repeatedly. I also tried BERT, but found it didn't offer a susbtantial increase in accuracy by the end of the training process, so I stuck with distilBERT for its balance of performance and speed. 

In [2]:
from pprint import pprint
from datasets import load_dataset
from transformers import AutoTokenizer
from torch.utils.data import DataLoader

# Use distilbert for as a base model
model_name = "distilbert-base-uncased"

# Use default tokenizer from distilbert base model
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Load Dataset from January 2016
dataset_dict = load_dataset('HUPD/hupd',
    name='sample',
    data_files="https://huggingface.co/datasets/HUPD/hupd/blob/main/hupd_metadata_2022-02-22.feather", 
    icpr_label=None,
    train_filing_start_date='2016-01-01',
    train_filing_end_date='2016-01-21',
    val_filing_start_date='2016-01-22',
    val_filing_end_date='2016-01-31',
)



  0%|          | 0/2 [00:00<?, ?it/s]

## Processing the data

First, as suggested by HUPD guide, I incorporated a helper function which maps the patent decision labels to numbers. This is both for the model to understand the labels as well for later filtering out labels irrelevant to the purpose of the project.
 
## Tokenizing the data

As suggested by the project prompt, I chose to use the application abstract and claims for training and inference of the application's patentability. Before being used with model, both text sequences must be tokenized. This was done using the default tokenizer from Huggingface's distilBERT model. During tokenization, the following parameters are used:

```
truncation=true, # ensures long sequences are no greater 512 tokens.
padding='max_length', # ensures short sequences are padded to 512 tokens.
batched=True, # parallelizes tokenization to speed up processing.

```

For each application in the dataset, the abstract and claims are split into up to max embedding length of the model (up to 512 tokens, including any special [CLS] and [SEP] tokens), mostly on a word-by-basis basis, but occasionally syllables depending on the vocabulary of the model. After the tokens are created, they are converted to numerical representations that the model can understand. Generally, the length of the abstract and claims exceed 512 tokens, with the claims being much longer than the abstracts. For that reason, I used the abstract as the first argument, to prevent the claims from dominating the set of tokens.

In [3]:
# Label-to-index mapping for the decision status field
decision_to_str = {'REJECTED': 0, 'ACCEPTED': 1, 'PENDING': 2, 'CONT-REJECTED': 3, 'CONT-ACCEPTED': 4, 'CONT-PENDING': 5}

# Helper function
def map_decision_to_string(example):
    return {'decision': decision_to_str[example['decision']]}

In [4]:
# Tokenize claims and abstract fields
def tokenize_function(examples):
  return tokenizer(examples['abstract'], examples['claims'], truncation=True, padding='max_length')
tokenized_datasets = dataset_dict.map(tokenize_function, batched=True)



Map:   0%|          | 0/9094 [00:00<?, ? examples/s]

After tokenizing the data, all fields are removed except for the input_ids, attention_mask, and decision column, which is renamed to "labels", as the distilBERT model expects. 

In [5]:
# Remove unnecessary fields
tokenized_datasets = tokenized_datasets.remove_columns(['patent_number', 
                                                        'abstract', 
                                                        'claims', 
                                                        'title', 
                                                        'background', 
                                                        'summary', 
                                                        'description', 
                                                        'cpc_label', 
                                                        'ipc_label', 
                                                        'filing_date', 
                                                        'patent_issue_date', 
                                                        'date_published', 
                                                        'examiner_id'])


I then map the decision to label numbers and filter out all labels except "REJECTED" and "ACCEPTED," in order to train the model as a binary classifier. The sizes of the training set and validation set are now 8179 and 4888 examples, respectively.

In [6]:
# Re-labeling/mapping.
train_set = tokenized_datasets['train'].map(map_decision_to_string)
val_set = tokenized_datasets['validation'].map(map_decision_to_string)

# Filter out all applications except those labeled "ACCEPTED" or "REJECTED"
train_set = train_set.filter(lambda e: e['decision'] <= 1)
val_set = val_set.filter(lambda e: e['decision'] <= 1)

print("Size of training dataset: " + str(len(train_set)))
print("Size of validation dataset: " + str(len(val_set)))



Map:   0%|          | 0/9094 [00:00<?, ? examples/s]



Filter:   0%|          | 0/9094 [00:00<?, ? examples/s]

Size of training dataset: 8719
Size of validation dataset: 4888


In [7]:
train_set = train_set.rename_column('decision', 'labels')
val_set = val_set.rename_column('decision', 'labels')

Looking over the distribution of labels in dataset, I found that it skews heavily towards applications that are "ACCEPTED". On intial attempts to train the model, I found that they simply learning to always predict acceptance. I decied to correct the skew by selecting 3,000 accepted examples and combining them with the ~2,000 rejected examples. The size of the final, processed training dataset is now 4774, which is roughly the same size as the validation dataset.

In [8]:
from datasets.combine import concatenate_datasets

accepted = train_set.filter(lambda e: e['labels']==1)
rejected = train_set.filter(lambda e: e['labels']==0) 
accepted_small= accepted.shuffle(seed=42).select(range(3000))
processed_train_set = concatenate_datasets([accepted_small, rejected]).shuffle(seed=42)



The datasets are formatted for PyTorch and transformed into dataloaders (i.e. split into batches). I experimented with batch sizes in the range of 4, 8, 16,...,128. I found that 64 was the largest batch size that was stable with Colab. Additionally, it seemed that the models would begin overfitting sooner with smaller batch sizes, often allowing for only one epoch, so I stuck with 64 as my batch size.

In [9]:
# Set the format
processed_train_set.set_format(type='torch', 
    columns=['input_ids', 'attention_mask', 'labels'])

val_set.set_format(type='torch', 
    columns=['input_ids', 'attention_mask', 'labels'])

In [10]:
# train_dataloader and val_data_loader
batch_size = 64
train_dataloader = DataLoader(processed_train_set, batch_size=batch_size)
val_dataloader = DataLoader(val_set, batch_size=batch_size)

# Model Training

For training, I found I was unable to train  on the full dataset for more than 3 epochs without the training and validation beginning to diverge (i.e. the model overfitting). I experimented with increasing the dropout as high as .9 and tried different learning rates. I found learning rates below .00001 had higher losses during training (higher values are less desirable), whereas learning rates greater than .00005 tended to ovefit within 1 epoch. Therefore, I landed on .00001 as the final learning rate.

As suggested by Huggingface, I used a learning rate scheduler, which adjusts the learning rate from epoch to epoch in order to increase the accuracy of the model.

In [11]:
# Model training
import torch
from transformers import AutoModelForSequenceClassification
from transformers import get_scheduler
from tqdm.auto import tqdm

model = AutoModelForSequenceClassification.from_pretrained(model_name)
model.config.dropout = .7 # Increase dropout to prevent overfitting
device = torch.device('cuda') if torch.cuda.is_available else torch.device('cpu')
model.to(device)
 
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-5)
num_of_epochs = 3
num_training_steps = num_of_epochs * (len(train_dataloader) + len(val_dataloader))
lr_scheduler = get_scheduler(
    name="linear", optimizer=optimizer, num_warmup_steps=0, num_training_steps=num_training_steps

)
progress_bar = tqdm(range(num_training_steps))

for epoch in range(num_of_epochs):
  running_loss = 0
  running_vloss = 0 
  model.train()

  for batch in train_dataloader:
    batch = {k: v.to(device) for k, v in batch.items()}
    optimizer.zero_grad()
    output = model(**batch)
    loss = output.loss
    running_loss += loss.item()
    loss.backward()
    optimizer.step()
    lr_scheduler.step()
    progress_bar.update(1)

  train_loss = running_loss / len(train_dataloader)
  
  model.eval()

  for batch in val_dataloader:
    batch = {k: v.to(device) for k, v in batch.items()}
    references = batch["labels"]
    with torch.no_grad():
      output = model(**batch)
    loss = output.loss
    running_vloss += loss.item()
    progress_bar.update(1)
  
  val_loss = running_vloss / len(val_dataloader)

  print(f"Train loss: {train_loss} Validation loss: {val_loss}")  

torch.save(model, "patent_classifier.pt")

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.bias', 'vocab_transform.weight', 'vocab_layer_norm.weight', 'vocab_projector.weight', 'vocab_projector.bias', 'vocab_layer_norm.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'classifier.bias', 'pre_classi

  0%|          | 0/456 [00:00<?, ?it/s]

Train loss: 0.6510325566927592 Validation loss: 0.5412276468493722
Train loss: 0.623265913327535 Validation loss: 0.5464698540699946
Train loss: 0.6030614495277404 Validation loss: 0.5375786942320985


Finally, after the model was trained, the code below calculates the accuracy of the model using the validation set. The final model used for my app features an accuracy of .73.

In [12]:
import torch.nn.functional as F
import evaluate

model = torch.load("patent_classifier.pt")
model.eval()

metric = evaluate.load("accuracy")

for batch in val_dataloader:
  batch = {k: v.to(device) for k, v in batch.items()}
  references = batch["labels"]
  with torch.no_grad():
    outputs = model(**batch)

  logits = outputs.logits
  predictions = torch.argmax(logits, dim=1)
  metric.add_batch(predictions=predictions, references=references)

metric.compute()


{'accuracy': 0.7371112929623568}