Perplexity is a statistical indicator that gauges the level of certainty a language model possesses in predicting text. This measurement essentially quantifies the model’s degree of «surprise» when it encounters previously unseen data. When the perplexity value decreases, it signifies an improved predictive prowess of the model, pointing to a more cohesive correspondence between the text and the training corpus.
In this example, we leverage the pretrained model «dmis-lab/biobert-base-cased-v1.2» (available in Hugging Face) to compute perplexity scores across a range of texts. The goal is to identify the relationship of each text with the biomedical context.
These are some possible scenarios where the perplexity measure can be used:
- Data cleansing to distinguish texts irrelevant to a specific field.
- Evaluate the fluency and coherence of the generated text.
- Identify outlier text within the dataset.
- Detects grammatical or terminological inaccuracies that require rectification.
In this Python script, the loss of the model’s predictions for the given text is calculated, followed by the computation of perplexity using the formula exp(loss / num_tokens).
Reference
https://en.wikipedia.org/wiki/Perplexity
https://medium.com/@priyankads/perplexity-of-language-models-41160427ed72
Example of Sentence Perplexity with Biomedical Language Model¶
This notebook explore sentence perplexity using the PerplexityCalculator
class, which leverages the dmis-lab/biobert-base-cased-v1.2
pre-trained model. This notebook analyzes perplexity across a set of sample sentences and provides insights into the intricate nature of language modeling within the biomedical domain.
import pandas as pd
from transformers import AutoTokenizer, AutoModelForMaskedLM
import torch
pd.set_option('display.max_colwidth', None)
class PerplexityCalculator:
def __init__(self, model_name="dmis-lab/biobert-base-cased-v1.2"):
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
self.model = AutoModelForMaskedLM.from_pretrained(model_name)
def calculate_perplexity(self, sentence):
tokens = self.tokenizer.encode_plus(sentence, add_special_tokens=True, return_tensors="pt")
input_ids = tokens['input_ids']
with torch.no_grad():
outputs = self.model(input_ids, labels=input_ids)
loss = outputs.loss
num_tokens = input_ids.size(1)
perplexity = torch.exp(loss / num_tokens)
return perplexity.item()
sentences = [
# 5 questions with content highly related to medical health
"What does a new study show about the vaccine's effectiveness in reducing the risk of infection?",
"What symptoms did the patient exhibit consistent with the flu, and what was done immediately?",
"What does the research paper discuss regarding the efficacy of a novel drug in treating cancer?",
"What complex neurosurgery did the medical team successfully perform on the infected patient?",
"How can physical therapy and regular exercise improve joint mobility in arthritis patients?",
# 5 questions with content slightly related to medical health
"What topics were covered in the conference related to healthcare and wellness?",
"Why did she decide to take up yoga and how does it improve her overall well-being?",
"What benefits are highlighted in the article regarding a balanced diet and maintaining good health?",
"Why is regular handwashing essential in preventing the spread of infections?",
"What did the documentary focus on regarding the impact of pollution on public health?",
# 5 questions with content non-related to medical health
"How has the new movie release been received by critics?",
"What was the outcome of the championship game for the team?",
"What does she enjoy doing in her free time, specifically related to reading?",
"What did the company announce regarding the launch of their latest smartphone?",
"What does the weather forecast predict for the weekend?"
]
def main():
perplexity_calculator = PerplexityCalculator()
df_perplexity = [
{'queryExpression': sentence, 'perplexity': perplexity_calculator.calculate_perplexity(sentence)}
for sentence in sentences
]
return pd.DataFrame(df_perplexity).sort_values('perplexity')
df = main()
df
queryExpression | perplexity | |
---|---|---|
0 | What does a new study show about the vaccine’s effectiveness in reducing the risk of infection? | 1.068590 |
2 | What does the research paper discuss regarding the efficacy of a novel drug in treating cancer? | 1.082275 |
3 | What complex neurosurgery did the medical team successfully perform on the infected patient? | 1.085622 |
1 | What symptoms did the patient exhibit consistent with the flu, and what was done immediately? | 1.085882 |
6 | Why did she decide to take up yoga and how does it improve her overall well-being? | 1.087844 |
4 | How can physical therapy and regular exercise improve joint mobility in arthritis patients? | 1.098204 |
7 | What benefits are highlighted in the article regarding a balanced diet and maintaining good health? | 1.104272 |
9 | What did the documentary focus on regarding the impact of pollution on public health? | 1.109692 |
12 | What does she enjoy doing in her free time, specifically related to reading? | 1.113692 |
5 | What topics were covered in the conference related to healthcare and wellness? | 1.122161 |
13 | What did the company announce regarding the launch of their latest smartphone? | 1.124373 |
8 | Why is regular handwashing essential in preventing the spread of infections? | 1.136480 |
11 | What was the outcome of the championship game for the team? | 1.156340 |
10 | How has the new movie release been received by critics? | 1.204013 |
14 | What does the weather forecast predict for the weekend? | 1.241182 |