Perplexity: assessing the relevance of a text within a context

Perplexity is a statistical indicator that gauges the level of certainty a language model possesses in predicting text. This measurement essentially quantifies the model’s degree of «surprise» when it encounters previously unseen data. When the perplexity value decreases, it signifies an improved predictive prowess of the model, pointing to a more cohesive correspondence between the text and the training corpus.

In this example, we leverage the pretrained model «dmis-lab/biobert-base-cased-v1.2» (available in Hugging Face) to compute perplexity scores across a range of texts. The goal is to identify the relationship of each text with the biomedical context.

These are some possible scenarios where the perplexity measure can be used:

Data cleansing to distinguish texts irrelevant to a specific field.
Evaluate the fluency and coherence of the generated text.
Identify outlier text within the dataset.
Detects grammatical or terminological inaccuracies that require rectification.

In this Python script, the loss of the model’s predictions for the given text is calculated, followed by the computation of perplexity using the formula exp(loss / num_tokens).

Reference

https://en.wikipedia.org/wiki/Perplexity

https://medium.com/@priyankads/perplexity-of-language-models-41160427ed72

perplexity-v2

Example of Sentence Perplexity with Biomedical Language Model¶

This notebook explore sentence perplexity using the PerplexityCalculator class, which leverages the dmis-lab/biobert-base-cased-v1.2 pre-trained model. This notebook analyzes perplexity across a set of sample sentences and provides insights into the intricate nature of language modeling within the biomedical domain.

In [1]:

import pandas as pd
from transformers import AutoTokenizer, AutoModelForMaskedLM
import torch
pd.set_option('display.max_colwidth', None)

In [2]:

class PerplexityCalculator:
    def __init__(self, model_name="dmis-lab/biobert-base-cased-v1.2"):
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForMaskedLM.from_pretrained(model_name)
    
    def calculate_perplexity(self, sentence):
        tokens = self.tokenizer.encode_plus(sentence, add_special_tokens=True, return_tensors="pt")
        input_ids = tokens['input_ids']
        
        with torch.no_grad():
            outputs = self.model(input_ids, labels=input_ids)
            loss = outputs.loss
            num_tokens = input_ids.size(1)
            perplexity = torch.exp(loss / num_tokens)
        
        return perplexity.item()

In [3]:

    sentences = [
        # 5 questions with content highly related to medical health
        "What does a new study show about the vaccine's effectiveness in reducing the risk of infection?",
        "What symptoms did the patient exhibit consistent with the flu, and what was done immediately?",
        "What does the research paper discuss regarding the efficacy of a novel drug in treating cancer?",
        "What complex neurosurgery did the medical team successfully perform on the infected patient?",
        "How can physical therapy and regular exercise improve joint mobility in arthritis patients?",

        # 5 questions with content slightly related to medical health
        "What topics were covered in the conference related to healthcare and wellness?",
        "Why did she decide to take up yoga and how does it improve her overall well-being?",
        "What benefits are highlighted in the article regarding a balanced diet and maintaining good health?",
        "Why is regular handwashing essential in preventing the spread of infections?",
        "What did the documentary focus on regarding the impact of pollution on public health?",

        # 5 questions with content non-related to medical health
        "How has the new movie release been received by critics?",
        "What was the outcome of the championship game for the team?",
        "What does she enjoy doing in her free time, specifically related to reading?",
        "What did the company announce regarding the launch of their latest smartphone?",
        "What does the weather forecast predict for the weekend?"
    ]

In [4]:

def main():
    perplexity_calculator = PerplexityCalculator()
    
    df_perplexity = [
        {'queryExpression': sentence, 'perplexity': perplexity_calculator.calculate_perplexity(sentence)}
        for sentence in sentences
    ]    
    return pd.DataFrame(df_perplexity).sort_values('perplexity')

In [ ]:

df = main()

In [6]:

df

Out[6]:

	queryExpression	perplexity
0	What does a new study show about the vaccine’s effectiveness in reducing the risk of infection?	1.068590
2	What does the research paper discuss regarding the efficacy of a novel drug in treating cancer?	1.082275
3	What complex neurosurgery did the medical team successfully perform on the infected patient?	1.085622
1	What symptoms did the patient exhibit consistent with the flu, and what was done immediately?	1.085882
6	Why did she decide to take up yoga and how does it improve her overall well-being?	1.087844
4	How can physical therapy and regular exercise improve joint mobility in arthritis patients?	1.098204
7	What benefits are highlighted in the article regarding a balanced diet and maintaining good health?	1.104272
9	What did the documentary focus on regarding the impact of pollution on public health?	1.109692
12	What does she enjoy doing in her free time, specifically related to reading?	1.113692
5	What topics were covered in the conference related to healthcare and wellness?	1.122161
13	What did the company announce regarding the launch of their latest smartphone?	1.124373
8	Why is regular handwashing essential in preventing the spread of infections?	1.136480
11	What was the outcome of the championship game for the team?	1.156340
10	How has the new movie release been received by critics?	1.204013
14	What does the weather forecast predict for the weekend?	1.241182

In [ ]:

Perplexity: assessing the relevance of a text within a context

Example of Sentence Perplexity with Biomedical Language Model¶

© Copyright 2022 ml4data - All Rights Reserve