Simpledorff - Krippendorff's Alpha On DataFrames
tl;dr
Calulate Krippendorff's Alpha on any dataframe in two lines. Repo is here
!pip install simpledorff
import simpledorff
import pandas as pd
Data = pd.read_csv('./examples/from_paper.csv') #Load Your Dataframe
simpledorff.calculate_krippendorffs_alpha_for_df(Data,experiment_col='document_id',
annotator_col='annotator_id',
class_col='annotation')
0.743421052631579
Knowing How Good Our Data Is
Say you have a team of 5 annotators classifying documents into 10 classes. To ensure quality and move fast, you've had 2 annotators annotate each document, and using LightTag, each document is labeled by a different pair of annotators.
Your output will look something like this.
document_id | annotator_id | annotation | |
---|---|---|---|
0 | 1 | B | 1 |
6 | 3 | B | 2 |
7 | 3 | C | 2 |
9 | 4 | B | 1 |
10 | 4 | C | 1 |
You need to know is if your output is any good. Is your labeled data reliable. One option is to calculate an agreement matrix, but those are hard to interpert and communicate about.
What you want is one number that tells you how reliable your data is.
Your stepping into the lovely world of Inter-Annotator-Agreement and Inter-Annotator-Reliability and at first glance you're spoiled for choice. Scott's Pi and Cohen's Kappa are commonly used and Fleiss' Kappa is a popular reliability metric and even well loved at Huggingface.
The canonical measure for Inter-annotator agreement for categorical classification (without a notion of ordering between classes) is Fleiss' kappa.
— Julien Chaumond (@julien_c) May 1, 2019
See the Wikipedia article here: https://t.co/eKytvygR3Z
However, none of these metrics support our case, where not every annotator labeled every example nor can we garauntee that every example was labeled exactly twice (or 3 times). Maybe some were labeled more by accident and some weren't labeled by two people yet.
Luckily, we can use Kripendorf's Alpha. Krippendorff's Alpha has a few traits that make it very well suited to our case. It supports
- Any number of observers, not just two
- Any number of categories, scale values, or measures
- Incomplete or missing data
- Large and small sample sizes alike, not requiring a minimum
The catch, it's hard to compute and calculate.
Making Sure Reliability Measures are Reliable
A python package that calculates Krippendorfs Alpha already exists. We found two things that were challenging with it and drove us to write this post and our implementation.
A More Intuitive API
First, It's hard to come too Kripendorfs Alpha and know how to format your data in the right way. The available package assumes you do, but if you're just guessing it's hard to know if you did it right or got a random number. The package's API looks like this:
def krippendorff_alpha(data, metric=interval_metric, force_vecmath=False, convert_items=float, missing_items=None):
'''
Calculate Krippendorff's alpha (inter-rater reliability):
data is in the format
[
{unit1:value, unit2:value, ...}, # coder 1
{unit1:value, unit3:value, ...}, # coder 2
... # more coders
]
or
it is a sequence of (masked) sequences (list, numpy.array, numpy.ma.array, e.g.) with rows corresponding to coders and columns to items
metric: function calculating the pairwise distance
force_vecmath: force vector math for custom metrics (numpy required)
convert_items: function for the type conversion of items (default: float)
missing_items: indicator for missing items (default: None)
'''
But what exactly should data look like and how to get there from our original data is unclear. Remeber, we started with
document_id | annotator_id | annotation | |
---|---|---|---|
0 | 1 | B | 1 |
6 | 3 | B | 2 |
7 | 3 | C | 2 |
9 | 4 | B | 1 |
10 | 4 | C | 1 |
Validating Ourselves
Second, and very much driven by the first point, we wanted to make sure we understood the statistic and how it's calculated so that we knew what it means and could formulate a simpler maybe even foolproof API.
The next part of this blog post walk's simpledorffs implementation.
Calculating Krippendorff's Alpha in Python With Pandas
There are a few equivalent ways to calculate Kripendorf's Alpha, and here we want to show a Python Implementation of Kripendorf's General method, published in the last section here. This isn't the method in Wikipedia, but we found it easier to grok and work with.
Terminology and Data Transforms
Let's get some terminology set up and then show the code.
Krippendorff talks about units, e.g. a single thing being classified by multiple people. In our case, a unit is a document being classified.
Krippendorff assumes an input in table format. Each row in the table corresponds to an annotator. And each column in the table corresponds to a unit/document. In the original table, a unit corresponds to a document_id.
annotator_id | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
---|---|---|---|---|---|---|---|---|---|---|---|---|
A | 1 | 2 | 3 | 3 | 2 | 1 | 4 | 1 | 2 | nan | nan | nan |
B | 1 | 2 | 3 | 3 | 2 | 2 | 4 | 1 | 2 | 5 | nan | 3 |
C | nan | 3 | 3 | 3 | 2 | 3 | 4 | 2 | 2 | 5 | 1 | nan |
D | 1 | 2 | 3 | 3 | 2 | 4 | 4 | 1 | 2 | 5 | 1 | nan |
Here's the code the goes from our original table to Krippendorff's format
def df_to_experiment_annotator_table(df,experiment_col,annotator_col,class_col):
return df.pivot_table(
index=annotator_col, columns=experiment_col, values=class_col, aggfunc="first"
)
df_to_experiment_annotator_table(original_data,'document_id','annotator_id','annotation')
Krippendorff wants to calculate two quantities from this table, the observed number of disagreements (Do) and an Estimate of the likelihood of a disagreement occurring by chance (De).
Notice that the calculation happens "in the negative", we're thinking about the likelihood of bad things happening and convert that into a "positive" measure of reliability at the end. If the likelihood of bad things happening is low then our data is reliable.
The likelihood of "bad things happening" is simply to the ratio of Do to De. We observe the number of disagreements and compare it to the number of disagreements we'd expect to see by chance. If we see far fewer disagreements than chance would expect, then our data is reliable. If we see far more agreements than chance would predict, then we can assume there is a systematic problem (someone is maliciously annotating). Somewhere in between and we need to talk with our team and figure out what's going wrong. (LightTag has analytical tools to deep dive and review).
The actual math for calculating Krippendorff's alpha is simple, however, the calculation requires some data wrangling kung-fu that can get tricky. Krippendorff's generic recipe goes like this:
The Recipe
Take your input table of experiments and transform it into a new table, where each column is an experiment/unit and each row corresponds to a possible class. The value of a cell at Unit(Column) i and Class(Row) j is the number of annotations of Unit i with class J
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 3 | 0 | 0 | 0 | 0 | 1 | 0 | 3 | 0 | 0 | 2 | 0 |
2 | 0 | 3 | 0 | 0 | 4 | 1 | 0 | 1 | 4 | 0 | 0 | 0 |
3 | 0 | 1 | 4 | 4 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
4 | 0 | 0 | 0 | 0 | 0 | 1 | 4 | 0 | 0 | 0 | 0 | 0 |
5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0 | 0 |
Our new table is interesting because a value higher than one indicates there was an agreement in the respective unit and class.
- Disagreement in a particular unit(experiment) is any column that has more than one value.
- In a particular unit(experiment), if we multiply two non-zero values the result is the total number of disagreements for that pair.
- If we take the sum of the products in a column and divide it by 1 minus the sum of the entire column, we get the disagreement rate for that unit (experiment).
- Summing that across all columns gives us the observed disagreement rate.
- Notice that unit 12, in bold, has only one response, and so it has no agreement information. We'll need to handle that case.
That's the hard part. Now the easy part. We need to compute an estimate of disagreement by chance, which we do by multiplying the frequencies of each class in the experiment with the frequencies of the other classes. Taking the sum of those products gives us De.
Finally, we take the ratio of Do to De and multiply by 1 minus the total observations to get the sample estimate of the ratio of disagreements to the ratio of disagreements we'd expect under chance. Subtracting that from 1 gives us our reliability score.
The Code
Below is an implementation of the above recipe. ####Preparing The Table First, we make our table mapping Values To Units
def make_value_by_unit_table_dict(experiment_annotator_df):
"""
:param experiment_annotator_df: A dataframe that came out of df_to_experiment_annotator_table
:return: A dictionary of dictionaries (e.g. a table) whose rows (first level) are experiments and columns are responses
{1: Counter({1.0: 1}),
2: Counter(),
3: Counter({2.0: 2}),
4: Counter({1.0: 2}),
5: Counter({3.0: 2}),
"""
data_by_exp = experiment_annotator_df.T.sort_index(axis=1).sort_index()
table_dict = {}
for exp, row in data_by_exp.iterrows():
vals = row.dropna().values
table_dict[exp] = Counter()
for val in vals:
table_dict[exp][val] += 1
return table_dict
Masking Units with less than two annotations
The next step is to fix column 12 that only had a single annotator. We need to only look at units that had at least two annotators, because we're working with agreement data
vbu_df = (
pd.DataFrame.from_dict(vbu_table_dict, orient="index")
.T.sort_index(axis=0)
.sort_index(axis=1)
.fillna(0)
)
ubv_df = vbu_df.T
vbu_df_masked = ubv_df.mask(ubv_df.sum(1) == 1, other=0).T
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 3 | 0 | 0 | 0 | 0 | 1 | 0 | 3 | 0 | 0 | 2 | 0 |
2 | 0 | 3 | 0 | 0 | 4 | 1 | 0 | 1 | 4 | 0 | 0 | 0 |
3 | 0 | 1 | 4 | 4 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 0 | 0 | 0 | 0 | 0 | 1 | 4 | 0 | 0 | 0 | 0 | 0 |
5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0 | 0 |
Covenience Calculations
We calculate some things that make the next code easier to work with
def calculate_frequency_dicts(vbu_table_dict):
"""
:param vbu_table_dict: A value by unit table dictionary, the output of make_value_by_unit_table_dict
:return: A dictionary of dictonaries
{
unit_freqs:{ 1:2..},
class_freqs:{ 3:4..},
total:7
}
"""
vbu_df = (
pd.DataFrame.from_dict(vbu_table_dict, orient="index")
.T.sort_index(axis=0)
.sort_index(axis=1)
.fillna(0)
)
ubv_df = vbu_df.T
vbu_df_masked = ubv_df.mask(ubv_df.sum(1) == 1, other=0).T
return dict(
unit_freqs=vbu_df_masked.sum().to_dict(),
class_freqs=vbu_df_masked.sum(1).to_dict(),
total=vbu_df_masked.sum().sum(),
)
Calulcate The Disagreement Rate Expected By Chance
def calculate_de(frequency_dicts, metric_fn):
"""
Calculates the expected disagreement by chance
:param frequency_dicts: The output of data_transforms.calculate_frequency_dicts e.g.:
{
unit_freqs:{ 1:2..},
class_freqs:{ 3:4..},
total:7
}
:param metric_fn metric function such as nominal_metric
:return: De a float
"""
De = 0
class_freqs = frequency_dicts["class_freqs"]
class_names = list(class_freqs.keys())
for i, c in enumerate(class_names):
for k in class_names:
De += class_freqs[c] * class_freqs[k] * metric_fn(c, k)
return De
Calculate The Observed Disagreement Rate
def calculate_do(vbu_table_dict, frequency_dicts, metric_fn):
"""
:param vbu_table_dict: Output of data_transforms.make_value_by_unit_table_dict
:param frequency_dicts: The output of data_transforms.calculate_frequency_dicts e.g.:
{
unit_freqs:{ 1:2..},
class_freqs:{ 3:4..},
total:7
}
:param metric_fn: metric_fn metric function such as nominal_metric
:return: Do a float
"""
Do = 0
unit_freqs = frequency_dicts["unit_freqs"]
unit_ids = list(unit_freqs.keys())
for unit_id in unit_ids:
unit_classes = list(vbu_table_dict[unit_id].keys())
if unit_freqs[unit_id] < 2:
pass
else:
weight = 1 / (unit_freqs[unit_id] - 1)
for i, c in enumerate(unit_classes):
for k in unit_classes:
Do += (
vbu_table_dict[unit_id][c]
* vbu_table_dict[unit_id][k]
* weight
* metric_fn(c, k)
)
return Do
And Finnaly Get Alpha
def calculate_krippendorffs_alpha(ea_table_df, metric_fn=nominal_metric):
"""
:param ea_table_df: The Experiment/Annotator table, output from data_transforms.df_to_experiment_annotator_table
:param metric_fn: The metric function. Defaults to nominal
:return: Alpha, a float
"""
vbu_table_dict = data_transforms.make_value_by_unit_table_dict(ea_table_df)
frequency_dict = data_transforms.calculate_frequency_dicts(vbu_table_dict)
observed_disagreement = calculate_do(
vbu_table_dict=vbu_table_dict,
frequency_dicts=frequency_dict,
metric_fn=metric_fn,
)
expected_disagreement = calculate_de(
frequency_dicts=frequency_dict, metric_fn=metric_fn
)
N = frequency_dict['total']
alpha = 1 - (observed_disagreement / expected_disagreement)*(N-1)
return alpha
Wrapping Up
Hopefully you can use this library without having to think about the code much, but if you'd like to contribute we're happily accepting PRs. And if you need to get some labeled data to measure reliability on, try LightTag