CDL Misinfo Datasets

A compound misinfo detection benchmark by Complex Data Lab

Misinformation is a challenging societal issue, and mitigating solutions are difficult to create due to data deficiencies. To address this problem, we have curated the largest collection of (mis)information datasets in the literature, totaling 75. From these, we evaluated the quality of all of the 35 datasets that consist of statements or claims. If you would like to contribute a novel dataset or report any issues, please email us, visit our Hugging Face, or GitHub.
Jekyll logo

The largest collection of (mis)information datasets

A curated collection of 75 misinformation datasets, and a unified setup to work with the 35 claim and statement datasets, available here.

Dataset Quality Assessment

We evaluated the quality of 35 datasets, identifying potential flaws such as insufficient label quality, spurious correlations, and political bias. This helps researchers select datasets that are suitable for their work.

Evaluation of Detection Models

Our paper provides state-of-the-art baselines for misinformation detection models on these datasets, demonstrating the limitations of categorical labels and suggesting alternative evaluation methods.