Yale Scientific Article Summarization Dataset

Background: What is Scisumm?

A summary of scientific papers should ideally incorporate the impact of the papers on the research community reflected by citations. To facilitate research in citation-aware scientific paper summarization (Scisumm), the CL-Scisumm shared task has been organized since 2014 for papers in the computational linguistics and NLP domain.

The latest CL-Scisumm 2018 task contains 40 NLP papers with citation sentences and human-annotated reference summaries. Participants develop systems that automatically produce summaries using the original papers and their citation information.

The ScisummNet Corpus

At the Yale LILY lab, we have expanded the CL-Scisumm project and developed the first large-scale, human-annotated Scisumm dataset, ScisummNet. It provides over 1,000 papers in the ACL anthology network with their citation networks (e.g. citation sentences, citation counts) and their comprehensive, manual summaries.

The following paper introduces the corpus in detail and shows how ScisummNet enables the training of data-driven neural summarization models for scientific papers.
Read the paper (AAAI 2019)
NEWS:  ScisummNet Corpus was featured in the CL-Scisumm 2019 shared task! Check out the project page.

Getting Started

Download the dataset (distributed under the CC BY-SA 4.0 license):
ScisummNet ver1.1 (15 MB)
When unzipped, the package contains a dataset description and subdirectories for the 1000 papers. Each paper directory contains the paper's PDF file, XML file, annotated citation information (in JSON format), and manual summay. Please see the included documentation for more detail.

If you use our corpus or summarization models, please consider citing the following papers.

    title = {{ScisummNet}: A Large Annotated Corpus and Content-Impact Models for Scientific Paper Summarization with Citation Networks},
    author = {Michihiro Yasunaga and Jungo Kasai and Rui Zhang and Alexander Fabbri and Irene Li and Dan Friedman and Dragomir Radev},
    booktitle = {Proceedings of AAAI 2019},
    year = {2019}
  title = {Graph-based Neural Multi-Document Summarization},
  author = {Yasunaga, Michihiro and Zhang, Rui and Meelu, Kshitijh and Pareek, Ayush and Srinivasan, Krishnan and Radev, Dragomir R.},
  booktitle = {Proceedings of CoNLL 2017},
  year = {2017}


We thank the members of the CL-Scisumm team, Kokil Jaidka, Muthu Kumar Chandrasekaran, and Min-Yen Kan, for their help on this project. We are also grateful to the developers of the SQuAD website, from which this website design is adapted.