Datasets

Signal 1M (available here for research purposes)

This dataset was released by Signal AI to facilitate conducting research on news articles. It was initially used for submissions to the NewsIR'16 workshop, but is intended to serve the community for research on news retrieval in general.

Signal-1M Related Tweets (available here for research purposes)

A TREC-like data collection to evaluate approaches for the task of related-tweet retrieval for news articles.
This collection was described in a peer-reviewed paper in ECIR 2018.

Signal-1M Summary Articles (available here for research purposes)

Signal-1M articles that comprise of disparate topical sections instead of being topical (talking about just one thing).
This collection was described in a peer-reviewed paper in ECIR 2019.

Tools

Signal AI shared some sample code for uploading and processing the one million article collection.