Signal 1 Million News Articles Dataset

This dataset is released by Signal AI to facilitate conducting research on news articles. It was initially used for submissions to the NewsIR'16 workshop, but it is intended to serve the community for research on news retrieval in general.

The articles of the dataset were originally collected by Moreover Technologies (one of Signal's content providers) from a variety of news sources for a period of 1 month (1-30 September 2015). It contains 1 million articles that are mainly English, but they also include non-English and multi-lingual articles. Sources of these articles include major ones, such as Reuters, in addition to local news sources and blogs.

Download

To obtain the dataset, please follow this link:
http://goo.gl/forms/5i4KldoWIX

Format

Upon downloading the data, you get a single compressed text file (approximately 1GB in size). You can uncompress it using gzip or zcat, etc. The file is in JSONL format, where each line is a JSON object representing an article. Each article has the following fields:

id: a unique identifier for the article
title: the title of the article
content: the textual content of the article (may occasionally contain HTML and JavaScript content)
source: the name of the article source (e.g. Reuters)
published: the publication date of the article
media-type: either "News" or "Blog"

Below is an example from the dataset: (content has been shortened to avoid verbosity)

{
"id": "a080f99a-07d9-47d1-8244-26a540017b7a",
"content": "KUALA LUMPUR, Sept 15 (MySinchew) -- The Kuala Lumpur City Hall today issued ...",
"title": "Pay up or face legal action: DBKL",
"media-type": "News",
"source": "My Sinchew",
"published": "2015-09-15T10:17:53Z"
}

General Statistics

Below is a summary of general statistics of the dataset:

The number of individual unique sources are over 93k
The dataset contains 265,512 Blog articles and 734,488 News articles
The average length of an article is 405 words

Tools

We have released a script to convert the data to a TREC format. This would enable researchers to easily index the data with popular platforms such as Terrier

You can checkout the script from Github: https://github.com/signal-ai/Signal-1M-Tools

Citing

@inproceedings{Signal1M2016,
author    = {David Corney and Dyaa Albakour and Miguel Martinez and Samir Moussa},
title     = {What do a Million News Articles Look like?},
booktitle = {Proceedings of the First International Workshop on Recent Trends in
           News Information Retrieval co-located with 38th European Conference
           on Information Retrieval {(ECIR} 2016), Padua, Italy, March 20, 2016.},
pages     = {42--47},
year      = {2016},
url       = {http://ceur-ws.org/Vol-1568/paper8.pdf}
}