This dataset is released by Signal AI to facilitate conducting research on news articles. It was initially used for submissions to the NewsIR'16 workshop, but it is intended to serve the community for research on news retrieval in general.
The articles of the dataset were originally collected by Moreover Technologies (one of Signal's content providers) from a variety of news sources for a period of 1 month (1-30 September 2015). It contains 1 million articles that are mainly English, but they also include non-English and multi-lingual articles. Sources of these articles include major ones, such as Reuters, in addition to local news sources and blogs.
Download
To obtain the dataset, please follow this link:
http://goo.gl/forms/5i4KldoWIX
Format
Upon downloading the data, you get a single compressed text file (approximately 1GB in size). You can uncompress it using gzip or zcat, etc. The file is in JSONL format, where each line is a JSON object representing an article. Each article has the following fields:
- id: a unique identifier for the article
- title: the title of the article
- content: the textual content of the article (may occasionally contain HTML and JavaScript content)
- source: the name of the article source (e.g. Reuters)
- published: the publication date of the article
- media-type: either "News" or "Blog"
{ "id": "a080f99a-07d9-47d1-8244-26a540017b7a", "content": "KUALA LUMPUR, Sept 15 (MySinchew) -- The Kuala Lumpur City Hall today issued ...", "title": "Pay up or face legal action: DBKL", "media-type": "News", "source": "My Sinchew", "published": "2015-09-15T10:17:53Z" }
General Statistics
Below is a summary of general statistics of the dataset:
- The number of individual unique sources are over 93k
- The dataset contains 265,512 Blog articles and 734,488 News articles
- The average length of an article is 405 words
Tools
We have released a script to convert the data to a TREC format. This would enable researchers to easily index the data with popular platforms such as Terrier
You can checkout the script from Github: https://github.com/signal-ai/Signal-1M-Tools
Citing
@inproceedings{Signal1M2016, author = {David Corney and Dyaa Albakour and Miguel Martinez and Samir Moussa}, title = {What do a Million News Articles Look like?}, booktitle = {Proceedings of the First International Workshop on Recent Trends in News Information Retrieval co-located with 38th European Conference on Information Retrieval {(ECIR} 2016), Padua, Italy, March 20, 2016.}, pages = {42--47}, year = {2016}, url = {http://ceur-ws.org/Vol-1568/paper8.pdf} }