The First International Workshop on Recent Trends in News Information Retrieval will take place in Padua, Italy in conjunction with ECIR 2016.
This dataset is released by Signal Media to facilitate conducting research on news articles. It can be used for submissions to the NewsIR'16 workshop, but it is intended to serve the community for research on news retrieval in general.
The articles of the dataset were originally collected by Moreover Technologies (one of Signal's content providers) from a variety of news sources for a period of 1 month (1-30 September 2015). It contains 1 million articles that are mainly English, but they also include non-English and multi-lingual articles. Sources of these articles include major ones, such as Reuters, in addition to local news sources and blogs.
To obtain the dataset, please follow this link:
http://goo.gl/forms/5i4KldoWIX
Upon downloading the data, you get a single compressed text file (approximately 1GB in size). You can uncompress it using gzip or zcat, etc. The file is in JSONL format, where each line is a JSON object representing an article. Each article has the following fields:
{ "id": "a080f99a-07d9-47d1-8244-26a540017b7a", "content": "KUALA LUMPUR, Sept 15 (MySinchew) -- The Kuala Lumpur City Hall today issued ...", "title": "Pay up or face legal action: DBKL", "media-type": "News", "source": "My Sinchew", "published": "2015-09-15T10:17:53Z" }
Below is a summary of general statistics of the dataset:
We have released a script to convert the data to a TREC format. This would enable researchers to easily index the data with popular platforms such as Terrier
You can checkout the script from Github: https://github.com/SignalMedia/Signal-1M-Tools
@inproceedings{Signal1M2016, author = {David Corney and Dyaa Albakour and Miguel Martinez and Samir Moussa}, title = {What do a Million News Articles Look like?}, booktitle = {Proceedings of the First International Workshop on Recent Trends in News Information Retrieval co-located with 38th European Conference on Information Retrieval {(ECIR} 2016), Padua, Italy, March 20, 2016.}, pages = {42--47}, year = {2016}, url = {http://ceur-ws.org/Vol-1568/paper8.pdf} }