A TREC-like data collection to evaluate approaches for the task of related-tweet retrieval for news articles.
Download
You would need to follow this link to get the dataset.
https://goo.gl/forms/R9yYo3lQSQTUtnHc2
Format
Upon downloading the data, you get a single compressed file. You can uncompress it using unzip. Uncompressing yields a folder with 2 files:
- topics : a file containing all of the topics (also known as articles) used as queries to retrieve tweets.
-
signal1m_tweets_qrels :
A
TREC Qrels
formatted file with the following fields:
- TOPIC - a unique identifier for an article from the Signal-1M dataset
- ITERATION - Unused (always 0); included to match TREC Qrels format.
- DOCUMENT - a tweet ID
- RELEVANCY -
- 0: not relevant
- 1: somewhat relevant
- 2: highly relevant
Using the dataset
As in any TREC task, to use the dataset:-
Use the topics file as an input to your tweet retrieval approach. In particular, your approach should return a ranked list of tweet IDs for each news article (topic) in a TREC results file format. Let's call it approach.result.
Each line in your file should conform to the following:topic Q0 tweet-id rank score NAME
You can find the tweet collection used to build this dataset here.
-
Use trec_eval
to evaluate the effectiveness of your approach by running:
trec_eval -q signal1m_tweets_qrels approach.result
Citing
This collection was described in a paper on ECIR 2018: A Data Collection for Evaluating the Retrieval of Related Tweets to News Articles .
@inproceedings{Signal1MRelatedTweetsRetrieval2018, author = {Axel Suarez, Dyaa Albakour, David Corney, Miguel Martinez and Jose Esquivel}, title = {A Data Collection for Evaluating the Retrieval of Related Tweets to News Articles}, booktitle = {40th European Conference on Information Retrieval Research {(ECIR} 2018), Grenoble, France, March, 2018.}, year = {2018}, pages = {780-786}, url = {https://link.springer.com/chapter/10.1007/978-3-319-76941-7_76} }