Stance Sentiment Emotion Corpus (SSEC)

The SSEC corpus is an annotation of the SemEval 2016 Twitter stance and sentiment corpus with emotion labels.

TL;TR?

  1. Download http://alt.qcri.org/semeval2016/task6/data/uploads/stancedataset.zip, remove the first line from train.csv and test.csv.
  2. Download ssec-aggregated.zip. Only use our best annotations train-combined-0.0.csv and test-combined-0.0.csv from this file. There is a line-by-line correspondence.

Problems? Continue reading.

Publications

This corpus and its generation of emotion labels is described in

  • Hendrik Schuff, Jeremy Barnes, Julian Mohme, Sebastian Padó, and Roman Klinger. Annotation, modelling and analysis of fine-grained emotions on a stance and sentiment detection corpus. In Proceedings of the 8th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, Copenhagen, Denmark, 2017. Workshop at Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics. [pdf]

Please consider to also cite the original publication of the SemEval 2016 data set:

  • Stance and Sentiment in Tweets. Saif M. Mohammad, Parinaz Sobhani, and Svetlana Kiritchenko. Special Section of the ACM Transactions on Internet Technology on Argumentation in Social Media, 2017, 17(3).

Obtaining the full annotation data

If you have a system with a shell and wget, you can obtain all annotations at once automatically with the following script:

Please download this script, gunzip it and execute it. The script first downloads the original SemEval data and combines it line by line with all annotations. These are 6 times 8 columns, in which a group of eight columns corresponds to annotations for one annotator for the emotions in the order "Anger Anticipation Disgust Fear Joy Sadness Surprise Trust". A 1 means, that the annotator marked the emotion to hold, a 0 that it does not hold. A -1 means, the annotator did not annotate this instance.

If this script did not work for you, please perform the following steps:

  • Download the original SemEval data from http://alt.qcri.org/semeval2016/task6/data/uploads/stancedataset.zip
  • Unzip the file and find train.csv and test.csv. Note that when you are not on a Windows machine, that you want to change the line breaks (for instance with tr '\r' '\n'). You also might need to set LC_ALL=C to be able to deal with special symbols in the file, depending on your editing environment).
  • Download these two files: train-annotations.csv and test-annotations.csv. The contain the annotations for both train.csv and test.csv from the original data, corresponding line by line, after stripping the header from the original data files.

Obtaining the annotation aggregation

Please download the file ssec-aggregated.zip.

It contains training and test files in five variations for the different thresholds mentioned in the paper. The lines correspond to the original train.csv and test.csv files.

Note

Two lines in the original data were not annotated. They are marked in the annotation files with XXXXXXXXXXXX EMPTY ANNOTATION. You most likely want to ignore them.