Research Data

This page contains links to data I have collected for my thesis on Analyzing Domestic Abuse using Natural Language Processing on Social Media Data. You are welcome to use any of this data for your research, but please cite the relevant paper and follow the terms of use if you do so.

Please read the papers first before contacting me with questions.

#WhyIStayed / #WhyILeft Research Data

Twitter users unequivocally reacted to the Ray Rice assault scandal by unleashing personal stories of domestic abuse via the hashtags #WhyIStayed or #WhyILeft. In Schrading et al. (2015a) we explored at a macro-level firsthand accounts of domestic abuse from a corpus of tweeted instances designated with these tags to seek insights into the reasons victims give for staying in vs. leaving abusive relationships.


Download the extended dataset here.


Twitter restricts public sharing of tweets to only tweet ids. Accordingly, the format of the data is as follows:

Getting the tweet text and additional information using Twitter’s API

In order to gather the tweet text and additional information about the tweet, like user information and retweet count, you must use Twitter’s API. I would recommend using the GET statuses/lookup API call because it allows you to look up 100 tweet ids at a time in bulk. Provided below is Python code to do this, using Twython as a wrapper for Twitter’s API.


To use this code, you must have Twython installed. Installation instructions, and how to set up a Twitter API account are on Twython’s githup page.

If you use this code, the text provided (Tweet.text) will still have hashtags, and is uncleaned and unprocessed.

# collects data from the publicly released data file
import json
from twython import Twython

# enter your APP_KEY and ACCESS_TOKEN from your Twitter API account here
twitter = Twython(APP_KEY, access_token=ACCESS_TOKEN)

class Tweet():
    # A container class for tweet information
    def __init__(self, json, text, label, startIdx, endIdx, idStr):
        self.json = json
        self.text = text
        self.label = label = idStr
        self.startIdx = startIdx
        self.endIdx = endIdx

    def __str__(self):
        return "id: " + + " " + self.label + ": " + self.text

def collectTwitterData(twitter):
    tweetDict = {}
    # open the shared file and extract its data for all tweet instances
    with open("stayedLeftData.json") as f:
        for line in f:
            data = json.loads(line)
            label = data['label']
            startIdx = data['startIdx']
            endIdx = data['endIdx']
            idStr = data['id']
            tweet = Tweet(None, None, label, startIdx, endIdx, idStr)
            tweetDict[idStr] = tweet

    # download the tweets JSON to get the text and additional info
    i = 0
    chunk = []
    for tweetId in tweetDict:
        # gather up 100 ids and then call Twitter's API
        i += 1
        if i >= 100:
            print("dumping 100...")
            # Make the API call
            results = twitter.lookup_status(id=chunk)
            for tweetJSON in results:
                idStr = tweetJSON['id_str']
                tweet = tweetDict[idStr]
                tweet.json = tweetJSON
                # If this tweet was split, get the right part of the text
                if tweet.startIdx is not None:
                    tweet.text = tweetJSON['text'][tweet.startIdx : tweet.endIdx]
                # Otherwise get all the text
                    tweet.text = tweetJSON['text']
            i = 0
            chunk = []
    # get the rest (< 100 tweets)
    print("dumping rest...")
    results = twitter.lookup_status(id=chunk)
    for tweetJSON in results:
        idStr = tweetJSON['id_str']
        tweet = tweetDict[idStr]
        tweet.json = tweetJSON
        if tweet.startIdx is not None:
            tweet.text = tweetJSON['text'][tweet.startIdx : tweet.endIdx]
            tweet.text = tweetJSON['text']

    # return the Tweet objects in a list
    return list(tweetDict.values())

Data Information

Not every instance actually contains a reason for staying or leaving. Some may be sympathizing with those sharing, or reflecting on the trend itself. Others may be ads or jokes. The following chart shows the distribution of such classes in a random sample of 1000 instances. More information is in the paper.

alt text

A = ads , J = jokes, L = reasons for leaving, M = meta commentary, O = other , S = reasons for staying, annotated by annotators A1-A4, with #L and #S being the assigned gold standard labels.

Reddit Domestic Abuse Research Data

Some of this dataset was used in Schrading et al. (2015b) to study the dynamics of domestic abuse. Submissions and comments from the following subreddits were collected, and assigned a binary reference label (abuse or non-abuse) based on the subreddit title:

  abuseinterrupted domesticviolence survivorsofabuse casualconversation advice anxiety anger
gold standard label abuse abuse abuse non abuse non abuse non abuse non abuse
num submissions collected 1653 749 512 7286 5913 4183 837

Additional subreddit data were also collected and used to examine classification in unused subreddits:

  relationships relationship_advice
num submissions collected 8201 5874


Download the entire reddit database here.

Download the “shelved” sets of reddit data here.

Download the “shelved” abuse classifier trained on the uneven set of data here.


There are two formats for the Reddit data. The most flexible is the entire database used to store all the data I collected. You will have to use sqlite to access the database (Python has an API for interacting with sqlite databases). For those who do not wish to interact with the database but want to access the provided datasets used in my experiments, I have provided Python shelved data files for use in Python.

The sqlite database named “new_reddit.db” has 3 tables within: submissions, comments, and submission_srls. Their columns are as follows:

Submissions table

Comments table

Submission_srls table

The shelved files provided are as follows (note that lists are aligned e.g. submissionId lists align with the submissions in data lists. Also note that a submission is the initial post, and comments are linked to it by the associated submission ID):

The shelved classifier is a scikit-learn Pipeline consisting of TfidfVectorizer with a custom tokenizer, followed by a LinearSVC. Its key in the shelf is “classifier”. To predict whether text is abuse or non-abuse, call the predict() function given a list of text. For example:

shelf ="abuseClassifier")
classifier = shelf['classifier']
print(classifier.predict(["I was abused and hit and I was sad :(", "I am happy and stuff. Love you!"]))

[‘abuse’, ‘non_abuse’]

Note that to use these shelved objects, you may need to use Python 3, not 2.

Terms of Use

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Twitter paper, Schrading et al. (2015a):

#WhyIStayed, #WhyILeft: Microblogging to Make Sense of Domestic Abuse

  author    = {Schrading, Nicolas  and  Ovesdotter Alm, Cecilia  and  Ptucha, Raymond  and  Homan, Christopher},
  title     = {\#WhyIStayed, \#WhyILeft: Microblogging to Make Sense of Domestic Abuse},
  booktitle = {Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies},
  month     = {May--June},
  year      = {2015},
  address   = {Denver, Colorado},
  publisher = {Association for Computational Linguistics},
  pages     = {1281--1286},
  url       = {}

Reddit paper, Schrading et al. (2015b):

To appear at EMNLP 2015.