Chris Bail, PhD
Duke University
www.chrisbail.net
github.com/cbail
twitter.com/chris_bail
Check out my Text as Data Course

Introduction

This is the first in a series of tutorials I’ve created about collecting data from web-based sources such as Twitter and analyzing them using various forms of automated text analysis. Before we proceed to the technical aspects of these techniques, I want to give you some sense of where the came from:

What is Digital Trace Data?

The past decade has witnessed an increasingly voluminous amount of digital data that is produced on the internet which describes human behavior and other objects of scholarly inquiry. As the figure below shows, recent decades have not only witnessed an increase in the amount of text based data, but also increased computing power which is increasingly necessary to analyze it. Together, these two shifts hold the potential to significantly expand the scope of research in many different fields.



Strengths of Digital Trace Data

I will begin by discussing some of the positive aspects of digital trace data, and then move on to some of the challenges. In so doing, I draw upon Matt Salganik’s Book Bit by Bit which I highly recommend not only for a more detailed discussion of digital trace data, but the nascent field of computational social science more broadly.

Always On

One of the most attractive features of digital trace data is that it is continuously collected, unlike surveys which usually only provide a brief snapshot of the social world. As the image below indicates, social media can occasionally provide a glimpse of major events such as protests, revolutions, or stock market surges, as they unfold.