Chris Bail, PhD
Duke University
www.chrisbail.net
github.com/cbail
twitter.com/chris_bail
Check out my Text as Data Course

Introduction

Application Programming Interfaces, or APIs, have become one of the most important ways to access and transfer data online— and increasingly APIs can even analyze your data as well. Compared to screen-scraping data, which is often illegal, logistically difficult (or both), APIs are a useful tool to make custom requests for data in manner that is well structured and considerably easier to work with than the HTML or XML data described in my previous tutorials on screenscraping. This tutorial assumes basic knowledge about R and other skills described in previous tutorials at the link above.

What Is an Application Programming Interface?

APIs are tools for building apps or other forms of software that help people access certain parts of large databases. Software developers can combine these tools in various ways—or combine them with tools from other APIs—in order to generate even more useful tools. Most of us use such apps each day. For example, if you install the Spotify app within your Facebook page to share music with your friends, this app is extracting data from Spotify’s API and then posting it to your Facebook page by communicating with Facebook’s API. There are countless examples of this on the internet at present— thanks in large part to the advent of Web 2.0, or the historical moment where the internet websites became became much more intertwined and dependent.

The number of APIs that are publicly available has expanded dramatically over the past decade, as the figure below shows. At the time of this writing, the website Programmable Web lists more than 19,638 APIs from sites as diverse as Google, Amazon, YouTube, the New York Times, del.icio.us, LinkedIn, and many others. Though the core function of most APIs is to provide software developers with access to data, many APIs now analyze data as well. This might include facial recognition APIs, voice to text APIs, APIs that produce data visualizations, and so on.


How Does an API Work?

In order to illustrate how an API works, it will be useful to start with a very simple one. Suppose we want to use the Google Maps API to geo-code a named entity— or tag the name of a place with latitude and longitude coordinates. The way that we do this, is to write a URL address that a) names the API; and b) includes the text of the query we want to make. If we Googled “Google Maps API Geocode” we would eventually be pointed towards the documentation for that API and learn that the base-URL for the Google Maps API is https://maps.googleapis.com. We want to use the geocoding function of this API, so we need a URL that points to this more specific part of the API: https://maps.googleapis.com/maps/api/geocode/json?address=. We can then add a named entity to the end of the URL such as “Duke” using text that looks something like this: follows: https://maps.googleapis.com/maps/api/geocode/json?address=Duke. This link (with some additional text that I will describe below) produces this output in a web browser:


What we are seeing is something called JSON data. Though it may look somewhat messy at first glance— lots of brackets, colons, commas, and indendation patterns—it is in fact very highly structured, and capable of storing complex types of data. Here, we can see that our request to geocode “Duke” not only identified the city within which it is located (Durham), but also the County, Country, and—towards the end of the page—the latitude and longitude data we were looking for. We will learn how to extract that piece of information later. The goal of the current discussion is to give you an idea of what an API is and how they work.

If we wanted to search for another geographic location, we could take the link above and replace “Duke” with the name of another place– try it out to give yourself a very rudimentary sense of how an API works.

API Credentials

Though anyone can make a request to the Google Maps API, getting data from Facebook’s API (which Facebook calls the “Graph” API) is considerably more difficult. This is because—with good reason—Facebook does not want a software developer to collect data about people whom they do not have a connection with on Facebook. In order to prevent this, Facebook’s Graph API—and many other APIs—require you to obtain “credentials” or codes/passwords that identify you and determine which types of data you are allowed to access. To illustrate this further, let’s take a look at a tool Facebook built to help people learn about APIs. It’s called the Graph API explorer.


If you have a Facebook account— and if you were logged in— Facebook will generate credentials for you automatically in the form of something called an “Access Token.” In the screenshot above, this appears in a bar towards the lower bottom part of the screen. This code will give you temporary authorization to make requests from Facebook’s Graph API—but ONLY for data that you are allowed to access from your own Facebook page. If you click the blue “submit” button at the top right of the screen, you will see some output that contains your name and an ID that Facebook assigns to you. With some more effort, we could use this tool to make API calls to access our friend list, our likes, and so on, but for now, I’m simply trying to make the point that each person gets their own code that allows them to access some, but certainly not all, of the data on Facebook’s API. If I were to write in “cocacola” instead of “me” to get access to data posted by this business, I would get an error message suggesting that my current credentials do not give me access to that data.

Credentials may not only determine your access to people with whom you are connected on a social network, but also other privileges you may have vis-a-vis an API. For example, many APIs charge money for access to their data or services, and thus you will only receive your credentials after setting up an account. As we will see below, some sites also require you to have multiple types of credentials which can be described using a variety of verbiage such as “tokens”,“keys”, or “secrets.”

Now YOU Try It!!!

The Graph API explorer allows you to search for different fields. Take a moment and try to see whether you can make a call for other types of information about yourself. To do this, you’ll have to request different levels of “permission” using the dropdown menu on the right hand side of the explorer tool page. You can populate the syntax for different fields using the “search for a field” box on the upper right hand side of the page. Don’t worry if you can’t get the query language right for now—specifying the syntax of an API call requires mastering each API’s documentation, which can take a very long time. This tool tries to help you, but it does not work perfectly and constructing the right call may require spending quite a bit of time reading through the API documentation. More importantly– as you will soon see– there are a number of R packages that contain functions that make API calls for you that will save you the time and energy necessary to learn the syntax specific to each API.

Rate Limiting

Before we make any more calls to APIs, we need to become familiar with an important concept called “Rate Limiting.” The credentials in the previous section not only define what type of information we are allowed to access, but also how often we are allowed to make requests for such data. These are known as “rate limits.” If we make too many requests for data within too short a period of time, an API will temporarily block us from collected data for a period of time that can range from 15 minutes to 24 hours or more, depending upon the API. Rate limiting is necessary so that APIs are not overwhelmed by too many requests that occur at the same time, which would slow down access to data for everyone. Rate limiting also enables large companies such as Google, Facebook, or Twitter, to prevent developers from collecting large amounts of data that could either compromise their user’s confidentiality or threaten their business model (since data has such immense value in today’s economy).

The exact timing of rate limiting is not always public, since knowing such time increments could enable developers to “game” the system and make rapid requests as soon as rate limiting has ended. Some APIs, however, allow you to make an API call or query in order to learn how many more requests you can make within a given time period before you are rate limited.

An Example with Twitter’s API

To illustrate the process of obtaining credentials and better understanding rate limiting, I will now present a worked example of how to obtain different types of data from the Twitter API. The first step in this process is to obtain credentials from Twitter that will allow you to make API calls.

Twitter, like many other websites, requires you to create an account in order to receive credentials. To do this, we need to visit https://apps.twitter.com. There, you will have to create a developer account by clicking “Apply for a developer account.” You may be asked to confirm your email address or add a mobile phone number because two-factor authentication helps Twitter prevent people from obtaining a large number of different credentials using multiple accounts that could be use to collect large amounts of data without being rate limited—or, for other nefarious purposes such as creating armies of bots that produce spam or attempt to influence elections.


Next, you will be asked whether you want to request a developer account for an organization or for personal use. You will most likely want to choose personal use (the organizational account can allow one person to request credentials that can be shared across a large group of people which could be useful if you work within a lab or other business).


Next, you will be asked a series of questions about how you want to use Twitter’s API. Unfortunately Twitter does not publish exact guidelines about who is allowed to use the API and why– as far as I know. That said, you can learn a lot by reading Twitter’s terms of service. Obvious red flags would include people who are hoping to build tools that somehow harasses Twitter users, hoards Twitter’s data (particularly for business purposes), ot other negative purposes. You will see these terms after you describe why you want to apply for credentials.

Once you accept the terms, your app developer request will go under review by Twitter. At the time of this writing, this process is rather time intensive– and with good reason, since Twitter is most likely employing large numbers of people to vet everyone who is applying for credentials right now. At the time of this writing, I’ve seen people get credentials within a day or two, and others who have waited more than a week. I even unfortunately know of several cases where people made multiple applications without any of the red flags above (and using different wording) that were ultimately rejected. Hopefully, yours will be approved (and if not, you might try mentioning your problem in a tweet to the @TwitterAPI on Twitter– this seems to have worked for several people that I know)


Once your developer account is approved, you can log in once again and click the “Create New App” button at the top right of the screen. Our goal is not to create a fully fledged app at this point, but simply to obtain the credentials necessary to begin making some simple calls to the Twitter API. You can name your app whatever you want, describe it however you want, and put in the name of any website you like. One important thing you must do is to put the following text in the “Callback URL” text box: http://127.0.0.1:1410 This number describes the location where the API will return your data– in this case, it is your web browser (but it could be another site where you want to store the results of the data.).

If you followed the steps above, the name of your application should now appear. Click on it, and then click on the “Keys and Access Tokens” tab in order to get your credentials. Unfortunately, Twitter makes developers get two different types of credentials which are listed on that page. These are blurred out in the screenshot below because I do not want people who read this web page to have access to my credneitals, which they could then abuse in various ways:


The next step is to define your credentials as string variables in R, which we will then use to authenticate ourselves with the Twitter API. Make sure to select the entire string (by triple clicking), and make sure that you do not accidentally leave out the first or last digit (or add spaces):

app_name<-"YOURAPPNAMEHERE"
consumer_key<-"YOURKEYHERE"
consumer_secret<-"YOURSECRETHERE"
access_token<-"YOURACCESSTOKENHERE"
access_token_secret<-"YOURACCESSTOKENSECRETHERE"

Next, we are going to install an R package from Github called rtweet that helps us make calls to Twitter’s API. More specifically, it provides a long list of functions that both a) construct API URL queries for different types of information; and b) parses the resulting data into neat formats. In order to authenticate you may also need to install the httpuv package as well (if so, you will receive an error message about this package). If you have never installed a package from Github before you will need the devtools package to do this.

library(devtools)
install_github("mkearney/rtweet")

Now, we are ready to authenticate ourselves vis-a-vis Twitter’s API. To do this, we are going to use rtweet’s create_token function, which makes an API call that passes the credentials we defined above, and then opens a web browser with an authentication dialogue that you must authorize by clicking the blue “authorize” button. You should then receive the following message Authentication complete. Please close this page and return to R.

library(rtweet)
create_token(app=app_name,
             consumer_key=consumer_key, consumer_secret=consumer_secret,
             access_token = access_token, access_secret = access_token_secret)

Now, we can take full advantage of all of the many useful functions within the rtweet function for collecting data from Twitter. Let’s begin by extracting 3,000 tweets that use the hashtag #korea.

korea_tweets<-search_tweets("#Korea", n=50, include_rts = FALSE)
## Searching for tweets...
## Finished collecting tweets!

This code creates a dataframe called korea_tweets which we may then browse. Let’s take a look at the first ten tweets, which rtweet stores as a variable called text.

head(korea_tweets$text)
## [1] "eles não são gays,eles apenas não tem masculinidade frágil e pintam o cabelo,usam maquiagem e falam coisas que toda garota quer ouvir,então respeito.\U0001f49e\n\n#bts #exo #gays #kpop #boygroup #tudogay #kpopbr #seventeen #straykids #nct #THEBOYZ #ikon #ASTRO #korea #coreia"                               
## [2] "North #Korea #leadership #bigdata techniques unavailable in New Zealand https://t.co/BArijLeMYO"                                                                                                                                                                                                                   
## [3] "#TB \nThe 1st Meeting of Global Public Diplomacy Network Assembly\nOct 23, 2014 ~ Oct 25, 2014 / Lotte Hotel Seoul, #Korea\n#GPDNet #PublicDiplomacy\nhttps://t.co/gPe1EEETTg https://t.co/lK7vmRk6V6"                                                                                                             
## [4] "Stray Kids New Comeback is crazy! It’s so damn good im screaming https://t.co/iWKQK8ickw || #bts #got7  #exo #blackpink #kpop  #korea #twice  #txt #straykids #nct #seventeen #ikon #astro #shinee #monstax #mamamoo #gidle #itzy #ateez #kard #day6 #pentagon"                                                    
## [5] "Please watch Red Velvet’s new comeback ”Zimzalabim”, the song is really unique! https://t.co/IbSSQIpoDu || #bts #got7 #exo #blackpink #kpop #korea #twice #txt #straykids #nct #seventeen #ikon #astro #shinee #monstax #mamamoo #gidle #itzy #ateez #kard #day6 #pentagon #everglow"                              
## [6] ".:\\\\#X3 #democracy #Day2Day1nSociety #SurenoAryano #liberty #News #USA #LAPD #LASD #Clippers #Japan #Asia #World #UK #Sweden #russia #music #Korea #XiJinping #china #HongKong #sports #popular #Dems #EU #Australia #marijuana #Congress #SUR #UNHCR #Lakers #FBR #Resistance #BlueWave https://t.co/mZ5mch2ewj"

Note that this API call also generated a lot of other interesting variables, including the name and screen name of the user, the time of their post, and a variety of other metrics including links to media content and user profiles. A small number of users also enable geolocation of their tweets– and if that information is available it will appear in this dataset. Here is the full list of variables we collected via our API call above:

names(korea_tweets)
##  [1] "user_id"                 "status_id"              
##  [3] "created_at"              "screen_name"            
##  [5] "text"                    "source"                 
##  [7] "display_text_width"      "reply_to_status_id"     
##  [9] "reply_to_user_id"        "reply_to_screen_name"   
## [11] "is_quote"                "is_retweet"             
## [13] "favorite_count"          "retweet_count"          
## [15] "hashtags"                "symbols"                
## [17] "urls_url"                "urls_t.co"              
## [19] "urls_expanded_url"       "media_url"              
## [21] "media_t.co"              "media_expanded_url"     
## [23] "media_type"              "ext_media_url"          
## [25] "ext_media_t.co"          "ext_media_expanded_url" 
## [27] "ext_media_type"          "mentions_user_id"       
## [29] "mentions_screen_name"    "lang"                   
## [31] "quoted_status_id"        "quoted_text"            
## [33] "quoted_created_at"       "quoted_source"          
## [35] "quoted_favorite_count"   "quoted_retweet_count"   
## [37] "quoted_user_id"          "quoted_screen_name"     
## [39] "quoted_name"             "quoted_followers_count" 
## [41] "quoted_friends_count"    "quoted_statuses_count"  
## [43] "quoted_location"         "quoted_description"     
## [45] "quoted_verified"         "retweet_status_id"      
## [47] "retweet_text"            "retweet_created_at"     
## [49] "retweet_source"          "retweet_favorite_count" 
## [51] "retweet_retweet_count"   "retweet_user_id"        
## [53] "retweet_screen_name"     "retweet_name"           
## [55] "retweet_followers_count" "retweet_friends_count"  
## [57] "retweet_statuses_count"  "retweet_location"       
## [59] "retweet_description"     "retweet_verified"       
## [61] "place_url"               "place_name"             
## [63] "place_full_name"         "place_type"             
## [65] "country"                 "country_code"           
## [67] "geo_coords"              "coords_coords"          
## [69] "bbox_coords"             "status_url"             
## [71] "name"                    "location"               
## [73] "description"             "url"                    
## [75] "protected"               "followers_count"        
## [77] "friends_count"           "listed_count"           
## [79] "statuses_count"          "favourites_count"       
## [81] "account_created_at"      "verified"               
## [83] "profile_url"             "profile_expanded_url"   
## [85] "account_lang"            "profile_banner_url"     
## [87] "profile_background_url"  "profile_image_url"

As a brief aside, the rtweet function also interfaces nicely with ggplot and other visualization libraries to produce nice plots of the results above. For instance, let’s make a plot of the frequency of tweets about Korea over the past few days:

library(ggplot2)
ts_plot(korea_tweets, "3 hours") +
  ggplot2::theme_minimal() +
  ggplot2::theme(plot.title = ggplot2::element_text(face = "bold")) +
  ggplot2::labs(
    x = NULL, y = NULL,
    title = "Frequency of Tweets about Korea from the Past Day",
    subtitle = "Twitter status (tweet) counts aggregated using three-hour intervals",
    caption = "\nSource: Data collected from Twitter's REST API via rtweet"
  )

The search_tweets function also has a number of useful options or arguments as well. For instance, we can restrict the geographic location of tweets to the United States and English-language tweets using the code below. The code also restricts the results to non-retweets and focuses upon the most recent tweets, rather than a mixture of popular and recent tweets, which is the defualt setting.

nk_tweets <- search_tweets("korea",
  "lang:en", geocode = lookup_coords("usa"), 
  n = 1000, type="recent", include_rts=FALSE
  )
## Searching for tweets...
## Finished collecting tweets!

rtweet also enables one to geocode tweets for users who allow Twitter to track their location:

geocoded <- lat_lng(nk_tweets)

We can then plot these results as follows (you may need to install the maps package to do this):

library(maps)
par(mar = c(0, 0, 0, 0))
maps::map("state", lwd = .25)
with(geocoded, points(lng, lat, pch = 20, cex = .75, col = rgb(0, .3, .7, .75)))

We don’t see all 100 tweets in this diagram for an important reason— we are only looking at people who allow Twitter to track their location (and this is roughly 1 in 100 people at the time of this writing).

Twitter’s API is also very useful for collecting data about a given user. Let’s take a look at Bernie Sanders’s Twitter page: http://www.twitter.com/SenSanders. There we can see Senator Sanders’s description and profile, the full text of his tweets, and—if we click several links—the names of the people he follows, those who follow him, and the tweets which he has “liked.”

First, let’s get his 5 most recent tweets:

sanders_tweets <- get_timelines(c("sensanders"), n = 5)
head(sanders_tweets$text)
## [1] "I'm proud to celebrate #Juneteenth with @MrDannyGlover. This year's holiday takes on special significance because of today's House hearing on H.R. 40 to study the impacts of slavery. We must come to terms with slavery's horrors and how they affect every aspect of our lives today. https://t.co/WXfqO5s4oN"
## [2] "The president won't address our health care crisis or climate change. But he will try to divide us up and go after undocumented people who have very little political power. That is what demagogues always do and that is why he must be defeated. https://t.co/63c26Aeobg"                                     
## [3] "Mr. President: Respect the Constitution. You have no legal authority to launch an attack on Iran. Come to Congress, present your evidence, and make your case. If not, we will assert our power under the War Powers Act to stop you. https://t.co/PZ1XfQkX68"                                                   
## [4] "Itta Bena, Mississippi is a \"banking desert\" where residents are without basic banking services. This is why @AOC and I are calling for the return of postal banking in the United States. The big banks of this country are preying on poor people or ignoring them completely. https://t.co/GfuDMaDFGG"      
## [5] "It’s a national disgrace that a minimum wage worker cannot afford a modest one-bedroom apartment in 99% of counties in America. All Americans have a right to a decent, affordable home. https://t.co/4542io2Pxu"

Note that you are limited to requesting the last 3,200 tweets, so obtaining a complete database of tweets for a person who tweets very often may not be feasible, or you may need to purchase the data from Twitter itself:

Next, let’s get some broader information about Sanders using the lookup_users function:

sanders_twitter_profile <- lookup_users("sensanders")

This creates a dataframe with a variety of additional variables. For example:

sanders_twitter_profile$description
## [1] "U.S. Senator Bernie Sanders of Vermont is the longest-serving independent in congressional history."
sanders_twitter_profile$location
## [1] "Vermont/DC"
sanders_twitter_profile$location
## [1] "Vermont/DC"
sanders_twitter_profile$followers_count
## [1] 8383861

We can also use the get_favorites() function to identify the Tweets Sanders has recently “liked.”

sanders_favorites<-get_favorites("sensanders", n=5)
sanders_favorites$text
## [1] "Met with @SenSanders to discuss his legislation to save 900k people in Puerto Rico from loosing their Medicare benefits in Seprember 2019. The Senator’s bill will provide 100% Medicare parity to PR and all territories. Thank you Senator for believing healthcare is a right. https://t.co/k0TN8Y319W"              
## [2] "Thank you @SenSanders for helping us highlight @blackrock's massive role in destroying our planet.\n\nBlackRock has poured billions of dollars into oil drilling, coal power plants, palm oil plantations — and so much more. They must stop financing the climate crisis. https://t.co/6P9GoqzKqK"                     
## [3] "Enjoyed speaking to @GSMA1 126th Annual Convention &amp; Scientific Assembly! Georgia's physicians of color are dedicated to their communities, their patients &amp; their profession. Dr. Niva Lubin-Johnson @nlj57 President @NationalMedAssn is a wonderful advocate. Thank you @DrKDWmsPhD1 https://t.co/3UZqub8GYy"
## [4] "I’m encouraged to see this legislation to study the issue gain support in Congress and the shared commitment my colleagues have in doing our part to repair the harm done to African-Americans."                                                                                                                        
## [5] "I have never spoken publicly about my abortion. I'm speaking now because of intensified efforts to strip Constitutional rights from pregnant people and to criminalize abortion.\n\nI shared my story in an Op-Ed with @nytimes.\nhttps://t.co/AjBxRLtBet"

We can also get a list of the people who Sanders follows like this:

sanders_follows<-get_followers("sensanders")

This produces the user IDs of those followers, and we could get more information about them if we want using the lookup_users function. If we were interested in creating a larger social network analysis dataset centered around Sanders, we could scrape the followers of his followers within a loop.

Looping is an efficient way of collecting a large amount of data, but it will also trigger rate limiting. As I mentioned above, however, Twitter enables users to check their rate limits. The rate_limit() function in the rtweets package does this as follows:

rate_limits<-rate_limit()
head(rate_limits[,1:4])
## # A tibble: 6 x 4
##   query                  limit remaining reset        
##   <chr>                  <int>     <int> <time>       
## 1 lists/list                15        15 15.00966 mins
## 2 lists/memberships         75        75 15.00966 mins
## 3 lists/subscribers/show    15        15 15.00966 mins
## 4 lists/members            900       900 15.00966 mins
## 5 lists/subscriptions       15        15 15.00966 mins
## 6 lists/show                75        75 15.00966 mins

In the code above I created a dataframe that describes the total number of calls I can make within a given deadline (called reset). In this case, it is 15 minutes. In order to prevent rate limiting within a large loop, it is common practice to employ R’s Sys.sleep function, which tells R to sleep for a certain number of seconds before proceeding to the next iteration of a loop.

rtweet has a number of other useful functions which I will mention in case they might be useful to readers. get_trends() will identify the trending topics on Twitter in a particular area:

get_trends("New York")
## # A tibble: 50 x 9
##    trend url   promoted_content query tweet_volume place  woeid
##    <chr> <chr> <lgl>            <chr>        <int> <chr>  <int>
##  1 #eng… http… NA               %23e…           NA New … 2.46e6
##  2 #Jun… http… NA               %23J…       109539 New … 2.46e6
##  3 Whoo… http… NA               Whoo…        38188 New … 2.46e6
##  4 #RHO… http… NA               %23R…        10325 New … 2.46e6
##  5 Chri… http… NA               %22C…        77901 New … 2.46e6
##  6 #Pos… http… NA               %23P…        44603 New … 2.46e6
##  7 Bern… http… NA               Bern…        94775 New … 2.46e6
##  8 Dems  http… NA               Dems        153829 New … 2.46e6
##  9 Conl… http… NA               Conl…        52704 New … 2.46e6
## 10 #How… http… NA               %23H…        41714 New … 2.46e6
## # … with 40 more rows, and 2 more variables: as_of <dttm>,
## #   created_at <dttm>

rtweet can even control your Twitter account. For example, you can post messages to your Twitter feed from R as follows:

post_tweet("I love APIs")

I have used this function in past work with bots. See for example, this paper.

Now YOU Try It!!!

To reinforce the skills you’ve learned in this section, try the following: 1) Collect the most recent 100 tweets from CNN; 2) determine how many people follow CNN on twitter; and, 3) determine if CNN is currently tweeting about any subjects that are trending in your area.

Wrapping API Calls within a Loop

Very often, one may wish to wrap API calls such as those we have made thus far into a loop to collect data about a long list of users. To illustrate this, let’s open a list of the Twitter handles of elected officials in the U.S. that I posted on my Github site:

#load list of twitter handles for elected officials
elected_officials<-read.csv("https://cbail.github.io/Senators_Twitter_Data.csv", stringsAsFactors = FALSE)

head(elected_officials)
##   bioguide_id party gender title birthdate   firstname middlename lastname
## 1     C001095     R      M   Sen   5/13/77         Tom              Cotton
## 2     G000562     R      M   Sen   8/22/74        Cory             Gardner
## 3     M001169     D      M   Sen    8/3/73 Christopher         S.   Murphy
## 4     S001194     D      M   Sen  10/20/72       Brian    Emanuel   Schatz
## 5     Y000064     R      M   Sen   8/24/72        Todd         C.    Young
## 6     S001197     R      M   Sen   2/22/72    Benjamin       Eric    Sasse
##   name_suffix state    district senate_class
## 1                AR Junior Seat           II
## 2                CO Junior Seat           II
## 3                CT Junior Seat            I
## 4                HI Senior Seat          III
## 5                IN Junior Seat          III
## 6                NE Junior Seat           II
##                               website    fec_id      twitter_id
## 1       https://www.cotton.senate.gov H2AR04083    SenTomCotton
## 2      https://www.gardner.senate.gov H0CO04122  SenCoryGardner
## 3       https://www.murphy.senate.gov H6CT05124 senmurphyoffice
## 4       https://www.schatz.senate.gov S4HI00136  SenBrianSchatz
## 5        https://www.young.senate.gov H0IN09070    SenToddYoung
## 6 https://www.sasse.senate.gov/public S4NE00090        SenSasse

As you can see, the second column of this .csv file includes the Twitter “screen names” or handles we need to make API requests about each elected official. Let’s grab each official’s most recent 100 tweets, and combine them into a single large dataset of recent tweets by elected officials in the U.S.

#create empty container to store tweets for each elected official
elected_official_tweets<-as.data.frame(NULL)

for(i in 1:nrow(elected_officials)){

  #pull tweets
  tweets<-get_timeline(elected_officials$twitter_id[i], n=100)
  
  #populate dataframe
  elected_official_tweets<-rbind(elected_official_tweets, tweets)
  
  #pause for five seconds to further prevent rate limiting
  Sys.sleep(1)
  
  #print number/iteration for debugging/monitoring progress
  print(i)
}

This code would take some time to run, of course, since we are collected 100 tweets from 500 different people. You may also get rate limited, depending upon your previous activity and your current rate limits. If so, modify the length of the pause in the Sys.sleep command above. You may also notice some error messages in your output– these could occur because Senators change their Twitter handle, or because they have an account but no tweets, or other such errors.

In case you don’t care to pull the data yourself, I’ve saved a copy of the data I produced in September 2018 that you can load as follows:

load(url("https://cbail.github.io/Elected_Official_Tweets.Rdata"))

Now that we’ve collected the data, let’s create a simple model that predicts the retweet count of tweets. To do this, we need to merge the dataset we read in from Github above with the new dataset we just produced (in order to look at attributes of each senator that might increase the reach of their tweets)

library(dplyr)
## Warning: package 'dplyr' was built under R version 3.5.2
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
#rename twitter_id variable in original dataset in order to merge it with tweet dataset
colnames(elected_officials)[colnames(elected_officials)=="twitter_id"]<-"screen_name"
for_analysis<-left_join(elected_official_tweets, elected_officials)
## Joining, by = "screen_name"

Let’s inspect the outcome measure:

hist(for_analysis$retweet_count)

Next, because our data is so skewed, we could use something like a negative binomial regression model to examine the association between the various predictors in our data and the outcome. To do this we will use the glm.nb function from the MASS package:

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
summary(glm.nb(retweet_count~
             party+
             followers_count+
             statuses_count+
             gender, 
             data=for_analysis))
## Warning: glm.fit: algorithm did not converge
## 
## Call:
## glm.nb(formula = retweet_count ~ party + followers_count + statuses_count + 
##     gender, data = for_analysis, init.theta = 0.2918080018, link = log)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -2.131  -1.221  -1.013  -0.611  14.655  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      6.142e+00  5.769e-02 106.478  < 2e-16 ***
## partyI          -2.627e+00  1.601e-01 -16.407  < 2e-16 ***
## partyR          -1.418e+00  4.304e-02 -32.938  < 2e-16 ***
## followers_count  8.997e-07  2.420e-08  37.179  < 2e-16 ***
## statuses_count   2.742e-06  4.905e-06   0.559    0.576    
## genderM          3.673e-01  5.313e-02   6.912 4.77e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(0.2918) family taken to be 1)
## 
##     Null deviance: 14564  on 8898  degrees of freedom
## Residual deviance: 12132  on 8893  degrees of freedom
##   (900 observations deleted due to missingness)
## AIC: 109118
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  0.29181 
##           Std. Err.:  0.00347 
## 
##  2 x log-likelihood:  -109104.33900

Unsurprisingly, senators who have more followers tend to produce tweets that have more retweets. Additionally, Republicans appear to get fewer retweets than Democrats. This could result from the fact that Twitter users who follow politics–or at least Senators–are more likely to be Democrats, or because Democrats produce tweets that are somehow more appealing, or any number of other confounding factors (so we probably shouldn’t read too much into this finding without a more careful analysis).

Working with Timestamps

There is one more skill that will be useful for you to have in order to work with Twitter data. Very often, we want to track trends over time or subset our data according to different time periods. If we browse the variable that describes the time each tweet was created, however, we see that it is not in a format that we can easily work with in r:

head(for_analysis$created_at)
## [1] "2018-09-17 01:43:46 UTC" "2018-09-14 17:15:06 UTC"
## [3] "2018-09-14 16:56:42 UTC" "2018-09-14 16:56:42 UTC"
## [5] "2018-09-13 16:55:18 UTC" "2018-09-12 18:11:21 UTC"

To manage these types of string variables that describe dates, it is often very useful to convert them into a variable of class “date.” There are several ways to do this in R, but here is the way to do it using the as.Date function in base R.

for_analysis$date<-as.Date(for_analysis$created_at, format="%Y-%m-%d")
head(for_analysis$date)
## [1] "2018-09-17" "2018-09-14" "2018-09-14" "2018-09-14" "2018-09-13"
## [6] "2018-09-12"

Now, we can subset the data using conventional techniques. For example, if we wanted to only look at tweets for August, we could do this:

august_tweets<-for_analysis[for_analysis$date>"2018-07-31"&
                              for_analysis$date<"2018-09-01",]

Challenges of Working with APIs

By now it is hopefully clear that APIs are an invaluable resource for collecting data from the internet. At the same time, it may also be clear that the process of obtaining credentials, avoiding rate limiting, and understanding the unique jargon employed by those who create each API can mean a lot of hours sifting through the documentation of an API—particularly where there are not well functioning R packages for interfacing with the API in question. If you have to develop your own custom code to work with an API— or if you need information that is not obtainable using functions within an R package, you may find it useful to browse the source code of the R functions we have discussed above in order to see where they pass the API query language necessary to produce the results we worked with above.

A List of APIs of Interest

There are numerous databases that describe popular APIs on the web, including the aforementioned Programmable Web, but also a variety of crowd-source and user generated lists as well:

https://www.programmableweb.com/

https://github.com/toddmotto/public-apis

https://apilist.fun/

The R OpenSci site also has a list of R packages that work with APIs:

https://ropensci.org/packages/

Happy coding!