Text as Data Course
Chris Bail, PhD
Duke University
www.chrisbail.net
github.com/cbail
twitter.com/chris_bail

This tutorial is designed to introduce you to the basics of text analysis in R. It provides a foundation for future tutorials that cover more advanced topics in automated text analysis such as topic modeling and network-based text analysis. This tutorial assumes basic knowledge about R and other skills described in previous tutorials at the link above.

Character Encoding

One of the first things that is important to learn about quantitative text analysis is to most computer programs, texts or strings also have a numerical basis called character encoding. Character encoding is a style of writing text in computer code that helps programs such as web browsers figure out how to display text. There are presently dozens of different types of character encoding that resulted not only from advances in computing technology—and the development of different styles for different operating systems—but also for different languages (and even new languages such as emoji). The figure below illustrates a form of character encoding called “Latin Extended-B” which was developed for representing text in languages derived from Latin (which of course excludes a number of important languages)



Why should you care that text can be created using different forms of character encoding? Well, if you have scraped a large amount of data from multiple websites—or multiple social media sites— you may find that your data exist in multiple types of character encoding, and this can create a big hassle. Before we begin working with a text-based dataset, it is useful to either a) make sure every text uses the same character encoding; or b) use a tool to force or coerce all text into a single character encoding. The Encoding and inconv functions in base R can be very useful for the latter purposes. Note, however, that the latter function may create “place holders” for characters that it cannot process. For example, if an old version of character encoding is applied to text that contains emoji, the emoji may appear as strings of seeminly incoherent symbols and punctuation marks.

Inconsistent character encoding is one of the most common pitfalls for those attempting to learn how to perform quantitative text analysis in R, but there are no easy solutions. If you try to run the code below and receive error messages such as invalid multibyte string, this is indicative of a character encoding issue that you will most likely need to resolve using one of the imperfect steps above.

GREP

Another important tool for working with text is GREP, which stands for “Globally search a Regular Expression and Print.” In laymans terms, GREP is a tool that helps you search for the presence of a string of characters that matches a pattern.

To demonstrate why you need to learn some GREP, let’s return to an issue we encountered in a previous tutorial on screen-scraping. In that tutorial, we scraped a Wikipedia page and discovered that there were strange characters such as \t and \n interspersed throughout the text we scraped. At the time, I mentioned that these are html tags, or chunks of code that tell your web browser how to display something (in this case a “tab” space and a new line).

Let’s create a character string that includes such characters as follows (the meaning of the text isn’t important- this was scraped from the Duke University web page “Events” section):

duke_web_scrape<- "Class of 2018: Senior Stories of Discovery, Learning and Serving\n\n\t\t\t\t\t\t\t" 

Once again, GREP-style commands search for a certain pattern. For example, let’s write some code that determines whether the word “Class” is part of our string using the grepl function in base R:

grepl("Class", duke_web_scrape)
## [1] TRUE

The text within quotes is the pattern we are trying to find, and the second argument is the string we want to search within. The output tells us that there was one occurrence of “Class.”

Now let’s use the gsub command to remove all \ts from the string

gsub("\t", "", duke_web_scrape)
## [1] "Class of 2018: Senior Stories of Discovery, Learning and Serving\n\n"

The first argument in the gsub function names the pattern we are looking for, the second (blank) argument tells us what we want to replace that pattern with, and the third argument is the strong we want to transform.

We can also pass two arguments at once using the | separator as follows:

gsub("\t|\n", "", duke_web_scrape)
## [1] "Class of 2018: Senior Stories of Discovery, Learning and Serving"

GREP-style commands also include a wildcard which can be used to, for example, find all words in a string that start with a certain letter, such as “P”:

some_text<-c("This","Professor","is","not","so","great")
some_text[grep("^[P]", some_text)]
## [1] "Professor"

Here is a useful cheatsheet that includes more examples of how to use GREP to find patterns in text.

GREP commands are fairly straight forward, and much more powerful and useful for subsetting rows or columns within larger datasets. There is one more concept which is important for you to grasp about GREP, however, which is that certain characters such as " confuse the techniques. For example

text_chunk<-c("[This Professor is not so Great]")
gsub("\","", text_chunk)

We receive an error message when we run the code above because the \ character has a literal meaning to R because it is part of something called a regular expression. To remove this character, and other characters like it, we need to “escape” the character using single quotation marks wraped around a double \\ as follows:

text_chunk<-c("[This Professor is not so Great]")
gsub('\\[|\\]',"", text_chunk)
## [1] "This Professor is not so Great"

Tokenization

Another important concept that is necessary to master to perform quantitative text analysis is Tokenization. Tokenization refers to the way you are definining the unit of analysis. This might include words, sequences of words, or entire sentences. The figure below provides an example of one way to tokenize a simple sentence.



This figure illustrates the most common way of tokenizing a text—by individual word. Many techniques in quantitative text analysis also analyze what are known as “n-grams” however. Ngrams are simply sequences of words with length “n.” For example, the sentence above could be written in ngram form as “the quick brown”, “quick brown fox”, “brown fox jumps” and so on. N-grams can be useful when word-order is important, as I will discuss in additional detail below. For now, let me give you a simple example: “I hate the president” and “I’d hate to be the president.”

Creating a Corpus

Another unique feature of quantitative text analysis is that it typically requires new data formats that allow algorithms to quickly compare one document to a lot of other documents in order to identify patterns in word usage that can be used to identify latent themes, or address the overall popularity of a word or words in a single document vs. a group of documents. One of the most common data formats in the field of Natural Language Processing is a corpus.

In R, the tm package is often used to create a corpus object. This package can be used to read in data in many different formats– including text within data frames, .txt files, or .doc files. Let’s begin with an example of how to read in text from within a data frame. We begin by loading an .Rdata file that contains 3,196 recent tweets by President Trump that are hosted on my Github page:

load(url("https://cbail.github.io/Trump_Tweets.Rdata"))
head(trumptweets$text)
## [1] "Just met with UN Secretary-General António Guterres who is working hard to “Make the United Nations Great Again.” When the UN does more to solve conflicts around the world, it means the U.S. has less to do and we save money. @NikkiHaley is doing a fantastic job! https://t.co/pqUv6cyH2z"           
## [2] "America is a Nation that believes in the power of redemption. America is a Nation that believes in second chances - and America is a Nation that believes that the best is always yet to come! #PrisonReform https://t.co/Yk5UJUYgHN"                                                                     
## [3] "RT @SteveForbesCEO: .@realDonaldTrump speech on drug costs pays immediate dividends. New @Amgen drug lists at 30% less than expected. Middl…"                                                                                                                                                             
## [4] "We grieve for the terrible loss of life, and send our support and love to everyone affected by this horrible attack in Texas. To the students, families, teachers and personnel at Santa Fe High School – we are with you in this tragic hour, and we will be with you forever... https://t.co/LtJ0D29Hsv"
## [5] "School shooting in Texas. Early reports not looking good. God bless all!"                                                                                                                                                                                                                                 
## [6] "Reports are there was indeed at least one FBI representative implanted, for political purposes, into my campaign for president. It took place very early on, and long before the phony Russia Hoax became a “hot” Fake News story. If true - all time biggest political scandal!"

In order to create a corpus of these tweets, we need to use the Corpus function within the tm package. First let’s install that package

install.packages("tm")

Now let’s load the tm package in order to use its Corpus function:

library(tm)
trump_corpus <- Corpus(VectorSource(as.vector(trumptweets$text))) 
trump_corpus
## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 3196

As this output shows, we’ve created a corpus with 3,196 documents, where each document is one of Trump’s tweets. You may also notice that the Corpus object can also store metadata such as information about the names of the author of each document of or the date each document was produced (though we are not storing any such meta data here.

Tidy-Text

An important alternative to Corpus object has emerged in recent years in the form of tidytext. Instead of saving a group of documents and associated meta data, text that is in tidytext format contains one word per row, and each row also includes additional information about the name of the document where the word appears, and the order in which the words appear.

Let’s install the tidytext package to illustrate: