1 Text Visualization

Special thanks to Tianhai Zu for preparing this chapter.

Libraries

The following code will check whether the required packages have been installed. If not, it will automatically install them from CRAN.

pkgs <- c(
  "tm",
  "readtext",
  "word2vec",
  "wordcloud2",
  "wordcloud",
  "ggplot2",
  "quanteda",
  "stm",
  "sentimentr",
  "syuzhet",
  "ggrepel",
  "ape",
  "igraph",
  "ggraph",
  "networkD3",
  "googleVis"
)

missing_pkgs <- pkgs[!(pkgs %in% installed.packages()[, "Package"])]

if (length(missing_pkgs) > 0) {
  install.packages(missing_pkgs)
}

1.1 Text Data Source: Pride and Prejudice

We will use the book, Pride and Prejudice, an 1813 romantic novel of manners written by Jane Austen, as book_nlp running example in this chapter. The book is downloaded from the Project Gutenberg (http://www.gutenberg.org/ebooks/1342), which entitled this text can be used with almost no restriction.

It centers on the turbulent relationship between Elizabeth Bennet, the daughter of a country gentleman, and Fitzwilliam Darcy, a rich aristocratic landowner. There is not the only successful relationship in the story worth mentioning. The relationship between Miss Jane Bennet and Mr. Bingley is just as successful and romantic, if not more so. Here is a picture from wikipedia describing the relationship between main characters:

Character_Map

And a tree map of main families:

Family_Map

Now, let’s load the data and take a look at the text data itself:

#read the book as lines
book_path = "./data/1342-0.txt"
book = readLines(book_path)
#visually check our data first
book[1:15]

##  [1] "ï»¿Chapter 1"                                                       
##  [2] ""                                                                   
##  [3] "It is a truth universally acknowledged, that a single man in"       
##  [4] "possession of a good fortune, must be in want of a wife."           
##  [5] ""                                                                   
##  [6] "However little known the feelings or views of such a man may be"    
##  [7] "on his first entering a neighbourhood, this truth is so well"       
##  [8] "fixed in the minds of the surrounding families, that he is"         
##  [9] "considered the rightful property of some one or other of their"     
## [10] "daughters."                                                         
## [11] ""                                                                   
## [12] "â\200œMy dear Mr. Bennet,â\200\235 said his lady to him one day, â\200œhave you"
## [13] "heard that Netherfield Park is let at last?â\200\235"                     
## [14] ""                                                                   
## [15] "Mr. Bennet replied that he had not."

1.2 Preprocessing of Text Data

Keyword: make unstructured text structured

The textual data are not structured as they are just “words,” “spaces,” “numbers” and “special characters.” It is almost impossible to convert this mixed format into nice visualization. We need to make unstructured text into structured data so that the information behind the text can be nicely presented. There are two general methods for such conversion: traditional NLP (natural language processing) methods or Machine learning methods.

1.2.1 Traditional Method: Tokinize, Word-term

The tradition method follows refers to the nature language processing procedure, which usually includes the following steps:

Tokenization
Normalization
Noise Removal
- stop words
- numbers, special characters
- correct misspelled words, or convert alternately spelled words to book_nlp single representation (e.g. “cool”/”kewl”/”cooool”)
- (optional) lemmatization

We will use the R package tm to complete the pre-processing.

library(tm)

## Loading required package: NLP

library(readtext)

#we will read the data as a whole long string first 
book = readtext(book_path)
#chop them into chapters
chapters = strsplit(book$text,"Chapter")[[1]]
#remove the first element as it's empty
chapters = chapters[-1]

#convert chapters into a corpus, which is standard format in NLP
book_nlp =Corpus(VectorSource(chapters))
#you can also take a look at the Corpus. What's the difference here?
#summary(book_nlp) 

#convert all characters to lower case
book_nlp = tm_map(book_nlp, tolower)

## Warning in tm_map.SimpleCorpus(book_nlp, tolower): transformation drops
## documents

#remove numbers
book_nlp = tm_map(book_nlp, removeNumbers)

## Warning in tm_map.SimpleCorpus(book_nlp, removeNumbers): transformation drops
## documents

#remove non-informative spaces
book_nlp = tm_map(book_nlp, stripWhitespace)

## Warning in tm_map.SimpleCorpus(book_nlp, stripWhitespace): transformation drops
## documents

#remove stopwords, which usually refer to the most common words in a language
book_nlp = tm_map(book_nlp, removeWords, stopwords("english")) # this stopword file a built-in one, but too weak sometime

## Warning in tm_map.SimpleCorpus(book_nlp, removeWords, stopwords("english")):
## transformation drops documents

book_nlp <- tm_map(book_nlp, removeWords, readLines("./data/stopwords_eng.txt")) #remove your own

## Warning in tm_map.SimpleCorpus(book_nlp, removeWords, readLines("./data/
## stopwords_eng.txt")): transformation drops documents

removeSpecialChars <- function(x) gsub("[“”]","",x)
#remove Special characters
book_nlp <- tm_map(book_nlp, removeSpecialChars)

## Warning in tm_map.SimpleCorpus(book_nlp, removeSpecialChars): transformation
## drops documents

#remove Punctuation
book_nlp <- tm_map(book_nlp, removePunctuation)

## Warning in tm_map.SimpleCorpus(book_nlp, removePunctuation): transformation
## drops documents

#you can optionally add a stemming step, which make car, cars, car's, cars' into car
#book_nlp = tm_map(book_nlp, stemDocument, language = "english")
#you can also optionally add a lemmatization step, which convert am, are, is into be 

#finally, we convert the pre-processed corpus to a term document matrix
book_dtm =TermDocumentMatrix(book_nlp)
book_dtm_matrix <- as.matrix(book_dtm) 
head(book_dtm_matrix)

##               Docs
## Terms          1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
##   abuse        1 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
##   account      1 0 0 0 0 0 0 0 0  1  0  0  0  0  0  2  1  3  0  0  0  0  1  1
##   acknowledged 1 0 0 1 0 0 0 0 0  1  0  0  0  0  0  1  0  0  0  0  1  0  0  1
##   affect       1 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
##   agreed       1 0 0 0 0 0 0 0 0  0  1  0  0  0  2  0  0  0  0  0  1  0  0  0
##   answer       1 0 1 0 0 1 2 2 4  2  2  1  0  0  0  1  0  3  2  1  1  0  2  0
##               Docs
## Terms          25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
##   abuse         0  0  0  0  0  0  0  0  0  0  1  0  0  0  0  0  0  0  0  0  0
##   account       0  3  0  1  0  0  2  0  0  0  1  4  0  0  2  0  0  0  2  2  0
##   acknowledged  0  1  0  0  0  0  1  0  0  0  3  0  0  0  0  0  0  0  1  1  0
##   affect        0  0  0  0  0  0  0  0  0  0  0  1  0  0  0  0  0  0  0  0  1
##   agreed        0  0  0  0  0  0  0  0  0  0  0  1  0  0  0  0  0  0  0  0  0
##   answer        2  0  1  0  1  0  2  2  2  2  0  0  1  0  2  1  2  1  1  0  1
##               Docs
## Terms          46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61
##   abuse         0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
##   account       3  1  0  0  1  1  1  0  0  2  1  0  0  0  1  0
##   acknowledged  0  0  0  0  0  0  1  1  1  1  0  0  1  2  0  0
##   affect        0  0  0  0  0  0  0  1  0  0  0  0  0  0  0  0
##   agreed        0  0  0  0  1  0  1  2  0  1  0  0  1  0  0  0
##   answer        2  0  1  3  0  0  1  1  1  0  2  0  0  0  3  1

One of the traditional way to represent a word in NLP is to use term document matrix as showed above. Each row represent a term and each column is the frequency of that term showed in each document (in our case, each chapter). This form is very easy to use when constructing some visualization such as word-cloud or word frequency spectrum.

Although this matrix looks intuitive, it suffers from ultra-high dimension and sparsity if the number of unique words are large. For your information, Oxford Dictionary has 273,000 headwords; 171,476 of them being in current use. Another drawback is that it ignores the semantic relationship between words, such as synonyms. For those visualization that concerns the distance between words or clusters of entities, we may need to use embedding method such as word2vec.

1.2.2 Machine Learning Methods, Word Embedding: Word2vec, etc

Different from traditional NLP methods, machine learning methods usually requires less pre-processing, especially for deep learning methods.

Word embedding is a learning representation of words in the form of numeric vectors. It learns a densely distributed representation for a predefined fixed-sized vocabulary from a corpus of text that captures the “meaning” of the word. In many cases, it is capable to reveal many hidden relationships between words. For example, vector(“king”) — vector(“lords”) is similar to vector(“queen”) — vector(“princess”).

We introduce the word2vec developed by google as our demonstration example. There are many other choices, such as Glove and fastText.

The word2vec algorithm uses a neural network model to learn word associations from a large corpus of text. Once trained, such a model can detect synonymous words or suggest additional words for a partial sentence. As the name implies, word2vec represents each distinct word with a particular list of numbers called a word vector. The vectors are chosen carefully such that a simple mathematical function (the cosine similarity between the vectors) indicates the level of semantic similarity between the words represented by those vectors.

Word2vec can utilize either of two model architectures to produce a distributed representation of words: continuous bag-of-words (CBOW) or continuous skip-gram. In the continuous bag-of-words architecture, the model predicts the current word from a window of surrounding context words. The order of context words does not influence prediction (bag-of-words assumption). In the continuous skip-gram architecture, the model uses the current word to predict the surrounding window of context words. The skip-gram architecture weighs nearby context words more heavily than more distant context words. According to the authors’ note, CBOW is faster while skip-gram is slower but does a better job for infrequent words.

You can find more detail about the word2vec method:https://israelg99.github.io/2017-03-23-Word2Vec-Explained/. Similarly, For Glove: https://towardsdatascience.com/light-on-math-ml-intuitive-guide-to-understanding-glove-embeddings-b13b4f19c010 and https://adityamohanty.medium.com/understanding-fasttext-an-embedding-to-look-forward-to-3ee9aa08787.

We use the R package word2vec to obtain the word vector as followed:

#for demo purpose, we only need the first chapter.
chapter1 = chapters[[2]]
paragraphs = strsplit(chapter1,"\n\n")[[1]]
library(word2vec)
chapter1_model = word2vec(paragraphs, type = "cbow", dim = 20, iter = 10,min_count = 2) # note that min_count set to 2 for demo, usually you need higher frequency to be included as training word.
chapter1_model_matrix = as.matrix(chapter1_model)
head(chapter1_model_matrix)

##                [,1]       [,2]       [,3]       [,4]       [,5]        [,6]
## youngest 0.42721614  1.3172870  0.3776873 -1.1135346 -1.5400720  1.78457570
## herself  0.03428602 -0.3869587 -0.2615181  1.0168931 -1.1178626 -0.85134876
## mother   0.20509635 -0.8822964 -0.1041097  1.3576339  1.5275036  0.08528424
## very     0.94297183  0.7897217 -0.9748942  0.1294067  0.6194392  0.34873414
## paid     1.78640568  0.2328346 -1.1629226  0.1371666  0.9637081 -0.71882349
## morning  0.92959905  0.4900802 -0.7260662  0.2613347  1.3974193  1.22866488
##                [,7]       [,8]       [,9]      [,10]      [,11]      [,12]
## youngest  1.3577552 -0.2997377  0.9598390 -0.7605901 -0.1471264 -1.0187546
## herself  -0.8952659 -1.0734850 -1.2856542  1.2874995 -1.5106959 -0.7643379
## mother   -1.3292240 -1.3745912  0.1335131 -0.7040000  1.2145315  1.3097191
## very      0.3728897  0.9466268 -0.6196774  1.4352375 -1.4738374 -1.5111881
## paid      0.2866280 -1.1866126 -0.5461755  0.8223507 -1.7625476  1.0302522
## morning   0.3953610 -1.0777792 -0.8672289  0.8335772 -1.9132695  0.7907312
##               [,13]      [,14]      [,15]      [,16]      [,17]      [,18]
## youngest  1.5709211  1.5502567  0.7117305 -1.0148395  0.1149218  0.1708377
## herself  -1.3797028  1.7081821 -0.2270095 -0.6589340  0.9398214  0.9932207
## mother   -0.8934644 -0.9018161 -0.8597059  0.5728320 -0.1506389 -1.2754641
## very      1.4456636  0.6224090 -1.3362113  1.5992825  0.2102079 -1.1676879
## paid     -0.5522528  0.4966215 -0.5789814  1.1392013  1.9100169  0.5480881
## morning   0.9155604  1.6751751 -1.5175841  0.1507362  1.3107909 -0.1953610
##               [,19]      [,20]
## youngest -0.2854540  0.3450190
## herself  -0.6085287  1.0459977
## mother   -0.1096184  1.8122938
## very      0.1256574 -0.9141896
## paid     -1.0745655 -0.1313893
## morning   0.2137921  0.1216508

Note that we use CBOW and low dimension of 20 for fast computation. The actual choices of those parameter depends on specific tasks. Usually the dimension would be higher in practice.

Visulization to Understand the Text at Early Stage.

Some of those visualizations are very helpful at early stage of data exploration, such as Wordcloud, Word Frequency plots, Word Frequency Spectrum plots. Thews visualizations do not involve with any modeling and requires no assumptions.

1.3 Wordcloud

Wordcloud is a novelty visual representation of text data, typically used to depict keyword metadata (tags) on websites, or to visualize high frequency words form text. This format is useful for quickly perceiving the most prominent terms to determine its relative prominence. Bigger term means greater weight.

We use the R package wordcloud2 to generate a colorful Worldcloud.

#load the library
library(wordcloud2)

#we need to prepare the data for plotting. The wordcloud2 function requires a word frequency table so we need to 
#convert the term-document matrix obtained in chuck 2 into a word frequency table as followed:
words <- sort(rowSums(book_dtm_matrix),decreasing=TRUE) 
df <- data.frame(word = names(words),freq=words)

#remove words that are not informative but necessary in processing
df <- df[!df$word %in% c("miss","mrs","mr","give","said","know","think","soon"),]

#for reproducibility 
set.seed(1234) 
#plot
#wordcloud2(data=df, size=1.6, color='random-dark')

#backup
library(wordcloud)

## Loading required package: RColorBrewer

wordcloud(words = df$word, freq = df$freq, min.freq=50)

1.4 Word Frequency plots

Nightingale Rose Charts

While the Wordcloud shows the whole picture, Nightingale Rose Charts focus on those high frequent keywords. To make the plot more informative, we start with the first three chapters.

library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:NLP':
## 
##     annotate

#function for roseplot
roseplot = function(data,main_title){
  ggplot(data, aes(x = word, y = freq)) + geom_bar(stat = "identity", fill = "steelblue") +
    coord_polar(theta = "x", start = 0) +
    ggtitle(paste0(main_title, " top keywords")) +
    theme(panel.grid = element_blank(), 
          panel.background = element_blank(), 
          axis.text.y = element_blank(), 
          axis.ticks = element_blank(), 
          axis.title = element_blank())
}

#obtain the word freq table based on first three chapters
first3_dtm =TermDocumentMatrix(book_nlp[1:3])
first3_dtm_matrix <- as.matrix(first3_dtm) 
first3_words <- sort(rowSums(first3_dtm_matrix),decreasing=TRUE) 
first3_df <- data.frame(word = names(first3_words),freq=first3_words)

#plot
roseplot(first3_df[1:10,], "chapter 1-3")

And the last three chapter.

#last three chapters
last3_dtm =TermDocumentMatrix(book_nlp[59:61])
last3_dtm_matrix <- as.matrix(last3_dtm) 
last3_words <- sort(rowSums(last3_dtm_matrix),decreasing=TRUE) 
last3_df <- data.frame(word = names(last3_words),freq=last3_words)

#plot
roseplot(last3_df[1:10,], "chapter 59-61")

It’s also very interesting to compare the top keywords in the first three chapters and the last three chapters.

#consolidate them into one table for plotting
merged_table = rbind(first3_df[1:15,], last3_df[1:15,])
merged_table$chapter = c(rep("1-3", 15), rep("59-61", 15))
merged_table$chapter = as.factor(merged_table$chapter)
merged_table$freq[16:30] = -merged_table$freq[16:30]
merged_table

##                word freq chapter
## “                 “   62     1-3
## ”                 ”   31     1-3
## bingley     bingley   23     1-3
## bennet       bennet   22     1-3
## mrs             mrs   17     1-3
## dear           dear   14     1-3
## know           know   12     1-3
## visit         visit   10     1-3
## man             man    9     1-3
## daughters daughters    8     1-3
## handsome   handsome    8     1-3
## soon           soon    8     1-3
## evening     evening    8     1-3
## room           room    8     1-3
## girls         girls    7     1-3
## “1                “  -58   59-61
## darcy         darcy  -29   59-61
## elizabeth elizabeth  -26   59-61
## ”1                ”  -24   59-61
## jane           jane  -16   59-61
## lizzy         lizzy  -16   59-61
## know1          know  -14   59-61
## bennet1      bennet  -13   59-61
## mrs1            mrs  -13   59-61
## dear1          dear  -12   59-61
## love           love  -12   59-61
## really       really  -12   59-61
## soon1          soon  -12   59-61
## bingley1    bingley  -11   59-61
## shall         shall  -11   59-61

#another plot
ggplot(merged_table, aes(x = reorder(word, freq), y = freq, fill = chapter)) + 
  geom_bar(stat = "identity") + 
  ggtitle("Comparing top 15 keywords between chapter 1-3 and chapter 59-61") +  # the tile
  geom_text(aes(x = reorder(word, freq), y = rep(0, length(word)),label = word)) + # positioning the labels
  coord_flip() +  scale_y_reverse()+                           # in horizontal
  theme(panel.grid = element_blank(),        # remove grid
        panel.background = element_blank(),  # remove background
        axis.text = element_blank(),         # remove axis text
        axis.ticks = element_blank(),        # remove axis ticks
        axis.title = element_blank())        # remove axis title

1.5 Lexical Dispersion plot

Lexical Dispersion plot shows certain keywords as a frequency spectrum across all documents.

We can compare the names of two main male characters, “bingley” and “darcy” across all chapters. The package we used here is quanteda, another popular text mining package .

library(quanteda)

## Package version: 3.2.0
## Unicode version: 13.0
## ICU version: 69.1

## Parallel computing: 64 of 64 threads used.

## See https://quanteda.io for tutorials and examples.

## 
## Attaching package: 'quanteda'

## The following object is masked from 'package:tm':
## 
##     stopwords

## The following objects are masked from 'package:NLP':
## 
##     meta, meta<-

book_nlp_quanteda <- corpus(book_nlp)
#rename the documents so that we have sort the documents properly
docnames(book_nlp_quanteda) <- c(paste0("chapter0",1:9),paste0("chapter",10:61))
# textplot_xray(
#   kwic(book_nlp_quanteda, pattern = "bingley"),
#   kwic(book_nlp_quanteda, pattern = "darcy"),
#   sort = TRUE
# )

Now let’s do the same thing to the two main Bennet sisters, “Jane” and “Elizabeth” across all chapters.

# textplot_xray(
#   kwic(book_nlp_quanteda, pattern = "jane"),
#  kwic(book_nlp_quanteda, pattern = "elizabeth"),
#  sort = TRUE
# )

How about the keywords in the title, “pride” and “prejudice?”

# textplot_xray(
#   kwic(book_nlp_quanteda, pattern = "pride"),
#   kwic(book_nlp_quanteda, pattern = "prejudice"),
#   sort = TRUE
# )

note: As all of the above visualizations are sensitive to keywords, pre-processing is extremely important. Besides, they are also sensitive to different part of the documents. For example, positive words present more frequently within the happy ending of this novel.

1.6 Topic modeling

Topic modeling is a frequently used text-mining tool for discovery of hidden semantic structures in a text body. Intuitively, given that a document is about a particular topic, one would expect particular words to appear in the document more or less frequently: “dog” and “bone” will appear more often in documents about dogs, “cat” and “meow” will appear in documents about cats, and “the” and “is” will appear approximately equally in both.

Let’s use Topic modeling to understand the topics of this book. We will use the R package stm to achieve topic modeling. note that stm works better on the quanteda corpus instead of tm corpus.

quant_dfm = dfm(book_nlp_quanteda)

## Warning: 'dfm.corpus()' is deprecated. Use 'tokens()' first.

quant_dfm = dfm_trim(quant_dfm, min_termfreq = 4, max_docfreq = 10)
library(stm)

## stm v1.3.6 successfully loaded. See ?stm for help. 
##  Papers, resources, and other materials at structuraltopicmodel.com

my_lda_fit = stm(quant_dfm, K = 15, verbose = FALSE)
plot(my_lda_fit)

a more appropriate usage

quant_dfm_first3chapter = dfm(book_nlp_quanteda,remove=c("mr","lady"))[59:61,]

## Warning: 'dfm.corpus()' is deprecated. Use 'tokens()' first.

## Warning: 'remove' is deprecated; use dfm_remove() instead

quant_dfm_first3chapter = dfm_trim(quant_dfm_first3chapter, min_termfreq = 3,max_termfreq = 30)
my_lda_fit3 = stm(quant_dfm_first3chapter, K = 3, verbose = FALSE)
plot(my_lda_fit3)

Visulization of Sentiment and Emotion

1.7 Sentiment Analysis

sentiment analysis usually refers to classifying the polarity of a given text at the document, sentence, or feature/aspect level, whether the expressed opinion in a document, a sentence or an entity feature/aspect is positive, negative, or neutral.

We will show the sentiment by chapter-level as a line chart. The package we used here is sentimentr. In addition to look up each word in each chapter for their polarity, it attempts to take into account valence shifters (i.e., negators, amplifiers (intensifiers), de-amplifiers (downtoners), and adversative conjunctions) while maintaining speed. For example, “I hardly like it.” will be detected as downtoners sentence, which represents a negative sentiment.

library(sentimentr)
#prepare the data fro sentiment analysis
book_nlp_table <- data.frame(text = rep("",length(book_nlp_quanteda)), chapter = 1:length(book_nlp_quanteda),stringsAsFactors = FALSE)
for(i in 1:length(book_nlp_quanteda)){
  book_nlp_table[i,1] <- book_nlp_quanteda[[i]]
}

#obtain sentiment score, with function is very handy here!
#note that this takes a long time to run.
senti_chapters = with(book_nlp_table, sentiment_by(get_sentences(text), list(chapter)))

load("./data/senti_chapters.RData")
#save(senti_chapters,file = "senti_chapters.RData")
#table a look at the result
head(senti_chapters)

##    chapter word_count sd ave_sentiment
## 1:       1        298 NA     1.5692816
## 2:       2        304 NA     0.2540779
## 3:       3        695 NA     2.3711395
## 4:       4        403 NA     2.4707520
## 5:       5        379 NA     1.1788623
## 6:       6        891 NA     3.0385643

#plot a smooth line chart
ggplot(senti_chapters, aes(x = chapter, y = ave_sentiment)) + geom_smooth() + 
  geom_line() + geom_point() +
  labs(x = "chapter", y = "sentiment") +
  theme_classic()

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

1.8 Emotion Analysis

Similarly, emotion analysis provides more detail than sentiment analysis. Instead rating the words with 1-dimensional plority, we can tag each word with one/more emotions. We will use the R package syuzhet. There are eight emotions that can be tagged by this package: anger, anticipation, disgust, fear, joy, sadness, surprise, trust.

For demonstration purpose, we only present the sadness as red line and joy as blue line.

library(syuzhet)

## 
## Attaching package: 'syuzhet'

## The following object is masked from 'package:sentimentr':
## 
##     get_sentences

#obtain sentiment score
emo_chapters = with(book_nlp_table, get_nrc_sentiment((text)))

## Warning: `spread_()` was deprecated in tidyr 1.2.0.
## Please use `spread()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.

emo_chapters = cbind(chapter= book_nlp_table$chapter,emo_chapters)
head(emo_chapters)

##   chapter anger anticipation disgust fear joy sadness surprise trust negative
## 1       1     4           16       4    8  12       4        7    17       17
## 2       2     6           15       6    7  15       7        8    16       23
## 3       3    18           22      16   18  33      16       13    34       33
## 4       4     6           24       3    9  29       5       18    41       21
## 5       5     8           20       6    6  19       8        9    27       19
## 6       6    14           45      11   20  49      22       20    61       48
##   positive
## 1       33
## 2       32
## 3       71
## 4       66
## 5       50
## 6      112

#plot
ggplot(emo_chapters, aes(x = chapter)) +
         geom_smooth(aes(y = scale(sadness)), color = "red") + 
         geom_smooth(aes(y = scale(joy)), color="blue") +
         labs(x = "chapter", y = "emotional scores") +
         theme_classic()

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

note: The emotion scores should be properly scaled!

1.9 Visualization of Entity/Character Position and Clustering

1.9.1 Preparation 1: Identify a Person with Unique Name

Normally, one person could have many alias. We need to substitute all the alias of the same person with one unique name, so that we can identify that person consistently.

Now we will use gsub directly to achieve this, please note that you can use tm as well. We already learned how to use tm to do string substitution a bit in the pre-processing section.

#A name list of main characters
main_characters <- c("Elizabeth Bennet","Mr. Fitzwilliam Darcy","George Wickham","Charles Bingley","Mr. William Collins","Lydia Bennet","Mr. Bennet", "Lady Catherine de Bourgh","Jane Bennet","Mrs. Bennet","Caroline Bingley","Mary Bennet","Kitty Bennet","Georgiana Darcy","Mr. Gardiner","Mrs. Gardiner")

#for example, we replace lizzy by elizabeth. Note that all characters are lower case now!
book_nlp_table[,1] = gsub(pattern = "\\b(lizzy|eliza)\\b", replacement = "elizabeth",x =  book_nlp_table[,1])
#And for darcy:
book_nlp_table[,1] = gsub(pattern = "\\b(mr darcy|mr fitzwilliam darcy|fitzwilliam)\\b", replacement = "mrdarcy",x =  book_nlp_table[,1])
book_nlp_table[,1] = gsub(pattern = "\\b(miss darcy|georgiana darcy|georgiana)\\b", replacement = "georgiana",x =  book_nlp_table[,1])
book_nlp_table[,1] = gsub(pattern = "\\b(lady anne darcy|lady anne|anne)\\b", replacement = "annedarcy",x =  book_nlp_table[,1])
book_nlp_table[,1] = gsub(pattern = "\\b(darcy)\\b", replacement = "mrdarcy",x =  book_nlp_table[,1])
#think: why we use this order?

#for others:
book_nlp_table[,1] = gsub(pattern = "\\b(miss bingley|caroline bingley|caroline)\\b", replacement = "msbingley",x =  book_nlp_table[,1])
book_nlp_table[,1] = gsub(pattern = "\\b(mr bingley|charles bingley|bingley|charles)\\b", replacement = "mrbingley",x =  book_nlp_table[,1])
book_nlp_table[,1] = gsub(pattern = "\\b(william collins|mr collins|collins)\\b", replacement = "mrcollins",x =  book_nlp_table[,1]) # note that sir William is not him! so we don't want to replace william with mrcollins.
book_nlp_table[,1] = gsub(pattern = "\\b(lydia bennet)\\b", replacement = "lydia",x =  book_nlp_table[,1])
book_nlp_table[,1] = gsub(pattern = "\\b(mr bennet)\\b", replacement = "mrbennet",x =  book_nlp_table[,1])
book_nlp_table[,1] = gsub(pattern = "\\b(mrs bennet)\\b", replacement = "mrsbennet",x =  book_nlp_table[,1])
book_nlp_table[,1] = gsub(pattern = "\\b(miss de bourgh)\\b", replacement = "msdebourgh",x =  book_nlp_table[,1])
book_nlp_table[,1] = gsub(pattern = "\\b(lady catherine de bourgh|lady catherine)\\b", replacement = "ladycatherine",x =  book_nlp_table[,1])
book_nlp_table[,1] = gsub(pattern = "\\b(jane bennet|the eldest miss bennet)\\b", replacement = "jane",x =  book_nlp_table[,1])
book_nlp_table[,1] = gsub(pattern = "\\b(caroline bingley|miss bingley)\\b", replacement = "caroline",x =  book_nlp_table[,1])
book_nlp_table[,1] = gsub(pattern = "\\b(mary bennet)\\b", replacement = "mary",x =  book_nlp_table[,1])
book_nlp_table[,1] = gsub(pattern = "\\b(kitty bennet)\\b", replacement = "kitty",x =  book_nlp_table[,1])
book_nlp_table[,1] = gsub(pattern = "\\b(mr gardiner)\\b", replacement = "mrgardiner",x =  book_nlp_table[,1])
book_nlp_table[,1] = gsub(pattern = "\\b(mrs gardiner)\\b", replacement = "mrsgardiner",x =  book_nlp_table[,1])
book_nlp_table[,1] = gsub(pattern = "\\b(mr wickham)\\b", replacement = "mrwickham",x =  book_nlp_table[,1])

This is a very delicate task as there are so many things we can tweak. For example, who is Miss Bennet, there are bunch of them! For demo purpose, we stop at here.

Now we have a list of unique names:

unique_names <- c("elizabeth","mrdarcy","georgiana","annedarcy","msbingley","mrbingley","mrcollins","lydia","mrbennet","mrsbennet","msdebourgh","ladycatherine","jane","caroline","mary","kitty","mrgardiner","mrsgardiner","mrwickham")

1.9.2 Preparation 2: Word Embedding

For locating the characters in a 2-d space, we need first to obtain the word vector for each characters. We will use the word2vec technique, which was discussed in word embedding section. As a name does not carry much information itself, it is more reasonable to use “skip-gram” for capturing the context of a name more. And we will use the whole book this time for more accurate embedding.

#we use the whole book instead for accurate embedding
chapters_vector = book_nlp_table[,1]
library(word2vec)
library(ggrepel) 
book_model = word2vec(chapters_vector, type = "skip-gram", dim = 10, iter = 100,min_count = 4) # note that min_count set to 4 for demo, usually you need higher frequency to be included as training word.
book_model_matrix = as.matrix(book_model)

#select only characters with unique_names list
book_model_matrix_unique_names = book_model_matrix[rownames(book_model_matrix)%in%unique_names,]

1.10 Visualize Position of Entity/Character

With the word vector for each main character, we can now start to plot the position. One extra step we will do go through for positioning is that we have to lower the dimension of the word vector from 10 to 2, as we need to locate them in a 2-D space. We will use the R built-in function princomp to obtain the first two principal components of the whole vector. You can find more information on principal components analysis here.

#pca to get first two components
book_model_matrix_unique_names_pca <- princomp(book_model_matrix_unique_names, cor = FALSE)

#construct the matrix for 2-D plot, visualizes first two columns 
book_model_matrix_plot = data.frame(words = rownames(book_model_matrix_unique_names_pca$scores), book_model_matrix_unique_names_pca$scores[, 1:2])
#2d plot
ggplot(book_model_matrix_plot, aes(Comp.1, Comp.2, label = words)) + geom_point(color = sample(rainbow(10), NROW(book_model_matrix_plot), replace = TRUE)) +
  geom_text_repel() +theme_void() +
  labs(title = "word2vec embeddings in 2D")

1.11 Visualize the Similarity among Entities

Similarity in text mining is usually measured by cosine distance. The advantage of using cosine similarity is because even if the two similar documents are far apart by the Euclidean distance because of the size (like, the word ‘cricket’ appeared 50 times in one document and 10 times in another) they could still have a smaller angle between them. Smaller the angle, higher the similarity.

In our case, we are measuring the similarity among the embedded word vectors. As the the embedded word vectors may have different magnitude as well, cosine similarity is a good idea. However, for vectors that are of similar length, we can use other distances, such as Euclidean distance.

Note that in the following example, we are using the full 10-d word vectors instead of 2-d PCA components.

#self-defined function for fast computation of cosine distance
cos_dist <- function(DF){
  Matrix <- as.matrix(DF)
  sim <- Matrix / sqrt(rowSums(Matrix * Matrix))
  sim <- sim %*% t(sim)
  as.dist(1 - sim)
}

#convert raw distance to distance matrix
unique_names_ca_sim = as.matrix(cos_dist(book_model_matrix_unique_names))

#prepare the distance matrix as data frame with names and distances
fast_ca_sim1 = data.frame(Var1 = rownames(book_model_matrix_unique_names), Var2 = rep(rownames(book_model_matrix_unique_names), each = NROW(book_model_matrix_unique_names)),value = as.vector(unique_names_ca_sim)) 

#plot
ggplot(fast_ca_sim1, aes(x=Var1, y=Var2, fill=value)) + 
    geom_tile() +
    ggtitle("distance between roles through word2vec, cosine distance") +
    labs(x = "role", y = "role", fill = "distance") +
    scale_fill_gradient(limits = c(0,max(fast_ca_sim1$value)),low = "red",  high = "white") +
    theme(axis.text = element_text(face = "bold",size=9), 
          axis.text.x = element_text(angle = 90),
          axis.title = element_text(face = "bold",size=12)) +
          coord_flip()

1.12 Clustering of Entity/Character

With the word vectors, we can also group them into clusters, just like we did to numeric data.

Here we use the R built-in function hclust to perform a hierarchical clustering analysis based on the cosine distance.

#use the same cosine distance here, but feel free to try other distance measure.
changan_hc = hclust(cos_dist(book_model_matrix_unique_names)) 

# simple plot 
plot(changan_hc)

With the help of library ape, we can make the plot more fancier.

# a fancier plot
library(ape) 
mycolor = c( "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")
clus7 = cutree(changan_hc, 7)
plot(as.phylo(changan_hc), type = "fan", tip.color = mycolor[clus7], label.offset = 0.05,
     edge.width = book_model_matrix_unique_names[,1] + 1, use.edge.length = TRUE, col = "red")

1.13 Visulization of Relationships between Entities/Characters in Text via Text Network

Text data can usually be represented as a network in different level. The text network can be constructed where each node is a document, and the thickness or strength of the edges between them describes similarities between the word used in any two documents. Or, one can create a text network where individual words are the nodes, and the edges between them describe the regularity with which they co-occur in documents. The granularity is decided by the purpose of the visualization.

Some benefits of visualize text as network includes: 1. network help use understanding patterns of connections between words and identify their meaning in a more precise manner than the “bag of words” approaches. 2. text networks can be built out of documents of any length, whereas topic models function poorly on short texts such as social media messages. 3. there are good amount of sophisticated techniques available for network analysis.

1.14 Build the Connections between Entities/Characters in Text

To construct a text network, we have several ways:

Connection could be built based on similarity. We already showed how to calculate the similarity based on word2vec embedding. One can build the connection by just setting a threshold of similarity, any character that are alike to certain extent is connected in the network. Similarly, you can construct any entity network with the embedding idea. We won’t show detail here.
Another popular way to build a network in text is using the co-occurrence matrix. We will show this method in detail in this section.
There are others way to construct network for sure. One can use communication records among characters or emotion bindings.

1.15 Constructing Text Network with Co-occurrence Matrix

A co-occurrence matrix is a matrix that present the number of times each entities in rows appears in the same context as each entities in columns. As a consequence, in order to use a co-occurrence matrix, we have to define our entities and the context in which they co-occur. Here, our entities are those main character names and the context is sentences.In another words, we want to know how many times those names present in one sentences.

However, sentences split need an intact structure. So we need to start from the raw text again as below:

raw_corpus <- (corpus(chapters))
#check how many documents we have, chapters in fact
ndoc(raw_corpus)

## [1] 61

corpus_sentences <- corpus_reshape(raw_corpus, to = "sentences")
#now the documents are sentences
ndoc(corpus_sentences)

## [1] 4614

Don’t remember pre-processing as we start from raw text! We need to replace alias first.

#note that this is modified as we are replacing the raw text. 

#lizzy
corpus_sentences = gsub(pattern = "\\b(lizzy|eliza)\\b", replacement = "elizabeth",x = corpus_sentences, ignore.case = TRUE)
#And for darcy:
corpus_sentences = gsub(pattern = "\\b(mr. darcy|mr. fitzwilliam darcy|fitzwilliam)\\b", replacement = "mrdarcy",x = corpus_sentences, ignore.case = TRUE)
corpus_sentences = gsub(pattern = "\\b(miss darcy|georgiana darcy|georgiana)\\b", replacement = "georgiana",x = corpus_sentences, ignore.case = TRUE)
corpus_sentences = gsub(pattern = "\\b(lady anne darcy|lady anne|anne)\\b", replacement = "annedarcy",x = corpus_sentences, ignore.case = TRUE)
corpus_sentences = gsub(pattern = "\\b(darcy)\\b", replacement = "mrdarcy",x = corpus_sentences, ignore.case = TRUE)
#think: why we use this order?

#for others:
corpus_sentences = gsub(pattern = "\\b(miss bingley|caroline bingley|caroline)\\b", replacement = "msbingley",x = corpus_sentences, ignore.case = TRUE)
corpus_sentences = gsub(pattern = "\\b(mr. bingley|charles bingley|bingley|charles)\\b", replacement = "mrbingley",x = corpus_sentences, ignore.case = TRUE)
corpus_sentences = gsub(pattern = "\\b(william collins|mr. collins|collins)\\b", replacement = "mrcollins",x = corpus_sentences, ignore.case = TRUE) # note that sir William is not him! so we don't want to replace william with mrcollins.
corpus_sentences = gsub(pattern = "\\b(lydia bennet)\\b", replacement = "lydia",x = corpus_sentences, ignore.case = TRUE)
corpus_sentences = gsub(pattern = "\\b(mr. bennet)\\b", replacement = "mrbennet",x = corpus_sentences, ignore.case = TRUE)
corpus_sentences = gsub(pattern = "\\b(mrs. bennet)\\b", replacement = "mrsbennet",x = corpus_sentences, ignore.case = TRUE)
corpus_sentences = gsub(pattern = "\\b(miss de bourgh)\\b", replacement = "msdebourgh",x = corpus_sentences, ignore.case = TRUE)
corpus_sentences = gsub(pattern = "\\b(lady catherine de bourgh|lady catherine)\\b", replacement = "ladycatherine",x = corpus_sentences, ignore.case = TRUE)
corpus_sentences = gsub(pattern = "\\b(jane bennet|the eldest miss bennet)\\b", replacement = "jane",x = corpus_sentences, ignore.case = TRUE)
corpus_sentences = gsub(pattern = "\\b(caroline bingley|miss bingley)\\b", replacement = "caroline",x = corpus_sentences, ignore.case = TRUE)
corpus_sentences = gsub(pattern = "\\b(mary bennet)\\b", replacement = "mary",x = corpus_sentences, ignore.case = TRUE)
corpus_sentences = gsub(pattern = "\\b(kitty bennet)\\b", replacement = "kitty",x = corpus_sentences, ignore.case = TRUE)
corpus_sentences = gsub(pattern = "\\b(mr. gardiner)\\b", replacement = "mrgardiner",x = corpus_sentences, ignore.case = TRUE)
corpus_sentences = gsub(pattern = "\\b(mrs. gardiner)\\b", replacement = "mrsgardiner",x = corpus_sentences, ignore.case = TRUE)
corpus_sentences = gsub(pattern = "\\b(mr. wickham)\\b", replacement = "mrwickham",x =  corpus_sentences, ignore.case = TRUE)

We need to do all the pre-processing again as we starts from the raw data. This time, we will use quanteda methods to achieve this.

# read an extended stop word list
stopwords_extended <- readLines("./data/stopwords_eng.txt", encoding = "UTF-8")

# Reprocessing of the corpus of sentences
corpus_tokens <- corpus_sentences %>% 
  tokens(remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE) %>% 
  tokens_tolower() %>% 
  tokens_remove(pattern = stopwords_extended, padding = T)

Counting co-occurrence with the help of the format of document frequency matrix.

# Create DTM, prune vocabulary and set binary values for presence/absence of types
minimumFrequency <- 10
binDTM <- corpus_tokens %>% 
  tokens_remove("") %>%
  dfm() %>% 
  dfm_trim(min_docfreq = minimumFrequency, max_docfreq = 100000000) %>% 
  dfm_weight("boolean")

#count the co-occurrence
coocCounts <- t(binDTM) %*% binDTM

#select those main characters
unique_names_cooc <- as.matrix(coocCounts[rownames(coocCounts) %in% unique_names,colnames(coocCounts) %in% unique_names])

Now we are ready to visualize.

1.16 Static Visualization of Text Network

With the co-occurrence matrix created above, we can start to visualize the relationship between characters. For static plot, we will use the R package igraph.

Let start with the naive plot:

library(igraph)

## 
## Attaching package: 'igraph'

## The following objects are masked from 'package:ape':
## 
##     degree, edges, mst, ring

## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum

## The following object is masked from 'package:base':
## 
##     union

graphNetwork <- graph_from_adjacency_matrix(unique_names_cooc,mode = "undirected", diag = FALSE)
plot(graphNetwork)

Each co-occurrence is a edge, which make this plot too messy. It’s too hard to tell the connection clearly. Now let’s set a threshold for forming edge from co-occurrence. And we can make the edge weight showed as width.

library(igraph)
cooc_min <- 4
unique_names_cooc[unique_names_cooc<4] <- 0
graphNetwork <- graph_from_adjacency_matrix(unique_names_cooc,mode = "undirected", diag = FALSE,weighted = TRUE)
plot(graphNetwork,edge.width=E(graphNetwork)$weight/5) # divide the weight by 5 for better visualization. you can try it dividing 5.

Of course you can change the plotting option as we learned in network visualization section.

plot(graphNetwork,
     edge.width=E(graphNetwork)$weight/5,
     layout=layout_nicely, 
     vertex.shape="none",)

Another package is ggraph which is also pretty helpful in some case.

library(ggraph)
ggraph(graphNetwork, layout='linear', circular=TRUE)+
  geom_node_label(aes(label = name), size=degree(graphNetwork, normalized = TRUE) * 10 + 2) +
  geom_edge_arc(aes(edge_width = E(graphNetwork)$weight), alpha = 0.25, colour = '377EB8') + 
  theme_graph(background = 'white') + 
  theme(legend.position = "none")

1.17 Interactive Visualization of Text Network

We can build an interactive plot of text network with the R package networkD3. It provides tools for creating D3 JavaScript network graphs from R. And it works well with igraph. We can use igraph_to_networkD3 function to convert igraph graphs to lists that work well with networkD3. We will use forceNetwork to plot an interactive 2D network. Free feel to explore other plotting function included in networkD3.

library(networkD3)
# convert igraph network to D3 network
netD3 <- igraph_to_networkD3(graphNetwork) 
# we don't have group information, so we are setting the group equal to 1 for plotting purpose.
netD3$nodes$group <- '1' 
# set degree for each vertex
netD3$nodes$degree <- degree(graphNetwork) 
forceNetwork(Links = netD3$links, Nodes = netD3$nodes, 
             Source = 'source', Target = 'target', 
             NodeID = 'name', Group = 'group', 
             Nodesize = 'degree', 
             charge = -100, 
             zoom = TRUE, 
             bounded = FALSE, 
             opacityNoHover = TRUE)

Other visualizations.

1.18 Word Tree

A word tree depicts multiple parallel sequences of words. It could be used to show which words most often follow or precede a target word (e.g., “Cats are…”) or to show a hierarchy of terms (e.g., a decision tree). It help us to understand the common context of the word of interest. Although R has a package googleVis that implements the Word Tree, it is not possible to show the word tree as an output chunk here. It requires a pop-up browser window backed by R. You can easily run the following code in your local R environment and play with the Word Tree interactive visualization.

library(googleVis)
sentences_list <- as.list(corpus_sentences)
table <- data.frame(text = do.call("c",sentences_list))
wt1 <- gvisWordTree(table,textvar = "text")
plot(wt1)

Ch12 Text Visualization

Descriptive Analytics and Data Visualization

Yichen Qin (qinyn@ucmail.uc.edu), University of Cincinnati

2022-02-06