17 Connections

I’ve been meaning to learn more about visualising text, and where better to start than Julia Silge and David Robinson’s Text Mining with R!

Setup

library(tidyverse) # always useful
library(tidytext)  # for text analysis
library(ggraph)    # for plotting ngrams
library(showtext)  # for custom fonts
library(ggtext)    # for adding the twitter logo

font_add("arista", "fonts/arista_light.ttf")
font_add("fa-brands", "fonts/fa-brands-400.ttf")
showtext_auto()

17.1 Data

I’ll be using my own tweets from 2021. I have a tutorial in the Applied Data Skills book on how to get these data and process them.

Load data

tweet_files <- list.files(
  path = "data/tweets", 
  pattern = "^tweet_activity_metrics",
  full.names = TRUE
)

ct <- cols("Tweet id" = col_character())

tweets <- map_df(tweet_files, read_csv, col_types = ct) %>%
  select(text = `Tweet text`)

17.2 Bigrams

Here I used the code from the ngrams chapter to get a table of all the word pairs in my tweets. I added one line to get rid of all the twitter usernames (i.e., any word that starts with @).

Code

bigrams <- tweets %>%
  mutate(no_usernames = gsub("@[A-Za-z_0-9]+", "", text)) %>%
  unnest_tokens(bigram, no_usernames, token = "ngrams", n = 2)

bigram_counts <- bigrams %>% 
  count(bigram, sort = TRUE)

head(bigram_counts, 10)

bigram	n
https t.co	1306
in the	282
of the	282
if you	149
to be	149
to the	149
you can	142
but i	141
this is	139
in a	129

The most common ones are stop words, so separate the words into word1 and word2 and get rid of any rows without words (e.g., 1-word tweets). I added a few custom entries to the stop_words$words list: “https”, “t.co”, “gt”, “lt”, and the numbers 0 to 100.

Code

bigrams_separated <- bigrams %>%
  separate(bigram, c("word1", "word2"), sep = " ") %>%
  filter(!is.na(word1), !is.na(word2))

my_stop_words <- c(stop_words$word, "https", "t.co", "gt", "lt", 0:100, "00")

bigrams_filtered <- bigrams_separated %>%
  filter(!word1 %in% my_stop_words) %>%
  filter(!word2 %in% my_stop_words)

bigram_counts <- bigrams_filtered %>% 
  count(word1, word2, sort = TRUE)

head(bigram_counts)

word1	word2	n
shiny	app	45
team	science	26
blog	post	21
data	wrangling	21
twitch	stream	20
data	simulation	18

17.3 Plot ngrams

First, I’ll follow the chapter example to create a basic plot.

Code

bigram_graph <- bigram_counts %>%
  filter(n > 10)

Code

set.seed(8675309)

ggraph(bigram_graph, layout = "fr") +
  geom_edge_link() +
  geom_node_point() +
  geom_node_text(aes(label = name), vjust = 1, hjust = 1)

The word “pak” looks all by itself because, for no reason I can recall, I tweeted “pak pak” 11 times last year.

17.4 Tidy the plot

As always, now it’s time to make the plot prettier.

I tried a bunch of layouts (see ?layout_tbl_graph_igraph) and the spring-based algorithm by Kamada and Kawai (“kk”) looked best.

Add a bit of padding to the margins and set coord_cartesian(clip = "off") to avoid clipping labels that are too close to the edge.

I added the twitter logo to the middle using ggtext and Albert Rapp’s tutorial.

Code

ggraph(bigram_graph, layout = "kk") +
  geom_edge_link(aes(width = n), color = "grey", show.legend = FALSE) +
  geom_node_label(aes(label = name), vjust = 0.5, hjust = 0.5,
                  fill = "dodgerblue3", color = "white", 
                  label.padding = unit(.5, "lines"),
                  label.r = unit(.75, "lines")) +
  annotate("richtext", label = "<span style='font-family:fa-brands'>&#xf099;</span>",
           x = 0.3, y = -.2, col = 'dodgerblue3', label.colour = NA,
           family='fa-brands', size=16) +
  coord_cartesian(clip = "off") +
  theme_void() +
  theme(plot.margin = unit(rep(.5, 4), "inches"))

Plot of the 21 word pairs that occured more than 10 times in my 2021 tweets -- shiny app = 45, team science = 26, blog post = 21, data wrangling = 21, twitch stream =20, data simulation = 18, rstats package = 16, data skills = 14, shiny apps = 14, coding club = 13, journal club = 13, power analysis = 13, random effects = 13, code check = 12, code review = 12, mixed effects = 12, mixed models = 12, unit tests = 12, data frame = 11, pak pak = 11, peer review = 11 — Bigram graph of my 2021 tweets.