29  Storytelling

Storytelling makes me think of words, so this is another good opportunity to practice my text mining with Julia Silge and David Robinson’s Text Mining with R.

Setup
library(tidyverse) # for data wrangling
library(lubridate) # for handling dates
library(tidytext)  # for text analysis
library(textdata)  # for sentiment analysis

theme_set(theme_minimal(base_size = 16))

29.1 Data

I’m on a roll with this PsyArXiv data from Day 27, and it has a lot of text for the titles and descriptions (abstracts), so I’ll extract that.

Code
preprint_data <- readRDS("data/preprint_data.RDS")

info <- map_df(preprint_data, function(p) {
  list(id = p$id,
       pub_dt = p$attributes$date_published,
       title = p$attributes$title,
       desc = p$attributes$description
       )
}) %>%
  mutate(pub_dt = as_datetime(pub_dt))

29.2 Sentiment Analysis

I’ll use the sentiment analysis data from (mohammad13?), but convert it to wide format.

Code
wide_nrc <- get_sentiments("nrc") %>%
  mutate(x = TRUE) %>%
  pivot_wider(names_from = sentiment, 
              values_from = x,
              values_fill = FALSE)

head(wide_nrc)
word trust fear negative sadness anger surprise positive disgust joy anticipation
abacus TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
abandon FALSE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
abandoned FALSE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
abandonment FALSE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE
abba FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
abbot TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

I’ll work out the code I need on a small sample of the papers first, and them comment out slice_sample() after it looks like I expect. This takes a few second to run.

Code
sentiment <- info %>%
  #slice_sample(n = 1000) %>%
  unnest_tokens(word, desc) %>%
  inner_join(wide_nrc, by = "word") %>%
  group_by(id, title, pub_dt) %>%
  summarise(across(trust:anticipation, mean),
            .groups = "drop") %>%
  pivot_longer(cols = trust:anticipation,
               names_to = "sentiment",
               values_to = "value")

29.3 Plot

First let’s try a simple plot looking at how sentiments have changed over time.

Code
ggplot(sentiment, aes(x = pub_dt, y = value, color = sentiment)) +
  geom_smooth()

29.3.1 Fix sentiment order

The sentiments default to alphabetic order, but the plot would be easier to read if they were in order by prevalence. I’ll filter out the preprints from the most recent year and month, calculate the mean value for each sentiment, and arrange them in descending order. Then I can set the sentiment column as a factor with levels in this order.

Code
s_order <- sentiment %>%
  filter(year(pub_dt) == max(year(pub_dt))) %>% 
  filter(month(pub_dt) == max(month(pub_dt))) %>%
  group_by(sentiment) %>%
  summarise(v = mean(value), .groups = "drop") %>%
  arrange(desc(v))

sentiment$sentiment <- factor(sentiment$sentiment, 
                              levels = s_order$sentiment)

This gets them in the right order, but I still think we can do better.

Code
ggplot(sentiment, aes(x = pub_dt, y = value, color = sentiment)) +
  geom_smooth()

29.3.2 Text labels

Hide the legend and add text annotation to the end of the pot with geom_text(). Make sure to use a dataset with only one row per sentiment, or it will try to print the labels thousands of times. You have to set the x value as a datetime, so use now() and add or subtract time.

Code
ggplot(sentiment, aes(x = pub_dt, y = value, color = sentiment)) +
  geom_smooth() +
  geom_text(data = s_order, 
            aes(y = v, label = sentiment), 
            x = now() + months(1), 
            hjust = 0, size = 5) +
  scale_x_datetime(date_breaks = "1 year",
                   date_labels = "%Y",
                   expand = expansion(mult = c(0, .2))) +
  scale_y_continuous(breaks = seq(0, 1, .1)) +
  coord_cartesian(ylim = c(0, 0.6)) +
  labs(x = "Date of preprint publication",
       y = "Mean proportion of words with this sentiment",
       title = "Sentiment in PsyArXiv preprint abstracts over time",
       cation = "Data from OSF.io API") +
  theme(legend.position = "none",
        panel.grid.minor = element_blank())

Plot of the proportion of words in PsyArXiv preprint abstracts with different sentiments over time. Most are fairly flat, but in early 2020, positive sentiment decreased and negative sentiment increased, along with fear and sadness. Overall, about 60% of abstract words are positive, 35% trust, 25% negative, 10-15% fear, sadness and joy, and< 10% anger, surprise and disgust.