18 OECD
I have a potential thesis student who wants to work with topic modelling next year. Although I have a publication using latent dirichtlet analysis (LDA), my former postdoc, Ant Lee, did the main analysis, so I need to brush up. I’m also keen to practice my web scraping skills, so I thought I’d scrape the OECD data insights page to see what topics they’ve been posting about.
18.1 Data
The data insights page only shows a few entries at first. So I had to click on “Show more” a bunch of times until I accessed all of them, and download the webpage first, then search the download with rvest.
I used the developer tools in FireFox to find out that each “card” had a class of “card-list-card-50” and the link to the insight page was in the only “a” tag in each card.
Scrape URLS
# saved from https://www.oecd.org/coronavirus/en/data-insights/
html <- read_html("data/OECD | Data insights.html")
nodes <- html_nodes(html, ".card-list-card-50 a")
urls <- html_attr(nodes, "href")
# narrow down to only links to the data-insights section
insight <- map_lgl(urls, str_detect, "/data-insights/")
insight_urls <- urls[insight] %>% unique()
Then scrape the contents of each link and extract all of the paragraph text. Save this as a RDS object, set the code chunk to eval = FALSE
, and load from that so this doesn’t scrape OECD every time.
Get rid of the rows that just have the text, “Follow us (Social Media):”.
18.2 Topic Modelling
Now the fun part. I used the Topic Modelling chapter from Text Mining with R.
Code
[1] "countries" "oecd" "covid" "19"
[5] "economic" "pandemic" "2020" "2021"
[9] "global" "growth" "economies" "people"
[13] "recovery" "trade" "crisis" "outlook"
[17] "health" "measures" "inflation" "compared"
[21] "sectors" "women" "energy" "income"
[25] "percentage" "policy" "food" "economy"
[29] "governments" "green" "support" "data"
[33] "expected" "rates" "digital" "international"
[37] "levels" "employment" "tax" "jobs"
[41] "lower" "world" "2019" "gdp"
[45] "impact" "pre" "vaccines" "supply"
[49] "unemployment" "united"
Let’s get rid of all the integers 0 to 2030, “countries”, and “oecd”, which are likely to be so common as to be meaningless.
18.2.1 LDA
Convert the word_counts table to dtm format and run the LDA. I used k = 6 to extract 6 topics. I tried a few different values with the interpretive steps below, and 6 seemed reasonable to interpret. There are formal ways to assess this, but it’s beyond what I have time to do today.
18.2.2 Top Terms
Get the top 7 terms for each topic.
Code
topic | words |
---|---|
1 | green, covid, global, pandemic, bond, bonds, sovereign |
2 | income, measures, capita, household, total, percentage, employment |
3 | inflation, economic, pandemic, pre, global, outlook, compared |
4 | covid, pandemic, recovery, economic, trade, people, education |
5 | tax, economic, growth, carbon, economies, emissions, gdp |
6 | covid, health, crisis, people, pandemic, public, women |
18.2.3 Document Topics
Now we need to figure out what these topics are. I don’t have a great sense from the top words, so we’ll can also make a table of gammas, which score each “document” (url in our case) on each of the 6 topics.
Then determine the main topic for each document and plot the topic mix for each document, grouped by main topic.
Code
gammas %>%
group_by(title) %>%
mutate(main_topic = max(gamma) == gamma,
main_topic = ifelse(main_topic, topic, 0),
main_topic = sum(main_topic)) %>%
ungroup() %>%
ggplot(aes(x = title, y = gamma, fill = as.factor(topic))) +
geom_col() +
coord_flip() +
scale_x_discrete(labels = rev)+
MetBrewer::scale_fill_met_d("Archambault") +
labs(x = NULL, fill = "Topic") +
facet_wrap(~main_topic, scales = "free", ncol = 1) +
theme(legend.position = "top")
18.2.4 Add Topic Labels
After looking at the table above, there are my best guesses for topic labels. Update the facet labels with the as_labeller()
function.
Code
topics <- c("global covid effects",
"household economics",
"economic recovery",
"pandemic recovery",
"economic growth",
"personal covid effects")
names(topics) <- 1:6
top_terms %>%
mutate(term = reorder_within(term, beta, topic)) %>%
ggplot(aes(beta, term, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free_y", nrow = 3,
labeller = as_labeller(topics)) +
scale_x_continuous(breaks = seq(0, 1, .01)) +
scale_y_reordered() +
labs(x = NULL, y = NULL,
title = "LDA Topic Analysis of OECD Data Insights",
caption = "Data from www.oecd.org/coronavirus/en/data-insights/ | Plot by @lisadebruine") +
MetBrewer::scale_fill_met_d("Archambault") +
theme(plot.title = element_text(size = 20),
plot.caption = element_text(color = "grey40"),
strip.text = element_text(size = 14))