Data Skills for Reproducible Science
2021-01-06
Overview
This course provides an overview of skills needed for reproducible research and open science using the statistical programming language R. Students will learn about data visualisation, data tidying and wrangling, archiving, iteration and functions, probability and data simulations, general linear models, and reproducible workflows. Learning is reinforced through weekly assignments that involve working with different types of data.
0.1 Course Aims
This course aims to teach students the basic principles of reproducible research and to provide practical training in data processing and analysis in the statistical programming language R.
0.2 Intended Learning Outcomes
By the end of this course students will be able to:
- Write scripts in R to organise and transform data sets using best accepted practices
- Explain basics of probability and its role in statistical inference
- Critically analyse data and report descriptive and inferential statistics in a reproducible manner
0.3 Course Resources
Data Skills Videos Each chapter has several short video lectures for the main learning outcomes at the playlist . The videos are captioned and watching with the captioning on is a useful way to learn the jargon of computational reproducibility. If you cannot access YouTube, the videos are available on the course Teams and Moodle sites or by request from the instructor.
dataskills This is a custom R package for this course. You can install it with the code below. It will download all of the packages that are used in the book, along with an offline copy of this book, the shiny apps used in the book, and the exercises.
glossary Coding and statistics both have a lot of specialist terms. Throughout this book, jargon will be linked to the glossary.
0.4 Course Outline
The overview below lists the beginner learning outcomes only. Some lessons have additional learning outcomes for intermediate or advanced students.
- Getting Started
- Understand the components of the RStudio IDE
- Type commands into the console
- Understand function syntax
- Install a package
- Organise a project
- Create and compile an Rmarkdown document
- Working with Data
- Load built-in datasets
- Import data from CSV and Excel files
- Create a data table
- Understand the use the basic data types
- Understand and use the basic container types (list, vector)
- Use vectorized operations
- Be able to troubleshoot common data import problems
- Data Visualisation
- Understand what types of graphs are best for different types of data
- Create common types of graphs with ggplot2
- Set custom labels, colours, and themes
- Combine plots on the same plot, as facets, or as a grid using cowplot
- Save plots as an image file
- Tidy Data
- Understand the concept of tidy data
- Be able to convert between long and wide formats using pivot functions
- Be able to use the 4 basic
tidyr
verbs - Be able to chain functions using pipes
- Data Wrangling
- Be able to use the 6 main dplyr one-table verbs:
select()
,filter()
,arrange()
,mutate()
,summarise()
,group_by()
- Be able to wrangle data by chaining tidyr and dplyr functions
- Be able to use these additional one-table verbs:
rename()
,distinct()
,count()
,slice()
,pull()
- Be able to use the 6 main dplyr one-table verbs:
- Data Relations
- Be able to use the 4 mutating join verbs:
left_join()
,right_join()
,inner_join()
,full_join()
- Be able to use the 2 filtering join verbs:
semi_join()
,anti_join()
- Be able to use the 2 binding join verbs:
bind_rows()
,bind_cols()
- Be able to use the 3 set operations:
intersect()
,union()
,setdiff()
- Be able to use the 4 mutating join verbs:
- Iteration & Functions
- Work with iteration functions:
rep()
,seq()
, andreplicate()
- Use
map()
andapply()
functions - Write your own custom functions with
function()
- Set default values for the arguments in your functions
- Work with iteration functions:
- Probability & Simulation
- Generate and plot data randomly sampled from common distributions: uniform, binomial, normal, poisson
- Generate related variables from a multivariate distribution
- Define the following statistical terms: p-value, alpha, power, smallest effect size of interest (SESOI), false positive (type I error), false negative (type II error), confidence interval (CI)
- Test sampled distributions against a null hypothesis using: exact binomial test, t-test (1-sample, independent samples, paired samples), correlation (pearson, kendall and spearman)
- Calculate power using iteration and a sampling function
- Introduction to GLM
- Define the components of the GLM
- Simulate data using GLM equations
- Identify the model parameters that correspond to the data-generation parameters
- Understand and plot residuals
- Predict new values using the model
- Explain the differences among coding schemes
- Reproducible Workflows
- Create a reproducible script in R Markdown
- Edit the YAML header to add table of contents and other options
- Include a table
- Include a figure
- Use
source()
to include code from an external file - Report the output of an analysis using inline R
0.5 Formative Exercises
Exercises are available at the end of each lesson’s webpage. These are not marked or mandatory, but if you can work through each of these (using web resources, of course), you will easily complete the marked assessments.
Download all exercises and data files below as a ZIP archive.
- 01 intro: Intro to R, functions, R markdown
- 02 data: Vectors, tabular data, data import, pipes
- 03 ggplot: Data visualisation
- 04 tidyr: Tidy Data
- 05 dplyr: Data wrangling
- 06 joins: Data relations
- 07 functions: Functions and iteration
- 08 simulation: Simulation
- 09 glm: GLM
0.6 I found a bug!
This book is a work in progress, so you might find errors. Please help me fix them! The best way is to open an issue on github that describes the error , but you can also mention it on the class Teams forum or email Lisa.
0.7 Other Resources
- Learning Statistics with R by Navarro
- R for Data Science by Grolemund and Wickham
- swirl
- R for Reproducible Scientific Analysis
- codeschool.com
- datacamp
- Improving your statistical inferences on Coursera
- You can access several cheatsheets in RStudio under the
Help
menu, or get the most recent RStudio Cheat Sheets - Style guide for R programming
- #rstats on twitter highly recommended!