Function Tips

I see a lot of functions from people new to coding that look like this and I want to point out a few common conceptual mistakes with writing functions.

# my data-generating parameters
my_n <- 50
my_mu <- c(0, 0.2)
my_sd <- c(1, 1)
my_r <- 0.5

my_func <- function(n = my_n) {
  # simulate data
  dat <- faux::rnorm_multi(
    n = my_n,
    vars = 2,
    mu = my_mu,
    sd = my_sd,
    r = my_r,
    varnames = c("low", "high")
  )
  
  # test high-low difference
  t.test(dat$high, dat$low, paired = TRUE)
}

my_func(n = 100)
## 
##  Paired t-test
## 
## data:  dat$high and dat$low
## t = 1.9282, df = 49, p-value = 0.05964
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.01142952  0.55308436
## sample estimates:
## mean of the differences 
##               0.2708274

Definitions

First, it’s helpful to define a few terms I’ll use below.

Variables are words that identify and store the value of some data for later use. For example, in the code above, my_n is a variable and its value is set to 50.

The function is my_func() and is defined by the code inside the curly brackets {...}.

The arguments are the variables that are set by the function. In the function above, function(n = my_n) has one argument: n. Arguments can have a default value. In the code above, the argument n has a default value of my_n, which is the value that argument takes if it’s not explicitly defined (e.g., if you just run my_func() instead of my_func(n = 25)).

The global environment is the set of variables and functions that you create during your R session. It is what you can see in the Environment tab in RStudio.

Not using an argument

The biggest thing wrong with this code is that it defines an argument called n, but doesn’t use it in the code. It uses the variable my_n instead. So when I ran my_func(n = 100) above, I still got the result for n = 50 (check the df in the output), because my_n still equals 50 and that’s what the function is using, not the value of n.

It’s easy for this to happen when you’re first developing a function because you probably are modifying non-function code, where it does make sense to use my_n in the rnorm_multi() function. You can either make the name of the argument match (e.g., function(my_n) {...}) or change the name of the variable in the function code (e.g., dat <- faux::rnorm_multi(n = n, ...)).

Externally defined variables

The code in a function should usually be able to run without depending on having variables with correct names in the environment. This code doesn’t run if you don’t define my_mu, my_sd and my_r before running it. It’s tempting to just use those variables inside the function, because it saves you typing the values as arguments to the function when you use it, especially in a script where you know you won’t need to change those values, but this makes the function less useful in other contexts (including reuse by future you).

There can definitely be exceptions to this, but first master the rule that any variables used inside a function have to be defined as arguments.

Variables as default values

Normally, don’t set function argument defaults to a variable (again, there can be exceptions, but you need to master that rule before you understand when you can use exceptions). It will work fine when you’re testing it in context and all the variables you expect to be in the environment are there, but the point of a function is to be reusable in different contexts, so it’s best not to depend on external things.

You don’t always have to set a default value for an argument, but it’s often useful to set the default value to a “neutral” thing that makes the code run even if the user doesn’t set all the arguments. So n should be a sensible number like 100 (e.g., if you set it to 0, then the code won’t run correctly), and sd should be 1 (not 0, since that’s not a valid value for SDs).

Better versions

In this version, I used the same argument names as the rnorm_multi() function from faux, and also set their default values to the same defaults that function uses (simulating a dataset with 100 pairs of observations with means of 0, SDs of 1, and a correlation of 0).

You could add more arguments to the function, like vars or varnames, but in this context I know I would never want to vary them, so I can “hard-code” them.

my_func2 <- function(n = 100, mu = 0, sd = 1, r = 0) {
  # simulate data
  dat <- faux::rnorm_multi(
    n = n,
    vars = 2,
    mu = mu,
    sd = sd,
    r = r,
    varnames = c("low", "high")
  )
  
  # test high-low difference
  t.test(dat$high, dat$low, paired = TRUE)
}

This lets you run the function without any arguments.

my_func2()
## 
##  Paired t-test
## 
## data:  dat$high and dat$low
## t = 0.94057, df = 99, p-value = 0.3492
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.1338159  0.3750168
## sample estimates:
## mean of the differences 
##               0.1206004

And then I can add in new values or my data-generating parameters from the code above.

my_func2(n = 200, mu = my_mu, sd = my_sd, r = my_r)
## 
##  Paired t-test
## 
## data:  dat$high and dat$low
## t = 2.2674, df = 199, p-value = 0.02444
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.02327751 0.33404535
## sample estimates:
## mean of the differences 
##               0.1786614

Your function doesn’t need to use the same variable names as the functions you might be using them in. Some people find using the same variable names to be easier because you can see the connection between the variable in your function and where you’re using it. But this can be confusing for new coders. You can give your function argument names that are different to clarify where you’re using them.

If this pattern makes sense to you, I recommend using a consistent prefix to the name, like the_, so you can always know if a variable is being defined as an argument to the function or externally.

my_func3 <- function(the_n = 100, the_mu = 0, the_sd = 1, the_r = 0) {
  # simulate data
  dat <- faux::rnorm_multi(
    n = the_n,
    vars = 2,
    mu = the_mu,
    sd = the_sd,
    r = the_r,
    varnames = c("low", "high")
  )
  
  # test high-low difference
  t.test(dat$high, dat$low, paired = TRUE)
}

If you’re using argument names in your function call, you will need to make sure they’re consistent with the function.

my_func3(the_n = my_n, the_mu = my_mu, the_sd = my_sd, the_r = my_r)
## 
##  Paired t-test
## 
## data:  dat$high and dat$low
## t = 0.80255, df = 49, p-value = 0.4261
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.1585849  0.3694730
## sample estimates:
## mean of the differences 
##                0.105444

Or you can set the arguments by order and omit the names.

my_func3(my_n, my_mu, my_sd, my_r)
## 
##  Paired t-test
## 
## data:  dat$high and dat$low
## t = 0.29191, df = 49, p-value = 0.7716
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.2255497  0.3022130
## sample estimates:
## mean of the differences 
##              0.03833165

Scope

The concept of scope can also be confusing to new coders. In this context it’s just important to know that if you have argument names that are the same as variables in your environment, the function will use the values set by its arguments, not the ones set in your global environment (what you can see in the Environment tab in RStudio).

In other words, a function has access to variables in the global environment, but also has variables created by that function’s arguments, which can overwrite the values of variables with the same name in the global environment.

For that reason, I advise new coders to avoid giving the parameter values in their global environment the same names as the arguments of the functions they are used in. this is fine, but can lead to confusion unless you have a very clear conceptual understanding of scope.

my_func4 <- function(my_n = 100, my_mu = 0, my_sd = 1, my_r = 0) {
  # simulate data
  dat <- faux::rnorm_multi(
    n = my_n,
    vars = 2,
    mu = my_mu,
    sd = my_sd,
    r = my_r,
    varnames = c("low", "high")
  )
  
  # test high-low difference
  t.test(dat$high, dat$low, paired = TRUE)
}

For example, the function above has arguments called my_n, my_mu, my_sd and my_r. We can also create variables with those same names in the global environment.

# my data-generating parameters
my_n <- 50
my_mu <- c(0, 0.2)
my_sd <- c(1, 1)
my_r <- 0.5

However, when you run the function without setting the arguments in the function, it uses the default values of the arguments, not the values you set in the global environment.

my_func4()
## 
##  Paired t-test
## 
## data:  dat$high and dat$low
## t = -0.25649, df = 99, p-value = 0.7981
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.3074508  0.2370637
## sample estimates:
## mean of the differences 
##             -0.03519355

If you set the values in the function, then it will work as expected.

my_func4(my_n, my_mu, my_sd, my_r)
## 
##  Paired t-test
## 
## data:  dat$high and dat$low
## t = 4.0469, df = 49, p-value = 0.0001838
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.2497333 0.7423856
## sample estimates:
## mean of the differences 
##               0.4960594
Lisa DeBruine
Lisa DeBruine
Professor of Psychology

Lisa DeBruine is a professor of psychology at the University of Glasgow. Her substantive research is on the social perception of faces and kinship. Her meta-science interests include team science (especially the Psychological Science Accelerator), open documentation, data simulation, web-based tools for data collection and stimulus generation, and teaching computational reproducibility.