When screening for a systematic review or meta-analysis, we conduct several pilot screening rounds. Pilot screenings help us refine our search string, decision tree, and increase the overall accuracy of our screening for literature reviews [check out this nice guide from the I-DEEL team for more info: Foo et al, 2021].
During a pilot screening, we want to select a random subset of references that would be a representative sample of the full set. When possible, screening rounds are conducted in collaboration with another reviewer. To speed up the screening process, we sometimes want to randomly allocate a subset of papers to a collaborator by splitting a reference list into subsets.
There are two reasons we’d want to automate the selection and splitting of a reference list:
- It is time consuming to randomly select papers (>100 papers is tedious to select by hand!)
- We are not really good at selecting things at random (actually computers aren’t really good at selecting truly at random either*)
Below is the R (www.r-project.org) code to run two functions that may come in useful when conducting your pilot and collaborative screenings with Rayyan (https://rayyan.ai/), or any other software where you can upload your pilot reference list.
1. Select random pilot set:
First, load the getpilotref function below in your environment:
# -----------------------------------
# getpilotref function
# -----------------------------------
## Description:
# Function to obtain a random subset of references for pilot screening.
# Arguments
# - x: data frame with reference list
# - n: number of papers for pilot subset (default is 10)
# - write: logical argument whether to save the pilot list as a csv file
# in the current working directory (default is FALSE).
# - fileName: name of file (default is "pilot")
getpilotref <- function(x, n=10, write=FALSE, fileName="pilot"){
if (length(n) == 1L && n%%1==0 && n>0 && n<=nrow(x)) {
# sample randomly the vector n of row indexes and remove id column in the final dataset
x$ids <- 1:nrow(x)
pdat <- x[which(x$ids %in% sample(x$ids, n)),]
pilot <- pdat[,-which(colnames(pdat)=="ids")]
} else {
# error message n value provided is not valid
stop("Incompatible value n supplied, please check.
#n must be a positive integer no higher than the total number of references provided.")
if (write==T){
# save generated pilot list in working directory using the name provided
write_csv(pilot, paste(fileName, ".csv", sep=""), na="")
# print out summary of saved file name
cat(paste("Pilot random sample set of ", n, " articles is saved as: ", fileName, ".csv", sep=""))
Load example csv file that was exported from Rayyan (a reference list of papers in Ecology & Evolutionary Biology having the word “butterflies” in their title):
# Read example butterfly reference list
p10 <- getpilotref(articles)
p100 <- getpilotref(articles, n=100, write=T, fileName="pilot100")
## Pilot random sample set of 100 articles is saved as: pilot100.csv
Load the splitref_prop function in your environment:
# -----------------------------------
# splitref_prop function
# -----------------------------------
## Description:
# Function to split in two a reference list based on input proportions.
## Arguments:
# - x: data frame with reference list
# - p: vector of two numerical proportions for each split, it must have two positive numerical values that sum to 1.
# - write: logical argument whether to save the pilot list as csv in current working directory.
# - fileName: name to give to the suffix of the two split csv files.
splitref_prop <- function(x, p=c(0.5, 0.5), write=F, sname="split") {
if (length(p) == 2L && is.numeric(p) && sum(p) == 1 && all(p > 0)) {
# randomly allocated a numerical id to each reference
rids <- sample(1:nrow(x))
# get index of row to split on using the proportion values provided
spl <- floor(p[-length(p)] * nrow(x))
# get indices of two data frames based on split ids
indx1 <- rids[1:spl]
indx2 <- rids[(spl + 1):nrow(x)]
# save split subsets in two separate datasets
split1 <<- x[indx1,]
split2 <<- x[indx2,]
# print out summary message
cat(paste(c("Reference list was randomly split into",length(p), "proportions of", p[1]*100, "% and", p[2]*100, "%")))
if (write == T) {
# save files
write_csv(split1, paste(sname, "_set1", ".csv", sep = ""), na ="")
write_csv(split2, paste(sname, "_set2", ".csv", sep = ""), na ="")
} else {
# error message if provided n value is not valid
stop("Incompatible values for p (proportions) supplied, please check.
Proportion values must be positive integers less than 1, and the total sum of all proportions should equal to 1.")
Using the example butterfly reference list, let’s first split the reference list in two equal splits (50% each):
## Reference list was randomly split into 2 proportions of 50 % and 50 %
Now let’s get 30% of references in the first subset (split1) and 70% in the second subset (split2), for example if one reviewer has more time to spend on the screening:
splitref_prop(articles, p=c(0.3,0.7))
## Reference list was randomly split into 2 proportions of 30 % and 70 %
splitref_prop(articles, p=c(0.3,0.7), write=T, sname="testsplit")
## Reference list was randomly split into 2 proportions of 30 % and 70 %
(Any comments, questions or feedback, you can reach me at: [email protected])