When screening for a systematic review or meta-analysis, we conduct several pilot screening rounds. Pilot screenings help us refine our search string, decision tree, and increase the overall accuracy of our screening for literature reviews [check out this nice guide from the I-DEEL team for more info: Foo et al, 2021].
During a pilot screening, we want to select a random subset of references that would be a representative sample of the full set. When possible, screening rounds are conducted in collaboration with another reviewer. To speed up the screening process, we sometimes want to randomly allocate a subset of papers to a collaborator by splitting a reference list into subsets.
There are two reasons we’d want to automate the selection and splitting of a reference list:
- It is time consuming to randomly select papers (>100 papers is tedious to select by hand!)
- We are not really good at selecting things at random (actually computers aren’t really good at selecting truly at random either*)
Below is the R (www.r-project.org) code to run two functions that may come in useful when conducting your pilot and collaborative screenings with Rayyan (https://rayyan.ai/), or any other software where you can upload your pilot reference list.
1. Select random pilot set:
First, load the getpilotref function below in your environment:
# -----------------------------------
# getpilotref function
# -----------------------------------
## Description:
# Function to obtain a random subset of references for pilot screening.
#
# Arguments
# - x: data frame with reference list
# - n: number of papers for pilot subset (default is 10)
# - write: logical argument whether to save the pilot list as a csv file
# in the current working directory (default is FALSE).
# - fileName: name of file (default is "pilot")
getpilotref <- function(x, n=10, write=FALSE, fileName="pilot"){
if (length(n) == 1L && n%%1==0 && n>0 && n<=nrow(x)) {
# sample randomly the vector n of row indexes and remove id column in the final dataset
x$ids <- 1:nrow(x)
pdat <- x[which(x$ids %in% sample(x$ids, n)),]
pilot <- pdat[,-which(colnames(pdat)=="ids")]
} else {
# error message n value provided is not valid
stop("Incompatible value n supplied, please check.
#n must be a positive integer no higher than the total number of references provided.")
}
if (write==T){
# save generated pilot list in working directory using the name provided
write_csv(pilot, paste(fileName, ".csv", sep=""), na="")
# print out summary of saved file name
cat(paste("Pilot random sample set of ", n, " articles is saved as: ", fileName, ".csv", sep=""))
}
return(pilot)
}
Load example csv file that was exported from Rayyan (a reference list of papers in Ecology & Evolutionary Biology having the word “butterflies” in their title):
# Read example butterfly reference list
articles<-read.csv("https://raw.githubusercontent.com/coraliewilliams/2022/main/data/articles_butterfly.csv")
p10 <- getpilotref(articles)
library(readr)
p100 <- getpilotref(articles, n=100, write=T, fileName="pilot100")
## Pilot random sample set of 100 articles is saved as: pilot100.csv
Load the splitref_prop function in your environment:
# -----------------------------------
# splitref_prop function
# -----------------------------------
## Description:
# Function to split in two a reference list based on input proportions.
#
## Arguments:
# - x: data frame with reference list
# - p: vector of two numerical proportions for each split, it must have two positive numerical values that sum to 1.
# - write: logical argument whether to save the pilot list as csv in current working directory.
# - fileName: name to give to the suffix of the two split csv files.
splitref_prop <- function(x, p=c(0.5, 0.5), write=F, sname="split") {
if (length(p) == 2L && is.numeric(p) && sum(p) == 1 && all(p > 0)) {
# randomly allocated a numerical id to each reference
rids <- sample(1:nrow(x))
# get index of row to split on using the proportion values provided
spl <- floor(p[-length(p)] * nrow(x))
# get indices of two data frames based on split ids
indx1 <- rids[1:spl]
indx2 <- rids[(spl + 1):nrow(x)]
# save split subsets in two separate datasets
split1 <<- x[indx1,]
split2 <<- x[indx2,]
# print out summary message
cat(paste(c("Reference list was randomly split into",length(p), "proportions of", p[1]*100, "% and", p[2]*100, "%")))
if (write == T) {
# save files
write_csv(split1, paste(sname, "_set1", ".csv", sep = ""), na ="")
write_csv(split2, paste(sname, "_set2", ".csv", sep = ""), na ="")
}
} else {
# error message if provided n value is not valid
stop("Incompatible values for p (proportions) supplied, please check.
Proportion values must be positive integers less than 1, and the total sum of all proportions should equal to 1.")
}
}
Using the example butterfly reference list, let’s first split the reference list in two equal splits (50% each):
splitref_prop(articles)
## Reference list was randomly split into 2 proportions of 50 % and 50 %
Now let’s get 30% of references in the first subset (split1) and 70% in the second subset (split2), for example if one reviewer has more time to spend on the screening:
splitref_prop(articles, p=c(0.3,0.7))
## Reference list was randomly split into 2 proportions of 30 % and 70 %
splitref_prop(articles, p=c(0.3,0.7), write=T, sname="testsplit")
## Reference list was randomly split into 2 proportions of 30 % and 70 %
(Any comments, questions or feedback, you can reach me at: [email protected])