I-DEEL: Inter-Disciplinary Ecology and Evolution Lab
  • Home
  • People
  • Research
  • Publications
  • Blog
  • Open Science
    • Registrations
    • Registered Reports
    • Published Protocols
    • Preprints
    • EDI
    • Other
  • Opportunities
  • Links

Split reference list helper for pilot and collaborative screening rounds

30/11/2022

0 Comments

 
by Coralie Williams

When screening for a systematic review or meta-analysis, we conduct several pilot screening rounds. Pilot screenings help us refine our search string, decision tree, and increase the overall accuracy of our screening for literature reviews [check out this nice guide from the I-DEEL team for more info: Foo et al, 2021].

During a pilot screening, we want to select a random subset of references that would be a representative sample of the full set. When possible, screening rounds are conducted in collaboration with another reviewer. To speed up the screening process, we sometimes want to randomly allocate a subset of papers to a collaborator by splitting a reference list into subsets.

There are two reasons we’d want to automate the selection and splitting of a reference list:
  1. It is time consuming to randomly select papers (>100 papers is tedious to select by hand!)
  2. We are not really good at selecting things at random (actually computers aren’t really good at selecting truly at random either*)

​Below is the R (www.r-project.org) code to run two functions that may come in useful when conducting your pilot and collaborative screenings with Rayyan (https://rayyan.ai/), or any other software where you can upload your pilot reference list.

1. Select random pilot set:

First, load the getpilotref function below in your environment:
​
# -----------------------------------
# getpilotref function 
# -----------------------------------
## Description: 
#     Function to obtain a random subset of references for pilot screening.
#
# Arguments
# - x: data frame with reference list
# - n: number of papers for pilot subset (default is 10)
# - write: logical argument whether to save the pilot list as a csv file 
#   in the current working directory (default is FALSE).
# - fileName: name of file (default is "pilot")

getpilotref <- function(x, n=10, write=FALSE, fileName="pilot"){
  
  if (length(n) == 1L && n%%1==0 && n>0 && n<=nrow(x)) { 
    
    # sample randomly the vector n of row indexes and remove id column in the final dataset
    x$ids <- 1:nrow(x)
    pdat <- x[which(x$ids %in% sample(x$ids, n)),]
    pilot <- pdat[,-which(colnames(pdat)=="ids")]
    
    } else {
      # error message n value provided is not valid 
      stop("Incompatible value n supplied, please check. 
      #n must be a positive integer no higher than the total number of references provided.") 
    }
  
  if (write==T){
    
    # save generated pilot list in working directory using the name provided
    write_csv(pilot, paste(fileName, ".csv", sep=""), na="")
    
    # print out summary of saved file name
    cat(paste("Pilot random sample set of ", n, " articles is saved as: ", fileName, ".csv", sep=""))
    
  }
  
  return(pilot)
}
​Let’s try it out
Load example csv file that was exported from Rayyan (a reference list of papers in Ecology & Evolutionary Biology having the word “butterflies” in their title):

# Read example butterfly reference list
articles<-read.csv("https://raw.githubusercontent.com/coraliewilliams/2022/main/data/articles_butterfly.csv")
​First, let’s obtain a random set of 10 papers without saving it as a csv file:
p10 <- getpilotref(articles)
Now, let’s obtain a subset of 100 papers for a pilot screening and save the subset as a csv file called pilot100.csv. Make sure you have the readr package installed and loaded in your environment.
library(readr)
p100 <- getpilotref(articles, n=100, write=T, fileName="pilot100")
## Pilot random sample set of 100 articles is saved as: pilot100.csv
This will save a csv file pilot100.csv in your working directory. If you are unsure where is your working directory run this command getwd() in your console.
2. Split reference list with another collaborator​

Load the splitref_prop function in your environment:
# -----------------------------------
# splitref_prop function 
# -----------------------------------
## Description: 
#     Function to split in two a reference list based on input proportions.
#
## Arguments: 
# - x: data frame with reference list
# - p: vector of two numerical proportions for each split, it must have two positive numerical values that sum to 1.
# - write: logical argument whether to save the pilot list as csv in current working directory.
# - fileName: name to give to the suffix of the two split csv files.

splitref_prop <- function(x, p=c(0.5, 0.5), write=F, sname="split") {
  
    if (length(p) == 2L && is.numeric(p) && sum(p) == 1 && all(p > 0)) {
      
      # randomly allocated a numerical id to each reference
      rids <- sample(1:nrow(x))
      
      # get index of row to split on using the proportion values provided
      spl <- floor(p[-length(p)] * nrow(x))
      
      # get indices of two data frames based on split ids
      indx1 <- rids[1:spl]
      indx2 <- rids[(spl + 1):nrow(x)]
      
      # save split subsets in two separate datasets
      split1 <<- x[indx1,]
      split2 <<- x[indx2,]
      
      # print out summary message
      cat(paste(c("Reference list was randomly split into",length(p), "proportions of", p[1]*100, "% and", p[2]*100, "%")))
      
      if (write == T) {
        # save files
        write_csv(split1, paste(sname, "_set1", ".csv", sep = ""), na ="")
        write_csv(split2, paste(sname, "_set2", ".csv", sep = ""), na ="")
        }
      
      } else {
      # error message if provided n value is not valid
      stop("Incompatible values for p (proportions) supplied, please check.
           Proportion values must be positive integers less than 1, and the total sum of all proportions should equal to 1.")
        
    }
}
​Let’s try it out
Using the example butterfly reference list, let’s first split the reference list in two equal splits (50% each):
splitref_prop(articles)
## Reference list was randomly split into 2 proportions of 50 % and 50 %
This will give you two separate data frames to share between two reviewers: split1 and split2.

Now let’s get 30% of references in the first subset (split1) and 70% in the second subset (split2), for example if one reviewer has more time to spend on the screening:
splitref_prop(articles, p=c(0.3,0.7))
## Reference list was randomly split into 2 proportions of 30 % and 70 %
Let’s save the 30% and 70% split list of references as csv files with the suffix “testsplit”:
splitref_prop(articles, p=c(0.3,0.7), write=T, sname="testsplit")
## Reference list was randomly split into 2 proportions of 30 % and 70 %
This will save two csv files, testsplit_set1.csv and testsplit_set2.csv, in your working directory.
Picture
*computers aren’t really good at selecting truly at random…Random number generators from most computer programs are actually “pseudo-random”, meaning they are produced from a deterministic mathematical model or algorithm. The R code above uses a pseudo-random number generator. Pseudo-random number generators are usually good enough for their intended purpose (basically better than what any human could do). A good pseudo-random number generator will reproduce statistics that are consistent with true randomness, but they are not truly random. A truly random number can be generated based on a constantly changing physical process that can’t be modelled as an algorithm. If you’re curious about true randomness check out these websites: https://www.random.org/; https://qrng.anu.edu.au/random-colours/.


(Any comments, questions or feedback, you can reach me at: [email protected])
0 Comments

    Author

    Posts are written by our group members and guests.

    Archives

    May 2025
    April 2025
    March 2025
    January 2025
    December 2024
    November 2024
    October 2024
    September 2024
    August 2024
    July 2024
    June 2024
    May 2024
    April 2024
    March 2024
    February 2024
    January 2024
    December 2023
    November 2023
    October 2023
    September 2023
    August 2023
    July 2023
    June 2023
    May 2023
    April 2023
    March 2023
    February 2023
    January 2023
    December 2022
    November 2022
    October 2022
    September 2022
    August 2022
    July 2022
    June 2022
    May 2022
    April 2022
    March 2022
    February 2022
    January 2022
    December 2021
    November 2021
    October 2021
    September 2021
    August 2021
    July 2021
    June 2021
    April 2021
    March 2021
    February 2021
    January 2021
    December 2020
    November 2020
    August 2020
    July 2020
    June 2020
    April 2020
    December 2019
    November 2019
    October 2019
    September 2019
    June 2019
    April 2019
    March 2019
    February 2019
    January 2019
    December 2018
    November 2018
    October 2018
    September 2018
    August 2018
    July 2018
    June 2018
    May 2018
    March 2018
    January 2018
    December 2017
    October 2017
    September 2017
    July 2017
    June 2017
    May 2017
    April 2017
    March 2017
    January 2017
    October 2016
    August 2016
    July 2016
    June 2016
    May 2016
    March 2016

    Categories

    All

    RSS Feed

HOME
PEOPLE
RESEARCH
PUBLICATIONS
OPEN SCIENCE
OPPORTUNITIES
LINKS
BLOG

Created by Losia Lagisz, last modified on June 24, 2015