Systematic Review

Introduction

A systematic review is a way to synthesise the existing literature on a given topic using clearly documented, transparent methods. Systematic reviews are an essential part of a meta-analysis as they allow you to identify and collate studies for extracting data on moderators and descriptive statistics used to calculate effect sizes – the data that all meta-analytic modelling depends upon!

Systematic reviews are important because they can be used to identify gaps in the literature, identify areas of consensus and disagreement, and identify areas where more research is needed (Foo et al., 2021; O’Dea et al., 2021). Systematic reviews can also be used to identify potential sources of bias in the literature and potential sources of heterogeneity in meta-analyses. Often a review of these biases and gaps are provided in the results of meta-analyses and they feedback to impact the types of questions the meta-analysis can, and will be able to, address.

While systematic reviews can be done without a meta-analysis, a meta-analysis cannot be done without a systematic review (O’Dea et al., 2021) (a good one at least). In fact, you might start out with the intention of doing a meta-analysis but realise that there are not enough studies to do a meta-analysis. Not all is lost! If you’ve been careful and thorough enough in your systematic search, then a solid review of the literature can be a valuable contribution to the field. This chapter is about reviewing some gold standards for doing a rigorous systematic review. There are many resources available for you to read to help you through this process (Foo et al., 2021; e.g., O’Dea et al., 2021), but we will cover the most important basics here to get you started.

1. Question(s) and Scope

One of the most important first steps in a systematic review / meta-analysis is clearly defining your question and the scope of your review/meta-analysis. Ill-defined questions make it difficult to conduct literature searches, extract data and ultimately do the meta-analysis. That’s because the question influences the search terms that you use, the study designs you are looking to include, the outcome variables you are looking to measure, and the type of effect size to use along with the necessary descriptive statistics and moderator variables you need to extract. Additionally, poorly defined questions can make completing the meta-analysis an insurmountable task. If the question is too broad you might have 100,000’s of papers to screen – let alone extract data from!

Forming a question is often an iterative process. You will probably first start with a broad question and an idea of the types of work already published. That is usually followed by a ‘scoping search’ or some sort of literature mapping exercise that allows you to get a better sense of the literature and to refine your question and scope further (Cobo et al., 2011; Foo et al., 2021). After doing your homework, both using manual searches, citations to and within key works, work your colleagues/supervisor know, and possibly with the help of AI (LLMs like Claude Code, ChatGPT might help in this process), you will have a set of key papers and a good grasp of the types of research that has been done in the research field.

Identifying a good question is not always easy but using the PICO/PECO framework can help (Higgins et al., 2024; Morgan et al., 2018). PICO stands for Population, Intervention, Comparator, Outcome. The ‘E’ in PECO stands for ‘Exposure’ instead of ‘Intervention’ as some studies have slightly different designs. The PICO/PECO framework is useful because it helps you clearly define the population(s) you are interested in collecting data on (e.g., early embryo birds), the type of intervention on this population (e.g., temperature manipulation), the comparator (e.g., control group) and the outcome variable(s) (e.g., hatching success).

Many folks recommend pre-registering your systematic review and meta-analysis because it helps you walk through all these crucial steps early. Pre-registration forces you to think through the question and scope of your review before you start. It also allows you to be transparent about your methods and to avoid bias in your review (Foo et al., 2021; Gurevitch et al., 2018; O’Dea et al., 2021). If you can do this, it’s a great exercise!

2. Search Strategy and Multiple Databases

The search strategy is the process of identifying and selecting studies to include in your systematic review and meta-analysis. A good search strategy should capture as many relevant studies as possible while reducing the number of irrelevant ones.

To achieve this goal it is important to: 1) identify appropriate literature sources, 2) create relevant search strings for database/platform searches (which often differs across databases), 3) test the search string for each database and refine, 4) supplement database searches by examining the reference lists and citing articles of relevant studies and reviews (also known as backward and forward searches), 5) make a clear attempt to access grey literature (e.g., searching thesis portals) and 6) remove duplicates (Foo et al., 2021). A good search strategy should be systematic and transparent. It would obviously be good to be “comprehensive” but that’s a high bar, not often realistic, and indeed, not always necessary.

It has been clear for a long time that searching multiple databases is important for ensuring that you are including all relevant studies in your review (Bar-Ilan, 2018; Foo et al., 2021; Mongeon and Paul-Hus, 2016). This is because different databases index different journals and different types of studies. For example, PubMed is a great database for biomedical research but it does not index many social science journals. Web of Science and Scopus are more comprehensive databases that index a wider range of journals but they still do not index everything. One database can give you a very different set of papers to screen than another. So, if you only search one database, you might be missing out on a lot of relevant studies. Google Scholar is often used for ad hoc literature searches, but Gusenbauer and Haddaway (2020) show it is unsuitable as a principal search system for systematic reviews because it lacks proper Boolean operators, produces non-reproducible result sets, and caps retrievable results. It is better used only as a supplementary source alongside proper databases (e.g., Web of Science, Scopus).

A crucial part of the search strategy is that you document everything carefully. When did you do the search? What databases did you search? What search strings did you use? What filters did you use? How many papers did you get from each database? How many duplicates did you remove? How many papers did you screen? How many papers did you include in the review/meta-analysis? This information is important for transparency and reproducibility. It also allows others to understand the scope of your review and to identify potential sources of bias.

3. Search String Development and Sensitivity

We alluded to the importance of search string development above, but how do we know if our search strings are effective? Database searches are often developed using Boolean operators (AND, OR, NOT) and wildcards (*), sometimes with specific reference to the fields (title, abstract and keywords - check defaults). The exact syntax for these operators and wildcards can differ across databases so it’s important to check the documentation for each database you are using.

A good search string should be sensitive enough to capture all relevant studies but specific enough to exclude irrelevant ones (Foo et al., 2021; Lagisz et al., 2025). One way to test the sensitivity of your search string is to use a set of “benchmark” papers that you know are relevant to your review (Lagisz et al., 2025). We’d suggest 15-20 such papers. These are papers that you have identified by your scoping searches, possibly in Google Scholar, that you know are relevant to your question and you think should be captured by your search string.

Different search strings can then be evaluated as to their effectiveness at: 1) capturing these benchmark papers and 2) their specificity. Sometimes these are out of sync. For example, if the search yields 100s of thousands of papers and captures all your benchmark papers but also includes irrelevant ones, you need to refine your search string. The goal is to find a balance between sensitivity and specificity and often this requires iterative refinement, looking through the results to identify keywords or patterns in the titles and abstracts that are leading to false positives and refining the search string to exclude. Using “NOT” can be helpful for this but be careful not to exclude relevant studies.

As an example, if you are interested in the effects of temperature on hatching success in birds, you might start with a search string like: “temperature AND hatching success AND birds”. This search string might be too specific and might miss relevant studies that use different terminology, emphasise different outcomes or have different designs (e.g., “development rate” instead of “hatching success” because many studies on development rate also report hatching success). You often also need to refine your search string to include synonyms and related terms (e.g., “temperature AND (hatching success OR incubation) AND birds”).

It’s also important to consider the use of wildcards and truncation in your search string. For example, using “bird*” instead of “birds” can help capture studies that use different forms of the word (e.g., “bird”, “birds”, “birding”). Trying different combinations of search terms and operators can help you find the right balance between sensitivity and specificity. Make sure to record the final search string that you use for each database and to document any refinements that you make along the way. Aiming for a query that gives you a manageable number of papers to screen (e.g., 3000-5000) is a good target.

4. Title and Abstract Screening

Search results across all the databases should be downloaded, merged and deduplicated. Record the number of studies identified in your final search and how many were excluded at the title and abstract level (important for reporting in your PRISMA diagram - O’Dea et al. (2021)). You can use Endnote or other reference managers to help. Merged and deduplicated search results can then be uploaded to a systematic review software (e.g., Rayyan – Ouzzani et al. (2016); see also metRscreen) for screening. There are also cool new machine learning tools that can help with screening (e.g., ASReview – (VanDeSchoot2021?)) and LLMs like Claude Code and ChatGPT might also be useful for this process, so worth thinking about these. However, remember to be critical of them and test their performance on your specific project.

Once you’re at this stage a few important things need to happen. First, you need to develop a clear set of inclusion and exclusion criteria – a decision tree. Which studies do you think should be included in the meta-analysis (or not) based on their title and abstract, and why? Try to develop simple ‘yes’ and ‘no’ decisions based on high level elements initially. For example, if we want birds, then an obvious inclusion criterion at the top would be to ask: “Is the study done with a bird species?”. If the answer is no, it is excluded. This is important for transparency and reproducibility.

Second, you should try to have at least two independent reviewers screen the titles and abstracts in developing the decision tree that allows you to make this inclusion/exclusion decision. This is important for reducing bias and increasing reliability, but also for testing the effectiveness of your decision tree. We usually recommend setting up 3-5 testing sets of n = 50 papers where everyone on the team screens independently. You then check for conflicts and inter-rater reliability. If conflicts are low (<5%), this suggests the decision tree is fairly good.

Third, you need to have a clear process for resolving disagreements between reviewers (e.g., discussion, third reviewer, maybe an LLM pass as well). Discussion of these conflicts is usually also very good at identifying weaknesses in the decision tree and can help you refine it further.

Finally, you need to document everything carefully (e.g., number of studies screened, number of studies included/excluded, reasons for exclusion). Make sure you update the PRISMA diagram as you go along and to report this information in your final review/meta-analysis.

Conclusion

This was just a superficial overview of some of the key steps in doing a systematic review. We’ll talk about the data extraction process in the next chapter. There are many resources available for you to read to help you through the initial question and screening processes (Foo et al., 2021; e.g., O’Dea et al., 2021), but we hope this chapter has given you a good starting point for thinking about how to approach your systematic review and meta-analysis.

References

Bar-Ilan, J. (2018). Tale of three databases: The implication of coverage demonstrated for a sample query. Frontiers in Research Metrics and Analytics 3, 6.

Cobo, M. J., López-Herrera, A. G., Herrera-Viedma, E. and Herrera, F. (2011). Science mapping software tools: Review, analysis, and cooperative study among tools. Journal of the American Society for Information Science and Technology 62, 1382–1402.

Foo, Y. Z., O’Dea, R. E., Koricheva, J., Nakagawa, S. and Lagisz, M. (2021). A practical guide to question formation, systematic searching and study screening for literature reviews in ecology and evolution. Methods in Ecology and Evolution 00, 1–16.

Gurevitch, J., Koricheva, J., Nakagawa, S. and Stewart, G. (2018). Meta-analysis and the science of research synthesis. Nature 555, 176–182.

Gusenbauer, M. and Haddaway, N. R. (2020). Which academic search systems are suitable for systematic reviews or meta-analyses? Evaluating retrieval qualities of Google Scholar, PubMed, and 26 other resources. Research Synthesis Methods 11, 181–217.

Higgins, J. P. T., Thomas, J., Chandler, J., Cumpston, M., Li, T., Page, M. J. and Welch, V. A. eds. (2024). Cochrane handbook for systematic reviews of interventions version 6.5. Cochrane.

Lagisz, M., Yang, Y., Young, S. and Nakagawa, S. (2025). A practical guide to evaluating sensitivity of literature search strings for systematic reviews using relative recall. Research Synthesis Methods 16, 1–14.

Mongeon, P. and Paul-Hus, A. (2016). The journal coverage of Web of Science and Scopus: A comparative analysis. Scientometrics 106, 213–228.

Morgan, R. L., Whaley, P., Thayer, K. A. and Schünemann, H. J. (2018). Identifying the PECO: A framework for formulating good questions to explore the association of environmental and other exposures with health outcomes. Environment International 121, 1027–1031.

O’Dea, R. E., Lagisz, M., Jennions, M. D., Koricheva, J., Noble, D. W. A., Parker, T. H., Gurevitch, J., Page, M. J., Stewart, G., Moher, D., et al. (2021). Preferred reporting items for systematic reviews and meta-analyses in ecology and evolutionary biology: A PRISMA extension. Biological Reviews doi: 10.1111/brv.12721,.

Ouzzani, M., Hammady, H., Fedorowicz, Z. and Elmagarmid, A. (2016). Rayyan—a web and mobile app for systematic reviews. Systematic Reviews 5, 210–220.

Session Information

R version 4.5.3 (2026-03-11)

Platform: aarch64-apple-darwin20

attached base packages: stats, graphics, grDevices, utils, datasets, methods and base

other attached packages: magick(v.2.9.1), equatags(v.0.2.1), mathjaxr(v.2.0-0), pander(v.0.6.6), orchaRd(v.2.2.0), lubridate(v.1.9.5), forcats(v.1.0.1), stringr(v.1.6.0), dplyr(v.1.2.1), purrr(v.1.2.2), readr(v.2.2.0), tidyr(v.1.3.2), tibble(v.3.3.1), ggplot2(v.4.0.3), tidyverse(v.2.0.0), flextable(v.0.9.9), metafor(v.4.8-0), numDeriv(v.2016.8-1.1), metadat(v.1.4-0) and Matrix(v.1.7-4)

loaded via a namespace (and not attached): xslt(v.1.5.1), gtable(v.0.3.6), xfun(v.0.57), htmlwidgets(v.1.6.4), lattice(v.0.22-9), tzdb(v.0.5.0), vctrs(v.0.7.3), tools(v.4.5.3), generics(v.0.1.4), curl(v.7.1.0), pacman(v.0.5.1), pkgconfig(v.2.0.3), katex(v.1.5.0), data.table(v.1.18.2.1), RColorBrewer(v.1.1-3), S7(v.0.2.2), uuid(v.1.2-2), lifecycle(v.1.0.5), compiler(v.4.5.3), farver(v.2.1.2), textshaping(v.1.0.5), fontquiver(v.0.2.1), fontLiberation(v.0.1.0), htmltools(v.0.5.9), yaml(v.2.3.12), pillar(v.1.11.1), openssl(v.2.4.0), nlme(v.3.1-168), fontBitstreamVera(v.0.1.1), tidyselect(v.1.2.1), zip(v.2.3.3), digest(v.0.6.39), stringi(v.1.8.7), fastmap(v.1.2.0), grid(v.4.5.3), cli(v.3.6.6), magrittr(v.2.0.5), withr(v.3.0.2), gdtools(v.0.4.2), scales(v.1.4.0), timechange(v.0.4.0), rmarkdown(v.2.31), officer(v.0.6.10), otel(v.0.2.0), askpass(v.1.2.1), ragg(v.1.5.2), hms(v.1.1.4), evaluate(v.1.0.5), knitr(v.1.51), V8(v.6.0.4), rlang(v.1.2.0), Rcpp(v.1.1.1-1.1), glue(v.1.8.1), xml2(v.1.5.2), jsonlite(v.2.0.0), R6(v.2.6.1) and systemfonts(v.1.3.2)