Supplementary Materials for: The promise of community-driven preprints in ecology and evolution

1 Supplemental Materials and Methods

Supplementary Materials and Methods for: Noble et al. 2024. “The promise of community-driven preprints in ecology and evolution”. EcoEvoRxiv. https://doi.org/10.32942/X2SS46

1.1 Pre-registration, Data and Code Availability

We preregistered the study on the Open Science Framework (OSF) (https://osf.io/d7zws) as an open-ended registration. We also created relevant updated study plan releases on our GitHub repository to capture key changes to the project as it developed (https://github.com/daniel1noble/ecoevo_1000). All data and code for reproducing the analyses and conclusions within our paper can be found on GitHub (https://github.com/daniel1noble/ecoevo_1000).

1.2 Data collection

We downloaded metadata on all preprints lodged on EcoEvoRxiv between 2018-03-21 and 2023-09-30. This information included data on preprint submission and acceptance dates, date when the preprint was last updated, the reuse license, preprint ID and Digitial Object Identifier (DOI), preprint title, submitting author and co-authors, the number of versions submitted by authors and, if available, a DOI for the published version of the preprint.

As part of the Society for Open, Reliable and Transparent Ecology and Evolutionary Biology’s (SORTEE - https://www.sortee.org) conference, we ran a workshop (i.e., Hackathon - entitled “The first 1000 EcoEvoRxiv preprints: did they get published and where?”) where we invited interested SORTEE members and conference registrants to participate in the data collection process. We provided a detailed data collection protocol to all delegates prior to the workshop (vers. 1.0 - release). We used the workshop to train participants on how to collect data from EcoEvoRxiv preprints. This involved a demonstration of a single preprint followed by data collection by all attendees on the same preprint. Any confusion or questions were addressed by the workshop facilitators and the study data collection instructions and study plan were updated accordingly (vers.1.1 - release).

We had a total of 55 participants in the workshop, and an additional 11 that could not attend but were interested in contributing. Participants assigned themselves to collect data from between 4 to 50 preprints (mean = 20.61, standard deviation = 7.07). Data collection was achieved using a Google Form to standardize data input. Participants collected the following information for each of their assigned preprints:

Preprint DOI: Copy and pasted from the preprint meta-data file.
Submitting/ corresponding authors firstname: First name (given name) and second name (or initials, if provided) of the author submitting the preprint. Usually there was one submitting author in the meta-data preprint list. If there were multiple authors then the first author listed on the preprint was used.
Submitting/ corresponding authors lastname: Last name of the author submitting the preprint. Usually there was one submitting author in the meta-data preprint list. If there were multiple authors then the first author listed on the preprint was used.
Country of the corresponding / submitting author: Country of affiliation for the author submitting the preprint collecting according to standard ISO 3166 names. If teh country could not be determined the affiliation was considered as “NA”.
Year of first publication for corresponding/submitted author: We used Google Scholar to collect publication year of the first journal article published by author submitting the preprint. If the submitting author had no profile or did not publish the value entered was ‘0’.
Taxa being studied: We collected information about the broad group of taxa that were the focus of the preprint. These categories were “Plants”, “Animals”, “Fungi”, “Algi”, “Invertabrates”, “Vertebrates”, “Microorganisms (bacteria, viruses)”, or “Other”. Multiple categories could be selected if the preprint contained information on multiple taxa.
Discussion on the preprint?: We identified whether there were community discussions around the preprint on the landing page of each preprint. If there were, we indicated ‘yes’, otherwise we indicated ‘no’. Additional discussions around a preprint may have occurred on other platforms (e.g., Twitter, Facebook, etc.) but we did not collect this information.
Type of article: EcoEvoRxiv publishes a greater diversity of preprints compared with other preprint servers. We therefore collected data on article type. The categories were: Research Article: any article-like manuscript intended for publication in research journals with new empirical findings; Methods paper: papers presenting new methodological or computational approaches; Reviews and Meta-analyses: papers quantitatively or qualitatively synthesizing a given topic; Opinions: usually short papers providing new perspectives on a topic; Comments: papers that explicitly comment on an already published research article; and Other: which includes any other category of preprint which may also be government, non-profit, or industry reports, white papers, or other documents that are not intended for publication in a research journal.
Link to Data for Preprint: We were explicitly interested in whether data for the preprint was already publicly available. To collect these data we scrolled through the first version of the preprint and any related files on EcoEvoRxiv to see if a link (or accession number) to underlying data was provided. If it was reported we took the link to the data. If the preprint was not an empirical or equivalent article based on data and data analyses (so usually for opinions, comments, narrative reviews, theoretical paper, etc.) we entered ‘NA’. If the preprint was based on data but no link (or accession number) to data repository was reported we entered ‘none’. If multiple links were provided we took one.
Link to Code for Preprint: We were also interested in whether code for data analysis was already publicly available. To collect these data we scrolled through the first version of the preprint and any related files on EcoEvoRxiv to see if a link (or accession number) to underlying code was provided. If it was reported we took the link to the code. If the preprint was not an empirical or equivalent article based on data and data analyses (so usually for opinions, comments, narrative reviews, theoretical paper, etc.) we entered ‘NA’. If the preprint was based on data but no link (or accession number) to data repository with code was reported we entered ‘none’. If multiple links were provided we took one.
Number of citations to preprint: To understand citation patterns of preprints on EcoEvoRxiv we used Google Scholar to search for the number of citations attributed to the preprint. If the preprint version was not on Google Scholar or was merged with an already published version we considered it as ‘NA’ (coded as ‘999’). If the preprint version was online and clearly indicated (i.e., EcoEvoRxiv) then we took the number of citations at the time of collection. While citation counts may vary slightly, the first round of data collection was done in a short time frame. As such, we do not anticipate that citation counts would have changed significantly between the time of data collection and the time of submission of this manuscript.
PCI recommendation: We also determined if the preprint been recommended by Peer Community In (PCI). To do this, we searched for any PCI recommendations associated with the preprint on the preprint landing page. If that none was available on the landing page then we searched “peercommunityin.org recommendation”TITLE OF PREPRINT”” on Google. If a link was discovered, we indicated ‘yes’ and took the DOI link. If not, then we assumed it was not recommended.
Publication DOI: If the preprint had been published as a journal article, we collected the published DOI of that article.
Journal name: If the preprint had been published as a journal article we recorded the journal name.
Publicaton Date. If the preprint has been published as a journal article, we collected the (first) publication date (Month, Day, Year).
Title Change: If the preprint had been published as a journal article, we assessed whether the title had changed between the first version of the preprint and the published article. Note that any word change was sufficient for a ‘yes’.
Number of citations to publication: We collected the number of citations to the published version of the article manually from Google Scholar if published.
Link to Data for Publication: If the preprint was published we collected the link to any published data provided on the publishers webpage or within the published paper. If a link to the data was not visible we recorded the paper as not having provided data (‘none’). If the paper had not been published or not based on data and data analyses (opinions, comments, narrative reviews, theoretical paper, etc.) we entered ‘NA’.
Link to Code for Publication: If the preprint was published we collected the link to any published code provided on the publishers webpage or within the published paper. If a link to the data was not visible we recorded the paper as not having provided data (‘none’). If the paper had not been published or not based on data and data analyses (opinions, comments, narrative reviews, theoretical paper, etc.) we entered ‘NA’.

Additional variables of interest (i.e., journal impact factor, preprint and review policy, gender of submitting author) could be determined outside of this data collection period.

1.3 Data Checking, Merging and Validation

The initial dataset collected by participants from the Hackathon was cleaned, merged (with master meta-data file provided by the California Digital Library) and reorganised (see 1_data_processing.R and 2_data_cleaning.R in the R/ folder of the main GitHub repository). This new dataset was again placed in a Google Sheet and co-authors were randomly allocated a number of pre-prints for cross-checking. The number of preprints allocated reflected contributions of authors during earlier data collection stages. Co-authors who did less work during initial data collection stages were given more papers to cross-check to ensure equitable contributions to the paper. This meant that the number of papers given to an author to cross-check ranged from ~10-50 papers.

Data from each preprint was cross checked by one additional author to ensure accuracy and consistency of the data collected. Any discrepancies were either discussed or fixed prior to analysis. Most data that was corrected included data such as missed DOI’s for published preprints, missing or incorrect citation counts to preprints and published articles, and missing or incorrect publication dates. It became evident that missing citation counts to preprints were largely the result of preprints and published articles not being correctly merged in Google Scholar. As such, checking citations to preprints involved looking carefully at all papers on the submitting author’s Google Scholar profile and cross checking titles. Often it was clear redundant articles existed that were not merged. We then checked that the unmerged preprint and the published article were in fact the same study before updating preprint and publication citation counts. We also made note when any titles changed from preprint to publication. At times, preprints had been merged with published versions of the same study. To cross-check citation count to the preprint itself, we looked at the version history of the publication. If we found a clear preprint version, we took the citation count from that version. If it was not provided we assumed no data existed on citation counts for the preprint and it was assigned ‘NA’. As indicated in the main manuscript. We collected data on the ‘academic age’ of submitting authors by looking at Google Scholar profiles of authors (when available) and recording their first year of publication in a peer-reviewed journal. If no google Scholar profile for the submitting author was available we attempted to acquire the information from other online research profiles (e.g., ResearchGate, ORCID, etc.). If no information was available we entered ‘0’ as the first year of publication.

1.4 Data collected computationally

One of us, ML, acquired the submitting authors gender using the R package gender [v.0.6.0; [1]] to predict the most likely gender of the submitting author of a preprint. We used an algorithm to assign gender based on the submitting author’s name. We only used the algorithm-assigned gender when the gender of a given name was identified with 95% certainty. For the remaining names, ML performed manual searches to determine gender based on the pronouns and photographs from professional and personal websites.

The open access status of each published article was obtained using the R package roadoi (v.0.7.2) to connect to the Unpaywall platform [2]. This R package provided data on whether the published article was open access or not (i.e., whether the article was behind a paywall or not), and categorized the open access status of the article (Sub-type meanings included: ‘Green’, articles published in ‘toll-access journals but achieved in an open access repository; ‘Bronze’, articles are free to read on publishers website without a license but grants no other rights and can be delayed free-to-read; ‘Hybrid’, articles are free to read upon publication with an open access license; ‘Gold’, articles published in fully open access journals. For full details on the meaning of each category see https://support.unpaywall.org/support/solutions/articles/44001777288-what-do-the-types-of-oa-status-green-gold-hybrid-and-bronze-mean- ).

1.5 Data Analysis

1.5.0.1 Generational Analysis

To test whether a greater number of preprints were submitted by authors of earlier ‘academic age’, as quantified by the submitting authors first year of publication, we used a generalized linear model (GLM) with a negative binomial error distribution. We included the number of preprints submitted in a given year as the response variable and the first year of publication as the predictor variable. A negative, and significant, slope coefficient indicates support for researchers of earlier career stages submitting more preprints.

2 Supplementary Results

2.1 Publication time by article type

Table S 1 provides a breakdown of publication time by article type.

Table S 1- Time between preprint submission and publication for different article types.

Article Type	Mean (days)	SD (days)	N
book	314	16	2
book chapter	338	142	11
comment	202	147	15
methods papers	285	128	50
opinion	259	226	59
other	211	127	17
research article	309	208	300
reviews and meta-analyses	268	180	165

2.2 Preprint citations versus published article citations

Figure S 1- Number of citations to preprints and published articles that result from those preprints.

Papers that are preprinted have been shown to receive more citations compared to those that are not preprinted in other research fields. Testing if this is true in Ecology and Evolution is challenging and not possible using observational data from EcoEvoRxiv. However, we can compare the number of citations to preprints and the published articles that result from those preprints to give a sense of how many citations preprints are expected to potentially accumulate and whether highly cited preprints go on to also be highly cited published research papers.

The vast majority of preprints were not cited before they became published (Figure S 1), with a total of 539 papers receiving zero citations. The median number of citations to preprints were 0 (mean = 0.6; SD = 1.8), while the median number of citations to published articles was 8 (mean = 19.6; SD = 38.2).

Notably, published articles with zero citations dropped dramatically (61 papers; a 88.68% reduction) compared to preprints with zero citations, suggesting that once published, only 11.32% had no citations.

Interestingly, a highly cited preprint did not necessarily become a highly cited published article over time, although some of this discrepancy is related to whether and how preprints and articles are linked through crossref (Spearman Rank Correlation: -0.01, S = 4.61^{7}, p = 0.86; Figure S 1).

References

Mullen L. 2021 Predict gender from names using historical data. R package version 0.6.0.

Jahn N. 2024 Roadoi: Find free versions of scholarly publications via unpaywall. R package version 0.7.2. Https://github.com/ropensci/roadoi/, https://docs.ropensci.org/roadoi/.

--- title: "Supplementary Materials for: The promise of community-driven preprints in ecology and evolution" bibliography: ../bib/refs.bib csl: ../bib/proceedings-of-the-royal-society-b.csl format: html: toc: true output-file: index.html toc-location: left toc-depth: 3 toc-title: "**Table of Contents**" theme: simplex embed-resources: true code-fold: show code-tools: true number-sections: true fontsize: "12" max-width: "10" code-overflow: wrap editor_options: chunk_output_type: console execute: freeze: auto # re-render only when source changes cache: false echo: false warning: false error: false include: true crossref: fig-title: 'Figure S' tbl-title: 'Table S' fig-labels: arabic title-delim: "-" fig-prefix: "Figure S" tbl-prefix: "Table S" --- ```{r, setup} #| label: setup #| echo: false #| include: false # Load packages #install.packages("pacman") pacman::p_load(tidyverse, here, patchwork, readxl, MASS, googlesheets4, janitor, maps, forcats,RColorBrewer, flextable, ggrepel, ggimage, gt, performance, genderdata, gender, ggthemes, mapproj, DemografixeR, countrycode, see, glmmTMB, cowplot) # Set rounding options options(scipen = 1, digits = 2) # Load in the data data <- read_xlsx(here("data", "20231003_EER_preprints_metadata.xlsx")) # Check duplicate titles. Appears some folks submitted the same paper multiple times instead of simply updating their existing one. # First make sure there is no case sensitivity data <- data %>% mutate(title = tolower(`Preprint Title`)) # Check for duplicates dupl <- data %>% group_by(title) %>% summarise(n = n()) %>% filter(n > 1) # Remove duplicates data <- data %>% distinct(title, .keep_all = TRUE) # Check success. Should be 1216. Yes dupl2 <- data %>% group_by(title) %>% summarise(n = n()) %>% filter(n > 1) # How many preprints? pr <- length(unique(data$`Preprint ID`)) # Dates preprints were published, range and graph dates <- range(data$`Published Date`) # Delegate details delegates <- read_sheet("https://docs.google.com/spreadsheets/d/1UEAUZWpOm7C1kKoVoYy-u_D6cbHoF--EGxoY0Gl0qHw/edit#gid=836736319", sheet = "Delegate Details") del_details <- delegates %>% rename("attend" = "Attended hackathon (yes/no)") %>% mutate(attend = str_to_title(attend)) %>% tabyl(attend) # Master preprint list master_list <- read_sheet("https://docs.google.com/spreadsheets/d/1UEAUZWpOm7C1kKoVoYy-u_D6cbHoF--EGxoY0Gl0qHw/edit#gid=836736319", sheet = "Master List of Preprints") %>% clean_names() # Check allocation of papers contr_extract <- master_list %>% tabyl(extractor_full_name) # Check how many each person has done relative to their total contr_extract2 <- master_list %>% group_by(extractor_full_name) %>% summarise(n = n(), yes = sum(completed_yes_no_extractor == "Yes"), no = sum(completed_yes_no_extractor == "No"), prop = yes/n) %>% data.frame() %>% arrange(n) # raw data. For analysis data_coll <- read_sheet("https://docs.google.com/spreadsheets/d/1032gLryvtCNJ7eJjKBjrn7txxDRUGql9SoIG18QnFrQ/edit?resourcekey#gid=1062856014") # Final Processed data data2 <- read_csv(here("data", "final_data2.csv")) ``` # Supplemental Materials and Methods Supplementary Materials and Methods for: Noble et al. 2024. "The promise of community-driven preprints in ecology and evolution". *EcoEvoRxiv*. [https://doi.org/10.32942/X2SS46](https://doi.org/10.32942/X2SS46) ## Pre-registration, Data and Code Availability We preregistered the study on the Open Science Framework (OSF) (https://osf.io/d7zws) as an open-ended registration. We also created relevant updated study plan releases on our GitHub repository to capture key changes to the project as it developed (https://github.com/daniel1noble/ecoevo_1000). All data and code for reproducing the analyses and conclusions within our paper can be found on GitHub (https://github.com/daniel1noble/ecoevo_1000). ## Data collection We downloaded metadata on all preprints lodged on *EcoEvoRxiv* between `r gsub("UTC", "", dates[1])` and `r gsub("UTC", "", dates[2])`. This information included data on preprint submission and acceptance dates, date when the preprint was last updated, the reuse license, preprint ID and Digitial Object Identifier (DOI), preprint title, submitting author and co-authors, the number of versions submitted by authors and, if available, a DOI for the published version of the preprint. As part of the Society for Open, Reliable and Transparent Ecology and Evolutionary Biology's (SORTEE - https://www.sortee.org) conference, we ran a workshop (i.e., Hackathon - entitled "The first 1000 *EcoEvoRxiv* preprints: did they get published and where?") where we invited interested SORTEE members and conference registrants to participate in the data collection process. We provided a detailed data collection protocol to all delegates prior to the workshop ([vers. 1.0 - release](https://github.com/daniel1noble/ecoevo_1000/releases/tag/v1.0)). We used the workshop to train participants on how to collect data from *EcoEvoRxiv* preprints. This involved a demonstration of a single preprint followed by data collection by all attendees on the same preprint. Any confusion or questions were addressed by the workshop facilitators and the study data collection instructions and study plan were updated accordingly ([vers.1.1 - release](https://github.com/daniel1noble/ecoevo_1000/releases/tag/V1.1)). We had a total of `r del_details[2,2]` participants in the workshop, and an additional `r del_details[1,2]` that could not attend but were interested in contributing. Participants assigned themselves to collect data from between `r range(contr_extract$n)[1]` to `r range(contr_extract$n)[2]` preprints (mean = `r mean(contr_extract$n)`, standard deviation = `r sd(contr_extract$n)`). Data collection was achieved using a Google Form to standardize data input. Participants collected the following information for each of their assigned preprints: - **Preprint DOI**: Copy and pasted from the preprint meta-data file. - **Submitting/ corresponding authors firstname**: First name (given name) and second name (or initials, if provided) of the author submitting the preprint. Usually there was one submitting author in the meta-data preprint list. If there were multiple authors then the first author listed on the preprint was used. - **Submitting/ corresponding authors lastname**: Last name of the author submitting the preprint. Usually there was one submitting author in the meta-data preprint list. If there were multiple authors then the first author listed on the preprint was used. - **Country of the corresponding / submitting author**: Country of affiliation for the author submitting the preprint collecting according to standard ISO 3166 names. If teh country could not be determined the affiliation was considered as "NA". - **Year of first publication for corresponding/submitted author**: We used [Google Scholar](https://scholar.google.com.au) to collect publication year of the first journal article published by author submitting the preprint. If the submitting author had no profile or did not publish the value entered was '0'. - **Taxa being studied**: We collected information about the broad group of taxa that were the focus of the preprint. These categories were "Plants", "Animals", "Fungi", "Algi", "Invertabrates", "Vertebrates", "Microorganisms (bacteria, viruses)", or "Other". Multiple categories could be selected if the preprint contained information on multiple taxa. - **Discussion on the preprint?**: We identified whether there were community discussions around the preprint on the landing page of each preprint. If there were, we indicated 'yes', otherwise we indicated 'no'. Additional discussions around a preprint may have occurred on other platforms (e.g., Twitter, Facebook, etc.) but we did not collect this information. - **Type of article**: *EcoEvoRxiv* publishes a greater diversity of preprints compared with other preprint servers. We therefore collected data on article type. The categories were: *Research Article*: any article-like manuscript intended for publication in research journals with new empirical findings; *Methods paper*: papers presenting new methodological or computational approaches; *Reviews and Meta-analyses*: papers quantitatively or qualitatively synthesizing a given topic; *Opinions*: usually short papers providing new perspectives on a topic; *Comments*: papers that explicitly comment on an already published research article; and *Other*: which includes any other category of preprint which may also be government, non-profit, or industry reports, white papers, or other documents that are not intended for publication in a research journal. - **Link to Data for Preprint**: We were explicitly interested in whether data for the preprint was already publicly available. To collect these data we scrolled through the first version of the preprint and any related files on *EcoEvoRxiv* to see if a link (or accession number) to underlying data was provided. If it was reported we took the link to the data. If the preprint was not an empirical or equivalent article based on data and data analyses (so usually for opinions, comments, narrative reviews, theoretical paper, etc.) we entered 'NA'. If the preprint was based on data but no link (or accession number) to data repository was reported we entered 'none'. If multiple links were provided we took one. - **Link to Code for Preprint**: We were also interested in whether code for data analysis was already publicly available. To collect these data we scrolled through the first version of the preprint and any related files on *EcoEvoRxiv* to see if a link (or accession number) to underlying code was provided. If it was reported we took the link to the code. If the preprint was not an empirical or equivalent article based on data and data analyses (so usually for opinions, comments, narrative reviews, theoretical paper, etc.) we entered 'NA'. If the preprint was based on data but no link (or accession number) to data repository with code was reported we entered 'none'. If multiple links were provided we took one. - **Number of citations to preprint**: To understand citation patterns of preprints on *EcoEvoRxiv* we used Google Scholar to search for the number of citations attributed to the preprint. If the preprint version was not on Google Scholar or was merged with an already published version we considered it as 'NA' (coded as '999'). If the preprint version was online and clearly indicated (i.e., EcoEvoRxiv) then we took the number of citations at the time of collection. While citation counts may vary slightly, the first round of data collection was done in a short time frame. As such, we do not anticipate that citation counts would have changed significantly between the time of data collection and the time of submission of this manuscript. - **PCI recommendation**: We also determined if the preprint been recommended by Peer Community In (PCI). To do this, we searched for any PCI recommendations associated with the preprint on the preprint landing page. If that none was available on the landing page then we searched "peercommunityin.org recommendation "TITLE OF PREPRINT"" on Google. If a link was discovered, we indicated 'yes' and took the DOI link. If not, then we assumed it was not recommended. - **Publication DOI**: If the preprint had been published as a journal article, we collected the published DOI of that article. - **Journal name**: If the preprint had been published as a journal article we recorded the journal name. - **Publicaton Date**. If the preprint has been published as a journal article, we collected the (first) publication date (Month, Day, Year). - **Title Change**: If the preprint had been published as a journal article, we assessed whether the title had changed between the first version of the preprint and the published article. Note that any word change was sufficient for a 'yes'. - **Number of citations to publication**: We collected the number of citations to the published version of the article manually from [Google Scholar](https://scholar.google.com.au) if published. - **Link to Data for Publication**: If the preprint was published we collected the link to any published data provided on the publishers webpage or within the published paper. If a link to the data was not visible we recorded the paper as not having provided data ('none'). If the paper had not been published or not based on data and data analyses (opinions, comments, narrative reviews, theoretical paper, etc.) we entered 'NA'. - **Link to Code for Publication**: If the preprint was published we collected the link to any published code provided on the publishers webpage or within the published paper. If a link to the data was not visible we recorded the paper as not having provided data ('none'). If the paper had not been published or not based on data and data analyses (opinions, comments, narrative reviews, theoretical paper, etc.) we entered 'NA'. Additional variables of interest (i.e., journal impact factor, preprint and review policy, gender of submitting author) could be determined outside of this data collection period. ## Data Checking, Merging and Validation The initial dataset collected by participants from the Hackathon was cleaned, merged (with master meta-data file provided by the California Digital Library) and reorganised (see `1_data_processing.R` and `2_data_cleaning.R` in the `R/` folder of the main GitHub repository). This new dataset was again placed in a Google Sheet and co-authors were randomly allocated a number of pre-prints for cross-checking. The number of preprints allocated reflected contributions of authors during earlier data collection stages. Co-authors who did less work during initial data collection stages were given more papers to cross-check to ensure equitable contributions to the paper. This meant that the number of papers given to an author to cross-check ranged from ~10-50 papers. Data from each preprint was cross checked by one additional author to ensure accuracy and consistency of the data collected. Any discrepancies were either discussed or fixed prior to analysis. Most data that was corrected included data such as missed DOI's for published preprints, missing or incorrect citation counts to preprints and published articles, and missing or incorrect publication dates. It became evident that missing citation counts to preprints were largely the result of preprints and published articles not being correctly merged in Google Scholar. As such, checking citations to preprints involved looking carefully at all papers on the submitting author's Google Scholar profile and cross checking titles. Often it was clear redundant articles existed that were not merged. We then checked that the unmerged preprint and the published article were in fact the same study before updating preprint and publication citation counts. We also made note when any titles changed from preprint to publication. At times, preprints had been merged with published versions of the same study. To cross-check citation count to the preprint itself, we looked at the version history of the publication. If we found a clear preprint version, we took the citation count from that version. If it was not provided we assumed no data existed on citation counts for the preprint and it was assigned 'NA'. As indicated in the main manuscript. We collected data on the ‘academic age’ of submitting authors by looking at Google Scholar profiles of authors (when available) and recording their first year of publication in a peer-reviewed journal. If no google Scholar profile for the submitting author was available we attempted to acquire the information from other online research profiles (e.g., ResearchGate, ORCID, etc.). If no information was available we entered '0' as the first year of publication. ## Data collected computationally One of us, ML, acquired the submitting authors gender using the R package gender [v.0.6.0; @mullen2021predict] to predict the most likely gender of the submitting author of a preprint. We used an algorithm to assign gender based on the submitting author’s name. We only used the algorithm-assigned gender when the gender of a given name was identified with 95% certainty. For the remaining names, ML performed manual searches to determine gender based on the pronouns and photographs from professional and personal websites. The open access status of each published article was obtained using the R package *roadoi* (v.`r utils::packageVersion("roadoi")`) to connect to the Unpaywall platform [@jahn2024]. This R package provided data on whether the published article was open access or not (i.e., whether the article was behind a paywall or not), and categorized the open access status of the article (Sub-type meanings included: ‘Green’, articles published in ‘toll-access journals but achieved in an open access repository; ‘Bronze’, articles are free to read on publishers website without a license but grants no other rights and can be delayed free-to-read; ‘Hybrid’, articles are free to read upon publication with an open access license; ‘Gold’, articles published in fully open access journals. For full details on the meaning of each category see https://support.unpaywall.org/support/solutions/articles/44001777288-what-do-the-types-of-oa-status-green-gold-hybrid-and-bronze-mean- ). ## Data Analysis #### Generational Analysis To test whether a greater number of preprints were submitted by authors of earlier 'academic age', as quantified by the submitting authors first year of publication, we used a generalized linear model (GLM) with a negative binomial error distribution. We included the number of preprints submitted in a given year as the response variable and the first year of publication as the predictor variable. A negative, and significant, slope coefficient indicates support for researchers of earlier career stages submitting more preprints. # Supplementary Results ## Publication time by article type @tbl-time provides a breakdown of publication time by article type. ```{r} #| label: tbl-time #| tbl-cap: Time between preprint submission and publication for different article types. # Load in the table( tbl_time <- read.csv(here("output", "tables", "preprints_pub_only_stats.csv"))[,-1] # Flx tbl <- flextable::flextable(tbl_time) %>% flextable::set_table_properties(width = .5) %>% flextable::set_header_labels("type_of_preprint" = "Article Type", "mean_time" = "Mean (days)", "sd_time" = "SD (days)", "n" = "N") %>% flextable::align(align = "center") tbl ``` ## Preprint citations versus published article citations ```{r, fig-citations} #| label: fig-citations #| fig-cap: Number of citations to preprints and published articles that result from those preprints. # First, we need to extract the published articles, but we probably also need to remove postprints pubs <- data2 %>% filter(!is.na(time_between_preprint_and_pub_days) & !time_between_preprint_and_pub_days <30) # We will need to re-oragnise the dataframe in long format for analysis. data_long_cites <- pubs %>% dplyr::select(preprint_id, preprint_doi, publication_date, time_between_preprint_and_pub_days, number_of_citations_preprint, number_of_citations_article) %>% pivot_longer(cols = c(number_of_citations_preprint, number_of_citations_article), names_to = "type", values_to = "citations") %>% mutate(type = ifelse(type == "number_of_citations_preprint", "Preprint", "Published Article")) # Now we can plot the data cites <- ggplot(data_long_cites, aes(x = type, y = sqrt(citations))) + geom_violinhalf(aes(fill = type), flip = 1) + geom_jitter(shape=16, position=position_jitter(0.01), alpha = 0.2) + geom_line(aes(group = preprint_id), alpha = 0.2, colour = "black") + labs(x = "Paper Status", y = "Number of citations (sqrt transformed)") + theme_bw() + theme(legend.position = "none") hist <- ggplot(data_long_cites, aes(x = sqrt(citations), group = type, fill = type)) + geom_histogram() + theme_bw() + labs(x = "Number of citations (sqrt transformed)", y = "Frequency") + labs(fill = "Paper Status") + theme(legend.position = "inside") plot.with.inset <- ggdraw() + draw_plot(cites) + draw_plot(hist, x = 0.07, y = .55, width = .45, height = .4) plot.with.inset # Have a look at relationship between preprint citations and publication citations complete_pub_data <- pubs %>% filter(!is.na(number_of_citations_preprint)) %>% filter(!is.na(number_of_citations_article)) spear <- cor.test(complete_pub_data$number_of_citations_preprint, complete_pub_data$number_of_citations_article, method = "spearman") # Model with ZI NBinom. NOt really sure this is worth properly analysing because we can't really discern much. Yes, they are cited more, but that's not surprising. We really want to know if those that peprinted are more likely to be cited. We could try comparing to 'postprints' papers that were posted after being published, but this could be a very small biased sample of already high ipact papers or biased towards certain research groups. Probably just a descriptive anlaysis is all we really need. # model2 = glmmTMB(citations ~ type + time_between_preprint_and_pub_days + (1 | preprint_id), data = data_long_cites, # family = nbinom2, zi = ~ 1) #summary(model2) # Summary of the citation data. sum_cites <- data_long_cites %>% group_by(type) %>% summarise(mean = mean(citations, na.rm = TRUE), median = median(citations, na.rm = TRUE), sd = sd(citations, na.rm = TRUE), min = min(citations, na.rm = TRUE), max = max(citations, na.rm = TRUE)) # Summary of frequency of papers in citation categories sum_cites2 <- data_long_cites %>% tabyl(type, citations) ``` Papers that are preprinted have been shown to receive more citations compared to those that are not preprinted in other research fields. Testing if this is true in Ecology and Evolution is challenging and not possible using observational data from *EcoEvoRxiv*. However, we can compare the number of citations to preprints and the published articles that result from those preprints to give a sense of how many citations preprints are expected to potentially accumulate and whether highly cited preprints go on to also be highly cited published research papers. The vast majority of preprints were not cited before they became published (@fig-citations), with a total of `r sum_cites2[1,2]` papers receiving zero citations. The median number of citations to preprints were `r sum_cites[1,"median"]` (mean = `r round(sum_cites[1,"mean"], 1)`; SD = `r round(sum_cites[1,"sd"], 1)`), while the median number of citations to published articles was `r sum_cites[2,"median"]` (mean = `r round(sum_cites[2,"mean"], 1)`; SD = `r round(sum_cites[2,"sd"], 1)`). Notably, published articles with zero citations dropped dramatically (`r sum_cites2[2,2]` papers; a `r (1 - (sum_cites2[2,2]/ sum_cites2[1,2]))*100`% reduction) compared to preprints with zero citations, suggesting that once published, only `r sum_cites2[2,2]/ sum_cites2[1,2]*100`% had no citations. Interestingly, a highly cited preprint did not necessarily become a highly cited published article over time, although some of this discrepancy is related to whether and how preprints and articles are linked through crossref (Spearman Rank Correlation: `r spear$estimate`, S = `r spear$statistic`, *p* = `r spear$p.value`; @fig-citations).