Code
library(openalexR)
library(dplyr)
library(ggplot2)
library(jsonlite)
library(tools)
library(stringr)
library(purrr)
library(tidyr)
library(broom)
library(cowplot)
This short document is a replication of the first part of Ben Marwick’s paper published in Journal of Archaeological Science in June 2025 that analyzes the hard/soft position of archaeology and the evolution through time (Marwick, 2025). This work is based on a bibliometric analysis of the Web of Science data of archaeological journals and articles, when I use the data from OpenAlex, a free and open-source database instead. This work confirms the computational reproducibility of Marwick (2025). Data from OpenAlex also confirms the trends visible in the replicated study both for the hard/soft science categorization, for the evolution through time and for the classification of different journals, with only some minor differences. This study also emphasizes that using the free and open source OpenAlex database is suitable for this kind of scientometric study instead of commercial databases.
library(openalexR)
library(dplyr)
library(ggplot2)
library(jsonlite)
library(tools)
library(stringr)
library(purrr)
library(tidyr)
library(broom)
library(cowplot)
Replication and reproduction of archaeological studies are extremely rare. Aren’t they supposed, though, to be among the pillars of the scientific method (Popper, 1959)?
Following the recommendations by Barba (2018) and the National Academies of Sciences et al. (2019) which are adopted by Marwick et al. (2020), replication is defined as “arriv[ing] at the same scientific findings as another study, collecting new data (possibly with different methods) and completing new analyses”, while reproduction is defined as “re-creating the results” given that “authors provide all the necessary data and the computer codes to run the analysis again”. To tell the truth, I searched for these two words along with the word archaeology in Google Scholar and OpenAlex and didn’t find any article replicating or reproducing another archaeological article. Despite growing awareness of reproducibility issues within the community and the accelerating use of programming languages in archaeological articles (Schmidt and Marwick, 2020), replicability has yet to be embraced. Seizing the opportunity, this manuscript attempts to reproduce and replicate the results published in the first part of Marwick (2025) regarding the hard/soft science categorization of archaeology.
Reproduction will use Marwick’s shared code and data, serving not only as a means to verify the presented results but also, more importantly, as an opportunity for me to learn new aspects of R, Quarto documents, and the organization of files in such a research project. During this reproduction process, I identified some minor errors, which I reported on the GitHub page of the manuscript. This provided me with an additional opportunity to gain experience with software forges.
The idea to replicate Marwick’s result using OpenAlex occured to me while delving into the data during the manuscript reproduction. I was surprised to read in the article that there were so few journals with 100 papers in the Web of Science (WoS) database for archaeology that Marwick (2025) had to limit his analysis to just 20 journals. I was also surprised that WoS only includes 108 journals in its Archaeology category. This led me to explore the data on OpenAlex, which initiated the entire process. The replication will thus apply the same methodology to different but supposedly equivalent data, using OpenAlex instead of Web of Science. OpenAlex, as described on the website of its creator, the nonprofit company OurResearch, is an “open and comprehensive catalog of scholarly papers, authors, institutions, and more”. Established in 2021, it is a free, open source, and open access bibliographic database that can serve as an alternative to commercial databases and is already supported by many public institutions (e.g. Badolato, 2024; Jack, 2023; OurResearch team, 2021; Singh Chawla, 2022). The OpenAlex database has a much broader scope than Web of Science and its dataset is significantly larger (Alperin et al., 2024; Culbert et al., 2025). This can be particularly crucial for archaeology, as the vast majority of references cited in publications from the History & Archaeology field (from OECD classification) are not identifiable in WoS (fig. 5 Andersen, 2023). However, caution must be exercised with the OpenAlex dataset, as some metadata are still relatively poorly documented (Alperin et al., 2024). This will necessitate additional filtering in the OpenAlex dataset to focus on usable data rather than the entire dataset.
I will outline here the main issues that I identified with the published version of the manuscript:
The manuscript states on page 2 that the selection of the “top-ranking 25 journals [in WoS archaeological was made based on] their h-indices as reported by Clarivate’s Journal Citation Indicator”. However, the code of the Quarto document and the dataset itself does not mention H-indices; the filter is actually based on the 2022 Impact Factor of the journals. This is an error in the text of the manuscript, almost a typo, since it does not change anything to the results, but ultimately, the only way to realize that the list of journals is based on the 2022 Impact Factor and not H-indices is by examining the data and code.
The dataset of WoS’s Impact Factor contains one “< 0.1” and one “NA” values which integrates the top 25 lines of the dataset when it is arranged in descending order based on this variable. When the values of these two journals’s IF are changed to 0, the list of the 25 journals should have included Journal of African Archaeology and World Archaeology Table 1. Both journals would have maet the criteria for the final list of journals even after applying the threshold of at least 100 papers in WoS, therefore extending this list to include 22 journals.
The Shannon’s index is incorrecly calculated in Marwick’s code and, therefore, should not be compared with the data presented in Fanelli and Glänzel (2013). Although it is accurately described in the comments of the code, the code itself incorrectly computes the Shannon index of the references instead of the sources. Specifically, Specifically, it divides p_i, the number of times a reference appears in an article (which is always one, as each reference is listed only once in each article), by the total number of citations of that reference in the entire dataset. Instead, it should calculate the Shannon index of the sources of the references. Additionally, the text of the manuscript is also misleading as it mentions “The diversity of references” where it should be “The diversity of sources”, as presented in Fanelli and Glänzel (2013).
The shared code does not contain the code to produce Figure 2 of the manuscript. Version 1.3 of the code loads a pre-existing .png from the figures folder. The code to produce a very similar figure is present in the version 1.1 of the code, but it is not exactly the same.
Other issues are small code errors, which I mentioned by pushing a commit to the GitHub repository of the original article.
To extract data from journals (and from works in the next section) , I use openalexR which is “an R package to interface with the OpenAlex API” (Aria et al., 2024).
Unfortunately, obtaining all the journals from a subfield of OpenAlex is not an feasible, as ‘journals’ are not categorized by fields or subfields in OpenAlex, unlike ‘works’ (OpenAlex uses the term ‘work’ to encompass all types of scientific production).
Consequently, I utilized the list of archaeological journals from WoS to retrieve the journal’s information from OpenAlex, as this list probably contains the largest journals which will be included in a top 25 list.
# list of journals from Marwick's WoS list for all the 108 Archaeology journal in WoS
= c("JOURNAL OF CULTURAL HERITAGE","AMERICAN ANTIQUITY","JOURNAL OF ARCHAEOLOGICAL SCIENCE","JOURNAL OF ARCHAEOLOGICAL METHOD AND THEORY","Archaeological and Anthropological Sciences","JOURNAL OF FIELD ARCHAEOLOGY","JOURNAL OF ANTHROPOLOGICAL ARCHAEOLOGY","Archaeological Dialogues","ANTIQUITY","Archaeological Prospection","The Journal of Island and Coastal Archaeology","Lithic Technology","GEOARCHAEOLOGY","Journal of Archaeological Science Reports","African Archaeological Review","JOURNAL OF ARCHAEOLOGICAL RESEARCH","ARCHAEOMETRY","Archaeological Research in Asia","European Journal of Archaeology","Mediterranean Archaeology & Archaeometry","ENVIRONMENTAL ARCHAEOLOGY","Advances in Archaeological Practice","Journal of African Archaeology","WORLD ARCHAEOLOGY","Anatolian Studies","CAMBRIDGE ARCHAEOLOGICAL JOURNAL","JOURNAL OF SOCIAL ARCHAEOLOGY","AMERICAN JOURNAL OF ARCHAEOLOGY","Trabajos de Prehistoria","Azania-Archaeological Research in Africa","AUSTRALIAN ARCHAEOLOGY","ARCHAEOLOGY IN OCEANIA","JOURNAL OF MATERIAL CULTURE","LATIN AMERICAN ANTIQUITY","Rock Art Research","Levant","Journal of Mediterranean Archaeology","International Journal of Historical Archaeology","Open Archaeology","Bulletin of the American Schools of Oriental Research","STUDIES IN CONSERVATION","HISTORICAL ARCHAEOLOGY","ACTA ARCHAEOLOGICA","Ancient Mesoamerica","Oxford Journal of Archaeology","Asian Perspectives-The Journal of Archaeology for Asia and the Pacific","NEAR EASTERN ARCHAEOLOGY","Journal of Roman Archaeology","Medieval Archaeology","Palestine Exploration Quarterly","JOURNAL OF NEAR EASTERN STUDIES","Norwegian Archaeological Review","Praehistorische Zeitschrift","Journal of Maritime Archaeology","Archeologicke Rozhledy","Archivo Espanol de Arqueologia","Journal of the British Archaeological Association","ZEITSCHRIFT DES DEUTSCHEN PALASTINA-VEREINS","Estudios Atacamenos","Arheoloski Vestnik","ARCHAEOFAUNA","Britannia","Complutum","ANTHROPOZOOLOGICA","Zeitschrift fur Assyriologie und Vorderasiatische Archaologie","Industrial Archaeology Review","NORTH AMERICAN ARCHAEOLOGIST","Arqueologia","Post-Medieval Archaeology","JOURNAL OF EGYPTIAN ARCHAEOLOGY","OLBA","Conservation and Management of Archaeological Sites","Boletin del Museo Chileno de Arte Precolombino","Public Archaeology","Belleten","Akkadica","Archaologisches Korrespondenzblatt","Adalya","Vjesnik za arheologiju i povijest dalmatinsku","Aula Orientalis","BULLETIN MONUMENTAL","ZEITSCHRIFT FUR AGYPTISCHE SPRACHE UND ALTERTUMSKUNDE","TRANSACTIONS OF THE ANCIENT MONUMENTS SOCIETY","JOURNAL OF WORLD PREHISTORY","INTERNATIONAL JOURNAL OF OSTEOARCHAEOLOGY","Estonian Journal of Archaeology","ARCHAEOLOGY","Journal of Historic Buildings and Places","Arabian archaeology and epigraphy","Archaeological Reports","Archaeologies","Archäologisches Korrespondenzblatt","Archeologické rozhledy","ArchéoSciences","Archivo Español de Arqueología","Arheološki vestnik","Arqueología","Asian perspectives","Aula orientalis: revista de estudios del Próximo Oriente Antiguo","Azania Archaeological Research in Africa","Boletín del Museo Chileno de Arte Precolombino","Fornvännen","Hesperia The Journal of the American School of Classical Studies at Athens","Intersecciones en antropología","Iran","Israel exploration journal","Mediterranean Archaeology & Archaeometry. International Journal/Mediterranean archaeology and archaeometry. International scientific journal","Opuscula Annual of the Swedish Institutes at Athens and Rome",
journals_marwick "Památky archeologické","Rossiiskaia arkheologiia","Tel Aviv","The Annual of the British School at Athens","The Bulletin of the American Society of Papyrologists","The International Journal of Nautical Archaeology","The journal of egyptian archaeology","The Journal of Island and Coastal Archaeology","The Journal of Juristic Papyrology","The South African Archaeological Bulletin","Time and Mind","Trabajos de Prehistoria","Transactions of The Ancient Monuments Society","Vjesnik za arheologiju i povijest dalmatinsku","Zeitschrift des Deutschen Palästina-Vereins","Zeitschrift für Ägyptische Sprache und Altertumskunde","Zeitschrift für Assyriologie und Vorderasiatische Archäologie","Zephyrvs")
<- toTitleCase(tolower(journals_marwick))
journals_marwick
<- oa_fetch(
sources_extract entity = "sources",
display_name = journals_marwick,
type = "Journal",
count_only = FALSE,
options = list(sort = "display_name:desc"),
verbose = TRUE
)
%>% distinct(display_name, .keep_all = TRUE)
sources_extract
= sources_extract$summary_stats
Journal_metrics <- as.numeric(sapply(Journal_metrics, function(x) x[2]))
h_index <- round(as.numeric(sapply(Journal_metrics, function(x) x[1])),2)
twoyr_mean_citedness <- as.numeric(sapply(Journal_metrics, function(x) x[3]))
i10
= sources_extract$id
id_all $id = gsub('https://openalex.org/', '' ,sources_extract$id)
sources_extract
<- sources_extract %>%
sources_extract_simple mutate(twoyr_mean_citedness = twoyr_mean_citedness, h_index = h_index, i10 = i10, OpenAlex_id = id) %>%
distinct(display_name, .keep_all = TRUE) %>%
arrange(`display_name`)
<- sources_extract_simple %>% select(display_name, host_organization_name, works_count, cited_by_count, twoyr_mean_citedness, h_index, i10, OpenAlex_id)
sources_extract_simple
saveRDS(sources_extract_simple, "data/sources_extract_simple.rds")
write.csv(sources_extract_simple, "data/sources_extract_simple.csv")
<-
jci_top_25_2ymc %>%
sources_extract_simple arrange(desc(`twoyr_mean_citedness`)) %>%
slice(1:25)
= gsub("https://openalex.org/","",jci_top_25_2ymc$OpenAlex_id)
id_25_2ymc
<-
jci_top_25_hindex %>%
sources_extract_simple arrange(desc(`h_index`)) %>%
slice(1:25)
<-
jci_top_25_i10 %>%
sources_extract_simple arrange(desc(`i10`)) %>%
slice(1:25)
While requesting OpenAlex with the list of 108 journals from WoS, I received only 38 results. This is due to variations in journal names, such as the use of capitals letters, dashes, etc. When I adjusted manually the names to match those in OpenAlex for the journals which are further used in Marwick’s top 25 and top 20 lists, I received 69 results, including all the journals from these lists.
To gather information from OpenAlex for as many journals as possible, I had to check each journal individually with his name or sometimes its ISSN. This was necessary because special characters from journal names have been removed in the WoS dataset, many journals which title begins with “The” are in the WoS dataset without the “The”, among similar issues. For example in WoS a journal is called ‘Hesperia’ when it is called ‘Hesperia The Journal of the American School of Classical Studies at Athens’ in OpenAlex. Unfortunately, this task of extracting journals is not straightforward, and it would be much more efficient that the journals in OpenAlex have also fields and subfields to request. Ultimately, I successfully retrieved data through openalexR and the API for 85 out of the 108 journals listed in WoS. However, increasing the count from 69 to 85 did not alter the top 25, and the still missing journals likely would not have been in the top 25 either, given that they are not major journals.
Once the 85 journals’ information are extracted, it is possible to compare OpenAlex and WoS datasets at the journal level. First, the observation that first striked me that there are only 108 journals in the archaeology category of WoS is unfortunately difficult to compare with OpenAlex since I cannot extract all the journals by the subfield Archaeology. I therefore filtered the OpenAlex dataset of all articles published between 1975 and 2025 with the subfield Archaeology, and grouped them by source (and removing some sources which are not journals but online repositories and deleted journals). This treatment, which probably closely mimics the extraction of all archaeological journals, retrieves 195 sources, almost twice the number of WoS archaeological journals. This difference is due to the fact that being indexed in WoS necessitate application by the journal and is decided by Clarivate based on criteria.
The WoS dataset is also small in terms of papers listed even when comparing journals present in both databases. It is interesting to demonstrate this by looking at those journals which have been removed from Marwick’s top 25 list because they had less than 100 articles. Archaeological Dialogues (94 papers in Wos), Journal of World Prehistory (63 papers in WoS), and Lithic Technology (60 papers in WoS) have respectively 700, 328, and 778 papers in the OpenAlex database.
The metadata of both datasets are also different. OpenAlex gives much more variables and information on the selected works or journals than WoS. The main issue with both datasets is the lack of many information regarding books, book chapters, monographs, and grey literature, as scientific production recorded in the database and also as references cited in the articles.
Since these datasets are not the same size, not the same exact quality on the same variables, I cleaned it as much as possible as Marwick did too for his data.
The next step was to extract from the 85 journals the top 25 journals based on their 2-years-mean-citedness (2ymc) which is the same as Impact Factor. This list can then be compared with the list from WoS.
Both lists are at the same time similar and pretty different. I can outline here some specific points:
As for the WoS list, the 25 journals are English-language journals.
The journal Australian Archaeology is ranking #1 with a 2ymc of 7.74. This is very surprising!
American Antiquity is missing from the list because it has a pretty low 2ymc in OpenAlex and is therefore ranked 45th. It’s OpenAlex H-index is on the other side very high, is it an error from the OpenAlex database?
All the 25 journals have largely more than 100 works in the OpenAlex dataset, so we can keep them all based on Marwick’s decision to keep only journals for further analysis. The minimum here is 316 works for Journal of Archaeological Research.
Other pretty strong discrepancies (>20) between the 2ymc rankings from OpenAlex and the IF ranking from Web of Science can be detected for the journals Levant, Lithic Technology, Palestine Exploration Quarterly, Studies in Conservation.
The OpenAlex dataset also containing other journal’s metrics H-indices and i10, I also made the top25-rankings on these metrics (Table 2). H index defined as the number of papers (h) with citation number ≥ h (Hirsch, 2005). The i10 index, created initially by Google Scholar, is the number of articles that have been cited at least 10 times.
I attempted to download all articles from OpenAlex in the subfield of Archaeology (number 1204) by accessing the API through an internet browser at this address: https://api.openalex.org/works?filter=type:article,from_publication_date:1975-01-01,to_publication_date:2025-12-31,topics.subfield.id:1204. This results in a gigantic list of almost 1.8 million references that can only be viewed 25 entries at a time. If you download it as JSON files, it is only by a single page of the first 25 or at best 100 results. This is not feasible manually. For such large queries, the OpenAlex team recommends downloading their full dataset, a 300 GB JSON file. However, I didn’t even try this way because I am unsure if I can handle such a large file, given my lack of experience in manipulating huge JSON files.
# This chunk of code is long to run so it is eval: false so that you run it only once and at the end it saves to disk the data and that it does not need to be run each time you render the quarto document.
# Here we just count the number of articles in the subfield Archaeology
<- oa_fetch(
works_search_count entity = "works",
type = "article",
topics.subfield.id = "1204", # 1204 is the id of the subfield Archaeology
count_only = TRUE,
output = "dataframe",
from_publication_date = "1975-01-01",
to_publication_date = "2025-12-31",
verbose = TRUE
)
When requesting works with openalexR, it is possible to use an entire subfield ID, in this case, 1204 for Archaeology. Simply counting counting the number of articles in the subfield Archaeology in OpenAlex between 1975 and 2025 yields a result of 1794305. Given the size of this sample, I will not download the complete dataset for this request.
I downloaded the data for each article only for the OpenAlex top 25 journalsbased on 2-years-mean-citedness. This process is time-consuming, as it involves substantial requests to the API. Then, to replicate the figures from Marwick’s study, I downloaded all the information from the cited papers for each article. This particular step took even longer, requiring approximately a dozen hours.
The data was not extracted, and the graphics not produced for the top 25 journals based on other available metrics such as H-index and i10. This is because these metrics are strongly correlated with the seniority of the journals, similar to when these metrics are used to compare individual researchers. It would have required downloading all the data for 12 supplementary journals that rank in the top 25 for these other metrics.
#very time consuming but result saved in .rds
# Below the code download several batch of data only for the top25 journals identified by their 2ymc value so that it is too big, and then bind them all together in a single tibble with the necessary variables and not all of them.
<- oa_fetch(
works_search_1975_1985 entity = "works",
type = "article",
journal = id_25_2ymc, # id is the list of top25 2ymc journals OpenAlex ids
count_only = FALSE,
output = "dataframe",
from_publication_date = "1975-01-01",
to_publication_date = "1985-12-31",
verbose = TRUE
)
<- oa_fetch(
works_search_1986_1995 entity = "works",
type = "article",
journal = id_25_2ymc, # id is the list of top25 2ymc journals OpenAlex ids
count_only = FALSE,
output = "dataframe",
from_publication_date = "1986-01-01",
to_publication_date = "1995-12-31",
verbose = TRUE
)
<- oa_fetch(
works_search_1996_2000 entity = "works",
type = "article",
journal = id_25_2ymc, # id is the list of top25 2ymc journals OpenAlex ids
count_only = FALSE,
output = "dataframe",
from_publication_date = "1996-01-01",
to_publication_date = "2000-12-31",
verbose = TRUE
)
<- oa_fetch(
works_search_2001_2003 entity = "works",
type = "article",
journal = id_25_2ymc, # id is the list of top25 2ymc journals OpenAlex ids
count_only = FALSE,
output = "dataframe",
from_publication_date = "2001-01-01",
to_publication_date = "2003-12-31",
verbose = TRUE
)
<- oa_fetch(
works_search_2004_2007 entity = "works",
type = "article",
journal = id_25_2ymc, # id is the list of top25 2ymc journals OpenAlex ids
count_only = FALSE,
output = "dataframe",
from_publication_date = "2004-01-01",
to_publication_date = "2007-12-31",
verbose = TRUE
)
<- oa_fetch(
works_search_2008_2011 entity = "works",
type = "article",
journal = id_25_2ymc, # id is the list of top25 2ymc journals OpenAlex ids
count_only = FALSE,
output = "dataframe",
from_publication_date = "2008-01-01",
to_publication_date = "2011-12-31",
verbose = TRUE
)
<- oa_fetch(
works_search_2012_2015 entity = "works",
type = "article",
journal = id_25_2ymc, # id is the list of top25 2ymc journals OpenAlex ids
count_only = FALSE,
output = "dataframe",
from_publication_date = "2012-01-01",
to_publication_date = "2015-12-31",
verbose = TRUE
)
<- oa_fetch(
works_search_2016_2019 entity = "works",
type = "article",
journal = id_25_2ymc, # id is the list of top25 2ymc journals OpenAlex ids
count_only = FALSE,
output = "dataframe",
from_publication_date = "2016-01-01",
to_publication_date = "2019-12-31",
verbose = TRUE
)
<- oa_fetch(
works_search_2020_2023 entity = "works",
type = "article",
journal = id_25_2ymc, # id is the list of top25 2ymc journals OpenAlex ids
count_only = FALSE,
output = "dataframe",
from_publication_date = "2020-01-01",
to_publication_date = "2023-12-31",
verbose = TRUE
)
<- oa_fetch(
works_search_2024_2025 entity = "works",
type = "article",
journal = id_25_2ymc, # id is the list of top25 2ymc journals OpenAlex ids
count_only = FALSE,
output = "dataframe",
from_publication_date = "2024-01-01",
to_publication_date = "2025-12-31",
verbose = TRUE
)
# Combining the data together
<- list(works_search_1975_1985,works_search_1986_1995,works_search_1996_2000,works_search_2001_2003,works_search_2004_2007,works_search_2008_2011,works_search_2012_2015,works_search_2016_2019,works_search_2020_2023,works_search_2024_2025)
liste_tibbles
# Define the variables we want to keep
<- c("id", "title", "authorships","doi","publication_year","source_display_name","host_organization_name","referenced_works_count","first_page","last_page","abstract","referenced_works")
variables_to_keep
# Apply the selection to the giant list of papers extracted from OpenAlex
# and cleaning it (titles, page numbers, preparing authors' lists, removing duplicates)
<- bind_rows(lapply(liste_tibbles, function(x) {
Works_top25_1975_2025 %>% select(all_of(variables_to_keep)) })) %>%
x mutate(title = str_trim(title)) %>% # delete double space
mutate(title = str_to_lower(title)) %>% # put titles to lower case
mutate(title = str_remove(title, "<i>")) %>%
mutate(title = str_remove(title, "</i>")) %>%
mutate(first_page = str_remove(first_page, "P")) %>%
mutate(last_page = str_remove(last_page, "P")) %>%
mutate(last_page = str_remove(referenced_works, "\n")) %>%
mutate(authors = map_chr(authorships, ~ paste(.x$display_name, collapse = ", "))) %>%
distinct(title, .keep_all = TRUE) # removes ca. 1477 works present more than once (based on title)
# Prepare dataframe exactly as in Marwick 2025
= dplyr::tibble(
items_df authors = Works_top25_1975_2025$authors,
authors_n = map_int(Works_top25_1975_2025$authorships, ~ nrow(.x)),
title = Works_top25_1975_2025$title,
title_n = sapply(strsplit(Works_top25_1975_2025$title, "\\s+"), length),
journal = Works_top25_1975_2025$source_display_name,
abstract = Works_top25_1975_2025$abstract,
refs = Works_top25_1975_2025$referenced_works,
refs_n = Works_top25_1975_2025$referenced_works_count,
pages_n = as.numeric(Works_top25_1975_2025$last_page)-as.numeric(Works_top25_1975_2025$first_page),
year = Works_top25_1975_2025$publication_year,
doi = Works_top25_1975_2025$doi
)
= items_df %>% filter(journal %in% jci_top_25_2ymc$display_name)
items_df
# save to disk so we can re-use it for the next steps to save time
saveRDS(items_df, "data/OpenAlex_Works_top25_1975_2025.rds")
#very time consuming but result saved in .rds
# Beware this takes tens of hours (49k requests to the API) and should not be run if not really necessary
# Use the result year_refs_list.rds instead
# Thus, I put eval: false to this chunk so that it does not run when rendering the quarto document!
= readRDS("data/OpenAlex_Works_top25_1975_2025.rds")
items_df
# Keeping only the OpenAlex ID of each ref in the refs variable
= items_df %>%
items_df mutate(refs = str_remove_all(refs, "https://openalex.org/")) %>%
mutate(refs = str_remove(refs, "^c\\(")) %>%
mutate(refs = str_remove_all(refs, '"')) %>%
mutate(refs = str_remove_all(refs, '\\)'))
# Creating the list before the loop
<- list()
year_refs_list <- list()
journal_refs_list
# Loop for getting the list of all refs for each paper and keeping only its publication years
# (very very long)
for (i in 1:nrow(items_df)) {
<- tryCatch({
year_refs_search oa_fetch(
entity = "works",
identifier = strsplit(items_df$refs[[i]], ", ")[[1]],
count_only = FALSE,
output = "dataframe",
verbose = TRUE,
mailto = "alain.queffelec@u-bordeaux.fr"
) error = function(e) {
}, message(paste("Error fetching data for row", i, ": ", e$message))
NULL
})
if (!is.null(year_refs_search)) {
<- year_refs_search$publication_year
year_refs <- year_refs_search$source_display_name
journals <- year_refs
year_refs_list[[i]] <- journals
journal_refs_list[[i]]
}
}
# Save the list on the disk
saveRDS(year_refs_list, "year_refs_list.rds")
saveRDS(journal_refs_list, "journal_refs_list.rds")
<- function(l) {
convert_list_to_string if (is.logical(l) && length(l) == 1 && is.na(l)) {
return(NA_character_)
}paste(l, collapse = ", ")
}
# Appliquer la fonction à la colonne de liste
<- items_df
items_df_nolist $refs <- sapply(items_df$refs, convert_list_to_string)
items_df_nolist<- items_df_nolist %>%
items_df_nolist_fewrefs filter(refs_n %in% c(0, 1, 2)) %>%
group_by(refs_n) %>%
ungroup()
# Exporter vers Excel
library(writexl)
write_xlsx(items_df_nolist_fewrefs, "data/sampled_data_0_1_2.xlsx")
library(tidyverse)
# Load the .rds files created by the very time consuming chunks of code run above in the Quarto document
<- readRDS("data/OpenAlex_Works_top25_1975_2025.rds")
items_df = readRDS("data/year_refs_list.rds")
year_refs_list = readRDS("data/journal_refs_list.rds")
journal_refs_list
# Simplify the journal names so that there is no more points, commas etc.
<- lapply(journal_refs_list, function(x) gsub("[^a-zA-Z]", "", x))
journal_refs_list <- lapply(journal_refs_list, function(x) gsub("[\\(]", "", x))
journal_refs_list
# Add the year_refs and journal_refs to items_df
for (i in 1:nrow(items_df)){
$year_refs[i] = paste(year_refs_list[[i]], collapse = ", ")
items_df$journal_refs[i] = paste(journal_refs_list[[i]], collapse = ", ")
items_df
}
# Get the OpenAlex IDs alone for each ref and split
= items_df %>%
items_df mutate(refs = str_remove_all(refs, "https://openalex.org/")) %>%
mutate(refs = str_remove(refs, "^c\\(")) %>%
mutate(refs = str_remove_all(refs, '"')) %>%
mutate(refs = str_remove_all(refs, '\\)')) %>%
mutate(refs = str_remove_all(refs, '\n')) %>%
mutate(journal_refs = str_remove_all(journal_refs, " ,"))
# Fill the refs and journal_refs so that their length does match because sometimes it doesn't due to some absence in OpenAlex dataset
= lapply(items_df$refs, function(x) {
length_IDrefs <- strsplit(x, ", ")[[1]]
split_result if (length(split_result) == 1 && split_result[1] == "") {
0} else {length(split_result)}})
<- lapply(items_df$journal_refs, function(x) {
length_journal_refs <- strsplit(x, ", ")[[1]]
split_result if (length(split_result) == 1 && split_result[1] == "") {0}
else {length(split_result)}})
$length_IDrefs = length_IDrefs
items_df$length_journal_refs = length_journal_refs
items_df$refs_mod = items_df$refs
items_df$journal_refs_mod = items_df$journal_refs
items_df
<- function() {
generate_unique_false_ref paste0("R", paste0(sample(0:9, 10, replace = TRUE), collapse = ""))
# generate false OpenAlex IDs beginning with an R for reference instead of W so that we can detect them
} <- function() {
generate_unique_false_journal paste0("J", paste0(sample(0:9, 10, replace = TRUE), collapse = ""))
# generate false OpenAlex IDs beginning with an J for Journal so that we can detect them
}
for (i in 1:nrow(items_df)) {
= items_df$length_IDrefs[[i]][1] - items_df$length_journal_refs[[i]][1]
len_diff if (len_diff != 0) {
if (len_diff > 0 & items_df$length_journal_refs[i] > 0) {
# Create false unique journals if necessary
<- tibble(replicate(len_diff, generate_unique_false_journal()))
false_journals $journal_refs_mod[i] <- rbind(items_df$journal_refs_mod[i], false_journals)
items_dfelse if (len_diff < 0) {
} # Create false unique refs if necessary
<- replicate(abs(len_diff), generate_unique_false_ref())
false_refs $refs_mod[i] <- rbind(items_df$refs_mod[i], false_refs)
items_dfelse {
} <- tibble(replicate(len_diff, generate_unique_false_journal()))
false_journals <- replicate(abs(len_diff), generate_unique_false_ref())
false_refs $journal_refs_mod[i] <- false_journals
items_df$refs_mod[i] <- false_refs
items_df
}
}
}
= items_df %>%
items_df mutate(journal_refs_mod = str_remove(journal_refs_mod, "^c\\(")) %>%
mutate(journal_refs_mod = str_remove_all(journal_refs_mod, '"')) %>%
mutate(journal_refs_mod = str_remove_all(journal_refs_mod, '\\)')) %>%
mutate(journal_refs_mod = str_remove_all(journal_refs_mod, ' ,'))
= items_df %>% filter(title_n !=0,
items_df > 0,
pages_n != 0,
pages_n < 1000,
pages_n > 0) authors_n
As previously demonstrated at a higher level than just archaeology, the OpenAlex dataset is significantly larger than the Wos dataset but has limitations regarding some metadata, especially the list of references cited in the articles (Alperin et al., 2024; Culbert et al., 2025). Nevertheless, for this specific subfield, here is what I observe:
The WoS dataset of archaeological articles is much smaller than the OpenAlex dataset: 28,871 compared to 1,788,444. OpenAlex aggregates data from a much wider diversity of sources.
The Wos dataset is cleaner than the OpenAlex dataset. I had to remove duplicates, papers without authors information, page numbers, reference list etc.
The WoS dataset is not without its issues either. Upon examining the data produced during the preparation of Marwick’s manuscript, problems remain even after cleaning by his code due to discrepancies in the references’ structure. For instance, in the first 4 lines, entries like “11swmuspap” or “1964uclaarchsurv” appear as journals, which, of course, won’t match other mentions of these same journals due to the remaining numbers. Additionally, even in the top-cited journals used for calculating the Shannon’s indices, there are problems in the WoS dataset. This list includes entries such as “[anonymous]thesis”, “” (empty cells), “notitlecaptured” etc.
Both datasets lack data which have tremendous importance in archaeology: books, book chapters and grey literature. I extracted only articles in this study, as Marwick did, but this does not fully represent the scientific production of the discipline.
When comparing the articles’ metadata which are present in both datasets, the number of article is only 4665, based on similar doi. It is clear that for most metadata the similarity between WoS and OpenAlex is very strong (Figure 1 A, C, and D), the length of the title, one of the metric used later in the study, is significantly longer in WoS than it is in OpenAlex (Figure 1 B), but the main issue is clearly the number of references (Figure 1 E) which is the weak point of OpenAlex already mentioned in the literature (Culbert et al., 2025). Many references with 0 reference in OpenAlex can have more than 100 references listed in WoS! This will have consequences in the further analysis, see below.
= readRDS("data/wos-data-df.rds")
articles_Wos = items_df
articles_OpenAlex $doi <- sub("https://doi.org/", "", articles_OpenAlex$doi)
articles_OpenAlex
<- articles_OpenAlex[articles_OpenAlex$doi %in% articles_Wos$doi, ]
articles_OpenAlex_filtre
colnames(articles_OpenAlex_filtre) <- paste0(colnames(articles_OpenAlex_filtre), "_OA")
colnames(articles_Wos) <- paste0(colnames(articles_Wos), "_WoS")
<- merge(articles_OpenAlex_filtre, articles_Wos, by.x = "doi_OA", by.y = "doi_WoS") %>%
Common_Articles_limited select(-authors_OA, -title_OA, -journal_OA, -abstract_OA, -refs_OA, -authors_WoS, -title_WoS, -journal_WoS, -abstract_WoS, -refs_WoS, -refs_OA)
# Correcting an single error in OpenAlex which makes an outlier preventing reading the plot (the length of the article is 326 in Taylor & Francis metadata but checking the pdf of the real article, it should be 27, as recorded in WoS)
<- Common_Articles_limited %>%
Common_Articles_limited mutate(pages_n_OA = replace(pages_n_OA, pages_n_OA == 326, 27))
# Create function for biplots
<- function(data, var1, var2) {
create_biplot ggplot(data, aes_string(x = var1, y = var2)) +
geom_point(color = "steelblue", size = 3, alpha = 0.7) +
geom_abline(intercept = 0, slope = 1, color = "darkred", linetype = "dashed") +
labs(x = var1, y = var2, title = paste(var1, "vs", var2)) +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5, face = "bold"),
axis.title = element_text(face = "bold"),
panel.grid.major = element_line(color = "gray90", size = 0.2),
panel.grid.minor = element_blank(),
panel.border = element_rect(color = "gray70", fill = NA, size = 0.5)
)
}
<- create_biplot(Common_Articles_limited, "authors_n_OA", "authors_n_WoS")
p_authors_n = p_authors_n + labs(title = NULL)
p_authors_n <- create_biplot(Common_Articles_limited, "title_n_OA", "title_n_WoS")
p_title_n = p_title_n + labs(title = NULL)
p_title_n <- create_biplot(Common_Articles_limited, "refs_n_OA", "refs_n_WoS")
p_refs_n = p_refs_n + labs(title = NULL)
p_refs_n <- create_biplot(Common_Articles_limited, "pages_n_OA", "pages_n_WoS")
p_pages_n = p_pages_n + labs(title = NULL)
p_pages_n <- create_biplot(Common_Articles_limited, "year_OA", "year_WoS")
p_year_n = p_year_n + labs(title = NULL)
p_year_n
# Use cowplot to organize the plots
<- plot_grid(p_authors_n, p_title_n, ncol = 2, labels = c("A","B"))
top_row <- plot_grid(p_pages_n, p_year_n, ncol = 2, labels = c("C","D"))
bottom_row <- plot_grid(top_row, bottom_row, ncol = 1, rel_heights = c(1, 1))
final_grid <- plot_grid(final_grid, p_refs_n, ncol = 1, rel_heights = c(2, 1.5), labels = c("","E")) final_grid
# Display all graphs
print(final_grid)
The goal here is to replicate the figures from Marwick (2025) using data from OpenAlex. This requires some effort to prepare the extensive list of papers, including the information on the references they cite. However, in the end, it works well for creating the boxplot figure.
<- nrow(items_df)
n_articles <- max(items_df$year)
year_max <- min(items_df$year)
year_min
<-
items_df_2012 %>%
items_df filter(year == 2012)
# how many archaeology articles in 2012
<- nrow(items_df_2012)
n_items_df_2012
# how many after 2012?
<-
n_items_df_after_2012 %>%
items_df filter(year %in% 2013:year_max) %>%
nrow()
# what proportion of archaeology articles published after 2012
<- n_items_df_after_2012 / n_articles prop_pub_after2012
As of the writing of this document (June 2025), and after automatically cleaning the dataset extracted from OpenAlex, there are 38782 unique articles from 1975 to 2025 in the top 25 journals (identified by their 2-years-mean-citedness), for which the necessary variables to replicate Marwick’s results are available. Among these, 10,415 papers have zero references, 3,222 being from Antiquity and I suspect them to be mainly book reviews. 1,183 works are attributed to Australian Archaeology, sometimes not even a page long in which I found a short poem, a communication from the journal to readers, and even the obituary of François Bordes. These works should not, in my opinion, be considered as ‘articles’ by OpenAlex. 1,086 papers among those with no reference are articles attributed to Palestine Exploration Quaterly for which references are all in footnotes and therefore not counted (n = 319) but they are mainly articles from Japanese journals (n = 767), most of them from the Japan Society of Political Economy which is a big issue of course. I therefore confirm that I need to remove all the papers with 0 references for the following statistical treatments.
The number of works with only one or two references drops significantly to 1,183 and 1,057 respectively but remain unexpected to me. In these category, Antiquity is again largely dominant, with 539 and 409 papers respectively. I extracted a random sample of 30 works with one and two citations to manually check each of them. As for the random sample of 30 works with one reference, most of them are book reviews or short articles. Some values of a single reference are the reality, but most of them should have listed more references. The missing references are usually conference papers, book chapters, or publications in other languages than English (e.g. Russian, German). As for the random sample of 30 works with two references, it is also pretty bad since only eight among them have really two references cited. both lists of 30 works having one or two references in OpenAlex should have had a in reality a mean of more than 7 references!
Given the issues with references listed in OpenAlex, I think that some metrics calculated from this dataset will not be accurate. This is particularly the case for the diversity of sources, since so many sources are not considered at all and of course the missing references are not all coming from single journals. We can even think that big journals have their references correctly referenced with their DOI, while books, chapter books, conference proceedings are probably less referenced due to the absence of such permanent identifiers. This will of course lead the final result of diversity of sources to be strongly underevaluated.
The entire code of Marwick (2025) can then be executed with only few modifications.
library(ggrepel)
source("code/001-redraw-Fanelli-and-Glanzel-Fig-2.R")
<- 6
base_size <- c('#d95f02', '#7570b3', '#1b9e77')
color <- 0.2
alpha <- 0.1
linewidth
# Number of authors ------------------
<-
boxlplot_n_authors ggplot() +
# boxplot of data from this study
geom_boxplot(data = items_df %>%
filter(!is.na(year)),
aes(1, log(authors_n)),
size = 1) +
# boxplot of data from Fanelli & Glänzel Fig 2
geom_boxplot(data = sim_data %>%
filter(Variable == "N. of authors (ln)",
%in% c("h", "p", "s")),
Category aes(1, Value,
group = Category),
size = 1,
fill = color,
colour = color,
alpha = alpha,
linewidth = linewidth) +
# annotations from Fanelli & Glänzel Fig 2
geom_text_repel(data = sim_data %>%
filter(Variable == "N. of authors (ln)",
%in% c("h", "p", "s")) %>%
Category group_by(Category) %>%
summarise(y = median(Value)) %>%
mutate(
label = as.character(Category)
),aes(c(0.75, 1, 1.25), y, label = label),
color = color,
bg.colour = "white",
bg.r = .2,
force = 0) +
scale_y_continuous(limits = c(0, 5)) +
scale_x_continuous(labels = NULL) +
theme_minimal(base_size = base_size) +
theme(panel.grid = element_blank()) +
ylab("N. of authors (ln)") +
xlab("Collaborator group size")
# Relative title length ----------------
<-
items_df_title %>%
items_df filter(!is.na(pages_n)) %>%
filter(!is.na(title_n)) %>%
mutate(relative_title_length = log(title_n / pages_n))
<-
boxlplot_rel_title_length %>%
items_df_title filter(!is.na(year)) %>%
ggplot(aes(1,
+
relative_title_length)) geom_boxplot(
size = 1) +
# boxplot of data from Fanelli & Glänzel Fig 2
geom_boxplot(data = sim_data %>%
filter(Variable == "Relative title length (ln)",
%in% c("h", "p", "s")),
Category aes(1, Value,
group = Category),
size = 1,
fill = color,
colour = color,
alpha = alpha,
linewidth = linewidth) +
# annotations from Fanelli & Glänzel Fig 2
geom_text_repel(data = sim_data %>%
filter(Variable == "Relative title length (ln)",
%in% c("h", "p", "s")) %>%
Category group_by(Category) %>%
summarise(y = median(Value)) %>%
mutate(
label = as.character(Category)
),aes(c(0.75, 1, 1.25), y, label = label),
bg.colour = "white",
colour = color,
bg.r = .2,
force = 0) +
scale_y_continuous(limits = c(-4.5, 3),
breaks = seq(-5, 5, 1),
labels = seq(-5, 5, 1)) +
scale_x_continuous(labels = NULL) +
theme_minimal(base_size = base_size) +
theme(panel.grid = element_blank()) +
ylab("Ratio of title length to article length (ln)") +
xlab("Relative title length")
# Number of pages ------------------
<-
boxlplot_n_pages %>%
items_df ggplot(aes(1,
log(pages_n))) +
geom_boxplot(
size = 1) +
# boxplot of data from Fanelli & Glänzel Fig 2
geom_boxplot(data = sim_data %>%
filter(Variable == "N. of pages (ln)",
%in% c("h", "p", "s")),
Category aes(1, Value,
group = Category),
size = 1,
fill = color,
colour = color,
alpha = alpha,
linewidth = linewidth) +
# annotations from Fanelli & Glänzel Fig 2
geom_text_repel(data = sim_data %>%
filter(Variable == "N. of pages (ln)",
%in% c("h", "p", "s")) %>%
Category group_by(Category) %>%
summarise(y = median(Value)) %>%
mutate(
label = as.character(Category)
),aes(c(0.75, 1, 1.25), y, label = label),
bg.colour = "white",
colour = color,
bg.r = .2,
force = 0) +
scale_y_reverse(limits = c(5, 0)) +
scale_x_continuous(labels = NULL) +
theme_minimal(base_size = base_size) +
theme(panel.grid = element_blank()) +
ylab("N. of pages (ln)") +
xlab("Article length")
# Price's index - age of references ------------------
library(stringr)
# output storage
<- vector("list", length = nrow(items_df))
prices_index
# loop, this takes a moment
for(i in seq_len(nrow(items_df))){
<- items_df$year_refs[i]
refs <- items_df$year[i]
year
<-
ref_years as.numeric(str_match(str_extract_all(refs, "[0-9]{4}")[[1]], "\\d{4}"))
<-
preceeding_five_years seq(year - 5, year, 1)
<-
refs_n_in_preceeding_five_years %in% preceeding_five_years]
ref_years[ref_years
<-
prices_index[[i]] length(refs_n_in_preceeding_five_years) / length(ref_years)
# for debugging
# print(i)
}
<- flatten_dbl(prices_index)
prices_index
# add to data frame
$prices_index <- prices_index
items_df
# plot
<-
boxlplot_price_index %>%
items_df ggplot(aes(1,
+
prices_index)) geom_boxplot(
size = 1) +
# boxplot of data from Fanelli & Glänzel Fig 2
geom_boxplot(data = sim_data %>%
filter(Variable == "Price's index",
%in% c("h", "p", "s")),
Category aes(1, Value,
group = Category),
size = 1,
fill = color,
colour = color,
alpha = alpha,
linewidth = linewidth) +
# annotations from Fanelli & Glänzel Fig 2
geom_text_repel(data = sim_data %>%
filter(Variable == "Price's index",
%in% c("h", "p", "s")) %>%
Category group_by(Category) %>%
summarise(y = median(Value)) %>%
mutate(
label = as.character(Category)
),aes(c(0.75, 1, 1.25), y, label = label),
bg.colour = "white", colour = color,
bg.r = .2,
force = 0) +
scale_y_continuous(limits = c(0, 1)) +
scale_x_continuous(labels = NULL) +
theme_minimal(base_size = base_size) +
theme(panel.grid = element_blank()) +
ylab("Prop. refs in last 5 years") +
xlab("Recency of references")
# Shannon index - diversity of sources ------------------
# journal name as species, article as habitat
# simplify the refs, since they are a bit inconsistent, some of
# these steps take a few seconds
<- map(items_df$journal_refs_mod, ~tolower(.x))
ref_list1 <- map(ref_list1, ~str_split(.x, ", "))
ref_list2 <- map(ref_list2, ~tibble(x = .x))
ref_list3 <- bind_rows(ref_list3, .id = "id")
ref_list4 <- tibble(refs_mod = str_split(items_df$refs_mod, ", "))
refs_mod <- cbind(ref_list4, refs_mod)
ref_list5 <- unnest(ref_list5, cols = c(x,refs_mod)) %>%
ref_list6 filter(!str_detect(refs_mod, "^R\\d{10}$"))
<-
ref_list7 %>%
ref_list6 rename(journal_name = x, x = refs_mod) %>%
filter(x != "W4285719527") %>%
filter(!journal_name %in% c("na","choicereviewsonline","pubmed","hallecentrepourlacommunicationscientifiquedirecte","doajdoajdirectoryofopenaccessjournals")) %>%
filter(!str_detect(journal_name, "books"))
# prepare to compute shannon and join with other variables
$id <- 1:nrow(items_df)
items_df
# In the Shannon index, p_i is the proportion (n/N) of individuals of one particular species (reference) found (n) divided by the total number of individuals found (N) in the article, ln is the natural log, Σ is the sum of the calculations, and s is the number of species.
# compute diversity of all citations for each article (habitat)
<-
shannon_per_item_AQ %>%
ref_list7 group_by(id, journal_name) %>%
tally() %>%
group_by(id) %>%
mutate(p_i = n / sum(n, na.rm = TRUE)) %>%
mutate(p_i_ln = log(p_i)) %>%
summarise(shannon = -sum(p_i * p_i_ln, na.rm = TRUE)) %>%
mutate(id = as.numeric(id)) %>%
arrange(id) %>%
left_join(items_df)
# plot
<-
boxlplot_shannon_index_AQ %>%
shannon_per_item_AQ filter(!is.na(year)) %>%
filter(shannon > 0) %>%
ggplot(aes(1,
+
shannon)) geom_boxplot(aes(colour = "red"),
size = 1, show.legend = FALSE) +
# boxplot of data from Fanelli & Glänzel Fig 2
geom_boxplot(data = sim_data %>%
filter(Variable == "Shannon div. of sources",
%in% c("h", "p", "s")),
Category aes(1, Value,
group = Category),
size = 1,
fill = color,
colour = color,
alpha = alpha,
linewidth = linewidth) +
# annotations from Fanelli & Glänzel Fig 2
geom_text_repel(data = sim_data %>%
filter(Variable == "Shannon div. of sources",
%in% c("h", "p", "s")) %>%
Category group_by(Category) %>%
summarise(y = median(Value)) %>%
mutate(
label = as.character(Category)
),aes(c(0.75, 1, 1.25), y, label = label),
colour = color,
bg.colour = "white",
bg.r = .2,
force = 0) +
scale_y_reverse(limits = c(6, 0)) +
scale_x_continuous(labels = NULL) +
theme_minimal(base_size = base_size) +
theme(panel.grid = element_blank()) +
ylab("Shannon Index") +
xlab("Diversity of sources")
plot_grid(boxlplot_n_authors,
boxlplot_rel_title_length,
boxlplot_n_pages,
boxlplot_price_index,
boxlplot_shannon_index_AQ,nrow = 2)
Figure 2 presents boxplots for archaeological journals (in black) that are quite similar to those in Marwick (2025). In my study, the number of authors and article length are closer to social sciences than to physics. The relative title length is very similar to the WoS data, although it is again slightly closer to social sciences. The recency of references is more akin to humanities. The diversity of sources, calculated correctly with the OpenAlex data, is lower in archaeological journals than in the physics data from Fanelli and Glänzel (2013). Low values of Shannon’s index indicate that articles from archaeological journals cite a limited number of different sources, which is typically interpreted as a characteristic of hard sciences (Fanelli and Glänzel, 2013). Nevertheless, as written above, this metric is probably strongly underestimated from OpenAlex dataset, given the lack of many references in the metadata of articles (books, book chapters, conferences, references in other languages than English etc.)
In the WoS dataset, the proportion of articles published after 2012 is 70 %, for only 13 years out of the 50, or 26 % of the studied time range. On the other side, the post-2012 articles represent only 41 % of the OpenAlex dataset. Data for archaeology in figure 1 of Marwick (2025) is thus strongly skewed towards recent publication habits rather than truly representing trends from 1975 to 2025. In contrast, the OpenAlex data presented in Figure 2 is more representative of the entire time range.
Given that the OpenAlex dataset is larger than the WoS dataset, I replicated figure 1 from Marwick (2025) but selected only data from 2012 (Figure 3), as in Fanelli and Glänzel (2013). Marwick (2025) did not perform the analysis due to a small sample size (n = 303), but the OpenAlex dataset contains 1241 articles from 2012. I believe this is interesting because the calculated metrics do vary over time (fig. 2 Marwick, 2025). Thus, comparing the 1975-2025 dataset of WoS with the 2012 data used by Fanelli and Glänzel (2013) could misrepresent the archaeological publication tendencies and, consequently, the interpretation of archaeology as a hard/soft science.
Figure 3 shows very minor differences compared to Figure 2. The boxplots for all five calculated metrics only shrink slightly, but the relative position to other fields remain the same with same mean, indicating that data from 2012 may be representative of the entire 1975-2025 dataset.
library(ggrepel)
source("code/001-redraw-Fanelli-and-Glanzel-Fig-2.R")
<- 6
base_size <- c('#d95f02', '#7570b3', '#1b9e77')
color <- 0.2
alpha <- 0.1
linewidth
# Number of authors ------------------
<-
boxlplot_n_authors_2012 ggplot() +
# boxplot of data from this study
geom_boxplot(data = items_df %>%
filter(year == 2012),
aes(1, log(authors_n)),
size = 1) +
# boxplot of data from Fanelli & Glänzel Fig 2
geom_boxplot(data = sim_data %>%
filter(Variable == "N. of authors (ln)",
%in% c("h", "p", "s")),
Category aes(1, Value,
group = Category),
size = 1,
fill = color,
colour = color,
alpha = alpha,
linewidth = linewidth) +
# annotations from Fanelli & Glänzel Fig 2
geom_text_repel(data = sim_data %>%
filter(Variable == "N. of authors (ln)",
%in% c("h", "p", "s")) %>%
Category group_by(Category) %>%
summarise(y = median(Value)) %>%
mutate(
label = as.character(Category)
),aes(c(0.75, 1, 1.25), y, label = label),
color = color,
bg.colour = "white",
bg.r = .2,
force = 0) +
scale_y_continuous(limits = c(0, 5)) +
scale_x_continuous(labels = NULL) +
theme_minimal(base_size = base_size) +
theme(panel.grid = element_blank()) +
ylab("N. of authors (ln)") +
xlab("Collaborator group size")
# Relative title length ----------------
<-
boxlplot_rel_title_length_2012 %>%
items_df_title filter(year == 2012) %>%
ggplot(aes(1,
+
relative_title_length)) geom_boxplot(
size = 1) +
# boxplot of data from Fanelli & Glänzel Fig 2
geom_boxplot(data = sim_data %>%
filter(Variable == "Relative title length (ln)",
%in% c("h", "p", "s")),
Category aes(1, Value,
group = Category),
size = 1,
fill = color,
colour = color,
alpha = alpha,
linewidth = linewidth) +
# annotations from Fanelli & Glänzel Fig 2
geom_text_repel(data = sim_data %>%
filter(Variable == "Relative title length (ln)",
%in% c("h", "p", "s")) %>%
Category group_by(Category) %>%
summarise(y = median(Value)) %>%
mutate(
label = as.character(Category)
),aes(c(0.75, 1, 1.25), y, label = label),
bg.colour = "white",
colour = color,
bg.r = .2,
force = 0) +
scale_y_continuous(limits = c(-4.5, 3),
breaks = seq(-5, 5, 1),
labels = seq(-5, 5, 1)) +
scale_x_continuous(labels = NULL) +
theme_minimal(base_size = base_size) +
theme(panel.grid = element_blank()) +
ylab("Ratio of title length to article length (ln)") +
xlab("Relative title length")
# Number of pages ------------------
<-
boxlplot_n_pages_2012 %>%
items_df filter(year == 2012) %>%
ggplot(aes(1,
log(pages_n))) +
geom_boxplot(
size = 1) +
# boxplot of data from Fanelli & Glänzel Fig 2
geom_boxplot(data = sim_data %>%
filter(Variable == "N. of pages (ln)",
%in% c("h", "p", "s")),
Category aes(1, Value,
group = Category),
size = 1,
fill = color,
colour = color,
alpha = alpha,
linewidth = linewidth) +
# annotations from Fanelli & Glänzel Fig 2
geom_text_repel(data = sim_data %>%
filter(Variable == "N. of pages (ln)",
%in% c("h", "p", "s")) %>%
Category group_by(Category) %>%
summarise(y = median(Value)) %>%
mutate(
label = as.character(Category)
),aes(c(0.75, 1, 1.25), y, label = label),
bg.colour = "white",
colour = color,
bg.r = .2,
force = 0) +
scale_y_reverse(limits = c(5, 0)) +
scale_x_continuous(labels = NULL) +
theme_minimal(base_size = base_size) +
theme(panel.grid = element_blank()) +
ylab("N. of pages (ln)") +
xlab("Article length")
# Price's index - age of references ------------------
<-
boxlplot_price_index_2012 %>%
items_df filter(year == 2012) %>%
ggplot(aes(1,
+
prices_index)) geom_boxplot(
size = 1) +
# boxplot of data from Fanelli & Glänzel Fig 2
geom_boxplot(data = sim_data %>%
filter(Variable == "Price's index",
%in% c("h", "p", "s")),
Category aes(1, Value,
group = Category),
size = 1,
fill = color,
colour = color,
alpha = alpha,
linewidth = linewidth) +
# annotations from Fanelli & Glänzel Fig 2
geom_text_repel(data = sim_data %>%
filter(Variable == "Price's index",
%in% c("h", "p", "s")) %>%
Category group_by(Category) %>%
summarise(y = median(Value)) %>%
mutate(
label = as.character(Category)
),aes(c(0.75, 1, 1.25), y, label = label),
bg.colour = "white", colour = color,
bg.r = .2,
force = 0) +
scale_y_continuous(limits = c(0, 1)) +
scale_x_continuous(labels = NULL) +
theme_minimal(base_size = base_size) +
theme(panel.grid = element_blank()) +
ylab("Prop. refs in last 5 years") +
xlab("Recency of references")
# Shannon index - diversity of sources ------------------
<-
boxlplot_shannon_index_2012_AQ %>%
shannon_per_item_AQ filter(year == 2012) %>%
filter(shannon > 0) %>%
ggplot(aes(1,
+
shannon)) geom_boxplot(aes(colour = "red"),
size = 1, show.legend = FALSE) +
# boxplot of data from Fanelli & Glänzel Fig 2
geom_boxplot(data = sim_data %>%
filter(Variable == "Shannon div. of sources",
%in% c("h", "p", "s")),
Category aes(1, Value,
group = Category),
size = 1,
fill = color,
colour = color,
alpha = alpha,
linewidth = linewidth) +
# annotations from Fanelli & Glänzel Fig 2
geom_text_repel(data = sim_data %>%
filter(Variable == "Shannon div. of sources",
%in% c("h", "p", "s")) %>%
Category group_by(Category) %>%
summarise(y = median(Value)) %>%
mutate(
label = as.character(Category)
),aes(c(0.75, 1, 1.25), y, label = label),
colour = color,
bg.colour = "white",
bg.r = .2,
force = 0) +
scale_y_reverse(limits = c(6, 0)) +
scale_x_continuous(labels = NULL) +
theme_minimal(base_size = base_size) +
theme(panel.grid = element_blank()) +
ylab("Shannon Index") +
xlab("Diversity of sources")
library(cowplot)
plot_grid(boxlplot_n_authors_2012,
boxlplot_rel_title_length_2012,
boxlplot_n_pages_2012,
boxlplot_price_index_2012,
boxlplot_shannon_index_2012_AQ,nrow = 2)
Given that the extraction of all this data also allows for ranking the top-cited references and sources, I present in Table 3 and Table 4 the 20 papers and sources, respectively, that have the most citations in the dataset. The first list indicates that the most cited references are primarily methodological (radiocarbon and isotopes) or theoretical articles, and sourcebooks, rather than case studies, which is an expected result. The latter list shows that the most cited journal, Journal of Archaeological Science, is more than twice as cited as the second one, American Antiquity. This highligths the importance of this journal in the community and may partly explain partly the low values of Shannon’s index. While the top 3 journals are, in my opinion, not very surprising, I find it more surprising to see Archaeometry in the fourth position. It is also interesting to note the presence of highly reputable generalist journals in position 5 and 6 for Nature and Science respectively, and even of the PNAS in position 17.
This table of the most cited journals differs significantly from the same table calculated (but not shown) in Marwick (2025). This table ranks Journal of Archaeological Science first, followed by American Antiquity with half the citations, then Antiquity. After that, the order changes considerably compared to the OpenAlex data. For instance, Nature is here in the 15th position, and the PNAS in 9th. Quaternary International is here in 5th position, whereas it is 12th in the OpenAlex table, etc. This indicates that the differences between both datasets regarding references are relatively significant, which explains the variations in the Diversity of source results. This discrepancy could be due to the recency of the WoS dataset (70% of the articles are post-2012), as seen for example with the presence of PLoS ONE (created in 2006) and Journal of Archaeological Science: Reports (created in 2015) in the top 20 sources, despite both journals being relatively new outlets.
<-
all_cited_items %>%
ref_list7 select(x) %>%
group_by(x) %>%
tally() %>%
arrange(desc(n))
= all_cited_items %>% head(25)
top25_cited_items colnames(top25_cited_items) = c("Article","N.citations")
<- list()
top25_refs_list
# Loop for getting the list of all refs for each paper and keeping only its publication years
# (very very long)
for (i in 1:nrow(top25_cited_items)) {
<- tryCatch({
title_refs_search oa_fetch(
entity = "works",
identifier = top25_cited_items$Article[[i]],
count_only = FALSE,
output = "dataframe",
verbose = TRUE,
mailto = "alain.queffelec@u-bordeaux.fr"
) error = function(e) {
}, message(paste("Error fetching data for row", i, ": ", e$message))
NULL
})
if (!is.null(title_refs_search)) {
<- title_refs_search$title
title_refs <- title_refs
top25_refs_list[[i]]
}
}
# Save the list on the disk
saveRDS(top25_refs_list, "data/top25_refs_list.rds")
# Read the list on the disk
= readRDS("data/top25_refs_list.rds")
top25_refs_list
$Article = unlist(top25_refs_list)
top25_cited_items$Article[10] = top25_cited_items$Article[2] # transform manually a title with chinese characters in the same title with latin characters
top25_cited_items$Article[21] = top25_cited_items$Article[16] # transform manually a title with authors added in the same title without the authors
top25_cited_items
<- top25_cited_items %>%
top20_cited_items_fused group_by(Article) %>%
summarise(N.citations = sum(N.citations), .groups = 'drop')
= top20_cited_items_fused %>% arrange(desc(top20_cited_items_fused$N.citations)) %>% head(20) %>%
top20_cited_items_fused mutate(rank = seq(1:20)) %>%
relocate(rank, .before = Article)
= cbind(top20_cited_items_fused[1:10,],top20_cited_items_fused[11:20,]) top20_cited_items_4cols
# Display the table
::kable(top20_cited_items_4cols, caption = "Top 20 references cited in the OpenAlex dataset") knitr
rank | Article | N.citations | rank | Article | N.citations |
---|---|---|---|---|---|
1 | IntCal13 and Marine13 Radiocarbon Age Calibration Curves 0–50,000 Years cal BP | 1240 | 11 | Experimental Evidence for the Relationship of the Carbon Isotope Ratios of Whole Diet and Dietary Protein to Those of Bone Collagen and Carbonate | 240 |
2 | Bayesian Analysis of Radiocarbon Dates | 525 | 12 | Preparation and characterization of bone and tooth collagen for isotopic analysis | 239 |
3 | Formation processes of the archaeological record | 386 | 13 | Bone Collagen Quality Indicators for Palaeodietary and Radiocarbon Measurements | 215 |
4 | Postmortem preservation and alteration of in vivo bone collagen isotope ratios in relation to palaeodietary reconstruction | 343 | 14 | New Method of Collagen Extraction for Radiocarbon Dating | 214 |
5 | Willow Smoke and Dogs’ Tails: Hunter-Gatherer Settlement Systems and Archaeological Site Formation | 313 | 15 | Organization and Formation Processes: Looking at Curated Technologies | 207 |
6 | Bones: Ancient Men and Modern Myths | 300 | 16 | The revolution that wasn’t: a new interpretation of the origin of modern human behavior | 205 |
7 | Nitrogen and carbon isotopic composition of bone collagen from marine and terrestrial animals | 297 | 17 | Strontium Isotopes from the Earth to the Archaeological Skeleton: A Review | 185 |
8 | Extended 14C Data Base and Revised CALIB 3.0 14C Age Calibration Program | 291 | 18 | R: A language and environment for statistical computing. | 184 |
9 | Influence of diet on the distribution of nitrogen isotopes in animals | 265 | 19 | A History of Archaeological Thought. By Bruce G. Trigger. | 176 |
10 | Pottery Analysis: A Sourcebook. | 258 | 20 | Advances in Archaeological Method and Theory | 175 |
# get a list of the top journals
<-
top_journals %>%
ref_list7 select(journal_name) %>%
group_by(journal_name) %>%
tally() %>%
filter(n > 50) %>%
arrange(desc(n))
= top_journals %>% head(20)
top20_cited_journals colnames(top20_cited_journals) = c("Journal","N.citations")
= top20_cited_journals %>%
top20_cited_journals mutate(rank = seq(1:20)) %>%
relocate(rank, .before = Journal)
= cbind(top20_cited_journals[1:10,],top20_cited_journals[11:20,]) top20_cited_journals_4cols
# Display the table
::kable(top20_cited_journals_4cols, caption = "Top 20 journals cited in the OpenAlex dataset") knitr
rank | Journal | N.citations | rank | Journal | N.citations |
---|---|---|---|---|---|
1 | journalofarchaeologicalscience | 59700 | 11 | radiocarbon | 10373 |
2 | americanantiquity | 24375 | 12 | quaternaryinternational | 9821 |
3 | antiquity | 18270 | 13 | americanjournalofphysicalanthropology | 9686 |
4 | archaeometry | 15404 | 14 | journalofanthropologicalarchaeology | 9669 |
5 | nature | 12513 | 15 | journalofhumanevolution | 9625 |
6 | science | 12104 | 16 | studiesinconservation | 8460 |
7 | currentanthropology | 11506 | 17 | proceedingsofthenationalacademyofsciences | 8421 |
8 | man | 10909 | 18 | americananthropologist | 7643 |
9 | worldarchaeology | 10474 | 19 | americanjournalofarchaeology | 6915 |
10 | journaloffieldarchaeology | 10408 | 20 | journalofculturalheritage | 6338 |
# get top 20 journals from Marwick
= readRDS("data/top_cited_journals_Marwick.rds")
top_cited_journals_marwick
= top_cited_journals_marwick %>% head(20)
top20_cited_journals_marwick colnames(top20_cited_journals_marwick) = c("Journal","N.citations")
= top20_cited_journals_marwick %>%
top20_cited_journals_marwick mutate(rank = seq(1:20)) %>%
relocate(rank, .before = Journal)
= cbind(top20_cited_journals_marwick[1:10,],top20_cited_journals_marwick[11:20,]) top20_cited_journals_marwick_4cols
# Display the table
::kable(top20_cited_journals_marwick_4cols, caption = "Top 20 journals cited in the Web of Science dataset") knitr
rank | Journal | N.citations | rank | Journal | N.citations |
---|---|---|---|---|---|
1 | jarchaeolsci | 24814 | 11 | thesis | 4222 |
2 | amantiquity | 12718 | 12 | radiocarbon | 4177 |
3 | antiquity | 7447 | 13 | jfieldarchaeol | 3733 |
4 | janthropolarchaeol | 6100 | 14 | jarchaeolscirep | 3614 |
5 | quaternint | 4996 | 15 | nature | 3561 |
6 | curranthropol | 4983 | 16 | amjphysanthropol | 3444 |
7 | worldarchaeol | 4754 | 17 | jarchaeolmethodth | 3376 |
8 | science | 4733 | 18 | jhumevol | 3296 |
9 | pnatlacadsciusa | 4615 | 19 | amanthropol | 3177 |
10 | archaeometry | 4477 | 20 | plosone | 3154 |
Regarding the evolution of hardness over time, the plots created with the OpenAlex data are similar to those created with the WoS data (Fig. 2 of Marwick, 2025). The only difference is the evolution of the relative length, which I found so close to 0 that I decided not to put in green but to present it in grey as a variable which does not evolve over time in OpenAlex data.
<- items_df_title %>% select(-refs)
items_df_title
<-
over_time %>%
items_df left_join(items_df_title) %>%
left_join(shannon_per_item_AQ) %>%
filter(relative_title_length != -Inf,
!= Inf,
relative_title_length != 0,
shannon < 200) %>%
pages_n mutate(journal_wrp = str_wrap(journal, 30)) %>%
select(journal,
year,
authors_n,
pages_n,
prices_index,
shannon,
relative_title_length)
<-
over_time_long %>%
over_time ungroup() %>%
select(-journal) %>%
gather(variable,
value, -year) %>%
filter(value != -Inf,
!= Inf) %>%
value mutate(variable = case_when(
== "pages_n" ~ "N. of pages",
variable == "prices_index" ~ "Recency of references",
variable == "shannon" ~ "Diversity of sources",
variable == "relative_title_length" ~ "Relative title length (ln)",
variable == "authors_n" ~ "N. of authors"
variable %>%
)) filter(!is.na(variable)) %>%
filter(!is.nan(value)) %>%
filter(!is.na(value)) %>%
filter(value != "NaN")
# compute beta estimates so we can colour lines to indicate more or less hard
<-
over_time_long_models %>%
over_time_long group_nest(variable) %>%
mutate(model = map(data, ~tidy(lm(value ~ year, data = .)))) %>%
unnest(model) %>%
filter(term == 'year') %>%
mutate(becoming_more_scientific = case_when(
== "N. of authors" & estimate > 0 ~ "TRUE",
variable == "N. of pages" & estimate < 0 ~ "TRUE",
variable == "N. of refs (sqrt)" & estimate < 0 ~ "TRUE",
variable == "Recency of references" & estimate > 0 ~ "TRUE",
variable == "Relative title length (ln)" ~ "NOT CHANGING",
variable == "Diversity of sources" & estimate < 0 ~ "TRUE",
variable TRUE ~ "FALSE"
))
# join with data
<-
over_time_long_colour %>%
over_time_long left_join(over_time_long_models)
# Chunk of code absent from v1.2 and v.1.3 where he calls directly the .png
# I may have the same problem as he had : when directly in Rstudio, no problem, but impossible to render the quarto document with the modelling overtime. So I do the same as he does : I save my own .png...
library(ggpmisc)
library(mgcv)
<- y ~ x
formula
<-
over_time_long_colour_gams %>%
over_time_long_colour nest(.by = variable) %>%
mutate(mod_gam = lapply(data,
function(df) gam(year ~ s(value, bs = "cr"),
data = df)))
<-
over_time_long_colour_gams_summary %>%
over_time_long_colour nest(.by = variable) %>%
mutate(fit = map(data, ~mgcv::gam(year ~ s(value, bs = "cs"), data = .)),
results = map(fit, glance),
R.square = map_dbl(fit, ~ summary(.)$r.sq)) %>%
unnest(results) %>%
select(-data, -fit) %>%
select(variable, adj.r.squared)
<-
over_time_long_colour_gams_summary_df %>%
over_time_long_colour left_join(over_time_long_colour_gams_summary)
<- ggplot() +
plot_overtime geom_point(data = over_time_long_colour_gams_summary_df,
aes(year,
value, colour = becoming_more_scientific),
alpha = 0.5) +
geom_smooth(data = over_time_long_colour_gams_summary_df,
aes(year, value),
method="gam",
formula = y ~ s(x, bs = "cs"),
se = FALSE,
linewidth = 2,
colour = "#7570b3") +
facet_wrap( ~ variable,
scales = "free_y") +
theme_bw(base_size = base_size) +
theme(legend.position = c(0.96, 0.02),
legend.justification = c(1, 0),
legend.key.size = unit(1, "cm"),
legend.text = element_text(size = 10),
legend.title = element_text(size = 12),
legend.background = element_rect(fill = "white", color = "black", size = 0.5),
legend.spacing = unit(0.5, "cm")) +
scale_color_manual(values = c("TRUE" = "#1b9e77",
"FALSE" = "#d95f02",
"NOT CHANGING" = "lightgrey")) +
ylab("") +
geom_text(data = over_time_long_colour_gams_summary_df %>%
group_by(variable) %>%
summarise(max_value = max(value),
adj.r.squared = unique(adj.r.squared)),
aes(
x = 1980,
y = max_value,
label = paste("Pseudo R² = ",
signif(adj.r.squared,
digits = 3))),
hjust = 0,
vjust = 1.5,
size = 2)
ggsave(plot_overtime,height = 6.95, width = 9.31,
filename = ("figures/plot_overtime.png"))
::include_graphics("figures/plot_overtime.png") knitr
<- 2
journal_title_size
# get rank order of journals by these bibliometic variables
<-
journal_metrics_for_plotting %>%
items_df left_join(items_df_title) %>%
left_join(shannon_per_item_AQ) %>%
ungroup() %>%
select(journal,
# log
authors_n, # log
pages_n,
relative_title_length,
prices_index,
shannon%>%
) filter(relative_title_length != -Inf,
!= Inf,
relative_title_length != "NaN"
prices_index %>%
) mutate(
log_authors = log(authors_n),
log_pages = log(pages_n)
)
<-
journal_metrics_for_plotting_summary %>%
journal_metrics_for_plotting mutate(journal = str_wrap(journal, 20)) %>%
group_by(journal) %>%
summarise(mean_log_authors = mean(log_authors),
mean_log_pages = mean(log_pages),
mean_relative_title_length = mean(relative_title_length),
mean_prices_index = mean(prices_index),
mean_shannon = mean(shannon))
# PCA of journal means
<-
journal_metrics_for_plotting_summary_pca %>%
journal_metrics_for_plotting_summary mutate(across(where(is.numeric), ~ ifelse(is.na(.), mean(., na.rm = TRUE), .))) %>%
column_to_rownames("journal") %>%
prcomp(scale = TRUE)
# Tidy the PCA results
<- journal_metrics_for_plotting_summary_pca %>% tidy(matrix = "pcs")
pca_means_tidy # first two PCs explain how much?
# Get the summary of the PCA
<- summary(journal_metrics_for_plotting_summary_pca)
pca_summary # Extract the proportion of variance explained by PC1 and PC2
<- round(pca_summary$importance[2, 1:2] * 100, 0)
variance_explained
# Get the PCA scores
<- journal_metrics_for_plotting_summary_pca %>% augment(journal_metrics_for_plotting_summary)
pca_scores_means
# Get the PCA loadings
<-
pca_loadings_means %>%
journal_metrics_for_plotting_summary_pca tidy(matrix = "rotation") %>%
pivot_wider(names_from = "PC",
values_from = "value",
names_prefix = "PC") %>%
mutate(column = case_when(
== "mean_log_authors" ~ "Number of\nauthors",
column == "mean_log_pages" ~ "Number of\npages",
column == "mean_relative_title_length" ~ "Relative\ntitle\nlength",
column == "mean_prices_index" ~ "Recency of\nreferences",
column == "mean_shannon" ~ "Diversity of\nsources",
column
))
# Plot the PCA results
<-
plot_pca_means ggplot() +
labs(x = paste0("PC1 (", variance_explained[1], "%)"),
y = paste0("PC2 (", variance_explained[2], "%)")) +
geom_point(data = pca_scores_means,
aes(.fittedPC1,
.fittedPC2),size = 1) +
geom_text_repel(data = pca_scores_means %>%
mutate(label = str_replace(journal,
"JOURNAL",
"J.")) %>%
mutate(label = str_remove(label,
"-AN\nINTERNATIONAL\nJ.")),
aes(.fittedPC1,
.fittedPC2,label = label),
lineheight = 0.8,
segment.color = NA,
force_pull = 10,
size = 2.5,
bg.color = "white", # Color of the halo
bg.r = 0.2) +
geom_segment(data = pca_loadings_means,
aes(x = 0,
y = 0,
xend = PC1,
yend = PC2),
arrow = arrow(length = unit(0.2, "cm")),
color = "grey70") +
geom_text_repel(data = pca_loadings_means,
aes(PC1,
PC2,label = column),
size = 2,
lineheight = 0.8,
force = 10,
force_pull = 0,
segment.color = NA,
color = "grey40",
bg.color = "white", # Color of the halo
bg.r = 0.2) +
theme_minimal(base_size = base_size) +
coord_fixed(xlim = c(-6, 2.5),
ylim = c(-3, 2))
# tricky to get the label spacing right, let's save an SVG, edit
# by hand, then export to PNG and read that file later.
ggsave(plot_pca_means,
filename = ("figures/plot_pca_means.svg"))
::include_graphics("figures/plot_pca_means.png") knitr
Figure 5, equivalent to figure 4 of Marwick (2025) (though the journals are not all the same, see Section 2.1), represents the characteristics of the journals in the top 25 by 2-years-mean-citedness of OpenAlex. The review journal Journal of Archaeological Research stands out significantly from the other journals, featuring a higher diversity of sources and longer papers. In the same direction on PC1 but in the opposite direction on PC2, Journal of Archaeological Methods and Theory, Journal of World Prehistory, Journal of Social Archaeology, and Journal of Material Culture are characterized by long articles with fewer authors, and references which are less diverse and less recent. Another part of the PCA is represented by harder science journals, with recent references, more authors and shorter papers. This group includes Advances in Archaeological Practice, Archaeological Dialogues (for which many papers are short comments or answers), Australian Archaeology, Archaeological Prospections, and Antiquity. Large teams of authors, publishing rather short articles, and citing less diverse sources are typically what is published in Journal of Archaeological Science: Reports, Archaeological and Anthropological Science, Archaeological Research in Asia.
= levels(factor(items_df$journal))
long_names = c("Adv. in Arch. Practice","Antiquity","Arch. & Anthro. Sci.","Arch. Dialogues","Arch. Prosp.","Arch. Res. Asia","Archaeometry","Australian Arch.","Cambr. Arch. J.","Geoarchaeology","J. Anthro. Arch.","J. Arch. M. & T.","J. Arch. Res.","J. Arch. Sci.",
short_names "J. Arch. Sci. Rep.","J. Cult. Heritage","J. Field Arch.","J. Material Cult.","J. Social Arch.","J. World Prehist.","Levant","Lithic Techno.","Palest. Explo. Quart.","Studies in Cons.","World Arch.")
<- setNames(short_names, long_names)
replacement_vector <- str_replace_all(levels(factor(items_df$journal)), replacement_vector) journal_short
# looking into rankings of the journals
<- 7
journal_title_size
<-
journal_summary_metrics_ranks %>%
journal_metrics_for_plotting_summary mutate(journal = journal_short) %>%
mutate(across(starts_with("mean"),
~ rank(-.),
.names = "rank_{.col}")) %>%
select(journal, starts_with("rank")) %>%
# reorder by hardness
mutate(rank_mean_log_pages = 21 - rank_mean_log_pages,
rank_mean_shannon = 21 - rank_mean_shannon)
library(irr)
<-
journal_summary_metrics_ranks_test %>%
journal_summary_metrics_ranks select(-journal) %>%
kendall(correct = TRUE)
# Convert to scientific text
<- function(num){
pretty_print_sci <- paste0(gsub("e", " x 10^", # Replace 'e' with ' x 10^'
scientific_text sprintf("%.2e", num)), "^") # round to 2 sf
return(scientific_text)
}
<- function(votes_tbl) {
borda_count_tbl # Number of voters
<- ncol(votes_tbl) - 1
num_voters
# Calculate scores for each option
<- votes_tbl %>%
scores rowwise() %>%
mutate(Score = sum(num_voters - c_across(starts_with("rank_")))) %>%
ungroup() %>%
select(1, Score)
# Return scores
return(scores)
}
# Calculate Borda Count scores
<-
borda_scores %>%
journal_summary_metrics_ranks borda_count_tbl() %>%
rename("Journal" = "journal") %>%
arrange(desc(Score))
<-
plot_borda_scores %>%
borda_scores ggplot() +
aes(reorder(Journal, Score),
+
Score) geom_col() +
coord_flip() +
ylab("Borda Count scores") +
xlab("") +
theme_minimal(base_size = base_size) +
theme(axis.text.y = element_text(size = journal_title_size))
library(ggridges)
<- 7
journal_title_size
<- journal_metrics_for_plotting %>%
journal_metrics_for_plotting mutate(journal = str_replace_all(journal, replacement_vector))
<-
plot_journals_authors %>%
journal_metrics_for_plotting ggplot(aes(y = reorder(journal,
log_authors,FUN = mean),
x = log_authors,
fill = after_stat(x),
height = after_stat(density))) +
geom_density_ridges_gradient(stat = "density",
colour = "white") +
scale_fill_viridis_c() +
guides(fill = 'none') +
theme_minimal(base_size = journal_title_size) +
theme(axis.text.y = element_text(size = journal_title_size)) +
ylab("") +
xlab("Number of authors (ln)")
<-
plot_journals_article_length %>%
journal_metrics_for_plotting ggplot(aes(y = reorder(journal,
-log_pages,
FUN = mean),
x = log_pages,
fill = after_stat(x),
height = after_stat(density))) +
geom_density_ridges_gradient(stat = "density",
colour = "white") +
scale_fill_viridis_c() +
guides(fill = 'none') +
theme_minimal(base_size = base_size) +
theme(axis.text.y = element_text(size = journal_title_size)) +
xlab("Number of pages (ln)") +
ylab("")
<-
plot_journals_title_length %>%
journal_metrics_for_plotting ggplot(aes(y = reorder(journal,
relative_title_length,FUN = mean),
x = relative_title_length,
fill = after_stat(x),
height = after_stat(density))) +
geom_density_ridges_gradient(stat = "density",
colour = "white") +
scale_fill_viridis_c() +
guides(fill = 'none') +
theme_minimal(base_size = base_size) +
theme(axis.text.y = element_text(size = journal_title_size)) +
ylab("") +
xlab("Relative title length (ln)")
<-
plot_journals_ref_recency %>%
journal_metrics_for_plotting group_by(journal) %>%
ggplot(aes(y = reorder(journal,
prices_index,FUN = mean),
x = prices_index,
fill = after_stat(x),
height = after_stat(density))) +
geom_density_ridges_gradient(stat = "density",
colour = "white") +
scale_fill_viridis_c() +
guides(fill = 'none') +
theme_minimal(base_size = base_size) +
theme(axis.text.y = element_text(size = journal_title_size)) +
ylab("") +
xlab("Recency of references")
<-
plot_journals_ref_diversity %>%
journal_metrics_for_plotting group_by(journal) %>%
ggplot(aes(y = reorder(journal,
-shannon,
FUN = mean),
x = shannon,
fill = after_stat(x),
height = after_stat(density))) +
geom_density_ridges_gradient(stat = "density",
colour = "white") +
scale_fill_viridis_c() +
guides(fill = 'none') +
theme_minimal(base_size = base_size) +
theme(axis.text.y = element_text(size = journal_title_size)) +
ylab("") +
xlab("Diversity of sources")
library(cowplot)
= plot_grid(plot_journals_authors,
plot_variation
plot_journals_article_length,
plot_journals_title_length,
plot_journals_ref_recency,
plot_journals_ref_diversity,
plot_borda_scores,nrow = 2,
labels = LETTERS[1:6],
label_size = 6)
plot_variation
ggsave(plot_variation,
filename = ("figures/plot_variation.svg"))
Figure 6 also presents interesting results:
Figure 6 A shows a pretty different order for the journals that are both in the top 25 of OpenAlex and in the top 20 of WoS.
Figure 6 B-E illustrate that the data in OpenAlex have a much wider distribution than the data in WoS presented by Marwick (2025).
Figure 6 B-D show rankings in the length of articles, the relative length of articles and in the recency of references quite similar to the ones calculated by Marwick (2025).
Figure 6 E shows rankings that are very similar to those presented in Marwick (2025) in the lower part of the plot, while the upper part of the plot is populated with journals which are not in the top 20 journals from the WoS database.
Figure 6 F also generally matches the results from Marwick (2025).
Figure 6 F shows that the two journals of cultural heritage and conservation studies are distinct from the other journals. It is very interesting to see that Studies in Conservations behaves similarly in this plot as Journal of Cultural Heritage, as it strengthens the interpretation of Marwick (2025). These journals are the closest to hard science because they “publishes materials science and computational analyses related to conservation and preservation of historic objects in museums and other collections” and therefore behave more as chemistry journals than archaeological journals.
This attempt at reproducing and replicating Marwick (2025) has been successful.
First, the well-organized and shared data and scripts allowed me to easily reproduce the published paper with all its figures, confirming full computational reproducibility. Nevertheless, a few errors were identified in the code, which have been shared with the author through GitHub. The main issues exposed relate to the selection of the top 20 journals and to the calculation of the Shannon’s index. The article states that the selection of journals is based on the H-index, whereas it is actually based on the 2022 Impact Factor in the code. This list is also missing two journals due to a problem of data sorting prior to subsetting. The second issue lies in the calculation of Shannon’s index, which significantly modify the results for this metric.
Second, the replication of the first part of Marwick (2025) about the hard/soft position of archaeology within science has been conducted using the OpenAlex dataset instead of the Web of Science dataset. This open dataset, which is much larger than the WoS dataset but still less curated in some variables, allowed for the confirmation of most observations made in the replicated study. It confirms that the open and free dataset is usable for scientometrics analysis. It also confirms that archaeology can be positioned in terms of hard/soft science as intermediate between physics and humanities and often close to social sciences. The only strong difference lies in the diversity of sources as estimated by the Shannon’s index. As previously explained, the calculation Shannon’s index for the diversity of sources is incorrect in Marwick (2025). When corrected in his code, evan while still using the WoS dataset, the result is quite different from the result presented in the article: the corrected diversity of sources is even more in the direction of soft science, higher than social sciences and humanities. In the OpenAlex dataset, with the correct calculation, Shannon’s index values are lower for archaeology than for the other three disciplines (Figure 2), and this is also the case when only articles from 2012 are used (Figure 3). Shannon’s index calculated from OpenAlex’s data suggests a very low diversity, lower than that of physics, and therefore would indicate a hard science behavior of archaeologists when citing scientific articles. This could be interpreted as evidence that, in archaeology, “scholars agree on the relative importance of scientific problems, their efforts […] concentrate in specific fields and their findings [are] of more general interest, leading to a greater concentration of the relevant literature in few, high-ranking outlets” (Fanelli and Glänzel, 2013). Observing the strong dominance of Journal of Archaeological Science and the presence of Nature, Science, and PNAS in the top 20 cited journals in archaeology (Table 4) could indeed indicate that archaeology relies on a relatively small number of journals, and especially high-ranking ones. Despite this observation, I wonder if this is due to the agreement of scholars on the relative importance of scientific problems as mentioned by Fanelli and Glänzel (2013), or if it is due to the ability of archaeological results to be published relatively more easily than other disciplines in these high-ranking journals, particularly in the archaeology of ancient periods. It may also indicate that the number of archaeological journals is smaller than in other disciplines, though I am unsure if this is true, and I am not aware of any studies on this topic. A recent study focusing on the publications in archaeology between 2020 and 2023 also shows quite a high concentration of citations in few journals (Table 2 in Vélaz Ciaurriz, 2023). It may also be the result of the lack of information about many references in published articles when they cite books, book chapters, conference, or literature in other languages than English, which are not recorded correctly in either database, but may be even stronger in OpenAlex and therefore explain the concentration of citations from journals only and then artificially reducing the diversity of sources.
The comparison of different journals for each metric measured in this study is also generally similar to the results published in Marwick (2025), although it is sometimes difficult to compare because the lists of journals in both manuscripts are different. Some journals are positioned in the PCA in the same way with both datasets, particularly on the more extreme soft or hard sides (Figure 5). Changing the calculation of Shannon’s index almost do not alters the PCA of Marwick (2025). The rankings are also quite similar for the journals which are in both lists (Figure 6), but the OpenAlex dataset shows much more diversity for each journal across most metrics. The higher number of articles in the OpenAlex dataset offers a more nuanced view of the behavior of each journal. This may be due to the inclusion of more ancient articles compared to the WoS dataset, which comprises 70% of post-2012 articles. Alternatively, it may also result from some data being poorly documented in OpenAlex, especially for the oldest articles?
This work confirmed the reproducibility and replicability of the first part of Marwick (2025). The reproducibility was easy to implement, but not totally without errors. The replication of the results with the OpenAlex dataset allowed me to identify these errors by applying the code to another dataset, which compelled me to delve deeper into the code. This process underscore (if necessary) the value of reusing and learning from the code and data of a skilled colleague, a method also employed by the author of the replicated study himself to train his students (Marwick et al., 2020). The results obtained using the OpenAlex dataset, which is entirely free and open source, generally align with those published by Marwick (2025). The primary difference lie in the references listed in the articles which can be automatically extracted. The findings indicate that the OpenAlex dataset is less influenced by recent trends in publications than the Web of Science dataset, as it maintains a more balanced number of articles over the 50-year period studied. This replication supports the idea that it is totally possible to use this extensive database for scientometrics analyses, particularly considering that this database will continue to expand and improve in the future. The main issue remains the data about cited references which clearly has issues finding many of the other sources than journal articles on which our discipline strongly relies.
Data and quarto document allowing to fully reproduce this manuscript are available on Zenodo. There is also an html version of this manuscript, more interactive, on the GitHub page. You can also make comment, publish issues or commits on GitHub.
I would like to express my gratitude to Ben Marwick for his ongoing efforts to promote transparency and openness in archaeology. Through his influential publications and active participation in professional societies, he consistently advocates for these principles within our community. I have gained significant insights from reading his papers and examining the code he develops and generously shares to produce his research findings. Once again, replicating his work in this paper has been an enriching learning experience.
Additionally, I wish to disclose that I used Large Language Models (LLMs) for assistance in modifying and creating code, as well as for refining the English language in this document.