Science-Metrix was commissioned to provide data for the UNESCO Science Report 2021. This contract followed on from our successful completion of similar work for the previous edition of this report in 2015. In the course of the 2021 project, we were excited to develop a methodology and indicators for measuring outputs of scientific publications that could be linked to 57 topics related to the United Nations’ Sustainable Development Goals (SDGs) and their associated targets. Each topic was to be analyzed separately, such that 57 distinct data sets were created. This groundbreaking work has since informed our studies for several other clients, and we continue to refine the methodology and build robust data sets.
To ensure the highest standards of quality, the work of building the data sets was performed manually by analysts skilled in balancing both the recall and the precision of bibliometric data sets. The first step in building a topical data set was to define a set of core terms and specialized journals whose specificity to the topic at hand was very high. These terms and journals were then used to retrieve matching publications from the database to obtain a seed data set. This was the most sensitive part of the process, as the core terms/specialist journals must be very precise and cover all pertinent aspects of the topic. Where possible, this was achieved by an analyst familiar with the topic at hand; otherwise, a literature review was conducted to ensure a minimal understanding of the topic.
The papers returned by each query were inspected to ensure high precision of the seed data set. Queries that returned proportionally too much noise were removed or limited by combining them with supplementary terms. For example, one of the topics was on HIV/AIDS research. When the data set was built, using the keyword “AIDS” alone in a query returned a large amount of noise due to literature on hearing aids, learning aids, and so forth. The query was then modified to exclude such unrelated terms. At this stage, precision was prioritized (aiming for at least 95%) at the expense of recall, which remained low (< 60%) in some cases.
The seed data set was then expanded by broadening the keyword-based query. To simplify this step and all subsequent ones, a specialized tool to help in building data sets was developed internally. The tool first computed the term frequency–inverse document frequency (tf–idf) of all noun phrases (extracted using a natural language processing algorithm) appearing in the papers of the seed to ease the identification of additional relevant terms. It also computed the number of additional publications each of these terms would add to help prioritize the selection of additional terms towards increasing recall from the seed.
The tool also helped in tracking the precision of each search expression before their inclusion in the query by enabling analysts to rate the pertinence of every encountered article. The recall figure was continuously updated as the query was expanded. It was also possible to specify a different recall set than the seed against which to test recall, such that it was easy to test for thematic biases by varying the recall data set used. During this process, the tool computed the share of each journal’s output appearing in the expanding data set. Using this information, analysts looked for journals that had a high share of papers included in the data set and analyzed their scope to decide whether it would be worth adding them to the data set in full. Work was continued until no more relevant terms or journals were worth adding, at which point recall was good (>70%) and precision was still high (>90%).
At that stage, further verifications were performed with the tool’s help. The first involved looking for queries with a low contribution to recall yet a very high number of returned papers, as this was often the signature of a search expression that was off-topic or an indication that the seed or other reference data set was missing a portion of relevant research. Another verification was to look at the subfields of science (based on Science-Metrix’s classification of science) in which the papers were classified to identify those that could be related to off-topic papers. Analysts could also verify that the author affiliations most often found in papers from the data set were indeed expected, given the topic. Finally, the recall was also measured for each journal appearing in the data set to search for potential biases across the subject matters of relevance to a given topic (i.e., some specialized journals having lower recall than others).
About the UNESCO Science Report 2021
“This seventh edition of the [UNESCO Science Report] monitors the development path that countries have been following over the past five years from the perspective of science governance. It documents the rapid societal transformation under way, which offers new opportunities for social and economic experimentation but also risks exacerbating social inequalities, unless safeguards are put in place.” – UNESCO
See the full report here [758 pages].
Read the Executive Summary here.
Image credit: iStock Photo