The purpose of this article is to sketch the technical issues behind duplicate identification in the context of the OpenAIRE infrastructure. Duplicate identification, to be followed by merging of duplicates, is the most important phase of the deduplication process. The major challenge in duplicate identification is the trade off between efficacy, i.e. ability to identify all possible groups of duplicates, and efficiency, i.e. time to process. Our intention is to explain the reasons of possible imperfections in the deduplication process and highlight the importance of end-user feedbacks on how to refine this process.
When aggregating publication bibliographic records by collecting from hundreds of repositories, CRIS systems, and web sources, it is very common to encounter so-called duplicates, namely groups of records describing the very same real-world document. Deduplication of a collection of bibliographic records is the process of scanning the whole collection to identify such groups and eventually replace them with only one record (i.e. “merging”), which unambiguously describes the publication in the collection.
With no doubts, humans are the best agents at accomplishing this task and no machinery will ever replace their ability to judge whether two records are indeed duplicates. On the other hand, humans are effective only when the size of the collection is contained and the process is not often repeated due to frequent changes to the collection. This is unfortunately not the case for scholarly communication aggregators or infrastructures such as Google Scholar, OpenAIRE, DOAJ, BASE, OAIster, CORE-UK, etc.. In fact, extensive literature has been written on automated deduplication methodologies, also known as “record linkage”, “disambiguation”, “named entity recognition”, etc..
The purpose of this article is to sketch the technical issues behind duplicate identification
in the context of OpenAIRE infrastructure. Our intention is to explain the reasons of possible imperfections in the deduplication process and highlight the importance of end-user feedbacks on how to refine this process.
Duplicate identification, to be followed by merging
of duplicates, is the most important phase of the deduplication process. The major challenge in duplicate identification is the trade off between efficacy
, i.e. ability to identify all possible groups of duplicates, and efficiency
, i.e. time to process.
Efficiency is the main reason why machines are involved, but given the sizes of the collections involved in OpenAIRE performance still remains an issue. Most (open sources) tools for duplicate identification take from 3-4 hours (FRIL) to 19-20 hours (LinkageWiz) in order to process a collection of 10 million records  and only identify pairs of equivalent records (e.g. no identification of groups of similar records, no merge of groups of similar records). It should be noticed that such tools also implement heuristics
(i.e. blocking, sliding window) to reduce execution time, by avoiding to comparisons between all possible pairs. Heuristics efficiently identify subcollections of records that are “likely” to be duplicates and then apply the comparison only to those. As a consequence, they improve efficiency by may indeed decrease effectiveness, since some duplicate records may mistakenly be left out by the grouping algorithms.
Efficacy mainly depends on the ability to judge whether two records are duplicates. Record comparison is based on similarity distance functions
capable of matching two records and come up with a measure of similarity in between 0 and 1. Such functions should replace humans and are often based on (combinations of) string matching functions. Similarity functions are hard to define since they should take into account factors that are easy to spot for humans and less easy for machines. This process is further complicated by the fact that Dublin Core metadata records are (i
) often incomplete, i.e. key values may be missing, dates only indicate the year, (ii
) often misused, e.g. dc:creator may contain names of organizations/initiatives, dates include the 1st of January or 31st of December simplifications, and (iii
) records are collected from different repositories that naturally expose heterogeneous semantics, e.g. author names may be specified according to different patterns, dc:date may contain publication dates or upload dates. Once a distance function is created, one has to decide beyond which threshold two records are to be considered equivalent. And this is another challenge, since minimal % of difference may lead to consider two distinct records as equivalent. For example, the titles: “A cat perspective on the logic of mouses” and “A cat perspective on the logic of mouses v2” are different but may be 0.98 similar. If we consider higher the equivalence threshold to 0.99, then we would lose the case-match for the records “A cat perspective on the logic of mouses” and “A cat perspective on the ogic of mouses”, which are mistakenly different due to a typo (the missing “l”).
The OpenAIRE case
In OpenAIRE the records we collect from data repositories, publication repositories, and CRIS systems are heavily dynamic and heterogeneous, hence require frequent cycles of deduplication. Moreover, once a group of equivalent records is identified, we need to create one "representative record", to support coherent statistics, but also keep the original records, to guarantee visibility to the participating data sources. Our deduplication algorithms for publication records are based only on titles, date years, and author strings. When titles are not enough of the match, then dates and authors are used as hints (e.g. year of the date only, number or authors) to strengthen the likelihood of similarity. Our algorithms are run in parallel on a 8-node Hadoop Map-Reduce cluster (@ICM Labs, Warsaw), which reduces deduplication time drastically also for tens of millions of records (as described in ). Today the time to identify pairs of candidates is around 1h 30m for around 12 millions records, trying to preserve efficacy (around 1,4 Billion pairs matched) by exploiting the power of parallel execution. Total deduplication time, inclusive of identification groups of equivalent records, merge them into one representative record, and redistribute the relationships of from/to merged records to the representative, is overall 2h 30m. The next algorithm upgrades will refine this process, by including further record context information in the record matching, as long as this information will be made available by OpenAIRE inference services (e.g. classification schemes, co-authorships). Last but not the least, OpenAIRE will realize tools allowing end-users to feedback information from the portal to the system in order to adjust the deduplication process. Any information that may confirm the correctness or the invalidness of deduplication output can be used to refine the algorithms and improve their subsequent runs.
Conclusions In conclusion, the quality of machine-based deduplication strongly depends on the completeness and uniformity of the records to be matched but is in general not at the same league of human assessment. More generally, as exemplified above, what seems to be a simple deduction from a human perspective may instead become a challenge for a machine in the need of applying the very same reasoning over millions of records. As a consequence, the identification of false duplicates or the miss of true duplicates is a common phenomenon. In fact, reducing these two factors is the real “art” and continuous work of data curators. As an example, check this search (accessed 21/02/2015, see Figure 1) on Google Scholar and visualize the 54 versions available of the first result “Preview”; these are different documents, with different titles, different authors, and over different years.
[caption id="attachment_117" align="aligncenter" width="600"]
First page of 54 documents that according to Google are equivalent to “Preview” by G James, 2013 (results accessed on 21/02/2015)
 Manghi, Paolo, Marko Mikulicic, and C. Atzori. "De–duplication of aggregation authority files." International Journal of Metadata, Semantics and Ontologies7.2 (2012): 114-130. From http://inderscience.metapress.com/content/j535u23177w2m804/