Entity Matching with Similarity Encoding

A Supervised Learning Recommendation Framework for Linking (Big) Data

The project is funded by DFG as part of the Infrastructure Priority Program New Data Spaces for the Social Sciences (SPP2431) under Grant 539465691.

Pantelis Karapanagiotis

EBS Business School

SAFE Leibniz Institute

Marius Liebald

Goethe University Frankfurt

AI/Digitalization Brownbag, University of Groningen, May 6th, 2024

Motivation

Left data set.
(index) Model (alphanumeric) Producer (alphanumeric) Origin (alphanumeric) Sales (in Mil.) (numeric)
1 Model T Ford USA 16.5
2 Model A Ford USA 4.8
3 Beetle Volkswagen Germany 21.5
Right data set.
(index) Name (alphanumeric) Firm (alphanumeric) Country (alphanumeric) Engine (in Lt.) (numeric)
1 T Model Ford United States 2.9
2 Corolla Toyota Japan 1.8
3 Beetle Volkswagen Germany 1.6
4 Mdl 124 Fiat Italy 1.4

A record matching toy example with two sources.


Left data set.
(index) Model (alphanumeric) Producer (alphanumeric) Origin (alphanumeric) Sales (in Mil.) (numeric)
1 Model T Ford USA 16.5
2 Model A Ford USA 4.8
3 Beetle Volkswagen Germany 21.5
Right data set.
(index) Name (alphanumeric) Firm (alphanumeric) Country (alphanumeric) Engine (in Lt.) (numeric)
1 T Model Ford United States 2.9
2 Corolla Toyota Japan 1.8
3 Beetle Volkswagen Germany 1.6
4 Mdl 124 Fiat Italy 1.4

(Levenshtein 1965 similarity) (Model T, Mdl 124) = 0.8

(Hamming 1950 similarity) (Model T, Mdl 124) = 0.75

Left data set.
(index) Model (alphanumeric) Producer (alphanumeric) Origin (alphanumeric) Sales (in Mil.) (numeric)
1 Model T Ford USA 16.5
2 Model A Ford USA 4.8
3 Beetle Volkswagen Germany 21.5
Right data set.
(index) Name (alphanumeric) Firm (alphanumeric) Country (alphanumeric) Engine (in Lt.) (numeric)
1 T Model Ford United States 2.9
2 Corolla Toyota Japan 1.8
3 Beetle Volkswagen Germany 1.6
4 Mdl 124 Fiat Italy 1.4

(Levenshtein 1965 similarity) (Model T, T Model) = 0.75

(Token sort ratio) (Model T, T Model) = 1

Previous Work

Overview and Contribution

Matching as a Multi-class Problem

Matching as a Multi-class Problem

Matching as a Multi-class Problem

Matching as a Multi-class Problem

  • Use the firms of \(Right\) as output classes.
  • Fit a discrete choice model and estimate for each \(f_{l}\) from \(Left\) an output class.
  • But:
    • Say, \(Right\) is the main data set.
    • It gets updated and now contains additional firms.
    • The model will not give classification estimates for the new firms.

Comparison-based Matching

  • Make pairs of records for each firm \(f_{l}\) in the \(Left\) and \(f_{r}\) in the \(Right\) data.
  • Classify the pairs \((f_{l}, f_{r})\) as a match \(label = 1\) or no match \(label=0\).
  • But:
    • This requires fitting a model on the Cartesian product of \(Left\) and \(Right\).
    • This approach quickly becomes computationally infeasible.
    • Even for small data sources, say \(N_{L} = 5\cdot10^{3}\) and \(N_{R} = 10^{4}\), the matching pairs \(N_{LM} = 5 \cdot 10^{7}\) require \(100\)s of GB.

Blocking

  • One solution is to exclude some potential pairs based on pre-defined criteria before training.
  • E.g., match \(Left\) with firms from \(Right\) that have the save foundation year.
  • This effectively reduces the computational burden.
  • But:
    • The blocking criteria require having already expertise with the data,
    • are not re-usable in different contexts, and
    • may exclude potential matches.

Matching with Semantic Similarities

  • Semantic similarity approaches:
    • Embed a field from \(Left\) in a high-dimensional vector space.
    • Embed a field from \(Right\) in the same space.
    • Calculate the cosine similarity between the two vectors.
    • Do this for all fields of interest from \(Left\) and \(Right\).
  • But:
    • Dirty data (e.g., misspellings) lead to poor similarities due to out-of-vocabulary words.
    • Can be scaled up, but at what cost?

Matching in Economics and Finance

  • Primarily via human-computation (Bartram, Hou, and Kim 2022; Wojcik et al. 2021; Persson 2020).
  • Most matching problems in economics and finance do not entail semantic ambiguities (e.g., matching firms, products, or persons).
  • Detecting common entities is based on calculating alphanumeric similarities and fuzzy matching.
  • Project-based solutions prevent using complicated end-to-end systems or sophisticated statistical models.

Matching with a Similarity Encoder

  • We propose a matching approach that:
    • Combines ANN record matching found in state-of-the-art end-to-end systems in CS with similarity-based matching in economics and finance.
    • Encodes the similarity between two fields using a similarity map (comparison-based approach).
    • Does not depend on distributional assumptions (unlike probabilistic record linkage).
    • Is context-independent and does not require expertise with the data (minimal human-in-the-loop requirements).

Methodology

Similarity Encoding Data Transformation

Network Architecture

Network Architecture

Network Architecture

Network Architecture

Network Architecture

Network Architecture

Network Architecture

Similarity Encoding Properties

Proposition 1
A set \(X\) is metrizable if and only if there is a normalized similarity \(s\) making \(\left(X, s\right)\) a normalized similarity space.
Proposition 2
With the product norm induced by the supremum norm used on its domain and the \(\ell^{1}\) norm on its range, the similarity map \(\mathrm{S}\) is an isometry. In particular, \(\mathrm{S}\) is a restriction of a bounded and continuous operator with operator norm equal to 1.
Proposition 3
Moreover, \(\mathrm{S}\) maps convex combinations of instructions to convex combinations of similarity encoders.

Performance Evaluation

Benchmark Cases

Table 1: Benchmark data summary statistics.
Task Dataset Domain Left Fields Left #Records Right Fields Right #Records #Matches Rel. Dirty
(B1) DBLP-ACM Bibliographic 4 (0) 2,614 4 (0) 2,294 2,224 1:1
(B2) Abt-Buy E-commerce 3 (1) 1,081 4 (2) 1,092 1,097 m:1 yes
(B3) Amazon-GoogleProducts E-commerce 4 (2) 1,363 4 (1) 3,226 1,300 m:1 yes

Benchmark Results

Table 2: Benchmarking results.
EM System Source F-score DBLP-ACM F-score Abt-Buy F-score Amazon-GoogleProducts
Magellan Mudgal et al. (2018) 98.4 43.6 49.1
DeepER Ebraheem et al. (2018) 96.0 98.6
DeepMatcher Mudgal et al. (2018) 98.4 62.8 69.3
Ditto Li et al. (2021) 99.0 75.6
AdaMEL-hyb Jin et al. (2021) 98.9 65.1
RuleSynth Singh et al. (2017) 92.6 63.8
CorDEL Wang et al. (2020) 99.2 64.9 70.2
AutoFJ Li et al. (2021) 97.7 61.3
ZeroER Wu et al. (2020) 96.0 52.0 48.0
MLMatch This Article 99.8 76.6 83.6
MLMatch Rank 1. 1. 2.

Applications

Matching Historical Company Entities

Results

Table 3: Test sample model evaluation for the company matching application.
(1) Iteration (2) TP (3) FP (4) TN (5) FN (6) Accuracy (7) Precision (8) Recall (9) F-Score
1 256 0 6430 2 99.97 100 99.22 99.61
2 253 0 6430 5 99.93 100 98.06 99.02
3 256 2 6428 2 99.94 99.22 99.22 99.22
4 257 0 6430 1 99.99 100 99.61 99.81
5 258 4 6426 0 99.94 98.47 100 99.23
Average 256 1.2 6428.8 2 99.95 99.53 99.22 99.38

Matching Historical Natural Person Entities

Table 4: Test sample model evaluation for the natural person matching application.
(1) Iteration (2) TP (3) FP (4) TN (5) FN (6) Accuracy (7) Precision (8) Recall (9) F-Score
1 54 0 266 0 100 100 100 100
2 54 0 266 0 100 100 100 100
3 54 0 266 0 100 100 100 100
4 54 0 266 0 100 100 100 100
5 54 0 266 0 100 100 100 100
Average 54 0 266 0 100 100 100 100

Entity Matching with Similarity Encoding

  • An entity matching framework combining ANNs with fuzzy matching.
  • Properties of similarity encoding:
    • effectively reduces the human-in-the-loop requirements and data expertise,
    • mitigates the need for blocking, and
    • allows efficient computational scaling.
  • Benchmarks: Outperforms / on par with large end-to-end entity matching systems.
  • Two novel applications matching historical dirty data of companies and persons.
  • Want to see it in action?
    • Similarity encoder in C++, ANN Model in pymlmatch and rmlmatch.

References

Adam, Sébastien, Jan Annaert, Frans Buelens, Bertrand B. Coüasnon, Boris Cule, Amaury de Vicq, Camille Guerry, et al. 2021. “Data Extraction and Matching the EurHisFirm Experience.” In Methodological Advances in the Extraction and Analysis of Historical Data. Methodological Advances in the Extraction and Analysis of Historical Data. Chicago/Virtual, United States: Kellogg School of Management - Northwestern University. https://hal.science/hal-03828381.
Bartram, Söhnke M, Kewei Hou, and Sehoon Kim. 2022. “Real Effects of Climate Policy: Financial Constraints and Spillovers.” Journal of Financial Economics 143 (2): 668–96. https://doi.org/10.1016/j.jfineco.2021.06.015.
Cule, Boris, Frans Buelens, Johan Poukens, Jan Annaert, and Johan Richer. 2020. EurHisFirm M6.2: Data Connecting Case Study.” Zenodo. https://doi.org/10.5281/zenodo.4309048.
Doll, Hendrik, Eniko Gabor-Toth, and Christopher-Johannes Schild. 2021. “Linking Deutsche Bundesbank Company Data.” Deutsche Bundesbank, Research Data and Service Centre.
Ebraheem, Muhammad, Saravanan Thirumuruganathan, Shafiq Joty, Mourad Ouzzani, and Nan Tang. 2018. “Distributed Representations of Tuples for Entity Resolution.” Proceedings of the VLDB Endowment 11 (11): 1454–67. https://doi.org/10.14778/3236187.3236198.
Fellegi, Ivan P., and Alan B. Sunter. 1969. “A Theory for Record Linkage.” Journal of the American Statistical Association 64 (328): 1183–1210. https://doi.org/10.1080/01621459.1969.10501049.
Hamming, Richard W. 1950. “Error Detecting and Error Correcting Codes.” The Bell System Technical Journal 29 (2): 147–60. https://doi.org/10.1002/j.1538-7305.1950.tb00463.x.
Jin, Di, Bunyamin Sisman, Hao Wei, Xin Luna Dong, and Danai Koutra. 2021. “Deep Transfer Learning for Multi-Source Entity Linkage via Domain Adaptation.” In Proceedings of the VLDB Endowment, 15:465–77. https://doi.org/10.14778/3494124.3494131.
Karapanagiotis, Pantelis. 2019. EurHisFirm D5.1: Technical Document on National Data Models.” Zenodo. https://doi.org/10.5281/zenodo.3467926.
Levenshtein, Vladimir Iosifovich. 1965. “Binary Codes Capable of Correcting Deletions, Insertions, and Reversals.” In Proceedings of the USSR Academy of Sciences, 163:845–48. Russian Academy of Sciences.
Li, Peng, Xiang Cheng, Xu Chu, Yeye He, and Surajit Chaudhuri. 2021. “Auto-FuzzyJoin: Auto-Program Fuzzy Similarity Joins Without Labeled Examples.” In Proceedings of the 2021 International Conference on Management of Data, 1064–76. https://doi.org/10.1145/3448016.3452824.
Mudgal, Sidharth, Han Li, Theodoros Rekatsinas, AnHai Doan, Youngchoon Park, Ganesh Krishnan, Rohit Deep, Esteban Arcaute, and Vijay Raghavendra. 2018. “Deep Learning for Entity Matching: A Design Space Exploration.” In Proceedings of the 2018 International Conference on Management of Data, 19–34. https://doi.org/10.1145/3183713.3196926.
Persson, Petra. 2020. “Social Insurance and the Marriage Market.” Journal of Political Economy 128 (1): 252–300. https://doi.org/10.1086/704073.
Poukens, Johan. 2018. EurHisFirm D4.2: Report on the Inventory of Data and Sources.” Zenodo. https://doi.org/10.5281/zenodo.3246457.
Sadinle, Mauricio. 2017. “Bayesian Estimation of Bipartite Matchings for Record Linkage.” Journal of the American Statistical Association 112 (518): 600–612. https://doi.org/10.1080/01621459.2016.1148612.
Singh, Rohit, Venkata Vamsikrishna Meduri, Ahmed Elmagarmid, Samuel Madden, Paolo Papotti, Jorge-Arnulfo Quiané-Ruiz, Armando Solar-Lezama, and Nan Tang. 2017. “Synthesizing Entity Matching Rules by Examples.” Proceedings of the VLDB Endowment 11 (2): 189–202. https://doi.org/10.14778/3149193.3149199.
Stringham, Thomas. 2022. “Fast Bayesian Record Linkage With Record-Specific Disagreement Parameters.” Journal of Business & Economic Statistics 40 (4): 1509–22. https://doi.org/10.1080/07350015.2021.1934478.
Universitätsbibliothek Mannheim. 2019a. “Handbuch Der Deutschen Aktiengesellschaften.” Heppenheim (Bergstr.), Berlin: urn:nbn:de:bsz:180-dighop-181; Hoppenstedt. http://digi.bib.uni-mannheim.de/urn/urn:nbn:de:bsz:180-dighop-181.
———. 2019b. “Wer Leitet.” Heppenheim (Bergstr.), Berlin: urn:nbn:de:bsz:180-dighop-43; Hoppenstedt. http://digi.bib.uni-mannheim.de/urn/urn:nbn:de:bsz:180-dighop-43.
Wang, Zhengyang, Bunyamin Sisman, Hao Wei, Xin Luna Dong, and Shuiwang Ji. 2020. CorDEL: A Contrastive Deep Learning Approach for Entity Linkage.” In 2020 IEEE International Conference on Data Mining (ICDM), 1322–27. IEEE. https://doi.org/10.1109/ICDM50108.2020.00171.
Wojcik, Stefan, Avleen S. Bijral, Richard Johnston, Juan M. Lavista Ferres, Gary King, Ryan Kennedy, Alessandro Vespignani, and David Lazer. 2021. “Survey Data and Human Computation for Improved Flu Tracking.” Nature Communications 12 (1): 194. https://doi.org/10.1038/s41467-020-20206-z.
Wu, Renzhi, Sanya Chaba, Saurabh Sawlani, Xu Chu, and Saravanan Thirumuruganathan. 2020. “Zeroer: Entity Resolution Using Zero Labeled Examples.” In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, 1149–64. https://doi.org/10.1145/3318464.3389743.

Appendix A: Usage

Python Interface

similarity_map = {
    "company_name": [
        "discrete",
        "partial",
        my_custom_awesome_similarity
    ],
    "address~address1": [ "partial" ],
    "address~address2": [ "partial" ],
    "purpose": [ 
        "sort",
        lambda x, y: x*y + 0.42 - y*x 
    ],
    "foundation": [
        "discrete",
        "partial"
    ]
}

model = match.MatchingModel(similarity_map)

model.compile(
    loss="binary_crossentropy",
    optimizer=tensorflow.keras.optimizers.Adam(learning_rate=0.01),
    metrics=evaluation_metrics)

train_left, train_right, train_matches = load_train_data()
model.fit(train_left, train_right, train_matches, epochs=100)

model.evaluate(train_left, train_right, train_matches)

predictions = model.predict(train_left, train_right)
suggestions = model.suggest(train_left, train_right, 3)

R Interface

similarity_map <- list(
    company_name = c(
        "discrete",
        "partial",
        my_custom_awesome_similarity
    ),
    `address~address1` = c("partial"),
    `address~address2` = c("partial"),
    purpose = c(
        "sort",
        function(x, y) x*y + 0.42 - y*x 
    ),
    foundation = c(
        "discrete",
        "partial"
    )
)

model <- matching_model(similarity_map)

model |> compile(
      loss = keras::loss_binary_crossentropy(),
      optimizer = keras::optimizer_adam(learning_rate = 1e-3),
      metrics = evaluation_metrics)

train_left, train_right, train_matches <- load_train_data()
model |> fit(left_train, right_train, matches_train, epochs = 100L)

model |> evaluate(left_test, right_test, matches_test)

predictions <- model |> predict(left, right)
suggestions <- model |> suggest(left, right, count = 3)

Appendix B: Fields, Similarities, and Ratios used in the Applications

DBLP-ACM

Left Field Right Field Similarities Ratios
title title Levenshtein, Jaro-Winkel partial, token sort, token set, partial token set
authors authors Levenshtein, Jaro-Winkel partial, token sort, token set, partial token set
venue venue Levenshtein, Jaro-Winkel, discrete partial, token sort, token set, partial token set, not_missing
year year Euclidean, Gaussian

Abt-Buy

Left Field Right Field Similarities Ratios
description description Levenshtein, Jaro-Winkel partial, token sort, token set, partial token set
name name Levenshtein, Jaro-Winkel, discrete partial, token sort, token set, partial token set
description name Levenshtein, Jaro-Winkel partial, token sort, token set, partial token set
name description Levenshtein, Jaro-Winkel partial, token sort, token set, partial token set
price price Levenshtein, Jaro-Winkel, discrete partial, token sort, token set, partial token set
name manufacturer partial, partial token set

Amazon-Google Products

Left Field Right Field Similarities Ratios
description manufacturer partial, partial token set
description description Levenshtein, Jaro-Winkel partial, token sort, token set, partial token set, not missing
title name Levenshtein, Jaro-Winkel partial, token sort, token set, partial token set, not missing
description name Levenshtein, Jaro-Winkel partial, token sort, token set, partial token set, not missing
title description Levenshtein, Jaro-Winkel partial, token sort, token set, partial token set, not missing
manufacturer manufacturer Levenshtein, Jaro-Winkel, discrete partial, token sort, token set, partial token set, not missing
price price Levenshtein, Jaro-Winkel, discrete partial, token sort, token set, partial token set, not missing

Handbuch der deutschen Aktiengesellschaften (2019a)

Left Field Right Field Similarities Ratios
company name company name Levenshtein, Jaro-Winkel partial, token sort, token set, partial token set
company info 1 company info 1 Levenshtein, Jaro-Winkel partial, token sort, token set, partial token set
company info 2 company info 2 Levenshtein, Jaro-Winkel partial, token sort, token set, partial token set
found date found date discrete
found year found year discrete
register date register date discrete
register year register year discrete
concession date concession date discrete
concession year concession year discrete
statue change date statue change date discrete
company name company info 1 Levenshtein, Jaro-Winkel partial, token sort, token set, partial token set
company name company info 2 Levenshtein, Jaro-Winkel partial, token sort, token set, partial token set
company info 1 company info 2 Levenshtein, Jaro-Winkel partial, token sort, token set, partial token set

Wer Leitet (2019b)

Left Field Right Field Similarities Ratios
main info main info Levenshtein, Jaro-Winkel partial, token sort, token set, partial token set
Vorstand Vorstand Levenshtein, Jaro-Winkel
StVdAR StVdAR Levenshtein, Jaro-Winkel
GeschF GeschF Levenshtein, Jaro-Winkel partial, token sort, token set, partial token set
Leiter Leiter Levenshtein, Jaro-Winkel partial, token sort, token set, partial token set
Beirat Beirat Levenshtein, Jaro-Winkel partial, token sort, token set, partial token set
AR AR Levenshtein, Jaro-Winkel partial, token sort, token set, partial token set
name name Levenshtein, Jaro-Winkel, discrete
surname surname Levenshtein, Jaro-Winkel, discrete
occupation occupation Levenshtein, Jaro-Winkel, discrete
address address Levenshtein, Jaro-Winkel partial, token sort, token set, partial token set
birth date birth date discrete
raw text raw text token set, partial token set