Incremental Schema Fingerprinting with MinHash for Drift Detection in Data Lakes

Mohan Vamsi Musunuru; Sai Charan Ponnoju; Chandan Gnana Murthy

Authors

Mohan Vamsi Musunuru Amazon, USA Author
Sai Charan Ponnoju Fidelity Investments, USA Author
Chandan Gnana Murthy Amtech Analytics, USA Author

Keywords:

MinHash, schema drift, data lakes, fingerprinting, semantic tagging, compliance monitoring, metadata catalogs, drift detection

Abstract

The work incrementally fingerprints MinHash schema. Finding schema drift in huge data lakes is quicker and easier with this strategy. This method creates small hashes of changing table schema column names, data types, and semantic tags. Weekly lightweight comparisons may show quality or compliance issues. Structured monitoring employing semantic fingerprinting may uncover minor schema semantic changes that mask origin or governance. Open-source datasets and real-world intake evaluate system accuracy, recall, and computation runtime. To assist data stewards examine schemas, metadata catalogue systems identify and display schema drift. Experimental findings show reliable, space-efficient detection for near-real-time governance enforcement in petabyte data lakes. Compliance-aware observability and scalable summarisation improve schema monitoring.

Downloads

Download data is not yet available.

References

A. K. Simitsis, K. Wilkinson, M. Castellanos, and U. Dayal, “QoX-driven ETL design: Reducing the cost of ETL via QoX-awareness,” in Proc. IEEE ICDE, Long Beach, CA, USA, 2011, pp. 535–546.

A. Broder, “On the resemblance and containment of documents,” in Proc. Compression and Complexity of Sequences, Positano, Italy, 1997, pp. 21–29.

M. Zaharia et al., “Apache Spark: A unified engine for big data processing,” Communications of the ACM, vol. 59, no. 11, pp. 56–65, Nov. 2016.

L. Golab, T. Johnson, and V. Shkapenyuk, “Schema evolution for semi-structured data,” Information Systems, vol. 30, no. 4, pp. 265–289, Jun. 2005.

D. J. Abadi et al., “The design of the Borealis stream processing engine,” in Proc. CIDR, Asilomar, CA, USA, 2005, pp. 277–289.

J. Leskovec, A. Rajaraman, and J. D. Ullman, Mining of Massive Datasets, 2nd ed., Cambridge, U.K.: Cambridge Univ. Press, 2014.

M. Armbrust et al., “Delta Lake: High-performance ACID table storage over cloud object stores,” Proc. VLDB, vol. 13, no. 12, pp. 3411–3424, 2020.

A. Chebotko, S. Lu, and F. Fotouhi, “Semantics-preserving mapping and querying of OWL-DL ontologies with relational databases,” Data & Knowledge Engineering, vol. 68, no. 6, pp. 604–623, Jun. 2009.

K. Tzoumas, M. Kaufmann, and A. Katsifodimos, “Dataflows and metadata management in Apache Flink,” in Proc. SIGMOD, San Francisco, CA, USA, 2016, pp. 2077–2082.

J. Vaidya, C. Clifton, and M. Zhu, Privacy-Preserving Data Mining, New York, NY, USA: Springer, 2006.

P. Christen, Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection, Springer, 2012.

R. J. Bayardo, Y. Ma, and R. Srikant, “Scaling up all pairs similarity search,” in Proc. WWW, New York, NY, USA, 2007, pp. 131–140.

E. Schallehn, M. Hentschel, and G. Saake, “Robust handling of schema evolution in data stream processing systems,” in Proc. IEEE MDM, Mannheim, Germany, 2010, pp. 177–186.

R. S. Xin et al., “Optimizing Apache Spark for data-intensive workloads: A case study,” in Proc. VLDB, vol. 9, no. 13, pp. 1511–1522, 2016.

D. R. McKenney and M. Y. Vardi, “Schema drift detection using probabilistic change summaries,” in Proc. IEEE BigData, Boston, MA, USA, 2017, pp. 1891–1900.

N. Bruno, S. Chaudhuri, and L. Gravano, “Top-k selection queries over relational databases: Mapping strategies and performance evaluation,” ACM Trans. Database Syst., vol. 27, no. 2, pp. 153–187, Jun. 2002.

B. Shao, H. Wang, and Y. Li, “The Trinity graph engine,” IEEE Trans. Knowl. Data Eng., vol. 30, no. 2, pp. 257–271, Feb. 2018.

L. Sweeney, “k-Anonymity: A model for protecting privacy,” Int. J. Uncertainty, Fuzziness and Knowledge-Based Systems, vol. 10, no. 5, pp. 557–570, Oct. 2002.

S. Agarwal, B. Mozafari, A. Panda, H. Milner, S. Madden, and I. Stoica, “BlinkDB: Queries with bounded errors and bounded response times on very large data,” in Proc. EuroSys, Prague, Czech Republic, 2013, pp. 29–42.

G. Demartini, D. Difallah, and P. Cudré-Mauroux, “Large-scale linked data integration using probabilistic reasoning and crowdsourcing,” VLDB J., vol. 22, no. 5, pp. 665–687, Oct. 2013.

Incremental Schema Fingerprinting with MinHash for Drift Detection in Data Lakes

Authors

Keywords:

Abstract

Downloads

References

Downloads

Published

Issue

Section

License

How to Cite