News | Datafold Launches New Open-Source Data-Diff to Compare Tables of Any Size Across Databases

Datafold Launches New Open-Source Data-Diff to Compare Tables of Any Size Across Databases

Published by: Insights Desk Released: Jun 24, 2022

Highlights:

Datafold’s focus has been on automated testing during the transformation step with Data Diff.
Developers and data analysts, who want to compare various databases quickly and effectively without creating a DIY diff tool, can use Datafold’s solution, which constitutes a significant advancement.

Datafold, a data reliability company, recently announced the release of a new open-source cross-database diffing program called data-diff. The new solution is an open-source extension to Datafold’s original Data Diff tool to compare data sets. Open-source data-diff uses high-performance algorithms to verify the consistency of data across databases.

In the current data stack, businesses gather data from sources, put it in a warehouse, and then transform it to be used for analysis, activation, or data science use cases. Datafold’s focus has been on automated testing during the transformation step with Data Diff. This ensures that any changes to a data model does not disintegrate a dashboard or cause a predictive algorithm to have the wrong data. With the release of open-source data-diff, Datafold can now assist with the extract and load phases of the process. Open-source data-diff ensures that the loaded data matches the original source from where it was derived. Datafold now provides data engineers with coverage across the extract, load, and transform (ELT) process. With every part of the data stack needing testing so that data engineers to produce reliable data products, Datafold provides engineers with the much-needed coverage throughout the extract, load, transform (ELT) process.

“Data-diff fulfills a need that wasn’t previously being met,” said Gleb Mezhanskiy, Datafold founder and CEO. “Every data-savvy business today replicates data between databases in some way, for example, to integrate all available data in a warehouse or data lake to leverage it for analytics and machine learning. Replicating data at scale is a complex and often error-prone process. Although multiple vendors and open-source tools provide replication solutions, there was no tooling to validate the correctness of such replication. As a result, engineering teams resorted to manual one-off checks and tedious investigations of discrepancies, and data consumers couldn’t fully trust the data replicated from other systems.”

Mezhanskiy continued, “Data-diff solves this problem elegantly by providing an easy way to validate the consistency of data sets across databases at scale. It relies on state-of-the-art algorithms to achieve incredible speed: e.g., comparing one-billion-row data sets across different databases takes less than five minutes on a regular laptop. And, as an open-source tool, it can be easily embedded into existing workflows and systems.”

Addressing an important need

Today’s organizations utilize data replication to combine data from several sources into data lakes or data warehouses for analytics. They combine data for search, move data from legacy systems to contemporary databases, and connect operational systems with real-time data pipelines.

Data synchronization between various systems and apps is now simpler than ever, thanks to incredible solutions like Fivetran, Airbyte, and Stitch. The majority of data synchronization scenarios demand a 100% data integrity guarantee. However, in reality, records can occasionally be lost in any connected system owing to failed packets, general replication problems, or configuration errors. Validation checks must be made with the help of the data diff tool to ensure data integrity.

Developers and data analysts, who want to compare various databases quickly and effectively without creating a DIY diff tool, can use Datafold’s solution, which constitutes a significant advancement. Currently, data engineers compare data using a variety of techniques, from straightforward row counts to in-depth row-level analysis. The latter methodology is slow but ensures full validation, while the former is quick but not comprehensive. Open-source data-diff is quick and offers full-scale validation.

Building and managing data quality with open-source data-diff

Data-diff, which is now available, employs checksums to quickly and effectively confirm complete consistency between two separate data sources. This technique enables a row-level comparison of 100 million records to be completed in a matter of seconds without compromising the granularity of the resulting comparison.

Data-diff was made available by Datafold under the MIT license. Datafold includes connectors for Postgres, MySQL, Snowflake, BigQuery, Redshift, Presto, and Oracle. To create connectors for more data sources and particular business applications, Datafold intends to invite contributors.

how to protect industrial processes in ot-it conve...

single-vendor sase for dummies...

beyond the vpn...

critical guidance for evaluating sase solutions...

choosing the best sase solution for your hybrid wo...

fruitful-berries realises their growth potential w...

sanorice future-proofs its business with aptean fo...

adapt, grow and thrive: how food industry experts ...

ai governance for the enterprise...

top 5 use cases for splunk enterprise security...

2024 gartner® magic quadrant™ for siem...

the hidden costs of downtime...

the ai philosophy powering digital resilience...

following the leaders: how premier organizations b...

the essential guide to zero trust...

2023 gartner® market guide for security, orchestr...

uncovering cyber threats: kaspersky incident analy...

proactive threat management: insights into managed...

threat hunting – what, why and how...

why are targeted ransomware attacks so successful?...

learn security vendor consolidation to enhance sec...

embedded payments: a smoother experience for your ...

leveraging multi-tenant architecture for scalabili...

building the bridge: effective post-merger it inte...

combating virtual machine sprawl: technical strate...

outdated endpoint security solutions: a security b...

businesses with low-code development enhances cust...

modern data governance for improved data quality...

deciphering cryptowall ransomware to plot a cyber ...

apache spark maximizing data potential with advanc...

scaling your cloud: scalable storage for public cl...

navigating shadow data: securing your sensitive bu...

guide to data center virtualization: management, p...

cloud application security solutions for complete ...

mastering source code management: best practices a...

profitable ai-powered data management solutions to...

application delivery network for business scalabil...

adaptive authentication fortifying businesses with...

bespoke software catalyzing roi: transforming busi...

result-driven virtual security analyst to help sec...

microsoft introduces bing generative search in lim...

qa wolf secures usd 36 m to enhance app testing...

linx security secures usd 33 m for its identity se...

microsoft reveals, crowdstrike update impacts 8.5 ...

cytoreason raises usd 80 m in the funding round in...

atlassian’s trello data breach: 15m emails leake...

google unveils a suite of new features for ai apps...

dreambig semiconductor secures usd 75m in funding...

kindo reels in usd 20.6 m and acquires whiterabbit...

microsoft’s spreadsheetllm enhances ai’s compr...

herculesai raises usd 26 m to develop and expand i...

intel capital leads usd 15 m investment in ai cons...

hayden ai raises usd 90 m to provide vision ai pla...

aws unveils app studio to accelerate app developme...

snowflake introduces multifactor authentication af...

alphabet call offs hubspot acquisition plans...

command zero launches with usd 21 m to investigate...

captions llc raises usd 60 m for generative video ...

aws introduces graviton4, fourth generation custom...

enso technologies secures usd 6 m for smb-focused ...

resurgence in lockbit drives record high ransomwar...

14 interesting trends that affect innovation and t...

what is web hosting?...

data privacy best practices every business should ...

Datafold Launches New Open-Source Data-Diff to Compare Tables of Any Size Across Databases

Highlights:

Addressing an important need

Building and managing data quality with open-source data-diff

Insights Desk

Related posts

Everstream Analytics, a Startup Focused on Supply ...

Immuta Issues Native Google BigQuery Integration f...

IBM Joins Forces with Databand...

Advantest ACS Nexus Enables Real-time Data Analyti...

Atlassian Introduces Data Lake and Analytics for u...

Promethium Raises USD 26 M in Series A Funding...

Quantexa Expands Technology to Help Organizations ...

Australia’s Prospection Raises USD 45M in Series...

Ahana Bags USD 20 Million for Open Data Lake Analy...

Our Brands