Managing data quality, testing and profiling data with Databricks is something often asked for when dealing with Data Assets. Testing code and applying code coverage metrics is common practice, how about coverage on data?
There are a few tools out there to manage testing, profiling, and managing quality of data pipelines. In this post I'll talk about one Python tool, Great Expectations, and an awesome blog from a data scientist working with Spark and tools like Great Expectations.
Great Expectations
Great Expectations is a pipeline data quality and data profiling library and scaffolding tool.
If you're comfortable working with Spark or Pandas dataframes, you would be comfortable working with this framework. In my initial view, the set up isn't quite notebook-friendly with it's wizard-based prompts. Best to try running locally first. There is also a lot going on with this framework. Take the time to dig into its features.
Once I got the framework installed, I was quickly able to setup both Spark and Pandas dataframe Expectations. Expectations are assertions about your data, and can be packaged into Suites. Great Expectations presents a lot of expectations for use in automated data testing and profiling. Here's a list, the Glossary of Expectations.
Before you start writing code to validate some json in a column, check out expect_column_values_to_be_json_parseable or expect_column_values_to_match_json_schema.
Are you working on a machine learning project and need to verify some results statistically? expect_column_stdev_to_be_between, expect_column_proportion_of_unique_values_to_be_between, use the at_least or at_most features, or perhaps something more?
Kullback-Leibeler divergence?
expect_column_kl_divergence_to_be_less_than
bootstrapped Kolmogorov-Smirnov test? expect_column_bootstrapped_ks_test_p_value_to_be_greater_than
expect_column_parameterized_distribution_ks_test_p_value_to_be_greater_than
Chi-squared test?
expect_column_chisquare_test_p_value_to_be_greater_than
Note that some of these expectations may have big data issues until they mature a bit more.
See https://github.com/great-expectations/great_expectations/issues/2277
Datasources can be used to interact with Batches of data, and apply Validators to evaluate Expectations or Expectation Suites.
Checkpoints are used to validate, test, and perform other actions. Stores and the Data Context configuration provides locations to configuration, metrics, validation results, and documentation.
Great Expectations could also be considered the Sphinx docs tool for data. It includes Site and Page builders, Renderers, and other tools to auto-generate documentation for data Batches.
A Profiler is available for scaffolding expectations and building collections of metrics.
Great Expectations is compatible with Databricks and Pyspark. I was also able to get portions of the framework setup in Google Colaboratory, with Spark and Airflow(!), for experimentation.
One addition to the framework is a Data Dictionary plugin. If you're using comments or metadata in tables and columns, or would like to manage these separately for Data Assets, this could be one tool to look at. Another would be services such as Azure Purview.
There's also a markdown renderer, so you can publish your data documentation to a code wiki or browse with tools like https://typora.io/.
Here's the latest documentation on Read the Docs.
Justin Matters
Justin Matters, a data scientist and developer from Edinburgh, UK, has some excellent articles on Databricks and Pyspark that may help with standardizing data pipelines, testing and data quality. I highly recommend reading his blog posts. Here's a few I've put in my sandbox for later testing.
Refactoring code with curried functions
https://justinmatters.co.uk/wp/building-a-custom-data-pipeline-using-curried-functions/