Databricks adds data governance, marketplace features

The data marketplace and other features are expected to accelerate data engineering tasks with an option for data monetization down the road, Databricks said.

Team members collaborate / discuss / communicate in a data center.
Gorodenkoff / Shutterstock

Along with open sourcing Delta Lake at its annual Data + AI Summit, data lakehouse provider Databricks on Tuesday launched a new data marketplace along with new data engineering features.

The new marketplace, which will be available in the coming months, will allow enterprises to share data and analytics assets such as tables, files, machine learning models, notebooks and dashboards, the company said, adding that data doesn't have to be moved or replicated from cloud storage for sharing purposes.

The marketplace, according to the company, will accelerate data engineering and application development, since it allows enterprises to access a dataset instead of developing one and also subscribe to a dashboard for analytics instead of creating a new one.

Databricks' marketplace lets users share, monetize data

Databricks said that the marketplace will make it easier for enterprises sharing data assets to monetize them.

The new marketplace is akin to Snowflake's data marketplace in design and strategy, analysts said.

"Every major enterprise platform (including Snowflake) needs to have a viable application ecosystem to truly be a platform and Databricks is no exception. It is seeking to be a central market for data assets and should be seen as an immediate opportunity for ISVs and application developers who are seeking to build on top of Delta Lake," said Hyoun Park, chief analyst at Amalgam Insights.

Comparing Databricks' marketplace with that of Snowflake, Doug Henschen, principal analyst at Constellation Research, said that in its present form the Databricks Data Marketplace is very new and only addresses data sharing, both internally and externally unlike Snowflake that has added integrations and support for data monetization.

In an effort to promote data collaboration with other enterprises in a secured manner, the company said that it was introducing an environment, dubbed Cleanrooms, that will be available in the coming months.

A data clean room is a secure environment that allows an enterprise to anonymize, process and store personally identifiable information to be later made available for data transformation in a manner that doesn't violate privacy regulations.

Databricks' Cleanrooms will provide a way to share and join data across enterprises without the need for replication, the company said, adding that these enterprises will be able to collaborate with customers and partners on any cloud with the flexibility to run complex computations and workloads using both SQL and data science tools, including Python, R, and Scala.

The promise of being compliant with privacy norms is an interesting proposition, Park said, adding that its litmus test will be its uptake in the financial services, government, legal and healthcare sectors that have tight regulatory guidelines.

Databricks updates data engineering, management tools

Databricks also launched several additions to data engineering tools.

One of the new tools, Enzyme, according to the company, is a new optimization layer to speed up the process of extract, transform, load (ETL) in Delta Live Tables that the company made generally available in April this year.

"The optimization layer is focused on supporting automated incremental data integration pipelines using Delta Live Tables through a combination of query plan and data change requirement analysis," said Matt Aslett, research director at Ventana Research.

And this layer, according to Henschen, is expected to "check off another set of customer-expected capabilities that will make it more competitive as an alternative to conventional data warehouse and data mart platforms."

Databricks also announced the next generation of Spark Structured Streaming, dubbed Project Lightspeed, on its Delta Lake platform that it claims will reduce cost and lower latency by using an expanded ecosystem of connectors.

Databricks referes to Delta Lake as a data lakehouse, built on a data architecture offering both storage and analytics capabilities, in contrast to data lakes, which store data in native format, and data warehouses, which store structured data (often in SQL format) for fast querying.

"Streaming data is an area in which Databricks is differentiated from some of the other data lakehouse providers and is gaining greater attention as real-time applications based on streaming data and events become more mainstream," Aslett said.

The second iteration of Spark, according to Park, shows Databricks' increasing interest in supporting smaller data sources for analytics and machine learning.

"Machine learning is no longer just a tool for massive big data, but a valuable feedback and alerting mechanism for real-time and distributed data as well," the analyst said.

In addition, in order to help enterprises with data governance, the company has launched the Data Lineage for Unity Catalog, which will be generally available on AWS and Azure in the coming weeks.

"General availability of Unity Catalog will help improve security and governance aspects of the lakehouse assets, such as files, tables, and ML models. This is essential to protect sensitive data," said Sanjeev Mohan, former research vice president for big data and analytics at Gartner.

The company also released Databricks SQL Serverless (on AWS) to offer a completely managed service to maintain, configure and scale cloud infrastructure on the lakehouse.

Some of the other updates include a query federation feature for Databricks SQL and a new capability for SQL CLI, allwoing users to run queries directly from their local computers.

The federation feature allows developers and data scientists to query remote data sources including PostgreSQL, MySQL, AWS Redshift, and others without the need to first extract and load the data from the source systems, the company said.

Copyright © 2022 IDG Communications, Inc.

How to choose a low-code development platform