The Microsoft Power BI team recently introduced dataflows. I’m currently in the process of researching this exciting new technology and came across this article, from Adi Regev, that I wanted to share with you:
In the modern BI world, data preparation is considered the most difficult, expensive, and time-consuming task, estimated by experts as taking 60%-80% of the time and cost of a typical analytics project. Some of the challenges in those projects include fragmented and incomplete data, complex system integration, business data without any structural consistency, and of course, a high skillset barrier. Specialized expertise, typically reserved for data warehousing professionals, is often required. Such advanced skills are rare and expensive.
To answer many of these challenges, Power BI serves analysts today with industry leading data preparation capabilities using Power Query in Power BI Desktop. Now, With Power BI dataflows, we’re bringing these self-service data preparation capabilities into the Power BI online service, and significantly expanding the capabilities in the following ways:
- Self-service data prep for big data in Power BI – Dataflows can be used to easily ingest, cleanse, transform, integrate, enrich, and schematize data from a large array of transactional and observational sources, encompassing all data preparation logic. Previously, ETL logic could only be included within datasets in Power BI, copied over and over between datasets and bound to dataset management settings. With dataflows, ETL logic is elevated to a first-class artifact within Power BI and includes dedicated authoring and management experiences. Business analysts and BI professionals can use dataflows to handle the most complex data preparation challenges and build on each other’s work, thanks to a revolutionary model-driven calculation engine, which takes care of all the transformation and dependency logic—cutting time, cost, and expertise to a fraction of what’s traditionally been required for those tasks. Better yet, analysts can now easily create dataflows using familiar self-service tools, such as the well known Power Query data preparation experience. Dataflows are created and easily managed in app workspaces, enjoying all the capabilities that the Power BI service has to offer, such as permission management, scheduled refreshes, and more.
- Advanced Analytics and AI with Azure – Power BI dataflows store data in Azure Data Lake Storage Gen2 – which means that data ingested through a Power BI dataflow is now available to data engineers and data scientists to leverage the full power of Azure Data Services such as Azure Machine Learning, Azure Databricks, and Azure SQL Datawarehouse for advanced analytics and AI. This allows business analysts, data engineers, and data scientists to collaborate on the same data within their organization.
- Support for the Common Data Model – The Common Data Model (CDM) is a set of a standardized data schemas and a metadata system to allow consistency of data and its meaning across applications and business processes. Dataflows support the CDM by offering easy mapping from any data in any shape into the standard CDM entities, such as Account, Contact etc. Dataflows also land the data, both standard and custom entities, in schematized CDM form. Business analysts can take advantage of the standard schema and its semantic consistency, or customize their entities based on their unique needs. The Common Data Model continues to evolve as part of the recently announced Open Data Initiative.
Once dataflows are created, users can use Power BI Desktop and the Power BI service to create datasets, reports, dashboards, and apps to drive deep insights into their business.
Sunrise Technologies utilizes Power BI to reason across data from various sources to improve efficiency and drive business growth.
“The ability to utilize PowerQuery via dataflows in the online service is a huge step forward,” said Brad Edwards, Senior Product Manager at Sunrise Technologies. “A single ETL process can now be used across multiple reports, greatly reducing the time required to deliver and maintain solutions.”
Today, we are excited to announce that dataflows are available for public preview. Check out this video by Adam Saxton (Guy in a Cube) for a quick look at dataflows:
Now, buckle up and lets get started:
Getting started with self-service data prep
To get started, simply create a new dataflow in your app workspace.
Note: Dataflow capabilities are available for Pro and Premium licenses. To access dataflow premium features, dataflows must first be enabled for the relevant capacity. Learn more
You can use dataflows to ingest data from a large and growing set of supported on-premises and cloud- based data sources, including Dynamics 365 (using the new Common Data Service for Apps connector), Salesforce, SQL Server, Azure SQL Database and Data Warehouse, Excel, SharePoint, and more. Over time, more and more data sources from Power BI’s extensive catalog will be made available for ingestion using dataflows.
Leveraging the familiar Power Query experience, you can shape your entities to your needs.
You can also use the Advanced Editor for defining your queries in M language.
Once you’ve created a dataflow, you can define its refresh frequency by setting a refresh schedule in the dataflow settings—similar to the refresh settings and experience available for datasets today.
Ingesting larger amounts of data with Incremental Refresh
To support the ingestion of large amounts of data, dataflow’s entities can be configured with incremental refresh settings. With just few clicks, you can ensure refresh only updates data that has changed, rather than ingesting a full copy of the source data with every refresh. With incremental refresh configured, refreshes are faster and more reliable, which also reduces resource consumption. Learn more
Once data is ingested, cleansed, shaped and mapped, you might want to enrich some of your data with other data that exists in your lake.
As explained next, this task can be easily and efficiently performed with Linked entities and Computed entities using Power BI Premium.
Building complex ETL projects with Linked entities
Linked entities allow users to reuse data which exists in the lake and manage and organize projects with complex ETL processes, from one or multiple sources, thereby allowing analysts to build on each other’s work.
You can create multiple dataflows with links between them, thus organizing and leveraging them as building blocks in your ETL chain, on your path to achieve business insights.
In addition, Linked entities lets you reuse data without data duplication, improving data consistency and reducing maintenance efforts. Learn more about linked entities.
Let’s look at an example of performing advanced data prep on Dynamics 365 for Sales data and support-call data from the service call center.
A best practice for complex projects is to use a single dataflow for each data source. So in this example, one dataflow ingests and preps Sales data from the CRM system, and another dataflow ingests data from the Services system.
Each dataflow can be scheduled to refresh independently, according to the frequency of the changes in the source system.
Then with Linked entities, a third dataflow can be created to leverage and enrich the existing Sales and Services data.
Gaining further insights with Computed entities
Computed entities let you perform in-storage computations on data that has already been ingested, enabling you to create advanced data transformation pipelines.
A new recalculation engine automatically takes care of all dependency logic within a workspace, and takes care of recomputing all dependent data.
In our example, there’s now a dataflow that contains linked entities to Account data from Dynamics and ServiceCalls data from service call center.
ServiceCalls data is raw data that specified the time and sentiment for each call made to the service center. By performing an aggregation on the service call data, you can get a view of how many calls were done in the last period for each of the accounts you have in the system.
This lets you quickly gain insights for the top accounts regarding their serviceability status.
To do this, you can create a new computed entity that performs in-lake transformations on the data that has already been ingested to the ServiceCalls entity.
To create a new computed entity using service-call data, right click the ServiceCalls entity, and choose Reference from the menu.
You can then perform aggregations like summary, average, and distinct count. To join two tables, you can use a combine tables transformation and
the result of the operation can be a new computed entity that is an enrichment of data from multiple sources.
As mentioned, dataflows have a model-driven calculation engine, which takes care of all the transformation and dependency logic. So in our example, you wouldn’t need to spend time syncing the refresh logic for your linked and computed entities. Learn more about computed entities.
Connecting datasets to your data
Finally, now that you’ve created your dataflows and defined how to keep your data updated, you’re ready to leverage your dataflows to build various datasets with Power BI Desktop.
Learn more about how to Connect to Power BI dataflows in Power BI Desktop. Once your data is connected, your path is clear to generate insights on top of your data using reports, dashboards, and apps.
That’s it, now you’re on the fast track to easily prep your organizational data and accelerate your time to insights.
Stay tuned for more updates as we introduce more and more capabilities – we have much lined up!