Thursday, January 27, 2022

Scaling Your Power BI DevOps and Solutions

The Background

Back in 2010, the following players comprised the upper right "leaders" section of Gartner's Magic Quadrant for BI and Analytics Platforms: Oracle BI, Microsoft Reporting Solutions, IBM Cognos Analytics, SAS, MicroStrategy, and SAP. Today in 2022, these often get labeled with the "legacy" moniker, and are heralded to have made way for the latest generation of tools such as Microsoft Power BI, Tableau, Qlik, Looker, Domo, ThoughtSpot, Sisense, TIBCO Spotfire, and more.

This latest generation of business intelligence tools, often branded as "data visualization" and "self-service" tools, marked a notable shift from the previous generation by leveraging low/no-code environments and user-friendly drag-and-drop configurations to create more visual reports and dashboards to better engage analysts and facilitate more insightful learnings. This shift was concurrent with, and influential in, the democratization of data away from just IT developers and into the hands of the business analyst that was closer the meaning of the data, and the story that needed to be told to functional leaders.

The Problem Statement

The adoption of tools by data citizens outside of IT has helped ensure the world runs on data and effective story-telling, but this has also come at a cost: for the better part of the decade, tools like Power BI and Tableau lacked functionality to be easily integrated into enterprise-grade solutions with the proper level of data governance, security, and support for modern DevOps approaches like continuous integration and continuous development (CI/CD). As a result, it's been easy for business users and units to develop standalone point solutions, but these eventually become isolated solutions in the organization, difficult to maintain, and difficult to reconcile with sources of truth from reporting solutions from other business units.

Even today, it's not very easy to develop solutions with Power BI, Tableau, and others at-scale. Though, it has gotten better and, as these tools continually more commonplace, the analytics community needs more architecture and development process patterns that can be leveraged to build and scale enterprise-grade solutions.

The aim of this post is to help share the DevOps patterns and processes we recently leveraged when building an Azure-based end-to-end modern data solution with Power BI reporting for a large client in the Consumer Packaged Goods (CPG) industry.

An Example Solution Architecture and DevOps Processes

Our Solution Architecture leveraged data sources ingested into an Azure Synapse "lakehouse" architecture and fed to a consumption layer comprised of Azure Analysis Services (AAS) and Power BI. The Power BI "stack" included development in Power BI Desktop, leveraging DirectQuery to multiple AAS cubes (can be referred to as "composite models"), publishing to Power BI Service workspaces, and surfacing via workspace apps for end-user consumption. Note-- although this solution leveraged AAS cubes, one leveraging the same semantic layer hosted in PBI Datasets on PBI Service with XMLA endpoints enabled can be considered as logically equivalent, and preferred in future deployments.

To manage the Power BI solution as part of an end-to-end, modern software/analytical solution, we leveraged the following DevOps Architecture & Processes:

  1. Source Control via Azure Git Repos. Here, we stored Power BI files (PBIXs) connected to TEST databases. We briefly entertained the notion of version control via OneDrive, but ultimately wanted the Power BI development team to operate with the same general tools and principles as the rest of the software and data engineering development team.


  2. Backlog Management and User Story Definition via Azure DevOps. Here, we leveraged agile principles and user-story-writing best practices to ensure well-structured and well-defined work items. More than this, we leveraged "tags" to more easily conduct Sprint Planning and work assignments to ensure that a single PBIX was worked on by only one developer at a time, which is a necessity, due to the inability to perform "merges" in a tool like Power BI.

  3. TEST / PROD Workspace Pattern on Power BI Service. We employed a pattern of only publishing PBIX files connected to TEST databases in the TEST Power BI workspaces. Similarly, PBIX files promoted to PROD workspaces would always have connection strings updated to point to PROD databases. Note: we did not leverage a DEV environment for Power BI development due to the lack of meaningful data volume and data quality in DEV databases. In addition, we can generally consider Power BI Desktop as a "DEV" environment of sorts, so this may be considered a preferred and more efficient pattern regardless of DEV data quality. As a result, an important distinction to be aware of is that all development in Power BI Desktop was completed against TEST databases. This may seem counter-intuitive, however, we found it to be the best balance based on the level of our data quality. Completing Power BI development against PROD data may seem desirable, however, this would require new back-end features to be released into PROD before front-end developers are able to leverage them. We found this to be slow and undesirable, especially as front-end exploration was often necessary to confirm and validate that back-end features are working properly before they can be promoted to PROD.


  4. Short-Lived "Development" Files and Branches. In traditional applications of version control, a branching strategy is employed to govern work being completed by many developers on the same software solution in parallel. In Power BI and other most data visualizations tools, this is not possible due to the inability to compare PBIX files at a line-by-line level to identify differences between the two files. Instead, a process is required to ensure a merge of a "development" branch into a "main" branch introduces the correct features and does not "blow away" any other features in the process. To address this, we implemented a naming convention process for "development" files/branches that appends the User Story Number from the Azure DevOps Backlog to the PBIX title. For example: "dashboard_001_002.pbix" would be the short-lived working title for "dashboard.pbix" that exists in the PROD and has User Story #001 and #002 as requested features from the business. This allows the publishing of "development" files to the TEST and PROD workspaces to verify/QA features without over-writing the "official" file in PROD that end-users are accessing. Once features are verified/validated/QA'd and the file is ready to be merged into the main branch, it is retitled as "dashboard.pbix" and a Pull Request created and completed.


  5. Development and Testing with Files (and not just Workspaces). In reading that no "DEV" workspace exists, it may be confusing to reconcile traditional approaches to "development" and "testing". Similarly, in reading that a "TEST" workspace exists in the solution architecture, it may be natural to assume that User Acceptance Training with business/end users is conducted from that workspace. This would be an incorrect assumption! With this framework, we shift our thinking of development, testing, and UAT into how we leverage specific PBIX files, not specific workspaces. "Development" occurs within Power BI Desktop by the primary developer in a file titled as described in the previous section ("dashboard_001_002.pbix"). The developer should test their features within Power BI Desktop, and, when ready, publish the "development" file to Power BI Service. As described earlier, Power BI Desktop development is conducted against TEST databases. Therefore, for a Power BI developer to properly "test" their features, they need to promote that file "dashboard_001_002.pbix" to the PROD workspace to see it with PROD data. Here, the developer should complete their own "testing" of the new features, and also solicit at least one colleague to perform a secondary round of "testing" (using the well-written User Story Acceptance Criteria to validate against). Once "testing" is complete, we have confidence that the "development" file can be merged into the "main" branch and file. You'll notice-- both the TEST and PROD workspaces were leveraged to complete what we would normally consider the "testing" process.


  6. User Acceptance Testing with Files (and not just Workspaces). We can extend the discussion in the previous section to formulate an approach for User Acceptance Testing with a broader group of business power users. When a dashboard or set of features are ready User Acceptance Testing, the main branch file can be retitled "dashboard_UAT.pbix" and promoted to the PROD workspace. Here, it will, of course, be connected to PROD data, which will be necessary for User Acceptance Testing. As feedback is received and bugs or new features are completed, the normal processes described can be followed. Let's imagine Work Item #003 is a new User Story that emerges from User Acceptance Testing, and Work Item #004 is a bug. The Power BI developer will create a new "development" file titled "dashboard_003_004.pbix" and complete the normal development and testing processes described. After it has been merged into the main branch, a file with the title "dashboard_UAT.pbix" can simply be published and promoted again, over-writing the previous version. This has the important benefit of ensuring the PBIX and URL that business users are accessing never changes or gets taken down during the User Acceptance Testing process. Once User Acceptance Testing is completed, the "dashboard_UAT.pbix" file can be removed and the proper "dashboard.pbix" can remain the only relevant file in the PROD workspace and app.

  7. Workbook Promotion via Power BI Deployment Pipelines. The usage of PBI Deployment Pipelines as the sole method to promote content into the PROD workspace and over-write end-user-facing content ensured the level of governance needed for a solution of this size, and also reduced any manual errors introduced by developers changing connection strings in PBI Desktop, and accidentally forgetting which database environment is active when publishing to source control or to Power BI Service. Notably, when we first built the solution, we leveraged Azure DevOps Deployment Pipelines to deploy PBIX files. This allowed for the releasing of the entire end-to-end solution artifacts to be versioned and released together with a single pipeline. However, it became clumsy and slow for front-end Power BI deployments to be dependent on back-end solution builds and releases, which could take 5-10 minutes on their own. This became particularly frustrating when numerous minor or cosmetic Power BI changes needed to be made rapidly or in relatively quick succession. It became clear that a standalone pipeline for Power BI deployments would be a much better solution. In addition, due to the incompatibility of the newly released DirectQuery data connections to anything but PBI Deployment Pipelines, the switch was made, and we happily never looked back.

Concluding Thoughts

The above architecture and development processes are by no means perfect, nor are they the only way to approach at-scale, governed, source-control-gated, and semi-automated CI/CD for Power BI. Nonetheless, we found it to be a robust approach to operate as a high-velocity and successful end-to-end development team. It also allowed the Power BI development team to embrace the benefits of the flexibility and freedom of a self-service data visualization tool, while also adhering to commonplace engineering principles and best practices. There were certainly things to improve, particularly future-facing optimizations we did not have time to pursue, but would have liked to-- namely the recently-announced native integration between Azure DevOps and PBI Deployment Pipelines (although, we would have still maintained PBI as a separate pipeline), and also the deprecation of Azure Analysis Services in favor of Power BI Datasets for the semantic modeling and self-service access point for all Power BI development. Hopefully you find this information useful, and if there are other suggestions based on your experiences, I would love to hear them in the comments.

Wednesday, January 18, 2017

Unemployment Viz: Who Is Responsible?

With Donald Trump's inauguration only a few days away, America is in wait-and-see-mode for just about everything. What will happen to healthcare? What will happen to the economy? What will domestic race relations look like? What will international relations look like? The list is too long to enumerate fully.

In this viz, I wanted to explore the economic health of the country under different presidents by specifically looking at changes in the unemployment rate. Under which presidents and political party has unemployment risen? Who's economy policies seem to have the most positive (or negative impact)? How long does it take for a president to make an impact on the unemployment rate?

I created the viz below to help explore and answer some of those questions. Or at least provide some directionally related and interesting data :). In particular, I tried to quantify a few things:

  • What was the economic legacy of each president? In other words, in this context, did unemployment go up or down in a president's final 2 years in office. I've dubbed a bad change (unemployment going up) as a "headwind". Conversely, I've dubbed a good change as a "tailwind".
  • How good (or bad) was the economy that each president inherited? Who walked into office in great shape, with a strong "tailwind" at their backs? Who was fighting a losing battle from the start against a strong "headwind"?
Play around with the viz below and see what you learn! There are some interesting nuggets around who really is responsible for making unemployment better or worse over the years...


Thursday, January 12, 2017

Tableau 10 Desktop Certification Tips!

I recently went through the Tableau 10 Desktop Certification process (Qualified Associate and Certified Professional exams) and wanted to share some preparation tips that can help you display your best work on both tests!

THE BASICS


  • Qualified Associate Exam: 2 hours, 36 questions (all multiple choice) spanning Data Connections, Organizing and Simplifying Data, Field and Chart Types, Calculations, Mapping, Analytics, and Dashboards.
  • Certified Professional Exam: 3 hours, all hands-on development (no multiple choice) spanning building with Visual Best Practices, Advanced Calculations, and Storytelling.

THE GAMEPLAN


Especially when it comes to the Certified Professional Exam, I would urge you to think about the following: 

"What can I do to maximize the time spent showcasing my visual analytics expertise during this exam?"

As mentioned, the test is 3 hours long and all viz / dashboard / storyboard development. It is challenging, even for a full-time, daily Tableau developer, because you need to pace yourself. Normally, you don't have unrealistically short, multi-hour deadlines to build something (hopefully). Normally, you have time to think about your data, colors, layout, formatting. Normally, you can iterate and refine multiple times before you finalize a single product (viz, dashboard, or story).

During the 3 hour exam, you'll build nearly a dozen worksheets or more, a dashboard, and a storyboard, all of which span numerous datasets and separate questions / prompts. For the most part, it's a race against the clock. You want to limit time wasted to low-value-add tasks or large formatting decisions that you could have planned out earlier. Ideally, you want to spend as much time as possible thinking about and implementing visual best practices (and having fun!) in Tableau.


THE TIPS


These tips are largely geared towards preparing for the Certified Professional exam, but some can be helpful for the Qualified Associate exam as well. This list assumes you've already read through the official exam prep guides (links above) and associated material.
  1. Eva Feng's (ie, Viz Panda's) blog post is a solid place to start (it is geared towards the Qualified Associate exam), and has some great resources and tips.
  2. Tableau's (Beth Lyons') blog post on LOD Expressions. This is recommended by Eva, but it is worth repeating here as well. Most obviously, it covers a broad range of LOD calcs (easy to hard, single to nested, fixed to include to exclude). Less obviously, it contains business vocabulary, metrics, and datasets that Tableau likes to reference. The more familiar you are with them, the less time you'll spend scratching your head during the exam(s).
  3. Review Tableau's most prominent sample datasets (ie, Superstore, Coffee Chain, World Indicators). Be somewhat familiar with how the data is structured (dimensions, measures, granularity). Again, the more familiar you are with them, the less time you'll spend reviewing if you encounter them on the exam.
  4. Pre-plan basic design or formatting decisions. What font do you plan to use? What about size? What about your titles / sub-titles / story points? Do you want to use shadings? Tooltips? Colors? Naming conventions? No, you obviously can't pre-plan the entire dashboard and layout. Yes, you could use out-of-the-box settings during the test. Yes, you could decide these things quickly on-the-spot. But, if you have a basic idea of how you want to build, present, and polish your analysis, it can be a real time-saver, and bail you out if you need more precious minutes to further understand, debug, or analyze a specific piece of the test.
  5. Pay attention to section weighting. Not all sections of the test are weighted equally. Plan accordingly. 'Nuff said.
  6. Don't get hung up extensively on a single prompt. The wording may be confusing. You might get an answer that is clearly wrong. You might get writer's / vizzer's block. You might not know how to do something. Just move along and keep the larger exam in mind.

Friday, August 12, 2016

Quick Viz - Amazon Purchases!

Recently stumbled upon instructions on how to download my entire Amazon purchase history and decided to build a really quick viz that compared purchase totals and habits of a few friends...

Screenshot below (dash is too wide for the blog), link to dashboard on Tableau Public here.


Tuesday, July 26, 2016

2000 - 2009: Prison Sentences & Time Served

BACKGROUND

This summer has been chock full of crazy, interesting, dismaying, and uplifting national and world events. Brexit, political races and conventions, police-civilian shootings, civilian-police shootings, endless soccer (a good thing, in my opinion), and Olympic buildup, just to name a few. So much to viz; so little time.

As it turns out, my latest viz has nothing to with any of those. In the past year, like many others, I became heavily engaged and invested in the Adnan Syed case, a story first popularized by the Serial podcast. I won't go much into the details of his efforts towards post-conviction relief, but you can learn more here or here, or a zillion other places on the internet. In short, the media about his case exposed me to a great deal more of the legal and criminal justice system than I known had before. I was (and am) fascinated by its procedures, history, successes, and flaws, and wanted to contribute data to a discourse I had anecdotally encountered before. Namely, are there significant differences in prison sentences across race? Is the system biased?


DATA SOURCE AND ALTERYX PREPARATION

After a brief search of the interwebz, I came across the Bureau of Justice Statistics, which contains lots of data snippets and goodies. I came across the data I was interested in, and got really excited... until I opened up one of the zipped CSV files.


Could there possibly be a less friendly format for data analysis? (Yes, I suppose things could be infinitely worse, actually, but whatever. This meant significantly more work. Did I mention that the file for each year was a slightly different format?)

Enter Alteryx, the self-service data blending and preparation tool that any data analyst should work to become fluent in. In a previous life, I would have fired up Excel, written a bunch of custom macros in VBA and cleaned/aggregated this data into a usable format. This mini-project gave me an opportunity to finally step beyond the very basics in Alteryx into the world of Control Parameters and Macro workflows.

Here is the core of one of the workflows to cleanse and aggregate an input file. The Control Parameter was used to change the name of the input file prior to numerous steps of skipping unused rows, deselecting unused columns, and adding new columns for labeling and classification.


After the cleansing, the data from each file is dumped into a TDE of ever-increasing size. Here, again, a Control Parameter is used to insert the proper year into the dataset (each year's data was contained in separate files).
Finally, at the end of the day, the data was significantly cleaner and much improved. Thank you, Alteryx!



DATA VIZ

For the viz itself, there were a few things I wanted to try out.

  1. Explore the power of barbell charts. Or perhaps, more specifically, the power of connecting two data points using a line to allow the user to create a strong mental association that can help sift through visual clutter. If my butchered explanation doesn't do it for you, go to the source. Hardly a text that flies under the radar, but one I finally consumed.
  2. Explore the power of small-multiples & area charts. I've typically had difficulty finding scenarios in which area charts effectively displayed information in a manner that I found functionally and visually appealing; I thought that, by overlapping two sets of data, with the intention to show a "gap/difference" using the darker shade, the reader's eye would be able to rapidly pick up these "gaps" against datapoints where no gap existed. Hope it worked-- let me know what you think!
Enjoy!

Sunday, June 26, 2016

Updated: 2016 Presidential Primaries - Super PACs & Independent Expenditures

It has been a few months since I've been able to viz for fun, but I wanted to make sure I submitted an entry for the latest Tableau Iron Viz feeder! Lucky for me, this round is politics themed, which gave me an opportunity to refresh and enhance my Independent Expenditures viz.

Now that the primary season is over, I was able to pull data for the entire 2016 cycle, and also include a comparison to 2012. Not surprisingly, 2016 has seen a significant jump in independent expenditures from Super PACs. More surprising:

  1. How flat-footed the Democratic party was caught in 2012, compared to their Republican counterparts, in taking advantage of the new campaign financing infrastructure provided by Citizens United and SpeechNow.
  2. How much spending is loosed multiple months before the Iowa caucuses.
Original post and viz here!

Thursday, March 31, 2016

2016 Presidential Election - Super PACs & Independent Expenditures

You may be aware that the United States is currently in a presidential election year. I know there are many television and news outlets helping you navigate the resulting sea of information (and misinformation). I'm not sure how helpful you will find this campaign finance viz in the grand scheme of things, but I hope there are a few "Wait, wut?" or "Srsly?" moments. I know there were for me...

In the previous presidential election cycle, I had heard of Super PACs and had a high-level understanding of how they functioned. This time around, I was super interested in finding out what candidates they support and how much of a financial impact they make. I always found it curious that an individual could only donate a maximum of $2700 directly to a candidate's campaign, while a Super PAC that is "separate" from and "unaffiliated" with said candidate's campaign could receive unrestricted millions from wealthy donors and corporations.

The viz below, armed with Federal Election Commission data as of 3/31/2016, takes a look at the Independent Expenditure landscape, which is overwhelmingly dominated by Super PACs. Do you spot any interesting trends? I've listed a few of my favorites below...

"Wait, wut?" Jeb Bush had a third party spend more than $87M on his behalf, and he managed to win zero states and four delegates? Impressive!

Republican backers are outspending Democratic counterparts (in terms of Independent Expenditures) by more than 13-to-1? "Srsly?"

This obviously isn't comprehensive of the entire campaign financing landscape, but I'm hoping to tackle this again in the future. We do have until November after all...