This article aims to introduce data skew metrics that are specific to hash-distributed tables in Azure Synapse Analytics and will visualise them in a Power BI dashboard. In the end, it will discuss the importance of the selection of an appropriate distribution column for the performance of operations using the data in the table.
A fundamental feature of the Dedicated SQL Pool in Azure Storage (formerly SQL DW) is the way it can store and operate on tables across multiple distributions and to use them in parallel to provide great performance on large data sets and across various analytical scenarios.
A hash-distributed table distributes rows based on the value in a distribution column. There are three methods for distributing data:
- Round-robin (default)
- Hash
- Replicated
The selection of the distribution column is important for the hash-distributed tables because it affects the way data gets stored and retrieved. It is important to select a column which leads to an even spread of data across each distribution and thus to minimum data skew. This allows for better parallelisation and consequently to better performance.
Skewness measures
Skewness is a measure of the symmetry of distribution. The distribution is skewed if there are significant differences in the values across the distributions, which make them asymmetrical.
We are going to introduce two measures of skewness. The first is Data Skew Percentage, and it is calculated using the formula below:
As you can see, the formula is based on the average row count from all distributions divided by the maximum row count among the distributions.
For example, if we have 60 distributions and 59 of them have only 1 row each and the last one has 20 rows then the formula will return:
As you can see, the data skew percentage for this data set is very high.
The other formula that we are going to use is for the calculation of the Data Skew Coefficient.
To calculate the Data Skew Coefficient, we need to calculate the standard deviation for the dataset and after that, to divide it by the average row count in all distributions. The skew coefficient does not depend only on the comparison of the distribution with the largest number of rows in it and the average, but considers all distributions in its calculation. Since the value of the standard deviation varies depending on the magnitude of the values it is calculated on, we are normalising it by dividing by the mean (this is a standard metric called Coefficient of variation). This gives us a number which usually varies between 0 for perfectly distributed data and a larger number when the data set has more variability.
Using the example that we used for calculating the Data Skew Percent we can also calculate the Data Skew Coefficient for the same data set:
SQL implementation
Using system tables and dynamic views, we are going to implement these formulas in SQL.
The following script will return information about all available tables on the server. For each table it will return the current distribution table method, distribution column, data skew coefficient, data skew percentage, and some more useful information, such as size on disk and schema name.
SELECT [Fully Entity Name] = t.full_entity_name, [Schema Name] = t.schema_name, [Entity Name] = t.entity_name, [Current Distribution Method] = t.distribution_method, [Current Distribution Column] = ISNULL(t.distribution_column, '-'), [Current Rows] = SUM(t.rows_count), [Distribution Count] = COUNT(t.rows_count), [Current Data Size on Disk MB] = SUM(t.data_size_MB), [Current Index Size on Disk MB] = SUM(t.index_size_MB), [Skew Coefficient] = CASE WHEN SUM(t.rows_count) / COUNT(rows_count) <> 0 THEN ROUND((t.standard_deviation / (SUM(t.rows_count) / COUNT(t.rows_count))) * 1.0, 2) ELSE 0 END, [Skew Percentage] = CASE WHEN MAX(t.rows_count) <> 0 THEN CAST((100.0 - (AVG(CAST(t.rows_count as float)) / MAX(t.rows_count) * 100)) AS DECIMAL(4, 2)) ELSE 0 END FROM ( SELECT full_entity_name = QUOTENAME(s.name) + '.' + QUOTENAME(t.name), schema_name = s.name, entity_name = t.name, distribution_method = tp.distribution_policy_desc, distribution_column = c.name, rows_count = nps.row_count, data_size_MB = ( ( nps.in_row_data_page_count + nps.row_overflow_used_page_count + nps.lob_used_page_count ) * 8.0 ) / 1000, index_size_MB = ( ( nps.reserved_page_count - (nps.reserved_page_count - nps.used_page_count) - (nps.in_row_data_page_count + nps.row_overflow_used_page_count + nps.lob_used_page_count) ) * 8.0 ) / 1000, standard_deviation = STDEV(nps.row_count) OVER (PARTITION BY t.object_id) FROM sys.schemas AS s INNER JOIN sys.tables AS t ON s.schema_id = t.schema_id INNER JOIN sys.indexes AS i ON t.object_id = i.object_id AND i.index_id <= 1 INNER JOIN sys.pdw_table_distribution_properties AS tp ON t.object_id = tp.object_id INNER JOIN sys.pdw_table_mappings AS tm ON t.object_id = tm.object_id INNER JOIN sys.pdw_nodes_tables AS nt ON tm.physical_name = nt.name INNER JOIN sys.dm_pdw_nodes AS pn ON nt.pdw_node_id = pn.pdw_node_id INNER JOIN sys.pdw_distributions AS di ON nt.distribution_id = di.distribution_id INNER JOIN ( SELECT object_id = object_id, pdw_node_id = pdw_node_id, distribution_id = distribution_id, row_count = SUM(row_count), in_row_data_page_count = SUM(in_row_data_page_count), row_overflow_used_page_count = SUM(row_overflow_used_page_count), lob_used_page_count = SUM(lob_used_page_count), reserved_page_count = SUM(reserved_page_count), used_page_count = SUM(used_page_count) FROM sys.dm_pdw_nodes_db_partition_stats GROUP BY object_id, pdw_node_id, distribution_id ) AS nps ON nt.object_id = nps.object_id AND nt.pdw_node_id = nps.pdw_node_id AND nt.distribution_id = nps.distribution_id LEFT JOIN ( SELECT object_id, column_id FROM sys.pdw_column_distribution_properties WHERE distribution_ordinal = 1 ) AS cdp ON t.object_id = cdp.object_id LEFT JOIN sys.columns AS c ON cdp.object_id = c.object_id AND cdp.column_id = c.column_id WHERE pn.type = 'COMPUTE' ) AS t GROUP BY t.full_entity_name, t.schema_name, t.entity_name, t.distribution_method, t.distribution_column, t.standard_deviation;
Additionally, we have one more query that will return information for all distributions for each table. The result set includes the name of the distribution (1, 2, 3, …), the row count in each distribution, as well as the name of the table and current distribution method.
SELECT [Entity Name] = QUOTENAME(s.name) + '.' + QUOTENAME(t.name), [Current Distribution Method] = tp.distribution_policy_desc, [Current Distribution Column] = c.name, [Distribution Name] = di.name, [Row Count] = nps.row_count from sys.schemas AS s INNER JOIN sys.tables AS t ON s.schema_id = t.schema_id INNER JOIN sys.indexes AS i ON t.object_id = i.object_id AND i.index_id <= 1 INNER JOIN sys.pdw_table_distribution_properties AS tp ON t.object_id = tp.object_id INNER JOIN sys.pdw_table_mappings AS tm ON t.object_id = tm.object_id INNER JOIN sys.pdw_nodes_tables AS nt ON tm.physical_name = nt.name INNER JOIN sys.dm_pdw_nodes AS pn ON nt.pdw_node_id = pn.pdw_node_id INNER JOIN sys.pdw_distributions AS di ON nt.distribution_id = di.distribution_id INNER JOIN ( SELECT object_id = object_id, pdw_node_id = pdw_node_id, distribution_id = distribution_id, row_count = SUM(row_count), in_row_data_page_count = SUM(in_row_data_page_count), row_overflow_used_page_count = SUM(row_overflow_used_page_count), lob_used_page_count = SUM(lob_used_page_count), reserved_page_count = SUM(reserved_page_count), used_page_count = SUM(used_page_count) FROM sys.dm_pdw_nodes_db_partition_stats GROUP BY object_id, pdw_node_id, distribution_id ) AS nps ON nt.object_id = nps.object_id AND nt.pdw_node_id = nps.pdw_node_id AND nt.distribution_id = nps.distribution_id LEFT JOIN ( SELECT object_id, column_id FROM sys.pdw_column_distribution_properties WHERE distribution_ordinal = 1 ) AS cdp ON t.object_id = cdp.object_id LEFT JOIN sys.columns as c with(nolock) ON cdp.object_id = c.object_id AND cdp.column_id = c.column_id WHERE pn.type = 'COMPUTE';
Power BI dashboard
Finally, we are ready to introduce the last part of the article: the creation of the Power BI dashboard. We are going to skip the part regarding the initial project setup, making the connection to the server and importing the data using the queries above, and we will proceed directly to the setup of the dashboard.
After our data is loaded in our Power BI project, we need to make a relation between our datasets. In our case we will use the [Entity Name] column and the relationship will be many-to-one.
Next, we are going to look at our dashboard.
I am going to briefly present the dashboard components and how they can be used. Going from top to bottom, the first thing you see is a table that contains the data returned by the first query. We also have filters above the table which include two options: schema and distribution method. The other feature that are worth noting are the ordering of the results in descending order and the different colour of the Data Skew Coefficient and Percentage columns.
The second thing you can see is a stacked column chart, which is positioned at the bottom. It is related to the results in the table above it. When one row from the table is selected, all available distribution information for that table will be presented on the chart. On the x-axis are the distributions (1…60) and on the y-axis is the count of rows in the concrete distribution.
Analysis of the results
We are ready to review and discuss our results. Ideally, in a perfect case, all the rows are distributed equally across the distributions. As a result, there is no skewness, and the skew percentage and skew coefficient are close to 0. Dashboard 1 (above) and Dashboard 2 show examples of tables with even distributions.
It is recommended that tables with a skew coefficient greater than 3 to be reviewed due to possible problems with the selection of the distribution column. In all such cases, the corresponding skew percentage will also be high. An example of this can be shown below (Dashboard 3 and Dashboard 4). The rows are distributed across the distributions unevenly.
Here the Skew Coefficient is 4.13 and the Skew Percent is 92.61. Both are quite high, and the reason for this can be observed on the chart.
Similarly, for this table we also have very high skew metrics since only five of all the distributions have data in them.
Although it is not a common occurrence to have a low skew coefficient and high skew percentage as a combination, it can still be seen occasionally. This is a result from a relatively even row distribution where only one of the distributions has significantly more data in it compared to the others. In this case the skew coefficient can still have an acceptable value, however the skew percentage will be higher.
Conclusion
For best performance, all the distributions should have approximately the same number of rows. When one or more distributions have a disproportionate number of rows, some distributions will complete their portion of a parallel query before others. The problem with data skewness should not be underestimated. When there is skewness, it should be reviewed and if possible, you should consider choosing a new distribution column based on data and query pattern analysis.
In summary, I believe having a dashboard like this can make our lives easier. It needs to be checked regularly, however this might not happen due to various reasons. As a solution to this I would suggest a stored procedure using the same queries in addition to the dashboard. The main functions of the procedure would be to check the skew percentage and skew coefficient and to send notifications when they are higher than acceptable.
Microsoft provides excellent guidance about choosing a distribution method and columns in the articles below.
Guidance for designing distributed tables using dedicated SQL pool in Azure Synapse Analytics
Choosing hash distributed table vs. round-robin distributed table in Azure SQL DW Service