With the increased usage of large-scale omics technologies such as RNAseq to generate data, it has become popular to represent results using heatmaps where numerical values are transformed into colors. This type of representation can be created in many ways, including programmatically, but in many cases the fastest and easiest way to do so is using good old excel.
To illustrate how excel can be used to generate a typical fold change heatmap, lets walk through a concrete hypothetical example where we have performed RNA sequencing on 6 samples (3 for the control and 3 in the experimental conditions. The RNA sequencing core has sent back the data as an excel file which in our case contains 20 genes and the number of transcripts measured in each condition for each gene. Here's the excel file which contains the original data, all the intermediates and the final heatmap on sheet 2: example_dataset.xlsx
The first step when dealing with data like this is to transform it so that we can better understand what occurred during the experiment. In this case we have a series of genes where we measured RNA levels in triplicate between two conditions: the control condition and the experimental condition. What we want to understand is which genes had the greatest changes in expression (either increased or decreased) so that we can understand what effect the experimental condition had on the gene expression in the cells we studied.
Right now, the sheet contains raw counts of transcripts which vary widely between genes. Making a heatmap of the data as it currently exists would not be useful because it would just display which genes have the overall highest or lowest transcript levels. We want to understand the changes produced by our experiment, not the absolute expression levels. In other words, we want to transform these raw values into the changes between experimental and control. How do we do this? For each gene, we will set the control value as the baseline (the values we normally expect) and then compare how much the gene went up or down in experimental condition. First let’s average the values we got for each gene in the control condition to get our baseline expression for that gene. In excel this can be done with the AVERAGE function. So for gene 1, the average control expression would be =AVERAGE(B2:D2). We can now compare all row values (both control and experimental values) to this average value by dividing their value by the average. So for the first control value it would be =B2/AVERAGE(B2:D2) which in this case is 0.8840. Now lets get clever. As you may know, if you enter a formula into a cell such as I2, after entering it, you can go over the right bottom corner and drag it which will copy the formula over into all the cells you select. This is super useful so you don’t have to copy the formula manually each time and you can apply it to many values at once almost instantly.
But to use this feature, we need to understand its behavior. Say we entered the formula =B2/AVERAGE(B2:D2) into cell I2. If we now drag it over to the right into cell J2, the formula will become =C2/AVERAGE(C2:E2). Notice because we went right by one column, it changed all the values in the formula so that they also increased by one column. Is this what we want? No! When we drag across horizontally to calculate the value for each of our replicates for gene 1, we want the average to be the same because we always want to compare each replicate value to the average control expression values. AVERAGE(C2:E2) is calculating the average including 2 control values and 1 experimental, not what we want which is always AVERAGE(B2:D2). How do we fix this? We can prevent excel from changing the column when we drag horizontally by changing our formula to =B2/AVERAGE($B2:$D2). Notice the dollar sign($) in front of the B and the D in AVERAGE. This fixes the values so that they don’t change when we drag horizontally. Now when we drag our formula to the right only B2 will change to C2 etc. That’s what we want, we want to apply this formula to each value, but using the same average every time. Starting from J2, lets drag 6 cells to the right and we should see 6 values representing the fold change of each replicate with respect to the control average for gene 1. Fantastic! Now, let’s apply this fancy formula to every gene by again. Select the cells I2 to J2, find the bottom right corner, and drag it down till the selection is down to row 21, the row of the last gene. Just like that we have all our fold changes. Remember, we fixed the columns in our formula, but no row values. So when we drag the formula down, excel will automatically change all the 2s in B2/AVERAGE($B2:$D2) to 3s, 4s, etc. This is what we want since we want to calculate the values for each row. Note that rows can be fixed just like columns by adding a dollar sign directly in front of them ($B$2 for example).
Now we have all our fold changes but let’s not stop there. Notice the range of values is kind of odd. We have some genes that go down by 10-fold relative to the control average and thus they have values near 0.1. Other genes go up dramatically and they have numbers closer to 30 meaning a 30x increase over the control average. We want to display changes in a way that’s easier to understand. As we have it now, a 10 fold decrease would appear as a 0.1 (1/10) and a 10 fold increase would appear as a 10 (10/1) with the center (no change as a 1 (1/1) ). It seems odd that no change is much closer to 0.1 than to 10 even though they are both equally as many fold changes away from no change. How do we fix this? One way would be to invert any numbers less than 1 and multiply them by -1. So 0.1 would become -1 *1/0.1 = -10.
We use a log transform! The idea is simple. Again lets work with the 0.1 and 10 example. The 0.1 comes from having 1 experimental/10 control which can be represented as 10-1 and 10 experimental/1 control which can be represented as 101. And remember no fold change is 1 which can be represented as 100 Notice here that the exponents in these representations have a nice property: no change is 0 and -1 and 1 are equally away from 0 in a way that nicely captures that they represent the same magnitude of difference.
How do we extract these type of exponent representations from our values? We use a log transform! In the above example I assumed base 10, but you can use any base for your log transform, 1.3, 6, 345, but let’s choose one that makes sense. 10 is a large number and as a result, small exponents can lead to large number quickly, In other words, a 100 fold change (which is quite) dramatic would only be 102 ie 2 on the log base 10 scale or for example an 8 fold change would be Log(8) = 0.9. By using base 10, we would enormously shrink our changes and we lose the ability to gauge whether they are biologically meaningful. Although this choice is arbitrary, it is standard practice to use base 2 such that we describe our fold changes in terms of number of doublings.
To do this in excel, lets move to cell P2 and enter the formula = LOG(I2,2) which tells excel to use base 2 to log transform the cell I2 where we have calculated the fold change of B2(the first control replicate relative to gene 1 control average). Again with the drag function, lets expand the formula 6 cells to the right and 20 rows down. This will yield a new matrix of 120 values that are the log 2 transformed values of expression changes for each gene. These are finally the values we can use to make our heatmap. As a sanity check lets look at the values we have now. Looking at the control values, we see that they all hover around 0 which makes sense because these values belong to the baseline condition which has no change relative to itself.
To keep things tidy in our excel file, lets select all these values, create a new sheet and right click cell B2, and select Paste Special -> Values. Remember if we just do a normal copy operation, we are actually copying the formulas which will be broken if we move to a new sheet since this sheet doesn’t have the source values for the calculations. We just want the values so we will only copy those.
Alright now lets select all these values, go to Conditional Formatting, then the submenu color scales, and select the blue red color scale option. The cell backgrounds will now change with a color that represents the values inside them.
Now lets go ahead and hide the values in the cells so we can better see the colors. Again, with all the cells select, go to Format > Cells …. In the new popup window, select the Number tab on the left and from the categories list select custom at the very bottom. In the text box now type three semi colons ;;; and once you’re finished click OK. All the numbers should now disappear from view. They’re still there don’t worry, but they don’t appear on the actual sheet. To see them just select a cell and the value will show up in the formula bar.
Now lets resize the cells so that they are nice squares. Go to the top and select the 6 columns with data. Now hover over the border between columns G and H and a resize cursor will appear.
Click and drag and it will tell you how many pixels wide it will make each of the selected columns. In this case we will go with the 25. Do the same thing with the rows so that they also become 25px tall. After that you should have a nice grid of square cells.
Right now the fold changes are organized by gene rather than by size. Lets fix that. Select all the colors cells and include one row above. With the cells selected, go to Sort & Filter and select Custom Sort… . A new popup will appear. Lets sort by the last column G and under order select the Largest to Smallest option and click ok.
Now the heatmap should be neatly organized with the largest increases at the top and the largest decreases at the bottom. Almost done, but not quite. We are currently using the default color scale provided by excel. This uses the min and max of the values to determine the bounds of the color scale. We want a fixed symmetrical scale with 0 at the center. Lets determine that range we want by using the MIN and MAX functions to determine the upper and lower bounds of our data. Go to a new cell, type “=MIN(“ and then use your mouse to drag over the 120 values) and close the function with a right parenthesis “)”. Do the same with the max function. We see that -3.27 is our min and our max is 5.2. So since we want a symmetrical scale that includes all our values, lets choose -6 to 6 as our range.
To apply this range to our color scale, go again to Conditional Formatting and select the bottom option Manage rules. This will pull up a popup that lists the current rules on the selected cells including the one we added previously. Select that rule and click Edit Rule… . In the Edit Formatting Rule popup, set Style to 3 color scale, for Minimum set Type to Number and enter -6 and select a blue color. For Midpoint set Type to Number and enter the number 0 and select a white color. For Maximum set Type to Number and enter the number 6 and select a red color. Click Ok and the heatmap should reflect the changes you just made.
Woo! We’re at the finish line. You can additionally change the style of the heatmap by adding a white border or whatever else you want. Once you’re done customizing it, select all the colored cells in the heatmap, Go to Save As, select PDF and click the Selection:cells option.
That’s it! You should now have a PDF of the heatmap you generated!