main_hero_img
posted: June 07, 2022 edited: June 10, 2022

Making volcano plots in python in Google Colab

dataanalysispythonplotlyGoogleColabpandasregex

Introduction

In their most basic form, experiments will generally involve comparing two conditions: a control condition vs an experimental condition where a treatment or perturbation has been applied to the system of interest. The effect of this treatment can then be read out in numerous different ways depending on what you’re interested in understanding about your system. Sometimes this can be accomplished by looking at a small number of factors such as the expression changes in a single protein or changes in the abundance of a single mRNA species, etc. However, it is not always this simple. For example, during the beginning of a project it is often not exactly clear what metrics should be tracked to best understand a particular phenomenon. In these cases, it makes sense to examine the system in a more global and unbiased manner. This is where large scale technologies like mRNA sequencing and mass spectrometry proteomics come in. They allow one to examine how treatments affect biological systems on a more global scale. Ideally, this initial pass then allows us to get a sense of what pathways are most affected such that we can then go in and perform more targeted experiments to determine what mechanisms are at play.

Volcano plot basics

If you do end up pursuing the global unbiased approach, you will be faced with several hurdles including what comparisons to make and how to choose the proteins and or pathways to follow up on. Remember that while you are starting with a global analysis, the goal is to eventually focus in on a particular pathway or process that is relevant to the effect you’re studying. Here there are two general principles that researchers tend to follow to narrow down the possibilities. You want to focus on changes that are statistically significant (likely not just due to random chance) and the changes that are the largest. This filtering process is often visually presented with a graph known as a “volcano plot”, which as the name implies, often resembles the lava shooting out from an erupting volcano. Each point on the plot represents one comparison metric (such as the abundance of a particular protein) that was compared between 2 conditions. This comparison has 2 key metrics, the size of the change and the significance/p-value associated with the change.

Focusing first on the change itself, there are several ways to calculate changes between two values such as subtracting one value from the other or dividing one value by the other. In our case, the division method is preferred because it scales the magnitude of the change relative to the one of the values and this enables us to compare different changes more fairly with each other. As an example, let's imagine that we are looking at 2 genes, GAPDH and MYC. Let’s assume we see that GAPDH expression goes from a mean of 1000 copies to a mean of 1500 copies and that MYC expression goes from a mean of 6 copies to a mean of 12 copies. Difference wise, GAPDH changed by 500 copies while MYC only changed by 6 copies. Looking just at these values you might be tempted to say that GAPDH has the much more significant change (more than 83x the change of MYC)! However, let's reexamine this change with a different lens. GAPDH seems to be generally much more highly expressed and compared to its baseline expression of 1000, 1500 only represents a 1.5x change. Meanwhile, MYC going from 6 copies to 12 copies is a 2x change. Framing these two changes as relative fold changes helps us place them into a more biologically meaningful context.

Next lets quickly discuss the p-value, which helps us tease out what differences are meaningful and not just random coincidences that can occur when we observe complex, variable systems. Without getting into painful technical details, we can say that the p-value gives us a numerical estimate for how likely our set of measurements from the two conditions represent the same underlying value/state (from the same distribution).  Returning to our GAPDH example, let’s say we measured the control condition 3 times and got the values: 800, 1000, 1200, Similarly for the experimental condition we had 3 measurements that were 1488,1500,1512. Are the GAPDH levels different between conditions? Intuitively yes, they’re higher in the experimental than the control, but we must consider that our measurements have error and therefore it’s theoretically possible that we happened to measure higher GAPDH levels in the experimental condition even if they’re actually the same as in the control condition. This is where the p value comes in. We can calculate the odds that we would get these measurements if the GAPDH levels were in fact the same between the conditions. If the p value is super small, that means that it is very unlikely that the conditions measured have the same true GAPDH level and therefore we can safely say that GAPDH levels are most likely different/higher in the experimental condition. In other words, smaller p values give us more confidence that we’re looking at true changes!

Alright we’re almost there! But there’s another aspect to plotting these changes that we need to examine before we can get to our beautiful volcano plot. In their raw forms, fold changes and p values are not ideal for graphing or understanding the data. Looking first at fold change let’s consider the following cases, GAPDH going from 8 to 64 copies (64/8= 8) and MYC going from 8 to 1 copy (1/8 = 0.125). Both are 8-fold changes, but in different directions (increase vs decrease). Notice how in the case of any dropping values, the entire range of possible fold changes are trapped in a range of 0 to 1, but for increasing values the fold changes can occupy values from 1 to infinity. We want the same magnitude of fold changes, independent of direction, to be visually presented as the same distance.

How do we fix this? Log transforms to the rescue! These transforms have a nice property: they can re-express our quotients in terms of exponents of a base term such that decreases become negative exponents and increases become positive exponents. More concretely, if we apply a log2 transform to our previous example we see that Log2(64/8) = Log2(8/1) =  Log2(2^3) = 3 and Log2(1/8) = Log2(2^-3) = -3. This does exactly what we want! We’re free to choose any base we want for our log transform, but generally people use base 2 because its natural to think of changes in terms of the number of doublings or number of halvings. Higher bases such as 10 would work too, but using such a large number as the base would compress our changes and make them harder to visualize/differentiate.

We similarly transform the p-value using a log transform to expand the range from 0 – 1 to 1 to infinity. Typically -log10 is used which transforms smaller input p values (closer to 0) into larger output values.  As an example, a p value of 0.01 = 10^-2 log10(10^-2) becomes 2.

 In summary we have described how to take two conditions, compare changes between them and display them in a way that is most conducive to understanding the data.

An example dataset

Let’s now focus on a more concrete example that assumes you have decided to use RNA sequencing to compare several conditions. As with any good experiment, you decided to test each condition multiple times to ensure that you have enough observations to be confident about your measurements. Each condition has been tested several times leading to replicate measurements/samples. All your data is in a text document (such as a CSV or TSV) where each column represents a unique sample and the rows represent the number of copies of a particular gene’s mRNA was detected.

Here's the example test dataset that we will be working with: test_rna_seq_data.csv. I should mention that I have no association or affiliation with this dataset and I chose it at random and modified it for use in this tutorial. I found it by going to the public access NCBI RNA sequencing dataset browser: https://www.ncbi.nlm.nih.gov/sites/GDSbrowser/.

Here is the citation for the article where this data was generated: Bitler BG, Aird KM, Garipov A, Li H, Amatangelo M, Kossenkov AV, Schultz DC, Liu Q, Shih IeM, Conejo-Garcia JR, Speicher DW, Zhang R. Synthetic lethality by targeting EZH2 methyltransferase activity in ARID1A-mutated cancers. Nat Med. 2015 Mar;21(3):231-8. doi: 10.1038/nm.3799. Epub 2015 Feb 16. PMID: 25686104; PMCID: PMC4352133.

If you actually check out this website you may notice a striking lack of CSV files. This is because the preferred data format used by this browser is the so called Simple Omnibus Format in Text (SOFT) format. This is just another plain text format that encodes meta data about the project/experiment and the expression data. To simplify matters for this tutorial, I just opened up this SOFT file in Excel, copied only the expression portions, edited the file slightly to have the desired header format, and saved it as a new file in the CSV format. 

Coding the volcano plot generator

For this project we’re going to use something called an interactive python notebook (.ipynb) within google Colab, a free tool that lets us enter and run python code in our web browser. We could code this project locally on our computer, but writing it in Colab has a few advantages. Colab essentially lets us run python code on a remote computer where a lot of very commonly used data analysis packages have already been pre-installed. This saves us the headache of having to install these packages ourselves. Additionally, the computers provided by Google allow us to run very processor intense code without bogging down our own machine. Anddddd colab notebooks can be shared with others and so it makes it trivial to write and share your python code with others so they can use it. The only drawback is that you have to have a google account to run code in Colab. 

Let’s dive into Colab and get started. Just google “google colab” and it should be the first result that pops up. This will bring you to a new page within your google account where you will be presented with several options including “new notebook”. Click the button and you will be brought to a new blank python notebook. Each notebook is broken up into modular subcomponents known as cells. There are two types of editable cells: text cells for displaying textual information and code cells for executing code. Each code cell has a small play icon to the left of it that you can click to execute its code. You can add any number of cells in any order to your notebook.

First things first, let’s go to the first blank code cell and type the first few lines to import the python packages that we want to use in this project.

pandas For importing and handling tabular data
numpy For handling mathematical data in the form of numerical arrays
re For performing more advanced string operations
plotly.graph_objects For creating interactive graphs
math For performing math operations
scipy.stats For performing statistical operations

In code that will just like the following: 

import pandas as pd
import numpy as np
import re 
import plotly.graph_objects as go
import math
import scipy.stats as stat

Alright now that we have all our packages available for use in our code we can go ahead and start thinking about data processing. We need a way to get the data from our text file into our program. For this we'll turn to the pandas package which conveniently has a function called read_csv which takes in a CSV document and produces a pandas datatable object known as a dataframe. We still have a problem though, how do we get the file sitting on our computer into the Colab environment so that pandas can read it in? The simplest way is to import the file into the local development environment created by Colab. To do this, click the Files icon on the very left hand side of the page, which will reveal a new folder that is pre populated with a folder titled sample_data. Click the upload icon above this area and choose the CSV file that we want to process for upload.  After you're done you should see something that looks like the image below:

In the example here, the data is named "test_rena_seq_data.csv" and we just uploaded this file into the content folder. So to import the data into our program we'll add the following line which will create the dataframe object seq_df from our CSV document. 

seq_df = pd.read_csv(r'/content/test_rna_seq_data.csv')

Filtering dataframe rows

Ok so now we have a dataframe. Now what? Well if you go through the rows of the data you'll see that there's a lot of rows with the identifier "ILMN_" which appear to be reads that could not be mapped to known genes. Let's filter these out of the dataframe since we only want to examine known genes in our analysis. How do we go about this? First we need to identify rows that contain this string in their identifier. We can do this by using the pandas .iloc function to extract the first column from our dataframe. The function call for this looks like:

seq_df.iloc[: , 0]

The colon tells pandas to take all rows and the 0 tells pandas that we want the 0th column from the left ie the first column. In other words this will extract the first column of values as a one dimensional pandas Series object. Next let's go through this column data and find all the rows that contain "ILMN_". We do this using the str.contains function which will return a new Series that contains boolean true or false values depending on if a row contains a string value or not. The function call for this looks like:  

seq_df.iloc[: , 0].str.contains("ILMN_").

Ok cool so now we have a pandas Series object that contains boolean values. This is useful because pandas actually lets us use a boolean Series to extract either rows or columns from a dataframe. We have one problem though, right now we have true wherever a row contains ILMN and if we used this directly, we'd end up extracting all the rows with this string rather than all the rows that don't. To fix this we can use the negate operator "~" which will flip all the boolean values, true to false and false to true. Now we have:

~seq_df.iloc[: , 0].str.contains("ILMN_")

Great now that we have this new boolean series we can extract the rows that we want. To do this we'll use the pandas loc function and pass in the boolean Series as the first argument to tell it which rows we want. We want all the columns from our dataframe so for the second argument we can again use a colon to tell pandas we want every column. By using loc we'll end up with a new dataframe that has only the rows we want without "ILMN_". We can assign this newly extracted dataframe back to our seq_df variable. In summary we end up with the following code: 

seq_df = seq_df.loc[~seq_df.iloc[:, 0].str.contains('ILMN_'), :]

Alright lets next focus on figuring out which columns belong to which sample groups so that we can group them appropriately when performing our mean analysis and statistics analysis. Our conditions are named such that the condition name is followed by an underscore and the index of the replicate. So DMSO_1 is replicate 1 from the DMSO condition, DMSO_2 is the second replicate and so on. To get these condition names, we can go into our dataframe object and access the columns property which automatically used the first row of our CSV file as column headers when we read in our file using the read_csv function. We can then take the Index object returned by this function and turn it into a numpy array of strings. Finally we can wrap all of this in the built in python list function to convert the numpy array to an easy to use python list object. As a final cleanup step, let’s remove the first value of the list which will contain the name of the first column "Identifiers" which we do not want since it is not one of our conditions. The code now looks like:

column_names = list(seq_df.columns.values)
column_names.pop(0)

Grouping columns/data based on conditions

Woo getting there! Now lets construct a loop to go through each of our columns/condition names and create groups that contain all samples associated with a specific condition. 

for column_name in column_names:

Whats next? Well we need to store which column indices belong to which groups. For that lets construct an empty dictionary which will let us associate a group name (the key) with a value (a list of column indices for the columns representing the samples). 

groups = {}

Inside the loop we need to examine the column name and determine what group it belongs to. Remember based on our naming convention it is condition name, underscore, index. So we need to extract the condition name from the column name. To do this we'll utilize the re "regex" regular expression library. Regular expressions are small stringlike expressions that you can use to specify patterns that you want to extract from a string. In our case the pattern we want is any characters followed by _ followed by one or more numeric digits until the end. The regular expression for this is "_[0-9]$". If you want to learn more about regular expressions take a look at the package guide: https://docs.python.org/3/library/re.html. Back inside our loop we will use the sub function from the re package which will find this pattern and return a new string where this pattern has been replaced with the empty string ''. This has the effect of removing everything except for the group/condition name from our string. As an example "DMSO_1" will become "DMSO", "mycondition_5" will become "mycondition". 

for column_name in column_names:
    group_name = re.sub('_[0-9]$', '', column_name)

Now that we have the group_name we need to check if we already have any column data for that group. If we don't, it means we have never seen this group name before and we need to make a new entry in our groups dictionary. If we have seen it, then we need to add that column index to the existing columns in the corresponding group. Oops we have a problem, how do we know what the index is the column we are looking for? We don't right now, but thats an easy fix, lets just declare a new variable index and set it to 1 before we enter the loop (remember that the first column we care about is the second column in the data frame at index 1). Then in the loop we can increase the index every-time as we move through the columns from left to right. All together now our loop looks like this: 

index = 1
for column_name in column_names:
    group_name = re.sub('_[0-9]$', '', column_name)

    if group_name in groups:
        groups[group_name].append(index)
    else:
        groups[group_name] = [index]
    
    index = index + 1

As a quick sanity check lets add a print statement after the loop to look at whats inside of groups after the for loop: 

Altogether the code looks like this now: 

import pandas as pd
import numpy as np
import re 
import plotly.graph_objects as go
import math
import scipy.stats as stat

seq_df = pd.read_csv(r'/content/test_rna_seq_data.csv')
seq_df = seq_df.loc[~seq_df.iloc[:, 0].str.contains('ILMN_'), :]

column_names = list(seq_df.columns.to_numpy())
column_names.pop(0)

groups = {}
index = 1
for column_name in column_names:
    group_name = re.sub('_[0-9]$', '', column_name)

    if group_name in groups:
        groups[group_name].append(index)
    else:
        groups[group_name] = [index];
    
    index = index + 1

print(groups)

If we run it we should get the output below:

{'DMSO': [1, 2, 3], 'ARID1A': [4, 5, 6], 'GSK126': [7, 8, 9]}

Calculating the fold changes

We are making progress! Let's keep going. We now know which columns belong to which groups so we're ready to analyze the data for each gene by going row by row. Assume for now we just want to compare DMSO with ARID1A. So to get that data we need to grab the columns that correspond to each group. An easy way to do that is to use the .keys() function to get the keys (condition names) from our groups dictionary and convert it to a list. The resulting list will have the first group name at index 0, the second at index 1, etc. We can then use the group names to go back into the groups dictionary and extract the column indices as shown below: 

groupnames = list(groups.keys())
g1_col_indices = groups[groupnames[0]]
g2_col_indices = groups[groupnames[1]]

Now it gets interesting. Let's use the indices to extract the values we want. We will turn again to the pandas .iloc function and just like before we will get all dataframes rows by passing in the colon and we will pass in the the column indices to get only the columns corresponding to the data we want. We can then convert the resulting dataframes into numpy objects so that we have axis to the numerical methods provided by the numpy package. 

g1_values = seq_df.iloc[:, g1_col_indices].to_numpy()
g2_values = seq_df.iloc[:, g2_col_indices].to_numpy()

To get the mean of each row (which corresponds to the mean expression value of a gene in a condition) we can use the numpy mean function and specify axis = 1. This tells numpy to calculate the mean of each of the rows. If instead we had used axis = 0 it would have calculated the means for each column. The result of this function call is a new numpy array that contains the mean expression value of each gene. We'll store these new arrays so we can work with them later. 

g1_means = g1_values.mean(axis = 1)
g2_means = g2_values.mean(axis = 1)

With the means in hand, we're ready to calculate the log2 of the fold changes between group 1 and group 2. Again numpy makes this very easy for us. We first use the numpy divide function to divide each element in the second array by every element in the first array. This produces a new array that we can then pass right into the numpy log2 function that calculates the base 2 logarithm of each value. The end result of this chaining of operation is a numpy array where each value represents the log2 foldchange in mean expression of a gene. As a final housekeeping step let's convert this numpy array to a list so that it's nicer to work with. The code for the foldchange calculation looks like this: 

foldchanges = list(np.log2(np.divide(g2_means, g1_means)))

Calculating the p-values

Now onto the pvalue portion of our journey. We need to go through each row of the data and calculate the p value associated with our observations by using a t test. To do this, we'll use the ttest_ind function that we imported via the scipy.stats package. This function takes in two arrays (the individual values from each group) and returns a new array that contains the t statistic and the corresponding p value. Once we have our p value for a row we can add it onto a growing list of pvalues (1 pvalue per gene/row). And as always, to go through values, we'll by using a for loop construct. Our p value calculation code now looks like this: 

pvals = []
num_values = g1_means.shape[0]
for row in range(0, num_values):
    ttest_result = stat.ttest_ind(g1_values[row, :],  g2_values[row, :])
    pvalue = ttest_result[1]
    pvals.append(pvalue)

Almost thereeee. But not quite. Patience is the key here. Let's transform our raw p values t so that we can use them when we make our volcano plot.  We'll do two steps: First we'll correct for the fact that we're calculating p values several times by using the so called Bonferroni correction. Then we can -log10 transform our corrected values for the reasons we mentioned earlier in the post. Let's first convert our pvalue list into a numpy array and then multiply it by the number of tests we've performed which is equal to the total number of rows we have analyzed. We can then feed these corrected p values into numpys log10 function to get the log10 transformed corrected p values. Finally we can multiply those by -1 and convert the values back into a python list and boom there we have it!

transformed_pvals = list(-1*np.log10(num_values*np.array(pvals)))

Graphing the data

We're finally at the fun part where get to see the values plotted on the volcano plot. For this portion we're going to turn to plolty, an extensive python package that can render responsive graphs. In our case we'll be using it to render a scatter plot using our fold change list and our transformed_pvals list. First let's add in a few variables to control the plot a title, label the axes, and specify the radius of the scatter points. A nice feature of Colab is that it parses the python code and it allows you to generate am interactive form where you can enter values for variables. In the code this can be done using the #@param {} command or you can bring up a visual form that lets you specify the name and type of the variable you want in the form. 

This can produce the code below: 

plot_title = "Group 1 vs Group 2" #@param {type:"string"}
x_axis_title = "log2 fold change" #@param {type:"string"}
y_axis_title = "-log10 pvalue" #@param {type:"string"}
point_radius = 4 #@param {type:"slider", min:1, max:30, step:1}

Next, lets generate a new figure by calling the plotly Figure() function. We can then set some parameters using the update_layout functon. 

fig.update_layout(
    title=plot_title,
    xaxis_title= x_axis_title,
    yaxis_title=y_axis_title,
    paper_bgcolor= 'white',
    plot_bgcolor='white',
)

Now let's get a little fancy with our plot. As a visual bonus it would be nice if we could highlight the points that represent the genes that were significantly increased or decreased. How do we go about doing this? Well plotly lets you provide it with an array of colors such that each color element is assigned to each point in a one to one relationship. In other words we need to generate a list of colors, that has a length equal to our number of points, where assign a color based on the point's x and y coordinates. As a baseline, we want all points to be gray, but if their Y value is above a certain threshold and if the x is below the threshold we will assign it blue, and if its above a certain threshold we will assign it red. 

colors = []

for i in range(0, len(foldchanges)):

    if transformed_pvals[i] > 2:

         if foldchanges[i] > 0.5:
            colors.append('#db3232')
         elif foldchanges[i] < -0.5:
            colors.append('#3f65d4')
         else:
            colors.append('rgba(150,150,150,0.5)')
    else:
        colors.append('rgba(150,150,150,0.5)')

The next step is to actually plot our points. The plotly package operates with objects called traces and we need to add our points to a trace that we can then add to the figure. So we'll generate a new scatter trace using the  go.Scattergl function and provide the fold-changes as the x values, the transformed p values as the y values, the names of the genes as the point labels (text), and some parameters to style the points (markers) to be the desired radius we set earlier. We'll add this trace to the figure with the appropriately named add_trace function and then the final step is to show the figure by calling its show function. 

fig.add_trace(
    go.Scattergl(
        x = foldchanges,
        y = transformed_pvals,
        mode = 'markers',
        text = seq_df.iloc[:, 0].values.tolist(),
        hovertemplate ='%{text}: %{x}<br>',
        marker= {
            'color':colors,
            'size':point_radius,
        }
    )
)
fig.show()

The final result

If you used this dataset and everything worked you should end up with the plot below! As expected most changes cluster near 0 fold change and lower p values and then, because bigger changes tend be more significant, we see a spreading out/divergence from the center vertical axis as we go up the graph. The great thing about plotly is that rather producing a static image of a plot, it actually produces an interactive plot that supports zoom, hover, etc. Just hover over any one of the points to see the size of the fold change and what gene it represent. Although we did some customizations here, we've barely scratched the surface of what you can do with plotly. Basically every aspect of this plot is customizable from colors to fonts to spacing. I encourage you to look through the plotly documentation linked above so you can see all the amazing things you can do with this package. 

The final notebook and code 

Last, but not least, you can look at all the code in an actual colab notebook ready for your use right here: https://colab.research.google.com/drive/1f_ex51iOkdkviT1sNzj9nchzNt23vLrf

import pandas as pd
import numpy as np
import re 
import plotly.graph_objects as go
import math
import scipy.stats as stat

seq_df = pd.read_csv(r'/content/test_rna_seq_data.csv')
seq_df = seq_df.loc[~seq_df.iloc[:, 0].str.contains('ILMN_'), :]

groups = {}
index = 1

column_names = list(seq_df.columns.to_numpy())
column_names.pop(0)
for column_name in column_names:
    group_name = re.sub('_[0-9]$', '', column_name)

    if group_name in groups:
        groups[group_name].append(index)
    else:
        groups[group_name] = [index];
    
    index = index + 1

groupnames = list(groups.keys())
pvalues = []

g1_col_indices = groups[groupnames[0]]
g2_col_indices = groups[groupnames[1]]

g1_values = seq_df.iloc[:, g1_col_indices].to_numpy()
g2_values = seq_df.iloc[:, g2_col_indices].to_numpy()

g1_means = g1_values.mean(axis = 1)
g2_means = g2_values.mean(axis = 1)

foldchanges = list(np.log2(np.divide(g2_means, g1_means)))

pvals = []
num_values = g1_means.shape[0]
for row in range(0, num_values):
    ttest_result = stat.ttest_ind(g1_values[row, :],  g2_values[row, :])
    pvalue = ttest_result[1]
    pvals.append(pvalue)

transformed_pvals = list(-1*np.log10(num_values*np.array(pvals)))

plot_title = "Group 1 vs Group 2" #@param {type:"string"}
x_axis_title = "log2 fold change" #@param {type:"string"}
y_axis_title = "-log10 pvalue" #@param {type:"string"}
point_radius = 4 #@param {type:"slider", min:1, max:30, step:1}

fig = go.Figure()

fig.update_layout(
    title=plot_title,
    xaxis_title= x_axis_title,
    yaxis_title=y_axis_title,
    paper_bgcolor= 'white',
    plot_bgcolor='white',
)

colors = []

for i in range(0, len(foldchanges)):

    if transformed_pvals[i] > 2:

         if foldchanges[i] > 0.5:
            colors.append('#db3232')
         elif foldchanges[i] < -0.5:
            colors.append('#3f65d4')
         else:
            colors.append('rgba(150,150,150,0.5)')
    else:
        colors.append('rgba(150,150,150,0.5)')

fig.add_trace(
    go.Scattergl(
        x = foldchanges,
        y = transformed_pvals,
        mode = 'markers',
        text = seq_df.iloc[:, 0].values.tolist(),
        hovertemplate ='%{text}: %{x}<br>',
        marker= {
            'color':colors,
            'size':point_radius,
        }
    )
)
fig.show()

That was a lot to cover, but hopefully you learned something(or a few things!) through this process and can apply it in your next data analysis or graphing project.