Analyzing Label-free Proteomics Data using Perseus

By Hanrui Zhang on 02/28/2022. This is a cheat sheet we use for label-free proximity labeling data analysis.

References

Workshop videos on proximity labeling by the Ting lab: 2020
Perseus basics tutorial video: MQSS 2021
Perseus intermediate tutorial video: MQSS 2021
A protocol style paper on using Perseus: Tyanova and Cox, 2018
The Perseus documentation: user case for label-free interaction data
The 2016 Nature Methods paper on the Perseus platform: Tyanova et al 2016

1. Software Download and Installation

Download from the official website.

2. Data File

The data file should be in txt format - can save to txt from excel spreadsheet.

3. Loading Data

Click the "Generic matrix upload" button.
In the pop-up window, navigate to the file to be loaded.
Select all the "LFQ" columns and transfer them to the Main columns window.

4. Filtering Low-quality data, Log Transformation, and Summary Statistics

In the workflow panel, change the name of the data matrix from matrix 1 to InitialData.
Go to "Processing -> Filter rows based on categorical column" to exclude proteins identified by site, matching to the reverse database or contaminants.
Transform the data to a logarithmic scale by going to "Processing -> Basic -> Transform" and specifying the transformation function (e.g., log2(x)).
Use "Analysis -> Visualization -> Histogram" to confirm if the transformed matrix more or less follows the normal distribution.
In the "Processing" section, select the "Basic" menu and click on the "Summary statistics (columns)" button. Select all expression columns. Click ok and explore the new matrix.

5. Grouping Samples and Filtering Valid Values According to Groups, and Imputation of Missing Values from Normal Distribution

Before filtering the data for valid values, it makes sense to group the samples according to replicates using "Processing -> Annot. rows -> Categorical annotation rows".
The name of the grouping can be left as the default value or changed to "Replicate". Then can shorten the rest of the column names or give each group a different name. One can also use "Rearrange" to change the column names.
Identifications with just one reported intensity are usually not very useful for further analyses. This is why we filter for valid values with "Processing -> Filter rows -> Filter rows based on valid values".
Assuming we have 3 replicates for each condition, we want to have "3" valid values "in at least one group", because interaction partners without any background affinity would not appear in the control, but should still appear with three valid values in the actual pulldown.
Imputation: then we impute missing values from a normal distribution with "Processing -> Imputation -> Replace missing values from normal distribution". Then Perseus will shrink the distributions to a factor of “0.3” (width), shift it down by “1.8” (down shift) standard deviations, and simulate some random values that makeup values to fill up the missing values. Also, we take the whole matrix (mode).
Using histograms as before (Analysis → Visualization → Histogram), we are able to look at the distributions of the imputed values. Therefore we generate histograms for all LFQ intensity columns.

6. Exploratory Analysis for Quality Check

Multi-scatter plot: use multi-scatter plots with "Analysis → Visualization → Multi scatter plot" to analyze the correlation between the samples.
Hierarchical clustering: hierarchical clustering with "Analysis → Clustering/PCA → Hierarchical clustering". Can change the default color range to blue-white-red or other color range better for the intensity data. To zoom in, we will need to hold the left mouse key and drag at the row clustering area or use the "zoom in" icon.
PCA plot: ""Analysis → Clustering/PCA -> PCA".

7. Two Sample t-test for Two Groups and Visualization

Find interactors with "Processing -> Tests -> Two-samples tests".
Can try a different combination of s0 and FDR, e.g. s0 = 1.5 and FDR = 0.05. The results will be in a new matrix with additional columns: "t-test Significant", "-Log t-test p-value", "t-test Difference" and "test statistics".
Scatter plot: interactors can be viewed using a scatter plot "Analysis -> Visualization -> Scatter plot" by plotting "t-test Difference" vs. "-Log t-test p-value".
Volcano plot: The volcano plot is the unified function of the two sample t-test and the scatter plot with the additional option to easily optimize the s0 and FDR using "Analysis -> Misc. -> Volcano plot:.

8. ANOVA for More Than Two Groups

Go to "Processing -> Tests -> Multiple-sample tests".
Keep the default value of 0 for the S0 parameter, to use the standard t-test statistics.
Select permutation DFR.
Inspect the output table. It contains three new columns: ANOVA significant, -Log ANOVA p-value, and ANOVA q-value.
Go to "Processing -> Filter rows -> Filter rows based on categorical column". Set the Column parameter to ANOVA Significant and the Mode parameter to Keep matching rows to retain all differentially expressed proteins.
Got to "Processing -> Tests -> Post-hoc tests". Set the Grouping parameter to the same grouping that was used for the ANOVA test. Set the FDR. Tukey's honestly significant difference is computed for all proteins and all pairwise comparisons and the significant hits within the corresponding pairs are marked.

9. Clustering and heatmap visualization based on the ANOVA Significant Proteins

Click the matrix for the ANOVA significant proteins. Z-score transformed by row (mean).
Go to "Analysis -> Clustering/PCA -> Hierarchical clustering". Keep the default parameters.
Inspect the resulting heatmap and the relationship between the groups and proteins. Change the color gradient as needed.

10. Notes

Depending on the nature of missing values, different filtering strategies may be employed and are supported in Perseus. For example, if large differences between groups are expected with proteins having very low expression levels in one of the groups, filtering based on a minimum number of valid values in at least one group would be a more suitable approach than filtering for a minimum number of valid values in the complete matrix. Tyanova and Cox, 2018
Very stringent filtering is usually not recommended, as a large amount of the data will be lost. Instead milder filtering combined with imputation may be more appropriate. Tyanova and Cox, 2018
Prior to any statistical analysis, data cleansing is usually performed which includes normalization, to ensure that different samples are comparable, and missing value handling to enable the use of methods that require all data points to be present. Data normalization is not always necessary. Different types of normalization can be applied to the data to correct for systematic shifts or skewness and to make samples comparable.Tyanova and Cox, 2018
The method with the largest power permutation-based FDR is recommended and at least 250 repetitions are suggested. In the case of technical replicates, these have to be specified as a separate grouping and selected with the “Preserve grouping in randomizations” option. Failure to specify technical replicates will result in the wrong FDR calculation. The more conservative Benjamini-Hochberg correction can also be used when a lower number of Type I errors at the cost of lower sensitivity are desired.Tyanova and Cox, 2018