Analyzing Label-free Proteomics Data using Perseus
By Hanrui Zhang on 02/28/2022. This is a cheat sheet we use for label-free proximity labeling data analysis.
References
1. Software Download and Installation
Download from the official website.
2. Data File
The data file should be in txt format - can save to txt from excel spreadsheet.
3. Loading Data
- Click the "Generic matrix upload" button.
- In the pop-up window, navigate to the file to be loaded.
- Select all the "LFQ" columns and transfer them to the Main columns window.
4. Filtering Low-quality data, Log Transformation, and Summary Statistics
- In the workflow panel, change the name of the data matrix from matrix 1 to InitialData.
- Go to "Processing -> Filter rows based on categorical column" to exclude proteins identified by site, matching to the reverse database or contaminants.
- Transform the data to a logarithmic scale by going to "Processing -> Basic -> Transform" and specifying the transformation function (e.g., log2(x)).
- Use "Analysis -> Visualization -> Histogram" to confirm if the transformed matrix more or less follows the normal distribution.
- In the "Processing" section, select the "Basic" menu and click on the "Summary statistics (columns)" button. Select all expression columns. Click ok and explore the new matrix.
5. Grouping Samples and Filtering Valid Values According to Groups, and Imputation of Missing Values from Normal Distribution
- Before filtering the data for valid values, it makes sense to group the samples according to replicates using "Processing -> Annot. rows -> Categorical annotation rows".
- The name of the grouping can be left as the default value or changed to "Replicate". Then can shorten the rest of the column names or give each group a different name. One can also use "Rearrange" to change the column names.
- Identifications with just one reported intensity are usually not very useful for further analyses. This is why we filter for valid values with "Processing -> Filter rows -> Filter rows based on valid values".
- Assuming we have 3 replicates for each condition, we want to have "3" valid values "in at least one group", because interaction partners without any background affinity would not appear in the control, but should still appear with three valid values in the actual pulldown.
- Imputation: then we impute missing values from a normal distribution with "Processing -> Imputation -> Replace missing values from normal distribution". Then Perseus will shrink the distributions to a factor of “0.3” (width), shift it down by “1.8” (down shift) standard deviations, and simulate some random values that makeup values to fill up the missing values. Also, we take the whole matrix (mode).
- Using histograms as before (Analysis → Visualization → Histogram), we are able to look at the distributions of the imputed values. Therefore we generate histograms for all LFQ intensity columns.
6. Exploratory Analysis for Quality Check
- Multi-scatter plot: use multi-scatter plots with "Analysis → Visualization → Multi scatter plot" to analyze the correlation between the samples.
- Hierarchical clustering: hierarchical clustering with "Analysis → Clustering/PCA → Hierarchical clustering". Can change the default color range to blue-white-red or other color range better for the intensity data. To zoom in, we will need to hold the left mouse key and drag at the row clustering area or use the "zoom in" icon.
- PCA plot: ""Analysis → Clustering/PCA -> PCA".
7. Two Sample t-test for Two Groups and Visualization
- Find interactors with "Processing -> Tests -> Two-samples tests".
- Can try a different combination of s0 and FDR, e.g. s0 = 1.5 and FDR = 0.05. The results will be in a new matrix with additional columns: "t-test Significant", "-Log t-test p-value", "t-test Difference" and "test statistics".
- Scatter plot: interactors can be viewed using a scatter plot "Analysis -> Visualization -> Scatter plot" by plotting "t-test Difference" vs. "-Log t-test p-value".
- Volcano plot: The volcano plot is the unified function of the two sample t-test and the scatter plot with the additional option to easily optimize the s0 and FDR using "Analysis -> Misc. -> Volcano plot:.
8. ANOVA for More Than Two Groups
- Go to "Processing -> Tests -> Multiple-sample tests".
- Keep the default value of 0 for the S0 parameter, to use the standard t-test statistics.
- Select permutation DFR.
- Inspect the output table. It contains three new columns: ANOVA significant, -Log ANOVA p-value, and ANOVA q-value.
- Go to "Processing -> Filter rows -> Filter rows based on categorical column". Set the Column parameter to ANOVA Significant and the Mode parameter to Keep matching rows to retain all differentially expressed proteins.
- Got to "Processing -> Tests -> Post-hoc tests". Set the Grouping parameter to the same grouping that was used for the ANOVA test. Set the FDR. Tukey's honestly significant difference is computed for all proteins and all pairwise comparisons and the significant hits within the corresponding pairs are marked.
9. Clustering and heatmap visualization based on the ANOVA Significant Proteins
- Click the matrix for the ANOVA significant proteins. Z-score transformed by row (mean).
- Go to "Analysis -> Clustering/PCA -> Hierarchical clustering". Keep the default parameters.
- Inspect the resulting heatmap and the relationship between the groups and proteins. Change the color gradient as needed.
10. Notes
- Depending on the nature of missing values, different filtering strategies may be employed and are supported in Perseus. For example, if large differences between groups are expected with proteins having very low expression levels in one of the groups, filtering based on a minimum number of valid values in at least one group would be a more suitable approach than filtering for a minimum number of valid values in the complete matrix. Tyanova and Cox, 2018
- Very stringent filtering is usually not recommended, as a large amount of the data will be lost. Instead milder filtering combined with imputation may be more appropriate. Tyanova and Cox, 2018
- Prior to any statistical analysis, data cleansing is usually performed which includes normalization, to ensure that different samples are comparable, and missing value handling to enable the use of methods that require all data points to be present. Data normalization is not always necessary. Different types of normalization can be applied to the data to correct for systematic shifts or skewness and to make samples comparable.Tyanova and Cox, 2018
- The method with the largest power permutation-based FDR is recommended and at least 250 repetitions are suggested. In the case of technical replicates, these have to be specified as a separate grouping and selected with the “Preserve grouping in randomizations” option. Failure to specify technical replicates will result in the wrong FDR calculation. The more conservative Benjamini-Hochberg correction can also be used when a lower number of Type I errors at the cost of lower sensitivity are desired.Tyanova and Cox, 2018