Introduction to binaryPPIclassifier

Using machine learning (ML) algorithms to classify quantitative PPI results

This vignette gives an introduction to the binaryPPIclassifier package. A support vector machine learning or random forest algorithm in combination with multi-adaptive sampling is trained on a data set containing positive and random/negative reference protein-protein interaction (PPI) results. The generated models are then applied to a set of PPIs with unknown classification and their likelihood to be positive or negative is predicted.

Installation

You can install the development version of binaryPPIclassifier from GitHub with:

# install.packages("devtools")
devtools::install_github("philipptrepte/binary-PPI-classifier")

library(binaryPPIclassifier)

Requirements

The input data should have been pre-processed or tidied if necessary.

Example dataset: LuTHy results from PRS-v2 and RRS-v2

We used the LuTHy assay (Trepte et al. Mol Sys Biol 2018) to detect protein-protein interactions between 60 PRS-v2 and 78 RRS-v2 (Choi et al. Nat Com 2019), which were each tested in 8 tagging configutation with the LuTHy-BRET and LuTHy-LuC assay versions (Trepte & Secker et al. bioRxiv 2023).

The dataset contains in-cell LuTHy-BRET and cell-free LuTHy-LuC measurements. For the LuTHy-BRET we measure the Acceptor protein expression mean_mCit, which directly influences the BRET ratio mean_cBRET. For the LuTHy-LuC, we measure the raw luminescence value after precipitation mean_LumiOUT of the Donor protein which is normalized to its expression resulting in the mean_cLuC.

For more details on the LuTHy assay and its measurement parameters, please read:

Trepte, P. et al. LuTHy: a double-readout bioluminescence-based two-hybrid technology for quantitative mapping of protein–protein interactions in mammalian cells. Mol Syst Biol 14, e8071 (2018). [Link]
Trepte, P. & Secker, C. et al. AI-guided pipeline for protein-protein interaction drug discovery identifies a SARS-CoV-2 inhibitor. bioRxiv 2023.06.14.544560 (2023) doi:10.1101/2023.06.14.544560. [Link]

Loading of the data

data("luthy_reference_sets")

Input table showing results for the interaction BAD + BCL2L1 from `luthy_reference_sets`.
Donor	Donor_tag	Donor_protein	Acceptor	Acceptor_tag	Acceptor_protein	complex	interaction	sample	orientation	data	score
5575	N1	BAD	5168	N2	BCL2L1	PRS	BAD + BCL2L1	BAD+BCL2L1	N1-N2	mean_LumiOUT	2.984230e+05
5575	N1	BAD	5431	C2	BCL2L1	PRS	BAD + BCL2L1	BAD+BCL2L1	N1-C2	mean_LumiOUT	6.519733e+04
5838	C1	BAD	5168	N2	BCL2L1	PRS	BAD + BCL2L1	BAD+BCL2L1	C1-N2	mean_LumiOUT	4.999130e+05
5838	C1	BAD	5431	C2	BCL2L1	PRS	BAD + BCL2L1	BAD+BCL2L1	C1-C2	mean_LumiOUT	7.323867e+04
5694	N2	BCL2L1	5049	N1	BAD	PRS	BAD + BCL2L1	BCL2L1+BAD	N2-N1	mean_LumiOUT	4.725640e+05
5957	C2	BCL2L1	5049	N1	BAD	PRS	BAD + BCL2L1	BCL2L1+BAD	C2-N1	mean_LumiOUT	3.265000e+04
5694	N2	BCL2L1	5312	C1	BAD	PRS	BAD + BCL2L1	BCL2L1+BAD	N2-C1	mean_LumiOUT	2.395853e+06
5957	C2	BCL2L1	5312	C1	BAD	PRS	BAD + BCL2L1	BCL2L1+BAD	C2-C1	mean_LumiOUT	9.105850e+04
5575	N1	BAD	5168	N2	BCL2L1	PRS	BAD + BCL2L1	BAD+BCL2L1	N1-N2	mean_mCit	5.011334e+03
5575	N1	BAD	5431	C2	BCL2L1	PRS	BAD + BCL2L1	BAD+BCL2L1	N1-C2	mean_mCit	2.547251e+03
5838	C1	BAD	5168	N2	BCL2L1	PRS	BAD + BCL2L1	BAD+BCL2L1	C1-N2	mean_mCit	5.091001e+03
5838	C1	BAD	5431	C2	BCL2L1	PRS	BAD + BCL2L1	BAD+BCL2L1	C1-C2	mean_mCit	2.249626e+03
5694	N2	BCL2L1	5049	N1	BAD	PRS	BAD + BCL2L1	BCL2L1+BAD	N2-N1	mean_mCit	3.293543e+03
5957	C2	BCL2L1	5049	N1	BAD	PRS	BAD + BCL2L1	BCL2L1+BAD	C2-N1	mean_mCit	1.744793e+03
5694	N2	BCL2L1	5312	C1	BAD	PRS	BAD + BCL2L1	BCL2L1+BAD	N2-C1	mean_mCit	1.157221e+04
5957	C2	BCL2L1	5312	C1	BAD	PRS	BAD + BCL2L1	BCL2L1+BAD	C2-C1	mean_mCit	1.801200e+04
5575	N1	BAD	5168	N2	BCL2L1	PRS	BAD + BCL2L1	BAD+BCL2L1	N1-N2	mean_cBRET	2.765355e-01
5575	N1	BAD	5431	C2	BCL2L1	PRS	BAD + BCL2L1	BAD+BCL2L1	N1-C2	mean_cBRET	8.709540e-02
5838	C1	BAD	5168	N2	BCL2L1	PRS	BAD + BCL2L1	BAD+BCL2L1	C1-N2	mean_cBRET	5.438252e-01
5838	C1	BAD	5431	C2	BCL2L1	PRS	BAD + BCL2L1	BAD+BCL2L1	C1-C2	mean_cBRET	1.808837e-01
5694	N2	BCL2L1	5049	N1	BAD	PRS	BAD + BCL2L1	BCL2L1+BAD	N2-N1	mean_cBRET	2.366357e-01
5957	C2	BCL2L1	5049	N1	BAD	PRS	BAD + BCL2L1	BCL2L1+BAD	C2-N1	mean_cBRET	3.644020e-02
5694	N2	BCL2L1	5312	C1	BAD	PRS	BAD + BCL2L1	BCL2L1+BAD	N2-C1	mean_cBRET	6.381235e-01
5957	C2	BCL2L1	5312	C1	BAD	PRS	BAD + BCL2L1	BCL2L1+BAD	C2-C1	mean_cBRET	3.185715e-01
5575	N1	BAD	5168	N2	BCL2L1	PRS	BAD + BCL2L1	BAD+BCL2L1	N1-N2	mean_cLuC	7.362038e-01
5575	N1	BAD	5431	C2	BCL2L1	PRS	BAD + BCL2L1	BAD+BCL2L1	N1-C2	mean_cLuC	6.521600e-01
5838	C1	BAD	5168	N2	BCL2L1	PRS	BAD + BCL2L1	BAD+BCL2L1	C1-N2	mean_cLuC	1.001871e+00
5838	C1	BAD	5431	C2	BCL2L1	PRS	BAD + BCL2L1	BAD+BCL2L1	C1-C2	mean_cLuC	7.442408e-01
5694	N2	BCL2L1	5049	N1	BAD	PRS	BAD + BCL2L1	BCL2L1+BAD	N2-N1	mean_cLuC	1.196843e+00
5957	C2	BCL2L1	5049	N1	BAD	PRS	BAD + BCL2L1	BCL2L1+BAD	C2-N1	mean_cLuC	5.047924e-01
5694	N2	BCL2L1	5312	C1	BAD	PRS	BAD + BCL2L1	BCL2L1+BAD	N2-C1	mean_cLuC	2.608016e+00
5957	C2	BCL2L1	5312	C1	BAD	PRS	BAD + BCL2L1	BCL2L1+BAD	C2-C1	mean_cLuC	1.114725e-01

(#tab:reqCol) Required table columns. Additional metadata columns are optional, e.g. UniProt ids
Column name	Description
Donor	unique id to identify the donor construct
Donor_tag	indicates if the donor protein is tagged N-terminally (“N”) or C-terminally (“C”)
Donor_protein	a protein identifier, e.g. BCL2L1
Acceptor	unique id to identify the acceptor construct
Acceptor_tag	indicates if the acceptor protein is tagged N-terminally (“N”) or C-terminally (“C”)
Acceptor_protein	a protein identifier, e.g. BAD
complex	should contain information if it is a reference interaction (e.g. “PRS”/“RRS”) or if it is part of any other known complex (e.g. “LAMTOR”)
interaction	should indicate the tested interaction independent on the tagging configuration, e.g. “BAD + BCL2L1”
sample	should indicate the tested interaction dependent on the tagging configuration, e.g. “BCL2L1 + BCL2L1”
orientation	indicates the exact tagging orientation of the first indicated donor protein (e.g. N1) and the second indicated acceptor protein (e.g. N2): N1-N2 for BAD + BCL2L1 and N2-N1 for BCL2L1 + BAD
data	indicates the type of data measured for an indicated interaction, e.g. “mean_cBRET” or “mean_cLuC”
score	The corresponding data value

Example interaction: BAD + BCL2L1

(#tab:TableTaggingOrientation) Tagging configurations and formatting of interaction and sample column.
Tagging configuration (`orientation` column)	Example (NL = NanoLuc; mCit = mCitrine)	interaction	sample
C1-C2	BAD-NL + BCL2L1-mCit	BAD + BCL2L1	BAD + BCL2L1
C1-N2	BAD-NL + mCit-BCL2L1	BAD + BCL2L1	BAD + BCL2L1
N1-C2	NL-BAD + BCL2L1-mCit	BAD + BCL2L1	BAD + BCL2L1
N1-N2	NL-BAD + mCit-BCL2L1	BAD + BCL2L1	BAD + BCL2L1
C2-C1	BCL2L1-NL + BAD-mCit	BAD + BCL2L1	BCL2L1 + BAD
C2-N1	BCL2L1-NL + mCit-BAD	BAD + BCL2L1	BCL2L1 + BAD
N2-C1	NL-BCL2L1 + BAD-mCit	BAD + BCL2L1	BCL2L1 + BAD
N2-N1	NL-BCL2L1 + mCit-BAD	BAD + BCL2L1	BCL2L1 + BAD

LuTHy-BRET

LuTHy-BRET results for one positive (BAD + BCL2L1) and one random (ARSA + DBN1) PPI

Usage

The ML algorithms for the LuTHy-BRET and LuTHy-LuC are trained on the mean_cBRET and mean_mCit or mean_cLuC and mean_LumiOUT features, respectively. Other assays that provide only one quantitative measurement, like the mN2H, can be trained only on this one feature. Training on more than two features is also possible but has not been tested.

Assay	Training features (`data` column)
LuTHy-BRET	mean_cBRET, mean_mCit
LuTHy-LuC	mean_cLuC, mean_LumiOUT

ppi.prediction()

The function ppi.prediction is used to predict the classification probability of a PPIdf with unknown classification labels by training an svm or randomforest machine learning algorithm on a set of reference interactions that contain classification labels. The parameters to be specified for the function can be found in the Documentation ?ppi.prediction.

In the following example, we use the luthy_reference_set example data to compile distinct training sets by sampling 50 times (ensembleSize = 50) from the entire referenceSet dataset. Weighted sampling is performed (sampling = “weighted”) using the mean_cBRET data (weightBy = “mean_cBRET”) from all interactions (cutoff = “all”). The data will not be further scaled (method.scaling = “none”) and we specify that in the complex column the negative/random interactions are indicated by “RRS” (negative.reference = “RRS”). We further specify that as training features for the LuTHy-BRET the mean_cBRET and mean_mCit (assay = c(“mean_cBRET”, “mean_mCit”) will be used to train the SVM algorithms (model.type = “svm”). For the SVM algorithm a linear kernel will be used (kernelType = “linear”) with a cost of constraints violation of 100 (C = 100). During training, the class labels of the reference set are reclassified in 5 iterations (iter = 5)

example_ppi_prediction <- ppi.prediction(PPIdf = luthy_reference_sets, 
               referenceSet = luthy_reference_sets,
               ensembleSize = 50,
               sampling = "weighted",
               weightBy = "mean_cBRET",
               cutoff = "all",
               method.scaling = "none",
               negative.reference = "RRS",
               assay = ("mean_cBRET", "mean_mCit"),
               model.type = "svm",
               kernelType = "linear",
               C = 100,
               iter = 5)

data("example_ppi_prediction")

The resulting list contains the following objects:

List object	Description
predTrainDf	Data frame of the training set containing the predicted classifier probabilities in the column `predTrainMat`
predDf	Data frame of the PPIdf test set containing the predicted classifier probabilities in the column `predMat`
predTrain.model.e	List for each `ensembleSize` as defined in `ppi.prediction()` containing the trained ML models used to predict the classification of the reference training set
training.sets	List for each `ensembleSize` as defined in `ppi.prediction()` containing the training sets used to train each ML model
negative.reference	as specified in `ppi.prediction()`
model.type	as specified in `ppi.prediction()`
model.e	List for each `ensembleSize` as defined in `ppi.prediction()` containing the trained ML models used to predict the classification of the PPIdf test set
testMat	Matrix of the PPIdf test set containing the features used for training
trainMat	Matrix of the reference training set containing the features used for training
label	Classifier labels of the reference training set interactions
cutoff	as specified in `ppi.prediction()`
inclusion	as specified in `ppi.prediction()`
ensembleSize	as specified in `ppi.prediction()`
sampling	as specified in `ppi.prediction()`
kernelType	as specified in `ppi.prediction()`
iter	as specified in `ppi.prediction()`
C	as specified in `ppi.prediction()`
gamma	as specified in `ppi.prediction()`
coef0	as specified in `ppi.prediction()`
degree	as specified in `ppi.prediction()`
top	as specified in `ppi.prediction()`
seed	as specified in `ppi.prediction()`
system.time	System time when the function was run

Plotting the results from the ML predictions

The package provides helper functions, to plot the results from the ML predictions (@ref(fig:MLresultPlot)). probGrid.plot will generate a probability grid, probDis.plot will show the probability distribution against the first feature and a receiver operating characteristic curve can be plotted using the roc.plot function. Finally, the recovery rates at 50%, 75% and 95% can be plotted using the function recovery.plot.




cowplot::plot_grid(
    probGrid.plot(example_ppi_prediction, ylim = c(-0.1, 0.8)),
    probDis.plot(example_ppi_prediction),
    roc.plot(example_ppi_prediction),
    recovery.plot(example_ppi_prediction),
  ncol = 2, align = "hv", axis = "tlrb", labels = "AUTO"
)

Training results as (A) probability grid; (B) probability distribtuoin; (C) ROC curve plotting the training features and the probability; (D) recovery rates at 50%, 75% and 95% probability

Confusion Matrix

To evaluate the performance of the classification model, we can calculate the confusion matrix using the function confusion.matrix which is based on the caret::confusionMatrix function.

example_confusion_matrix <- confusion.matrix(example_ppi_prediction)

A tabulated confusion matrix can be accessed by example_confusion_matrix$confusionMatrixDf

Tabulated confusion matrix.
	PRS	RRS
PRS: 50%	42	14
RRS: 50%	18	64
PRS: 75%	30	2
RRS: 75%	30	76
PRS: 95%	23	1
RRS: 95%	37	77

Detailed information and statistics can be assessed by example_confusion_matrix$confusionMatrixList for at 50%, 75% and 95% probability.

#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction PRS RRS
#>        PRS  23   1
#>        RRS  37  77
#>                                           
#>                Accuracy : 0.7246          
#>                  95% CI : (0.6422, 0.7972)
#>     No Information Rate : 0.5652          
#>     P-Value [Acc > NIR] : 8.065e-05       
#>                                           
#>                   Kappa : 0.3981          
#>                                           
#>  Mcnemar's Test P-Value : 1.365e-08       
#>                                           
#>             Sensitivity : 0.3833          
#>             Specificity : 0.9872          
#>          Pos Pred Value : 0.9583          
#>          Neg Pred Value : 0.6754          
#>              Prevalence : 0.4348          
#>          Detection Rate : 0.1667          
#>    Detection Prevalence : 0.1739          
#>       Balanced Accuracy : 0.6853          
#>                                           
#>        'Positive' Class : PRS             
#>

Example interaction: BAD + BCL2L1

(A) Predicted probabilities to be true-positive interactions and (B) mean cBRET values for the PRS-v2 interaction BAD + BCL2L1 and the RRS-v2 interaction ARSA + DBN1

Predicted probabilities to be true-positive interactions and (B) mean cBRET values for the PRS-v2 interaction BAD + BCL2L1 and the RRS-v2 interaction ARSA + DBN1

learning.curve()

To evaluate the model performance, we can plot learning curves using the function learning.curve(). Performance metrics (accuracy, hinge loss, binary cross-entropy loss) are plotted against the amount of training data. Thereby, it can be determined if the model is training effectively or if more data would improve performance. Learning curves also help to evaluate over- and underfitting of the models.

The function learning.curve only requires a ppi.prediction object as input. Additionally, the relative training sizees can be specified for example as train_size = base::seq(0.1, 1, by = 0.1) and the number of models used from ppi.prediction$model.e can be specified as models = 10 to use the first 10 models or models = "all" to use all saved models.

example_learning_curve <- learning.curve(ppi_prediction_result = example_ppi_prediction)

example_learning_curve$learning_plot

(A) Accuracy, (B) Hinge Loss, (C) Binary Cross-Entropy Loss

Accuracy, (B) Hinge Loss, (C) Binary Cross-Entropy Loss

Appendix

Session info

#> R version 4.2.1 (2022-06-23)
#> Platform: x86_64-apple-darwin17.0 (64-bit)
#> Running under: macOS Big Sur ... 10.16
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
#> 
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#>  [1] binaryPPIclassifier_1.5.5.8 plotly_4.10.0              
#>  [3] varhandle_2.0.5             usethis_2.1.6              
#>  [5] tidyr_1.2.0                 tibble_3.2.1               
#>  [7] stringr_1.4.1               Rmisc_1.5.1                
#>  [9] plyr_1.8.7                  rlang_1.1.1                
#> [11] randomForest_4.7-1.1        purrr_1.0.1                
#> [13] plotROC_2.3.0               ggnewscale_0.4.7           
#> [15] e1071_1.7-11                DescTools_0.99.46          
#> [17] cowplot_1.1.1               caret_6.0-94               
#> [19] lattice_0.20-45             viridis_0.6.2              
#> [21] viridisLite_0.4.1           ggpubr_0.4.0               
#> [23] ggplot2_3.3.6               dplyr_1.1.2                
#> [25] BiocStyle_2.24.0           
#> 
#> loaded via a namespace (and not attached):
#>   [1] colorspace_2.0-3     ggsignif_0.6.3       class_7.3-20        
#>   [4] rprojroot_2.0.3      fs_1.5.2             gld_2.6.5           
#>   [7] rstudioapi_0.14      proxy_0.4-27         farver_2.1.1        
#>  [10] listenv_0.8.0        prodlim_2023.03.31   fansi_1.0.3         
#>  [13] mvtnorm_1.1-3        lubridate_1.8.0      codetools_0.2-18    
#>  [16] splines_4.2.1        cachem_1.0.6         rootSolve_1.8.2.3   
#>  [19] knitr_1.40           jsonlite_1.8.0       pROC_1.18.2         
#>  [22] broom_1.0.1          httr_1.4.4           BiocManager_1.30.18 
#>  [25] compiler_4.2.1       backports_1.4.1      assertthat_0.2.1    
#>  [28] lazyeval_0.2.2       Matrix_1.4-1         fastmap_1.1.0       
#>  [31] cli_3.6.1            htmltools_0.5.3      tools_4.2.1         
#>  [34] gtable_0.3.1         glue_1.6.2           lmom_2.9            
#>  [37] reshape2_1.4.4       Rcpp_1.0.9           carData_3.0-5       
#>  [40] cellranger_1.1.0     jquerylib_0.1.4      pkgdown_2.0.6       
#>  [43] vctrs_0.6.3          nlme_3.1-159         iterators_1.0.14    
#>  [46] timeDate_4022.108    gower_1.0.1          xfun_0.32           
#>  [49] globals_0.16.1       lifecycle_1.0.3      rstatix_0.7.0       
#>  [52] future_1.27.0        MASS_7.3-58.1        scales_1.2.1        
#>  [55] ipred_0.9-14         ragg_1.2.2           parallel_4.2.1      
#>  [58] expm_0.999-6         yaml_2.3.5           Exact_3.2           
#>  [61] memoise_2.0.1        gridExtra_2.3        sass_0.4.2          
#>  [64] rpart_4.1.16         stringi_1.7.8        highr_0.9           
#>  [67] desc_1.4.1           foreach_1.5.2        hardhat_1.3.0       
#>  [70] boot_1.3-28          lava_1.7.2.1         pkgconfig_2.0.3     
#>  [73] systemfonts_1.0.4    evaluate_0.16        labeling_0.4.2      
#>  [76] htmlwidgets_1.5.4    recipes_1.0.6        tidyselect_1.2.0    
#>  [79] parallelly_1.32.1    magrittr_2.0.3       bookdown_0.28       
#>  [82] R6_2.5.1             generics_0.1.3       pillar_1.9.0        
#>  [85] withr_2.5.0          survival_3.4-0       abind_1.4-5         
#>  [88] nnet_7.3-17          future.apply_1.9.0   car_3.1-0           
#>  [91] utf8_1.2.2           rmarkdown_2.16       grid_4.2.1          
#>  [94] readxl_1.4.1         data.table_1.14.2    ModelMetrics_1.2.2.2
#>  [97] digest_0.6.29        textshaping_0.3.6    stats4_4.2.1        
#> [100] munsell_0.5.0        bslib_0.4.0

Philipp Trepte

2024-01-12

Using machine learning (ML) algorithms to classify quantitative PPI results

Installation

Requirements

Example dataset: LuTHy results from PRS-v2 and RRS-v2

Loading of the data

Example interaction: BAD + BCL2L1

LuTHy-BRET

Usage

ppi.prediction()

Plotting the results from the ML predictions

Confusion Matrix

Example interaction: BAD + BCL2L1

learning.curve()

Appendix

Session info