Explainable machine learning approach for cancer prediction through binarilization of RNA sequencing data

Tianjie Chen, Md Faisal Kabir

In recent years, researchers have proven the effectiveness and speediness of machine learning-based cancer diagnosis models. However, it is difficult to explain the results generated by machine learning models, especially ones that utilized complex high-dimensional data like RNA sequencing data. In this study, we propose the binarilization technique as a novel way to treat RNA sequencing data and used it to construct explainable cancer prediction models. 

We tested our proposed data processing technique on five different models, namely neural network, random forest, xgboost, support vector machine, and decision tree, using four cancer datasets collected from the National Cancer Institute Genomic Data Commons. Since our datasets are imbalanced, we evaluated the performance of all models using metrics designed for imbalance performance like geometric mean, Matthews correlation coefficient, F-Measure, and area under the receiver operating characteristic curve.

Machine learning (ML) models have been used for cancer research for almost 40 years. In the past, researchers primarily focused on using clinical and demographic data to individual’s risk of developing cancer [1]. Recent advancements in genomic and computational technology has enabled researchers to study cancer more thoroughly and develop new models for cancer prediction and survival analysis [2–5].

One proven way to study cancer using computational and ML-based methods is by analyzing peptides, specifically anti-cancer peptides (ACPs) data. Because of their low toxicity and greater efficacy, ACPs have recently attracted researchers’ interests as a promising therapeutic agent for cancer treatment. However, efficient identification of ACPs is still a challenge. To address this issue, researchers have proposed multiple ML-assisted tools for prediction of ACPs.

Materials and Methods
In this section, we present our approach for cancer diagnosis using binarilized RNA-seq data. The approach consists of four parts: data binarilization, feature selection, model construction, and model explanation. First, raw RNA-seq data are binarlized. Then, univariate feature selection is applied to select the most relevant features from binarlized data. Processed data are then randomly splitted into a training set and a test set. The training set is used for both hyperparameter search and model training; whereas the test set is used for measuring the performance of trained models. The train-test process is repeated 10 times to get the average results. Models with the highest F1 scores is used to generate SHAP plots.

In this study, we propose a novel data processing technique for cancer-related RNA-seq data. After binarilization, each gene is splitted into ten binary features. For each sample only one of the ten binary features is positive, indicating the range of values the sample’s original value of that feature lies between. Because of data binarilization will increase the number of features, we performed feature selection to filter out irrelevant features. We compared the performance of models using binarilized features with models using continuous features. 

The results show that data binarilization does not affect the predictive performance of models. For model explanation, we used SHAP to rank all features in terms of relevance to prediction. Comparing to other explanation models that use continuous data like iAFPs-EnC-GA, AIPs-SnTCN, and OncoNetExplainer, using binarlized data makes understanding results of SHAP analysis easier because the relevant features along with the relevant value range of the feature are revealed together.

Early detection of cancer can increase patients’ survival chances [48]. Recent developments in technology have enabled the use of new diagnostic methods like ML-assisted models. Because ML models excel at processing complex data, this allows researchers to utilize high-dimensional data like RNA-seq data to predict cancer patients and extract relevant biomarkers. However, ML models suffer from problems like poor interpretability. In this study, we proposed a novel approach that utilize data binarilization to increase interpretability of ML models for cancer diagnosis. We proved that models using binarlized data can achieve the same level of performance while relying on less features.

Citation: Chen T, Kabir MF (2024) Explainable machine learning approach for cancer prediction through binarilization of RNA sequencing data. PLoS ONE 19(5): e0302947. https://doi.org/10.1371/journal.pone.0302947

Editor: Shahid Akbar, Abdul Wali Khan University Mardan, PAKISTAN

Received: October 15, 2023; Accepted: April 15, 2024; Published: May 10, 2024

Copyright: © 2024 Chen, Kabir. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: The data underlying the results presented in the study are available from https://www.kaggle.com/datasets/tianjiechen/tcga-rna-datasets/data.

Funding: The author(s) received no specific funding for this work.

Competing interests: The authors have declared that no competing interests exist.