Handling Imbalance Classification Virtual Screening Big Data Using Machine Learning Algorithms

Complexity 2021:1-15 (2021)
  Copy   BIBTEX

Abstract

Virtual screening is the most critical process in drug discovery, and it relies on machine learning to facilitate the screening process. It enables the discovery of molecules that bind to a specific protein to form a drug. Despite its benefits, virtual screening generates enormous data and suffers from drawbacks such as high dimensions and imbalance. This paper tackles data imbalance and aims to improve virtual screening accuracy, especially for a minority dataset. For a dataset identified without considering the data’s imbalanced nature, most classification methods tend to have high predictive accuracy for the majority category. However, the accuracy was significantly poor for the minority category. The paper proposes a K-mean algorithm coupled with Synthetic Minority Oversampling Technique to overcome the problem of imbalanced datasets. The proposed algorithm is named as KSMOTE. Using KSMOTE, minority data can be identified at high accuracy and can be detected at high precision. A large set of experiments were implemented on Apache Spark using numeric PaDEL and fingerprint descriptors. The proposed solution was compared to both no-sampling method and SMOTE on the same datasets. Experimental results showed that the proposed solution outperformed other methods.

Links

PhilArchive



    Upload a copy of this work     Papers currently archived: 91,881

External links

Setup an account with your affiliations in order to access resources via your University's proxy server

Through your library

Similar books and articles

Machine Learning and Job Posting Classification: A Comparative Study.Ibrahim M. Nasser & Amjad H. Alzaanin - 2020 - International Journal of Engineering and Information Systems (IJEAIS) 4 (9):06-14.
Classifying Cardiotocography Data based on Rough Neural Network.A. A. Salama, Belal Amin, Mona Gamal, Khaled Mahfouz & Ibrahim Mahmoud El-Henawy - 2019 - International Journal of Advanced Computer Science and Applications 10 (8):352-356.

Analytics

Added to PP
2021-01-29

Downloads
15 (#947,268)

6 months
12 (#213,833)

Historical graph of downloads
How can I increase my downloads?

Citations of this work

No citations found.

Add more citations

References found in this work

No references found.

Add more references