Comparative analysis of preprocessing methods for molecular descriptors in predicting anti-cathepsin activity

Research output: Contribution to journalArticlepeer-review


Quantitative Structure-Activity Relationship (QSAR) is a powerful tool for investigating the correlation between the chemical and biological properties of molecules. It employs mathematical and statistical modeling techniques that focus on molecular descriptors, which represent various characteristics of the molecules. However, the extensive use of descriptors in QSAR modeling introduces complexities in data analysis and computation. To address these challenges, data preprocessing techniques, such as data reduction and feature selection, are critical for directing the inputs of statistical models. Feature selection plays a crucial role in improving the accuracy and efficiency of machine learning algorithms by identifying relevant features that significantly influence the target response. In this study, a comprehensive comparison of preprocessing methods in QSAR modeling is carried out. The methods under investigation include filtering through Recursive Feature Elimination (RFE) and wrapping methods such as Forward Selection (FS), Backward Elimination (BE), and Stepwise Selection (SS). To evaluate the effectiveness of these preprocessing methods, both linear regression and nonlinear regression models were utilized. The research findings reveal the usefulness of feature selection methods in reducing the number of descriptors required for the accurate assessment of anti-cathepsin activity. Remarkably, the FS, BE, and SS methods, particularly when coupled with nonlinear regression models, exhibit promising performance relating to R-squared scores. This research emphasizes the significance of data preprocessing, particularly feature selection, in QSAR modeling. The comparative analysis of different preprocessing methods provides a valuable understanding of their effectiveness in reducing descriptor complexity and improving model performance. These findings contribute to the improvement of QSAR modeling techniques, supporting accurate predictions of anti-cathepsin activity and enabling the exploration of structure-activity relationships in drug discovery and design.

Original languageEnglish
Pages (from-to)123-135
Number of pages13
JournalSouth African Journal of Chemical Engineering
Publication statusPublished - Jan 2024


  • Backward elimination
  • Feature selection
  • Forward selection
  • Molecular descriptors
  • Recursive feature elimination
  • Stepwise selection


Dive into the research topics of 'Comparative analysis of preprocessing methods for molecular descriptors in predicting anti-cathepsin activity'. Together they form a unique fingerprint.

Cite this