TY - JOUR
T1 - Comparative analysis of preprocessing methods for molecular descriptors in predicting anti-cathepsin activity
AU - Suprapto, Suprapto
N1 - Publisher Copyright:
© 2023 The Author(s)
PY - 2024/1
Y1 - 2024/1
N2 - Quantitative Structure-Activity Relationship (QSAR) is a powerful tool for investigating the correlation between the chemical and biological properties of molecules. It employs mathematical and statistical modeling techniques that focus on molecular descriptors, which represent various characteristics of the molecules. However, the extensive use of descriptors in QSAR modeling introduces complexities in data analysis and computation. To address these challenges, data preprocessing techniques, such as data reduction and feature selection, are critical for directing the inputs of statistical models. Feature selection plays a crucial role in improving the accuracy and efficiency of machine learning algorithms by identifying relevant features that significantly influence the target response. In this study, a comprehensive comparison of preprocessing methods in QSAR modeling is carried out. The methods under investigation include filtering through Recursive Feature Elimination (RFE) and wrapping methods such as Forward Selection (FS), Backward Elimination (BE), and Stepwise Selection (SS). To evaluate the effectiveness of these preprocessing methods, both linear regression and nonlinear regression models were utilized. The research findings reveal the usefulness of feature selection methods in reducing the number of descriptors required for the accurate assessment of anti-cathepsin activity. Remarkably, the FS, BE, and SS methods, particularly when coupled with nonlinear regression models, exhibit promising performance relating to R-squared scores. This research emphasizes the significance of data preprocessing, particularly feature selection, in QSAR modeling. The comparative analysis of different preprocessing methods provides a valuable understanding of their effectiveness in reducing descriptor complexity and improving model performance. These findings contribute to the improvement of QSAR modeling techniques, supporting accurate predictions of anti-cathepsin activity and enabling the exploration of structure-activity relationships in drug discovery and design.
AB - Quantitative Structure-Activity Relationship (QSAR) is a powerful tool for investigating the correlation between the chemical and biological properties of molecules. It employs mathematical and statistical modeling techniques that focus on molecular descriptors, which represent various characteristics of the molecules. However, the extensive use of descriptors in QSAR modeling introduces complexities in data analysis and computation. To address these challenges, data preprocessing techniques, such as data reduction and feature selection, are critical for directing the inputs of statistical models. Feature selection plays a crucial role in improving the accuracy and efficiency of machine learning algorithms by identifying relevant features that significantly influence the target response. In this study, a comprehensive comparison of preprocessing methods in QSAR modeling is carried out. The methods under investigation include filtering through Recursive Feature Elimination (RFE) and wrapping methods such as Forward Selection (FS), Backward Elimination (BE), and Stepwise Selection (SS). To evaluate the effectiveness of these preprocessing methods, both linear regression and nonlinear regression models were utilized. The research findings reveal the usefulness of feature selection methods in reducing the number of descriptors required for the accurate assessment of anti-cathepsin activity. Remarkably, the FS, BE, and SS methods, particularly when coupled with nonlinear regression models, exhibit promising performance relating to R-squared scores. This research emphasizes the significance of data preprocessing, particularly feature selection, in QSAR modeling. The comparative analysis of different preprocessing methods provides a valuable understanding of their effectiveness in reducing descriptor complexity and improving model performance. These findings contribute to the improvement of QSAR modeling techniques, supporting accurate predictions of anti-cathepsin activity and enabling the exploration of structure-activity relationships in drug discovery and design.
KW - Backward elimination
KW - Feature selection
KW - Forward selection
KW - Molecular descriptors
KW - Recursive feature elimination
KW - Stepwise selection
UR - http://www.scopus.com/inward/record.url?scp=85176765409&partnerID=8YFLogxK
U2 - 10.1016/j.sajce.2023.11.001
DO - 10.1016/j.sajce.2023.11.001
M3 - Article
AN - SCOPUS:85176765409
SN - 1026-9185
VL - 47
SP - 123
EP - 135
JO - South African Journal of Chemical Engineering
JF - South African Journal of Chemical Engineering
ER -