TY - GEN
T1 - Towards Better HS Code Prediction
T2 - 2025 International Conference on Smart Computing, IoT and Machine Learning, SIML 2025
AU - Sasana, Erwin Duadja Betha
AU - Siahaan, Daniel
AU - Purwitasari, Diana
N1 - Publisher Copyright:
© 2025 IEEE.
PY - 2025
Y1 - 2025
N2 - The Harmonized System (HS) code is an important instrument for classifying goods in international trade, as it ensures proper tariffs to be paid along with compliance with customs regulations. However, predicting HS codes is a challenging task, as commodity descriptions are unstructured text that need to be mapped to hierarchical commodity categories which often different between common trade terms and HS nomenclature. This study addresses these challenges by evaluating various machine learning models, including traditional, deep learning, and NLP-based approaches, on datasets characterized by short, noisy descriptions. We aim to investigate whether these models maintain their performance with real-world, imperfect data and understand the underlying factors contributing to model inaccuracies. The analysis demonstrates that NLP models, particularly fastText, consistently outperformed the others by delivering the highest accuracy when it came to 8-digit HS code classification. Despite the result, this study also revealed significant misclassification issues because of ambiguous terminology and common practices of importers copying the SKU numbers from invoices or packing list into the import declarations without parsing them into rich commodity descriptions, and also formatting errors.
AB - The Harmonized System (HS) code is an important instrument for classifying goods in international trade, as it ensures proper tariffs to be paid along with compliance with customs regulations. However, predicting HS codes is a challenging task, as commodity descriptions are unstructured text that need to be mapped to hierarchical commodity categories which often different between common trade terms and HS nomenclature. This study addresses these challenges by evaluating various machine learning models, including traditional, deep learning, and NLP-based approaches, on datasets characterized by short, noisy descriptions. We aim to investigate whether these models maintain their performance with real-world, imperfect data and understand the underlying factors contributing to model inaccuracies. The analysis demonstrates that NLP models, particularly fastText, consistently outperformed the others by delivering the highest accuracy when it came to 8-digit HS code classification. Despite the result, this study also revealed significant misclassification issues because of ambiguous terminology and common practices of importers copying the SKU numbers from invoices or packing list into the import declarations without parsing them into rich commodity descriptions, and also formatting errors.
KW - HS code prediction
KW - commodity classification
KW - machine learning models
KW - natural language processing
KW - trade compliance
UR - https://www.scopus.com/pages/publications/105012774673
U2 - 10.1109/SIML65326.2025.11081088
DO - 10.1109/SIML65326.2025.11081088
M3 - Conference contribution
AN - SCOPUS:105012774673
T3 - 2025 International Conference on Smart Computing, IoT and Machine Learning, SIML 2025
BT - 2025 International Conference on Smart Computing, IoT and Machine Learning, SIML 2025
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 3 June 2025 through 4 June 2025
ER -