Abstract

Every web page will have the main content. The main content is a section, segment or block that contains text or multimedia on a single web page. Important information about local governance generally lies within the main content, thus the need for web content extractor to extract that information. To solve these problems, this research combines two approaches that already existed, template-based approach and machine learning approach using Naïve-Bayes Classifier. Generally, previous research that has been conducted is using one type of approach; it is either using a template-based approach or using a machine learning approach. The result shows that with combining two types of approaches, the model could identify 95% all nodes that contain the main content.

Original languageEnglish
Title of host publicationICSGTEIS 2023 - 2023 International Conference on Smart-Green Technology in Electrical and Information Systems
Subtitle of host publicationLeveraging Sustainable Technologies Towards Energy Transition and Net Zero Emission, Proceedings
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages41-46
Number of pages6
ISBN (Electronic)9798350382822
DOIs
Publication statusPublished - 2023
Event2023 International Conference on Smart-Green Technology in Electrical and Information Systems, ICSGTEIS 2023 - Hybrid, Bali, Indonesia
Duration: 2 Nov 20234 Nov 2023

Publication series

NameProceedings - International Conference on Smart-Green Technology in Electrical and Information Systems, ICSGTEIS
ISSN (Print)2831-3992
ISSN (Electronic)2831-400X

Conference

Conference2023 International Conference on Smart-Green Technology in Electrical and Information Systems, ICSGTEIS 2023
Country/TerritoryIndonesia
CityHybrid, Bali
Period2/11/234/11/23

Keywords

  • naïve Bayes
  • template-based
  • web content extractor

Fingerprint

Dive into the research topics of 'Website Main Content Extraction Using Template-Based Approach and Naïve-Bayes Classification'. Together they form a unique fingerprint.

Cite this