Water Quality Identification Using Ensemble Machine Learning and Hybrid Resampling SMOTE-ENN Algorithm
Abstract
Abstract Water is essential for all living organisms, yet only a small fraction is fresh and suitable for consumption. The limited availability of freshwater sources, worsened by pollution, overuse, and climate change, underscores the urgent need for sustainable water management. Traditional water quality identification methods are labour-intensive, slow, and costly. Water quality identification often struggles with data quality, imbalanced datasets, and model interpretability. These challenges lead to inaccuracies, especially in detecting minority classes, which is crucial for identifying pollution. This research explores machine learning (ML) techniques to address the limitations of water quality classification by integrating ensemble learning using LightGBM and hybrid Resampling using SMOTE-ENN. Ensemble learning techniques improve accuracy and robustness by aggregating the strengths of multiple models, effectively handling imbalanced data and reducing overfitting. Hybrid Resampling techniques enhance model sensitivity by generating synthetic minority-class samples and refining datasets through noise reduction. Together, these integrations provide a more reliable framework for water quality identification, enabling timely and accurate. This innovative method offers a robust solution for addressing data imbalance and overfitting, ensuring more effective detection of polluted conditions. This study highlights the importance of advanced ML techniques in improving water quality tasks and underscores LightGBM's effectiveness in handling imbalanced data post-SMOTE-ENN application. This method is known for its superior performance, achieving the highest performance evaluation metrics in water quality classification with accuracy, F1-Score, and increasing the recall value by 3% with values of 94.50%, 94.76% and 93.00%, respectively. Keywords: Water Quality, Machine Learning, Imbalanced Data, LightGBM, SMOTE-ENN, Ensemble Learning, Hybrid Resampling. Abstrak Air sangat penting bagi semua organisme hidup, namun hanya sebagian kecil yang segar dan layak untuk dikonsumsi. Terbatasnya ketersediaan sumber air bersih, yang diperburuk oleh polusi, penggunaan berlebihan, dan perubahan iklim, menggarisbawahi kebutuhan mendesak akan pengelolaan air berkelanjutan. Metode identifikasi kualitas air tradisional memerlukan banyak tenaga kerja, lambat, dan mahal. Identifikasi kualitas air sering kali bermasalah dengan kualitas data, kumpulan data yang tidak seimbang, dan kemampuan interpretasi model. Tantangan-tantangan ini menyebabkan ketidakakuratan, terutama dalam mendeteksi kelompok minoritas, yang sangat penting dalam mengidentifikasi polusi. Penelitian ini mengeksplorasi teknik pembelajaran mesin (ML) untuk mengatasi keterbatasan klasifikasi kualitas air dengan mengintegrasikan pembelajaran ensembel menggunakan LightGBM dan pengambilan sampel hybrid menggunakan SMOTE-ENN. Teknik pembelajaran ensemble meningkatkan akurasi dan ketahanan dengan menggabungkan kekuatan beberapa model, menangani data yang tidak seimbang secara efektif, dan mengurangi overfitting. Teknik pengambilan sampel hibrid meningkatkan sensitivitas model dengan menghasilkan sampel kelas minoritas sintetik dan menyempurnakan kumpulan data melalui pengurangan noise. Bersama-sama, integrasi ini memberikan kerangka kerja yang lebih andal untuk identifikasi kualitas air, sehingga memungkinkan dilakukannya identifikasi secara tepat waktu dan akurat. Metode inovatif ini menawarkan solusi yang kuat untuk mengatasi ketidakseimbangan dan overfitting data, sehingga memastikan deteksi kondisi tercemar dengan lebih efektif. Studi ini menyoroti pentingnya teknik ML tingkat lanjut dalam meningkatkan tugas kualitas air dan menggarisbawahi efektivitas LightGBM dalam menangani data yang tidak seimbang pasca penerapan SMOTE-ENN. Metode ini dikenal dengan kinerjanya yang unggul, mencapai metrik evaluasi kinerja tertinggi dalam klasifikasi kualitas air dengan akurasi, F1-Score, dan meningkatkan nilai recall sebesar 3% dengan nilai masing-masing 94,50%, 94,76% dan 93,00%. Kata kunci: Kualitas Air, Pembelajaran Mesin, Data Ketidakseimbangan, LightGBM, SMOTE-ENN, Pembelajaran Ensemble, Pengambilan Sampel Hibrid.Downloads
Submitted
Accepted
Published
Issue
Section
License
Copyright (c) 2024 Moch Deny Pratama, Rifqi Abdillah, Dina Zatusiva Haq
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Please find the rights and licenses in the Fountain of Informatics Journal (FIJ). By submitting the article/manuscript of the article, the author(s) agree with this policy. No specific document sign-off is required.
1. License
The non-commercial use of the article will be governed by the Creative Commons Attribution license as currently displayed on Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
2. Author(s)' Warranties
The author warrants that the article is original, written by the stated author(s), has not been published before, contains no unlawful statements, does not infringe the rights of others, is subject to copyright that is vested exclusively in the author, and free of any third party rights, and that any necessary written permissions to quote from other sources have been obtained by the author(s).
3. User/Public Rights
FIJ's spirit is to disseminate articles published are as free as possible. Under the Creative Commons license, FIJ permits users to copy, distribute, display, and perform the work for non-commercial purposes only. Users will also need to attribute authors and FIJ on distributing works in the journal and other media of publications. Unless otherwise stated, the authors are public entities as soon as their articles got published.
4. Rights of Authors
Authors retain all their rights to the published works, such as (but not limited to) the following rights;
- Copyright and other proprietary rights relating to the article, such as patent rights,
- The right to use the substance of the article in own future works, including lectures and books,
- The right to reproduce the article for own purposes,
- The right to self-archive the article (please read out deposit policy),
- The right to enter into separate, additional contractual arrangements for the non-exclusive distribution of the article's published version (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial publication in this journal (Jurnal Optimasi Sistem Industri).
5. Co-Authorship
If the article was jointly prepared by more than one author, any authors submitting the manuscript warrants that he/she has been authorized by all co-authors to be agreed on this copyright and license notice (agreement) on their behalf, and agrees to inform his/her co-authors of the terms of this policy. FIJ will not be held liable for anything that may arise due to the author(s) internal dispute. FIJ will only communicate with the corresponding author.
6. Royalties
Being an open accessed journal and disseminating articles for free under the Creative Commons license term mentioned, author(s) aware that FIJ entitles the author(s) to no royalties or other fees.
7. Miscellaneous
FIJ will publish the article (or have it published) in the journal if the article’s editorial process is successfully completed. FIJ's editors may modify the article to a style of punctuation, spelling, capitalization, referencing, and usage that deems appropriate. The author acknowledges that the article may be published so that it will be publicly accessible and such access will be free of charge for the readers as mentioned in point 3.