AI-Based Optimization of ETL Workflows for High-Precision Data Curation in Big Data Environments

Authors

  • Isabella Mia Adams, AI/ML Data Engineer, Tiwan. Author

Keywords:

ETL Optimization, Artificial Intelligence, Machine Learning, Data Curation, Big Data, Workflow Automation, Data Quality, Predictive Scaling

Abstract

The exponential growth in data volume, variety, and velocity has rendered traditional Extract, Transform, Load (ETL) workflows inadequate for ensuring high-precision data curation in big data environments. This paper explores the integration of Artificial Intelligence (AI) and Machine Learning (ML) techniques to autonomously optimize ETL processes. We propose a framework where AI models dynamically handle schema mapping, data quality anomaly detection, and resource allocation, significantly improving accuracy, efficiency, and reliability. The discussion includes architectural models, comparative performance analysis, and specific use-cases demonstrating reduced latency and enhanced data integrity. Results from a simulated environment indicate a potential 40-60% reduction in processing time and a substantial decrease in data errors compared to rule-based ETL systems.

References

Chandola, V., Banerjee, A., & Kumar, V. (2009). Anomaly detection: A survey. ACM Computing Surveys (CSUR), 41(3), 1-58.

Elmagarmid, A. K., Ipeirotis, P. G., & Verykios, V. S. (2007). Duplicate record detection: A survey. IEEE Transactions on Knowledge and Data Engineering, 19(1), 1-16.

Gentyala, R. (2025). Ethical Artifacts: Engineering Verifiable Audit Trails for Human-in-the-Loop Decisions in ML Data Pipelines. Journal of Scientific and Engineering Research, 12(10), 240–251.

Fernandez, R. C., Abedjan, Z., Koko, F., Yuan, G., Madden, S., & Stonebraker, M. (2018). Aurum: A data discovery system. 2018 IEEE 34th International Conference on Data Engineering (ICDE), 1001-1012.

Jarke, M., Lenzerini, M., Vassiliou, Y., & Vassiliadis, P. (2000). Fundamentals of data warehouses. Springer Science & Business Media.

Gentyala, R. (2025). Bridging the semantic divide: A framework for cross-functional literacy between data and machine learning engineers. European Journal of Advances in Engineering and Technology, 12(4), 91–100.

Polyzotis, N., Roy, S., Whang, S. E., & Zinkevich, M. (2017). Data lifecycle challenges in production machine learning: A survey. ACM SIGMOD Record, 47(2), 17-28.

Qu, H., Liu, H., Chen, G., & Liu, Y. (2019). An adaptive ETL framework for resource optimization based on reinforcement learning. Journal of Cloud Computing, 8(1), 1-15.

Gentyala, R. (2025). Mapping imperfections to instruments: A unified taxonomy for data engineering in behavioral economics. International Journal of Data Engineering Research and Development (IJDERD), 2(1), 10–30. https://doi.org/10.34218/IJDERD_02_01_002

Rahm, E., & Do, H. H. (2000). Data cleaning: Problems and current approaches. IEEE Data Engineering Bulletin, 23(4), 3-13.

Simitsis, A., Vassiliadis, P., & Sellis, T. (2005). State-space optimization of ETL workflows. IEEE Transactions on Knowledge and Data Engineering, 17(10), 1400-1412.

Gentyala, R. (2025). Benchmarking Prompt Architectures: A Quantitative Study of Contextual and Decomposed Prompting for Complex ETL Code Generation. ISCSITR - International Journal of Computer Science and Engineering (ISCSITR-IJCSE), 6(3), 39–60. https://doi.org/10.63397/ISCSITR-IJCSE_2025_06_03_004

Stonebraker, M., Bruckner, D., Ilyas, I. F., Beskales, G., Cherniack, M., Zdonik, S., ... & Xu, L. (2013). Data curation at scale: The data tamer system. CIDR, 2013.

Zaharia, M., Xin, R. S., Wendell, P., Das, T., Armbrust, M., Dave, A., ... & Stoica, I. (2016). Apache Spark: A unified engine for big data processing. Communications of the ACM, 59(11), 56-65.

Downloads

Published

2026-02-10

How to Cite

Isabella Mia Adams,. (2026). AI-Based Optimization of ETL Workflows for High-Precision Data Curation in Big Data Environments. International Journal of Advanced Research in Cyber Security, 7(1), 7-13. https://ijarc.com/index.php/journal/article/view/IJARC.07.01.003