ML-Powered Data Catalogs and Metadata Inference for Self-Describing Big Data Ecosystems

Authors

  • Sivadeep Katangoori Solutions Architect at Metanoia Solutions Inc, USA Author

Keywords:

ML-powered data catalogs, metadata inference, big data ecosystems

Abstract

In today's fast-changing data world, businesses have too much data that is too different and spread out. Standard methods for managing data don't always work well for finding, understanding, and applying data. Modern data catalogues are more than simply static lists of items; they are also dynamic places to store information that provide structure, meaning, and easy access to large data ecosystems. These catalogues can really shine when ML is applied to figure things out smartly, even when the data sets are too messy, not well-documented, or not well-organized. Data catalogues that utilize ML look at these current data trends, user behaviors & contextual signals to automatically tag, categorize & link their data assets. This makes the data self-descriptive and makes governance easier. This solves a big problem: creating information by hand is hard work, prone to their mistakes, and typically doesn't work well with different teams & also technologies. Our methodology utilizes both supervised and unsupervised ML techniques to infer schema details, data lineage, semantic links, and use contexts, therefore reducing dependence on human input while improving their data discoverability & the dependability. This paper suggests a scalable architecture that integrates ML models directly into the cataloguing process, enabling actual time metadata enhancement as data is ingested into the system. Additionally, we have a feedback mechanism that constantly improves the quality of inferences via user interactions & validations. Initial findings indicate significant improvements in data discoverability, accelerated onboarding for data users, and improved compliance readiness, resulting from enhanced metadata completeness and traceability. This work shows how data catalogues that use machine learning can change from simple documentation tools into these smart systems that make self-service analytics easier, promote data democracy & support truly self-describing big data platforms. This makes organizations more agile and focused on getting insights.

Downloads

Download data is not yet available.

References

1. Verma, Dinesh, et al. "Self-describing digital assets and their applications in an integrated science and engineering ecosystem." Smoky Mountains Computational Sciences and Engineering Conference. Cham: Springer Nature Switzerland, 2022.

2. Abdul Jabbar Mohammad. “Cross-Platform Timekeeping Systems for a Multi-Generational Workforce”. American Journal of Cognitive Computing and AI Systems, vol. 5, Dec. 2021, pp. 1-22

3. Shah, Syed Iftikhar Hussain, Vassilios Peristeras, and Ioannis Magnisalis. "Government big data ecosystem: definitions, types of data, actors, and roles and the impact in public administrations." ACM Journal of Data and Information Quality 13.2 (2021): 1-25.

4. Manda, Jeevan Kumar. "AI-powered Threat Intelligence Platforms in Telecom: Leveraging AI for Real-time Threat Detection and Intelligence Gathering in Telecom Network Security Operations." Available at SSRN 5003638 (2024).

5. Balkishan Arugula. “Building Scalable Ecommerce Platforms: Microservices and Cloud-Native Approaches”. Journal of Artificial Intelligence & Machine Learning Studies, vol. 8, Aug. 2024, pp. 42-74

6. Mishra, Sarbaree, et al. “Hyperfocused Customer Insights Based On Graph Analytics and Knowledge Graphs”. International Journal of AI, BigData, Computational and Management Studies, vol. 4, no. 4, Dec. 2023, pp. 88-99

7. Zhang, Wei. Efficient scientific data discovery over self-describing file formats. Diss. 2021.

8. Abdul Jabbar Mohammad. “Leveraging Timekeeping Data for Risk Reward Optimization in Workforce Strategy”. Los Angeles Journal of Intelligent Systems and Pattern Recognition, vol. 4, Mar. 2024, pp. 302-24

9. Jani, Parth, and Sangeeta Anand. "Compliance-Aware AI Adjudication Using LLMs in Claims Engines (Delta Lake+ LangChain)." International Journal of Artificial Intelligence, Data Science, and Machine Learning 5.2 (2024): 37-46.

10. Khalifa, Shadi, et al. "The six pillars for building big data analytics ecosystems." ACM Computing Surveys (CSUR) 49.2 (2016): 1-36.

11. Guntupalli, Bhavitha, and Surya Vamshi ch. “Designing Microservices That Handle High-Volume Data Loads”. International Journal of AI, BigData, Computational and Management Studies, vol. 4, no. 4, Dec. 2023, pp. 76-87

12. Allam, Hitesh. “Intent-Based Infrastructure: Moving BeyondIaC to Self-Describing Systems”. International Journal of Artificial Intelligence, Data Science, and Machine Learning, vol. 6, no. 1, Jan. 2025, pp. 124-36

13. Ruppert, Evelyn. "Big data economies and ecologies." An End to the Crisis of Empirical Sociology?. Routledge, 2015. 13-28.

14. Nookala, G. (2024). Adaptive data governance frameworks for data-driven digital transformations. Journal of Computational Innovation, 4(1).

15. Eynard-Bontemps, Guillaume, et al. "The PANGEO Big Data Ecosystem and its use at CNES." Big Data from Space (BiDS'19).... Turning Data into insights... 19-21 fébruary 2019, Munich, Germany. 2019.

16. Mishra, Sarbaree. “Incorporating Automated Machine Learning and Neural Architecture Searches to Build a Better Enterprise Search Engine”. International Journal of Artificial Intelligence, Data Science, and Machine Learning, vol. 4, no. 4, Dec. 2023, pp. 65-75

17. Chaganti, Krishna Chaitanya. "Threat Modeling and Vulnerability Management for Securing IoT Ecosystems." International Journal of Emerging Trends in Computer Science and Information Technology (2025): 28-35.

18. Jani, Parth. "Document-Level AI Validation for Prior Authorization Using Iceberg+ Vision Models." International Journal of AI, BigData, Computational and Management Studies 5.4 (2024): 41-50.

19. Balkishan Arugula. “Order Management Optimization in B2B and B2C Ecommerce: Best Practices and Case Studies”. Artificial Intelligence, Machine Learning, and Autonomous Systems, vol. 8, June 2024, pp. 43-71

20. Allam, Hitesh. “Shift-Left Observability: Embedding Insights from Code to Production”. International Journal of AI, BigData, Computational and Management Studies, vol. 5, no. 2, June 2024, pp. 58-69

21. Shaik, Babulal. "Developing Predictive Autoscaling Algorithms for Variable Traffic Patterns." Journal of Bioinformatics and Artificial Intelligence 1.2 (2021): 71-90.

22. Abdul Jabbar Mohammad. “Integrating Timekeeping With Mental Health and Burnout Detection Systems”. Artificial Intelligence, Machine Learning, and Autonomous Systems, vol. 8, Mar. 2024, pp. 72-97

23. Datla, Lalith Sriram. “Proactive Application Monitoring for Insurance Platforms: How AppDynamics Improved Our Response Times”. International Journal of Emerging Research in Engineering and Technology, vol. 4, no. 1, Mar. 2023, pp. 54-65

24. Berman, Jules J. Principles of big data: preparing, sharing, and analyzing complex information. Newnes, 2013.

25. Immaneni, J. (2022). Strengthening Fraud Detection with Swarm Intelligence and Graph Analytics. International Journal of Digital Innovation, 3(1).

26. Veluru, Sai Prasad. "Self-Penalizing Neural Networks: Built-in Regularization Through Internal Confidence Feedback." International Journal of Emerging Trends in Computer Science and Information Technology 4.3 (2023): 41-49.

27. Amirian, Pouria, Francois van Loggerenberg, and Trudie Lang. "Big data and big data technologies." Big Data in Healthcare: Extracting Knowledge from Point-of-Care Machines. Cham: Springer International Publishing, 2017. 39-58.

28. Guntupalli, Bhavitha. “Data Lake Vs. Data Warehouse: Choosing the Right Architecture”. International Journal of Artificial Intelligence, Data Science, and Machine Learning, vol. 4, no. 4, Dec. 2023, pp. 54-64

29. Nookala, G., Gade, K. R., Dulam, N., & Thumburu, S. K. R. (2024). Post-quantum cryptography: Preparing for a new era of data encryption. MZ Computing Journal, 5(2), 012077.

30. Shaik, Babulal, Jayaram Immaneni, and K. Allam. "Unified Monitoring for Hybrid EKS and On-Premises Kubernetes Clusters." Journal of Artificial Intelligence Research and Applications 4.1 (2024): 649-669.

31. Mishra, Sarbaree. “The Lifelong Learner - Designing AI Models That Continuously Learn and Adapt To New Datasets”. International Journal of Artificial Intelligence, Data Science, and Machine Learning, vol. 5, no. 1, Mar. 2024, pp. 68-78

32. Patel, Piyushkumar. "AI and Machine Learning in Tax Strategy: Predictive Analytics for Corporate Tax Optimization." African Journal of Artificial Intelligence and Sustainable Development 4.1 (2024): 439-57.

33. Shen, Yushi, et al. "Big data overview." Enabling the new era of cloud computing: Data security, transfer, and management. IGI Global, 2014. 156-184.

34. Lalith Sriram Datla, and Samardh Sai Malay. “Transforming Healthcare Cloud Governance: A Blueprint for Intelligent IAM and Automated Compliance”. Journal of Artificial Intelligence & Machine Learning Studies, vol. 9, Jan. 2025, pp. 15-37

35. Jani, Parth, and Sarbaree Mishra. "UM PEGA+ AI Integration for Dynamic Care Path Selection in Value-Based Contracts." International Journal of AI, BigData, Computational and Management Studies 4.4 (2023): 47-55.

36. Talakola, Swetha. “The Optimization of Software Testing Efficiency and Effectiveness Using AI Techniques”. International Journal of Artificial Intelligence, Data Science, and Machine Learning, vol. 5, no. 3, Oct. 2024, pp. 23-34

37. Patel, Piyushkumar. "Accounting for Climate-Related Contingencies: The Rise of Carbon Credits and Their Financial Reporting Impact." African Journal of Artificial Intelligence and Sustainable Development 3.1 (2023): 490-12.

38. Manda, Jeevan Kumar. "Privacy-Preserving Technologies in Telecom Data Analytics: Implementing Privacy-Preserving Techniques Like Differential Privacy to Protect Sensitive Customer Data During Telecom Data Analytics." Available at SSRN 5136773 (2023).

39. Chaganti, Krishna Chiatanya. "Securing Enterprise Java Applications: A Comprehensive Approach." International Journal of Science And Engineering 10.2 (2024): 18-27.

40. Alexandru, Adriana, et al. "Big data: concepts, technologies and applications in the public sector." International Journal of Computer and Information Engineering 10.10 (2016): 1670-1676.

41. Balkishan Arugula. “Cloud Migration Strategies for Financial Institutions: Lessons from Africa, Asia, and North America”. Los Angeles Journal of Intelligent Systems and Pattern Recognition, vol. 4, Mar. 2024, pp. 277-01

42. Tarra, Vasanta Kumar. “Telematics & IoT-Driven Insurance With AI in Salesforce”. International Journal of AI, BigData, Computational and Management Studies, vol. 5, no. 3, Oct. 2024, pp. 72-80

43. Ge, Zhiyu. "Artificial Intelligence and Machine Learning in Data Management." THE FUTURE AND FINTECH: ABCDI and Beyond. 2022. 281-310.

44. Patel, Piyushkumar, and Deepu Jose. "Green Tax Incentives and Their Accounting Implications: The Rise of Sustainable Finance." Journal of Artificial Intelligence Research and Applications 4.1 (2024): 627-48.

45. Shaik, Babulal. "Network Isolation Techniques in Multi-Tenant EKS Clusters." Distributed Learning and Broad Applications in Scientific Research 6 (2020).

46. Kritika, Manan. "The Smart Backbone: AI and ML in Enterprise Metadata." (2024).

47. Allam, Hitesh. “Code Meets Intelligence: AI-Augmented CI CD Systems for DevOps at Scale”. International Journal of Artificial Intelligence, Data Science, and Machine Learning, vol. 6, no. 1, Jan. 2025, pp. 137-46

48. Guntupalli, Bhavitha. “ETL Architecture Patterns: Hub-and-Spoke, Lambda, and More”. International Journal of AI, BigData, Computational and Management Studies, vol. 4, no. 3, Oct. 2023, pp. 61-71

49. Shanbhogue, Rahul, et al. "ML Powered Analytics for Sensing Demand in Consumer Industry." 2023 IEEE Pune Section International Conference (PuneCon). IEEE, 2023.

50. Nookala, G. (2023). Real-Time Data Integration in Traditional Data Warehouses: A Comparative Analysis. Journal of Computational Innovation, 3(1).

51. Mishra, Sarbaree. “Cross Modal AI Model Training to Increase Scope and Build More Comprehensive and Robust Models”. International Journal of Artificial Intelligence, Data Science, and Machine Learning, vol. 5, no. 3, Oct. 2024, pp. 98-108.

52. Manda, Jeevan Kumar. "Augmented Reality (AR) Applications in Telecom Maintenance: Utilizing AR Technologies for Remote Maintenance and Troubleshooting in Telecom Infrastructure." Available at SSRN 5136767 (2023).

53. Abdul Jabbar Mohammad. “Timekeeping Accuracy in Remote and Hybrid Work Environments”. American Journal of Cognitive Computing and AI Systems, vol. 6, July 2022, pp. 1-25

54. Borra, Praveen. "Advancing data science and AI with azure machine learning: A comprehensive review." International Journal of Research Publication and Reviews 5.6 (2024): 1825-1831.

55. Chaganti, Krishna Chaitanya. "AI-Powered Patch Management: Reducing Vulnerabilities in Operating Systems." International Journal of Science And Engineering 10.3 (2024): 89-97.

56. Datla, Lalith Sriram. “Optimizing REST API Reliability in Cloud-Based Insurance Platforms for Education and Healthcare Clients”. International Journal of Artificial Intelligence, Data Science, and Machine Learning, vol. 4, no. 3, Oct. 2023, pp. 50-59

57. Mishra, Sarbaree, et al. “Building More Efficient AI Models through Unsupervised Representation Learning”. International Journal of Artificial Intelligence, Data Science, and Machine Learning, vol. 5, no. 3, Oct. 2024, pp. 109-20

58. Mohammad, Abdul Jabbar. “Sentiment-Driven Scheduling Optimizer”. International Journal of Emerging Research in Engineering and Technology, vol. 1, no. 2, June 2020, pp. 50-59

59. Vajpayee, Abhishek. "The role of machine learning in automated data pipelines and warehousing: enhancing data integration, transformation, and analytics." ESP Journal of Engineering & Technology Advancements 3.3 (2023): 84-96

Downloads

Published

2025-05-08

How to Cite

[1]
S. Katangoori, “ML-Powered Data Catalogs and Metadata Inference for Self-Describing Big Data Ecosystems”, J. of Art. Int. Research and App., vol. 5, no. 1, pp. 54–82, May 2025, Accessed: May 18, 2026. [Online]. Available: https://jaira.org.uk/index.php/jaira/article/view/14

Similar Articles

11-15 of 15

You may also start an advanced similarity search for this article.