Achieving Data Acumen for Big Open Data using Natural Language Processing (NLP) Model
Conference
65th ISI World Statistics Congress 2025
Format: CPS Abstract - WSC 2025
Keywords: #officialstatistics, artificial intelligence, big data, data-analysis, data-literacy
Session: CPS 55 - Public Engagement and Statistical Literacy
Monday 6 October 4 p.m. - 5 p.m. (Europe/Amsterdam)
Session: CPS 55 - Public Engagement and Statistical Literacy
Tuesday 7 October 4 p.m. - 5 p.m. (Europe/Amsterdam)
Session: CPS 55 - Public Engagement and Statistical Literacy
Tuesday 7 October 5:10 p.m. - 6:10 p.m. (Europe/Amsterdam)
Abstract
With modern-day advances in technologies, open data can be found and collected nearly everywhere. From various sources, such as social media, clinical notes, and web scraped data, including census and operational systems, big open data can be collected at regular and continuous intervals. Many agencies (individuals, government, and quasi-government) are involved in these important tasks of providing useful information to various data consumers. However, the beauty of reasonable cohesion has been deluded due to individual differences among the data producers, hence the need for strengthening and global consolidation of open data, for better understanding.
Open data refers to datasets related to climate, health, crime, migration, economy, etc. which are freely available for everyone to use and republish as they wish, without restrictions from any mechanisms of control. They should be machine-readable to enable their processes by computers and other digital instruments. Though, the importance of open data cannot be overemphasized in national and global developments, open data risks increasing the digital divide and social inequality unless it is rightfully handled. There is a clear and compelling case that data produced at public expense should be made open and freely available to the public advantage. However, simply declaring data sets to be open does not in itself make it of any practical use to the public unless it is truly available, and reasonable inference is made thereon.
In ameliorating the challenges of big open data acumen, data management, and data cleaning become essentially important steps before starting the data exploration and these can pose new challenges in data analysis. The use of digital tools is inevitable to cope with these challenges, thus this proposal has identified the importance of the Natural Language Processing (NLP) technique in exploring complex datasets, especially text-based big data, efficiently and adequately. The technique has been identified to assist in getting useful information from open data to provide competency in statistical literacy.