Build your own open-source text analysis tools: Istat's cloud-based toolkit
Conference
65th ISI World Statistics Congress 2025
Format: IPS Abstract - WSC 2025
Keywords: data-science, natural-language-processing, open source,
Abstract
Text analysis is crucial for extracting insights from vast amounts of unstructured data. This work presents a toolkit to create open-source, cloud-based platforms for comprehensive text analysis, built entirely using Python and the Streamlit library. In line with the growing importance of open-source tools like Python within the statistical community, this toolkit can integrate most natural language processing techniques into a text analysis platform. For instance, BERTopic for topic extraction, large language models (LLMs) for automated comment generation and information retrieval, and traditional methods such as word frequency analysis.
By relying on cloud-native technologies, the toolkit integrates enhanced scalability and accessibility by design, making advanced text analysis more efficient and available to a wider audience. The open-source nature promotes collaboration, transparency, and adaptability to emerging techniques, aligning with the broader movement in statistics towards more open and collaborative practices. This flexibility also allows researchers to rapidly develop proofs-of-concept and demos for their work.
This study details the development process, demonstrates the platform's functionality through case studies, and explores its potential applications in research and academia.