65th ISI World Statistics Congress 2025

65th ISI World Statistics Congress 2025

Build your own open-source text analysis tools: Istat's cloud-based toolkit

Author

MB
Mauro Bruno

Co-author

  • F
    Francesco Ortame

Conference

65th ISI World Statistics Congress 2025

Format: IPS Abstract - WSC 2025

Keywords: data-science, natural-language-processing, open source,

Abstract

Text analysis is crucial for extracting insights from vast amounts of unstructured data. This work presents a toolkit to create open-source, cloud-based platforms for comprehensive text analysis, built entirely using Python and the Streamlit library. In line with the growing importance of open-source tools like Python within the statistical community, this toolkit can integrate most natural language processing techniques into a text analysis platform. For instance, BERTopic for topic extraction, large language models (LLMs) for automated comment generation and information retrieval, and traditional methods such as word frequency analysis.
By relying on cloud-native technologies, the toolkit integrates enhanced scalability and accessibility by design, making advanced text analysis more efficient and available to a wider audience. The open-source nature promotes collaboration, transparency, and adaptability to emerging techniques, aligning with the broader movement in statistics towards more open and collaborative practices. This flexibility also allows researchers to rapidly develop proofs-of-concept and demos for their work.
This study details the development process, demonstrates the platform's functionality through case studies, and explores its potential applications in research and academia.