Download PDF

New Time Series from Textual Data - Machine Learning meets Statistics

Author

Peter Winker

Conference

65th ISI World Statistics Congress

Format: IPS Abstract - WSC 2025

Keywords: machine learning, nlp, robustness, textmining

Session: IPS 925 - Machine Learning improved Time Series Analysis

Monday 6 October 10:50 a.m. - 12:30 p.m. (Europe/Amsterdam)

Abstract

The fast development of methods for computer-based analyses of text data offers a source for novel text-based indicators. Whenever documents in a corpus also include publication time among its metadata, summarized information such as sentiment or topic weights can be provided as time series. Generating such time series includes a substantial number of steps. First, the collection and preparation of the textual data requires decisions about sample choice and pre-processing steps, e.g., removal of low and high-frequency words. Second, there exists large number of natural language processing (NLP) tools, which might be used to extract sentiment, classify into categories or extract topics from a corpus. These methods include statistical models as well as machine learning methods. The setting of (meta-)parameters of models and algorithms might influence the outcome as well as decisions about (temporal) aggregation. Finally, also post-processing, e.g., seasonal adjustment, might affect the properties of the resulting time series.
We will present available and new empirical evidence on how robust time series derived from textual data are with regard to the aspects addressed. Given the lack of a comprehensive statistical model for the process, evidence will mainly be based on Monte Carlo simulations and bootstrapping from real corpora. We will also present methods for visualizing the degree of uncertainty linked to specific results.