New Time Series from Textual Data - Machine Learning meets Statistics
Conference
65th ISI World Statistics Congress 2025
Format: IPS Abstract - WSC 2025
Keywords: machine learning, nlp, robustness, textmining
Session: IPS 925 - Machine Learning improved Time Series Analysis
Monday 6 October 10:50 a.m. - 12:30 p.m. (Europe/Amsterdam)
Abstract
The fast development of methods for computer-based analyses of text data offers a source for novel text-based indicators. Whenever documents in a corpus also include publication time among its metadata, summarized information such as sentiment or topic weights can be provided as time series. Generating such time series includes a substantial number of steps. First, the collection and preparation of the textual data requires decisions about sample choice and pre-processing steps, e.g., removal of low and high-frequency words. Second, there exists large number of natural language processing (NLP) tools, which might be used to extract sentiment, classify into categories or extract topics from a corpus. These methods include statistical models as well as machine learning methods. The setting of (meta-)parameters of models and algorithms might influence the outcome as well as decisions about (temporal) aggregation. Finally, also post-processing, e.g., seasonal adjustment, might affect the properties of the resulting time series.
We will present available and new empirical evidence on how robust time series derived from textual data are with regard to the aspects addressed. Given the lack of a comprehensive statistical model for the process, evidence will mainly be based on Monte Carlo simulations and bootstrapping from real corpora. We will also present methods for visualizing the degree of uncertainty linked to specific results.