Differences, distances and fingerprints

The fundamentals of stylometry and multivariate text analysis

Keywords: stylometry, computational text analysis, authorship attribution, Estonian fiction

The recent rapid expansion of computational methods and tools into humanities have rekindled the conversation surrounding the relationship between a study object and its mathematical representation, or model. The paper serves as a conceptual introduction to stylometry, a sub-field of computational text analysis that studies differences between texts quantitatively, and shows how simplistic models of texts can be used to uncover their complex relationships.

The very term “stylometry” and the field’s development is closely linked to the problem of authorship attribution and identification – the paper briefly introduces the early history of stylometry that highlights its inherited assumptions about texts and authorship. It shows how text analysis methods have shifted from analysis of ­single features (such as word length) to multivariate computations. The paper explains the basics of modern multivariate text analysis and the notion of a “distance” between texts as a proxy to their difference, focusing primarily on word frequencies as units of analysis.

The paper concludes with a series of preliminary authorship experiments for a small Estonian fiction corpus, which test the behavior of a few foundational variables and serve as a general proof-of-concept demonstration. Experiments show a reliable performance of Cosine Delta distance in the Estonian non-lemmatized corpus, using frequencies of at least 100 most frequent words. The Cosine Delta also achieves stable attribution accuracy for random samples of at least 5000 words, which suggests that texts shorter than this size might not be reliably attributed with the proposed methodology. Both findings are consistent with observations done for other languages, but should be treated as preliminary for Estonian.


Artjoms Šeļa (b. 1989), PhD, Researcher in Digital Humanities in the University of Tartu (Jakobi 2, Tartu 51003), Assistant Professor in Computational Stylometry in the Institute of Polish Language, Polish Academy of Sciences, Kraków,