Differences, distances and fingerprints

The fundamentals of stylometry and multivariate text analysis


Keywords: stylometry, computational text analysis, authorship attribution, Estonian fiction

The recent rapid expansion of computational methods and tools into humanities have rekindled the conversation surrounding the relationship between a study object and its mathematical representation, or model. The paper serves as a conceptual introduction to stylometry, a sub-field of computational text analysis that studies differences between texts quantitatively, and shows how simplistic models of texts can be used to uncover their complex relationships.

The very term “stylometry” and the field’s development is closely linked to the problem of authorship attribution and identification – the paper briefly introduces the early history of stylometry that highlights its inherited assumptions about texts and authorship. It shows how text analysis methods have shifted from analysis of ­single features (such as word length) to multivariate computations. The paper explains the basics of modern multivariate text analysis and the notion of a “distance” between texts as a proxy to their difference, focusing primarily on word frequencies as units of analysis.

The paper concludes with a series of preliminary authorship experiments for a small Estonian fiction corpus, which test the behavior of a few foundational variables and serve as a general proof-of-concept demonstration. Experiments show a reliable performance of Cosine Delta distance in the Estonian non-lemmatized corpus, using frequencies of at least 100 most frequent words. The Cosine Delta also achieves stable attribution accuracy for random samples of at least 5000 words, which suggests that texts shorter than this size might not be reliably attributed with the proposed methodology. Both findings are consistent with observations done for other languages, but should be treated as preliminary for Estonian.


Artjoms Šeļa (b. 1989), PhD, Researcher in Digital Humanities in the University of Tartu (Jakobi 2, Tartu 51003), Assistant Professor in Computational Stylometry in the Institute of Polish Language, Polish Academy of Sciences, Kraków, artjoms.sela@ut.ee


Kood ja andmed

Korpus ja kood kõigi katsete ja diagrammide jaoks on saadaval varamus: https://github.com/perechen/stylometry_intro_KK. Analüüsi jaoks kasutati tarkvara R 4.1.0. Peamised stilomeetria katsed tuginesid lisaks ka Stylo paketile (Eder jt 2016).


Ackerman, James S. 1962. A Theory of style. – The Journal of Aesthetics and Art Criticism, kd 20, nr 3, lk 227-237.
Allison, Sarah; Heuser, Ryan; Jockers, Matthew L.; Moretti, Franco; Witmore, Michael 2011. Quantitative formalism: An experiment. – Literary Lab Pamphlets 1, 15. XI. https://litlab.stanford.edu/LiteraryLabPamphlet1.pdf
Argamon, Shlomo 2008. Interpreting Burrows’s Delta: Geometric and probabilistic foundations. – Literary and Linguistic Computing, kd 23, nr 2, lk 131-147.
Argamon, Shlomo; Goulain, Jean-Baptiste; Horton, Russell; Olsen, Mark 2009a. Vive la Différence! Text mining gender difference in French literature. – Digital Humanities Quarterly, kd 3, nr 2. http://www.digitalhumanities.org/dhq/vol/3/2/000042/000042.html
Argamon, Shlomo; Koppel, Moshe; Pennebaker, James W.; Schler, Jonathan 2009b. Automatically profiling the author of an anonymous text. – Commun. ACM, kd 52, nr 2, lk 119-123.
Brennan, Michael; Afroz, Sadia; Greenstadt, Rachel 2012. Adversarial stylometry: Circumventing authorship recognition to preserve privacy and anonymity. – ACM Transaction on Information and System Security, kd 15, nr 3, art 12.
Burrows, John 2002. ‘Delta’: A measure of stylistic difference and a guide to likely authorship. – Literary and Linguistic Computing, kd 17, nr 3, lk 267-287.
Cambpell, Lewis 1867. The Sophistes and Politicus of Plato. Oxford: Oxford Clarendon Press.
Chang, Kent K.; DeDeo, Simon 2020. Divergence and the complexity of difference in text and culture. – Journal of Cultural Analytics, kd 1, nr 1.
Croft, William 2000. Explaining Language Change: An Evolutionary Approach. Harlow: Pearson Education.
Da, Nan Z. 2019. The computational case against computational literary studies. – Critical Inquiry, kd 45, nr 3, lk 601-639.
Dittenberg, Wilhelm 1881. Sprachliche Kriterien für die Chronologie der platonischen Dialoge. – Hermes. Zeitschrift für klassische Philologie, kd 16, lk 321-345.
Eder, Maciej 2013. Does size matter? Authorship attribution, small samples, big problem. – Digital Scholarship in the Humanities, kd 30, nr 2, lk 167-182.
Eder, Maciej 2015. Visualization in stylometry: Cluster analysis using networks. – Digital Scholarship in the Humanities, kd 32, nr 1, lk 50-64.
Eder, Maciej 2017. Short samples in authorship attribution: A new approach. – Digital Human­ities 2017. Montreal, Canada, August 8-11, 2017. Conference Abstracts. Montreal: McGill University, lk 221-224.
Eder, Maciej; Rybicki, Jan; Kestemont, Mike 2016. Stylometry with R: a package for computational text analysis. – R Journal, kd 8, nr 1, lk 107-121.
Emmery, Chris; Kádár, Ákos; Chrupała, Grzegorz 2021. Adversarial stylometry in the wild: Transferable lexical substitution attacks on author profiling. – ArXiv.org. Computer Science.
Evert, Stefan; Proisl, Thomas; Jannidis, Fotis; Reger, Isabella; Pielström, Steffen; Schöch, Christof; Vitt, Thorsten 2017. Understanding and explaining Delta measures for author­ship attribution. – Digital Scholarship in the Humanities, kd 32, nr suppl_2, lk ii4-ii16.
Ginzburg, Carlo 1979. Clues: Roots of a scientific paradigm. – Theory and Society, kd 7, nr 3, lk 273-288.
Grieve, Jack 2007. Quantitative authorship attribution: An evaluation of techniques. – Liter­ary and Linguistic Computing, kd 22, nr 3, lk 251-270.
Grzybek, Peter 2014. The emergence of stylometry: Prolegomena to the history of term and concept. – Text within Text – Culture within Culture. Toim Katalin Kroó, Peeter Torop. Budapest-Tartu: L’Harmattan, lk 58-75.
Herrmann, Berenike J.; Dalen-Oskam, Karina van; Schöch, Christof 2015. Revisiting style, a key concept in literary studies. – Journal of Literary Theory, kd 9, nr 1, lk 25-52.
Holmes, David I. 1998. The evolution of stylometry in humanities scholarship. – Literary and Linguistic Computing, kd 13, nr 3, lk 111-117.
Holmes, David I.; Kardos, Judit 2003. Who was the author? An introduction to stylometry. – CHANCE, kd 16, nr 2, lk 5-8.
Hughes, James M.; Foti, Nicholas J.; Krakauer, David C.; Rockmore, Daniel N. 2012. Quantitative patterns of stylistic influence in the evolution of literature. – Proceedings of the National Academy of Sciences, kd 109, nr 20, lk 7682-7686.
Jannidis, Fotis; Pielström, Steffen; Schöch, Christof; Vitt, Thorsten 2015. Improving Burrows’ Delta – An empirical evaluation of text distance measures. – Digital Humanities Conference 2015. University of Western Sydney, Australia. Konverentsiettekanne. https://www.researchgate.net/publication/280086768_Improving_Burrows’_Delta_-_An_empirical_evaluation_of_text_distance_measures (17. VIII 2021).
Jockers, Matthew L. 2013. Macroanalysis: Digital Methods and Literary History. 1st Edition. Urbana: University of Illinois Press.
Juola, Patrick 2012. Large-scale experiments in authorship attribution. – English Studies, kd 93, nr 3, lk 275-283.
Juola, Patrick 2013. How a computer program helped reveal J. K. Rowling as author of a Cuckoo’s Calling. – Scientific American. https://www.scientificamerican.com/article/how-a-computer-program-helped-show-jk-rowling-write-a-cuckoos-calling/ (2. VI 2021).
Juola, Patrick 2020. Authorship studies and the dark side of social media analytics. – Journal of Universal Computer Science, kd 26, nr 1, lk 156-170.
Kestemont, Mike; Stover, Justin; Koppel, Moshe; Karsdorp, Folgert; Daelemans, Walter 2016. Authenticating the writings of Julius Caesar. – Expert Systems with Applications, kd 63, lk 86-96.
Koolen, Corina W. 2018. Reading beyond the female. The relationship between perception of author gender and literary quality. (ILLC Dissertation Series DS-2018-03.) Universiteit van Amsterdam.
Koppel, Moshe; Schler, Jonathan; Argamon, Shlomo 2009. Computational methods in authorship attribution. – Journal of the American Society for Information Science and Technology, kd 60, nr 1, lk 9-26.
Lotman, Jurij M.; Lotman, Mihail J. 1986. Vokrug desjatoj glavy “Evgenija Onegina”. – Puškin: Issledovanija i materialy. Leningrad: Nauka, lk 124-151. [Юрий М. Лотман, Михаил Ю. Лотман, Вокруг десятой главы “Евгения Онегина”. – Пушкин: Исследования и материалы. Ленинград: Наука.]
Lutosławski, Wincenty 1897a. The Origin and Growth of Plato’s Logic. With an Account of Plato’s Style and of the Chronology of His Writings. London: Longmans, Green and Co.
Lutosławski, Wincenty 1897b. On stylometry. – Classical Review, kd 11, lk 284-286.
Malone, Edward 1787. A dissertation on parts I, II and III of Henry VI tending to show that those plays were not written originally by Shakespeare. London: From the Press of Henry Baldwin.
Mendenhall, Thomas C. 1887. The characteristic curves of composition. – Science, kd ns-9, nr 214S, lk 237-246.
Mendenhall, Thomas 1901. A mechanical solution of a literary problem. – Popular Science Monthly, kd 60, lk 98-105.
Morozov, Nikolaj Aleksandrovič 1915. Lingvističeskie spektry. Sredstvo dlja otličenija plagiatov ot istinnyh proizvedenij togo ili drugogo izvestnogo avtora. – Izvestija otdelenija russkogo jazyka i slovesnoti Imperatorskoj akademii nauk, kd 20, nr 1-4, lk 95-127. [Николай Александрович Морозов, Лингвистические спектры. Средство для отличения плагиатов от истинных произведений того или другого известнoго автора. – Известия отделения русского языка и словесноти Императорской академии наук.]
Mosteller, Frederick; Wallace, David L. 1963. Inference in an authorship problem. – Journal of the American Statistical Association, kd 58, nr 302, lk 275-309.
Newberry, Mitchell G.; Ahern, Christopher A.; Clark, Robin; Plotkin, Joshua B. 2017. Detecting evolutionary forces in language change. – Nature, kd 551, nr 7679, lk 223-226.
Noecker Jr, John; Ryan, Michael; Juola, Patrick 2013. Psychological profiling through textual analysis. – Literary and Linguistic Computing, kd 28, nr 3, lk 382-387.
Piper, Andrew 2017. Fictionality. – Journal of Cultural Analytics, kd 1, nr 1.
Plecháč, Petr; Bobenhausen, Klemens; Hammerich, Benjamin 2018. Versification and ­authorship attribution. A pilot study on Czech, German, Spanish, and English poetry. – Studia Metrica et Poetica, kd 5, nr 2, lk 29-54.
Rapp, Christof 2010. Aristotle’s Rhetoric. – The Stanford Encyclopedia of Philosophy. Toim Edward N. Zalta. Metaphysics Research Lab, Stanford University. https://plato.stanford.edu/archives/spr2010/entries/aristotle-rhetoric/ (5. VIII 2021).
Sarawgi, Ruchita; Gajulapalli, Kailash; Choi, Yejin 2011. Gender attribution: Tracing stylometric evidence beyond topic and genre. – CoNLL ’11: Proceedings of the Fifteenth Conference on Computational Natural Language Learning. Portland Oregon, June 23-24, 2011. Stroudsburg: Association for Computational Linguistics, lk 78-86.
Savoy, Jacques 2020. Machine Learning Methods for Stylometry. Springer.
Smith, Peter W. H.; Aldridge, W. 2011. Improving Authorship Attribution: Optimizing Burrows’ Delta Method*. – Journal of Quantitative Linguistics, kd 18, nr 1, lk 63-88.
Spedding, James 1850. Who wrote Shakespeare’s Henry VIII? – Gentleman’s Magazine and Historical Review, kd 34, lk 381-382.
Storey, Grant; Mimno, David 2020. Like two Pis in a pod: Author similarity across time in the Ancient Greek corpus. – Journal of Cultural Analytics, kd 1, nr 1.
Šapir, Maksim Il’ič 2000. Fenomen Baten’kova i problema mistifikacii (lingvostihovedčeskij aspekt). – M. I. Šapir, Universum versus: Jazyk, stih, smysl v russkoj poèzii XVIII-XIX vekov. Moskva: Jazyki russkoj kul’tury, lk 335-443. [Максим Ильич Шапир, Феномен Батенькова и проблема мистификации (Лингвостиховедческий аспект). – М. И. Шапир, Universum versus: Язык, стих, смысл в русской поэзии XVIII-XIX веков. Москва: Языки русской культуры.]
Šeļa, Artjoms; Orekhov, Boris; Leibov, Roman 2020. Weak genres: Modeling association between poetic meter and meaning in Russian poetry. – CHR 2020: Workshop on Computational Humanities Research. Amsterdam: CEUR-WS, lk 12-31.
Zipf, George K. 1949. Human Behavior and the Principle of Least Effort. Cambridge: Addison-Wesley Press, Inc.
Tarlinskaja, Marina 1987. Shakespeare’s Verse. Iambic Pentameter and the Poet’s Idiosyncrasies. New York: Peter Lang.
Tomaševski 1923 = Boris Viktorovič Tomaševskij, Pjatistopnyj jamb Puškina. – Očerki po poètike Puškina. Berlin: Èpoha, lk 7-143. [Борис Викторович Томашевский, Пятистопный ямб Пушкина. – Очерки по поэтике Пушкина. Берлин: Эпоха.]
Tuzzi, Arjuna; Cortelazzo, Michele A. (toim) 2018. Drawing Elena Ferrante’s Profile. Padova: Padova University Press.
Uiboaed, Kristel 2017. Kirjandusteoste automaatanalüüs [Text-mining and Stylometric Analysis of Estonian Novels]. – Tekstikaeve. http://www.tekstikaeve.ee/blog/2017-10-25-kirjandusteoste-automaatanalyys/ (2. VI 2021).
Uiboaed, Kristel (toim) 2018. E-raamatute eeltöödeldud ja lemmatiseeritud failid. https://datadoi.ee/handle/33/76
Underwood, Ted 2017. The life cycles of genres. – Journal of Cultural Analytics, kd 1, nr 1.
Underwood, Ted 2019. Distant Horizons: Digital Evidence and Literary Change. Chicago: University of Chicago Press.
Yule, George Udny 1939. On sentence-length as a statistical characteristic of style in prose: With application to two cases of disputed authorship. – Biometrika, kd 30, nr 3-4, lk 363-390.
Yule, George Udny 1944. The Statistical Study of Literary Vocabulary. Cambridge: Cam­bridge University Press.