Differences, distances and fingerprints

The fundamentals of stylometry and multivariate text analysis

https://doi.org/10.54013/kk764a3

Keywords: stylometry, computational text analysis, authorship attribution, Estonian fiction

The recent rapid expansion of computational methods and tools into humanities have rekindled the conversation surrounding the relationship between a study object and its mathematical representation, or model. The paper serves as a conceptual introduction to stylometry, a sub-field of computational text analysis that studies differences between texts quantitatively, and shows how simplistic models of texts can be used to uncover their complex relationships.

The very term “stylometry” and the field’s development is closely linked to the problem of authorship attribution and identification – the paper briefly introduces the early history of stylometry that highlights its inherited assumptions about texts and authorship. It shows how text analysis methods have shifted from analysis of ­single features (such as word length) to multivariate computations. The paper explains the basics of modern multivariate text analysis and the notion of a “distance” between texts as a proxy to their difference, focusing primarily on word frequencies as units of analysis.

The paper concludes with a series of preliminary authorship experiments for a small Estonian fiction corpus, which test the behavior of a few foundational variables and serve as a general proof-of-concept demonstration. Experiments show a reliable performance of Cosine Delta distance in the Estonian non-lemmatized corpus, using frequencies of at least 100 most frequent words. The Cosine Delta also achieves stable attribution accuracy for random samples of at least 5000 words, which suggests that texts shorter than this size might not be reliably attributed with the proposed methodology. Both findings are consistent with observations done for other languages, but should be treated as preliminary for Estonian.

 

Artjoms Šeļa (b. 1989), PhD, Researcher in Digital Humanities in the University of Tartu (Jakobi 2, Tartu 51003), Assistant Professor in Computational Stylometry in the Institute of Polish Language, Polish Academy of Sciences, Kraków, artjoms.sela@ut.ee

References

Kood ja andmed

Korpus ja kood kõigi katsete ja diagrammide jaoks on saadaval varamus: https://github.com/perechen/stylometry_intro_KK. Analüüsi jaoks kasutati tarkvara R 4.1.0. Peamised stilomeetria katsed tuginesid lisaks ka Stylo paketile (Eder jt 2016).

Kirjandus

Ackerman, James S. 1962. A Theory of style. – The Journal of Aesthetics and Art Criticism, kd 20, nr 3, lk 227-237.
https://doi.org/10.2307/427321
Allison, Sarah; Heuser, Ryan; Jockers, Matthew L.; Moretti, Franco; Witmore, Michael 2011. Quantitative formalism: An experiment. – Literary Lab Pamphlets 1, 15. XI. https://litlab.stanford.edu/LiteraryLabPamphlet1.pdf
Argamon, Shlomo 2008. Interpreting Burrows’s Delta: Geometric and probabilistic foundations. – Literary and Linguistic Computing, kd 23, nr 2, lk 131-147.
https://doi.org/10.1093/llc/fqn003
Argamon, Shlomo; Goulain, Jean-Baptiste; Horton, Russell; Olsen, Mark 2009a. Vive la Différence! Text mining gender difference in French literature. – Digital Humanities Quarterly, kd 3, nr 2. http://www.digitalhumanities.org/dhq/vol/3/2/000042/000042.html
Argamon, Shlomo; Koppel, Moshe; Pennebaker, James W.; Schler, Jonathan 2009b. Automatically profiling the author of an anonymous text. – Commun. ACM, kd 52, nr 2, lk 119-123.
https://doi.org/10.1145/1461928.1461959
Brennan, Michael; Afroz, Sadia; Greenstadt, Rachel 2012. Adversarial stylometry: Circumventing authorship recognition to preserve privacy and anonymity. – ACM Transaction on Information and System Security, kd 15, nr 3, art 12.
https://doi.org/10.1145/2382448.2382450
Burrows, John 2002. ‘Delta’: A measure of stylistic difference and a guide to likely authorship. – Literary and Linguistic Computing, kd 17, nr 3, lk 267-287.
https://doi.org/10.1093/llc/17.3.267
Cambpell, Lewis 1867. The Sophistes and Politicus of Plato. Oxford: Oxford Clarendon Press.
Chang, Kent K.; DeDeo, Simon 2020. Divergence and the complexity of difference in text and culture. – Journal of Cultural Analytics, kd 1, nr 1.
https://doi.org/10.22148/001c.17585
Croft, William 2000. Explaining Language Change: An Evolutionary Approach. Harlow: Pearson Education.
Da, Nan Z. 2019. The computational case against computational literary studies. – Critical Inquiry, kd 45, nr 3, lk 601-639.
https://doi.org/10.1086/702594
Dittenberg, Wilhelm 1881. Sprachliche Kriterien für die Chronologie der platonischen Dialoge. – Hermes. Zeitschrift für klassische Philologie, kd 16, lk 321-345.
Eder, Maciej 2013. Does size matter? Authorship attribution, small samples, big problem. – Digital Scholarship in the Humanities, kd 30, nr 2, lk 167-182.
https://doi.org/10.1093/llc/fqt066
Eder, Maciej 2015. Visualization in stylometry: Cluster analysis using networks. – Digital Scholarship in the Humanities, kd 32, nr 1, lk 50-64.
https://doi.org/10.1093/llc/fqv061
Eder, Maciej 2017. Short samples in authorship attribution: A new approach. – Digital Human­ities 2017. Montreal, Canada, August 8-11, 2017. Conference Abstracts. Montreal: McGill University, lk 221-224.
Eder, Maciej; Rybicki, Jan; Kestemont, Mike 2016. Stylometry with R: a package for computational text analysis. – R Journal, kd 8, nr 1, lk 107-121.
https://doi.org/10.32614/RJ-2016-007
Emmery, Chris; Kádár, Ákos; Chrupała, Grzegorz 2021. Adversarial stylometry in the wild: Transferable lexical substitution attacks on author profiling. – ArXiv.org. Computer Science.
https://doi.org/10.18653/v1/2021.eacl-main.203
Evert, Stefan; Proisl, Thomas; Jannidis, Fotis; Reger, Isabella; Pielström, Steffen; Schöch, Christof; Vitt, Thorsten 2017. Understanding and explaining Delta measures for author­ship attribution. – Digital Scholarship in the Humanities, kd 32, nr suppl_2, lk ii4-ii16.
https://doi.org/10.1093/llc/fqx023
Ginzburg, Carlo 1979. Clues: Roots of a scientific paradigm. – Theory and Society, kd 7, nr 3, lk 273-288.
https://doi.org/10.1007/BF00207323
Grieve, Jack 2007. Quantitative authorship attribution: An evaluation of techniques. – Liter­ary and Linguistic Computing, kd 22, nr 3, lk 251-270.
https://doi.org/10.1093/llc/fqm020
Grzybek, Peter 2014. The emergence of stylometry: Prolegomena to the history of term and concept. – Text within Text – Culture within Culture. Toim Katalin Kroó, Peeter Torop. Budapest-Tartu: L’Harmattan, lk 58-75.
Herrmann, Berenike J.; Dalen-Oskam, Karina van; Schöch, Christof 2015. Revisiting style, a key concept in literary studies. – Journal of Literary Theory, kd 9, nr 1, lk 25-52.
https://doi.org/10.1515/jlt-2015-0003
Holmes, David I. 1998. The evolution of stylometry in humanities scholarship. – Literary and Linguistic Computing, kd 13, nr 3, lk 111-117.
https://doi.org/10.1093/llc/13.3.111
Holmes, David I.; Kardos, Judit 2003. Who was the author? An introduction to stylometry. – CHANCE, kd 16, nr 2, lk 5-8.
https://doi.org/10.1080/09332480.2003.10554842
Hughes, James M.; Foti, Nicholas J.; Krakauer, David C.; Rockmore, Daniel N. 2012. Quantitative patterns of stylistic influence in the evolution of literature. – Proceedings of the National Academy of Sciences, kd 109, nr 20, lk 7682-7686.
https://doi.org/10.1073/pnas.1115407109
Jannidis, Fotis; Pielström, Steffen; Schöch, Christof; Vitt, Thorsten 2015. Improving Burrows’ Delta – An empirical evaluation of text distance measures. – Digital Humanities Conference 2015. University of Western Sydney, Australia. Konverentsiettekanne. https://www.researchgate.net/publication/280086768_Improving_Burrows’_Delta_-_An_empirical_evaluation_of_text_distance_measures (17. VIII 2021).
Jockers, Matthew L. 2013. Macroanalysis: Digital Methods and Literary History. 1st Edition. Urbana: University of Illinois Press.
https://doi.org/10.5406/illinois/9780252037528.001.0001
Juola, Patrick 2012. Large-scale experiments in authorship attribution. – English Studies, kd 93, nr 3, lk 275-283.
https://doi.org/10.1080/0013838X.2012.668792
Juola, Patrick 2013. How a computer program helped reveal J. K. Rowling as author of a Cuckoo’s Calling. – Scientific American. https://www.scientificamerican.com/article/how-a-computer-program-helped-show-jk-rowling-write-a-cuckoos-calling/ (2. VI 2021).
Juola, Patrick 2020. Authorship studies and the dark side of social media analytics. – Journal of Universal Computer Science, kd 26, nr 1, lk 156-170.
https://doi.org/10.3897/jucs.2020.009
Kestemont, Mike; Stover, Justin; Koppel, Moshe; Karsdorp, Folgert; Daelemans, Walter 2016. Authenticating the writings of Julius Caesar. – Expert Systems with Applications, kd 63, lk 86-96.
https://doi.org/10.1016/j.eswa.2016.06.029
Koolen, Corina W. 2018. Reading beyond the female. The relationship between perception of author gender and literary quality. (ILLC Dissertation Series DS-2018-03.) Universiteit van Amsterdam.
Koppel, Moshe; Schler, Jonathan; Argamon, Shlomo 2009. Computational methods in authorship attribution. – Journal of the American Society for Information Science and Technology, kd 60, nr 1, lk 9-26.
https://doi.org/10.1002/asi.20961
Lotman, Jurij M.; Lotman, Mihail J. 1986. Vokrug desjatoj glavy “Evgenija Onegina”. – Puškin: Issledovanija i materialy. Leningrad: Nauka, lk 124-151. [Юрий М. Лотман, Михаил Ю. Лотман, Вокруг десятой главы “Евгения Онегина”. – Пушкин: Исследования и материалы. Ленинград: Наука.]
Lutosławski, Wincenty 1897a. The Origin and Growth of Plato’s Logic. With an Account of Plato’s Style and of the Chronology of His Writings. London: Longmans, Green and Co.
Lutosławski, Wincenty 1897b. On stylometry. – Classical Review, kd 11, lk 284-286.
https://doi.org/10.1017/S0009840X00032315
Malone, Edward 1787. A dissertation on parts I, II and III of Henry VI tending to show that those plays were not written originally by Shakespeare. London: From the Press of Henry Baldwin.
Mendenhall, Thomas C. 1887. The characteristic curves of composition. – Science, kd ns-9, nr 214S, lk 237-246.
https://doi.org/10.1126/science.ns-9.214S.237
Mendenhall, Thomas 1901. A mechanical solution of a literary problem. – Popular Science Monthly, kd 60, lk 98-105.
Morozov, Nikolaj Aleksandrovič 1915. Lingvističeskie spektry. Sredstvo dlja otličenija plagiatov ot istinnyh proizvedenij togo ili drugogo izvestnogo avtora. – Izvestija otdelenija russkogo jazyka i slovesnoti Imperatorskoj akademii nauk, kd 20, nr 1-4, lk 95-127. [Николай Александрович Морозов, Лингвистические спектры. Средство для отличения плагиатов от истинных произведений того или другого известнoго автора. – Известия отделения русского языка и словесноти Императорской академии наук.]
Mosteller, Frederick; Wallace, David L. 1963. Inference in an authorship problem. – Journal of the American Statistical Association, kd 58, nr 302, lk 275-309.
https://doi.org/10.1080/01621459.1963.10500849
Newberry, Mitchell G.; Ahern, Christopher A.; Clark, Robin; Plotkin, Joshua B. 2017. Detecting evolutionary forces in language change. – Nature, kd 551, nr 7679, lk 223-226.
https://doi.org/10.1038/nature24455
Noecker Jr, John; Ryan, Michael; Juola, Patrick 2013. Psychological profiling through textual analysis. – Literary and Linguistic Computing, kd 28, nr 3, lk 382-387.
https://doi.org/10.1093/llc/fqs070
Piper, Andrew 2017. Fictionality. – Journal of Cultural Analytics, kd 1, nr 1.
https://doi.org/10.22148/16.011
Plecháč, Petr; Bobenhausen, Klemens; Hammerich, Benjamin 2018. Versification and ­authorship attribution. A pilot study on Czech, German, Spanish, and English poetry. – Studia Metrica et Poetica, kd 5, nr 2, lk 29-54.
https://doi.org/10.12697/smp.2018.5.2.02
Rapp, Christof 2010. Aristotle’s Rhetoric. – The Stanford Encyclopedia of Philosophy. Toim Edward N. Zalta. Metaphysics Research Lab, Stanford University. https://plato.stanford.edu/archives/spr2010/entries/aristotle-rhetoric/ (5. VIII 2021).
Sarawgi, Ruchita; Gajulapalli, Kailash; Choi, Yejin 2011. Gender attribution: Tracing stylometric evidence beyond topic and genre. – CoNLL ’11: Proceedings of the Fifteenth Conference on Computational Natural Language Learning. Portland Oregon, June 23-24, 2011. Stroudsburg: Association for Computational Linguistics, lk 78-86.
Savoy, Jacques 2020. Machine Learning Methods for Stylometry. Springer.
https://doi.org/10.1007/978-3-030-53360-1
Smith, Peter W. H.; Aldridge, W. 2011. Improving Authorship Attribution: Optimizing Burrows’ Delta Method*. – Journal of Quantitative Linguistics, kd 18, nr 1, lk 63-88.
https://doi.org/10.1080/09296174.2011.533591
Spedding, James 1850. Who wrote Shakespeare’s Henry VIII? – Gentleman’s Magazine and Historical Review, kd 34, lk 381-382.
Storey, Grant; Mimno, David 2020. Like two Pis in a pod: Author similarity across time in the Ancient Greek corpus. – Journal of Cultural Analytics, kd 1, nr 1.
https://doi.org/10.22148/001c.13680
Šapir, Maksim Il’ič 2000. Fenomen Baten’kova i problema mistifikacii (lingvostihovedčeskij aspekt). – M. I. Šapir, Universum versus: Jazyk, stih, smysl v russkoj poèzii XVIII-XIX vekov. Moskva: Jazyki russkoj kul’tury, lk 335-443. [Максим Ильич Шапир, Феномен Батенькова и проблема мистификации (Лингвостиховедческий аспект). – М. И. Шапир, Universum versus: Язык, стих, смысл в русской поэзии XVIII-XIX веков. Москва: Языки русской культуры.]
Šeļa, Artjoms; Orekhov, Boris; Leibov, Roman 2020. Weak genres: Modeling association between poetic meter and meaning in Russian poetry. – CHR 2020: Workshop on Computational Humanities Research. Amsterdam: CEUR-WS, lk 12-31.
Zipf, George K. 1949. Human Behavior and the Principle of Least Effort. Cambridge: Addison-Wesley Press, Inc.
Tarlinskaja, Marina 1987. Shakespeare’s Verse. Iambic Pentameter and the Poet’s Idiosyncrasies. New York: Peter Lang.
Tomaševski 1923 = Boris Viktorovič Tomaševskij, Pjatistopnyj jamb Puškina. – Očerki po poètike Puškina. Berlin: Èpoha, lk 7-143. [Борис Викторович Томашевский, Пятистопный ямб Пушкина. – Очерки по поэтике Пушкина. Берлин: Эпоха.]
Tuzzi, Arjuna; Cortelazzo, Michele A. (toim) 2018. Drawing Elena Ferrante’s Profile. Padova: Padova University Press.
Uiboaed, Kristel 2017. Kirjandusteoste automaatanalüüs [Text-mining and Stylometric Analysis of Estonian Novels]. – Tekstikaeve. http://www.tekstikaeve.ee/blog/2017-10-25-kirjandusteoste-automaatanalyys/ (2. VI 2021).
Uiboaed, Kristel (toim) 2018. E-raamatute eeltöödeldud ja lemmatiseeritud failid. https://datadoi.ee/handle/33/76
Underwood, Ted 2017. The life cycles of genres. – Journal of Cultural Analytics, kd 1, nr 1.
https://doi.org/10.22148/16.005
Underwood, Ted 2019. Distant Horizons: Digital Evidence and Literary Change. Chicago: University of Chicago Press.
https://doi.org/10.7208/chicago/9780226612973.001.0001
Yule, George Udny 1939. On sentence-length as a statistical characteristic of style in prose: With application to two cases of disputed authorship. – Biometrika, kd 30, nr 3-4, lk 363-390.
https://doi.org/10.1093/biomet/30.3-4.363
Yule, George Udny 1944. The Statistical Study of Literary Vocabulary. Cambridge: Cam­bridge University Press.