Keywords: corpus linguistics, text classification, text typology, functional text dimensions, multidimensional analysis
Corpus linguists and language technologists are increasingly turning to the Web as a source of language data. However, automatically crawled corpora have some shortcomings: lots of data but the content is unknown. This has created a need for software which is able to extract all the necessary information from the raw corpus. One such information extraction task in natural language processing is automatic text classification, which in practice imposes several challenges, such as confusion around the terminology, the absence of a generally accepted taxonomy, etc. Even if the latter existed, Web corpora include noisy user-generated content with lots of variation, meaning that all this variety may not fit well into generally accepted taxonomies. In this article we propose a novel theoretical framework for text classification – the Dimensional Text Model (DTM). This approach does not depend on existing genres or genre taxonomies but rather relies on some text-external (function) and text-internal (linguistic features) criteria according to which the texts that express a similar set of linguistic features share a similar function. DTM is a combination of the Multidimensional Analysis (MDA, Biber 1988) and Functional Text Dimensions (FTD, Sharoff 2018). From MDA we adapt the concepts and definitions for text-internal criteria and dimension – dimension is a quantifiable measure of a set of co-occurring linguistic features. From FTD we adapt the notion of hybridism where instead of classifying a text belonging to a class (genre) or not, it characterizes a text through a combination of several parameters, i.e dimensions, and describes it in the space of dimensions. The aim of DTM is to propose a cross-linguistic universal model which does not depend on defining or classifying genres, instead it offers a framework for classifying texts based on their characteristic linguistic features, describing them in a single space of dimensions and interpret the function of these texts based on their location in the DTM space.
Kristiina Vaik (b. 1990), MA, University of Tartu, Institute of Estonian and General Linguistics, PhD Student (Jakobi 2, 51005 Tartu),
Kairit Sirts (b. 1980), PhD, University of Tartu, Institute of Computer Science, Research Fellow in Language Technology (Narva maantee 18, 51009 Tartu),
Kadri Muischnek (b. 1965), PhD, University of Tartu, Institute of Estonian and General Linguistics, Associate Professor in Computer Linguistics (Jakobi 2-426, 51005 Tartu),
LLC = The London-Lund Corpus of Spoken English.
LOB = The Lancaster-Oslo/Bergen Corpus.
Universal Dependencies.
Atkins, Sue; Clear, Jeremy; Ostler, Nicholas 1992. Corpus design criteria. – Literary and Linguistic Computing, kd 7, nr 1, lk 1-30. |
Besnier, Niko 1988. The linguistic relationships of spoken and written Nukulaelae registers. – Language, kd 64, nr 4, lk 707-736. |
Biber, Douglas 1985. Investigating macroscopic textual variation through multifeature/ |
multidimensional analyses. – Linguistics, kd 23, nr 2, lk 337-360. | ||||
Biber, Douglas 1986. Spoken and written textual dimensions in English: Resolving the contradictory findings. – Language, kd 62, nr 2, lk 384-414. |
Biber, Douglas 1988. Variation across Speech and Writing. Cambridge: Cambridge University Press. |
Biber, Douglas 1994. An analytical framework for register studies. – Sociolinguistic Perspectives on Register. (Oxford Studies in Sociolinguistics.) Toim D. Biber, Edward Finegan. Oxford: Oxford University Press, lk 31-56. | ||||
Biber, Douglas 1995. Dimensions of Register Variation: A Cross-Linguistic Comparison. Cambridge: Cambridge University Press. |
Biber, Douglas; Davies, Mark; Jones, James K.; Tracy-Ventura, Nicole 2006. Spoken and written register variation in Spanish: A multi-dimensional analysis. – Corpora, kd 1, nr 1, lk 1-37. |
Biber, Douglas; Hared, Mohamed 1992. Dimensions of register variation in Somali. – Language Variation and Change, kd 4, nr 1, lk 41-75. |
Crossley, Scott A.; Louwerse, Max M. 2007. Multi-dimensional register classification using bigrams. – International Journal of Corpus Linguistics, kd 12, nr 4, lk 453-478. |
Crowston, Kevin; Kwaśnik, Barbara; Rubleski, Joseph 2011. Problems in the use-centered development of a taxonomy of web genres. – Genres on the Web: Computational Models and Empirical Studies. (Text, Speech and Language Technology 42.) Toim Alexander Mehler, Serge Sharoff, Marina Santini. Dordrecht: Springer Publishing Company, lk 69-84. |
Eggins, Suzanne; Martin, James R. 1997. Genres and registers of discourse. – Discourse as Structure and Process: Discourse Studies: A Multidisciplinary Introduction. Toim Teun A. van Dijk. London: Sage, lk 230-256. |
Ferguson, Charles 1994. Dialect, register and genre: Working assumptions about conventionalization. – Sociolinguistic Perspectives on Register. (Oxford Studies in Sociolinguistics.) Toim Douglas Biber, Edward Finegan. Oxford: Oxford University Press, lk 15-30. | ||||
Forsyth, Richard S.; Sharoff, Serge 2013. Document dissimilarity within and across languages: A benchmarking study. – Literary and Linguistic Computing, kd 29, nr 1, lk 6-22. |
Grieve, Jack 2014. A multi-dimensional analysis of regional variation in American English. – Multi-Dimensional Analysis, 25 years on: A Tribute to Douglas Biber. Toim Tony Berber Sardinha, Marcia Veirano Pinto. Amsterdam-Philadelphia: John Benjamins Publishing Company, lk 3-34. | ||||
Grieve, Jack; Biber, Douglas; Foriginal, Eric; Nekrasova, Tatiana 2011. Variation among blogs: A multi-dimensional analysis. – Genres on the Web: Computational Models and Empirical Studies. (Text, Speech and Language Technology 42.) Toim Alexander Mehler, Serge Sharoff, Marina Santini. Dordrecht: Springer Publishing Company, lk 303-322. |
Hennoste, Tiit 2000. Allkeeled. – Eesti keele allkeeled. (Tartu Ülikooli eesti keele õppetooli toimetised 16.) Tartu: Tartu Ülikooli Kirjastus, lk 9-56. | ||||
Jang, Shyue-Chian 1998. Dimensions of Spoken and Written Taiwanese: A Corpus Based Study. PhD thesis. University of Hawaii at Manoa. | ||||
Katinskaia, Anisia; Sharoff, Serge 2015. Applying multi-dimensional analysis to a Russian webcorpus: Searching for evidence of genres. – The 5th Workshop on Balto-Slavic Natural Language Processing associated with the 10th International Conference on Recent Advances in Natural Language Processing (RANLP 2015), Hissar, Bulgaria 10-11 September 2015: Proceedings. Shoumen, Bulgaria: Incoma Ltd., lk 65-74. | ||||
Kerge, Krista 2003. Keele variatiivsus ja mine-tuletus allkeelte süntaktilise keerukuse tegurina. (Tallinna Pedagoogikaülikooli humanitaarteaduste dissertatsioonid 10.) Tallinn: Tallinna Pedagoogikaülikooli Kirjastus. | ||||
Kerge, Krista 2010. Kirjažanrite keeleparameetrid mitme tekstiliigi taustal. – Emakeele Seltsi aastaraamat 55 (2009). Tallinn: Emakeele Selts, lk 32-62. | ||||
Kerge, Krista; Pajupuu, Hille 2010. Text-types in speech technology and language teaching. – Analizar datos > Describir variación / Analysing data > Describing variation. Toim Jorge L. Bueno Alonso jt. Vigo: Universida de Vigo, Servizo de Publicacións, lk 380−390. | ||||
Kerge, Krista; Pajupuu, Hille; Altrov, Rene 2007. Tekst, kontekstuaalsus ja kultuur. – Keel ja Kirjandus, nr 8, lk 624-637. | ||||
Kerge, Krista; Pajupuu, Hille; Tamuri, Kairi; Meier, Heidi 2008. Kõnetehnoloogia vajab žanrilist lähenemist. – Eesti Rakenduslingvistika Ühingu aastaraamat, nr 4, lk 53-65. |
Kim, Yong-Jin; Biber, Douglas 1994. A corpus-based analysis of register variation in Korean. – Sociolinguistic Perspectives on Register. (Oxford Studies in Sociolinguistics.) Toim D. Biber, Edward Finegan. Oxford: Oxford University Press, lk 157-182. | ||||
Lahe, Janno 2005. Süü deliktiõiguses. (Dissertationes iuridicae Universitatis Tartuensis 16.) Tartu: Tartu Ülikooli Kirjastus. | ||||
Lamb, William E. 2002. Scottish Gaelic Speech and Writing: Register Variation in an Endangered Language. PhD thesis. Belfast: Cló Ollscoil na Banríona. | ||||
Lee, David 2001. Genres, registers, text types, domains, and styles: Clarifying the concepts and navigating a path through the BNC jungle. – Language Learning and Technology, kd 5, nr 3, lk 37-72. | ||||
Lindström, Liina 2000. Narratiiv ja selle sõnajärg. − Keel ja Kirjandus, nr 3, lk 190−200. | ||||
McEnery, Tony; Hardie, Andrew 2012. Corpus Linguistics: Method, Theory and Practice. (Cambridge Textbooks in Linguistics.) Cambridge: Cambridge University Press. |
Mehler, Alexander; Sharoff, Serge; Santini, Marina 2010. Riding the rough waves of genre on the web. – Genres on the Web: Computational Models and Empirical Studies. (Text, Speech and Language Technology 42.) Toim A. Mehler, S. Sharoff, Marina Santini. Dordrecht: Springer Publishing Company, lk 3-33. |
Meier, Heidi 2002. Olulisi aspekte tekstitüübi võrdluses. − Tekstid ja taustad: artikleid tekstianalüüsist. (Tartu Ülikooli eesti keele õppetooli toimetised 23.) Toim Reet Kasik. Tartu, lk 101−114. | ||||
Meier, Heidi 2003. Essee allkeelte võrdluses. − Tekstid ja taustad II: tekstianalüüsi vaatepunkte. (Tartu Ülikooli eesti keele õppetooli toimetised 26.) Toim Reet Kasik. Tartu, lk 116−135. | ||||
Pajupuu, Hille; Altrov, Rene; Pajupuu, Jaan 2016. Identifying polarity in different text types. − Folklore. Electronic Journal of Folklore, nr 64, lk 25−42. |
Pajupuu, Hille; Kerge, Krista; Altrov, Rene 2012. Lexicon-based detection of emotion in different types of texts: Preliminary remarks. – Eesti Rakenduslingvistika Ühingu aastaraamat, nr 8, lk 171-184. |
Parodi, Giovanni 2007. Variation across registers in Spanish: Exploring the El-Grial Pucv Corpus. – Working with Spanish Corpora. Toim G. Parodi. London: Contiinum, lk 11-53. | ||||
Passonneau, Rebecca J.; Ide, Nancy; Su, Songqiao; Stuart, Jesse 2014. Biber redux: Reconsidering dimensions of variation in American English. – Proceedings of COLING 2014: The 25th International Conference on Computational Linguistics: Technical Papers. Dublin: Dublin City University and Association for Computational Linguistics, lk 565-576. | ||||
Puksand, Helin; Kerge, Krista 2012. Õpiteksti analüüs kirjaoskuse omandamise kontekstis. − Emakeele Seltsi aastaraamat 57 (2011). Tallinn: Emakeele Selts, lk 162-217. |
Purvis, Tristan M. 2008. A Linguistic and Discursive Analysis of Register Variation in Dagbani. PhD thesis. Bloomington: Indiana University. | ||||
Santini, Marina 2007. Automatic Identification of Genre in the Web Pages. PhD thesis. Brighton: University of Brighton. | ||||
Sardinha, Tony Berber; Kauffmann, Carlos; Acunzo, Cristina Mayer 2014. A multi-dimensional analysis of register variation in Brazilian Portuguese. – Corpora, kd 9, nr 2, lk 239-271. |
Shakir, Muhammad; Deuber, Dagmar 2019. A multidimensional analysis of Pakistani and U.S. English blogs and columns. – English World-Wide, kd 40, nr 1, lk 1-23. |
Sharoff, Serge 2010. In the garden and in the jungle. – Genres on the Web: Computational Models and Empirical Studies. (Text, Speech and Language Technology 42.) Toim Alexander Mehler, S. Sharoff, Marina Santini. Dordrecht: Springer Publishing Company, lk 149-166. |
Sharoff, Serge 2018. Functional text dimensions for the annotation of web corpora. – Corpora, kd 13, nr 1, lk 65-95. |
Sinclair, John; Ball, Jackie 1996. Preliminary Recommendations on Text Typology. EAGLES Document EAG-TCWG-TTYP. | ||||
Sorokin, Alexey; Katinskaia, Anisia; Sharoff, Serge 2014. Associating symptoms with syndromes: Reliable genre annotation for a large Russian webcorpus. – Computational Linguistics and Intellectual Technologies: Papers from the Annual International Conference “Dialogue” (Bekasovo, June 4 – 8, 2014). Moscow: RGGU, lk 646-658. | ||||
Zhang, Man 2016. A multidimensional analysis of metadiscourse markers across written registers. – Discourse Studies, kd 18, nr 2, lk 204-222. |
Zipf, George K. 1949. Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology. Cambridge, Massachusetts: Addison Wesley Press. |