Place of statistics in a language model


Keywords: morphology, corpus linguistics, linguistic variation, text statistics

The article speculates on how quantitative data may fit into a theoretical model of language. It argues that the language model should include an idea about the generation procedure at play, albeit a speculative one. A concrete example shows how quantitative data form an integral part of a model of Estonian morphology, another concrete example shows how corpus-based statistical models may result in dubious statistical calculations, and two descriptions of old experiments in statistical learning show a potential path worth following in corpus linguistics in the future: one should pay more attention to some not-so-obvious features that play a role in human language learning, namely, transitional probabilities and linguistic units that should be left out from computations.

Heiki-Jaan Kaalep (b. 1962), PhD, University of Tartu, Senior Researcher, Heiki-Jaan.Kaalep@ut.ee


Baker, Carl L. 1979. Syntactic theory and the projection problem. – Linguistic Inquiry, kd 10, nr 4, lk 533–581.

Bybee, Joan L. 1995. Diachronic and typological properties of morphology and their implications for representation. – Morphological Aspects of Language Processing. Toim Louis B. Feldman. Hillsdale, NJ: Lawrence Erlbaum Associ­ates, lk 225–246.

Divjak, Dagmar, Levshina, Natalia, Klavan, Jane 2016a. Cognitive linguistics: Looking back, looking forward. – Cognitive Linguistics, kd 27, nr 4, lk 447–463. https://doi.org/10.1515/cog-2016-0095

Divjak, Dagmar, Arppe, Antti, Baayen, Harald 2016b. Does language-as-used fit a self-paced reading paradigm? (The answer may well depend on how you model the data.) – Slavic Languages in Psycholinguistics: Chances and Challenges for Empirical and Experimental Research. Toim T. Anstatt, A. Gattnar, C. Clasmeier. Tübingen: Narr Francke Attempto Verlag, lk 52–82.

EKK = Mati Erelt, Tiiu Erelt, Kristiina Ross 2007. Eesti keele käsiraamat. Kolmas, täiendatud tr. Tallinn: Eesti Keele Sihtasutus.

Gleitman, Lila R., Landau, Barbara 2012. Every child an isolate: Nature’s experiments in language learning. – Rich Languages from Poor Inputs. Toim Massimo Piattelli-Palmarini, Robert C. Berwick. Oxford: Oxford University Press, lk 91–104. https://doi.org/10.1093/acprof:oso/9780199590339.003.0006

Gropen, Jess, Pinker, Steven, Hollander, Michelle, Goldberg, Richard, Wilson, Ronald 1989. The learnability and acquisition of the dative alternation in English. – Language, kd 65, nr 2, lk 203–257. https://doi.org/10.2307/415332

Hasselblatt, Cornelius 2000. Eesti keele ainsuse sisseütlev on lühike. – Keel ja Kirjandus, nr 11, lk 796–803.

Hopper, Paul J., Bybee, Joan L. 2001. Introduction to frequency and the emergence of linguistic structure. – Frequency and the Emergence of Linguistic Structure. Toim J. L. Bybee, P. J. Hopper. Amsterdam–Philadelphia: John Benjamins, lk 1–24. https://doi.org/10.1075/tsl.45.01byb

Kaalep, Heiki-Jaan 2009. Kuidas kirjeldada lühikest sisseütlevat kasutusandmetega kooskõlas? – Keel ja Kirjandus, nr 6, lk 411–425.

Kaalep, Heiki-Jaan 2010. Mitmuse osastav eesti keele käändesüsteemis. – Keel ja Kirjandus, nr 2, lk 94–111. https://doi.org/10.54013/kk655a2

Kaalep, Heiki-Jaan 2012. Eesti käänamissüsteemi seaduspärasused. – Keel ja Kirjandus, nr 6, lk 418–449. https://doi.org/10.54013/kk655a2

Kio, Kati 2006. Sisseütleva käände kasutus eesti kirjakeeles. Magistritöö. Tartu. http://dspace.ut.ee/handle/10062/865

Klavan, Jane 2012. Evidence in Linguistics: Corpus-linguistic and Experimental Methods for Studying Grammatical Synonymy. (Dissertationes linguisticae Universitatis Tartuensis 15.) Tartu: Tartu Ülikooli Kirjastus.

Klavan, Jane, Divjak, Dagmar 2016. The cognitive plausibility of statistical classification models: Comparing textual and behavioral evidence. – Folia Linguistica, kd 50, nr 2, lk 355–384. https://doi.org/10.1515/flin-2016-0014

Milin, Petar, Divjak, Dagmar, Dimitrijević, Strahinja, Baayen, Harald R. 2016. Towards cognitively plausible data science in language research. – Cognitive Linguistics, kd 27, nr 4, lk 507–526. https://doi.org/10.1515/cog-2016-0055

Saffran, Jenny R. 2009. What is statistical learning, and what statistical learn­ing is not. – Neoconstructivism: The New Science of Cognitive Development. Toim Scott Johnson. New York: Oxford University Press, lk 180–195. https://doi.org/10.1093/acprof:oso/9780195331059.003.0009

Saffran, Jenny R., Aslin, Richard N., Newport, Elissa L. 1996. Statistical learning by 8-month-old infants. – Science, kd 274, nr 5294, lk 1926–1928. https://doi.org/10.1126/science.274.5294.1926

Saffran, Jenny R., Kirkham, Natasha Z. 2018. Infant statistical learning. – Annual Review of Psychology, kd 69, lk 181–203. https://doi.org/10.1146/annurev-psych-122216-011805

Schmid, Hans-Jörg 2010. Does frequency in text instantiate entrenchment in the cognitive system? – Quantitative Methods in Cognitive Semantics: Corpus-driven Approaches. Toim Dylan Glynn, Kerstin Fischer. Berlin–New York: De Gruyter Mouton, lk 101–136. https://doi.org/10.1515/9783110226423.101

Siiman, Ann 2016. Ainsuse sisseütleva vormi valiku seos morfosüntaktiliste ja semantiliste tunnustega – materjali ning meetodi sobivus korpusanalüüsiks. – Emakeele Seltsi aastaraamat, kd 61 (2015). Tallinn: Teaduste Akadeemia Kirjastus, lk 207–232. https://doi.org/10.3176/esa61.10

Viitso, Tiit-Rein 2003. Structure of the Estonian language. – Estonian Language. (Linguistica Uralica. Supplementary series 1.) Toim Mati Erelt. Tallinn: Estonian Academy Publishers, lk 9–129.

Wonnacott, Elizabeth, Newport, Elissa L., Tanenhaus, Michael K. 2008. Acquiring and processing verb argument structure: Distributional learning in a miniature language. – Cognitive Psychology, kd 56, nr 3, lk 165–209. https://doi.org/10.1016/j.cogpsych.2007.04.002

ÕS 2013 = Eesti õigekeelsussõnaraamat ÕS 2013. Toim Maire Raadik. Koost Tiiu Erelt, Tiina Leemets, Sirje Mäearu, M. Raadik. Tallinn: Eesti Keele Sihtasutus, 2013.