Estonian Interlanguage Corpus


The article introduces the first version of the Estonian Interlanguage Corpus (EIC) of Tallinn University, surveying the corpus structure, multilevel statistics, corpus annotation, linguistic error taxonomy, system of requesting, options of automatic analysis (morphological and syntactic analysis, n-grams) of Estonian learner language, and current EIC-based research.

EIC is a resource consisting of Estonian texts written by learners of Estonian as an official and foreign language. The corpus has hitherto provided material for empirical and applied research on morphosyntactic usage patterns and lexical variation of Estonian, the morphosyntactic complexity and lexical richness of learner language, developments in the Estonian language system, gradual development of language skills and CEFR proficiency levels, error and contrastive analysis (Estonian, Russian, Finnish morphology), and cluster analysis. EIC is a monitor corpus containing over three million word forms.

The major direction of research is comparative corpus analysis of standard Estonian and learner Estonian in the domain of morphosyntax, the focus lying on patterns of language usage. This has been a conscious choice as the multi-component language structures which are regularly used have a definite place in the text creation process of a native speaker as well as a learner, while the comparative analysis of the resulting texts has a heuristic meaning for understanding Estonian grammar. The results can also benefit applications for automatic analysis of learner language. Regardless of the starting points of system development, the central issue will still be linguistic, namely, how do people typically combine words, morphology and linguistic structures or, in other words, what is the connection between the semantics and morphology of a word and the textual functions of its morphology.