From a collection of dictionaries to a language portal

https://doi.org/10.54013/kk764a6

Keywords: e-lexicography, corpora, dictionary writing system, language portal, data model, API

The article aims to describe some major changes that have taken place in e-lexicography in recent decades in Europe generally and in Estonia in particular. Digital changes have permeated not only the dictionary compilation process but also whole workflow from lexicographic content creation to publication. The focus has shifted from building specific dictionaries to building a central database and infrastructure that can be adapted for further user and NLP applications.

We describe methods and technologies used to better integrate lexicographic data (several tools have been developed within the Horizon 2020 project European Lexicographic Infrastructure), and to better access lexicographic information.

As a turning point for digital change in Estonian lexicography, we consider the start of the development of the new Dictionary Writing System Ekilex and its user interface Sõnaveeb in 2017. The long-term goal is to have a single data source to provide consistent information about the Estonian language. In connection with Ekilex and Sõnaveeb, we discuss several issues: the theoretical foundations of the Ekilex biggest lexicographic dataset, the EKI Combined Dictionary, improvements in lexicographic workflow, and the Ekilex data model and API. The EKI Combined Dictionary contains information layers imported from several monolingual explanatory dictionaries, bilingual dictionaries, a collocations dictionary, and an etymology and morphology database. The improvements in lexicographic workflow include working in one general database, more cooperation between research groups in the institute and more active involvement of external users.

The Ekilex data model meets the requirements for treating both words and meanings as independent entities and for representing both semasiological and onomasiological data. Created data are stored in Ekilex’s PostgreSQL database and comply with all current standards of data exchange. As of April 2021, Ekilex contains approx. 300,000 headwords from general-language dictionaries and more than 90 terminological databases.

 

Margit Langemets (b. 1961), PhD, Institute of the Estonian Language, Leading Lexicographer (Roosikrantsi 6, 10119 Tallinn), margit.langemets@eki.ee

Kristina Koppel (b. 1985), PhD, Institute of the Estonian Language, Senior Computa­tional Lexicographer (Roosikrantsi 6, 10119 Tallinn), kristina.koppel@eki.ee

Jelena Kallas (b. 1976), PhD, Institute of the Estonian Language, Senior Computational Lexicographer (Roosikrantsi 6, 10119 Tallinn), jelena.kallas@eki.ee

Arvi Tavast (b. 1969), PhD, Institute of the Estonian Language, Director (Roosikrantsi 6, 10119 Tallinn), arvi.tavast@eki.ee

References

Veebivarad

EELex. Eesti Keele Instituudi sõnastikusüsteem. https://eelex.eki.ee

Eesti keele ühendkorpus 2019. https://dx.doi.org/10.15155/3-00-0000-0000-0000-08565L

Ekilex. Eesti Keele Instituudi sõnastiku- ja terminibaasisüsteem. https://ekilex.eki.ee

Ekilex API. https://github.com/tripledev/ekilex/wiki/Ekilex-API

EKI ühendsõnastik 2021. Eesti Keele Instituut. Sõnaveeb, 2021. https://sonaveeb.ee/collections

EKS 2019 = Eesti keele sõnaraamat 2019. Eesti Keele Instituut. Sõnaveeb, 2019. www.eki.ee/dict/eks; https://doi.org/10.15155/3-00-0000-0000-0000-08240L

Elexifier. https://elexifier.elex.is

ELEXIS = European Lexicographic Infrastructure. https://elex.is

Entu. https://entu.keeleressursid.ee

Esterm. https://termin.eki.ee/esterm

EstNLTK = Estonian Natural Language ToolKit. Kogumik teeke eestikeelsete tekstide töötluseks. Versioon 1.6.2. https://estnltk.github.io

EVS = Eesti-vene sõnaraamat 2019. 2., täiendatud ja kohandatud veebiväljaanne. Eesti Keele Instituut. Sõnaveeb, 2019. https://www.eki.ee/dict/evs

Keeleõppija Sõnaveeb. Eesti Keele Instituudi keeleportaal. Versioon 1.21. https://sonaveeb.ee/lite

KORP. Korpuspäringusüsteem. https://korp.keeleressursid.ee

Lexicog. https://www.w3.org/2019/09/lexicog

Lexonomy. https://www.lexonomy.eu/

MultiTerm. https://www.trados.com/products/multiterm-desktop/

NAISC. https://github.com/insight-centre/naisc

Sketch Engine. Korpuspäringussüteem. https://www.sketchengine.eu

Sõnaveeb. Eesti Keele Instituudi keeleportaal. Versioon 1.21. https://sonaveeb.ee

Kirjandus

Bajčetić, Lenka; Declerck, Thierry 2020. Interlinking Slovene language datasets. – Proceedings of XIX EURALEX Congress: Lexicography for Inclusion. Kd I. Toim Zoe Gavriilidou, Maria Mitsiaki, Asimakis Fliatouras. Democritus University of Thrace, lk 73-80.
Bański, Piotr; Bowers, Jack; Erjavec, Tomaz 2017. TEI-Lex0 guidelines for the encoding of dictionary information on written and spoken forms. – Electronic Lexicography in the 21st Century: Lexicography from Scratch. Proceedings of eLex 2017 Conference, Leiden, September 2017. Toim Iztok Kosem, Carole Tiberius, Miloš Jakubíček, Jelena Kallas, Simon Krek, Vít Baisa. Brno: Lexical Computing CZ s.r.o., lk 485-494.
Cavagliá, Gabriela; Kilgarriff, Adam 2001. Corpora from the Web. – Proceeding of the Fourth Annual CLUCK Colloquium, Sheffield, UK, lk 120-124.
Cvrček, Václav; Komrsková, Zuzana; Lukeš, David; Poukarová, Petra; Řehořková, Anna; Zasina, Adrian Jan; Benko, Vladimír 2020. Comparing web-crawled and traditional corpora. – Language Resources and Evaluation, kd 54, nr 3, lk 713-745.
https://doi.org/10.1007/s10579-020-09487-4
de Schryver, Gilles-Maurice; Chishman, Rove; da Silva, Bruna 2019. An overview of digital lexicography and directions for its future: An interview with Gilles-Maurice de Schryver. – Calidoscópio, kd 17, nr 3, lk 659-683.
https://doi.org/10.4013/ld.2019.173.13
EKSS = Eesti kirjakeele seletussõnaraamat. Kd I-VII. Tallinn: Eesti Keele Instituut, 1991-2007.
EKSS 2009 = Eesti keele seletav sõnaraamat. Kd I-VI. “Eesti kirjakeele seletussõnaraamatu” 2., täiendatud ja parandatud tr. Toim Margit Langemets, Mai Tiits, Tiia Valdre, Leidi Veskis, Ülle Viks, Piret Voll. Tallinn: Eesti Keele Sihtasutus. http://www.eki.ee/dict/ekss
Francopoulo, Gil; George, Monte; Calzolari, Nicoletta; Monachini, Monica; Bel, Nuria; Pet, Mandy; Soria, Claudia 2006. Lexical markup framework (LMF). – Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06). Genoa: European Language Resources Association, lk 233-236.
Fuertes-Olivera, Pedro; Bergenholtz, Henning (toim) 2011. e-Lexicography: The Internet, Digital Initiatives and Lexicography. Great Britain: Continuum.
Gorjanc, Vojko; Gantar, Polona; Kosem, Iztok; Krek, Simon (toim) 2017. Dictionary of Modern Slovene: Problems and Solutions. Ljubljana: University of Ljubljana, Faculty of Arts.
https://doi.org/10.4312/9789612379131
Grefenstette, Gregory; Nioche, Julien 2000. Estimation of English and non-English language use on the WWW. – Recherche d’Information Assistée par Ordinateur (RIAO). 6th International Conference, College de France, France, April 12-14. Proceedings. Toim Joseph-Jean Mariani, Donna Harman. Paris, lk 237-246.
Jürviste, Madis; Kallas, Jelena; Langemets, Margit; Tuulik, Maria; Viks, Ülle 2011. Extend­ing the functions of the EELex dictionary writing system using the example of the Basic Estonian Dictionary. – Electronic Lexicography in the 21st Century: New Applications for New Users. Proceedings of eLex 2011, Bled, 10-12 November. Toim Iztok Kosem, Karmen Kosem. Ljubljana: Trojina, Institute for Applied Slovenian Studies, lk 106−112.
Kallas, Jelena; Koeva, Svetla; Langemets, Margit; Tiberius, Carole; Kosem, Iztok 2019. Lexicographic practices in Europe: Results of the ELEX survey on user needs. – Proceedings of the eLex 2019 conference. 1-3 October, Sintra, Portugal. Toim I. Kosem, Tanara Zingano Kuhn, Margarita Correia, José Pedro Ferreira, Maarten Jansen, Isabel Pereira, J. Kallas, Miloš Jakubíček, Simon Krek, Carole Tiberius. Brno: Lexical Computing CZ, s.r.o, lk 519−536.
Kilgarriff, Adam; Rychlý, Pavel; Smr, Pavel; Tugwell, David 2004. The Sketch Engine. – Proceedings of the XI EURALEX International Congress. Toim Geoffrey Williams, Sandra Vessier. Lorient, France: Université de Bretagne Sud, lk 105-115.
Kilgarriff, Adam; Baisa, Vít; Bušta, Jan; Jakubíček, Miloš; Kovář, Vojtěch; Michelfeit, Jan; Rychlý, Pavel; Suchomel, Vít 2014. The Sketch Engine: Ten years on. – Lexicography: Journal of ASIALEX, kd 1, nr 1, lk 7-36.
https://doi.org/10.1007/s40607-014-0009-9
Koppel, Kristina 2020. Näitelausete korpuspõhine automaattuvastus eesti keele õppesõnastikele. (Dissertationes linguisticae Universitatis Tartuensis 38.) Tartu: Tartu Ülikooli Kirjastus.
Koppel, Kristina; Tavast, Arvi; Langemets, Margit; Kallas, Jelena 2019a. Aggregating dictionaries into the language portal Sõnaveeb: Issues with and without a solution. – Proceedings of the eLex 2019 conference. 1-3 October, Sintra, Portugal. Toim Iztok Kosem, Tanara Zingano Kuhn, Margarita Correia, José Pedro Ferreira, Maarten Jansen, Toim Isabel Pereira, J. Kallas, Miloš Jakubíček, Simon Krek, Carole Tiberius. Brno: Lexical Computing CZ, s.r.o, lk 434−452.
Koppel, Kristina; Kallas, Jelena; Khokhlova, Maria; Suchomel, Vít; Baisa, Vít; Michelfeit, Jan 2019b. SkELL corpora as a part of the language portal Sõnaveeb: Problems and perspectives. – Proceedings of the eLex 2019 conference. 1-3 October, Sintra, Portugal. Toim Iztok Kosem, Tanara Zingano Kuhn, Margarita Correia, José Pedro Ferreira, Maarten Jansen, Isabel Pereira, J. Kallas, Miloš Jakubíček, Simon Krek, Carole Tiberius. Brno: Lexical Computing CZ, s.r.o, lk 763-782.
Kuusk, Piret; Reivelt, Kaido 2019. Füüsika eestikeelsest terminivarast riiklikul oskuskeelekuul. – Sirp 31. V.
Langemets, Margit 2012. Lingvistilisest kolonialismist sõnaraamatus. – Keel ja Kirjandus, nr 8-9, lk 598−613.
https://doi.org/10.54013/kk658a4
Langemets, Margit; Loopmann, Andres; Viks, Ülle 2010. Dictionary management system for bilingual dictionaries. – eLexicography in the 21st Century: New Challenges, New Applications. Toim Sylviane Granger, Magali Paquot. Louvain-la-Neuve: Presses universitaires de Louvain, Cahiers du CENTAL, lk 425-429.
Langemets, Margit; Hein, Indrek; Heinonen, Tarja; Koppel, Kristina; Viks, Ülle 2017. From monolingual to bilingual dictionary: The case of semi-automated lexicography on the example of Estonian-Finnish Dictionary. – Electronic Lexicography in the 21st Century: Lexicography from Scratch. Proceedings of eLex 2017 Conference, Leiden, September 2017. Toim Iztok Kosem, Carole Tiberius, Miloš Jakubíček, Jelena Kallas, Simon Krek, Vít Baisa. Brno: Lexical Computing CZ s.r.o., lk 155-171.
Langemets, Margit; Uibo, Udo; Tiits, Mai; Valdre, Tiia; Voll, Piret 2018. Eesti keel uues kuues. Eesti keele sõnaraamat 2018. – Keel ja Kirjandus, nr 12, lk 942−958.
https://doi.org/10.54013/kk733a2
Langemets, Margit; Päll, Peeter 2020. Kust vaadata kirjakeele normi? EKI keelekool. – Posti­mees. Arvamus ja Kultuur 19. XII.
McCrae, John P.; Bosque-Gil, Julia; Gracia, Jordi; Buitelaar, Paul; Cimiano, Philipp 2017. The OntoLex-Lemon Model: Development and applications. − Electronic Lexicography in the 21st Century: Lexicography from Scratch. Proceedings of eLex 2017 Conference, Leiden, September 2017. Toim Iztok Kosem, Carole Tiberius, Miloš Jakubíček, Jelena Kallas, Simon Krek, Vít Baisa. Brno: Lexical Computing CZ s.r.o., lk 587-597.
Měchura, Michal B. 2017 Introducing Lexonomy: An open-source dictionary writing and publishing system. – Electronic Lexicography in the 21st Century: Lexicography from Scratch. Proceedings of eLex 2017 Conference, Leiden, September 2017. Toim Iztok Kosem, Carole Tiberius, Miloš Jakubíček, Jelena Kallas, Simon Krek, Vít Baisa. Brno: Lexical Computing CZ s.r.o., lk 662-679.
Paulsen, Geda; Vainik, Ene; Tuulik, Maria 2020. Sõnaliik leksikograafi töölaual: sõnaliikide roll tänapäeva leksikograafias. – Eesti Rakenduslingvistika Ühingu aastaraamat, nr 16, lk 177-202.
https://doi.org/10.5128/ERYa16.11
Pomikálek, Jan; Rychlý, Pavel; Kilgarriff, Adam 2009. Scaling to billion-plus word corpora. – Advances in Computational Linguistics, nr 41, lk 3-13.
Pomikálek, Jan; Jakubíček, Miloš; Rychlý, Pavel 2012. Building a 70 billion word corpus of English from ClueWeb. – Proceedings of Eighth International Conference on Language Resources and Evaluation (LREC’12). Toim Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Mehmet Uğur Doğan, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis. European Language Resources Association, lk 502-506.
Sinclair, John 1991. Corpus Concordance and Collocation. Oxford: Oxford University Press.
Steurs, Frieda; Schoonheim, Tanneke; Heylen, Kris; Vandeghinste, Vincent (toim) 2020. The Future of Academic Lexicography – A White Paper. Version 1.2 https://ivdnt.org/wp-content/uploads/2021/02/The-Future-of-Academic-Lexicography-A-White-Paper.pdf (25. IV 2021).
Suchomel, Vít 2020. Better Web Corpora For Corpus Linguistics And NLP. Doctoral Theses. Brno: Masaryk University, Faculty of Informatics.
Suchomel, Vít; Pomikálek, Jan 2012. Efficient web crawling for large text corpora. – Proceedings of the 7th Web-as-Corpus Workshop (WAC7). 17 April, Lyon, France. Toim Adam Kilgarriff, Serge Sharoff. Lyon, lk 39-43.
Tasovac, Toma 2010. Reimagining the dictionary, or why lexicography needs digital humanities. – Digital Humanities. King’s College London, 7th-10th July. http://dh2010.cch.kcl.ac.uk/academic-programme/abstracts/papers/html/ab-883.html (5. II 2021).
Tasovac, Toma; Romary, Laurent; Banski, Piotr; Bowers, Jack; de Does, Jesse; Depuydt, Katrien; Erjavec, Tomaž; Geyken, Alexander; Herold, Axel; Hildenbrandt, Vera; ­Khemakhem, Mohamed; Petrović, Snežana; Salgado, Ana; Witt, Andreas 2018. TEI Lex-0: A baseline encoding for lexicographic data. Version 0.8.6. DARIAH Working Group on Lexical Resources. https://dariah-eric.github.io/lexicalresources/pages/­TEILex0/TEILex0.html (5. II 2021).
Tavast, Arvi; Langemets, Margit; Kallas, Jelena; Koppel, Kristina 2018. Unified data modell­ing for presenting lexical data: The case of EKILEX. – Proceedings of the XVIII ­EURALEX International Congress. EURALEX: Lexicography in Global Contexts, Ljubljana, 17-21 July. Toim Jaka Čibej, Vojko Gorjanc, Iztok Kosem, Simon Krek. Ljubljana: Ljubljana University Press, Faculty of Arts, lk 749−761.
Tavast, Arvi; Koppel, Kristina; Langemets, Margit; Kallas, Jelena 2020. Towards the superdictionary: Layers, tools and unidirectional meaning relations. – Proceedings of XIX EURALEX Congress: Lexicography for Inclusion. Kd I. Toim Zoe Gavriilidou, Maria Mitsiaki, Asimakis Fliatouras. Alexandroupolis: Democritus University of Thrace, lk 215−223.
Undusk, Jaan; Tamm, Marek 2021. Digimaailm ja mõnda. – ERR, Ööülikool 16. I. https://podcastid.ee/ooulikool/ooulikool-jaan-undusk-ja-marek-tamm-digimaailm-ja-monda/ (1. II 2021).
Vainik, Ene 2019. Eesti keele assotsiatsioonisõnastik. Eesti Keele Instituut. https://www.eki.ee/dict/assotsiatsioonid (25. IV 2021).
Valge, Jüri 2013. Kaks terminoloogiaprogrammi: 2008-2012 ja 2013-2017. – Õiguskeel, nr 4. https://www.just.ee/sites/www.just.ee/files/juri_valge._kaks_terminoloogiaprogrammi_2008-2012_ja_2013-2017.pdf (5. II 2021).
ÕS 2013 = Eesti õigekeelsussõnaraamat ÕS 2013. Toim Maire Raadik. Koost Tiiu Erelt, Tiina Leemets, Sirje Mäearu, M. Raadik. Eesti Keele Instituut. Tallinn: Eesti Keele Sihtasutus. http://www.eki.ee/dict/qs2013 (26. VIII 2021).
ÕS 2018 = Eesti õigekeelsussõnaraamat ÕS 2018. Toim Maire Raadik. Koost Tiiu Erelt, Tiina Leemets, Sirje Mäearu, M. Raadik. Eesti Keele Instituut. Tallinn: EKSA. https://www.eki.ee/dict/qs2018 (26. VIII 2021).