Časopis Slovo a slovesnost
en cz

Změny v morfologické anotaci korpusů řady SYN: nové možnosti zkoumání české gramatiky a lexikonu

Jan Křivan, Jana Šindlerová



Changes in the morphological annotation of the SYN series corpora: new possibilities for researching Czech grammar and lexicon

This paper introduces some major conceptual enhancements to the morphological annotation of the SYN series corpora of the Czech National Corpus. Apart from minor changes in tokenization and in the positional tagset, three major conceptual changes have been applied which affect the representation of various lexical and grammatical patterns. In the paper, we present the actual impact of the changes in linguistic data and search for possibilities in three linguistic areas. First, the treatment of phonic, graphemic, and morphological variants via a two-tier lemma structure is discussed; second, a new approach to periphrastic verb forms, auxiliaries, participles and the interpretation of verbal grammatical categories through a new attribute, called verbtag, is explained; and third, a complex multi-value treatment of multiword tokens is introduced.

Key words: lemmatization, tokenization, morphological annotation, verbal morphology, lemma variants
Klíčová slova: lemmatizace, tokenizace, morfologická anotace, slovesná morfologie, varianty lemmatu

Daný článek je on-line k dispozici v databázi CEEOL.

Ústav teoretické a komputační lingvistiky FF UK
Celetná 13, 110 00 Praha 1

Slovo a slovesnost, volume 83 (2022), number 2, pp. 122–145

Previous Beata Jarosz: The professional language of Polish journalists in a diachronic perspective

Next Markus Giger: Patrick Sériot (ed.): Le nom des langues en Europe centrale, orientale et balkanique