home edit page issue tracker

This page pertains to UD version 2.

UD Croatian SET

Language: Croatian (code: hr)
Family: Indo-European, Slavic

This treebank has been part of Universal Dependencies since the UD v1.1 release.

The following people have contributed to making this treebank part of UD: Željko Agić, Nikola Ljubešić, Daniel Zeman.

Repository: UD_Croatian-SET
Search this treebank on-line: PML-TQ
Download all treebanks: UD 2.2

License: CC BY-SA 4.0

Genre: news, web, wiki

Questions, comments? General annotation questions (either Croatian-specific or cross-linguistic) can be raised in the main UD issue tracker. You can report bugs in this treebank in the treebank-specific issue tracker on Github. If you want to collaborate, please contact [zeljko • agic (æt) gmail • com]. Development of the treebank happens outside the UD repository. If there are bugs, either the original data source or the conversion procedure must be fixed. Do not submit pull requests against the UD repository.

Annotation Source
Lemmas annotated manually in non-UD style, automatically converted to UD
UPOS annotated manually in non-UD style, automatically converted to UD
XPOS annotated manually
Features annotated manually in non-UD style, automatically converted to UD
Relations annotated manually, natively in UD style

Description

The Croatian UD treebank is based on the SETimes-HR corpus.

The sentences are partially parallel with the smaller Serbian UD treebank, which comes from the Serbian edition of SETimes. For the CoNLL 2018 shared task in parsing (and for UD release 2.2), the Croatian corpus was re-split so that corresponding sentences are in the same section (train/dev/test) in Croatian and Serbian. The re-split had to be done on the Croatian side because the Serbian corpus is smaller and most of it correspond to what used to be training data in Croatian.

For the time being, sentence ids have not been changed although they contain references to train/dev/test. Therefore it is now possible that e.g. sentence id “train-s2852” occurs in the development data, not in training data. This may be changed in future releases.

Also note that the following description of data split and sources refers to the old data split. Thus, sentences 0001-3557 of the “training set” have ids “train-s1” to “train-s3557” but some of them are now in the dev file and some in the test file.

Training set.

Contains 7,689 sentences (169,283 tokens) from three sources:

  1. Sentences 0001-3557: Newspaper text from the Southeast European Times news website, obtained from the SETimes parallel corpus. This part of the treebank is built on top of the SETimes.HR dependency treebank of Croatian;
  2. Sentences 3558-5792: Text from various Croatian web sources.
  3. Sentences 5793-7689: Croatian news web sources.

Development set.

Contains 600 sentences (14,533 tokens) from two sources:

  1. 001-200: newspaper text from the Croatian SETimes,
  2. 201-600: Croatian news web sources.

Test set.

Contains 600 sentences (13,228 tokens) from three sources:

  1. sentences 001-100: newspaper text,
  2. sentences 101-200: Wikipedia,
  3. sentences 201-297: web sources, and
  4. sentences 298-600: Croatian news web sources.

Details

Sentence and word segmentation was manually checked. The treebank does not include multiword tokens. No language-specific features and relations were used. The POS tags and features were converted from Multext East v4 and manually checked. The syntactic annotation was done manually.

Acknowledgments

When using the Croatian UD treebank, please cite the following paper:

See file LICENSE.txt for further licensing information.

Statistics of UD Croatian SET

POS Tags

ADJADPADVAUXCCONJDETINTJNOUNNUMPARTPRONPROPNPUNCTSCONJSYMVERBX

Features

AnimacyCaseDefiniteDegreeGenderGender[psor]MoodNumberNumber[psor]NumTypePersonPolarityPossPronTypeReflexTenseVerbFormVoice

Relations

acladvcladvmodadvmod:emphamodapposauxaux:passcaseccccompcompoundconjcopcsubjcsubj:passdepdetdiscoursedislocatedexplexpl:pvfixedflatflat:foreigngoeswithiobjlistmarknmodnsubjnsubj:passnummodobjoblorphanparataxispunctrootvocativexcomp

Tokenization and Word Segmentation

Morphology

Tags

Nominal Features

Degree and Polarity

Verbal Features

Pronouns, Determiners, Quantifiers

Other Features

Syntax

Auxiliary Verbs and Copula

Core Arguments, Oblique Arguments and Adjuncts

Here we consider only relations between verbs (parent) and nouns or pronouns (child).

Reflexive Verbs

Verbs with Reflexive Core Objects

Relations Overview