home edit page issue tracker

This page pertains to UD version 2.

UD Chinese CFL

Language: Chinese (code: zh)
Family: Sino-Tibetan

This treebank has been part of Universal Dependencies since the UD v2.1 release.

The following people have contributed to making this treebank part of UD: John Lee, Herman Leung, Keying Li.

Repository: UD_Chinese-CFL
Search this treebank on-line: PML-TQ
Download all treebanks: UD 2.2

License: CC BY-SA 4.0

Genre: learner-essays

Questions, comments? General annotation questions (either Chinese-specific or cross-linguistic) can be raised in the main UD issue tracker. You can report bugs in this treebank in the treebank-specific issue tracker on Github. If you want to collaborate, please contact [keyingli3-c (æt) my • cityu • edu • hk, tswong-c (æt) my • cityu • edu • hk, jsylee (æt) cityu • edu • hk]. Development of the treebank happens outside the UD repository. If there are bugs, either the original data source or the conversion procedure must be fixed. Do not submit pull requests against the UD repository.

Annotation	Source
Lemmas	not available
UPOS	annotated manually, natively in UD style
XPOS	not available
Features	not available
Relations	annotated manually, natively in UD style

Description

The Chinese-CFL UD treebank is manually annotated by Keying Li with minor manual revisions by Herman Leung and John Lee at City University of Hong Kong, based on essays written by learners of Mandarin Chinese as a foreign language. The data is in Simplified Chinese.

.CONLLUX (extension files)

[NOTE: This is a temporary measure for procedures whose descriptions are not yet available in the UD guidelines.]

Included is an additional .conllux file for the .conllu file of the same name. The .conllux counterpart file contains extra information not ordinarily stored in any of the 10 columns in the CONLL-U format. The non-duplicate columns in .conllux for this treebank are columns 3 (distributional tag), 6 (distributional head), 7 (distributional relation), and 10 (alignment). [If data in columns 3, 6, and 7 in the .conllux file are the same as their counterparts in .conllu, that means the distributional annotation is the same as the morphological annotation. For more information on “distributional” vs. “morphological” annotation, see descriptions further below.]

ALIGNMENTS

Alignments are linked to native-Chinese-speaker corrections (by Keying Li) of the learner sentences; storage of the corrected sentences are to be determined. All sentences pertaining to the learner corpus have a sent_id beginning with CFL-; original learner sentences have the parallel-treebank extension /ori in the sent_id, whereas the corrected sentences have the extension /crr in the sent_id. Each alignment includes the full sent_id followed by ‘#’ and the index of the token aligned. Additional alignments in a one-to-many alignment is offset by commas (e.g. CFL_A_1-5/crr#5,CFL_A_1-5/crr#6 means the token is aligned to tokens 5 and 6 of the corrected (‘crr’) sentence of ‘CFL_A_1-5’).

BASIC STATISTICS

Tree count: 451 Word count: 7256 Token count: 7256 Dep. relations: 45 of which 13 language specific POS tags: 15 Category=value feature pairs: 0

GENERAL COMMENTS

A “literal annotation” is preferred, i.e., one should annotate “as if the sentence were as syntactically well-formed as it can be, possibly ignoring meaning” (Ragheb and Dickinson, 2014).

WORD SEGMENTATION

Non-words are allowed only when there are spelling errors resulting from orthographic or phonetic confusion. An orthographic confusion must involve characters with similar appearance, e.g., between 了 and 子 in *花花公了.

Phonetic confusion must involve characters with the same pronunciation but different tones, e.g., between 關 and 管 in the sentence *不關多貴我也買; or, characters with easily confusable pairs such as {j, zh} and {x, sh}.

In these cases, the lemma of the misspelt word is its corrected version. For example, the lemma of *花花公了 is 花花公子, and the lemma of 不關 is 不管.

LEMMA

The lemma is the same as the word, except when the word contains a spelling error.

POS TAGGING

POS tagging is performed on the basis of the lemma, rather than the word. Hence, in the sentence *不關多貴我也買, 不關 is not tagged as VERB but rather as SCONJ, on account of its lemma 不管.

When determining the POS, one usually considers both the “morphological evidence”, i.e., the linguistic form of the word, as well as the “distributional evidence”, i.e., its syntactic use in the sentence. In a well-formed sentence, these two kinds of evidence should agree; in learner text, however, they may conflict (Ragheb and Dickinson, 2014).

Consider the word 可怕 kepa “scary” in the sentence 我可怕他 “I scary him”. Morphological evidence suggests the word 可怕 kepa “scary” should be tagged as an adjective (ADJ), reflecting its normal usage. Distributional evidence suggests it should be tagged as a verb, since the trailing pronoun 他 ta “him” implies its use as a verb with a direct object.

When these two kinds of evidence contradict one another, the morphological evidence prevails. The example sentence is thus tagged as:

我/PN 可怕/ADJ 他/PN

However, we also include the “distributional POS tag” in column 3 of the .conllux file.

DEPENDENCY RELATIONS

Missing words

When a word seems missing in the learner sentence, we annotate according to the UD guidelines on promotion by head elision. For example, in the sentence fragment 在中國最近幾年 zai zhongguo zuijin ji nian “in China recent few years”, we promote 年 nian “year” to be the root. Although both 中國 zhongguo “China” and 年 nian “year” would be obl dependents if a verb was present, 年 nian “year” is promoted because it is closer to the expected location of the verb.

Word-order errors

The annotation should assume no word order error. For example, in the sentence *我被了他打一頓. The aspect particle 了 le usually modifies the verb that precedes it immediately, and is probably misplaced in this sentence. It is most likely intended to modify 打 da “hit”, and should immediately follow da rather than 被 bei, the passive marker.

To adhere to the principle of “literal annotation”, rather than annotating le as the child of 打 da “hit” with the aux relation, we annotate 了 le as the child of 被 bei with the dep relation.

dep (unspecified dependency)

When learner errors make it difficult to characterize the grammatical relation between a word and the rest of the sentence, we use the dep relation. Typically, when the POS tag differs from the distributional POS tag, the dep relation is needed.

Consider the sentence 我可怕他 “I scary him”. From the point of view of its POS tag, it is unclear how the word 可怕 kepa “scary”, as an adjective, relates to the pronoun. We thus consider kepa as the head of 他 ta “him” with the dep relation.

When a word has a different distributional POS tag, we also include a “distributional” dependency relation on the basis of the word’s distributional POS tag. This relation is stored in column 4 of the .conllux file. In the example sentence above, the word 可怕 kepa “scary”, as a verb, is the head of 他 ta “him” with the obj relation.

REFERENCES Marwa Ragheb and Markus Dickinson. 2014. Developing a Corpus of Syntactically-annotated Learner Language for English. Proceedings of the 13th International Workshop on Treebanks and Linguistic Theories (TLT).

Acknowledgments

This work is partially supported by a Strategic Research Grant (Project no. 7004494) from City University of Hong Kong.

Statistics of UD Chinese CFL

POS Tags

ADJ – ADP – ADV – AUX – CCONJ – DET – INTJ – NOUN – NUM – PART – PRON – PROPN – PUNCT – SCONJ – VERB

Features

Relations

acl – advcl – advmod – advmod:df – amod – appos – aux – case – case:loc – cc – ccomp – clf – compound – compound:dir – compound:ext – compound:vo – compound:vv – conj – cop – csubj – dep – det – discourse – discourse:sp – dislocated – flat – iobj – mark – mark:adv – mark:rel – nmod – nsubj – nsubj:pass – nummod – obj – obl – obl:agent – obl:patient – obl:tmod – parataxis – punct – reparandum – root – vocative – xcomp

Tokenization and Word Segmentation

This corpus contains 451 sentences and 7256 tokens.

This corpus contains 7256 tokens (100%) that are not followed by a space.

This corpus does not contain words with spaces.

This corpus does not contain words that contain both letters and punctuation.

Morphology

Nominal Features

Degree and Polarity

Verbal Features

Pronouns, Determiners, Quantifiers

Other Features

Syntax

Auxiliary Verbs and Copula

This corpus uses 2 lemmas as copulas (cop). Examples: 是、就是.

This corpus uses 30 lemmas as auxiliaries (aux). Examples: 了、着、要、会、能、想、过、可以、没、应该、爱、得、敢、需要、可能、没有、不得、似乎、似的、喜欢、回、好像、宁愿、就、希望、必须、愿意、懒得、起来、这.

Core Arguments, Oblique Arguments and Adjuncts

Here we consider only relations between verbs (parent) and nouns or pronouns (child).

nsubj
- VERB--NOUN (126)
- VERB--NOUN-ADP(在) (1)
- VERB--PRON (393)

obj
- VERB--NOUN (394)
- VERB--PRON (90)

iobj
- VERB--NOUN (1)
- VERB--PRON (5)
- VERB--PRON-ADP(给) (1)

Relations Overview

This corpus uses 13 relation subtypes: advmod:df, case:loc, compound:dir, compound:ext, compound:vo, compound:vv, discourse:sp, mark:adv, mark:rel, nsubj:pass, obl:agent, obl:patient, obl:tmod
The following 5 relation types are not used in this corpus at all: expl, fixed, list, orphan, goeswith