UD Cantonese HK
Language: Cantonese (code: yue
)
Family: Sino-Tibetan
This treebank has been part of Universal Dependencies since the UD v2.1 release.
The following people have contributed to making this treebank part of UD: Kim Gerdes, John Lee, Herman Leung, Tak-sum Wong.
Repository: UD_Cantonese-HK
Search this treebank on-line: PML-TQ
Download all treebanks: UD 2.2
License: CC BY-SA 4.0
Genre: spoken
Questions, comments? General annotation questions (either Cantonese-specific or cross-linguistic) can be raised in the main UD issue tracker. You can report bugs in this treebank in the treebank-specific issue tracker on Github. If you want to collaborate, please contact [tswong-c (æt) my • cityu • edu • hk; jsylee (æt) cityu • edu • hk]. Development of the treebank happens outside the UD repository. If there are bugs, either the original data source or the conversion procedure must be fixed. Do not submit pull requests against the UD repository.
Annotation | Source |
---|---|
Lemmas | annotated manually, natively in UD style |
UPOS | annotated manually, natively in UD style |
XPOS | not available |
Features | annotated manually, natively in UD style |
Relations | annotated manually, natively in UD style |
Description
The Cantonese-HK UD treebank was manually annotated by Tak-sum Wong and Herman H. M. Leung at City University of Hong Kong, by finely transcribing three films shooted by students from the School of Creative Media. The data are in Tradiaitonal Chinese. These trees form a parallel treebank with those in Chinese-HK.
ORIGIN
Acknowledgments
This work was partially supported by a grant from the PROCORE-France/Hong Kong Joint Research Scheme sponsored by the Research Grants Council and the Consulate General of France in Hong Kong (Reference No.: F-CityU107/15 and N 35322RG); and by two Strategic Research Grants (Project No. 7004494 and No. 7004736) from City University of Hong Kong.
Statistics of UD Cantonese HK
POS Tags
ADJ – ADP – ADV – AUX – CCONJ – DET – INTJ – NOUN – NUM – PART – PRON – PROPN – PUNCT – SCONJ – VERB
Features
Relations
acl – advcl – advcl:coverb – advmod – advmod:df – amod – appos – aux – case – case:loc – cc – ccomp – clf – compound – compound:dir – compound:ext – compound:quant – compound:vo – compound:vv – conj – cop – csubj – det – discourse – discourse:sp – dislocated – flat – iobj – mark – mark:rel – nmod – nsubj – nsubj:periph – nummod – obj – obj:periph – obl – obl:agent – obl:patient – obl:tmod – parataxis – punct – reparandum – root – vocative – xcomp
Tokenization and Word Segmentation
- This corpus contains 650 sentences and 6264 tokens.
- This corpus contains 6263 tokens (100%) that are not followed by a space.
- This corpus does not contain words with spaces.
- This corpus contains 4 types of words that contain both letters and punctuation. Examples: Yes!, Next-stop-is-Tin-Fu, angle-sum-of-triangle, last-touch
Morphology
Tags
- This corpus uses 15 UPOS tags out of 17 possible: ADJ, ADP, ADV, AUX, CCONJ, DET, INTJ, NOUN, NUM, PART, PRON, PROPN, PUNCT, SCONJ, VERB
- This corpus does not use the following tags: SYM, X
- This corpus contains 46 word types tagged as particles (PART): 㗎, 㗎喇, 㗎嘛, 下, 之, 之嘛, 份, 個, 先, 可, 吖, 吖嗱, 吖嘛, 呀, 呀馬, 呃, 呢, 呢個, 咋, 咋嘛, 咖嘛, 咩, 唧, 啦, 啫, 喇, 喎, 嗎, 嗱, 嘅, 嘅話, 嘛, 嘞, 嚟, 囉, 囖, 埋, 得, 晒, 的, 落, 褦, 話, , 𠿪, 𢰸
- This corpus contains 28 lemmas tagged as pronouns (PRON): 一個二個, 乜, 乜嘢, 人哋, 你, 你哋, 你自己, 佢, 佢哋, 依個, 個啲, 呢個, 呢啲, 呢度, 咩, 嗰個, 嗰啲, 嗰度, 嗰時, 大家, 幾多, 我, 我哋, 我自己, 自己, 邊, 邊個, 邊度
- This corpus contains 19 lemmas tagged as determiners (DET): 一部分, 下, 乜, 個個, 全, 其他, 冇乜, 另, 呢, 呢啲, 咩, 啲, 嗰, 嗰啲, 多, 好多, 幾多, 成, 每
- Out of the above, 5 lemmas occurred sometimes as PRON and sometimes as DET: 乜, 呢啲, 咩, 嗰啲, 幾多
- This corpus contains 27 lemmas tagged as auxiliaries (AUX): 中意, 住, 使, 係, 係咪, 冇, 可, 可以, 可能, 咗, 唔使, 唔好, 好, 得, 想, 應該, 會, 有, 緊, 肯, 要, 覺得, 該, 識, 識得, 過, 需要
- Out of the above, 14 lemmas occurred sometimes as AUX and sometimes as VERB: 中意, 住, 使, 係, 係咪, 冇, 唔使, 得, 想, 有, 要, 覺得, 識, 過
- This corpus does not use the VerbForm feature.
Nominal Features
Degree and Polarity
Verbal Features
Pronouns, Determiners, Quantifiers
Other Features
- NounType
- Clf
- NOUN: 個, 啲, 蚊, 年, 張, 條, 隻, 分, 日, 部
- PART: 個
- Clf
Syntax
Auxiliary Verbs and Copula
- This corpus uses 2 lemmas as copulas (cop). Examples: 係, 係咪.
- This corpus uses 24 lemmas as auxiliaries (aux). Examples: 咗, 要, 唔好, 過, 可以, 會, 中意, 冇, 想, 緊, 唔使, 得, 應該, 有, 覺得, 住, 識, 可, 可能, 識得, 希望, 肯, 該, 需要.
Core Arguments, Oblique Arguments and Adjuncts
Here we consider only relations between verbs (parent) and nouns or pronouns (child).
- nsubj
- VERB--NOUN (59)
- VERB--PRON (251)
- obj
- VERB--NOUN (272)
- VERB--NOUN-ADP(到) (1)
- VERB--PRON (113)
- iobj
- VERB--PRON (9)
Relations Overview
- This corpus uses 15 relation subtypes: advcl:coverb, advmod:df, case:loc, compound:dir, compound:ext, compound:quant, compound:vo, compound:vv, discourse:sp, mark:rel, nsubj:periph, obj:periph, obl:agent, obl:patient, obl:tmod
- The following 6 relation types are not used in this corpus at all: expl, fixed, list, orphan, goeswith, dep