UD Japanese GSD
Language: Japanese (code: ja
)
Family: Japanese
This treebank has been part of Universal Dependencies since the UD v1.4 release.
The following people have contributed to making this treebank part of UD: Hiroshi Kanayama, Masayuki Asahara, Yusuke Miyao, Takaaki Tanaka, Ryan McDonald, Joakim Nivre, Daniel Zeman, Yuji Matsumoto, Shinsuke Mori, Sumire Uematsu.
Repository: UD_Japanese-GSD
Search this treebank on-line: PML-TQ
Download all treebanks: UD 2.2
License: CC BY-NC-SA 3.0 US
Genre: news, blog
Questions, comments? General annotation questions (either Japanese-specific or cross-linguistic) can be raised in the main UD issue tracker. You can report bugs in this treebank in the treebank-specific issue tracker on Github. If you want to collaborate, please contact [hkana (æt) jp • ibm • com]. Development of the treebank happens directly in the UD repository, so you may submit bug fixes as pull requests against the dev branch.
Annotation | Source |
---|---|
Lemmas | assigned by a program, with some manual corrections, but not a full manual verification |
UPOS | annotated manually in non-UD style, automatically converted to UD, with some manual corrections of the conversion |
XPOS | annotated manually in non-UD style, automatically converted to UD, with some manual corrections of the conversion |
Features | not available |
Relations | annotated manually in non-UD style, automatically converted to UD, with some manual corrections of the conversion |
Description
This Universal Dependencies (UD) Japanese treebank is based on the definition of UD Japanese convention described in the UD documentation. The original sentences are from Google UDT 2.0.
The Japanese UD treebank contains the sentences from Google Universal Dependency Treebanks v2.0 (legacy): https://github.com/ryanmcd/uni-dep-tb. First, Google UDT v2.0 was converted to UD-style with bunsetsu-based word units (say “master” corpus).
The word units in “master” is significantly different from the definition of the documents based on Short Word Unit (SWU) [1], then the sentences are automatically re-processed by Hiroshi Kanayama in Feb 2017. It is the Japanese_UD v2.0 and used in the CoNLL 2017 shared task. In November 2017, UD_Japanese v2.0 is merged with the “master” data so that the manual annotations for dependencies can be reflected to the corpus. It reduced the errors in the dependency structures and relation labels.
Still there are slight differences in the word unit between UD_Japanese v2.1 and UD_Japanese-KTC 1.3. The manual segmentation work is ongoing by the group of Masayuki Asahara so that the divergence of the two Japanese treebanks should be fixed in the future.
Acknowledgments
The original treebank was provided by:
- Adam LaMontagne
- Milan Souček
- Timo Järvinen
- Alessandra Radici
via
- Dan Zeman.
The corpus was converted by:
- Hiroshi Kanayama
through discussion and validation with
- Yusuke Miyao
- Masayuki Asahara
- Takaaki Tanaka
- Yuji Matsumoto
- Shinsuke Mori
- Sumire Uematsu
Statistics of UD Japanese GSD
POS Tags
ADJ – ADP – ADV – AUX – CCONJ – DET – INTJ – NOUN – NUM – PART – PRON – PROPN – PUNCT – SCONJ – SYM – VERB
Features
Relations
acl – advcl – advmod – amod – aux – case – cc – ccomp – compound – cop – csubj – dep – det – fixed – iobj – mark – nmod – nsubj – nummod – obj – obl – punct – root
Tokenization and Word Segmentation
- This corpus contains 8195 sentences and 184348 tokens.
- This corpus contains 183692 tokens (100%) that are not followed by a space.
- This corpus does not contain words with spaces.
- This corpus contains 77 types of words that contain both letters and punctuation. Examples: 、と, 、という, SETI@home, ”と, が、, ら・むうん, スター・ウォーズ, ルイ・ヴィトン, (株), )し, A&M, A.T, D.C.I, E.T, IT'SFRIDAY, J.Z, Jr., L'Arc, L'Orateurdu, L.E.D, O'Malley, PaulKantner's, S&P, ZYX-α, http://en.wikipedia.org/wiki/Acute_intermittent_porphyria, ”する, ”に, 、が, 、で, 、といった, 、など, 、の, 、を, 」し, アテナ&ロビケロッツ, アル・パチーノ, アンディ・ウォーホル, アンドレ・アガシ, イー・モバイル, ウォール・ストリート・ジャーナル, エル・ドラード, エール・フランス, オードリー・ヘップバーン, カール・ツァイス, クリーブランド・ブラウンズ, ゴールドマン・サックス, サム・シェパード, シラノ・ド・ベルジュラック, ジェームズ・ブキャナン, ジェームズ・ワトソン
Morphology
Tags
- This corpus uses 16 UPOS tags out of 17 possible: ADJ, ADP, ADV, AUX, CCONJ, DET, INTJ, NOUN, NUM, PART, PRON, PROPN, PUNCT, SCONJ, SYM, VERB
- This corpus does not use the following tags: X
- This corpus contains 61 word types tagged as particles (PART): +, -, ~, およそ, か, かしらん, かどうか, か否か, がな, さ, ぞ, ぞお, な, なぁ, なあ, なー, ね, ねえ, の, のよ, ほぼ, よ, よぉ, よー, わ, ナンバー, ベスト, マイナス, マッハ, 丸, 人口, 全長, 分の, 南緯, 各, 同, 夜, 対, 平成, 年, 延べ, 昭和, 最低, 最多, 最大, 最高, 残り, 毎時, 直径, 第, 築, 紀元前, 約, 総計, 翌, 計, 金, 長さ, 高さ, 齢, ~
- This corpus contains 55 lemmas tagged as pronouns (PRON): あちこち, あちら, あなた, あれ, いずれ, おまえ, おめーら, お前, かれ, ここ, こちら, この方, これ, これら, そこ, そちら, その他, それ, それぞれ, それら, どこ, どちら, どなた, どれ, なん, ぼく, みなさま, みなさん, みんな, わしゃ, わたし, われわれ, 他所, 何か, 何処, 俺, 僕, 僕ら, 君, 奴, 彼, 彼ら, 彼女, 彼女たち, 彼方, 彼等, 我々, 手前, 皆, 皆さま, 皆さん, 皆様, 私, 私たち, 誰
- This corpus contains 4 lemmas tagged as determiners (DET): あの, この, その, どの
- This corpus contains 118 lemmas tagged as auxiliaries (AUX): あう, あげる, ある, いく, いける, いただく, いらっしゃる, いる, う, うる, える, おく, おる, かける, かす, かねる, かもしれる, かも知れる, がたい, がちだ, がる, きる, くださる, くれる, ける, げだ, げる, こむ, ございる, させる, ざるをえる, ざるを得る, しまう, すぎる, する, せる, そうだ, た, たい, たっ, たら, たろ, たー, だ, だす, だめ, ちゃう, っぱい, っぱなし, っぽい, つづける, づめる, づらい, て, てる, できる, でした, ない, なければ, なさる, なら, なる, にくい, ぬく, ね, はじめる, ふうだ, べし, べる, ほしい, ま~す, まい, まいる, まう, ます, みせる, みたいだ, みる, める, もらう, もらえる, やすい, やっ, やる, ゆく, よい, よう, ようだ, らしい, らす, らるる, られる, れる, わす, 下さる, 出す, 出来る, 切る, 化, 参る, 合う, 回る, 始める, 尽くす, 得る, 易い, 来る, 欲しい, 済み, 直す, 終わる, 続ける, 良い, 行く, 込む, 過ぎる, 難い, 頂く
- Out of the above, 48 lemmas occurred sometimes as AUX and sometimes as VERB: あう, あげる, ある, いく, いける, いただく, いらっしゃる, いる, おく, おる, かける, きる, くださる, くれる, しまう, すぎる, する, せる, だす, つづける, できる, なさる, なる, はじめる, ます, みせる, みる, もらう, やる, ゆく, 下さる, 出す, 出来る, 切る, 化, 参る, 合う, 回る, 始める, 尽くす, 得る, 来る, 直す, 終わる, 続ける, 行く, 過ぎる, 頂く
- This corpus does not use the VerbForm feature.
Nominal Features
Degree and Polarity
Verbal Features
Pronouns, Determiners, Quantifiers
- Card
- NUM: 1, 2, 3, 4, 一, 5, 10, 6, 二, 7
Other Features
Syntax
Auxiliary Verbs and Copula
- This corpus uses 1 lemmas as copulas (cop). Examples: だ.
- This corpus uses 117 lemmas as auxiliaries (aux). Examples: た, する, いる, れる, だ, ます, ない, こと, ようだ, られる, なる, おる, せる, う, 来る, できる, たい, しまう, くれる, いく, そうだ, ける, たら, もらう, みる, える, べし, なら, やすい, くださる, 続ける, いただく, らしい, める, でした, 始める, 行く, かもしれる, みたいだ, 出す, 出来る, 合う, すぎる, 頂く, ある, ちゃう, 込む, おく, なければ, もらえる.
Core Arguments, Oblique Arguments and Adjuncts
Here we consider only relations between verbs (parent) and nouns or pronouns (child).
- nsubj
- VERB--NOUN-ADP(から)-ADP(が) (3)
- VERB--NOUN-ADP(が) (2519)
- VERB--NOUN-ADP(が)-ADP(に) (1)
- VERB--NOUN-ADP(だけ)-ADP(が) (4)
- VERB--NOUN-ADP(だけ)-ADP(は) (3)
- VERB--NOUN-ADP(など)-ADP(が) (35)
- VERB--NOUN-ADP(など)-ADP(は) (7)
- VERB--NOUN-ADP(のみ)-ADP(が) (1)
- VERB--NOUN-ADP(は) (1882)
- VERB--NOUN-ADP(まで)-ADP(が) (2)
- VERB--NOUN-ADP(まで)-ADP(も)-ADP(が) (1)
- VERB--PRON-ADP(が) (51)
- VERB--PRON-ADP(は) (116)
- obj
- VERB--NOUN-ADP(だけ)-ADP(を) (2)
- VERB--NOUN-ADP(と)-ADP(を) (3)
- VERB--NOUN-ADP(という)-ADP(を) (1)
- VERB--NOUN-ADP(といった)-ADP(を) (1)
- VERB--NOUN-ADP(とか)-ADP(を) (1)
- VERB--NOUN-ADP(など)-ADP(を) (53)
- VERB--NOUN-ADP(に当たる)-ADP(を) (1)
- VERB--NOUN-ADP(のみ)-ADP(を) (5)
- VERB--NOUN-ADP(ほど)-ADP(を) (1)
- VERB--NOUN-ADP(まで)-ADP(を) (1)
- VERB--NOUN-ADP(を) (4504)
- VERB--NOUN-ADP(を)-ADP(と)-ADP(も) (2)
- VERB--NOUN-ADP(を)-ADP(はじめ) (4)
- VERB--NOUN-ADP(を)-ADP(も) (2)
- VERB--PRON-ADP(まで)-ADP(を) (1)
- VERB--PRON-ADP(を) (84)
- iobj
- VERB--NOUN-ADP(くらい)-ADP(に) (3)
- VERB--NOUN-ADP(だけ)-ADP(に) (2)
- VERB--NOUN-ADP(と)-ADP(に) (1)
- VERB--NOUN-ADP(など)-ADP(に) (21)
- VERB--NOUN-ADP(など)-ADP(に)-ADP(は) (1)
- VERB--NOUN-ADP(など)-ADP(に)-ADP(も) (3)
- VERB--NOUN-ADP(なり)-ADP(に) (1)
- VERB--NOUN-ADP(に) (3530)
- VERB--NOUN-ADP(に)-ADP(しか) (2)
- VERB--NOUN-ADP(に)-ADP(だけ)-ADP(で)-ADP(も) (1)
- VERB--NOUN-ADP(に)-ADP(だけ)-ADP(は) (2)
- VERB--NOUN-ADP(に)-ADP(と) (1)
- VERB--NOUN-ADP(に)-ADP(は) (393)
- VERB--NOUN-ADP(に)-ADP(は)-ADP(と) (1)
- VERB--NOUN-ADP(に)-ADP(まで) (10)
- VERB--NOUN-ADP(に)-ADP(も) (107)
- VERB--NOUN-ADP(のみ)-ADP(に) (1)
- VERB--NOUN-ADP(は)-ADP(に) (1)
- VERB--NOUN-ADP(ほど)-ADP(に) (1)
- VERB--NOUN-ADP(まで)-ADP(に) (23)
- VERB--NOUN-ADP(まで)-ADP(に)-ADP(は) (1)
- VERB--PRON-ADP(に) (72)
- VERB--PRON-ADP(に)-ADP(しか) (1)
- VERB--PRON-ADP(に)-ADP(は) (10)
- VERB--PRON-ADP(に)-ADP(も) (6)
- VERB--PRON-ADP(まで)-ADP(に) (3)
- VERB--PRON-ADP(まで)-ADP(に)-ADP(も) (1)