home edit page issue tracker

This page pertains to UD version 2.

UD Japanese GSD

Language: Japanese (code: ja)
Family: Japanese

This treebank has been part of Universal Dependencies since the UD v1.4 release.

The following people have contributed to making this treebank part of UD: Hiroshi Kanayama, Masayuki Asahara, Yusuke Miyao, Takaaki Tanaka, Ryan McDonald, Joakim Nivre, Daniel Zeman, Yuji Matsumoto, Shinsuke Mori, Sumire Uematsu.

Repository: UD_Japanese-GSD
Search this treebank on-line: PML-TQ
Download all treebanks: UD 2.2

License: CC BY-NC-SA 3.0 US

Genre: news, blog

Questions, comments? General annotation questions (either Japanese-specific or cross-linguistic) can be raised in the main UD issue tracker. You can report bugs in this treebank in the treebank-specific issue tracker on Github. If you want to collaborate, please contact [hkana (æt) jp • ibm • com]. Development of the treebank happens directly in the UD repository, so you may submit bug fixes as pull requests against the dev branch.

Annotation Source
Lemmas assigned by a program, with some manual corrections, but not a full manual verification
UPOS annotated manually in non-UD style, automatically converted to UD, with some manual corrections of the conversion
XPOS annotated manually in non-UD style, automatically converted to UD, with some manual corrections of the conversion
Features not available
Relations annotated manually in non-UD style, automatically converted to UD, with some manual corrections of the conversion

Description

This Universal Dependencies (UD) Japanese treebank is based on the definition of UD Japanese convention described in the UD documentation. The original sentences are from Google UDT 2.0.

The Japanese UD treebank contains the sentences from Google Universal Dependency Treebanks v2.0 (legacy): https://github.com/ryanmcd/uni-dep-tb. First, Google UDT v2.0 was converted to UD-style with bunsetsu-based word units (say “master” corpus).

The word units in “master” is significantly different from the definition of the documents based on Short Word Unit (SWU) [1], then the sentences are automatically re-processed by Hiroshi Kanayama in Feb 2017. It is the Japanese_UD v2.0 and used in the CoNLL 2017 shared task. In November 2017, UD_Japanese v2.0 is merged with the “master” data so that the manual annotations for dependencies can be reflected to the corpus. It reduced the errors in the dependency structures and relation labels.

Still there are slight differences in the word unit between UD_Japanese v2.1 and UD_Japanese-KTC 1.3. The manual segmentation work is ongoing by the group of Masayuki Asahara so that the divergence of the two Japanese treebanks should be fixed in the future.

Acknowledgments

The original treebank was provided by:

via

The corpus was converted by:

through discussion and validation with

Statistics of UD Japanese GSD

POS Tags

ADJADPADVAUXCCONJDETINTJNOUNNUMPARTPRONPROPNPUNCTSCONJSYMVERB

Features

NumType

Relations

acladvcladvmodamodauxcaseccccompcompoundcopcsubjdepdetfixediobjmarknmodnsubjnummodobjoblpunctroot

Tokenization and Word Segmentation

Morphology

Tags

Nominal Features

Degree and Polarity

Verbal Features

Pronouns, Determiners, Quantifiers

Other Features

Syntax

Auxiliary Verbs and Copula

Core Arguments, Oblique Arguments and Adjuncts

Here we consider only relations between verbs (parent) and nouns or pronouns (child).

Relations Overview