home edit page issue tracker

This page pertains to UD version 2.

Universal Dependencies

Universal Dependencies (UD) is a framework for cross-linguistically consistent grammatical annotation and an open community effort with over 200 contributors producing more than 100 treebanks in over 70 languages.

Short introduction to UD
UD annotation guidelines
More information on UD:
Query UD treebanks online:
- SETS treebank search maintained by the University of Turku
- PML Tree Query maintained by the Charles University in Prague
- Kontext maintained by the Charles University in Prague
- Grew-match maintained by Inria in Nancy
Download UD treebanks

If you want to receive news about Universal Dependencies, you can subscribe to the UD mailing list. If you want to discuss individual annotation questions, use the Github issue tracker.

Current UD Languages

Information about language families (and genera for families with multiple branches) is mostly taken from WALS Online (IE = Indo-European).

Afrikaans 1 49K IE, Germanic

Afrikaans treebanks

AfriBooms 49K

UD Afrikaans-AfriBooms is a conversion of the AfriBooms Dependency Treebank, originally annotated with a simplified PoS set and dependency relations according to a subset of the Stanford tag set. The corpus consists of public government documents.

Contributors: Peter Dirix, Liesbeth Augustinus, Daniel van Niekerk
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Akkadian 1 1K Afro-Asiatic, Semitic

Akkadian treebanks

PISANDUB 1K

A small set of sentences from Babylonian royal inscriptions.

Contributors: Kamil Kopacewicz
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Amharic 1 10K Afro-Asiatic, Semitic

Amharic treebanks

ATT 10K

UD_Amharic-ATT is a manual developed Treebanks for Amharic. Sentences were collected from grammar books, fictions, biographies, religious texts and news.

Contributors: Binyam Ephrem, Gashaw Arutie, Tsegay Woldemariam, Juan Ignacio Navarro Horñiacek
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Ancient Greek 2 417K IE, Greek

Ancient Greek treebanks

PROIEL 214K

UD_Ancient_Greek-PROIEL is converted from the Ancient Greek data in the PROIEL treebank, and consists of the New Testament plus selections from Herodotus.

Contributors: Dag Haug
Repository master dev
README
Treebank hub page
Download

Perseus 202K

This Universal Dependencies Ancient Greek Treebank consists of an automatic conversion of a selection of passages from the Ancient Greek and Latin Dependency Treebank 2.1

Contributors: Giuseppe G. A. Celano, Daniel Zeman
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Ancient Greek treebanks.

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Arabic 3 1,042K Afro-Asiatic, Semitic

Arabic treebanks

PADT 282K

The Arabic-PADT UD treebank is based on the [Prague Arabic Dependency Treebank](http://ufal.mff.cuni.cz/padt/) (PADT), created at the Charles University in Prague.

Contributors: Daniel Zeman, Zdeněk Žabokrtský, Shadi Saleh
Repository master dev
README
Treebank hub page
Download

PUD 20K

This is a part of the Parallel Universal Dependencies (PUD) treebanks created for the [CoNLL 2017 shared task on Multilingual Parsing from Raw Text to Universal Dependencies](http://universaldependencies.org/conll17/).

Contributors: Hans Uszkoreit, Vivien Macketanz, Aljoscha Burchardt, Kim Harris, Katrin Marheinecke, Slav Petrov, Tolga Kayadelen, Mohammed Attia, Ali Elkahky, Zhuoran Yu, Emily Pitler, Saran Lertpradit, Luma Ateyah, Martin Popel, Daniel Zeman, Nizar Habash, Dima Taji
Repository master dev
README
Treebank hub page
Download

NYUAD 738K

The NYUAD Arabic UD treebank is based on the Penn Arabic Treebank (PATB), parts 1, 2, and 3, through conversion to CATiB dependency trees.

Contributors: Nizar Habash, Dima Taji
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Arabic treebanks.

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Armenian 1 22K IE, Armenian

Armenian treebanks

ArmTDP 22K

The ArmTDP Eastern Armenian UD treebank is based on the ՀայՇտեմ-ArmTDP-East dataset (2.0), created by the ArmTDP team led by Marat M. Yavrumyan at the Yerevan State University.

Contributors: Marat M. Yavrumyan
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Bambara 1 13K Mande

Bambara treebanks

CRB 13K

The UD Bambara treebank is a section of the Corpus Référence du Bambara annotated natively with Universal Dependencies.

Contributors: Katya Aplonova, Francis Tyers
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Basque 1 121K Basque

Basque treebanks

BDT 121K

The Basque UD treebank is based on a automatic conversion from part of the Basque Dependency Treebank (BDT), created at the University of of the Basque Country by the IXA NLP research group. The treebank consists of 8.993 sentences (121.443 tokens) and covers mainly literary and journalistic texts.

Contributors: Maria Jesus Aranzabe, Aitziber Atutxa, Kepa Bengoetxea, Arantza Diaz de Ilarraza, Iakes Goenaga, Koldo Gojenola, Larraitz Uria
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Belarusian 1 8K IE, Slavic

Belarusian treebanks

HSE 8K

The Belarusian UD treebank is based on a sample of the news texts included in the Belarusian-Russian parallel subcorpus of the Russian National Corpus, online search available at: http://ruscorpora.ru/search-para-be.html.

Contributors: Olga Lyashevskaya
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Breton 1 10K IE, Celtic

Breton treebanks

KEB 10K

UD Breton-KEB is a treebank of Breton that has been manually annotated according to the Universal Dependencies guidelines. The tokenisation guidelines and morphological annotation comes from a finite-state morphological analyser of Breton released as part of the [Apertium project](http://www.apertium.org).

Contributors: Francis Tyers
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Bulgarian 1 156K IE, Slavic

Bulgarian treebanks

BTB 156K

UD_Bulgarian-BTB is based on the HPSG-based BulTreeBank, created at the Institute of Information and Communication Technologies, Bulgarian Academy of Sciences. The original consists of 215,000 tokens (over 15,000 sentences). All the texts were processed automatically at tokenization, morphological and chunk level. Then, the full syntactic analysis were perfomed manually by trained annotators.

Contributors: Kiril Simov, Petya Osenova, Martin Popel
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Buryat 1 10K Mongolic

Buryat treebanks

BDT 10K

The UD Buryat treebank was annotated manually natively in UD and contains grammar book sentences, along with news and some fiction.

Contributors: Elena Badmaeva, Francis Tyers
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Cantonese 1 6K Sino-Tibetan

Cantonese treebanks

HK 6K

The Cantonese-HK UD treebank was manually annotated by Tak-sum Wong and Herman H. M. Leung at City University of Hong Kong, by finely transcribing three films shooted by students from the School of Creative Media. The data are in Tradiaitonal Chinese. These trees form a parallel treebank with those in Chinese-HK.

Contributors: Kim Gerdes, John Lee, Herman Leung, Tak-sum Wong
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Catalan 1 531K IE, Romance

Catalan treebanks

AnCora 531K

Catalan data from the AnCora corpus.

Contributors: Héctor Martínez Alonso, Elena Pascual, Daniel Zeman
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Chinese 4 160K Sino-Tibetan

Chinese treebanks

GSD 123K

Traditional Chinese Universal Dependencies Treebank annotated and converted by Google.

Contributors: Mo Shen, Ryan McDonald, Daniel Zeman
Repository master dev
README
Treebank hub page
Download

PUD 21K

Contributors: Hans Uszkoreit, Vivien Macketanz, Aljoscha Burchardt, Kim Harris, Katrin Marheinecke, Slav Petrov, Tolga Kayadelen, Mohammed Attia, Ali Elkahky, Zhuoran Yu, Emily Pitler, Saran Lertpradit, Josie Li, Cheuk Ying Li, Martin Popel, Daniel Zeman, Herman Leung
Repository master dev
README
Treebank hub page
Download

HK 8K

A treebank manually annotated at the City University of Hong Kong. It contains subtitles of three films shot by students from the School of Creative Media as well as the official record of proceedings of the Legislative Council of Hong Kong. Traditional Chinese characters. This treebank is parallel with UD_Cantonese-HK.

Contributors: Kim Gerdes, John Lee, Herman Leung, Tak-sum Wong
Repository master dev
README
Treebank hub page
Download

CFL 7K

The Chinese-CFL UD treebank is manually annotated by Keying Li with minor manual revisions by Herman Leung and John Lee at City University of Hong Kong, based on essays written by learners of Mandarin Chinese as a foreign language. The data is in Simplified Chinese.

Contributors: John Lee, Herman Leung, Keying Li
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Chinese treebanks.

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Classical Chinese 1 34K Sino-Tibetan

Classical Chinese treebanks

Mencius 34K

Classical Chinese Universal Dependencies Treebank annotated and converted by Institute for Research in Humanities, Kyoto University.

Contributors: Koichi Yasuoka, Christian Wittern, Tomohiko Morioka, Takumi Ikeda, Naoki Yamazaki, Yoshihiro Nikaido, Shingo Suzuki, Shigeki Moro, Yuan Li, Hiroyuki Shirasu, Kazunori Fujita
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Coptic 1 22K Afro-Asiatic, Egyptian

Coptic treebanks

Scriptorium 22K

UD Coptic contains manually annotated Sahidic Coptic texts, currently from the Gospel of Mark, Shenoute of Atripe's "Not Because a Fox Barks", the Letters of Besa, and several short stories from the Apophthegmata Patrum.

Contributors: Mitchell Abrams, Elizabeth Davidson, Amir Zeldes
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Croatian 1 197K IE, Slavic

Croatian treebanks

SET 197K

The Croatian UD treebank is based on the SETimes-HR corpus.

Contributors: Željko Agić, Nikola Ljubešić, Daniel Zeman
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Czech 5 2,222K IE, Slavic

Czech treebanks

PDT 1,506K

The Czech-PDT UD treebank is based on the Prague Dependency Treebank 3.0 (PDT), created at the Charles University in Prague.

Contributors: Daniel Zeman, Jan Hajič
Repository master dev
README
Treebank hub page
Download

CAC 494K

The UD_Czech-CAC treebank is based on the Czech Academic Corpus 2.0 (CAC; Český akademický korpus; ČAK), created at Charles University in Prague.

Contributors: Barbora Hladká, Daniel Zeman
Repository master dev
README
Treebank hub page
Download

FicTree 167K

FicTree is a treebank of Czech fiction, automatically converted into the UD format. The treebank was built at Charles University in Prague.

Contributors: Tomáš Jelínek, Daniel Zeman
Repository master dev
README
Treebank hub page
Download

PUD 18K

Contributors: Václava Kettnerová, Jan Hajič jr., Silvie Cinková, Zdeňka Urešová, Milan Straka, Jan Hajič, Jaroslava Hlaváčová, Daniel Zeman
Repository master dev
README
Treebank hub page
Download

CLTT 35K

The UD_Czech-CLTT treebank is based on the Czech Legal Text Treebank 1.0, created at Charles University in Prague.

Contributors: Barbora Hladká, Daniel Zeman, Martin Popel
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Czech treebanks.

Language documentation

See the language documentation page.

Danish 2 100K IE, Germanic

Danish treebanks

DDT 100K

The Danish UD treebank is a conversion of the Danish Dependency Treebank.

Contributors: Anders Johannsen, Héctor Martínez Alonso, Barbara Plank
Repository master dev
README
Treebank hub page
Download

DTB -

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Natalie Schluter
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Dutch 2 307K IE, Germanic

Dutch treebanks

Alpino 208K

This corpus consists of samples from various treebanks annotated at the University of Groningen using the Alpino annotation tools and guidelines.

Contributors: Daniel Zeman, Zdeněk Žabokrtský, Gosse Bouma, Gertjan van Noord
Repository master dev
README
Treebank hub page
Download

LassySmall 98K

This corpus contains sentences from the Wikipedia section of the Lassy Small Treebank. Universal Dependency annotation was generated automatically from the original annotation in Lassy.

Contributors: Gosse Bouma, Gertjan van Noord
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Dutch treebanks.

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

English 6 586K IE, Germanic

English treebanks

ParTUT 49K

UD_English-ParTUT is a conversion of a multilingual parallel treebank developed at the University of Turin, and consisting of a variety of text genres, including talks, legal texts and Wikipedia articles, among others.

Contributors: Cristina Bosco, Manuela Sanguinetti
Repository master dev
README
Treebank hub page
Download

GUM 80K

Universal Dependencies version of syntax annotations from the GUM corpus (https://corpling.uis.georgetown.edu/gum/)

Contributors: Siyao Peng, Amir Zeldes
Repository master dev
README
Treebank hub page
Download

EWT 254K

A Gold Standard Universal Dependencies Corpus for English, built over the source material of the English Web Treebank LDC2012T13 (https://catalog.ldc.upenn.edu/LDC2012T13).

Contributors: Natalia Silveira, Timothy Dozat, Christopher Manning, Sebastian Schuster, John Bauer, Miriam Connor, Marie-Catherine de Marneffe, Nathan Schneider, Sam Bowman, Hanzhi Zhu, Daniel Galbraith
Repository master dev
README
Treebank hub page
Download

PUD 21K

This is the English portion of the Parallel Universal Dependencies (PUD) treebanks created for the CoNLL 2017 shared task on Multilingual Parsing from Raw Text to Universal Dependencies (http://universaldependencies.org/conll17/).

Contributors: Hans Uszkoreit, Vivien Macketanz, Aljoscha Burchardt, Kim Harris, Katrin Marheinecke, Slav Petrov, Tolga Kayadelen, Mohammed Attia, Ali Elkahky, Zhuoran Yu, Emily Pitler, Saran Lertpradit, Jesse Kirchner, Lorenzo Lambertino, Martin Popel, Daniel Zeman, Christopher Manning, Sebastian Schuster, Siva Reddy
Repository master dev
README
Treebank hub page
Download

LinES 82K

UD English_LinES is the English half of the LinES Parallel Treebank with the original dependency annotation first automatically converted into Universal Dependencies and then partially reviewed. Its contents cover literature, an online manual and Europarl data.

Contributors: Lars Ahrenberg
Repository master dev
README
Treebank hub page
Download

ESL 97K

UD English-ESL / Treebank of Learner English (TLE) contains manual POS tag and dependency annotations for 5,124 English as a Second Language (ESL) sentences drawn from the Cambridge Learner Corpus First Certificate in English (FCE) dataset.

Contributors: Yevgeni Berzak, Jessica Kenney, Carolyn Spadine, Jing Xian Wang, Lucia Lam, Keiko Sophie Mori, Sebastian Garza, Boris Katz, Margarita Misirpashayeva
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of English treebanks.

Language documentation

See the language documentation page.

Erzya 1 15K Uralic, Mordvin

Erzya treebanks

JR 15K

UD Erzya is the original annotation (CoNLL-U) for texts in the Erzya language, it originally consists of a sample from a number of fiction authors writing originals in Erzya.

Contributors: Jack Rueter, Francis Tyers
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Estonian 1 434K Uralic, Finnic

Estonian treebanks

EDT 434K

UD Estonian is a converted version of the Estonian Dependency Treebank (EDT), originally annotated in the Constraint Grammar (CG) annotation scheme, and consisting of genres of fiction, newspaper texts and scientific texts. The treebank contains 30,723 trees, 434,245 tokens.

Contributors: Kadri Muischnek, Kaili Müürisep, Tiina Puolakainen, Andriela Rääbis
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Faroese 1 10K IE, Germanic

Faroese treebanks

OFT 10K

This is a treebank of Faroese based on the Faroese Wikipedia.

Contributors: Daniel Zeman, Bjartur Mortensen, Francis Tyers
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Finnish 3 377K Uralic, Finnic

Finnish treebanks

FTB 159K

FinnTreeBank 1 consists of manually annotated grammatical examples from VISK. The UD version of FinnTreeBank 1 was converted from a native annotation model with a script.

Contributors: Jussi Piitulainen, Hanna Nurmi
Repository master dev
README
Treebank hub page
Download

TDT 202K

UD_Finnish-TDT is based on the Turku Dependency Treebank (TDT), a broad-coverage dependency treebank of general Finnish covering numerous genres. The conversion to UD was followed by extensive manual checks and corrections, and the treebank closely adheres to the UD guidelines.

Contributors: Filip Ginter, Jenna Kanerva, Veronika Laippala, Niko Miekka, Anna Missilä, Stina Ojala, Sampo Pyysalo
Repository master dev
README
Treebank hub page
Download

PUD 15K

Contributors: Jenna Kanerva, Filip Ginter, Stina Ojala, Anna Missilä
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Finnish treebanks.

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

French 8 1,132K IE, Romance

French treebanks

ParTUT 28K

UD_French-ParTUT is a conversion of a multilingual parallel treebank developed at the University of Turin, and consisting of a variety of text genres, including talks, legal texts and Wikipedia articles, among others.

Contributors: Cristina Bosco, Manuela Sanguinetti
Repository master dev
README
Treebank hub page
Download

GSD 400K

The French UD was converted in 2015 from the content head version of the universal dependency treebank v2.0 (https://github.com/ryanmcd/uni-dep-tb). It is updated since 2015 independently from the previous source.

Contributors: Marie-Catherine de Marneffe, Bruno Guillaume, Ryan McDonald, Alane Suhr, Joakim Nivre, Matias Grioni, Carly Dickerson, Guy Perrier
Repository master dev
README
Treebank hub page
Download

Sequoia 70K

UD_French-Sequoia is an automatic conversion of the Sequoia Treebank corpus [French Sequoia corpus](http://deep-sequoia.inria.fr).

Contributors: Marie Candito, Djamé Seddah, Guy Perrier, Bruno Guillaume
Repository master dev
README
Treebank hub page
Download

Spoken 34K

A Universal Dependencies corpus for spoken French.

Contributors: Kim Gerdes, Sylvain Kahane, Chunxiao Yan, Aline Etienne, Marine Courtin
Repository master dev
README
Treebank hub page
Download

PUD 24K

Contributors: Hans Uszkoreit, Vivien Macketanz, Aljoscha Burchardt, Kim Harris, Katrin Marheinecke, Slav Petrov, Tolga Kayadelen, Mohammed Attia, Ali Elkahky, Zhuoran Yu, Emily Pitler, Saran Lertpradit, Jana Strnadová, Gauthier Caron, Martin Popel, Daniel Zeman, Marie-Catherine de Marneffe
Repository master dev
README
Treebank hub page
Download

FTB 573K

The Universal Dependency version of the French Treebank (Abeillé et al., 2003), hereafter UD_French-FTB, is a treebank of sentences from the newspaper Le Monde, initially manually annotated with morphological information and phrase-structure and then converted to the Universal Dependencies annotation scheme.

Contributors: Marie Candito, Bruno Guillaume, Teresa Lynn, Héctor Martínez Alonso, Benoît Sagot, Djamé Seddah, Eric Villemonte de la Clergerie
Repository master dev
README
Treebank hub page
Download

CrapBank -

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Djamé Seddah
Repository master dev
README
Treebank hub page
Download

FQB -

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Djamé Seddah
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of French treebanks.

Language documentation

See the language documentation page.

Galician 2 164K IE, Romance

Galician treebanks

TreeGal 25K

The Galician-TreeGal is a treebank for Galician developed at LyS Group (Universidade da Coruña).

Contributors: Marcos Garcia
Repository master dev
README
Treebank hub page
Download

CTG 138K

The Galician UD treebank is based on the automatic parsing of the Galician Technical Corpus (http://sli.uvigo.gal/CTG) created at the University of Vigo by the the TALG NLP research group.

Contributors: Xavier Gómez Guinovart
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Galician treebanks.

Language documentation

See the language documentation page.

German 3 354K IE, Germanic

German treebanks

GSD 292K

The German UD is converted from the content head version of the [universal dependency treebank v2.0 (legacy)](https://github.com/ryanmcd/uni-dep-tb).

Contributors: Slav Petrov, Wolfgang Seeker, Ryan McDonald, Joakim Nivre, Daniel Zeman, Adriane Boyd
Repository master dev
README
Treebank hub page
Download

PUD 21K

Contributors: Hans Uszkoreit, Vivien Macketanz, Aljoscha Burchardt, Georg Rehm, Kim Harris, Katrin Marheinecke, Slav Petrov, Tolga Kayadelen, Mohammed Attia, Ali Elkahky, Zhuoran Yu, Emily Pitler, Saran Lertpradit, Michael Mandl, Sebastian Bank, Martin Popel, Daniel Zeman
Repository master dev
README
Treebank hub page
Download

FRAG 40K

Fragments of German aesthetic essays from late 18th century.

Contributors: Alessio Salomoni
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of German treebanks.

Language documentation

See the language documentation page.

Gothic 1 55K IE, Germanic

Gothic treebanks

PROIEL 55K

The UD Gothic treebank is based on the Gothic data from the PROIEL treebank, and consists of Wulfila's Bible translation.

Contributors: Dag Haug
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Greek 1 63K IE, Greek

Greek treebanks

GDT 63K

The Greek UD treebank (UD_Greek-GDT) is derived from the Greek Dependency Treebank (http://gdt.ilsp.gr), a resource developed and maintained by researchers at the Institute for Language and Speech Processing/Athena R.C. (http://www.ilsp.gr).

Contributors: Prokopis Prokopidis
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Hebrew 1 161K Afro-Asiatic, Semitic

Hebrew treebanks

HTB 161K

A Universal Dependencies Corpus for Hebrew.

Contributors: Yoav Goldberg, Reut Tsarfaty, Amir More, Shoval Sadde, Victoria Basmov
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Hindi 2 375K IE, Indic

Hindi treebanks

HDTB 351K

The Hindi UD treebank is based on the Hindi Dependency Treebank (HDTB), created at IIIT Hyderabad, India.

Contributors: Riyaz Ahmad Bhat, Daniel Zeman
Repository master dev
README
Treebank hub page
Download

PUD 23K

Contributors: Hans Uszkoreit, Vivien Macketanz, Aljoscha Burchardt, Kim Harris, Katrin Marheinecke, Slav Petrov, Tolga Kayadelen, Mohammed Attia, Ali Elkahky, Zhuoran Yu, Emily Pitler, Saran Lertpradit, Esha Banerjee, Pinkey Nainwani, Martin Popel, Daniel Zeman
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Hindi treebanks.

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Hindi English 1 26K Code switching

Hindi English treebanks

HIENCS 26K

The Hindi-English Code-switching treebank is based on code-switching tweets of Hindi and English multilingual speakers (mostly Indian) on Twitter. The treebank is manually annotated using UD sceheme. The training and evaluations sets were seperately annotated by different annotators using UD v2 and v1 guidelines respectively. The evaluation sets are automatically converted from UD v1 to v2.

Contributors: Riyaz Ahmad Bhat, Irshad Ahmad Bhat
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Hungarian 1 42K Uralic, Ugric

Hungarian treebanks

Szeged 42K

The Hungarian UD treebank is derived from the Szeged Dependency Treebank (Vincze et al. 2010).

Contributors: Richárd Farkas, Katalin Simkó, Zsolt Szántó, Viktor Varga, Veronika Vincze
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Indonesian 2 141K Austronesian, Malayo-Sumbawan

Indonesian treebanks

GSD 121K

The Indonesian UD is converted from the content head version of the [universal dependency treebank v2.0 (legacy)](https://github.com/ryanmcd/uni-dep-tb).

Contributors: Ryan McDonald, Joakim Nivre, Daniel Zeman, Septina Dian Larasati
Repository master dev
README
Treebank hub page
Download

PUD 19K

Contributors: Hans Uszkoreit, Vivien Macketanz, Aljoscha Burchardt, Kim Harris, Katrin Marheinecke, Slav Petrov, Tolga Kayadelen, Mohammed Attia, Ali Elkahky, Zhuoran Yu, Emily Pitler, Saran Lertpradit, Ruli Manurung, Muh Shohibussirri, Martin Popel, Daniel Zeman
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Indonesian treebanks.

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Irish 1 23K IE, Celtic

Irish treebanks

IDT 23K

A Universal Dependencies 1020-sentence treebank for modern Irish.

Contributors: Teresa Lynn, Jennifer Foster
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Italian 5 502K IE, Romance

Italian treebanks

ISDT 298K

The Italian corpus annotated according to the UD annotation scheme was obtained by conversion from ISDT (Italian Stanford Dependency Treebank), released for the dependency parsing shared task of Evalita-2014 (Bosco et al. 2014).

Contributors: Cristina Bosco, Alessandro Lenci, Simonetta Montemagni, Maria Simi
Repository master dev
README
Treebank hub page
Download

ParTUT 55K

UD_Italian-ParTUT is a conversion of a multilingual parallel treebank developed at the University of Turin, and consisting of a variety of text genres, including talks, legal texts and Wikipedia articles, among others.

Contributors: Cristina Bosco, Manuela Sanguinetti
Repository master dev
README
Treebank hub page
Download

PoSTWITA 124K

PoSTWITA-UD is a collection of Italian tweets annotated in Universal Dependencies that can be exploited for the training of NLP systems to enhance their performance on social media texts.

Contributors: Cristina Bosco, Manuela Sanguinetti
Repository master dev
README
Treebank hub page
Download

PUD 23K

Contributors: Hans Uszkoreit, Vivien Macketanz, Aljoscha Burchardt, Kim Harris, Katrin Marheinecke, Slav Petrov, Tolga Kayadelen, Mohammed Attia, Ali Elkahky, Zhuoran Yu, Emily Pitler, Saran Lertpradit, Antonio Stella, Davide Rovati, Martin Popel, Daniel Zeman, Maria Simi, Manuela Sanguinetti
Repository master dev
README
Treebank hub page
Download

VIT -

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Fabio Tamburini, Maria Simi, Cristina Bosco
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Italian treebanks.

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Japanese 5 1,688K Japanese

Japanese treebanks

GSD 184K

This Universal Dependencies (UD) Japanese treebank is based on the definition of UD Japanese convention described in the UD documentation. The original sentences are from Google UDT 2.0.

Contributors: Hiroshi Kanayama, Masayuki Asahara, Yusuke Miyao, Takaaki Tanaka, Ryan McDonald, Joakim Nivre, Daniel Zeman, Yuji Matsumoto, Shinsuke Mori, Sumire Uematsu
Repository master dev
README
Treebank hub page
Download

PUD 26K

Contributors: Hans Uszkoreit, Vivien Macketanz, Aljoscha Burchardt, Kim Harris, Katrin Marheinecke, Slav Petrov, Tolga Kayadelen, Mohammed Attia, Ali Elkahky, Zhuoran Yu, Emily Pitler, Saran Lertpradit, Atsuko Shimada, Anna Trukhina, Martin Popel, Daniel Zeman, Hiroshi Kanayama
Repository master dev
README
Treebank hub page
Download

Modern 14K

This Universal Dependencies (UD) Japanese treebank is based on the definition of UD Japanese convention described in the UD documentation. The original sentences are from `Corpus of Historical Japanese' (CHJ).

Contributors: Mai Omura, Masayuki Asahara, Yuta Takahashi
Repository master dev
README
Treebank hub page
Download

BCCWJ 1,273K

This Universal Dependencies (UD) Japanese treebank is based on the definition of UD Japanese convention described in the UD documentation. The original sentences are from `Balanced Corpus of Contemporary Written Japanese'(BCCWJ).

Contributors: Mai Omura, Masayuki Asahara, Yusuke Miyao, Takaaki Tanaka, Hiroshi Kanayama, Yuji Matsumoto, Shinsuke Mori, Sumire Uematsu, Yugo Murawaki
Repository master dev
README
Treebank hub page
Download

KTC 189K

Please add a summary section to the treebank readme file

Contributors: Masayuki Asahara, Hiroshi Kanayama, Yuji Matsumoto, Yusuke Miyao, Shunsuke Mori, Takaaki Tanaka, Sumire Uematsu
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Japanese treebanks.

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Kazakh 1 10K Turkic, Northwestern

Kazakh treebanks

KTB 10K

The UD Kazakh treebank is a combination of text from various sources including Wikipedia, some folk tales, sentences from the UDHR, news and phrasebook sentences. Sentences IDs include partial document identifiers.

Contributors: Aibek Makazhanov, Jonathan North Washington, Francis Tyers
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Komi Zyrian 2 3K Uralic, Permic

Komi Zyrian treebanks

Lattice 2K

UD Komi-Zyrian Lattice is a treebank of written standard Komi-Zyrian.

Contributors: Niko Partanen, KyungTae Lim, Thierry Poibeau
Repository master dev
README
Treebank hub page
Download

IKDP 1K

This treebank consists of dialectal transcriptions of spoken Komi-Zyrian. The current texts are short recorded segments from different areas where the Iźva dialect of Komi language is spoken.

Contributors: Niko Partanen, Rogier Blokland, Michael Rießler
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Komi Zyrian treebanks.

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Korean 5 446K Korean

Korean treebanks

Kaist 350K

The KAIST Korean Universal Dependency Treebank is generated by Chun et al., 2018 from the constituency trees in the [KAIST Tree-Tagging Corpus](http://semanticweb.kaist.ac.kr/home/index.php/Corpus4).

Contributors: Jinho Choi, Na-Rae Han, Jena Hwang, Jayeol Chun
Repository master dev
README
Treebank hub page
Download

GSD 80K

The Google Korean Universal Dependency Treebank is first converted from the [Universal Dependency Treebank v2.0 (legacy)](https://github.com/ryanmcd/uni-dep-tb), and then enhanced by Chun et al., 2018.

Contributors: Ryan McDonald, Joakim Nivre, Daniel Zeman, Jinho Choi, Na-Rae Han, Jena Hwang, Jayeol Chun
Repository master dev
README
Treebank hub page
Download

PUD 16K

Contributors: Hans Uszkoreit, Vivien Macketanz, Aljoscha Burchardt, Kim Harris, Katrin Marheinecke, Slav Petrov, Tolga Kayadelen, Mohammed Attia, Ali Elkahky, Zhuoran Yu, Emily Pitler, Saran Lertpradit, Sookyoung Kwak, Yongseok Cho, Martin Popel, Daniel Zeman
Repository master dev
README
Treebank hub page
Download

Penn -

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Jinho Choi, Narae Han, Jena Hwang, Jayeol Chun
Repository master dev
README
Treebank hub page
Download

Sejong -

Please add a summary section to the treebank readme file

Contributors: Jaemin Cho
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Korean treebanks.

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Kurmanji 1 10K IE, Iranian

Kurmanji treebanks

MG 10K

The UD Kurmanji corpus is a corpus of Kurmanji Kurdish. It contains fiction and encyclopaedic texts in roughly equal measure. It has been annotated natively in accordance with the UD annotation scheme.

Contributors: Memduh Gökırmak, Francis Tyers
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Latin 3 582K IE, Latin

Latin treebanks

PROIEL 199K

The Latin PROIEL treebank is based on the Latin data from the PROIEL treebank, and contains most of the Vulgate New Testament translations plus selections from Caesar's Gallic War, Cicero's Letters to Atticus, Palladius' Opus Agriculturae and the first book of Cicero's De officiis.

Contributors: Dag Haug
Repository master dev
README
Treebank hub page
Download

ITTB 353K

Latin data from the _Index Thomisticus_ Treebank. Data are taken from the _Index Thomisticus_ corpus by Roberto Busa SJ, which contains the complete work by Thomas Aquinas (1225–1274; Medieval Latin) and by 61 other authors related to Thomas.

Contributors: Marco Passarotti, Daniel Zeman, Berta González Saavedra, Flavio Massimiliano Cecchini
Repository master dev
README
Treebank hub page
Download

Perseus 29K

This Universal Dependencies Latin Treebank consists of an automatic conversion of a selection of passages from the Ancient Greek and Latin Dependency Treebank 2.1

Contributors: Giuseppe G. A. Celano, Daniel Zeman
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Latin treebanks.

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Latvian 1 152K IE, Baltic

Latvian treebanks

LVTB 152K

Latvian UD Treebank is based on Latvian Treebank ([LVTB](http://sintakse.korpuss.lv)), being created at University of Latvia, Institute of Mathematics and Computer Science, [Artificial Intelligence Laboratory](http://ailab.lv).

Contributors: Lauma Pretkalniņa, Laura Rituma, Baiba Saulīte, Gunta Nešpore-Bērzkalne, Normunds Grūzītis
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Lithuanian 2 46K IE, Baltic

Lithuanian treebanks

HSE 5K

Lithuanian treebank annotated manually (dependencies) using the Morphological Annotator by CCL, Vytautas Magnus University (http://tekstynas.vdu.lt/) and manual disambiguation. A pilot version which includes news and an essay by Tomas Venclova is available here.

Contributors: Olga Lyashevskaya, Dmitry Sichinava
Repository master dev
README
Treebank hub page
Download

ALKSNIS 40K

The Lithuanian dependency treebank ALKSNIS.

Contributors: Erika Rimkutė, Agnė Bielinskienė, Jolanta Kovalevskaitė, Loïc Boizou, Gabrielė Aleksandravičiūtė, Daniel Zeman
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Maltese 1 44K Afro-Asiatic, Semitic

Maltese treebanks

MUDT 44K

MUDT (Maltese Universal Dependencies Treebank) is a manually annotated treebank of Maltese, a Semitic language of Malta descended from North African Arabic with a significant amount of Italo-Romance influence. MUDT was designed as a balanced corpus with four major genres (see Splitting below) represented roughly equally.

Contributors: Slavomír Čéplö, Daniel Zeman
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Marathi 1 3K IE, Indic

Marathi treebanks

UFAL 3K

UD Marathi is a manually annotated treebank consisting primarily of stories from Wikisource, and parts of an article on Wikipedia.

Contributors: Vinit Ravishankar
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Naija 1 12K Creole

Naija treebanks

NSC 12K

A Universal Dependencies corpus for spoken Naija (Nigerian Pidgin).

Contributors: Bernard Caron, Marine Courtin, Kim Gerdes, Sylvain Kahane, Sandra Bellato, Manying Zhang
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

North Sami 1 26K Uralic, Sami

North Sami treebanks

Giella 26K

This is a North Sámi treebank based on a manually disambiguated and function-labelled gold-standard corpus of North Sámi produced by the Giellatekno team at UiT Norgga árktalaš universitehta.

Contributors: Trond Trosterud, Lene Antonsen, Francis Tyers
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Norwegian 3 625K IE, Germanic

Norwegian treebanks

Bokmaal 310K

The Norwegian UD treebank is based on the Bokmål section of the Norwegian Dependency Treebank (NDT), which is a syntactic treebank of Norwegian. NDT has been automatically converted to the UD scheme by Lilja Øvrelid at the University of Oslo.

Contributors: Lilja Øvrelid, Fredrik Jørgensen, Petter Hohle
Repository master dev
README
Treebank hub page
Download

Nynorsk 301K

The Norwegian UD treebank is based on the Nynorsk section of the Norwegian Dependency Treebank (NDT), which is a syntactic treebank of Norwegian. NDT has been automatically converted to the UD scheme by Lilja Øvrelid at the University of Oslo.

Contributors: Lilja Øvrelid, Fredrik Jørgensen, Petter Hohle
Repository master dev
README
Treebank hub page
Download

NynorskLIA 13K

This Norwegian treebank is based on the LIA treebank of transcribed spoken Norwegian dialects. The treebank has been automatically converted to the UD scheme by Lilja Øvrelid at the University of Oslo.

Contributors: Lilja Øvrelid
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Norwegian treebanks.

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Old Church Slavonic 1 57K IE, Slavic

Old Church Slavonic treebanks

PROIEL 57K

The Old Church Slavonic (OCS) UD treebank is based on the Old Church Slavonic data from the PROIEL treebank and contains the text of the Codex Marianus New Testament translation.

Contributors: Dag Haug
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Old French 1 170K IE, Romance

Old French treebanks

SRCMF 170K

UD_Old_French-SRCMF is a conversion of (part of) the SRCMF corpus (Syntactic Reference Corpus of Medieval French [srcmf.org](http://srcmf.org/)).

Contributors: Sophie Prévost, Aurélie Collomb, Kim Gerdes, Isabelle Tellier, Marine Courtin, Alexei Lavrentiev, Céline Guillot-Barbance
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Persian 1 152K IE, Iranian

Persian treebanks

Seraji 152K

The Persian Universal Dependency Treebank (Persian UD) is based on Uppsala Persian Dependency Treebank (UPDT). The conversion of the UPDT to the Universal Dependencies was performed semi-automatically with extensive manual checks and corrections.

Contributors: Mojgan Seraji, Filip Ginter, Joakim Nivre
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Polish 2 214K IE, Slavic

Polish treebanks

LFG 130K

The LFG Enhanced UD treebank of Polish is based on a corpus of LFG (Lexical Functional Grammar) syntactic structures generated by an LFG grammar of Polish, POLFIE, and manually disambiguated by human annotators.

Contributors: Agnieszka Patejuk, Adam Przepiórkowski
Repository master dev
README
Treebank hub page
Download

SZ 83K

The UD Polish treebank is based on “Składnica zależnościowa” (the Polish dependency treebank) version 0.5.

Contributors: Daniel Zeman, Jan Mašek, Rudolf Rosa
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Polish treebanks.

Language documentation

See the language documentation page.

Portuguese 3 570K IE, Romance

Portuguese treebanks

Bosque 227K

This Universal Dependencies (UD) Portuguese treebank is based on the Constraint Grammar converted version of the Bosque, which is part of the Floresta Sintá(c)tica treebank. It contains both European (CETEMPúblico) and Brazilian (CETENFolha) variants.

Contributors: Alexandre Rademaker, Eckhard Bick, Fabricio Chalub, Cláudia Freitas, Guilherme Paulino-Passos, Luisa Rocha, Isabela Soares-Bastos, Livy Real, Valeria de Paiva, Daniel Zeman, Martin Popel, David Mareček, Natalia Silveira, André Martins
Repository master dev
README
Treebank hub page
Download

GSD 319K

The Brazilian Portuguese UD is converted from the [Google Universal Dependency Treebank v2.0 (legacy)](https://github.com/ryanmcd/uni-dep-tb).

Contributors: Ryan McDonald, Joakim Nivre, Daniel Zeman, Alexandre Rademaker, Fabricio Chalub, Carlos Ramisch
Repository master dev
README
Treebank hub page
Download

PUD 23K

Contributors: Hans Uszkoreit, Vivien Macketanz, Aljoscha Burchardt, Kim Harris, Katrin Marheinecke, Slav Petrov, Tolga Kayadelen, Mohammed Attia, Ali Elkahky, Zhuoran Yu, Emily Pitler, Saran Lertpradit, Gustavo Mendonça, Larissa Rinaldi, Martin Popel, Daniel Zeman, Valeria de Paiva
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Portuguese treebanks.

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Romanian 2 413K IE, Romance

Romanian treebanks

RRT 218K

The Romanian UD treebank (called RoRefTrees) (Barbu Mititelu et al., 2016) is the reference treebank in UD format for standard Romanian.

Contributors: Verginica Barbu Mititelu, Elena Irimia, Cenel-Augusto Perez, Radu Ion, Radu Simionescu, Martin Popel
Repository master dev
README
Treebank hub page
Download

Nonstandard 195K

The Romanian Non-standard UD treebank (called UAIC-RoDia) is based on UAIC-RoDia Treebank.

Contributors: Cătălina Mărănduc, Cenel-Augusto Perez, Victoria Bobicev, Cătălin Mititelu, Florinel Hociung, Valentin Roșca
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Romanian treebanks.

Language documentation

See the language documentation page.

Russian 4 1,247K IE, Slavic

Russian treebanks

GSD 99K

Russian Universal Dependencies Treebank annotated and converted by Google.

Contributors: Ryan McDonald, Vitaly Nikolaev, Olga Lyashevskaya
Repository master dev
README
Treebank hub page
Download

SynTagRus 1,107K

Russian data from the SynTagRus corpus.

Contributors: Kira Droganova, Olga Lyashevskaya, Daniel Zeman
Repository master dev
README
Treebank hub page
Download

Taiga 20K

Universal Dependencies treebank based on data samples extracted from Taiga Corpus and MorphoRuEval-2017 text collections.

Contributors: Olga Lyashevskaya, Olga Rudina
Repository master dev
README
Treebank hub page
Download

PUD 19K

Contributors: Hans Uszkoreit, Vivien Macketanz, Aljoscha Burchardt, Kim Harris, Katrin Marheinecke, Slav Petrov, Tolga Kayadelen, Mohammed Attia, Ali Elkahky, Zhuoran Yu, Emily Pitler, Saran Lertpradit, Tatiana Lando, Olga Loginova, Martin Popel, Daniel Zeman, Kira Droganova
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Russian treebanks.

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Sanskrit 1 1K IE, Indic

Sanskrit treebanks

UFAL 1K

A small Sanskrit treebank of sentences from Pañcatantra, an ancient Indian collection of interrelated fables by Vishnu Sharma.

Contributors: Puneet Dwivedi, Daniel Zeman, Erica Biagetti
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Serbian 1 86K IE, Slavic

Serbian treebanks

SET 86K

The Serbian UD treebank is based on the SETimes-SR corpus.

Contributors: Tanja Samardžić, Nikola Ljubešić
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Slovak 1 106K IE, Slavic

Slovak treebanks

SNK 106K

The Slovak UD treebank is based on data originally annotated as part of the Slovak National Corpus, following the annotation style of the Prague Dependency Treebank.

Contributors: Katarína Gajdošová, Mária Šimková, Daniel Zeman
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Slovenian 2 170K IE, Slavic

Slovenian treebanks

SSJ 140K

The Slovenian UD Treebank is a rule-based conversion of the ssj500k treebank, the largest collection of manually syntactically annotated data in Slovenian, originally annotated in the JOS annotation scheme.

Contributors: Kaja Dobrovoljc, Tomaž Erjavec, Simon Krek
Repository master dev
README
Treebank hub page
Download

SST 29K

The Spoken Slovenian UD Treebank (SST) is the first syntactically annotated corpus of spoken Slovenian, based on a sample of the reference GOS corpus, a collection of transcribed audio recordings of monologic, dialogic and multi-party spontaneous speech in different everyday situations.

Contributors: Kaja Dobrovoljc, Joakim Nivre
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Slovenian treebanks.

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Spanish 3 1,004K IE, Romance

Spanish treebanks

AnCora 549K

Spanish data from the AnCora corpus.

Contributors: Héctor Martínez Alonso, Daniel Zeman
Repository master dev
README
Treebank hub page
Download

GSD 431K

The Spanish UD is converted from the content head version of the [universal dependency treebank v2.0 (legacy)](https://github.com/ryanmcd/uni-dep-tb).

Contributors: Miguel Ballesteros, Héctor Martínez Alonso, Ryan McDonald, Elena Pascual, Natalia Silveira, Daniel Zeman, Joakim Nivre
Repository master dev
README
Treebank hub page
Download

PUD 23K

Contributors: Hans Uszkoreit, Vivien Macketanz, Aljoscha Burchardt, Kim Harris, Katrin Marheinecke, Slav Petrov, Tolga Kayadelen, Mohammed Attia, Ali Elkahky, Zhuoran Yu, Emily Pitler, Saran Lertpradit, Hector Fernandez Alcalde, Laura Moreno Romero, Martin Popel, Daniel Zeman, Héctor Martínez Alonso
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Spanish treebanks.

Language documentation

See the language documentation page.

Swedish 3 195K IE, Germanic

Swedish treebanks

Talbanken 96K

The Swedish-Talbanken treebank is based on Talbanken, a treebank developed at Lund University in the 1970s.

Contributors: Joakim Nivre, Aaron Smith
Repository master dev
README
Treebank hub page
Download

LinES 79K

UD Swedish_LinES is the Swedish half of the LinES Parallel Treebank with UD annotations. All segments are translations from English and the sources cover literary genres, online manuals and Europarl data.

Contributors: Lars Ahrenberg
Repository master dev
README
Treebank hub page
Download

PUD 19K

Swedish-PUD is the Swedish part of the Parallel Universal Dependencies (PUD) treebanks.

Contributors: Joakim Nivre
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Swedish treebanks.

Language documentation

See the language documentation page.

Swedish Sign Language 1 1K Sign Language

Swedish Sign Language treebanks

SSLC 1K

The Universal Dependencies treebank for Swedish Sign Language (ISO 639-3: swl) is derived from the Swedish Sign Language Corpus (SSLC) from the department of linguistics, Stockholm University.

Contributors: Moa Gärdenfors, Carl Börstell, Robert Östling, Lars Wallin, Mats Wirén
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Tagalog 1 <1K Austronesian, Central Philippine

Tagalog treebanks

TRG <1K

UD_Tagalog-TRG is a UD treebank manually annotated using sentences from a grammar book.

Contributors: Stephanie Samson
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Tamil 1 9K Dravidian, Southern

Tamil treebanks

TTB 9K

The UD Tamil treebank is based on the Tamil Dependency Treebank created at the Charles University in Prague by Loganathan Ramasamy.

Contributors: Loganathan Ramasamy, Daniel Zeman
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Telugu 1 6K Dravidian, South Central

Telugu treebanks

MTG 6K

The Telugu UD treebank is created in UD based on manual annotations of sentences from a grammar book.

Contributors: Taraka Rama, Sowmya Vajjala
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Thai 1 22K Tai-Kadai

Thai treebanks

PUD 22K

Contributors: Hans Uszkoreit, Vivien Macketanz, Aljoscha Burchardt, Kim Harris, Katrin Marheinecke, Slav Petrov, Tolga Kayadelen, Mohammed Attia, Ali Elkahky, Zhuoran Yu, Emily Pitler, Saran Lertpradit, Rattima Nitisaroj, Yanin Sawanakunanon, Martin Popel, Daniel Zeman
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Turkish 3 74K Turkic, Southwestern

Turkish treebanks

IMST 57K

The UD Turkish Treebank, also called the IMST-UD Treebank, is a semi-automatic conversion of the IMST Treebank (Sulubacak et al., 2016).

Contributors: Çağrı Çöltekin, Gülşen Cebiroğlu Eryiğit, Memduh Gökırmak, Hüner Kaşıkara, Umut Sulubacak, Francis Tyers
Repository master dev
README
Treebank hub page
Download

PUD 16K

Contributors: Hans Uszkoreit, Vivien Macketanz, Aljoscha Burchardt, Kim Harris, Katrin Marheinecke, Slav Petrov, Tolga Kayadelen, Mohammed Attia, Ali Elkahky, Zhuoran Yu, Emily Pitler, Saran Lertpradit, Savas Cetin, Martin Popel, Daniel Zeman, Francis Tyers, Çağrı Çöltekin
Repository master dev
README
Treebank hub page
Download

BOUN -

A Turkish treebank annotated at the Boğaziçi University.

Contributors: Betül Bilgin
Repository master dev
README
Treebank hub page
Download

See here for comparative statistics of Turkish treebanks.

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Ukrainian 1 116K IE, Slavic

Ukrainian treebanks

IU 116K

Gold standard Universal Dependencies corpus for Ukrainian, developed for UD originally, by [Institute for Ukrainian](https://mova.institute), NGO. [[українською](https://mova.institute/золотий_стандарт)]

Contributors: Natalia Kotsyba, Bohdan Moskalevskyi, Mykhailo Romanenko
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Upper Sorbian 1 11K IE, Slavic

Upper Sorbian treebanks

UFAL 11K

A small treebank of Upper Sorbian based mostly on Wikipedia.

Contributors: Daniel Zeman, Anna Nedoluzhko
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Urdu 1 138K IE, Indic

Urdu treebanks

UDTB 138K

The Urdu Universal Dependency Treebank was automatically converted from Urdu Dependency Treebank (UDTB) which is part of an ongoing effort of creating multi-layered treebanks for Hindi and Urdu.

Contributors: Riyaz Ahmad Bhat, Daniel Zeman
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Uyghur 1 40K Turkic, Southeastern

Uyghur treebanks

UDT 40K

The Uyghur UD treebank is based on the Uyghur Dependency Treebank (UDT), created at the Xinjiang University in Ürümqi, China.

Contributors: Marhaba Eli, Daniel Zeman, Francis Tyers
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Vietnamese 1 43K Austro-Asiatic, Viet-Muong

Vietnamese treebanks

VTB 43K

The Vietnamese UD treebank is a conversion of the constituent treebank created in the VLSP project (https://vlsp.hpda.vn/).

Contributors: Lương Nguyễn Thị, Linh Hà Mỹ, Phương Lê Hồng, Huyền Nguyễn Thị Minh
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Warlpiri 1 <1K Pama-Nyungan

Warlpiri treebanks

UFAL <1K

A small treebank of grammatical examples in Warlpiri, taken from linguistic literature.

Contributors: Daniel Zeman
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Yoruba 1 2K Niger-Congo, Defoid

Yoruba treebanks

YTB 2K

Parts of the Yoruba Bible, hand-annotated natively in Universal Dependencies.

Contributors: Adédayọ̀ Olúòkun, Daniel Zeman, Seyi Williams, Ọlájídé Ishola
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Upcoming UD Languages

Bengali 2 - IE, Indic

Bengali treebanks

BRU -

Please add a summary section to the treebank readme file

Contributors: Siratun Jannat, Mizanur Rahoman, Shafi Sourov, Jannatul Ferdaousi, Syeda Shahzadi
Repository master dev
README
Treebank hub page
Download

DDS -

Please add a summary section to the treebank readme file

Contributors: Md. Anwarus Salam Khan, Md. Mahfuzus Salam Khan
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Bhojpuri 1 - IE, Indic

Bhojpuri treebanks

BHTB -

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Atul Kr. Ojha
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Dargwa 1 - Nakho-Dagestanian

Dargwa treebanks

Mehweb -

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Sasha Kozhukhar, Olga Lyashevskaya
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Georgian 1 - Kartvelian

Georgian treebanks

GNC -

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Ana Kolkhidashvili
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Kannada 1 - Dravidian, Southern

Kannada treebanks

MKG -

Examples from Modern Kannada Grammar by S.N.Sridhar.

Contributors: Taraka Rama, Sowmya Vajjala
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Kyrgyz 1 - Turkic, Northwestern

Kyrgyz treebanks

KTB -

... 1-2 sentences (see http://universaldependencies.org/release_checklist.html#the-readme-file for README guidelines) ...

Contributors: Kamen Bonov
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Pnar 1 - Austro-Asiatic, Khasian

Pnar treebanks

PTB -

UD Pnar-PTB is a conversion from the Ring (2017) dataset ([doi:10.21979/N9/KVFGBZ](http://dx.doi.org/10.21979/N9/KVFGBZ)) that underpins a grammatical description of the Pnar language (Ring 2015, [http://hdl.handle.net/10356/62519](http://hdl.handle.net/10356/62519)). The corpus consists of folktales and interviews transcribed, translated, and interlinearized.

Contributors: Hiram Ring
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Romansh 2 - IE, Romance

Romansh treebanks

Rumgr -

Please add a summary section to the treebank readme file

Contributors: Sascha Brawer, Martin Cantieni
Repository master dev
README
Treebank hub page
Download

Sursilv -

Please add a summary section to the treebank readme file

Contributors: Sascha Brawer, Martin Cantieni
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Shipibo Konibo 1 - Panoan

Shipibo Konibo treebanks

PUCP -

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Ronald Ahmed Cárdenas Acosta
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Sindhi 1 - IE, Indic

Sindhi treebanks

MazharDootio -

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Mazhar Dootio
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Somali 1 - Afro-Asiatic, Cushitic

Somali treebanks

STB -

Please add a summary section to the treebank readme file

Contributors: Morgan Nilsson
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Sorani 1 - IE, Iranian

Sorani treebanks

MG -

Please add a summary section to the treebank readme file

Contributors: Memduh Gökırmak
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Welsh 1 - IE, Celtic

Welsh treebanks

CCG -

Corpws Cystrawennol y Gymraeg

Contributors: Francis Tyers, Johannes Heinecke
Repository master dev
README
Treebank hub page
Download

Language documentation

See the language documentation page.

Wolof 1 - Niger-Congo, Northern Atlantic

Wolof treebanks

WTB -

... 1-2 sentences (see [release checklist](http://universaldependencies.org/release_checklist.html#the-readme-file) for README guidelines) ...

Contributors: Bamba Dione
Repository master dev
README
Treebank hub page
Download

Language documentation

The language hub documentation has not yet been created or ported from the UDv1 documentation.

Disclaimer: Our use of flags to symbolise languages is only intended as a visual enhancement of the website and should not be interpreted as a political statement in any way.

Download

The data is released through LINDAT/CLARIN.

The next release (v2.4) is scheduled for May 15, 2019 (data freeze on May 1).
Version 2.3 treebanks are available at http://hdl.handle.net/11234/1-2895. 129 treebanks, 76 languages, released November 15, 2018.
Version 2.2 treebanks are archived at http://hdl.handle.net/11234/1-2837. 122 treebanks, 71 languages, released July 1, 2018.
Version 2.1 treebanks are archived at http://hdl.handle.net/11234/1-2515. 102 treebanks, 60 languages, released November 15, 2017.
Version 2.0 treebanks are archived at http://hdl.handle.net/11234/1-1983. 70 treebanks, 50 languages, released March 1, 2017.
- Test data 2.0 are archived at http://hdl.handle.net/11234/1-2184. 81 treebanks, 49 languages, released May 18, 2017.
Version 1.4 treebanks are archived at http://hdl.handle.net/11234/1-1827. 64 treebanks, 47 languages, released November 15, 2016.
Version 1.3 treebanks are archived at http://hdl.handle.net/11234/1-1699. 54 treebanks, 40 languages, released May 15, 2016.
Version 1.2 treebanks are archived at http://hdl.handle.net/11234/1-1548. 37 treebanks, 33 languages, released November 15, 2015.
Version 1.1 treebanks are archived at http://hdl.handle.net/11234/LRT-1478. 19 treebanks, 18 languages, released May 15, 2015.
Version 1.0 treebanks are archived at http://hdl.handle.net/11234/1-1464. 10 treebanks, 10 languages, released January 15, 2015.
In general, we intend to have regular treebank releases every six months. The v2.0 and v2.2 releases were brought forward because of their usage in the CoNLL 2017 and 2018 Multilingual Parsing Shared Tasks.