home edit page issue tracker

This page pertains to UD version 2.

UD English ParTUT

Language: English (code: en)
Family: Indo-European, Germanic

This treebank has been part of Universal Dependencies since the UD v2.0 release.

The following people have contributed to making this treebank part of UD: Cristina Bosco, Manuela Sanguinetti.

Repository: UD_English-ParTUT
Search this treebank on-line: PML-TQ
Download all treebanks: UD 2.2

License: CC BY-NC-SA 4.0

Genre: legal, news, wiki

Questions, comments? General annotation questions (either English-specific or cross-linguistic) can be raised in the main UD issue tracker. You can report bugs in this treebank in the treebank-specific issue tracker on Github. If you want to collaborate, please contact [msanguin (æt) di • unito • it]. Development of the treebank happens outside the UD repository. If there are bugs, either the original data source or the conversion procedure must be fixed. Do not submit pull requests against the UD repository.

Annotation	Source
Lemmas	annotated manually in non-UD style, automatically converted to UD, with some manual corrections of the conversion
UPOS	annotated manually in non-UD style, automatically converted to UD, with some manual corrections of the conversion
XPOS	annotated manually in non-UD style, automatically converted to UD, with some manual corrections of the conversion
Features	annotated manually in non-UD style, automatically converted to UD, with some manual corrections of the conversion
Relations	annotated manually in non-UD style, automatically converted to UD, with some manual corrections of the conversion

Description

UD_English-ParTUT is a conversion of a multilingual parallel treebank developed at the University of Turin, and consisting of a variety of text genres, including talks, legal texts and Wikipedia articles, among others.

UD_English-ParTUT data is derived from the already-existing parallel treebank Par(allel)TUT.

ParTUT is a morpho-syntactically annotated collection of Italian/French/English parallel sentences, which includes texts from different sources and representing different genres and domains, released in several formats.

The treebank comprises approximately 167,000 tokens, with an average amount of 2,100 sentences per language. The texts of the collection currently available were gathered from a large number of sources and domains:

the Creative Commons open license;
the DGT-Translation Memory
the Europarl parallel corpus (section ep_00_01_17);
publicly available pages from Facebook website;
the JRC-Acquis multilingual parallel corpus (section jrc52006DC243);
several articles from Project Syndicate© [ABSENT IN UD_French-ParTUT];
the Universal Declaration of Human Rights;
Wikipedia articles retrieved in the English section and then translated into Italian only by graduate students in Translation Studies [ABSENT IN UD_French-ParTUT];
the Web Inventory of Translated Talks .

ParTUT data can be downloaded here and here.

Acknowledgments

We are deeply grateful to Project Syndicate© for letting us download and exploit their articles as text material, under the terms of educational use.

References

Manuela Sanguinetti, Cristina Bosco. 2014. PartTUT: The Turin University Parallel Treebank. In Basili, Bosco, Delmonte, Moschitti, Simi (editors) Harmonization and development of resources and tools for Italian Natural Language Processing within the PARLI project, LNCS, Springer Verlag
Manuela Sanguinetti, Cristina Bosco. 2014. Converting the parallel treebank ParTUT in Universal Stanford Dependencies. In Proceedings of the 1rst Conference for Italian Computational Linguistics (CLiC-it 2014), Pisa (Italy)
Cristina Bosco, Manuela Sanguinetti. 2014. Towards a Universal Stanford Dependencies parallel treebank. In Proceedings of the 13th Workshop on Treebanks and Linguistic Theories (TLT-13), Tubingen (Germany)

Statistics of UD English ParTUT

POS Tags

ADJ – ADP – ADV – AUX – CCONJ – DET – INTJ – NOUN – NUM – PART – PRON – PROPN – PUNCT – SCONJ – SYM – VERB – X

Features

Definite – Degree – Foreign – Gender – Mood – Number – NumType – Person – Polarity – Poss – PronType – Tense – VerbForm

Relations

acl – acl:relcl – advcl – advmod – amod – appos – aux – aux:pass – case – cc – ccomp – compound – compound:prt – conj – cop – csubj – csubj:pass – dep – det – det:predet – discourse – dislocated – expl – fixed – flat – flat:foreign – goeswith – iobj – mark – nmod – nmod:npmod – nmod:poss – nmod:tmod – nsubj – nsubj:pass – nummod – obj – obl – orphan – parataxis – punct – root – vocative – xcomp

Tokenization and Word Segmentation

This corpus contains 2090 sentences, 49616 tokens and 49648 syntactic words.

This corpus contains 6475 tokens (13%) that are not followed by a space.

This corpus does not contain words with spaces.

This corpus contains 82 types of words that contain both letters and punctuation. Examples: 's, 're, so-called, 'm, 've, cost-effective, long-term, ’s, hi-tech, self-regulation, 'd, 'll, D', G., Mid-1590s, R&D, S., T., e.g., etc., i.e., late-1990, medium-sized, p., part-time, real-time, A., African-American, C., Co-operation, D., Fine-tune, H., M., Mr., Self-destructive, St., W., W.H., above-mentioned, avant-garde, back-up, best-selling, case-by-case, co-financing, co-ordination, cost-effectiveness, deep-seated, dot-com, fat-soluble

This corpus contains 32 multi-word tokens. On average, one multi-word token consists of 2.00 syntactic words.
There are 9 types of multi-word tokens. Examples: cannot, don't, ain't, can't, won't, aren't, des, had, shouldn't.

Morphology

Nominal Features

Gender

Fem
- ADJ: innovative, sentimentale
- DET: la, une, ma
- NOUN: women, Madam, Mrs, policymakers, wellbeing, additionality, agriculture, bardolatry, codependency, cynicism
- PRON: she, her, one

Masc
- ADJ: Perdu, necessary
- NOUN: Mr, man, men, king, crisis, mycelium, child, children, basis, genius
- PRON: he, him, himself, Nothing

Number

Plur
- ADJ: themselves, innovative
- AUX: are, have, were, do, 're, 've, be
- AUX-Fin: are, have, were, do, 're, 've
- DET: these, those, Many, les
- NOUN: countries, people, states, plays, years, rights, terms, measures, requirements, works
- PRON: we, they, them, those, us, others, these, Many, ourselves, themselves
- VERB-Fin: have, are, know, need, see, include, remain, like, believe, create

Sing
- ADJ: particular, Lankan, clear, following, free, full, general, overall, possible, this
- AUX-Fin: is, was, has, does, were, 's, have, am, do, 'm
- AUX-Part: being
- DET: a, this, an, its, my, that, another, each, every, la
- NOUN: work, Commission, time, Parliament, President, member, license, growth, Directive, programme
- PRON: it, I, he, this, him, everyone, one, what, she, that
- VERB-Fin: is, has, makes, believe, provides, think, appears, comes, remains, seems
- VERB-Part: including, emerging, regarding, concerning, developing, following, rising, amending, arising, growing

Definite

Def
- DET: the, ’s, la, les, Le, une

Ind
- DET: a, an, another, Une

Degree and Polarity

Degree

Cmp
- ADJ: more, greater, better, higher, later, bigger, lower, closer, larger, smaller
- ADV: more, less, later, longer

Pos
- ADJ: other, new, European, economic, financial, social, many, important, first, own
- ADV: real-time

Sup
- ADJ: most, best, greatest, largest, highest, earliest, finest, latest, strongest, biggest
- ADV: least

Polarity

Neg
- ADV: no
- PART: not

Verbal Features

Mood

Imp
- AUX-Fin: do, can, Will, be, have
- VERB-Fin: let, click, Learn, Reach, use, Create, choose, Adjust, Build, Connect

Ind
- AUX-Fin: is, are, was, has, would, should, have, can, shall, will
- VERB-Fin: is, has, have, wrote, know, are, believe, had, need, made
- VERB-Part: annexed

Sub
- AUX-Fin: be
- VERB-Fin: be, express

Tense

Past
- AUX-Fin: was, would, should, were, had, could, did, might, 'd, may
- AUX-Part: been, had
- VERB-Fin: wrote, had, made, became, began, did, provided, died, took, used
- VERB-Part: given, based, made, taken, adopted, used, granted, set, done, entitled

Pres
- AUX: is, are, has, have, can, will, shall, may, do, must
- AUX-Fin: is, are, has, have, can, will, shall, may, do, must
- AUX-Part: being
- VERB-Fin: is, has, have, know, are, believe, need, think, makes, see
- VERB-Inf: live, look
- VERB-Part: including, emerging, developing, regarding, concerning, following, rising, relating, amending, arising

Pronouns, Determiners, Quantifiers

PronType

Art
- DET: the, a, an, another, Le, ’s, L, la, les, une

Dem
- DET: this, such, these, that, those
- PRON: this, that, those, these

Ind
- DET: any, no, some, each, both, every, whatever, certain, numerous, Many
- PRON: all, some, others, each, nothing, Many, other, one, Much, both

Int
- DET: what, which
- PRON: what, who

Neg
- ADV: non, no, none

Prs
- DET: his, their, its, our, your, my, her, ma
- PRON: it, I, we, he, you, they, them, him, everyone, us

Rel
- DET: which
- PRON: which, that, who, what, where, whom, whose, when, whereby

Tot
- DET: all

NumType

Card
- NUM: two, one, 1, three, 2, four, 18, 3, 6, five

Ord
- ADJ: first, last, second, third, II, III, sixth, I, IV, VI
- PRON: first, third, latter, second

Poss

Yes
- DET: his, their, its, our, your, my, her, ma
- PRON: his, our, ours

Person

1
- AUX-Fin: have, am, do, 'm, was
- PRON: I, we, us, me, ourselves
- VERB-Fin: believe, think, have, hope, want, accept, allow, face, feel, know

2
- AUX-Fin: were, are, do, can, Will, have, may
- PRON: you, second
- VERB-Fin: let, Create, Imagine, Learn, Recall, accept, agree, enter, facilitate, own

3
- AUX-Fin: is, was, has, would, should, can, shall, will, may, had
- AUX-Ger: being
- PRON: it, he, they, them, him, everyone, one, she, himself, itself
- VERB-Fin: is, has, wrote, had, made, became, makes, began, provides, did
- VERB-Ger: adapting, guaranteeing, initiating, surpassing
- VERB-Part: granted, coupled, placed, provided, resumed, spent

Other Features

Foreign
- Yes
  - X: La, Comédie, humaine, De, Illusions, Le, Perdues, Chagrin, Peau, Père

Syntax

Auxiliary Verbs and Copula

This corpus uses 3 lemmas as copulas (cop). Examples: be, being, is.

This corpus uses 9 lemmas as auxiliaries (aux). Examples: have, will, shall, be, can, may, do, must, might.
This corpus uses 2 lemmas as passive auxiliaries (aux:pass). Examples: be, have.

Core Arguments, Oblique Arguments and Adjuncts

Here we consider only relations between verbs (parent) and nouns or pronouns (child).

nsubj
- VERB-Fin--NOUN (563)
- VERB-Fin--PRON (617)
- VERB-Ger--NOUN (36)
- VERB-Ger--PRON (46)
- VERB-Inf--NOUN (185)
- VERB-Inf--PRON (233)
- VERB-Part--NOUN (133)
- VERB-Part--PRON (91)

obj
- VERB-Fin--NOUN (628)
- VERB-Fin--NOUN-ADP(up) (1)
- VERB-Fin--PRON (120)
- VERB-Ger--NOUN (261)
- VERB-Ger--PRON (17)
- VERB-Inf--NOUN (579)
- VERB-Inf--PRON (78)
- VERB-Part--NOUN (213)
- VERB-Part--NOUN-ADP('s) (1)
- VERB-Part--PRON (21)

iobj
- VERB-Fin--NOUN (2)
- VERB-Fin--PRON (7)
- VERB-Ger--NOUN (3)
- VERB-Ger--PRON (1)
- VERB-Inf--NOUN (2)
- VERB-Inf--PRON (12)
- VERB-Part--NOUN (1)
- VERB-Part--PRON (2)

Relations Overview

This corpus uses 10 relation subtypes: acl:relcl, aux:pass, compound:prt, csubj:pass, det:predet, flat:foreign, nmod:npmod, nmod:poss, nmod:tmod, nsubj:pass
The following 3 relation types are not used in this corpus at all: clf, list, reparandum