UD English GUM
Language: English (code: en
)
Family: Indo-European, Germanic
This treebank has been part of Universal Dependencies since the UD v2.2 release.
The following people have contributed to making this treebank part of UD: Siyao Peng, Amir Zeldes.
Repository: UD_English-GUM
Search this treebank on-line: PML-TQ
Download all treebanks: UD 2.2
License: CC BY-NC-SA 4.0
Genre: academic, fiction, nonfiction, news, spoken, web, wiki
Questions, comments? General annotation questions (either English-specific or cross-linguistic) can be raised in the main UD issue tracker. You can report bugs in this treebank in the treebank-specific issue tracker on Github. If you want to collaborate, please contact [amir • zeldes (æt) georgetown • edu]. Development of the treebank happens outside the UD repository. If there are bugs, either the original data source or the conversion procedure must be fixed. Do not submit pull requests against the UD repository.
Annotation | Source |
---|---|
Lemmas | annotated manually in non-UD style, automatically converted to UD |
UPOS | annotated manually in non-UD style, automatically converted to UD |
XPOS | annotated manually |
Features | annotated manually in non-UD style, automatically converted to UD |
Relations | annotated manually in non-UD style, automatically converted to UD |
Description
Universal Dependencies version of syntax annotations from the GUM corpus (https://corpling.uis.georgetown.edu/gum/)
GUM, the Georgetown University Multilayer corpus, is an open source collection of richly annotated web texts from multiple text types. The corpus is collected and expanded by students as part of the curriculum in the course LING-367 “Computational Corpus Linguistics” at Georgetown University. The selection of text types is meant to represent different communicative purposes, while coming from sources that are readily and openly available (usually Creative Commons licenses), so that new texts can be annotated and published with ease.
The dependencies in the corpus were originally annotated using Stanford Typed Depenencies (de Marneffe & Manning 2013) and converted automatically to UD using DepEdit (https://corpling.uis.georgetown.edu/depedit/). The rule-based conversion takes into account gold entity annotations found in other annotation layers of the GUM corpus (e.g. entity annotations). The conversion script used can found in the GUM build bot code, available from the (non-UD) GUM repository. For more details see the corpus website.
Acknowledgments
GUM annotation team (so far - thanks for participating!)
Adrienne Isaac, Akitaka Yamada, Amani Aloufi, Amelia Becker, Andrea Price, Andrew O’Brien, Anna Runova, Anne Butler, Arianna Janoff, Ayan Mandal, Brandon Tullock, Brent Laing, Candice Penelton, Chenyue Guo, Colleen Diamond, Connor O’Dwyer, Dan Simonson, Didem Ikizoglu, Edwin Ko, Emily Pace, Emma Manning, Ethan Beaman, Han Bu, Hang Jiang, Hanwool Choe, Hassan Munshi, Ho Fai Cheng, Jakob Prange, Jehan al-Mahmoud, Jemm Excelle Dela Cruz, Joaquin Gris Roca, John Chi, Jongbong Lee, Juliet May, Katarina Starcevic, Katherine Vadella, Lara Bryfonski, Lindley Winchester, Logan Peng, Lucia Donatelli, Margaret Anne Rowe, Margaret Borowczyk, Maria Stoianova, Mariko Uno, Mary Henderson, Maya Barzilai, Md. Jahurul Islam, Michaela Harrington, Minnie Annan, Mitchell Abrams, Mohammad Ali Yektaie, Naomee-Minh Nguyen, Nicholas Workman, Nicole Steinberg, Rachel Thorson, Rebecca Childress, Ruizhong Li, Ryan Murphy, Sakol Suethanapornkul, Sean Macavaney, Sean Simpson, Shannon Mooney, Siddharth Singh, Siyu Liang, Stephanie Kramer, Sylvia Sierra, Timothy Ingrassia, Wenxi Yang, Xiaopei Wu, Yang Liu, Yilun Zhu, Yingzhu Chen, Yiran Xu, Young-A Son, Yushi Zhao, Zhuxin Wang, Amir Zeldes
… and other annotators who wish to remain anonymous!
References
As a scholarly citation for the corpus in articles, please use this paper:
- Zeldes, Amir (2017) “The GUM Corpus: Creating Multilayer Resources in the Classroom”. Language Resources and Evaluation 51(3), 581–612.
Statistics of UD English GUM
POS Tags
ADJ – ADP – ADV – AUX – CCONJ – DET – INTJ – NOUN – NUM – PART – PRON – PROPN – PUNCT – SCONJ – SYM – VERB – X
Features
Case – Definite – Degree – Gender – Mood – Number – NumType – Person – Polarity – Poss – PronType – Reflex – Tense – VerbForm
Relations
acl – acl:relcl – advcl – advmod – amod – appos – aux – aux:pass – case – cc – cc:preconj – ccomp – compound – compound:prt – conj – cop – csubj – csubj:pass – dep – det – det:predet – discourse – dislocated – expl – fixed – flat – goeswith – iobj – mark – nmod – nmod:npmod – nmod:poss – nmod:tmod – nsubj – nsubj:pass – nummod – obj – obl – obl:npmod – obl:tmod – orphan – parataxis – punct – reparandum – root – vocative – xcomp
Tokenization and Word Segmentation
- This corpus contains 4399 sentences and 80176 tokens.
- This corpus contains 10536 tokens (13%) that are not followed by a space.
- This corpus does not contain words with spaces.
- This corpus contains 539 types of words that contain both letters and punctuation. Examples: 's, n't, ’s, n’t, 're, 'll, 've, U.S., L'Enfant, ’re, e.g., 'd, ’ve, pro-Beijing, ’ll, 'm, al., i.e., Naqsh-e, how-to, pan-democracy, ’d, St., etc., eye-tracking, north-south, t-shirt, Vava'u, adult-like, anti-establishment, c., e-mail, pan-democrat, re-elected, upside-down, #istandwithahmed, F-E, I-44, Mr., U.S, co-founder, east-west, eco-tourism, follow-up, s/he, to-do, well-known, ’m, Cheuk-yan, Chi-wai
Morphology
Tags
- This corpus uses 17 UPOS tags out of 17 possible: ADJ, ADP, ADV, AUX, CCONJ, DET, INTJ, NOUN, NUM, PART, PRON, PROPN, PUNCT, SCONJ, SYM, VERB, X
- This corpus contains 11 word types tagged as particles (PART): ', 's, -, n't, n`t, not, n’t, to, ’, ’s, ’t
- This corpus contains 54 lemmas tagged as pronouns (PRON): I, PRP, a, all, another, any, both, each, either, every, half, he, her, herself, him, himself, his, it, its, itself, me, my, myself, one, our, quite, s/he, she, some, such, that, the, their, theirs, them, themselves, there, these, they, this, those, us, we, what, whatever, which, who, whoever, whom, whose, you, your, yours, yourself
- This corpus contains 24 lemmas tagged as determiners (DET): 6, Une, a, all, an, another, any, both, each, either, every, no, other, some, such, that, the, these, this, those, what, whatever, which, you
- Out of the above, 19 lemmas occurred sometimes as PRON and sometimes as DET: a, all, another, any, both, each, either, every, some, such, that, the, these, this, those, what, whatever, which, you
- This corpus contains 18 lemmas tagged as auxiliaries (AUX): 's, able, be, become, ca, can, could, do, get, have, may, might, must, shall, should, will, wo, would
- Out of the above, 6 lemmas occurred sometimes as AUX and sometimes as VERB: be, become, do, get, have, will
- There are 4 (de)verbal forms:
- Fin
- AUX: is, can, was, will, are, would, should, may, were, do
- VERB: have, are, said, is, has, want, think, had, was, says
- Ger
- AUX: being, having
- SCONJ: according, depending, including, regarding
- VERB: using, including, following, making, doing, having, going, living, growing, writing
- Inf
- AUX: be, do, have, get
- VERB: make, have, get, see, do, take, use, know, find, go
- Part
- AUX: been
- SCONJ: based, compared, given, got
- VERB: called, used, known, made, found, given, done, based, elected, seen
Nominal Features
- Fem
- PRON: her, she, herself
- Masc
- PRON: he, his, him, himself
- Neut
- PRON: it, its, itself
- Plur
- DET: these, those
- NOUN: people, years, things, days, minutes, ants, hours, children, mice, friends
- PRON: they, we, their, them, our, us, those, themselves, these
- PROPN: States, skittles, Chathams, Mets, Paralympics, Thais, Americans, Games, Netherlands, Thrones
- SCONJ: points
- Sing
- AUX: is, was, has, 's, ’s, does, am, s, Be, isn
- AUX-Fin: is, was, has, 's, ’s, does, am, s, isn
- DET: this, that, The, Une
- NOUN: city, time, way, world, day, language, year, image, something, history
- PRON: it, i, he, her, she, his, this, my, its, that
- PROPN: New, United, Scientology, University, Warhol, York, lee, Party, fort, Wikinews
- SCONJ: suffix
- SYM: %
- VERB-Fin: is, has, was, says, 's, comes, ’s, makes, means, takes
- Acc
- PRON: it, them, me, her, you, him, us, yourself, themselves, himself
- Nom
- DET: you
- PRON: you, it, i, they, we, he, she
- Def
- DET: the
- PRON: the
- Ind
- DET: a, an
- PRON: a
Degree and Polarity
- Cmp
- ADJ: more, better, larger, greater, easier, further, smaller, less, later, higher
- ADV: less, earlier, later, longer, further, better, sooner, worse
- Pos
- ADJ: many, other, new, good, important, own, first, last, different, large
- ADV: well, far, long, soon, little, badly, early, fast, late, close
- AUX: able
- DET: other
- SCONJ: such
- Sup
- ADJ: most, best, least, largest, highest, latest, greatest, biggest, hardest, hottest
- ADV: best, fastest, highest, least
- Neg
- ADV: never, no
- DET: no
- PART: not, n't, n’t, ’t
Verbal Features
- Ind
- AUX-Fin: is, was, are, were, do, has, 's, have, had, did
- VERB-Fin: have, are, said, is, has, want, think, had, was, says
- Past
- AUX-Fin: was, were, had, did, ’d, 'd, became, got, where
- AUX-Part: been
- SCONJ-Part: based, compared, given, got
- VERB-Fin: said, had, was, came, took, wanted, became, made, started, used
- VERB-Part: called, used, known, made, found, given, done, based, elected, seen
- Pres
- AUX-Fin: is, are, do, has, 's, have, 're, ’s, does, 've
- VERB-Fin: have, are, is, has, want, think, says, 's, know, comes
- VERB-Part: going, doing, looking, trying, getting, taking, making, working, moving, talking
Pronouns, Determiners, Quantifiers
- Art
- DET: the, a, an
- PRON: a, the
- Dem
- ADV: then, there, here
- DET: this, these, that, those
- PRON: this, that, those, these
- Int
- DET: what, which, whatever
- PRON: what, which, who, whatever, whose, whoever, whom
- SCONJ: when, how, why, where, wherever, Whenever, While
- Prs
- DET: you
- PRON: you, it, i, your, they, we, he, her, she, his
- Rel
- DET: that
- PRON: that, which, who, what, whom
- SCONJ: where, when, why
- Card
- DET: 6
- NUM: one, two, 1, 2, 3, 15, 4, four, 10, 5
- Mult
- ADV: once, twice
- SCONJ: once
- Ord
- ADJ: first, second, third, 19th, fourth, 20th, 10th, 30th, 135th, 164th
- Yes
- PRON: your, his, their, my, her, its, our, whose, yours
- Yes
- PRON: yourself, themselves, himself, itself, myself, herself
- 1
- AUX-Fin: am, was
- PRON: i, we, my, me, our, us, myself
- VERB-Fin: was
- 2
- DET: you
- PRON: you, your, yourself, yours
- 3
- AUX-Fin: is, was, has, 's, ’s, does, s, isn
- PRON: it, they, he, her, she, his, their, them, its, him
- VERB-Fin: is, has, says, 's, was, comes, ’s, makes, means, takes
Other Features
Syntax
Auxiliary Verbs and Copula
- This corpus uses 4 lemmas as copulas (cop). Examples: be, 's, able, become.
- This corpus uses 14 lemmas as auxiliaries (aux). Examples: have, do, will, can, be, would, should, may, could, might, wo, ca, must, shall.
- This corpus uses 2 lemmas as passive auxiliaries (aux:pass). Examples: be, get.
Core Arguments, Oblique Arguments and Adjuncts
Here we consider only relations between verbs (parent) and nouns or pronouns (child).
- nsubj
- VERB-Fin--NOUN (876)
- VERB-Fin--PRON (269)
- VERB-Fin--PRON-Acc (1)
- VERB-Fin--PRON-Nom (1095)
- VERB-Ger--NOUN (24)
- VERB-Ger--PRON (5)
- VERB-Ger--PRON-Acc (3)
- VERB-Ger--PRON-Nom (15)
- VERB-Inf--NOUN (261)
- VERB-Inf--PRON (75)
- VERB-Inf--PRON-ADP(of) (1)
- VERB-Inf--PRON-Acc (11)
- VERB-Inf--PRON-Nom (565)
- VERB-Part--NOUN (162)
- VERB-Part--PRON (29)
- VERB-Part--PRON-Acc (1)
- VERB-Part--PRON-Nom (211)
- obj
- VERB-Fin--NOUN (1007)
- VERB-Fin--PRON (55)
- VERB-Fin--PRON-Acc (130)
- VERB-Fin--PRON-Nom (2)
- VERB-Ger--NOUN (430)
- VERB-Ger--PRON (10)
- VERB-Ger--PRON-Acc (35)
- VERB-Ger--PRON-Nom (1)
- VERB-Inf--NOUN (1262)
- VERB-Inf--PRON (62)
- VERB-Inf--PRON-Acc (178)
- VERB-Inf--PRON-Nom (2)
- VERB-Part--NOUN (181)
- VERB-Part--PRON (18)
- VERB-Part--PRON-Acc (24)
- VERB-Part--PRON-Nom (1)
- iobj
- VERB-Fin--NOUN (9)
- VERB-Fin--PRON-Acc (19)
- VERB-Ger--NOUN (6)
- VERB-Ger--PRON-Acc (5)
- VERB-Inf--NOUN (13)
- VERB-Inf--PRON-Acc (14)
- VERB-Part--PRON (1)
- VERB-Part--PRON-Acc (1)
Verbs with Reflexive Core Objects
- This corpus contains 29 lemmas that occur at least once with a reflexive core object (obj or iobj). Examples: call themselves, force yourself, give yourself, proclaim himself, Declare himself, ask yourself, attach itself, call myself, coin myself, comfort yourself, declare myself, expose yourself, find yourself, fling themselves, give themselves, go yourself, infect themselves, introduce themselves, make herself, make yourself, prepare yourself, prove itself, redesign itself, remind yourself, reposition itself, support himself, teach himself, tell yourself, turn himself
Relations Overview
- This corpus uses 12 relation subtypes: acl:relcl, aux:pass, cc:preconj, compound:prt, csubj:pass, det:predet, nmod:npmod, nmod:poss, nmod:tmod, nsubj:pass, obl:npmod, obl:tmod
- The following 2 relation types are not used in this corpus at all: clf, list