UD Naija NSC
Language: Naija (code: pcm
)
Family: Creole
This treebank has been part of Universal Dependencies since the UD v2.2 release.
The following people have contributed to making this treebank part of UD: Bernard Caron, Marine Courtin, Kim Gerdes, Sylvain Kahane, Sandra Bellato, Manying Zhang.
Repository: UD_Naija-NSC
Search this treebank on-line: PML-TQ
Download all treebanks: UD 2.2
License: CC BY-SA 4.0
Genre: spoken
Questions, comments?
General annotation questions (either Naija-specific or cross-linguistic) can be raised in the main UD issue tracker.
You can report bugs in this treebank in the treebank-specific issue tracker on Github.
If you want to collaborate, please contact [kim (æt) gerdes • fr].
Development of the treebank happens in the UD repository but not directly in the final CoNLL-U files.
You may submit bug fixes as pull requests against the dev branch but you have to go to the folder called not-to-release
and locate the source files there.
Contact the treebank maintainers if in doubt.
Annotation | Source |
---|---|
Lemmas | assigned by a program, not checked manually |
UPOS | annotated manually, natively in UD style |
XPOS | not available |
Features | not available |
Relations | annotated manually, natively in UD style |
Description
A Universal Dependencies corpus for spoken Naija (Nigerian Pidgin).
The corpus is based on dialogues and monologues and comprises 948 sentences and 12863 tokens.
Sentences are annotated with the following metadata :
- sent_id (which also indicates the file)
- text
- text_en (English translation)
- speaker (for dialogues).
Acknowledgments
The treebank was created within the NaijaSynCor project, directed by Bernard Caron and funded by the ANR, the French National Research Agency.
This corpus is a pilot for the larger corpus elaborated as part of the NaijaSynCor Project (Projet-ANR-16-CE27-0007). Its main aim is to elaborate and test the annotation and procedures that are used in the ANR-project. It will be part of a larger 500kW corpus that will be projected on prosodic and information structures and analysed for sociolinguistics variation (http://naijasyncor.huma-num.fr/).
The pilot corpus was recorded in various locations in Ibadan (Nigeria) by Bukola Babalola and Opeyemi Lewis. It was transcribed, translated and tagged manually using Elan-Corpa (http://llacan.vjf.cnrs.fr/res_ELAN-CorpA_en.php) by Folakemi Ladoja, Emeka Onwuegbuzia, Biola Oyelere and Samson Tella under the supervision of Bernard Caron. It was converted to CONLL by Mourad Aouini. The final Universal dependencies annotations have been manually checked by Sandra Bello, Marine Courtin, Bernard Caron, Kim Gerdes, Sylvain Kahane, and Manying Zhang using the processing chain developed by Kim Gerdes. The guidelines were written by Marine Courtin and Sandra Bellato under the supervision of Sylvain Kahane, Bernard Caron, and Kim Gerdes.
Statistics of UD Naija NSC
POS Tags
ADJ – ADP – ADV – AUX – CCONJ – DET – INTJ – NOUN – NUM – PART – PRON – PROPN – PUNCT – SCONJ – SYM – VERB – X
Features
Relations
acl – acl:cleft – acl:relcl – advcl – advcl:periph – advmod – advmod:emph – amod – appos – aux – aux:pass – case – cc – ccomp – ccomp:cleft – compound – compound:prt – compound:redup – compound:svc – conj – conj:appos – conj:coord – conj:dicto – cop – csubj – csubj:quasi – det – det:predet – discourse – dislocated – expl – fixed – flat – goeswith – iobj – list – mark – nmod – nmod:npmod – nmod:poss – nsubj – nsubj:expl – nsubj:pass – nsubj:quasi – nummod – obj – obl – obl:arg – obl:mod – obl:periph – orphan – parataxis:conj – parataxis:discourse – parataxis:dislocated – parataxis:obj – parataxis:parenth – punct – reparandum – root – vocative – xcomp
Tokenization and Word Segmentation
- This corpus contains 948 sentences and 12861 tokens.
- All tokens in this corpus are followed by a space.
- This corpus does not contain words with spaces.
- This corpus contains 12 types of words that contain both letters and punctuation. Examples: a'ah, "uh", twenty-seventh, I.B.B., "ah", "ehen", Port-Harcourt, Third-Mainland, how-I-for-do, o'clock, so-so, twenty-eighth
Morphology
Tags
- This corpus uses 17 UPOS tags out of 17 possible: ADJ, ADP, ADV, AUX, CCONJ, DET, INTJ, NOUN, NUM, PART, PRON, PROPN, PUNCT, SCONJ, SYM, VERB, X
- This corpus contains 16 word types tagged as particles (PART): be, dem, dey, ma, na, naim, no, not, now, n~, o, oya, sef, sey, sha, to
- This corpus contains 36 lemmas tagged as pronouns (PRON): I, am, anyone, dat, de, deir, dem, dis, dose, e, each, everything, her, if, im, imself, it, ma, me, naim, one, our, own, she, some, something, una, unasef, us, we, wetin, which, who, you, your, yourself
- This corpus contains 13 lemmas tagged as determiners (DET): a, all, anoder, any, dat, dese, di, dis, dose, every, one, some, which
- Out of the above, 6 lemmas occurred sometimes as PRON and sometimes as DET: dat, dis, dose, one, some, which
- This corpus contains 10 lemmas tagged as auxiliaries (AUX): come, de, dey, don, fit, for, go, make, neva, will
- Out of the above, 4 lemmas occurred sometimes as AUX and sometimes as VERB: come, dey, go, make
- This corpus does not use the VerbForm feature.
Nominal Features
Degree and Polarity
Verbal Features
Pronouns, Determiners, Quantifiers
Other Features
Syntax
Auxiliary Verbs and Copula
- This corpus uses 4 lemmas as copulas (cop). Examples: na, be, dey, is.
- This corpus uses 12 lemmas as auxiliaries (aux). Examples: dey, go, make, don, come, fit, neva, for, I, de, do, will.
- This corpus uses 1 lemmas as passive auxiliaries (aux:pass). Examples: dey.
Core Arguments, Oblique Arguments and Adjuncts
Here we consider only relations between verbs (parent) and nouns or pronouns (child).
- nsubj
- VERB--NOUN (242)
- VERB--PRON (848)
- VERB--PRON-ADP(of) (1)
- obj
- VERB--NOUN (428)
- VERB--NOUN-ADP(dem) (1)
- VERB--PRON (241)
- iobj
- VERB--NOUN (1)
- VERB--PRON (29)
Relations Overview
- This corpus uses 27 relation subtypes: acl:cleft, acl:relcl, advcl:periph, advmod:emph, aux:pass, ccomp:cleft, compound:prt, compound:redup, compound:svc, conj:appos, conj:coord, conj:dicto, csubj:quasi, det:predet, nmod:npmod, nmod:poss, nsubj:expl, nsubj:pass, nsubj:quasi, obl:arg, obl:mod, obl:periph, parataxis:conj, parataxis:discourse, parataxis:dislocated, parataxis:obj, parataxis:parenth
- The following 1 main types are not used alone, they are always subtyped: parataxis
- The following 2 relation types are not used in this corpus at all: clf, dep