UD Bambara CRB
Language: Bambara (code: bm
)
Family: Mande
This treebank has been part of Universal Dependencies since the UD v2.3 release.
The following people have contributed to making this treebank part of UD: Katya Aplonova, Francis Tyers.
Repository: UD_Bambara-CRB
Search this treebank on-line: PML-TQ
Download all treebanks: UD 2.2
License: CC BY-SA 4.0
Genre: nonfiction, news
Questions, comments? General annotation questions (either Bambara-specific or cross-linguistic) can be raised in the main UD issue tracker. You can report bugs in this treebank in the treebank-specific issue tracker on Github. If you want to collaborate, please contact [aplooon (æt) gmail • com]. Development of the treebank happens outside the UD repository. If there are bugs, either the original data source or the conversion procedure must be fixed. Do not submit pull requests against the UD repository.
Annotation | Source |
---|---|
Lemmas | annotated manually in non-UD style, automatically converted to UD |
UPOS | annotated manually in non-UD style, automatically converted to UD |
XPOS | annotated manually |
Features | annotated manually in non-UD style, automatically converted to UD |
Relations | annotated manually in non-UD style, automatically converted to UD |
Description
The UD Bambara treebank is a section of the Corpus Référence du Bambara annotated natively with Universal Dependencies.
Bambara (also known as Bamana) is the most widely-spoken language of the Manding language group (Niger-Congo > Mande > Western Mande). It is spoken mainly in Mali by 13-14 million people; of these, around four million are L1 speakers. Development of the Bambara Reference Corpus was started in April 2012 (Vydrin 2013, Maslinsky 2014). The corpus includes a non-disambiguated sub-corpus and a disambiguated one. At present, the whole corpus contains about nine million tokens. The corpus was annotated using UD Annotatrix annotation tool (Tyers, Sheyanova, Washington 2018).
Documentation for a treebank is available on UD site (http://universaldependencies.org/bm/dep/).
Acknowledgments
The conversion and annotation has been done by Katya Aplonova and Francis M. Tyers at the Higher School of Economics in Moscow. We would like to thank the developers and annotators of the Corpus Référence du Bambara for permission to base this on their work.
References
- Maslinsky, K. (2014). Daba: a model and tools for Manding corpora. In Proceedings of TALAf 2014 : Traitement Automatique des Langues Africaines, pages 114-122.
- Tyers, F. M., Sheyanova, M., and Washington, J. N. (2018). UD Annotatrix: An annotation tool for Universal Dependencies. In Proceedings of the 16th International Workshop on Treebanks and Linguistic Theories.
- Vydrin, V. (2013). Bamana reference corpus (BRC). Procedia - Social and Behavioral Sciences, 95, pages 75–80.
Statistics of UD Bambara CRB
POS Tags
ADJ – ADP – ADV – AUX – CCONJ – DET – INTJ – NOUN – NUM – PART – PRON – PROPN – PUNCT – SCONJ – VERB – X
Features
AdjType – Aspect – Definite – Mood – Number – NumType – Person – Polarity – PronType – Tense – Valency – VerbForm – Voice
Relations
acl – advcl – advmod – amod – appos – aux – case – cc – ccomp – compound – compound:redup – conj – csubj – dep – det – det:rel – discourse – dislocated – fixed – flat – mark – nmod – nmod:poss – nsubj – nummod – obj – obl – orphan – parataxis – parataxis:obj – punct – reparandum – root – vocative – xcomp
Tokenization and Word Segmentation
- This corpus contains 1026 sentences and 13823 tokens.
- This corpus contains 1843 tokens (13%) that are not followed by a space.
- This corpus does not contain words with spaces.
- This corpus contains 6 types of words that contain both letters and punctuation. Examples: k', n', y', b', kelen-kelen, t'
Morphology
Tags
- This corpus uses 16 UPOS tags out of 17 possible: ADJ, ADP, ADV, AUX, CCONJ, DET, INTJ, NOUN, NUM, PART, PRON, PROPN, PUNCT, SCONJ, VERB, X
- This corpus does not use the following tags: SYM
- This corpus contains 20 word types tagged as particles (PART): bada, bani, bilen, de, dennin, dè, dun, dɛ, fana, hali, k', ko, koyi, kòni, le, sa, tun, wa, wo, yo
- This corpus contains 36 lemmas tagged as pronouns (PRON): _, à, àle, àlê, á, án, áw, bɛ́ɛ, dì, dɔ̀w, dɔ́, dɔ́w, é, ê, í, jɔ̀n, jɔ́n, jɔ́nì, minw, mín, mîn, mun, mùn, mùnna, né, nê, nìn, ń, ò, ò.lú, òlû, ó, sí, ù, ɔn, ɲɔ́gɔn
- This corpus contains 19 lemmas tagged as determiners (DET): _, bɛ́, bɛ́ɛ, dòn, dɔw, dɔ́, dɔ́rɔn, ìn, jùmɛn, minw, mín, mîn, ninw, nìn, ò, sí, wɛ́rɛ, yɛ̀rɛ, yɛ̀rɛ̂
- Out of the above, 9 lemmas occurred sometimes as PRON and sometimes as DET: _, bɛ́ɛ, dɔ́, minw, mín, mîn, nìn, ò, sí
- This corpus contains 31 lemmas tagged as auxiliaries (AUX): _, b', be, bé, bɛ, bɛ́, bɛ́na, bɛ́nà, dìyé, dòn, k', ka, kà, kàna, kànâ, ká, ma, mà, má, mán, mána, mánà, n', nà, té, tɛ, tɛ́, tɛ́nà, y', ye, yé
- Out of the above, 9 lemmas occurred sometimes as AUX and sometimes as VERB: _, b', bé, bɛ́, dòn, nà, tɛ, tɛ́, yé
- There are 2 (de)verbal forms:
- Part
- ADJ: jalenba, jelenba, sigilen
- VERB: nalen, sigilen, kotò, selen, jèlen, bannen, bintò, bòlen, bònnen, dalen
- Vnoun
- NOUN: kanliba, falennò, tobili, nyininkali, FURULI, foli, furakèli, nyinini
Nominal Features
- Plur
- DET: ninw, dòw, minw
- NOUN: kòròkèw, denw, misiw, sagaw, dunanw, gòòtèw, julaw, nyèdenw, sosow, surukuw
- PRON: u, olu, an, aw, dòw, dow, minw, a
- PROPN: warabaw
- Sing
- PRON: a, n, ne, e, i, ale, ele, à
- Def
- DET: nin, ninw, in
- PRON: nin
Degree and Polarity
- Neg
- AUX: tè, ma, kana, te, man, tèna
- VERB: tè, tɛ
- Pos
- AUX: ye, bè, ka, be, mana, bèna, y', bɛ, na, b'
- VERB: tagara, bè, ye, nana, bòra, sera, kèra, tora, cira, banna
Verbal Features
- Imp
- AUX: bè, tè, be, te, bɛ, b'
- VERB: be, bè, tè, tɛ
- Perf
- ADJ-Part: jalenba, jelenba, sigilen
- AUX: ye, ma, y'
- VERB: tagara, nana, bòra, sera, kèra, tora, cira, banna, bolila, donna
- VERB-Part: nalen, sigilen, selen, jèlen, bannen, bòlen, bònnen, dalen, dibilen, dilen
- Prog
- VERB-Part: kotò, bintò, natò
- Cnd
- AUX: mana
- Imp
- AUX: kana, ye
- Sub
- AUX: ka
- Fut
- AUX: bèna, na, n', tèna
- Past
- PART: tun
- Cau
- VERB: labò, lajigin, lajè, dalajè, laminè, latila
Pronouns, Determiners, Quantifiers
- Dem
- DET: nin, ninw
- PRON: nin
- Emp
- PRON: ne, e, ale, aw, ele
- Prs
- PRON: a, n, i, u, olu, an, à
- Rcp
- PRON: nyògòn, nyògon
- Rel
- DET: min, minw
- PRON: min, minw, mun
- Card
- NUM: 6
- Ord
- ADJ: SABANAN, filanan, tannan
- NUM: NAN
- 1
- PRON: n, ne, an
- 2
- PRON: e, i, aw, a
- 3
- PRON: a, u, ale, ele, e, à
Other Features
- AdjType
- Attr
- ADJ: caman, camanba, KEGUNMAN
- Attr
- Valency
- 1
- VERB: tagara, nana, bòra, sera, kèra, tora, cira, banna, bolila, donna
- 2
- AUX: ye, y'
- 1
Syntax
Auxiliary Verbs and Copula
- This corpus does not contain copulas.
- This corpus uses 37 lemmas as auxiliaries (aux). Examples: kà, yé, ye, bɛ́, bɛ, tɛ́, ka, _, ká, tùn, má, ma, tɛ, bé, dìyé, kàna, mána, bɛ́na, kànâ, y', mà, nà, té, k', mánà, à, b', bá, be, bɛ́nà, dòn, fulakɛ, kɔ́, mán, n', tún, tɛ́nà.
Core Arguments, Oblique Arguments and Adjuncts
Here we consider only relations between verbs (parent) and nouns or pronouns (child).
- nsubj
- VERB--NOUN (534)
- VERB--PRON (1212)
- VERB-Part--NOUN (15)
- VERB-Part--PRON (15)
- obj
- VERB--NOUN (517)
- VERB--PRON (500)
Relations Overview
- This corpus uses 4 relation subtypes: compound:redup, det:rel, nmod:poss, parataxis:obj
- The following 6 relation types are not used in this corpus at all: iobj, expl, cop, clf, list, goeswith