UD Hindi English HIENCS
Language: Hindi English (code: qhe
)
Family: Code switching
This treebank has been part of Universal Dependencies since the UD v2.3 release.
The following people have contributed to making this treebank part of UD: Riyaz Ahmad Bhat, Irshad Ahmad Bhat.
Repository: UD_Hindi_English-HIENCS
Search this treebank on-line: PML-TQ
Download all treebanks: UD 2.2
License: CC BY-SA 4.0. The underlying text is not included; the user must obtain it separately and then merge with the UD annotation using a script distributed with UD
Genre: social
Questions, comments? General annotation questions (either Hindi English-specific or cross-linguistic) can be raised in the main UD issue tracker. You can report bugs in this treebank in the treebank-specific issue tracker on Github. If you want to collaborate, please contact [riyaz • ah • bhat (æt) gmail • com]. Development of the treebank happens outside the UD repository. If there are bugs, either the original data source or the conversion procedure must be fixed. Do not submit pull requests against the UD repository.
Annotation | Source |
---|---|
Lemmas | annotated manually in non-UD style, automatically converted to UD |
UPOS | annotated manually, natively in UD style |
XPOS | annotated manually |
Features | annotated manually in non-UD style, automatically converted to UD |
Relations | annotated manually in non-UD style, automatically converted to UD |
Description
The Hindi-English Code-switching treebank is based on code-switching tweets of Hindi and English multilingual speakers (mostly Indian) on Twitter. The treebank is manually annotated using UD sceheme. The training and evaluations sets were seperately annotated by different annotators using UD v2 and v1 guidelines respectively. The evaluation sets are automatically converted from UD v1 to v2.
Acknowledgments
Any publication reporting the work done using this data should cite the following papers:
Irshad Ahmad Bhat, Riyaz Ahmad Bhat, Manish Shrivastava and Dipti Misra Sharma. Universal Dependency Parsing for Hindi-English Code-switching. In Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL) 2018, New Orleans, USA.
@inproceedings{bhat2017joining, title={Joining Hands: Exploiting Monolingual Treebanks for Parsing of Code-mixing Data}, author={Bhat, Irshad and Bhat, Riyaz A and Shrivastava, Manish and Sharma, Dipti}, booktitle={Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers}, volume={2}, pages={324–330}, year={2017} }
Irshad Ahmad Bhat, Riyaz Ahmad Bhat, Manish Shrivastava and Dipti Misra Sharma. Joining Hands: Exploiting Monolingual Treebanks for Parsing of Code-mixing Data. In Proceedings of the European Chapter of the Association of Computational Linguistics (EACL) 2017, Valencia, Spain.
@inproceedings{bhat2018universal, title={Universal Dependency Parsing for Hindi-English Code-Switching}, author={Bhat, Irshad and Bhat, Riyaz A and Shrivastava, Manish and Sharma, Dipti}, booktitle={Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)}, volume={1}, pages={987–998}, year={2018} }
Statistics of UD Hindi English HIENCS
POS Tags
ADJ – ADP – ADV – AUX – CCONJ – DET – INTJ – NOUN – NUM – PART – PRON – PROPN – PUNCT – SCONJ – SYM – VERB – X
Features
Relations
acl – acl:relcl – advcl – advmod – amod – appos – aux – case – cc – ccomp – compound – conj – cop – csubj – det – discourse – expl – fixed – flat – iobj – mark – nmod – nsubj – nummod – obj – obl – orphan – parataxis – punct – root – vocative – xcomp
Tokenization and Word Segmentation
- This corpus contains 1898 sentences and 26909 tokens.
- All tokens in this corpus are followed by a space.
- This corpus does not contain words with spaces.
- This corpus does not contain words that contain both letters and punctuation.
Morphology
Tags
- This corpus uses 17 UPOS tags out of 17 possible: ADJ, ADP, ADV, AUX, CCONJ, DET, INTJ, NOUN, NUM, PART, PRON, PROPN, PUNCT, SCONJ, SYM, VERB, X
- This corpus contains 1 word types tagged as particles (PART): _
- This corpus contains 1 lemmas tagged as pronouns (PRON): _
- This corpus contains 1 lemmas tagged as determiners (DET): _
- Out of the above, 1 lemmas occurred sometimes as PRON and sometimes as DET: _
- This corpus contains 1 lemmas tagged as auxiliaries (AUX): _
- Out of the above, 1 lemmas occurred sometimes as AUX and sometimes as VERB: _
- This corpus does not use the VerbForm feature.
Nominal Features
Degree and Polarity
Verbal Features
Pronouns, Determiners, Quantifiers
Other Features
Syntax
Auxiliary Verbs and Copula
- This corpus uses 1 lemmas as copulas (cop). Examples: _.
- This corpus uses 1 lemmas as auxiliaries (aux). Examples: _.
Core Arguments, Oblique Arguments and Adjuncts
Here we consider only relations between verbs (parent) and nouns or pronouns (child).
- nsubj
- VERB--NOUN (703)
- VERB--NOUN-ADP(_) (52)
- VERB--NOUN-ADP(_)-ADP(_) (3)
- VERB--PRON (853)
- VERB--PRON-ADP(_) (22)
- obj
- VERB--NOUN (836)
- VERB--NOUN-ADP(_) (75)
- VERB--NOUN-ADP(_)-ADP(_) (6)
- VERB--PRON (206)
- VERB--PRON-ADP(_) (18)
- iobj
- VERB--NOUN (10)
- VERB--NOUN-ADP(_) (28)
- VERB--NOUN-ADP(_)-ADP(_) (1)
- VERB--PRON (65)
- VERB--PRON-ADP(_) (12)
Relations Overview
- This corpus uses 1 relation subtypes: acl:relcl
- The following 6 relation types are not used in this corpus at all: dislocated, clf, list, goeswith, reparandum, dep