UD Norwegian Bokmaal
Language: Norwegian (code: no
)
Family: Indo-European, Germanic
This treebank has been part of Universal Dependencies since the UD v1.2 release.
The following people have contributed to making this treebank part of UD: Lilja Øvrelid, Fredrik Jørgensen, Petter Hohle.
Repository: UD_Norwegian-Bokmaal
Search this treebank on-line: PML-TQ
Download all treebanks: UD 2.2
License: CC BY-SA
Genre: news, blog, nonfiction
Questions, comments? General annotation questions (either Norwegian-specific or cross-linguistic) can be raised in the main UD issue tracker. You can report bugs in this treebank in the treebank-specific issue tracker on Github. If you want to collaborate, please contact [liljao (æt) ifi • uio • no]. Development of the treebank happens outside the UD repository. If there are bugs, either the original data source or the conversion procedure must be fixed. Do not submit pull requests against the UD repository.
Annotation | Source |
---|---|
Lemmas | annotated manually in non-UD style, automatically converted to UD |
UPOS | annotated manually in non-UD style, automatically converted to UD |
XPOS | not available |
Features | annotated manually in non-UD style, automatically converted to UD |
Relations | annotated manually in non-UD style, automatically converted to UD |
Description
The Norwegian UD treebank is based on the Bokmål section of the Norwegian Dependency Treebank (NDT), which is a syntactic treebank of Norwegian. NDT has been automatically converted to the UD scheme by Lilja Øvrelid at the University of Oslo.
NDT was developed 2011-2014 at the National Library of Norway in collaboration with the Text Laboratory and the Department of Informatics at the University of Oslo. NDT contains around 300,000 tokens taken from a variety of genres. The treebank texts have been manually annotated for morphosyntactic information. The morphological annotation mainly follows mainly the Oslo-Bergen Tagger. The syntactic annotation follows, to a large extent, the Norwegian Reference Grammar, as well as a dependency annotation scheme formulated at the outset of the annotation project and iteratively refined throughout the construction of the treebank. For more information, see the references below.
DATA SPLITS
In creating the data splits, care has been taken to preserve contiguous texts in the different splits and also to keep a fair balance of genres in each of the splits. Petter Hohle created the splits for the Norwegian UD treebank. The splits were created by concatenating the following files (available with the distribution of NDT):
Training data (15696 sentences, 180 individual files):
- ap001_0000 – ap012_0002 (53 files)
- bt001_0000 – bt005_0001 (28 files)
- db001a_0000 – db013_0004 (42 files)
- kk001_0000 – kk005_0001 (10 files)
- sp-bm001_0000 – sp-bm001_0008 (9 files)
- vg001_0000 – vg002_0003 (8 files)
- blogg-bm001_0000 – blogg-bm003_0000 (9 files)
- nou001_0000 – nou004_0000 (10 files)
- st001_0000 – st005_0000 (11 files)
Development data (2410 sentences, 26 individual files):
- ap012_0003 – ap014_0002 (7 files)
- bt005_0002 – bt005_0005 (4 files)
- db013_0005 – db014_0002 (5 files)
- kk006_00001 – kk007_0000 (2 files)
- sp-bm002_0000 – sp-bm002_0001 (2 files)
- vg002_0004 (1 file)
- blogg-bm003_0001 – blogg-bm003_0002 (2 files)
- nou004_0001 (1 file)
- st005_0001 – st005_0002 (2 files)
Test data (1939 sentences, 26 individual files):
- ap014_0003 – ap015_0002 (7 files)
- bt005_0006 – bt006_0001 (4 files)
- db014_0003 – db014_0007 (5 files)
- kk007_0001 – kk008_0000 (2 files)
- sp-bm003_0000 – sp-bm003_0001 (2 files)
- vg002_0005 (1 file)
- blogg-bm003_0003 – blogg-bm003_0004 (2 files)
- nou004_0002 (1 file)
- st005_0003 – st005_0004 (2 files)
BASIC STATISTICS
Tree count: 20045
Word count: 311277
Token count: 311277
Dep. relations: 35 of which 2 language specific
POS tags: 17
Category=value feature pairs: 31
TOKENIZATION
White space always indicates a token boundary and punctuation constitute separate tokens, except:
- numbers with periods, commas or colons, e.g. 1.3, 0,6, 10:13
- abbreviations, e.g. f.eks., Carl J. Hambro
- URLs, e.g. http://www.ifi.uio.no
The treebank does not contain multiword tokens.
MORPHOLOGY
The PoS-tags follow the universal tag set and does not add any language-specific PoS-tags. The morphological features follow the Oslo-Bergen Tagger scheme (Hagen et. al., 2000). PoS-tags and morphological features were converted automatically to the UD scheme.
SYNTAX
The syntactic annotation in the Norwegian UD treebank conforms to the
UD guidelines, adding language-specific relations for relative clauses (acl:relcl
)
and verb particles (compound:prt
). The annotation has been automatically converted to
UD from the original dependency scheme described in Solberg
et. al. (2014) and further described in the NDT guidelines (Kinn
et. al.).
The conversion has not been manually checked. There are a few known discrepancies from UD:
- no mwe analysis in the treebank. This is also information that is not present in the original data.
Acknowledgments
NDT has been automatically converted to the UD scheme by Lilja Øvrelid at the University of Oslo. Petter Hohle created the data splits and Fredrik Jørgensen aligned the treebank to the original texts. We thank the annotators of the original NDT: Pål Kristian Eriksen, Kari Kinn and Per Erik Solberg.
Statistics of UD Norwegian Bokmaal
POS Tags
ADJ – ADP – ADV – AUX – CCONJ – DET – INTJ – NOUN – NUM – PART – PRON – PROPN – PUNCT – SCONJ – SYM – VERB – X
Features
Abbr – Animacy – Case – Definite – Degree – Gender – Mood – Number – NumType – Person – Polarity – Poss – PronType – Reflex – Tense – VerbForm – Voice
Relations
acl – acl:cleft – acl:relcl – advcl – advmod – amod – appos – aux – aux:pass – case – cc – ccomp – compound – compound:prt – conj – cop – csubj – csubj:pass – det – discourse – expl – flat:foreign – flat:name – goeswith – iobj – mark – nmod – nsubj – nsubj:pass – nummod – obj – obl – orphan – parataxis – punct – root – xcomp
Tokenization and Word Segmentation
- This corpus contains 20045 sentences and 310222 tokens.
- This corpus contains 34525 tokens (11%) that are not followed by a space.
- This corpus does not contain words with spaces.
- This corpus contains 1583 types of words that contain both letters and punctuation. Examples: tros-, Nord-Korea, Dagbladet.no, Fr.p., pr., tv-kanalen, bl.a., dr., aftenposten.no, Mette-Marit, Kyoto-avtalen, bt.no, I., St., e-post, Sør-Afrika, Thiis-Evensen, ca., handelsportal.no, Rieber-Mohn, Schmidt-Nielsen, W., pst., 70-tallet, CO2-frie, Mayen-loven, olje-, Fr.p.s, M., dvs., f.eks., miljø-, LO-leder, flyktning-, helse-, kl., skatte-, 1970-tallet, 1980-tallet, 60-tallet, A., B., GH:WT, Jong-un, Nord-Koreas, O2-opptaket, m.m., norsk-pakistanske, nærings-, 1800-tallet
Morphology
Tags
- This corpus uses 17 UPOS tags out of 17 possible: ADJ, ADP, ADV, AUX, CCONJ, DET, INTJ, NOUN, NUM, PART, PRON, PROPN, PUNCT, SCONJ, SYM, VERB, X
- This corpus contains 4 word types tagged as particles (PART): ei, ikke, og, å
- This corpus contains 49 lemmas tagged as pronouns (PRON): alle, alt, begge, de, den, denne, dere, deres, det, dette, din, disse, du, en, enhver, ham, han, hans, hennes, hun, hva, hvem, hverandre, hverandres, hvilket, hvis, ikkenoe, ingen, ingenting, intet, jag, jeg, man, meg, min, noe, noen, samtlige, seg, sin, sitt, slikt, som, sånt, vi, vår, whatever, you, æ
- This corpus contains 38 lemmas tagged as determiners (DET): 135a, all, alle, annen, begge, de, den, denne, det, dette, disse, egen, en, endel, enhver, fire-fem, forrige, hin, hver, hvilken, hvis, ingen, min, neste, nineish, noe, noen, samme, samtlige, selv, selve, selveste, sjøl, slik, sådan, sånn, tenish, the
- Out of the above, 16 lemmas occurred sometimes as PRON and sometimes as DET: alle, begge, de, den, denne, det, dette, disse, en, enhver, hvis, ingen, min, noe, noen, samtlige
- This corpus contains 10 lemmas tagged as auxiliaries (AUX): bli, burde, få, ha, kunne, måtte, skulle, tørre, ville, være
- Out of the above, 10 lemmas occurred sometimes as AUX and sometimes as VERB: bli, burde, få, ha, kunne, måtte, skulle, tørre, ville, være
- There are 3 (de)verbal forms:
- Fin
- AUX: er, har, var, kan, vil, skal, ble, må, hadde, skulle
- VERB: har, sier, er, blir, kommer, går, mener, ble, får, kom
- Inf
- AUX: være, ha, bli, kunne, få, måtte, ville, skulle, ble, vøre
- VERB: få, ha, bli, ta, gjøre, se, si, gå, komme, gi
- Part
- ADJ: sittende, tilsvarende, stående, forurensende, økende, økt, overraskende, ledende, krevende, manglende
- AUX: vært, blitt, fått, måttet, kunnet, villet
- VERB: fått, hatt, blitt, tatt, gjort, sett, gått, kommet, lagt, sagt
Nominal Features
- Fem
- DET: den, ei, noen, all, denne, hver, egen, annen, enhver, hvilken
- NOUN: tid, kirke, kroner, kvinner, støtte, hjelp, uker, side, mor, endringer
- NUM: halvannen, annenhver
- PRON: hun, henne, vår, hans, deres, si, hennes, di, mi
- PROPN: Kristin, Marit, Hanne, Hanna, Märtha, Gro, Ingrid, Maria, Marie, Anne
- Fem,Masc
- PRON: den, noen, denne, ingen, enhver, der
- Masc
- ADJ: antiautoritære, stor, straffet
- ADJ-Part: straffet
- ADV: Jo
- DET: en, den, denne, ingen, annen, hver, egen, slik, noen, all
- NOUN: dag, prosent, gang, verden, del, grunn, saken, ganger, ting, millioner
- NUM: én, halvannen, annenhver, Èn
- PRON: han, sin, ham, min, hans, vår, din, deres, hennes
- PROPN: Jan, Espen, Martin, Olav, Erik, Øyvind, Per, Kjell, Aftenposten, Sverre
- Neut
- ADJ: mye, helt, godt, litt, langt, samtidig, veldig, mulig, svært, lite
- ADJ-Part: bortsett, knyttet, samlet, opptalt, fredet, sett, uttalt, basert, betalt, integrert
- DET: et, det, noe, annet, dette, hvert, eget, alt, slikt, hvilket
- NOUN: år, folk, land, barn, landet, mennesker, livet, spørsmål, forhold, tillegg
- NUM: ett, halvannet, mangt, annethvert
- PRON: det, dette, noe, sitt, alt, mitt, vårt, hans, hennes, ditt
- PROPN: Stortinget, Dagbladet, Fremskrittspartiet, Senterpartiet, Stortingets, Sørlandet, Internett, Barentshavet, Norden, Vestlandet
- Hum
- PRON: jeg, han, vi, hun, du, man, meg, oss, ham, deg
- Plur
- ADJ: mange, store, nye, norske, siste, gode, få, ulike, 22., ansatte
- ADJ-Part: økte, fredede, gjentatte, interesserte, samlede, forente, kvalifiserte, solgte, tapte, undertrykte
- DET: de, andre, alle, noen, disse, slike, egne, ingen, begge, hvilke
- NOUN: år, prosent, folk, barn, mennesker, ganger, kroner, land, ting, millioner
- NUM: to, tre, fire, 2, fem, ti, 20, seks, 3, 50
- PRON: vi, de, oss, dem, sine, alle, våre, ingen, dere, hverandre
- Plur,Sing
- NOUN: A/S, AS, EKG, IQ, KS
- Sing
- ADJ: mye, første, helt, litt, godt, hele, norske, stor, ny, god
- ADJ-Part: økt, bekymret, knyttet, samlet, overrasket, bortsett, domfelte, interessert, lovforankret, integrert
- DET: en, et, den, det, denne, noe, annet, dette, annen, ingen
- NOUN: dag, gang, tid, verden, del, år, kirke, landet, grunn, saken
- NUM: ett, én, ene, 1, halvannen, annenhver, halvannet, mangt, 1., annethvert
- PRON: det, jeg, han, hun, du, dette, man, sin, meg, den
- VERB-Part: overrasket
- Acc
- PRON: seg, meg, oss, dem, ham, deg, henne, dere, han, mæ
- Gen
- ADJ: domfeltes, manges, offentliges, ansattes, enkeltes, fattiges, mistenktes, rødgrønnes, sistnevntes, tiltaltes
- ADJ-Part: domfeltes, mistenktes
- DET: andres, dens, dets, alles, ens, annens, hvis
- NOUN: verdens, dagens, landets, årets, kirkens, statens, utvalgets, års, samfunnets, barnets
- NUM: 2, 2011s, 2s
- PRON: alles, ens
- PROPN: Norges, Regjeringens, Cathrines, Obamas, Høyres, FNs, Bertelsens, USAs, Europas, Hannahs
- Gen,Nom
- PRON: ens
- Nom
- PRON: jeg, han, vi, de, hun, du, man, dere, Eg, mann
- Def
- ADJ: første, hele, norske, beste, siste, nye, fleste, største, store, viktigste
- ADJ-Part: domfelte, samlede, mistenkte, økte, domfeltes, planlagte, undertegnede, anbefalte, betalte, drepte
- DET: samme, neste, forrige, andre, selve, selveste, the
- NOUN: landet, saken, livet, regjeringen, stedet, tiden, utvalget, politiet, staten, dagens
- NUM: eneste, ene
- Def,Ind
- NOUN: A/S, AS, EKG, IQ, IT, KS
- Ind
- ADJ: mye, helt, litt, godt, stor, ny, mest, god, norsk, langt
- ADJ-Part: økt, bekymret, knyttet, samlet, overrasket, bortsett, interessert, lovforankret, integrert, redusert
- DET: annet, annen, egen, eget, annens
- NOUN: år, dag, prosent, gang, tid, folk, verden, land, barn, del
- VERB-Part: overrasket
Degree and Polarity
- Cmp
- ADJ: mer, flere, tidligere, bedre, større, mindre, videre, lenger, senere, høyere
- Pos
- ADJ: mange, norske, mye, første, store, nye, hele, helt, litt, godt
- Sup
- ADJ: mest, beste, fleste, minst, største, best, viktigste, fremst, verste, nærmest
- Neg
- DET: ingen, intet
- PART: ikke
- PRON: ingen, ingenting
Verbal Features
- Imp
- AUX-Fin: vær, Få
- VERB-Fin: les, la, se, tenk, Ha, ta, send, gi, husk, kom
- Ind
- AUX-Fin: er, har, var, kan, vil, skal, ble, må, hadde, skulle
- VERB-Fin: har, sier, er, blir, kommer, går, mener, ble, får, hadde
- Past
- AUX-Fin: var, ble, hadde, skulle, ville, kunne, måtte, burde, fikk, torde
- VERB-Fin: ble, hadde, kom, fikk, sa, gikk, tok, var, gjorde, så
- Pres
- AUX-Fin: er, har, kan, vil, skal, må, blir, bør, får, tør
- VERB-Fin: har, sier, er, blir, kommer, går, mener, får, ser, gjør
- Pass
- VERB-Fin: brukes, legges, sies, gis, kreves, oppheves, settes, vises, snakkes, stilles
- VERB-Inf: gjøres, tas, brukes, legges, settes, styrkes, behandles, gis, gjennomføres, sies
Pronouns, Determiners, Quantifiers
- Art
- DET: en, et, ei, ens, at, er, ett
- Art,Prs
- PRON: en, ens
- Dem
- DET: den, de, det, andre, denne, annet, disse, samme, dette, annen
- Dem,Ind
- DET: noe
- Ind
- DET: noen, noe, Endel
- Ind,Prs
- PRON: noe, noen
- Int
- DET: hvilke, hvilken, hvilket
- PRON: hva, hvem, hvis, hvilket
- Neg
- DET: ingen, intet
- PRON: ingenting
- Neg,Prs
- PRON: ingen
- Prs
- DET: selv, egen, egne, eget, selve, 135a, sjøl, selveste, the, fire-fem
- PRON: det, jeg, han, vi, de, seg, hun, du, dette, man
- Prs,Tot
- PRON: alle, begge, enhver, samtlige, alles
- Rcp
- PRON: hverandre, hverandres
- Rel
- PRON: som
- Tot
- DET: alle, hver, hvert, all, begge, alt, enhver, samtlige, ethvert, alles
- Card
- NUM: to, tre, fire, eneste, ett, 2, fem, ti, 20, seks
- Yes
- PRON: sin, sine, hans, sitt, min, vår, deres, mitt, våre, vårt
- Yes
- PRON: seg
- 1
- PRON: jeg, vi, meg, oss, mæ, Eg, mig, æ
- 2
- PRON: du, deg, dere
- 3
- PRON: det, han, de, hun, dette, den, noe, dem, ham, alt
Other Features
- Abbr
- Yes
- ADJ: a, kgl., flg, lat., s.k.
- ADP: bl.a., pr., bl, f, pr, f., inkl., mht., bla, p.g.a.
- ADV: ca, ca., dvs., f.eks., m.m., dvs, osv., m.v., mv, o.l.
- NOUN: dr., nr, NATO, PST, pst., kr, kl., res, eks, Nato
- PROPN: USA, Frp, FN, EU, Ap, KrF, SV, Sp, Fr.p., FNs
- VERB-Fin: jf
- Yes
Syntax
Auxiliary Verbs and Copula
- This corpus uses 1 lemmas as copulas (cop). Examples: være.
- This corpus uses 9 lemmas as auxiliaries (aux). Examples: ha, kunne, ville, skulle, være, måtte, få, burde, tørre.
- This corpus uses 1 lemmas as passive auxiliaries (aux:pass). Examples: bli.
Core Arguments, Oblique Arguments and Adjuncts
Here we consider only relations between verbs (parent) and nouns or pronouns (child).
- nsubj
- VERB-Fin--NOUN (4772)
- VERB-Fin--NOUN-ADP(med) (10)
- VERB-Fin--NOUN-ADP(over) (2)
- VERB-Fin--PRON (2629)
- VERB-Fin--PRON-Acc (6)
- VERB-Fin--PRON-Nom (5621)
- VERB-Inf--NOUN (1058)
- VERB-Inf--NOUN-Gen (1)
- VERB-Inf--PRON (534)
- VERB-Inf--PRON-Acc (7)
- VERB-Inf--PRON-Nom (1463)
- VERB-Part--NOUN (1045)
- VERB-Part--NOUN-ADP(med) (1)
- VERB-Part--PRON (510)
- VERB-Part--PRON-Nom (1015)
- obj
- VERB-Fin--NOUN (5102)
- VERB-Fin--NOUN-ADP(over) (1)
- VERB-Fin--NOUN-Gen (1)
- VERB-Fin--PRON (784)
- VERB-Fin--PRON-Acc (819)
- VERB-Fin--PRON-Nom (11)
- VERB-Inf--NOUN (3500)
- VERB-Inf--PRON (400)
- VERB-Inf--PRON-ADP(med) (1)
- VERB-Inf--PRON-Acc (499)
- VERB-Inf--PRON-Nom (5)
- VERB-Part--NOUN (1271)
- VERB-Part--PRON (172)
- VERB-Part--PRON-Acc (182)
- iobj
- VERB-Fin--NOUN (61)
- VERB-Fin--PRON (81)
- VERB-Fin--PRON-Acc (216)
- VERB-Inf--NOUN (72)
- VERB-Inf--PRON (14)
- VERB-Inf--PRON-Acc (109)
- VERB-Part--NOUN (12)
- VERB-Part--PRON (14)
- VERB-Part--PRON-Acc (48)
Verbs with Reflexive Core Objects
- This corpus contains 273 lemmas that occur at least once with a reflexive core object (obj or iobj). Examples: vise seg, føle seg, ta seg, dreie seg, sette seg, holde seg, komme seg, forholde seg, skille seg, legge seg, utvikle seg, befinne seg, bestemme seg, stille seg, gjøre seg, sikre seg, la seg, nærme seg, skaffe seg, uttale seg, gifte seg, klare seg, bevege seg, glede seg, tenke seg, endre seg, engasjere seg, melde seg, strekke seg, forberede seg, tillate seg, trekke seg, gi seg, reise seg, si seg, bry seg, konsentrere seg, se seg, gjemme seg, kaste seg, lære seg, prøve seg, samle seg, barrikadere seg, etterlate seg, fortone seg, få seg, kjøpe seg, knytte seg, oppføre seg
- Out of those, 18 lemmas occurred more than once, but never without a reflexive dependent. Examples: fortone, pådra, påta, vegre, ombestemme, oppholde, skynde, belage, påberope, venne, dy, forskanse, godte, hive, kvitte, lene, opparbeide, tilegne
Relations Overview
- This corpus uses 8 relation subtypes: acl:cleft, acl:relcl, aux:pass, compound:prt, csubj:pass, flat:foreign, flat:name, nsubj:pass
- The following 1 main types are not used alone, they are always subtyped: flat
- The following 7 relation types are not used in this corpus at all: vocative, dislocated, clf, fixed, list, reparandum, dep