UD for Catalan
Tokenization and Word Segmentation
- In general, words are delimited by whitespace characters. Description of exceptions follows.
- According to typographical rules, many punctuation marks are attached to a neighboring word. We usually tokenize them as separate tokens (words) with the exception of abbreviations such as etc. “etc.” which are kept as one token with the period.
- There are two main classes of multi-word tokens:
- Contractions of prepositions and definite articles. Example: al = a + el “to the”, del = de + el “of the”.
- Certain verb forms (infinitives, imperatives, present participles) are writen together with object clitic pronouns, while with other verb forms the clitics are written as separate words. Examples: convertir-se = convertir + se “to become” (lit. “to convert itself”), fer-ho “to do it”. Since the verb-clitic combination is written with a hyphen in Catalan, it could be split during the low-level tokenization. However, we treat it as a multi-word token to emphasize parallelism with Spanish, where it is written as one word.
Morphology
Tags
- Catalan uses all 17 universal POS categories, including particles (PART).
- The only word to be tagged as particle is no “not”.
- TODO: rules for the PRON vs. DET distinction.
- Catalan auxiliary verbs (AUX) are:
- ser and estar “to be”, used as copulas
- ser “to be” for the passive (la guia va ser presentada “the guide was presented”)
- estar “to be” for the progressive (la globalització està causant els canvis “globalization is causing changes”)
- haver “to be” for the perfect tenses (¿Què ha passat? “What happened?”)
- There are four main (de)verbal forms, distinguished by the UPOS tag and the value of the VerbForm feature:
Nominal Features
- Nominal words (NOUN, PROPN and PRON) have an inherent Gender feature with one of two values:
Masc
orFem
.- The following parts of speech inflect for
Gender
because they must agree with nouns: ADJ, DET. Only a subset of adjectives can inflect for gender. A large group of adjectives (e.g. firal “fair” or gran “big”) have just one form regardless of the gender of the modified noun. These adjectives have the gender feature empty.
- The following parts of speech inflect for
- The two values of the Number feature are
Sing
andPlur
. The following parts of speech inflect for number: NOUN, PROPN, PRON, ADJ, DET, VERB, AUX (finite and participles). - Case has three possible values:
Nom
,Dat
,Acc
. It occurs only with personal pronouns (PRON). The “case” (i.e., role w.r.t. predicates or other phrases) of other nominals is expressed using prepositions, not morphologically. - Definite has 2 values:
Ind
,Def
. It is used to distinguish the indefinite and definite articles (DET).
Degree and Polarity
- Degree applies to adjectives (ADJ) and adverbs (ADV) and has one of three possible values:
Pos
,Cmp
,Abs
. The absolute superlative is marked morphologically on adjectives. Otherwise, the comparative and superlative of most adjectives is formed periphrastically, andDegree=Cmp
is only used with a few irregular forms. - Polarity is used to mark the negative particle no, i.e., only the
Neg
value is used.
Verbal Features
- Finite verbs always have one of four values of Mood:
Ind
,Imp
,Sub
andCnd
. - Finite verbs can have one of four values of Tense:
Past
,Imp
,Pres
,Fut
.- Imperative and conditional forms do not have the
Tense
feature. (In Catalan grammar, the conditional is itself often classified as a tense. However, it is a mood in Universal Dependencies.) - The
Tense
feature is also used with the past participles (venido “come”).
- Imperative and conditional forms do not have the
- The Aspect feature is currently not used in Catalan.
It is not needed for the imperfect past tense because UD has the special value
Tense=Imp
. And it is not needed for the perfect tenses because they are constructed periphrastically. - The Voice feature is not used in Catalan because the passive voice is expressed periphrastically.
Pronouns, Determiners, Quantifiers
- PronType is used with pronouns (PRON) and determiners (DET).
- NumType is used with numerals (NUM), adjectives (ADJ) and determiners (DET).
- The Poss feature marks possessive personal determiners (e.g. meu “my”), possessive personal pronouns (e.g. meva “mine”).
- The Reflex feature is always used together with
PronType=Prs
and it marks reflexive pronouns. Note that their forms in the first and second person are ambiguous with irreflexive accusative forms, and theReflex
feature must be decided by context. - Person is a lexical feature of personal pronouns (PRON) and has three values,
1
,2
and3
. With personal possessive determiners (DET), the feature actually encodes the person of the possessor. Person is not marked on other types of pronouns and on nouns, although they can be almost always interpreted as the 3rd person. - The Polite feature distinguishes informal second-person pronouns (tu, vosaltres,
Polite=Infm
) from the formal vostè, vostès (Polite=Form
). - There is one layered feature, Number[psor]. It appears with possessive determiners and encodes the lexical number of the possessor. The extra layer is needed to distinguish this lexical feature from the inflectional number that marks agreement with the modified (possessed) noun.
Other Features
Syntax
Core Arguments, Oblique Arguments and Adjuncts
- Nominal subject (nsubj) is a bare noun phrase without preposition. If it is a personal pronoun, it must be in the nominative form (note however that Catalan is a pro-drop language, where pronominal subjects can be omitted).
- Direct nominal object (obj) is either a bare noun phrase (for inanimate objects) or a prepositional phrase with the preposition a (for animate objects) or a personal pronoun in the accusative form.
- Extra attention has to be paid to the reflexive pronoun es. It can function as:
- Core object (obj or iobj): es va veure al mirall “he sighted himself in the mirror.”
- Reciprocal core objects (
obj
oriobj
): es van besar “they kissed each other.” - Reflexive passive (expl:pass): s’ha ofert una atenció psicològica a les persones afectades “psychological attention has been offered to the people affected” (lit. “offered itself”).
- Inherently reflexive verb, cannot exist without the reflexive clitic, and the clitic cannot be substituted by an irreflexive pronoun
or a noun phrase. In many cases, an irreflexive counterpart of the verb actually exists but its meaning is different because it
denotes a different action performed by the agent.
In accord with the current UD guidelines, we label the relation
between the verb and the clitic as expl:pv, not
compound
. Example: es tracta d’una immigració “the matter is immigration;” s’havia de riure “he had to laugh.”
- In passive clauses, the subject is labeled with nsubj:pass or csubj:pass, respectively.
- The auxiliary verb in periphrastic passive is labeled aux:pass.
Non-verbal Clauses
- The copula verbs ser and estar (be) are used in equational, attributional, locative, possessive and benefactory nonverbal clauses.
Relations Overview
- The following relation subtypes are used in Catalan:
- acl:relcl for relative clauses
- nsubj:pass for nominal subjects of passive verbs
- csubj:pass for clausal subjects of passive verbs
- aux:pass for passive auxiliaries
- The following relation types are not used in Catalan at all: clf, dislocated
Treebanks
There is one Catalan UD treebank: