UD for Spanish
Tokenization and Word Segmentation
- In general, words are delimited by whitespace characters. Description of exceptions follows.
- According to typographical rules, many punctuation marks are attached to a neighboring word. We usually tokenize them as separate tokens (words) with the exception of abbreviations such as etc. “etc.” which are kept as one token with the period.
- There are two main classes of multi-word tokens:
- Contractions of prepositions and definite articles. Example: al = a + el “to the”, del = de + el “of the”.
- Certain verb forms (infinitives, imperatives, present participles) are writen together with object clitic pronouns, while with other verb forms the clitics are written as separate words. Examples: convertirse = convertir + se “to become” (lit. “to convert itself”), hacerlo “to do it”.
Morphology
Tags
- Spanish uses all 17 universal POS categories, including particles (PART).
- The only word to be tagged as particle is no “not”.
- TODO: rules for the PRON vs. DET distinction.
- Spanish auxiliary verbs (AUX) are:
- ser and estar “to be”, used as copulas
- ser “to be” for the passive (la sentencia fue publicada “the sentence was published”)
- estar “to be” for the progressive (mis hijos están estudiando inglés “my children are studying English”)
- haber “to be” for the perfect tenses (ha venido hoy “he came today”)
- There are four main (de)verbal forms, distinguished by the UPOS tag and the value of the VerbForm feature:
Nominal Features
- Nominal words (NOUN, PROPN and PRON) have an inherent Gender feature with one of two values:
Masc
orFem
.- The following parts of speech inflect for
Gender
because they must agree with nouns: ADJ, DET. Only a subset of adjectives can inflect for gender, with the suffix -o indicating the masculine and -a the feminine. A large group of adjectives (e.g. grande “big” or feliz “happy”) have just one form regardless of the gender of the modified noun. These adjectives have the gender feature empty.
- The following parts of speech inflect for
- The two values of the Number feature are
Sing
andPlur
. The following parts of speech inflect for number: NOUN, PROPN, PRON, ADJ, DET, VERB, AUX (finite and participles). - Case has 4 possible values:
Nom
,Dat
,Acc
,Com
. It occurs only with personal pronouns (PRON). The “case” (i.e., role w.r.t. predicates or other phrases) of other nominals is expressed using prepositions, not morphologically. - Definite has 2 values:
Ind
,Def
. It is used to distinguish the indefinite and definite articles (DET).
Degree and Polarity
- Degree applies to adjectives (ADJ) and adverbs (ADV) and has one of three possible values:
Pos
,Cmp
,Abs
. The absolute superlative is marked morphologically on adjectives. Otherwise, the comparative and superlative of most adjectives is formed periphrastically, andDegree=Cmp
is only used with a few irregular forms. - Polarity is used to mark the negative particle no, i.e., only the
Neg
value is used.
Verbal Features
- Finite verbs always have one of four values of Mood:
Ind
,Imp
,Sub
andCnd
. - Finite verbs can have one of four values of Tense:
Past
,Imp
,Pres
,Fut
.- Imperative and conditional forms do not have the
Tense
feature. (In Spanish grammar, the conditional is itself often classified as a tense. However, it is a mood in Universal Dependencies.) - The
Tense
feature is also used with the past participles (venido “come”).
- Imperative and conditional forms do not have the
- The Aspect feature is currently not used in Spanish.
It is not needed for the imperfect past tense because UD has the special value
Tense=Imp
. And it is not needed for the perfect tenses because they are constructed periphrastically. - The Voice feature is not used in Spanish because the passive voice is expressed periphrastically.
Pronouns, Determiners, Quantifiers
- PronType is used with pronouns (PRON) and determiners (DET).
- NumType is used with numerals (NUM), adjectives (ADJ) and determiners (DET).
- The Poss feature marks possessive personal determiners (e.g. mi “my”), possessive personal pronouns (e.g. mío “mine”), and possessive interrogative or relative determiners (e.g. cuyo “whose”).
- The Reflex feature is always used together with
PronType=Prs
and it marks reflexive pronouns (me, te, se, nos, os). Note that their forms in the first and second person are ambiguous with irreflexive accusative forms, and theReflex
feature must be decided by context. - Person is a lexical feature of personal pronouns (PRON) and has three values,
1
,2
and3
. With personal possessive determiners (DET), the feature actually encodes the person of the possessor. Person is not marked on other types of pronouns and on nouns, although they can be almost always interpreted as the 3rd person. - The Polite feature distinguishes informal second-person pronouns (tú, vosotros,
Polite=Infm
) from the formal usted, ustedes (Polite=Form
). - There is one layered feature, Number[psor]. It appears with possessive determiners and encodes the lexical number of the possessor. The extra layer is needed to distinguish this lexical feature from the inflectional number that marks agreement with the modified (possessed) noun.
Other Features
Syntax
Core Arguments, Oblique Arguments and Adjuncts
- Nominal subject (nsubj) is a bare noun phrase without preposition. If it is a personal pronoun, it must be in the nominative form (note however that Spanish is a pro-drop language, where pronominal subjects can be omitted).
- Direct nominal object (obj) is either a bare noun phrase (for inanimate objects) or a prepositional phrase with the preposition a (for animate objects) or a personal pronoun in the accusative form.
- Extra attention has to be paid to the reflexive pronoun se. It can function as:
- Core object (obj or iobj): él se vio en el espejo “he sighted himself in the mirror.”
- Reciprocal core objects (
obj
oriobj
): se besaron “they kissed each other.” - Reflexive passive (expl:pass): se celebran los cien años del club “hundred years of the club are celebrated” (lit. “celebrate themselves”); se dice que la escribió en París “it is said that he wrote it in Paris.”
- Inherently reflexive verb, cannot exist without the reflexive clitic, and the clitic cannot be substituted by an irreflexive pronoun
or a noun phrase. In many cases, an irreflexive counterpart of the verb actually exists but its meaning is different because it
denotes a different action performed by the agent.
In accord with the current UD guidelines, we label the relation
between the verb and the clitic as expl:pv, not
compound
. Example: se trataba de un negocio nuevo “the matter is a new contract.”
- In passive clauses, the subject is labeled with nsubj:pass or csubj:pass, respectively.
- The auxiliary verb in periphrastic passive is labeled aux:pass.
Non-verbal Clauses
- The copula verbs ser and estar (be) are used in equational, attributional, locative, possessive and benefactory nonverbal clauses.
- Existential clauses use a different verb, hay (be), and the entity whose existence is asserted is its object: hay algo para comer “there is something to eat.”
Relations Overview
- The following relation subtypes are used in Spanish:
- acl:relcl for relative clauses
- nsubj:pass for nominal subjects of passive verbs
- csubj:pass for clausal subjects of passive verbs
- aux:pass for passive auxiliaries
- The following relation types are not used in Spanish at all: clf, dislocated
Treebanks
There are three Spanish UD treebanks: