ID: Korlex-Serbian-Resource RESOURCE DESCRIPTION The lexical resource Korlex-Serbian-Resource provides a list of 102,019 Serbian lemmas, i.e., words in canonical form, annotated with part-of-speech (POS) tag and lexical features. The resource is a flat textual file in which each textual line contains information about one lemma. The format of a line can be captured with the following Perl regular expressions: # Characters appearing in a word (ISO-8859-2) $c = qr/[-{}|.\/\d\w\xA9\xAE\xB9\xBE\xC6\xC8\xD0\xE6\xE8\xF0]/; # A lemma $l = qr/$c+(?: $c+)*/; # A lemma specification (each line in the resource) /^($l(?:#$l)?)\s+:(\w+)([\w:]+)\r?$/ In the last expression, $1 is a lemma, $2 is the POS tag, and $3 is a concatenated list of features. A typical line is: vrata :nn:f in which "vrata" is a lemma, with POS being "nn", and features including "f" gender. A lemma may contain the hash sign (#), in which case it denotes a frequently misspelled form. For example, in bidem#budem :spec:x "bidem" is an incorrect form, followed by a correct form "budem". Additionally, the incorrect forms are marked with the feature ":x". A lemma may contain the curly parenthesis ({}) and optionally a pipe character (|), in which case it denotes that lemma is different in ekavian and ijekavian dialect of Serbian language (ekavian is spoken in Serbia, while ijekavian is spoken in Montenegro and Republika Srpska (Bosnia)). For example, in m{j}esec :nn:m d{e|i}o :nn:m "mesec" and "deo" are in the ekavian form, while "mjesec" and "dio" are in the ijekavian form. To obtain ekavian or ijekavian form, it is sufficient to apply one of the following two regular expressions: s/\{(?:([^|}]*)\||())[^}]*}/$1/g; # ekavian s/\{(?:[^|}]*\|)?([^}]*)}/$1/g; # ijekavian The resource is encoded using ISO-8859-2 encoding, and sorted according to the standard Serbian lexicographic order. The resource statistics is presented below: Table of POS Tags Tag Part of speech Count ------------------------------------------------------ cc coordinate conjunction 3 cd cardinal number 98 cs subordinate conjunction 33 etc list continuation (etc.) 5 in preposition 72 jj adjective 24555 jjr adjective, comparative 4201 jjs adjective, superlative 693 nn noun 45590 nnc collective noun 108 nnp proper noun 3457 nns noun, plural (no regular singular form) 83 od ordinal number 70 pr pronoun 299 rb adverb 8547 spec special syntactic tag (e.g., clitic 72 and auxiliary verbs) uh interjection 64 vb verb 14069 ------------------------------------------------------ 102,019 Table of Features Tag:Features Description Count ----------------------------------------------- :in:x incorrect form 1 :jj:f gender feminine 50 :m gender masculine 24398 :n gender neuter 107 :pl plural 6 :x incorrect form 46 :jjr:f gender feminine 3 :m gender masculine 4198 :pl plural 1 :x incorrect form 1 :jjs:m gender masculine 693 :nn:dim diminutive 419 :f gender feminine 17040 :m gender masculine 16710 :n gender neuter 11840 :pl plural 60 :x incorrect form 110 :nnc:f gender feminine 25 :m gender masculine 4 :n gender neuter 79 :nnp:dim diminutive 1 :f gender feminine 695 :m gender masculine 2716 :n gender neuter 46 :pl plural 44 :nns:dim diminutive 17 :f gender feminine 48 :m gender masculine 7 :n gender neuter 8 :od:x incorrect form 1 :pr:dt demonstrative 52 :f gender feminine 95 :m gender masculine 116 :n gender neuter 65 :pl plural 104 :sg singular 178 :wh interrogative 19 :rb:wh interrogative 12 :x incorrect form 2 :vb:x incorrect form 85 -----------------------------------------------