LOGIN TO YOUR ACCOUNT

Username
Password
Remember Me
Or use your Academic/Social account:

CREATE AN ACCOUNT

Or use your Academic/Social account:

Congratulations!

You have just completed your registration at OpenAire.

Before you can login to the site, you will need to activate your account. An e-mail will be sent to you with the proper instructions.

Important!

Please note that this site is currently undergoing Beta testing.
Any new content you create is not guaranteed to be present to the final version of the site upon release.

Thank you for your patience,
OpenAire Dev Team.

Close This Message

CREATE AN ACCOUNT

Name:
Username:
Password:
Verify Password:
E-mail:
Verify E-mail:
*All Fields Are Required.
Please Verify You Are Human:
fbtwitterlinkedinvimeoflicker grey 14rssslideshare1
Sawalha, M; Atwell, E (2013)
Publisher: Edinburgh University Press
Languages: English
Types: Article
Subjects:
The SALMA Morphological Features Tag Set (SALMA, Sawalha Atwell Leeds Morphological Analysis tag set for Arabic) captures long-established traditional morphological features of grammar and Arabic, in a compact yet transparent notation. First, we introduce Part-of-Speech tagging and tag set standards for English and other European languages, and then survey Arabic Part-of-Speech taggers and corpora, and long-established Arabic traditions in analysis of morphology. A range of existing Arabic Part-of-Speech tag sets are illustrated and compared; and we review generic design criteria for corpus tag sets. For a morphologically-rich language like Arabic, the Part-of-Speech tag set should be defined in terms of morphological features characterizing word structure. We describe the SALMA Tag Set in detail, explaining and illustrating each feature and possible values. In our analysis, a tag consists of 22 characters; each position represents a feature and the letter at that location represents a value or attribute of the morphological feature; the dash ‘-’ represents a feature not relevant to a given word. The first character shows the main Parts of Speech, from: noun, verb, particle, punctuation, and Other (residual); these last two are an extension to the traditional three classes to handle modern texts. ‘Noun’ in Arabic subsumes what are traditionally referred to in English as ‘noun’ and ‘adjective’. The characters 2, 3, and 4 are used to represent subcategories; traditional Arabic grammar recognizes 34 subclasses of noun (letter 2), 3 subclasses of verb (letter 3), 21 subclasses of particle (letter 4). Others (residuals) and punctuation marks are represented in letters 5 and 6 respectively. The next letters represent traditional morphological features: gender (7), number (8), person (9), inflectional morphology (10) case or mood (11), case and mood marks (12), definiteness (13), voice (14), emphasized and non-emphasized (15), transitivity (16), rational (17), declension and conjugation (18). Finally there are four characters representing morphological information which is useful in Arabic text analysis, although not all linguists would count these as traditional features: unaugmented and augmented (19), number of root letters (20), verb root (21), types of nouns according to their final letters (22). The SALMA Tag Set is not tied to a specific tagging algorithm or theory, and other tag sets could be mapped onto this standard, to simplify and promote comparisons between and reuse of Arabic taggers and tagged corpora.
  • The results below are discovered through our pilot algorithms. Let us know how we are doing!

    • Habash, Nizar, Faraj, Reem and Roth, Ryan 2009. Syntactic Annotation in Columbia Arabic Treebank. 2nd International Conference on Arabic Language Resources & Tools MEDAR 2009 Cairo, Egypt.
    • Habash, Nizar and Rambow, Owen 2005. Arabic tokenization, part-of-speech tagging and morphological disambiguation in one fell swoop. Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics Ann Arbor, Michigan: Association for Computational Linguistics.
    • Habash, Nizar and Roth, Ryan M. 2009. CATiB: The Columbia Arabic Treebank. Proceedings of the ACL-IJCNLP 2009 Conference Short Papers 221-224. Suntec, Singapore.
    • Hamada, Salwa 2010. Evaluation of the Arabic Morphological Analyzers? Proceedings of The Sixth International Computing science Conference ICCA Hammamet, Tunisia.
    • Harmain, Harmain M. 2004. Arabic Part-of-Speech Tagging. The Fifth Annual U.A.E. University Research Conference United Arab Emirates.
    • Johansson, Stig, Atwell, Eric, Garside, Roger and Leech, Geoffrey 1986. The Tagged LOB Corpus. Bergen, Norway: Norwegian Computing Centre for the Humanities.
    • Khoja, Shereen 2001. APT: Arabic Part-of-Speech Tagger. Student Workshop at the Second Meeting of the North American Chapter of the Association for Computational Linguistics (NAACL2001) Carnegie Mellon University, Pittsburgh, Pennsylvania.
    • Khoja, Shereen 2003. APT: An Automatic Arabic Part-of-Speech Tagger. Lancaster, UK: Lancaster University.
    • Khoja, Shereen, Garside, Porger and Knowles, Gerry 2001. A tagset for the morphosynactic tagging of Arabic. Corpus Linguistics 2001 Lancaster University, Lancaster, UK.
    • Leech, Geoffrey and Wilson, Andrew 1999. Standards for Tagsets. In Hans van Halteren (ed.), Syntactic Wordclass Tagging. KLUWER Academic Publishers. 55-80.
    • Maamouri, Mohamed and Bies, Ann 2004. Developing an Arabic Treebank: Methods, Guidelines, Procedures, and Tools. Proceedings of the 20th International Conference on Computational Linguistics (COLING 2004).
    • Marsi, Erwin, Bosch, Antal van den and Soudi, Abdelhadi 2005. Memory-based morphological analysis generation and part-of-speech tagging of Arabic. Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages 1-8. Ann Arbor: Association for Computational Linguistics.
    • Monachini, Monica and Calzolari, Nicoletta 1996. Synopsis and comparison of morphosyntactic phenomena encoded in lexicons and corpora. A common proposal and applications to European languages. Pisa, Italy: Istituto di Linguistica Computazionale -CNR.
    • Ryding, Karin C. 2005. A Reference Grammar of Modern Standard Arabic. Cambridge University Press.
    • Sawalha, Majdi 2011. Open-source Resources and Standards for Arabic Word Structure Analysis. School of Computing Leeds: University of Leeds.
    • Sawalha, Majdi and Atwell, Eric 2009a. Linguistically Informed and Corpus Informed Morphological Analysis of Arabic. Proceedings of the 5th International Corpus Linguistics Conference CL2009 Liverpool, UK.
    • Sawalha, Majdi and Atwell, Eric 2009b. Educational, Cultural and Scientific Organization (ALECSO), King Abdul-Aziz City for Science and Technology (KACST) and Arabic Language Academy. Damascus, Syria.
    • Sawalha, Majdi and Atwell, Eric 2010. Fine-Grain Morphological Analyzer and Part-of-Speech Tagger for Arabic Text. Language Resource and Evaluation Conference LREC 2010 Valleta, Malta: European Language Resources Association (ELRA).
    • Schmid, Helmut and Laws, Florian 2008. Estimation of Conditional Probabilities with Decision Trees and an Application to Fine-Grained POS Tagging. COLING'08 Manchester,UK.
    • Talmon, Rafi and Wintner, Shuly 2003. Morphological Tagging of the Qur'an. In Proceedings of the Workshop on Finite-State Methods in Natural Language Processing, an EACL'03 Workshop Budapest, Hungary.
    • Teahan, Bill 1998. Modeling English Text. Department of Computer Science New Zealand: University of Waikato.
    • Teufel, Simone, Schmid, Helmut, Heid, Ulrich and Schiller, Anne 1996. Study of the relation between tagsets and taggers. Stuttgart, Germany Institut fu¨ r maschinelle Sprachverarbeitung, Universita¨t Stuttgart
    • Tlili-Guiassa, Yamina 2006. Hybrid Method for Tagging Arabic Text. Journal of Computer Science 2, 245-248.
    • Voutilainen, Atro 2003. Part-of-Speech Tagging. In Ruslan Mitkov (ed.), The Oxford Handbook of Computational Linguistics 219-232. Oxford University Press.
    • Wright, W. 1996. A Grammar of the Arabic Language, Translated from the German of Caspari, and Editted with Numerous Additions and Corrections. Beirut: Librairie du Liban.
    • Zibri, Chiraz Ben Othmane, Torjmen, Aroua and Ahmad, Mohamed Ben 2006. An Efficient Multi-agent system Combining POS-Taggers for Arabic Texts. CICLing 2006, LNCS 3878.
    • Zolfagharifard, Ellie 2009. Anti-terror technology tool uses human logic. The Engineer.
  • No related research data.
  • No similar publications.

Share - Bookmark

Cite this article