Remember Me
Or use your Academic/Social account:


Or use your Academic/Social account:


You have just completed your registration at OpenAire.

Before you can login to the site, you will need to activate your account. An e-mail will be sent to you with the proper instructions.


Please note that this site is currently undergoing Beta testing.
Any new content you create is not guaranteed to be present to the final version of the site upon release.

Thank you for your patience,
OpenAire Dev Team.

Close This Message


Verify Password:
Verify E-mail:
*All Fields Are Required.
Please Verify You Are Human:
fbtwitterlinkedinvimeoflicker grey 14rssslideshare1
Atwell, ES; Hughes, J; Souter, DC (1994)
Publisher: Association for Computational Linguistics
Languages: English
Types: Other

Classified by OpenAIRE into

Several Corpus Linguistics research groups have gone beyond collation of 'raw' text, to syntactic annotation of the text. However, linguists developing these linguistic resources have used quite different wordtagging and parse-tree labelling schemes in each of these annotated corpora. This restricts the accessibility of each corpus, making it impossible for speech and handwriting researchers to collate them into a single very large training set. This is particularly problematic as there is evidence that one of these parsed corpora on its own is too small for a general statistical model of grammatical structure, but the combined size of all the above annotated corpora should deliver a much more reliable model. We are developing a set of mapping algorithms to map between the main tagsets and phrase structure grammar schemes used in the above corpora. We plan to develop a Multi-tagged Corpus and a MultiTreebank, a single text-set annotated with all the above tagging and parsing schemes. The text-set is the Spoken English Corpus: this is a half-way house between formal written text and colloquial conversational speech. However, the main deliverable to the computational linguistics research community is not the SEC-based MultiTreebank, but the mapping suite used to produce it - this can be used to combine currently-incompatible syntactic training sets into a large unified multicorpus. Our architecture combines standard statistical language modelling and a rule-base derived from linguists' analyses of tagset-mappings, in a novel yet intuitive way. Our development of the mapping algorithms aims to distinguish notational from substantive differences in the annotation schemes, and we will be able to evaluate tagging schemes in terms of how well they fit standard statistical language models such as n-pos (Markov) models.
  • The results below are discovered through our pilot algorithms. Let us know how we are doing!

    • [9] Eric Steven Atwell. 1993. Corpus-based statistical modelling of English grammar. In Clive Souter and Eric Atwell, editors, Corpus-Based Computational Li~,fluistics, pages 195-214. Amsterdam, Rodopi.
    • [10] Eric Steven Atwell. 1993. Linguistic Constraints for Large-Vocabulary Speech Recognition In Eric Stcw:n Atwell (ed), h'nowledge at Work in, Univer,~ilies: Proceedings of the second annual conference of the Higher Education Funding Councils' Knowlcdgc Based Systems Initiative, pp26-32. Leeds, Leeds I~niversity Press.
    • [11] Eric Steven Atwell, Simon Arnfield, George l)~.metriou, Stephen Itanlon, John Hughes, Uwe Jost, I¢ot> Pocock, Clive Souter, and Joerg Ueberla. 1993. Multi-level disambiguation grammar inferred from I,:uglish corpus, treebank and dictionary. In Proceeding.~ of the IEE Two One-Day Colloquia on Grammutical Infi:rence : Theory, Applications and Alteroati~,cs, (Ref 1993/092). London, Institution of Eleci rical Engim~crs (lEE).
    • [12] Ih~nk Barkema. 1994. The TOSCA Analysis En~,~ro~,mcnt for ICE. Technical Report, Department of I,auguage and Speech, Katholieke Universiteit Ni.imegen, The Netherlands.
    • [13] Nancy Belmore. 1991. Tagging Brown with the I,OB tagging suite. In Journal of the International Uompnter Archive of Modern English (ICAME Jonr,al). No. 15, pages 63-86. Norwegian Computing Centre for the Humanities, Bergen University.
    • [14] Eric Brill. 1991. A Simple Rule-Based Part of Speech Tagger. Technical Report: Department of C.omputer Science, University of Pennsylvania.
    • [15] Eric Brill and Mitchel Marcus. 1992. Tagging an Unfamiliar Text with Minimal Human Supervision. In Robert Goldman, editor, Working notes of the A A AI ~hli Symposium on Probabilistic Approaches to Natural Language, AAAI Press.
    • [1{~] Eric Brill, David Magerman, Mitchell Marcus, :~ml Beatrice Santorini. 1992. Deducing Linguistic Structure from the Statistics of Large Corpora. In Carl Weir and Ralph Grishman, editors, Proceedings of AAAI-gP Workshop Program: Statistically.Based NLP Techniques San Jose, California.
    • [I 7] Gavin Burnage. 1990. CELEX - A Guide for Users. Nijn wgen: Centre for Lexical Information (CELEX).
    • [18] Lou Burnard. 1991. What is the TEI? In D. Greenstein, editor, Modelling Historical Data. Goettingen: St. Katharinen.
    • [19] K. Church. 1992. Parts of Speech Tagging. Fifth Amm:d CUNY Conference on Human Science Pro~:~.ssiwlg.
    • [211] Aviv Cohen. 1994. personal communication.
    • [71] (h'~rgc C. Demetriou and Eric Steven Atwell. 199,1. Machinc-Lc~irs~abic, Non-Compositional Semaolic.~ fiJr Domain Independent Speech or Text
    • [22] Elizabeth Eyes and Geoffrey Leech. 1993. Progress in UCREL research: Improving corpus annotation practices. In Jan Aarts, Pieter de Haan, and Nelleke Oostdijk, editors, English Language Corpora: design, analysis and exploitation; Proceedings of the 13th ICAME conference, pages 123-144. Amsterdam: Rodopi.
    • [23] Robin Fawcett and Michael Perkins. 1980. Child Language Transcripts 6-12. (With a preface, in J volumes). Department of Behavioural and Communication Studies, Polytechnic of Wales.
    • [24] W.N. Francis and H. Ku~era. 1979. Manual of Information to Accompany a Standard Corpus of Present-Day Edited American English, for use with Digital Computers (Corrected and Revised edition). Department of Linguistics, Brown University, Providence, Rhode Island.
    • [25] Roger Garside, Geoffrey Leech, and Geoffrey Sampson (editors). 1987. The Computational Analysis of English : A Corpus-Based Approach. Longman, London and New York.
    • [26] Roger Garside, Geoffrey Leech and T a m ~ V~iradi. 1990. Manual of Information for the Lancaster Parsed Corpus. Technical Report, Department of Linguistics and Modern English, University of Lancaster, UK.
    • [27] Gerald Ga~dar and Chris Mellish. 1989. Natural Language Processing in POP-11 : An Introduction to Computational Linguistics. Addison Wesley.
    • [28] Sidney Greenbaum. 1993. The Tagset for the International Corpus of English. In Clive Souter and Eric Atwell (eds.) Corpus-based Computational Linguistics Amsterdam: Rodopi.
    • [29] Robin Haigh, Geoffrey Sampson and Eric Atwell. 1988. Project APRIL - a progress report on the Leeds annealing parser project. In Proceedings of the ~6th Annual Meeting of the Association for Computational Linguistics (ACL), pages 104-112. New Jersey, Association for Computational Linguistics (ACL).
    • [30] Robin Haigh. 1993. personal communication.
    • [31] Hans van Halteren and Nelleke Oostdijk. 1993. Towards a syntactic database: the TOSCA analysis system. In Jan Aarts, Pieter de Haan, and Nelleke Oostdijk, editors, English Language Corpora: design, analysis and exploitation; Proceedings of the 13th ICAME conference, pages 145-162. Amsterdam: Rodopi.
    • [32] John Hughes. 1989. A Learning Interface to the Realistic Annealing Parser. Technical Report: School of Computer Studies, The University of Leeds.
    • [33] John ltughes and Erie Steven Atwell. 1993. utomatically acquiring and evaluating a classification of words In Proceedings of the IEE Two One-Day Col. loquia on Grammatical Inference : Theory, Applications and Alternatives, (Ref 1993/092). London, Institution of Electrical Engineers (IEE).
    • [34] John Hughes. 1994. Automatically Acquiring a Classification of Words. PhD Thesis: School of Computer Studies, The University of Leeds.
    • [35] John Hughes and Eric Steven Atwell. 1994. A Methodical Approach to Word Class Formation Using Automatic Evaluation. In Lindsay Evett and Tony Rose, editors, Proceedings of AISB workshop on Computational Linguistics for Speech and Handwriting Recognition. Leeds University.
    • [36] John Hughes and Erie Steven Atwell. 1994. The Automated Evaluation of Inferred Word Classifications. In Tony Cohn (ed), Proceedings of the 1Ith European Conference on Artificial Intelligence, Amsterdam.
    • [37] F. Jelinek. 1990. Self-organised language modelling for speech recognition. In Alex Waibel and Kai-Fu Lee, editors, Readings in Speech Recognition, pages 450-506. Morgan Kaufmann.
    • [38] Stig Johansson, Eric Atwell, Roger Garside and Geoffrey Leech. 1986. The Tagged LOB CorpusUsers' Manual. The Norwegian Centre for the Humanities, Bergen.
    • [39] Stig Johansson. 1994. personal communication.
    • [40] Uwe Jost and Eric Steven Atwell. 1993. Deriving a probabilistic grammar of semantic markers from unrestricted English text In Proceedings of the lEE Two One-Day Colloquia on Grammatical Inference : Theory, Applications and Alternatives, (Ref 1993/0921. London, Institution of Electrical Engineers (lEE).
    • [41] Judith Klavans. 1994. personal communication.
    • [42] Geoffrey Leech, Roger Garside and Eric Atwell. 1983. The automatic grammatical tagging of the LOB Corpus. In Journal of the International Computer Archive of Modern English (1CAME Journal), No. 7, pages 13-33. Norwegian Computing Centre for the Humanities, Bergen University.
    • [43] Geoffrey Leech and Roger Garside. 1991. Running a grammar factory: The production of syntactically analysed corpora or "treebanks". In Stig Johansson and Anna-Brits Stenstr6m, editors, English Computer Corpora: Selected Papers and Research Guide. Berlin: Mouten de Gruyter.
    • [44] Geoffrey Leech. 1993. 100 Million Words of English: The British National Corpus (BNC) Project. English Today.
    • [45] Miteh P. Marcus and Beatrice Santorini. 1992. Building Very Large Natural Language Corpora: The Penn Treebank. In N. Ostler, editor, Proceedings of the 1992 Pisa Symposium on European Textual (:ol.- pora.
    • 06] Nelleke Oostdijk. 1989. TOSCA Corpus MaT~ual. University of Nijmegen.
    • [47] Nelleke Oostdijk. 1991. Corpus linguistic~ and the automatic analysis of English. Amst,erdam: I{,o,lopi.
    • [48] Marian Owen. 1987. Evaluating automatic grammarital tagging of text. In Newsletter of the International Computer Archive of Modern English (ICAME NEWS), No. 11, pages 18-26. Norwegian Computing Centre for the Humanities, Bergen University.
    • [49] Rob Pocock and Eric Atwell. 1993. Extracting statistical grammars from the Lancaster-IBM Spoken English Corpus Treebank. Technical Report 93.29, School of Computer Studies, Leeds University.
    • [50] Rob Pocock and Eric Atwell. 1993. Probabilistic grammatical models for treebank-trained lattice disambiguation. Technical Report 93.30, School of Computer Studies, Leeds University.
    • [51] Paul Procter. 1978. Longman Dictionary of Contemporary English. London: Longman.
    • [52] Geoffrey Sampson. 1994. "personal comnm,fi('ation".
    • [53] Beatrice Santorini. 1990. Part-of-speech ta!l.qing guidelines for the Penn treebank project. Teclmi('al Report MS-CIS-90-47, Department of Computer and • Information Science, University of Pennsylvania.
    • [54] John Sinclair. 1987. 'Looking Up: An Account of the COBUILD Project in Lexical Computing. Collins, Glasgow.
    • [55] Clive Souter. 1989. A short handbook to the Pol.qtechnic of Wales Corpus. Bergen: Norwegian (',omputing Centre for the Humanities, Bergen Uniw;rsity.
    • [56] Clive Sourer. 1990. Systemic functional grammars and corpora. In J. Aarts and W. Meijs, editors, Theory and Practice in Corpus Linguistics, pages 179- 211. Amsterdam: Rodopi.
    • [57] Clive Sourer and Eric Steven Atwell. 1992. A richly annotated corpus for probabilistic parsing. In Carl Weir and Ralph Grishman, editors, Proceedings of A A A I workshop on Statistically-Based NLP Techniques, San Jose, CA, pages 28-38.
    • [58] Clive Souter. 1993. Harmonising a lexical datal)a~se with a corpus-based grammar. In Souter and Atwell, editors, Corpus-based Computational Linguistics, pages 181-193. Amsterdam: Rodopi.
    • [59] Clive Souter. 1993. Towards a standard format fi)r parsed corpora. In Jan Aarts, Pieter de Iiaan, and Nelleke Oostdijk, editors, English Language Corpora: design, analysis and exploitation; Proceedings of the 13th ICAME conference, pages 197-214. Amsterdam: Rodopi.
    • [(10] Clive Souter and Eric Steven Atwell. 1994. Using Parsed Corpora: A review of current practice In N~lleke Oostdijk and Pieter de Haan (eds), Corpusbased Research Into Language, pp143-158. Amsterd a m , l'~odopi.
    • [(;1] C. Sperberg-McQueen and L. Burnard. 1990. Guidelines for the encoding and interchange of machine-readable tezts, TEI P1, Technical report, [l niversities of Chicago and Oxford.
    • [62] Jan Svartvik (ed). 1990. The London-Lund Corpus of Spoken English: Description and Research. Lund University Press, Lund, Sweden.
    • [(i:~] L.J. Taylor and G. Knowles. 1988. Manual of Information to Accompany the SEC Corpus. Technical Report, Unit for Computer Research on the English l,anguage,University of Lancaster, UK.
    • [(il] Ni Yihin 1993. The ICE Tagset - A Complete List of Tags used by the Tag-Selector for the Reference ,~f Tag.Selectors and Researchers. Technical Report, I)epartment of English, University College London, UK.
  • No related research data.
  • Discovered through pilot similarity algorithms. Send us your feedback.

Share - Bookmark

Cite this article