Remember Me
Or use your Academic/Social account:


Or use your Academic/Social account:


You have just completed your registration at OpenAire.

Before you can login to the site, you will need to activate your account. An e-mail will be sent to you with the proper instructions.


Please note that this site is currently undergoing Beta testing.
Any new content you create is not guaranteed to be present to the final version of the site upon release.

Thank you for your patience,
OpenAire Dev Team.

Close This Message


Verify Password:
Verify E-mail:
*All Fields Are Required.
Please Verify You Are Human:
fbtwitterlinkedinvimeoflicker grey 14rssslideshare1
Kim, Y.; Ross, S. (2006)
Publisher: Springer
Languages: English
Types: Other
Subjects: ZA4050
Metadata creation is a crucial aspect of the ingest of digital\ud materials into digital libraries. Metadata needed to document and\ud manage digital materials are extensive and manual creation of them expensive.\ud The Digital Curation Centre (DCC) has undertaken research\ud to automate this process for some classes of digital material. We have\ud segmented the problem and this paper discusses results in genre classification\ud as a first step toward automating metadata extraction from\ud documents. Here we propose a classification method built on looking at\ud the documents from five directions; as an object exhibiting a specific visual\ud format, as a linear layout of strings with characteristic grammar,\ud as an object with stylo-metric signatures, as an object with intended\ud meaning and purpose, and as an object linked to previously classified\ud objects and other external sources. The results of some experiments in\ud relation to the first two directions are described here; they are meant to\ud be indicative of the promise underlying this multi-facetted approach.
  • The results below are discovered through our pilot algorithms. Let us know how we are doing!

    • 1. Aiello, M., Monz, C., Todoran, L., Worring, M.: Document Understanding for a Broad Class of Documents. International Journal on Document Analysis and Recognition 5(1) (2002) 1-16.
    • 2. Automatic Metadata Generation: http://www.cs.kuleuven.ac.be/mdb/amg /documentation.php
    • 3. Arens,A., Blaesius, K. H.: Domain oriented information extraction from the In- ternet. Proceedings of SPIE Document Recognition and Retrieval 2003 Vol 5010 (2003) 286.
    • 4. Bagdanov, A. D., Worring, M.: Fine-Grained Document Genre Classification Using First Order Random Graphs. Proceedings of International Conference on Docu- ment Analysis and Recognition 2001 (2001) 79.
    • 5. Barbu, E., Heroux, P., Adam, S., Trupin, E.: Clustering Document Images Using a Bag of Symbols Representation. International Conference on Document Analysis and Recognition, (2005) 1216-1220.
    • 6. Bekkerman, R., McCallum, A., Huang, G.: Automatic Categorization of Email into Folders. Benchmark Experiments on Enron and SRI Corpora', CIIR Technical Report, IR-418 (2004).
    • 7. Biber, D.: Dimensions of Register Variation:a Cross-Linguistic Comparison. Cambridge University Press (1995).
    • 8. Boese, E. S.: Stereotyping the web: genre classification of web documents. Master's thesis, Colorado State University (2005).
    • 9. Breuel, T. M.: An Algorithm for Finding Maximal Whitespace Rectangles at Ar- bitrary Orientations for Document Layout Analysis. 7th International Conference for Document Analysis and Recognition (ICDAR), 66-70 (2003).
    • 10. Digital Curation Centre: http://www.dcc.ac.uk
    • 11. DC-dot, Dublin Core metadata editor: http://www.ukoln.ac.uk/metadata/dcdot/
    • 12. DELOS Network of Excellence on Digital Libraries: http://www.delos.info/
    • 13. NSF International Projects: http://www.dli2.nsf.gov/ intl.html
    • 14. DELOS/NSF Working Groups: Reference Models for Digital Libraries: Ac- tors and Roles (2003) http://www.dli2.nsf.gov /internationalprojects/ work- ing group reports/ actors final report.html
    • 15. Dublin Core Initiative: http://dublincore.org/tools/#automaticextraction
    • 16. Engineering and Physical Sciences Research Council: http://www.epsrc.ac.uk/
    • 17. Electronic Resources Preservation Access Network (ERPANET): http:// www.erpanet.org
    • 18. ERPANET: Packaged Object Ingest Project. http://www.erpanet.org/events/ 2003/rome/presentations/ ross rusbridge pres.pdf
    • 19. Giuffrida, G., Shek, E. Yang, J.: Knowledge-based Metadata Extraction from PostScript File. Proc. 5th ACM Intl. conf. Digital Libraries (2000) 77-84.
    • 20. Han, H., Giles, L., Manavoglu, E., Zha, H., Zhang, Z., Fox, E. A.: Automatic Docu- ment Metadata Extraction using Support Vector Machines. Proc. 3rd ACM/IEEE- CS conf. Digital libraries (2000) 37-48.
    • 21. Hedstrom, M., Ross, S., Ashley, K., Christensen-Dalsgaard, B., Duff, W., Gladney, H., Huc, C., Kenney, A. R., Moore, R., Neuhold, E.: Invest to Save: Report and Recommendations of the NSF-DELOS Working Group on Digital Archiving and Preservation. Report of the European Union DELOS and US National Science Foundation Workgroup on Digital Preservation and Archiv- ing (2003) http://delosnoe.iei.pi.cnr.it/activities/internationalforum/JointWGs/digitalarchiving/Digitalarchiving.pdf.
    • 22. Joint Information Systems Committee: http://www.jisc.ac.uk/
    • 23. Karlgren, J. and Cutting, D.: Recognizing Text Genres with Simple Metric using Discriminant Analysis. Proc. 15th conf. Comp. Ling. Vol 2 (1994) 1071-1075.
    • 24. Ke, S. W., Bowerman, C. Oakes, M. PERC: A Personal Email Classifier. Proceed- ings of 28th European Conference on Information Retrieval (ECIR 2006) 460-463.
    • 25. Kessler, B., Nunberg, G., Schuetze, H.: Automatic Detection of Text Genre. Proc. 35th Ann. Meeting ACL (1997) 32-38.
    • 26. Zhang Le: Maximum Entropy Toolkit for Python and C++. LGPL license, http://homepages.inf.ed.ac.uk/s0450736/maxent toolkit.html
    • 27. MetadataExtractor: http://pami-xeon.uwaterloo.ca/TextMiner/ MetadataExtrac- tor.aspx
    • 28. McCallum, A.: Bow: A Toolkit for Statistical Language Modeling, Text Retrieval, Classification and Clustering. (1998) http://www.cs.cmu.edu/ mccallum/bow/
    • 29. National Archives UK: DROID (Digital Object Identification). http: //www. nationalarchives. gov.uk/ aboutapps/pronom/droid.htm
    • 30. Natinal Library of Medicine US: http://www.nlm.nih.gov/
    • 31. National Library of New Zealand: Metadata Extraction Tool. http://www. natlib. govt.nz/en/whatsnew/4initiatives.html#extraction
    • 32. Adobe Acrobat PDF specification: http://partners.adobe.com/ public/developer/ pdf/index reference.html
    • 33. Python Imaging Library: http://www.pythonware.com/products/pil/
    • 34. PREMIS (PREservation Metadata: Implementation Strategy) Working Group: http://www.oclc.org/research/projects/pmwg/
    • 35. Python: http://www.python.org
    • 36. Riloff, E., Wiebe, J., and Wilson, T.: Learning Subjective Nouns using Extraction Pattern Bootstrapping. Proc. 7th CoNLL, (2003) 25-32.
    • 37. Ross S and Hedstrom M.: Preservation Research and Sustainable Digital Libraries. International Journal of Digital Libraries (Springer) (2005) DOI: 10.1007/s00799- 004- 0099-3.
    • 38. Santini, M.: A Shallow Approach To Syntactic Feature Extraction For Genre Classification. Proceedings of the 7th Annual Colloquium for the UK Special Interest Group for Computational Linguistics (CLUK 04) (2004).
    • 39. Sebastiani F.: 'Machine Learning in Automated Text Categorization', ACM Com- puting Surveys, Vol. 34 (2002) 1-47
    • 40. Faisal Shafait, Daniel Keysers, Thomas M. Breuel, “Performance Comparison of Six Algorithms for Page Segmentation”, 7th IAPR Workshop on Document Analysis Systems (DAS) (2006).368-379.
    • 41. M. Shao, M. and Futrelle, R.: Graphics Recognition in PDF document. Sixth IAPR International Workshop on Graphics Recognition (GREC2005), 218-227.
    • 42. Thoma,G.: Automating the production of bibliographic records. R&D report of the Communications Engineering Branch, Lister Hill National Center for Biomedical Communications, National Library of Medicine, 2001.
    • 43. Witte, R., Krestel, R. and Bergler, S.: ERSS 2005:Coreference-based Summariza- tion Reloaded. DUC 2005 Document Understanding Workshop, Canada
  • No related research data.
  • No similar publications.

Share - Bookmark

Download from

Cite this article