Remember Me
Or use your Academic/Social account:


Or use your Academic/Social account:


You have just completed your registration at OpenAire.

Before you can login to the site, you will need to activate your account. An e-mail will be sent to you with the proper instructions.


Please note that this site is currently undergoing Beta testing.
Any new content you create is not guaranteed to be present to the final version of the site upon release.

Thank you for your patience,
OpenAire Dev Team.

Close This Message


Verify Password:
Verify E-mail:
*All Fields Are Required.
Please Verify You Are Human:
fbtwitterlinkedinvimeoflicker grey 14rssslideshare1
Ibrahim, O.; Landa-Silva, Dario (2016)
Publisher: Springer
Languages: English
Types: Article
In the context of Information Retrieval (IR) from text documents, the term-weighting scheme (TWS) is a key component of the matching mechanism when using the vector space model (VSM). In this paper we propose a new TWS that is based on computing the average term occurrences of terms in documents and it also uses a discriminative approach based on the document centroid vector to remove less significant weights from the documents. We call our approach Term Frequency With Average Term Occurrence (TF-ATO). An analysis of commonly used document collections shows that test collections are not fully judged as achieving that is expensive and may be infeasible for large collections. A document collection being fully judged means that every document in the collection acts as a relevant document to a specific query or a group of queries. The discriminative approach used in our proposed approach is a heuristic method for improving the IR effectiveness and performance, and it has the advantage of not requiring previous knowledge about relevance judgements. We compare the performance of the proposed TF-ATO to the well-known TF-IDF approach and show that using TF-ATO results in better effectiveness in both static and dynamic document collections. In addition, this paper investigates the impact that stop-words removal and our discriminative approach have on TFIDF and TF-ATO. The results show that both, stopwords removal and the discriminative approach, have a positive effect on both term-weighting schemes. More importantly, it is shown that using the proposed discriminative approach is beneficial for improving IR effectiveness and performance with no information in the relevance judgement for the collection.
  • The results below are discovered through our pilot algorithms. Let us know how we are doing!

    • Ricardo A. Baeza-Yates and Berthier Ribeiro-Neto. Modern Information Retrieval. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1999.
    • Ricardo A. Baeza-Yates and Berthier A. Ribeiro-Neto. Modern Information Retrieval - the concepts and technology behind search, Second edition. Pearson Education Ltd., Harlow, England, 2nd editio edition, 2011.
    • C. H. Chang and C. C. Hsu. The design of an information system for hypertext retrieval and automatic discovery on WWW. PhD thesis, National Taiwan University, 1999.
    • O. Cordan, E. Herrera-Viedma, C. Lapez-Pujalte, M. Luque, and C. Zarco. A review on the application of evolutionary computation to information retrieval. International Journal of Approximate Reasoning, 34 (23):241 { 264, 2003. Soft Computing Applications to Intelligent Information Retrieval on the Internet.
    • Thomas M. Cover and Joy A. Thomas. Elements of Information Theory. Wiley-Interscience, New York, NY, USA, 1991.
    • Ronan Cummins. The Evolution and Analysis of TermWeighting Schemes in Information Retrieval. PhD thesis, National University of Ireland, Galway, 2008.
    • Ronan Cummins and Colm O'Riordan. Term-weighting in information retrieval using genetic programming: A three stage process. In Proceedings of the 2006 Conference on ECAI 2006: 17th European Conference on Arti cial Intelligence August 29 { September 1, 2006, Riva Del Garda, Italy, pages 793{794, Amsterdam, The Netherlands, The Netherlands, 2006. IOS Press.
    • Christopher Fox. Information retrieval. chapter Lexical Analysis and Stoplists, pages 102{130. Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1992.
    • Ed Greengrass. Information Retrieval : A Survey. Technical Report November, University of Maryland, USA, 2000. URL http://www.csee.umbc.edu/csee/research/ cadip/readings/IR.report.120600.book.pdf.
    • William Hersh, Chris Buckley, T. J. Leone, and David Hickam. Ohsumed: An interactive retrieval evaluation and new large test collection for research. In Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '94, pages 192{201, New York, NY, USA, 1994. Springer-Verlag New York, Inc.
    • Osman A. S. Ibrahim and Dario Landa-Silva. A new weighting scheme and discriminative approach for information retrieval in static and dynamic document collections. In Computational Intelligence (UKCI), 2014 14th UK Workshop on, pages 1{8, Sept 2014.
    • Rong Jin, Christos Falusos, and Alex G. Hauptmann. Meta-scoring: Automatically evaluating term weighting schemes in ir without precision-recall. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '01, pages 83{89, New York, NY, USA, 2001. ACM.
    • Rong Jin, Joyce Y. Chai, and Luo Si. Learn to weight terms in information retrieval using category information. In Proceedings of the 22Nd International Conference on Machine Learning, ICML '05, pages 353{360, New York, NY, USA, 2005. ACM.
    • Thorsten Joachims. Text categorization with support vector machines: Learning with many relevant features. In Claire Ndellec and Cline Rouveirol, editors, Machine Learning: ECML-98, volume 1398 of Lecture Notes in Computer Science, pages 137{142. Springer Berlin Heidelberg, 1998.
    • Marika Kaden, Martin Riedel, Wieland Hermann, and Thomas Villmann. Border-sensitive learning in generalized learning vector quantization: an alternative to support vector machines. Soft Computing, pages 1{12, 2014. doi: 10.1007/s00500-014-1496-1.
    • Sa kwang Song and Sung Hyon Myaeng. A novel term weighting scheme based on discrimination power obtained from past retrieval results. Information Processing & Management, 48(5):919 { 930, 2012. LargeScale and Distributed Systems for Information Retrieval.
    • K. L. Kwok. Comparing representations in Chinese information retrieval. In SIGIR '97 Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval, pages 34{41, New York, NY, USA, 1997. ACM.
    • Tie-Yan Liu. Learning to rank for information retrieval. Foundations and Trends in Information Retrieval, 3 (3):225{331, 2009.
    • Rachel Tsz-wai Lo, Ben He, and Iadh Ounis. Automatically Building a Stopword List for an Information Retrieval System. Digital Information Management: special issue on the 5th Dutch-Belgian Information Retrieval Workshop (DIR 2005), 3(1):3{8, 2005.
    • H.P. Luhn. A statistical approach to mechanized encoding and searching of literary information. IBM Journal of Research and Development, 1(4):309{317, Oct 1957.
    • Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schutze. Introduction to Information Retrieval. Cambridge University Press, New York, NY, USA, 2008. ISBN 0521865719, 9780521865715.
    • Michael McCandless, Erik Hatcher, and Otis Gospodnetic. Lucene in Action, Second Edition: Covers Apache Lucene 3.0. Manning Publications Co., Greenwich, CT, USA, 2010. ISBN 1933988177, 9781933988177.
    • Michael McGill. An evaluation of factors a ecting document ranking by information retrieval systems. 1979.
    • Christian Middleton and Ricardo Baeza-yates. A comparison of open source search engines. Technical report, 2007. URL http://citeseerx.ist.psu.edu/ viewdoc/summary?doi=
    • T. Noreault, M. McGill, and M. Koll. A performance evaluation of similarity measures, document term weighting schemes and representations in a Boolean environment. In SIGIR '80 Proceedings of the 3rd annual ACM conference on Research and development in information retrieval, pages 57{76. Butterworth & Co. Kent, UK, 1980.
    • Tao Qin, Tie-Yan Liu, Jun Xu, and Hang Li. Letor: A benchmark collection for research on learning to rank for information retrieval. Information Retrieval, 13 (4):346{374, 2010. ISSN 1386-4564.
    • Joel W. Reed, Yu Jiao, Thomas E. Potok, Brian A. Klump, Mark T. Elmore, and Ali R. Hurson. Tf-icf: A new term weighting scheme for clustering dynamic data streams. In Proceedings of the 5th International Conference on Machine Learning and Applications, ICMLA '06, pages 258{263, Washington, DC, USA, 2006. IEEE Computer Society.
    • S. E. Robertson, S. Walker, M. M. Hancock-Beaulieu, S. Jones, and M. Gatford. Okapi at TREC-3. In D. Harman, editor, Proceeding of Third Text REtrieval Conference TREC3, pages 109{126, Gaithersburg, 1995.
    • Miriam; He Yulan Saif, Hassan; Fernandez and Harith Alani. On stopwords, ltering and data sparsity for sentiment analysis of Twitter. In In: LREC 2014, Ninth International Conference on Language Resources and Evaluation, pages 810{817, Reykjavik, Iceland, 2014.
    • Gerard Salton and Chris Buckley. Readings in information retrieval. chapter Improving Retrieval Performance by Relevance Feedback, pages 355{364. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1997.
    • Gerard Salton and Christopher Buckley. Termweighting approaches in automatic text retrieval. Information Processing & Management, 24(5):513 { 523, 1988.
    • Gerard Salton and Michael J. McGill. Introduction to Modern Information Retrieval. McGraw-Hill, New York, NY, USA, 1986.
    • Amit Singhal, Chris Buckley, and Mandar Mitra. Pivoted document length normalization. In Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '96, pages 21{29, New York, NY, USA, 1996. ACM.
    • Mark P Sinka and David W Corne. Towards Modernised and Web-Speci c Stoplists for Web Document Analysis. pages 0{6, 2003a.
    • Mark P Sinka and David W Corne. Evolving Better Stoplists for Document Clustering and Web Intelligence. Design and application of hybrid intelligent systems, pages 1015{1023, 2003b.
    • SMART. SMART System Stop-words List. URL http://jmlr.org/papers/volume5/lewis04a/ a11-smart-stop-list/english.stop.
    • Mark D Smucker, Gabriella Kazai, and Matthew Lease. Overview of the trec 2012 crowdsourcing track. Technical report, DTIC Document, 2012.
    • Ian Soboro . A comparison of pooled and sampled relevance judgments. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '07, pages 785{786, New York, NY, USA, 2007. ACM.
    • Karen Sparck Jones. Document retrieval systems. chapter A Statistical Interpretation of Term Speci city and Its Application in Retrieval, pages 132{142. Taylor Graham Publishing, London, UK, 1988.
    • Karen Sparck Jones and Peter Willett, editors. Readings in Information Retrieval. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1997.
    • Warren S Torgerson. Theory and methods of scaling. 1958.
    • C.J. Van Rijsbergen. Information Retrieval. Butterworths, 1975. URL http://www.dcs.gla.ac.uk/ Keith/Preface.html.
    • A. Vinciarelli. Application of information retrieval techniques to single writer documents. Pattern Recognition Letters, 26(14):2262{2271, 2005.
    • Ellen M. Voorhees. Overview of the trec 2004 robust retrieval track. In In Proceedings of the Thirteenth Text REtrieval Conference (TREC-2004), page 13, 2004.
    • Stephan Winkler, Susanne Schaller, Viktoria Dorfer, Michael A enzeller, Gerald Petz, and Micha Karpowicz. Data-based prediction of sentiments using heterogeneous model ensembles. Soft Computing, pages 1{12, 2014. doi: 10.1007/s00500-014-1325-6.
    • Ligang Zhou, KinKeung Lai, and Lean Yu. Credit scoring using support vector machines with direct search for parameters selection. Soft Computing, 13(2):149{ 155, 2009.
    • George K. Zipf. Human Behavior and the Principle of Least E ort. Addison-Wesley (Reading MA), 1949.
  • No related research data.
  • No similar publications.

Share - Bookmark

Cite this article