You have just completed your registration at OpenAire.
Before you can login to the site, you will need to activate your account.
An e-mail will be sent to you with the proper instructions.
Important!
Please note that this site is currently undergoing Beta testing.
Any new content you create is not guaranteed to be present to the final version
of the site upon release.
In this paper we exploit concepts of information theory to address the fundamental problem of identifying and defining the most suitable tools for extracting, in a automatic and agnostic way, information from a generic string of characters. We introduce in particular a class of methods which use in a crucial way data compression techniques in order to define a measure of remoteness and distance between pairs of sequences of characters (e.g. texts) based on their relative information content. We also discuss in detail how specific features of data compression techniques could be used to introduce the notion of dictionary of a given sequence and of artificial text and we show how these new tools can be used for information extraction purposes. We point out the versatility and generality of our method that applies to any kind of corpora of character strings independently of the type of coding behind them. We consider as a case study linguistic motivated problems and we present results for automatic language recognition, authorship attribution and self-consistent classification.
[4] Lempel A and Ziv J, A Universal Algorithm for sequential Data Compression, 1977 IEEE Transactions on Information Theory IT-23 337-343
[5] Loewenstern D, Hirsh H, Yianilos P and Noordewieret M, DNA Sequence Classification using Compression-Based Induction, 1995 DIMACS Technical Report, 95-04
[6] Kukushkina OV, Polikarpov AA and Khmelev DV, Using Literal and Grammatical Statistics for Authorship Attribution, 2000 Problemy Peredachi Informatsii 37 96-108 (in Russian). Translated in English 2001 Problems of Information Transmission 37 172-184
[7] Juola P, Cross-entropy and linguistic typology, 1998 Proceedings of New Methods in Language Processing 3, Sidney
[8] Teahan WJ, Text Classification and Segmentation Using Minimum Cross-Entropy, 2000 Proceedings of the International Conference on Content-based Multimedia Information Access (RIAO), pp. 943-961. C.I.D.-C.A.S.I.S, Paris
[9] Thaper N, MS in Computer Science, MIT, Master thesis (2001).
[11] Puglisi A, Benedetto D, Caglioti E, Loreto V and Vulpiani A, Data compression and learning in time sequences analysis, 2003 Physica D 180 92-107
[12] Fukuda K, Stanley H E and Nunes Amaral L A, Heuristic segmentation of a nonstationary time series, 2004 Phys. Rev. E 69 041905 Galv´an P, Carpena P, Rom´an-Rold´an R, Oliver J, and Stanley H E, Analysis of symbolic sequences using the Jensen-Shannon divergence, 2002 Phys. Rev. E 65 041905
[13] Azad R K, Bernaola-Galv´an P, Ramaswamy R and Rao J S, Segmentation of genomic DNA through entropic divergence: Power laws and scaling, 2002 Phys. Rev. E 65 051909
[14] Mantegna R N, Buldyrev S V, Goldberger A L, Havlin S, Peng C K, Simons M and Stanley H E, Linguistic Features of Noncoding DNA Sequences, 1994 Phys. Rev. Lett. 73 3169
[15] Grosse I, Bernaola-Galvn P, Carpena P, Romn-Roldn R, Oliver J, and Stanley H E, Analysis of symbolic sequences using the Jensen-Shannon divergence, 2002 Phys. Rev. E 65 041905
[20] Schuermann T and Grassberger P, Entropy estimation of symbol sequences, 1996 CHAOS 6 167-171
[21] Quiroga R Q, Arnhold J, Lehnertz K and Grassberger P, Kulback-Leibler and renormalized entropies: Applications to electroencephalograms of epilepsy patients, 2000 Phys. Rev. E 62 8380
[22] Kopitzki K, Warnke P C, Saparin P, Kurths J and Timmer J, Comment on ”Kullback-Leibler and renormalized entropies: Applications to electroencephalograms of epilepsy patients”, 2002 Phys. Rev. E 66 043902
[23] Quiroga R Q, Arnhold J, Lehnertz K and Grassberger P, Reply to ”Comment on 'Kullback-Leibler and renormalized entropies: Applications to electroencephalograms of epilepsy patients' “, 2002 Phys. Rev. E 66 043903
[24] Khinchin AI, Mathematical Foundations of Information Theory (Dover, New York, 1957)
[25] Welsh D, Codes and Cryptography (Clarendon Press, Oxford, 1989)
[26] Chaitin GJ, On the length of programs for computing finite binary sequences, 1966 Journal of the Association for Computer Machinery 13 547-569
[28] Kolmogorov AN, Three Approaches to the quantitative definition of Information, 1965 Problems of Information Transmission 1 1-17
[29] Solomonov RJ, A formal theory of inductive inference, 1964 Information and Control 7 1-22 and 224-254
[30] Li M and Vit´anyi PMB, An introduction to Kolmogorov complexity and its applications (2nd. ed.) (Springer, 1997)
[31] Wyner AD and Ziv J, The sliding-window Lempel-Ziv algorithm is asymptotically optimal, 1994 Proceedings of the IEEE, 82 872-877
[32] Pierce JR, Introduction to information theory: Symbols, signals and noise (2nd ed.) (Dover Publications, Inc., New York, 1980)
[33] Farach M, Noordewier M, Savari S, Shepp L, Wyner A and Ziv J, On the entropy of DNA: algorithms and measurements based on memory and rapid convergence, 1995 Proceedings of the Sixth Annual ACM-SIAM Symposium on Discrete Algorithms, San Francisco, California, 22-24 January 1995. pp. 48-57
[34] Milosavljevi´c A, Discovering Dependencies via Algorithmic Mutual Information: A Case Study in DNA Sequence Comparisons, 1995 Machine Learning 21 35-50
[35] Wyner AD, 1994 Shannon Lecture, Typical sequences and all that: Entropy, Pattern Matching and Data Compression, 1994 ATeT Bell Laboratories
[36] Ziv J and Merhav N, A measure of relative entropy between individual sequences with applications to universal classification, 1993 IEEE Transactions on Information Theory 39 1280-1292
[37] Cai H, Kulkarni S and Verdu´ S, 2002 Proc. of the 2002 IEEE Intern. Symp. Inform. Theory, p. 433, USA
[38] Verdu´ S, Fifty Years of Shannon Theory, 1998 IEEE Transactions on Information Theory 44 2057-2078
[39] Sinai YG, On the notion of entropy of a dynamical system, 1959 Dokl. Akad. Nauk. SSSR 124 768-771
[40] Eckmann JP and Ruelle D, Ergodic theory of chaos and strange attractors, 1985 Review Modern Physics 57 617- 656
[41] Benci V, Bonanno C, Galatolo S, Menconi G and Virgilio M, Dynamical systems and computable information, 2004 Discrete and Continuous Dynamical Systems B. 4 4 935-960 [cond-mat/0210654] and references therein
[42] Boffetta G, Cencini M, Falcioni M and Vulpiani A, Predictability: a way to characterize Complexity, 2002 Physics Reports 356 367-474
[43] Lind D and Marcus B, Symbolic Dynamic and Coding (Cambridge University Press, Cambridge, 1995)
[44] Bell TC, Cleary JC and Witten IH, Text Compression (Prentice Hall, 1990)
[45] Bennett CH, Li M and Ma B, Chain Letters and Evolutionary Histories, 2003 Scientific American, 288 76-81
[46] El-Yaniv R, Fine S and Tishby N, Agnostic clustering of markovian sequences, 1997 Advances in Neural Information Processing Systems 10 465-471
[47] Kontoyiannis I, Algoet PH, Suhov YuM and Wyner AJ, Nonparametric entropy estimation for stationary processes and random fields, with applications to English text, 1998 IEEE Transactions in Information Theory 44 1319-1327
[48] Nevill-Manning C, Inferring sequential structures Ph.D. thesis, University of Waikato, http://craig.nevill-manning.com/ (1996).
[50] Li M, Badger J, Chen X, Kwong S, Kearney P and Zhang H, An information based sequence distance and its application to whole (mitochondria) genome distance, 2001 Bioinformatics 17 149-154
[52] Li M, Chen X, Li X, Ma B and Vitanyi PMB, The similarity metric, 2004 IEEE Trans Information Theory 50:12 3250-3264 and http://arxiv.org/abs/cs.CC/0111054v3
[53] Cilibrasi R, Vitanyi PMB and de Wolf R, Algorithmic clustering of music based on string compression, 2004 Computer Music J. 28 49-67. [http://arxiv.org/archive/cs/0303025]
[54] Londei A, Loreto V and Belardinelli MO, 2003, Proc. 5th Triannual ESCOM Conf. 200-203
[55] Cover T and Thomas J, Elements of information theory (Wiley, New York, 1991).
[56] Kullback S and Leibler RA, On information and sufficiency, 1951 Annals of Mathematical Statistics 22 79-86
[57] Kullback S, Information Theory and Statistics (Wiley, 1959).
[58] Bennett CH, G`acs P, Li M, Vitanyi PMB and Zurek W, Information Distance, 1998 IEEE Transactions on Information Theory 44 1407-1423
[59] Cilibrasi R and Vitanyi PMB, Clustering by compression, 2005 IEEE Trans. Information Theory 51:4 [http://arxiv.org/abs/cs.CV/0312044]
[66] Stock JM and Trebbi F, 2003 Journal of Economic Perspectives 17, 177-194
[67] Grunberg - Van der Jagt authorship attribution problem: see for instance: http://pil.phys.uniroma1.it/∼loreto/nrc1.ps
[68] Cavalli-Sforza LL and Edwards AW, Phylogenetic analysis: Models and estimation procedures, 1967 Evolution 32 550-570 and 1967 American Journal of Human Genetics 19 233-257
[69] Farris JS, Distance data in phylogenetic analysis, 1981 In: V. A. Funk and D. R. Brooks [eds.], Advances in Cladistics, 1 3-23. New York Botanical Garden, New York.
[70] Felsenstein J, Distance methods for inferring phylogenies: a justification, 1984 Evolution 38 16-24
[71] Fitch WM and Margoliash E, Construction of phylogenetic trees, 1967 Science 155 279-284
[73] Saitou N and Nei M, The neighbor-joining method: a new method for reconstructing phylogenetic trees, 1987 Molecular Biology and Evolution 4 406-425
[74] UNHCHR web site: http://www.unhchr.ch/udhr/navigate/alpha.htm
[75] Ethnologue web site, http://www.sil.org/ethnologue
[76] Renfrew C, McMahon A and Trask L, Time depth in historical linguistics Vol.1 and 2 (The McDonald Institute for Archaeological Research, Cambridge 59-73.)
[77] Swadesh M, Salish internal relationships, 1950 International Journal of Americal Linguistics 16 157-167
[78] Loreto V, Mortarino C and Starostin S, A new approach to classification tree building in historical linguistics, 2005, Santa Fe Institute monograraph on Evolution Human Languages (EHL) project.”