Remember Me
Or use your Academic/Social account:


Or use your Academic/Social account:


You have just completed your registration at OpenAire.

Before you can login to the site, you will need to activate your account. An e-mail will be sent to you with the proper instructions.


Please note that this site is currently undergoing Beta testing.
Any new content you create is not guaranteed to be present to the final version of the site upon release.

Thank you for your patience,
OpenAire Dev Team.

Close This Message


Verify Password:
Verify E-mail:
*All Fields Are Required.
Please Verify You Are Human:
fbtwitterlinkedinvimeoflicker grey 14rssslideshare1
Babych, B (2016)
Publisher: University of Latvia
Languages: English
Types: Article
This paper presents a methodology for calculating a modified Levenshtein edit distance between character strings, and applies it to the task of automated cognate identification from non-parallel (comparable) corpora. This task is an important stage in developing MT systems and bilingual dictionaries beyond the coverage of traditionally used aligned parallel corpora, which can be used for finding translation equivalents for the ‘long tail’ in Zipfian distribution: low-frequency and usually unambiguous lexical items in closely-related languages (many of those often under-resourced). Graphonological Levenshtein edit distance relies on editing hierarchical representations of phonological features for graphemes (graphonological representations) and improves on phonological edit distance proposed for measuring dialectological variation. Graphonological edit distance works directly with character strings and does not require an intermediate stage of phonological transcription, exploiting the advantages of historical and morphological principles of orthography, which are obscured if only phonetic principle is applied. Difficulties associated with plain feature representations (unstructured feature sets or vectors) are addressed by using linguistically-motivated feature hierarchy that restricts matching of lower-level graphonological features when higher-level features are not matched. The paper presents an evaluation of the graphonological edit distance in comparison with the traditional Levenshtein edit distance from the perspective of its usefulness for the task of automated cognate identification. It discusses the advantages of the proposed method, which can be used for morphology induction, for robust transliteration across different alphabets (Latin, Cyrillic, Arabic, etc.) and robust identification of words with non-standard or distorted spelling, e.g., in user-generated content on the web such as posts on social media, blogs and comments. Software for calculating the modified feature-based Levenshtein distance, and the corresponding graphonological feature representations (vectors and the hierarchies of graphemes’ features) are released on the author’s webpage: http://corpus.leeds.ac.uk/bogdan/phonologylevenshtein/. Features are currently available for Latin and Cyrillic alphabets and will be extended to other alphabets and languages.
  • The results below are discovered through our pilot algorithms. Let us know how we are doing!

    • Anderson, S. R. (1985). Phonology in the twentieth century: Theories of rules and theories of representations. University of Chicago Press.
    • Babych, B., Elliott, D., Hartley, A. (2004, August). Extending MT evaluation tools with translation complexity metrics. In Proceedings of the 20th international conference on Computational Linguistics (p. 106). Association for Computational Linguistics.
    • Babych, B., Hartley, A., Sharoff, S. (2007). Translating from under-resourced languages: comparing direct transfer against pivot translation. Proceedings of MT Summit XI, Copenhagen, Denmark.
    • Beinborn, L., Zesch, T., Gurevych, I. (2013). Cognate Production using Character-based Machine Translation. In IJCNLP (pp. 883-891).
    • Bergsma, S., Kondrak, G. (2007, September). Multilingual cognate identification using integer linear programming. In RANLP Workshop on Acquisition and Management of Multilingual Lexicons.
    • Chomsky, N., Halle, M. (1968). The sound pattern of English. Harper & Row Publishers: New York, London.
    • Ciobanu, A. M., Dinu, L. P. (2014). Automatic Detection of Cognates Using Orthographic Alignment. In ACL (2) (pp. 99-105).
    • Comrie, B. , Corbett, G., Eds. (1993). The Slavonic Languages. Routledge: London, New York.
    • Eberle, K., Geiß, J., Ginestí-Rosell, M., Babych, B., Hartley, A., Rapp, R., Sharoff, S & Thomas, M. (2012, April). Design of a hybrid high quality machine translation system. In Proceedings of the Joint Workshop on Exploiting Synergies between Information Retrieval and Machine Translation (ESIRMT) and Hybrid Approaches to Machine Translation (HyTra) (pp. 101- 112). Association for Computational Linguistics.
    • Enright, J., Kondrak, G. (2007) A fast method for parallel document identification. Proceedings of Human Language Technologies: The Conference of the North American Chapter of the Association for Computational Linguistics companion volume, pp 29-32, Rochester, NY, April 2007.
    • Hana, J., Feldman, A., Brew, C., Amaral, L. (2006, April). Tagging Portuguese with a Spanish tagger using cognates. In Proceedings of the International Workshop on Cross-Language Knowledge Induction (pp. 33-40). Association for Computational Linguistics.
    • Hubey, M. (1999). Mathematical Foundations of Linguistics. Lincom Europa, Muenchen.
    • Jakobson, R., Fant, G., Halle, M. (1951). Preliminaries to speech analysis. The distinctive features and their correlates.
    • Koehler. R. (1993). Synergetic Linguistics. In: Contributions to Quantitative Linguistics, R. Koehler and B.B. Rieger (eds.), pp. 41-51.
    • Koehn, P., Knight, K. (2002). Learning a Translation Lexicon from Monolingual Corpora, , ACL 2002, Workshop on Unsupervised Lexical Acquisition
    • Ladefoged, P., Halle, M. (1988). Some major features of the International Phonetic Alphabet. Language, 64(3), 577-582.
    • Leusch, G., Ueffing, N., Ney, H. (2003, September). A novel string-to-string distance measure with applications to machine translation evaluation. In Proceedings of MT Summit IX (pp. 240- 247).
    • Levenshtein, V. I. (1966). Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady 10 (8): 707-710.
    • Menzerath, P. (1954). Die Architektonik des deutchen Wortschatzes. Dummler, Bonn.
    • Mulloni, A., Pekar, V. (2006). Automatic detection of orthographic cues for cognate recognition. Proceedings of LREC'06, 2387, 2390.
    • Nerbonne, J., Heeringa, W. (1997). Measuring dialect distance phonetically. In Proceedings of the Third Meeting of the ACL Special Interest Group in Computational Phonology (SIGPHON97).
    • Nießen, S.; F. J. Och; G. Leusch, and H. Ney. (2000) An evaluation tool for machine translation: Fast evaluation for MT research. In Proc. Second Int. Conf. on Language Resources and Evaluation, pp. 39-45, Athens, Greece, May
    • Pinnis, M., Ion, R., Ştefănescu, D., Su, F., Skadiņa, I., Vasiļjevs, A., Babych, B. (2012) Toolkit for Multi-Level Alignment and Information Extraction from Comparable Corpora // Proceedings of ACL 2012, System Demonstrations Track, Jeju Island, Republic of Korea, 8-14 July 2012.
    • Pivtorak, H. P. (1988). Forming and dialectal differentiation of the old Ukrainian language. (Formuvannya i dialektna dyferentsiatsiya davn'orus'koyi movy - Формування і діалектна диференціація давньоруської мови). Naukova Dumka, Kyiv. (in Ukrainian).
    • Sanders, N. C., Chin, S. B. (2009). Phonological Distance Measures. Journal of Quantitative Linguistics, 16(1), 96-114.
    • Schepens, J., Dijkstra, T., Grootjen, F. (2012). Distributions of cognates in Europe as based on Levenshtein distance. Bilingualism: Language and Cognition, 15(01), 157-166.
    • Serva, M., Petroni, F. (2008). Indo-European languages tree by Levenshtein distance. EPL (Europhysics Letters), 81(6), 68005.
    • Sigurd, B., Eeg-Olofsson, M., Van Weijer, J. (2004). Word length, sentence length and frequencyZipf revisited. Studia Linguistica, 58(1), 37-52.
    • Swadesh, M. (1952). Lexico-statistic dating of prehistoric ethnic contacts: with special reference to North American Indians and Eskimos. Proceedings of the American philosophical society, 96(4), 452-463.
    • Zipf, G. K. (1935). The psycho-biology of language.
  • No related research data.
  • No similar publications.

Share - Bookmark

Cite this article