Remember Me
Or use your Academic/Social account:


Or use your Academic/Social account:


You have just completed your registration at OpenAire.

Before you can login to the site, you will need to activate your account. An e-mail will be sent to you with the proper instructions.


Please note that this site is currently undergoing Beta testing.
Any new content you create is not guaranteed to be present to the final version of the site upon release.

Thank you for your patience,
OpenAire Dev Team.

Close This Message


Verify Password:
Verify E-mail:
*All Fields Are Required.
Please Verify You Are Human:
fbtwitterlinkedinvimeoflicker grey 14rssslideshare1
Veluru, S.; Rahulamathavan, Y.; Manandhar, S.; Rajarajan, M. (2014)
Languages: English
Types: Unknown
Subjects: GN, QA75
Generally surnames (family name) or forenames are evolved over generations which can be used to understand population origins, migration, identity, social norms and cultural customs. These forenames or surnames may have hidden structure associated with them called communities. Each community might have strong correlation among several forenames and surnames. In addition, the correlation might be across communities of forenames or surnames. Popular statistical generative model such as Latent Dirichlet Allocation (LDA) has been developed to find topics in a corpus of documents. However, the LDA model can be proposed to identify hidden communities in names data set. This paper proposes several variants of latent Dirichlet allocation models to capture correlation between surnames and forenames within the communities and across the communities over a set of names collected at different locations. Initially, we propose surname correlated LDA model and forename correlated LDA model. These models identify communities in surnames or forenames and extract corresponding correlated forenames or surnames in each community respectively. Later, we propose surname community correlated LDA model and forename community correlated LDA model. These models estimate correlation among each surname community to the communities of forenames and vice versa respectively. We experiment for India and United Kingdom names data sets and conclusions are drawn.
  • The results below are discovered through our pilot algorithms. Let us know how we are doing!

    • [1] A. M. Dai and A. J. Storkey. The grouped author-topic model for unsupervised entity resolution. In Proc. of the 21th international conference on Artificial neural networks, pages 241-249, 2011.
    • [2] A. McCallum, X. Wang, and A. Corrada-Emmanuel. Topic and role discovery in social networks with experiments on enron and academic email. Journal of Artificial Intelligence Research, 30(1):249-272, 2007.
    • [3] B. de Finetti. Theory of probability. John Wiley & Sons Ltd., 1975.
    • [4] Indrajit Bhattacharya and Lise Getoor. A latent dirichlet allocation model for entity resolution. Technical report, University of Maryland, College Park, MD, USA, 2005.
    • [5] James A. Cheshire and Paul A. Longley. Identifying spatial concentrations of surnames. International Journal of Geographical Information Science, DOI:10.1080/13658816.2011.591291:1-17, 2011.
    • [6] D. M. Blei and M. I. Jordan. Modelling annotated data. In Proc. of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, pages 127-134, 2003.
    • [7] D. Mimmo and A. McCallum. Expertise modeling for matching papers with reviewers. In Proc. of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 500-509, 2007.
    • [8] D. Mimmo and A. McCallum. Topic models conditioned on arbitrary features with dirichlet- multinomial regression. In Proc. of UAI, 2008.
    • [9] D. Newman, C. Chemudugunta, and P. Smyth. Statistical entity-topic models. In Proc. of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 680-686, 2006.
    • [10] D. Ramage, C. D. Manning, and S. Dumais. Partially labeled topic models for interpretable text mining. In Proc. of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 457-465, 2011.
    • [11] D. Ramage, D. Hall, R. Nallapati, and C. D. Manning. Labeled lda: a supervised topic model for credit attribution in multi-labeled corpora. In Proc. of the 2009 Conference on Empirical Methods in Natural Language Processing, volume 1, pages 248-256, 2009.
    • [12] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent dirichlet allocation. Journal of Machine Learning Research, 3:993-1022, 2003.
    • [13] E. Erosheva, S. Fienberg and J. Lafferty. Mixed-membership models of scientific publications. In Proc. of the National Academy of Sciences of the United States of America, volume 101, pages 5220-5227, 2004.
    • [14] H. Zhang, B. Qiu, C. L. Giles, H. C. Foley, and J. Yen. An ldabased community structure discovery approach for large-scale social networks. In Proc. of the 20th international joint conference on Artifical intelligence (IJCAI'07), pages 200-207. IEEE Intelligence and Security Informatics, 2007.
    • [15] J. Burt, G. Barder, and D. Rigby. Elementary statistics for geographers. Guilford Press, 2009.
    • [16] K. Aas and L. Eikvil. Text categorisation: A survey. In Technical Report 941. Norwegian Computing Center, 1999.
    • [17] L. Shu, B. Long, and W. Meng. A latent topic model for complete entity resolution. In Proc. of the 2009 IEEE International Conference on Data Engineering, pages 880-891, 2009.
    • [18] G. Lasker. Using surnames to analyse population structure. Naming, Society and Regional Identity, pages 3-24, 2002.
    • [19] Paul A. Longley, James A. Cheshire, and P Mateos. Creating a regional geography of britain through the spatial analysis of surnames. Geoforum, doi:10.1016, 2011.
    • [20] M. Steyvers, P. Smyth, and T. Griffiths. Probabilistic author-topic models for information discovery. In Proc. of the 10th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 306-315, 2004.
    • [21] P Mateos, Paul A. Longley, and David O'Sullivan. Ethnicity and population structure in personal naming networks. PLoS ONE, 6(9):1- 12, 2011.
    • [22] P Mateos, A Singleton, and P A Longley. Uncertainty in the analysis of ethnicity classifications: Issues of extent and aggregation of ethnic groups. Journal of Ethnic and Migration Studies, 35(9):1437-1460, 2009.
    • [23] Ralf Krestel, Peter Fankhauser, and Wolfgang Nejdl. Latent dirichlet allocation for tag recommendation. In Proc. of the third ACM conference on Recommender systems (RecSys '09), pages 61-68, 2009.
    • [24] A Rodriguez-Larralde, A. Pavesi, G. Siri, and I. Barrai. Isonamy and the genetic structure of sicily. Journal of Biosocial Science, 26:9-24, 1994.
    • [25] Suresh Veluru, R Yogachandran, and M Rajarajan. Surname identification and correction in a corpus of forename surname dataset. In Proc. of the UK Workshop on Computational Intelligence 2012 (UKCI 2012), 2012.
    • [26] Suresh Veluru, R Yogachandran, P. Viswanath, P Longley, and M Rajarajan. E-mail address categorization based on semantics of surnames. In Proc. of the IEEE Symposium on Computational Intelligence and Data Mining (ICDM), pages 222-229, 2013.
    • [27] T. L. Griffiths and M. Steyvers. Finding scientific topics. In Proc. of the National Academy of Sciences of the United States of America, volume 101, pages 5228-5235, 2004.
    • [28] H. M. Wallach. Topic modeling : Beyond bag-of-words. In Proc. of the 23rd International Conference on Machine Learning, pages 977-983, 2006.
    • [29] X. Wang and A. McCallum. Topics over time: A non-markov continuous-time model of topical trends. In Proc. of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 424-433, 2006.
    • [30] X. Wang, N. Mohanty and A. McCallum. Group and topic discovery from relations and their atrributes. In Proc. of the 3rd international workshop on Link discovery, pages 28-35, 2005.
    • [31] X. Wei, J. Sun, and X. Wang. Dynamic mixture models for multiple time series. In Proc. of the 20th international joint conference on Artifical intelligence (IJCAI'07), pages 2909-2914. Norwegian Computing Center, 2007.
  • No related research data.
  • Discovered through pilot similarity algorithms. Send us your feedback.

Share - Bookmark

Download from

Cite this article