Remember Me
Or use your Academic/Social account:


Or use your Academic/Social account:


You have just completed your registration at OpenAire.

Before you can login to the site, you will need to activate your account. An e-mail will be sent to you with the proper instructions.


Please note that this site is currently undergoing Beta testing.
Any new content you create is not guaranteed to be present to the final version of the site upon release.

Thank you for your patience,
OpenAire Dev Team.

Close This Message


Verify Password:
Verify E-mail:
*All Fields Are Required.
Please Verify You Are Human:
fbtwitterlinkedinvimeoflicker grey 14rssslideshare1
Ntarmos, N.; Patlakas, I.; Triantafillou, P. (2014)
Publisher: VLDB Endowment Inc.
Languages: English
Types: Article
Rank (i.e., top-k) join queries play a key role in modern analytics\ud tasks. However, despite their importance and unlike\ud centralized settings, they have been completely overlooked\ud in cloud NoSQL settings. We attempt to fill this gap: We\ud contribute a suite of solutions and study their performance\ud comprehensively. Baseline solutions are ordered using SQLlike\ud languages (like Hive and Pig), based on MapReduce\ud jobs. We first provide solutions that are based on specialized\ud indices, which may themselves be accessed using either\ud MapReduce or coordinator-based strategies. The first\ud index-based solution is based on inverted indices, which are\ud accessed with MapReduce jobs. The second index-based\ud solution adapts a popular centralized rank-join algorithm.\ud We further contribute a novel statistical structure comprising\ud histograms and Bloom filters, which forms the basis for\ud the third index-based solution. We provide (i) MapReduce\ud algorithms showing how to build these indices and statistical\ud structures, (ii) algorithms to allow for online updates to\ud these indices, and (iii) query processing algorithms utilizing\ud them. We implemented all algorithms in Hadoop (HDFS)\ud and HBase and tested them on TPC-H datasets of various\ud scales, utilizing different queries on tables of various sizes\ud and different score-attribute distributions. We ported our\ud implementations to Amazon EC2 and "in-house" lab clusters\ud of various scales. We provide performance results for\ud three metrics: query execution time, network bandwidth\ud consumption, and dollar-cost for query execution.
  • The results below are discovered through our pilot algorithms. Let us know how we are doing!

    • [1] A. Abouzeid, et al. HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads. PVLDB, 2(1):922{933, 2009.
    • [2] F. N. Afrati and J. D. Ullman. Optimizing joins in a map-reduce environment. In Proc. EDBT, 2010.
    • [3] B. H. Bloom. Space/time trade-o s in hash coding with allowable errors. Commun. ACM, 13(7):422{426.
    • [4] C. Bohm and H.-P. Kriegel. A cost model and index architecture for the similarity join. In Proc. ICDE, 2001.
    • [5] P. Cao and Z. Wang. E cient top-k query calculation in distributed networks. In Proc. ACM PODC, 2004.
    • [6] S. Cohen and Y. Matias. Spectral Bloom lters. In Proc. ACM SIGMOD, 2003.
    • [7] J. Dittrich, et al. Hadoop++: Making a yellow elephant run like a cheetah (without it even noticing). PVLDB, 3(1-2):515{529, 2010.
    • [8] C. Doulkeridis, et al. Processing of rank joins in highly distributed systems. In IEEE ICDE, 2012.
    • [9] DynamoDB pricing scheme: http://aws.amazon.com/dynamodb/#pricing.
    • [10] R. Fagin, A. Lotem, and M. Naor. Optimal aggregation algorithms for middleware. In Proc. ACM PODS, 2001.
    • [11] S. W. Golomb. Run-length encodings. IEEE Transactions on Information Theory, 12(3):399, 1966.
    • [12] I. Ilyas, W. Aref, and A. Elmagarmid. Joining ranked inputs in practice. In Proc. VLDB, 2002.
    • [13] I. Ilyas, W. Aref, and A. Elmagarmid. Supporting top-k join queries in relational databases. In Proc. VLDB, 2003.
    • [14] I. Ilyas, G. Beskales, and M. Soliman. A survey of top-k query processing techniques in relational database systems. ACM Computing Surveys, 40(4):1{58, 2008.
    • [15] Y. Lin, D. Agrawal, C. Chen, B. C. Ooi, and S. Wu. Llama: leveraging columnar storage for scalable join processing in the mapreduce framework. In Proc. ACM SIGMOD, 2011.
    • [16] S. Michel, P. Trianta llou, and G. Weikum. KLEE: A framework for distributed top-k query algorithms. In Proc. VLDB, 2005.
    • [17] M. Mitzenmacher. Compressed Bloom lters. IEEE/ACM Transactions on Networking, 10(5):604{612, 2002.
    • [18] J. Mullin. Estimating the size of a relational join. Information Systems, 18(3):189{196, 1993.
    • [19] A. Natsev, Y.-C. Chang, J. Smith, C.-S. Li, and J. Vitter. Supporting incremental join queries on ranked inputs. In Proc. VLDB, 2001.
    • [20] A. Okcan and M. Riedewald. Processing theta-joins using mapreduce. In Proc. ACM SIGMOD, 2011.
    • [21] C. Olston, et al. Pig Latin: A not-so-foreign language for data processing. In Proc. ACM SIGMOD, 2008.
    • [22] K. Schnaitter and N. Polyzotis. Evaluating rank joins with optimal cost. In Proc. ACM PODS, 2008.
    • [23] M. Stonebraker, et al. Mapreduce and parallel DBMSs: Friends or foes? Comm. ACM, 53(1):64{71, 2010.
    • [24] A. Thusoo, et al. Hive: a warehousing solution over a map-reduce framework. PVLDB, 2(2):1626{1629, 2009.
    • [25] M. Wu, L. Berti-Equille, A. Marian, C. Procopiuc, and D. Srivastava. Processing top-k join queries. PVLDB, 3(1-2):860{870, 2010.
    • [26] C. Xia, H. Lu, B. C. Ooi, and J. Hu. Gorder: An e cient method for kNN join processing. In Proc. VLDB, 2004.
    • [27] C. Xiao, W. Wang, X. Lin, and H. Shang. Top-k set similarity joins. In Proc. ICDE, 2009.
    • [28] D. Zeinalipour-Yazti, et al. The Threshold Join Algorithm for top-k queries in distributed sensor networks. In Proc. ACM DMSN, 2005.
    • [29] K. Zhao, S. Zhou, K.-L. Tan, and A. Zhou. Supporting ranked join in peer-to-peer networks. In Proc. DEXA, 2005.
  • No related research data.
  • No similar publications.

Share - Bookmark

Download from

Cite this article