Remember Me
Or use your Academic/Social account:


Or use your Academic/Social account:


You have just completed your registration at OpenAire.

Before you can login to the site, you will need to activate your account. An e-mail will be sent to you with the proper instructions.


Please note that this site is currently undergoing Beta testing.
Any new content you create is not guaranteed to be present to the final version of the site upon release.

Thank you for your patience,
OpenAire Dev Team.

Close This Message


Verify Password:
Verify E-mail:
*All Fields Are Required.
Please Verify You Are Human:
fbtwitterlinkedinvimeoflicker grey 14rssslideshare1
Castro Fernandez, Raul; Migliavacca, Matteo; Kalyvianaki, Evangelia; Pietzuch, Peter (2014)
Languages: English
Types: Unknown
Subjects: QA75, QA76
Data scientists often implement machine learning algorithms in imperative languages such as Java, Matlab and R. Yet such implementations fail to achieve the performance and scalability of specialised data-parallel processing frameworks. Our goal is to execute imperative Java programs in a data-parallel fashion with high throughput and low latency. This raises two challenges: how to support the arbitrary mutable state of Java programs without compromising scalability, and how to recover that state after failure with low overhead. \ud \ud Our idea is to infer the dataflow and the types of state accesses from a Java program and use this information to generate a stateful dataflow graph (SDG). By explicitly separating data from mutable state, SDGs have specific features to enable this translation: to ensure scalability, distributed state can be partitioned across nodes if computation can occur entirely in parallel; if this is not possible, partial state gives nodes local instances for independent computation, which are reconciled according to application semantics. For fault tolerance, large inmemory state is checkpointed asynchronously without global coordination. We show that the performance of SDGs for several imperative online applications matches that of existing data-parallel processing frameworks.
  • The results below are discovered through our pilot algorithms. Let us know how we are doing!

    • [1] AKIDAU, T., BALIKOV, A., ET AL. MillWheel: Fault-Tolerant Stream Processing at Internet Scale. In VLDB (2013).
    • [2] ALVARO, P., CONWAY, N., ET AL. Blazes: Coordination Analysis for Distributed Programs. In ICDE (2014).
    • [3] BECK, M., AND PINGALI, K. From Control Flow to Dataflow. In ICPP (1990).
    • [4] BHATOTIA, P., WIEDER, A., ET AL. Incoop: MapReduce for Incremental Computations. In SOCC (2011).
    • [5] BU, Y., HOWE, B., ET AL. HaLoop: Efficient Iterative Data Processing on Large Clusters. In VLDB (2010).
    • [6] CHAIKEN, R., JENKINS, B., ET AL. SCOPE: Easy and Efficient Parallel Processing of Massive Data Sets. In VLDB (2008).
    • [7] CHAMBERS, C., RANIWALA, A., ET AL. FlumeJava: Easy, Efficient Data-Parallel Pipelines. In PLDI (2010).
    • [8] DEAN, J., AND GHEMAWAT, S. MapReduce: Simplified Data Processing on Large Clusters. In CACM (2008).
    • [9] EWEN, S., TZOUMAS, K., ET AL. Spinning Fast Iterative Data Flows. In VLDB (2012).
    • [10] FERNANDEZ, R. C., MIGLIAVACCA, M., ET AL. Integrating Scale Out and Fault Tolerance in Stream Processing using Operator State Management. In SIGMOD (2013).
    • [11] GUNDA, P. K., RAVINDRANATH, L., ET AL. Nectar: Automatic Management of Data and Comp. in Datacenters. In OSDI (2010).
    • [12] HE, B., YANG, M., ET AL. Comet: Batched Stream Processing for Data Intensive Distributed Computing. In SOCC (2010).
    • [13] HUESKE, F., PETERS, M., ET AL. Opening the Black Boxes in Data Flow Optimization. In VLDB (2012).
    • [14] HWANG, J.-H., BALAZINSKA, M., ET AL. High-Availability Algorithms for Distributed Stream Processing. In ICDE (2005).
    • [15] ISARD, M., BUDIU, M., ET AL. Dryad: Dist. Data-Parallel Programs from Sequential Building Blocks. In EuroSys (2007).
    • [16] JOHNSTON, W. M., HANNA, J., ET AL. Advances in Dataflow Programming Languages. In CSUR (2004).
    • [17] KDNUGGETS ANNUAL SOFTWARE POLL. RapidMiner and R vie for the First Place. http://goo.gl/OLikb, 2013.
    • [18] LI, F., POP, A., ET AL. Automatic Extraction of Coarse-Grained Data-Flow Threads from Imperative Programs. In Micro (2012).
    • [19] LOGOTHETHIS, D., OLSON, C., ET AL. Stateful Bulk Processing for Incremental Analytics. In SOCC (2010).
    • [20] LOW, Y., BICKSON, D., ET AL. Dist. GraphLab: A Framework for ML and Data Mining in the Cloud. In VLDB (2012).
    • [21] MA, J., SAUL, L. K., ET AL. Identifying Suspicious URLs: an Application of Large-Scale Online Learning. In ICML (2009).
    • [22] MALEWICZ, G., AUSTERN, M. H., ET AL. Pregel: A System for Large-scale Graph Processing. In SIGMOD (2010).
    • [23] MISHNE, G., DALTON, J., ET AL. Fast Data in the Era of Big Data: Twitter's Real-Time Related Query Suggestion Architecture. In SIGMOD (2013).
    • [24] MITCHELL, C., POWER, R., ET AL. Oolong: Asynchronous Distributed Applications Made Easy. In APSYS (2012).
    • [25] MURRAY, D., SCHWARZKOPF, M., ET AL. CIEL: A Universal Exec. Engine for Distributed Data-Flow Comp. In NSDI (2011).
    • [26] MURRAY, D. G., MCSHERRY, F., ET AL. Naiad: A Timely Dataflow System. In SOSP (2013).
    • [27] NIKHIL, R. S., ET AL. Executing a Program on the MIT TaggedToken Dataflow Architecture. In TC (1990).
    • [28] OLSTON, C., REED, B., ET AL. Pig Latin: A Not-So-Foreign Language for Data Processing. In SIGMOD (2008).
    • [29] ONGARO, D., RUMBLE, S. M., ET AL. Fast Crash Recovery in RAMcloud. In SOSP (2011).
    • [30] POWER, R., AND LI, J. Piccolo: Building Fast, Distributed Programs with Partitioned Tables. In OSDI (2010).
    • [31] SCHELTER, S., EWEN, S., ET AL. All Roads Lead to Rome: Optimistic Recovery for Distributed Iterative Data Processing. In CIKM (2013).
    • [32] SUMBALY, R., KREPS, J., ET AL. The Big Data Ecosystem at LinkedIn. In SIGMOD (2013).
    • [33] VALL E´E-RAI, R., HENDREN, L., ET AL. Soot: A Java Optimization Framework. In CASCON (1999).
    • [34] VANDIERENDONCK, H., RUL, S., ET AL. The Paralax Infrastructure: Automatic Parallelization with a Helping Hand. In PACT (2010).
    • [35] VENKATARAMAN, S., BODZSAR, E., ET AL. Presto: Dist. ML and Graph Processing with Sparse Matrices. In EuroSys (2013).
    • [36] XIN, R. S., ROSEN, J., ET AL. Shark: SQL and Rich Analytics at Scale. In SIGMOD (2013).
    • [37] YU, Y., ISARD, M., ET AL. DryadLINQ: a System for GeneralPurpose Distributed Data-Parallel Computing using a High-Level Language. In OSDI (2008).
    • [38] ZAHARIA, M., CHOWDHURY, M., ET AL. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. In NSDI (2012).
    • [39] ZAHARIA, M., DAS, T., ET AL. Discretized Streams: Faulttolerant Streaming Computation at Scale. In SOSP (2013).
    • [40] ZAHARIA, M., KONWINSKI, A., ET AL. Improving MapReduce Performance in Heterogeneous Environments. In OSDI (2008).
  • No related research data.
  • No similar publications.

Share - Bookmark

Cite this article