LOGIN TO YOUR ACCOUNT

Username
Password
Remember Me
Or use your Academic/Social account:

CREATE AN ACCOUNT

Or use your Academic/Social account:

Congratulations!

You have just completed your registration at OpenAire.

Before you can login to the site, you will need to activate your account. An e-mail will be sent to you with the proper instructions.

Important!

Please note that this site is currently undergoing Beta testing.
Any new content you create is not guaranteed to be present to the final version of the site upon release.

Thank you for your patience,
OpenAire Dev Team.

Close This Message

CREATE AN ACCOUNT

Name:
Username:
Password:
Verify Password:
E-mail:
Verify E-mail:
*All Fields Are Required.
Please Verify You Are Human:
fbtwitterlinkedinvimeoflicker grey 14rssslideshare1
Languages: English
Types: Doctoral thesis
Subjects: QA76
Pipelined wavefront computations are an ubiquitous class of high performance parallel algorithms\ud used for the solution of many scientific and engineering applications. In order to aid\ud the design and optimisation of these applications, and to ensure that during procurement platforms\ud are chosen best suited to these codes, there has been considerable research in analysing\ud and evaluating their operational performance.\ud Wavefront codes exhibit complex computation, communication, synchronisation patterns,\ud and as a result there exist a large variety of such codes and possible optimisations. The\ud problem is compounded by each new generation of high performance computing system,\ud which has often introduced a previously unexplored architectural trait, requiring previous\ud performance models to be rewritten and reevaluated.\ud In this thesis, we address the performance modelling and optimisation of this class of\ud application, as a whole. This differs from previous studies in which bespoke models are applied\ud to specific applications. The analytic performance models are generalised and reusable,\ud and we demonstrate their application to the predictive analysis and optimisation of pipelined\ud wavefront computations running on modern high performance computing systems.\ud The performance model is based on the LogGP parameterisation, and uses a small\ud number of input parameters to specify the particular behaviour of most wavefront codes. The\ud new parameters and model equations capture the key structural and behavioural differences\ud among different wavefront application codes, providing a succinct summary of the operations\ud for each application and insights into alternative wavefront application design.\ud The models are applied to three industry-strength wavefront codes and are validated\ud on several systems including a Cray XT3/XT4 and an InfiniBand commodity cluster. Model\ud predictions show high quantitative accuracy (less than 20% error) for all high performance\ud configurations and excellent qualitative accuracy.\ud The thesis presents applications, projections and insights for optimisations using the\ud model, which show the utility of reusable analytic models for performance engineering of\ud high performance computing codes. In particular, we demonstrate the use of the model for:\ud (1) evaluating application configuration and resulting performance; (2) evaluating hardware\ud platform issues including platform sizing, configuration; (3) exploring hardware platform design\ud alternatives and system procurement and, (4) considering possible code and algorithmic\ud optimisations.
  • The results below are discovered through our pilot algorithms. Let us know how we are doing!

    • Appendix A Modelling Contention on CMPs 142
    • A.1 Dual Core CMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
    • A.2 Quad Core CMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
    • A.3 8 Core CMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
    • A.4 16 Core CMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
    • Appendix B Model Validations 147
    • B.1 Chimaera Validations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
    • B.2 Sweep3D Validations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
    • Appendix D Wavefront Model and Extensions 153
    • D.1 Model Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
    • D.2 Single Core Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
    • D.3 2D Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
    • D.4 Extensions for Cray XT3/XT4 CMP Nodes . . . . . . . . . . . . . . . . . . . . . 154
    • D.5 Model Extensions for Simultaneous Multiple Wavefronts . . . . . . . . . . . . . 154
    • D.6 Model Extensions for Heterogeneous Resources . . . . . . . . . . . . . . . . . . 155
    • D.7 Model Extensions for Irregular/Unstructured Grids . . . . . . . . . . . . . . . . 155
    • Appendix E Model Parameter Error Propagation 156
    • E.1 General Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
    • E.2 Error Model for Chimaera . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
    • 1.1 Operation of a Wavefront computation . . . . . . . . . . . . . . . . . . . . . . . . 3
    • 2.1 Speedups projected by Amdahl's law . . . . . . . . . . . . . . . . . . . . . . . . 16
    • 2.2 A superstep in the BSP model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
    • 2.3 LogGP parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
    • 2.4 Stages in the HPC lifecycle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
    • 2.5 Performance Engineering Methodologies . . . . . . . . . . . . . . . . . . . . . . 33
    • 3.1 A 2D pipelined wavefront operation on a 1D processor array . . . . . . . . . . . 36
    • 3.2 Hyperplanes on a 3D grid of data . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
    • 3.3 3D data grid mapping on to a 2D processor array . . . . . . . . . . . . . . . . . . 38
    • 3.4 Pipelined wavefronts on the 2D processor array . . . . . . . . . . . . . . . . . . 39
    • 3.5 Fine-grained messaging and agglomerated messaging . . . . . . . . . . . . . . . 40
    • 3.6 LU pipelined wavefront operation on the 2D processor array . . . . . . . . . . . 43
    • 3.7 Sweep3D and Chimaera pipelined wavefront operation on the 2D processor array 45
    • 4.1 Pipelined wavefront operation on a 2D processor array . . . . . . . . . . . . . . 56
    • 4.2 Measured and modelled Cray XT4 off-node MPI end-to-end communication times 61
    • 4.3 Measured and modelled Cray XT3 off-node MPI end-to-end communication times 61
    • 4.4 Measured and modelled Cray XT4 on-chip MPI end-to-end communication times 64
    • 4.5 MPI allreduce operation on dual-core nodes . . . . . . . . . . . . . . . . . . . . . 67
    • 4.6 MPI allreduce operation on quad-core nodes . . . . . . . . . . . . . . . . . . . . 67
    • 4.7 Wavefront operation on a 2D data grid . . . . . . . . . . . . . . . . . . . . . . . . 70
    • 4.8 Wavefront application mapped to multi-core nodes . . . . . . . . . . . . . . . . 72
    • 4.9 Wavefront operation and collisions on dual core nodes . . . . . . . . . . . . . . 72
    • 4.10 Wavefront operation and collisions on quad core nodes . . . . . . . . . . . . . . 73
    • 6.1 Overview of the PACE simulator and toolset . . . . . . . . . . . . . . . . . . . . 95
    • 6.2 Layers in a PACE model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
    • 6.3 Layered objects for PACE Sweep3D model . . . . . . . . . . . . . . . . . . . . . 101
    • 7.1 Optimisation by shifting computation costs to pre-computation - strong scal-
    • ing (Speculative Chimaera type application, 240x240x240 Cells, 1 time step, 16
    • energy groups 419 iterations, Htile = 1) . . . . . . . . . . . . . . . . . . . . . . . . 112
    • 7.2 Optimisation by shifting 100% of computation costs to pre-computation - weak
    • scaling (Speculative Chimaera type application, 8x8x1000 Cells/PE, 1 time step,
    • 16 energy groups 419 iterations, Htile = 1) . . . . . . . . . . . . . . . . . . . . . . 113
    • 7.3 Chimaera Model Validation on a Intel Xeon-InfiniBand cluster - 2403 total prob-
    • lem size, Htile = 1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
    • 7.4 Multiple simultaneous sweeps on separate cores . . . . . . . . . . . . . . . . . . 115
    • 7.5 Simultaneous Sweeps on Separate cores (Speculative Chimaera type applica-
    • tion, 240x240x240 Cells, 1 time step, 16 energy groups 419 iterations, Htile =
    • 7.6 Simultaneous multiple wavefronts overlapping steps . . . . . . . . . . . . . . . 116
    • 7.7 Simultaneous sweeps on all cores (Speculative Chimaera type application,
    • 240x240x240 Cells, 1 time step, 16 energy groups 419 iterations, Htile= 1) . . . . 118
    • 7.8 Parallel efficiency of larger problem sizes (Chimaera, 1 time step, 16 energy
    • groups 419 iterations) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
    • 7.9 Change in runtime due to improved computation performance (Chimaera
    • 240x240x240, 1 time step, 16 energy groups 419 iterations) . . . . . . . . . . . . 124
    • 7.10 Change in runtime due to reduced network latency (Chimaera 240x240x240, 1
    • time step, 16 energy groups 419 iterations) . . . . . . . . . . . . . . . . . . . . . . 124
    • 7.11 Change in runtime due to increased network bandwidth (Chimaera
    • 240x240x240, 1 time step, 16 energy groups 419 iterations) . . . . . . . . . . . . 125
    • 4.1 Plug-and-Play Reusable Model Application Parameters . . . . . . . . . . . . . . 52
    • 4.2 Plug-and-play LogGP Model: One Core Per Node, on 3D Data Grids . . . . . . 59
    • 4.3 The ORNL Jaguar : System Details . . . . . . . . . . . . . . . . . . . . . . . . . . 60
    • 4.4 XT4 Communication Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
    • 4.5 LogGP Model of XT4 MPI Communication . . . . . . . . . . . . . . . . . . . . . 65
    • 4.6 Validations for the LogGP MPI allreduce model on a Cray XT4 . . . . . . . . . . 66
    • 4.7 Plug-and-play LogGP Model for Wavefront Codes on 2D Data Grids . . . . . . 70
    • 4.8 Re-usable Model Extensions for CMP Nodes . . . . . . . . . . . . . . . . . . . . 73
    • 4.9 LU Model Validation on Jaguar (Cray XT3) - 643 cells per processor . . . . . . . 75
    • 4.10 LU Model Validation on Jaguar (Cray XT3) - 1023 cells per processor . . . . . . 75
    • 4.11 Sweep3D Model Validation on Jaguar (Cray XT4) - 10003 total problem size,
    • Htile = 2, mmi = 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
    • 4.12 Sweep3D Model Validation on Jaguar (Cray XT4) - 20 106 total problem size,
    • Htile = 2, mmi = 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
    • 4.13 Chimaera Model Validation on Jaguar (Cray XT4) - 2403 total problem size . . . 77
    • 6.1 Model Validation Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
    • 6.2 Sweep3D simulation model validations on an Intel Pentium-3 2-way SMP clus-
    • ter with a Myrinet 2000 interconnect . . . . . . . . . . . . . . . . . . . . . . . . . 105
    • 6.3 Sweep3D simulation model validations on an AMD Opteron 2-way SMP cluster
    • interconnected by a Gigabit Ethernet . . . . . . . . . . . . . . . . . . . . . . . . . 106
    • 6.4 Sweep3D simulation model validations on an SGI Altix Intel Itanium-2 56-way
    • SMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
    • 6.5 Intel InfiniBand (CSC-Francesca) Cluster - Key Specifications . . . . . . . . . . . 108
    • 6.6 Chimaera Model Validation on a Intel Xeon-InfiniBand cluster - 1203 total prob-
    • lem size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
    • 6.7 Chimaera Model Validation on a Intel Xeon-InfiniBand luster - 2403 total prob-
    • lem size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
    • 6.8 InfiniBand network model parameters . . . . . . . . . . . . . . . . . . . . . . . . 109
    • B.1 Chimaera Model Validation on Jaguar (Cray XT4) - 603 problem size, Htile = 1 147
    • B.2 Chimaera Model Validation on Jaguar (Cray XT4) - 1203 problem size, Htile = 1 147
    • B.3 Chimaera Model Validation on Jaguar (Cray XT4) - 2403 problem size, Htile = 1 147
    • B.4 Chimaera Model Validation on a Intel Xeon-InfiniBand cluster - 1203 problem size147
    • B.5 Chimaera Model Validation on a Intel Xeon-InfiniBand luster - 2403 problem size 147
    • B.6 Sweep3D Model Validation on Jaguar (Cray XT4) - 10003 total problem size,
    • Htile = 2, mmi = 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
    • B.7 Sweep3D Model Validation on Jaguar (Cray XT4) - 20 106 total problem size,
    • Htile = 2, mmi = 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
    • B.8 Sweep3D Model Validation on Jaguar (Cray XT4) - 5 5 400 per processor
    • problem size, Htile = 5, mmi = 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
    • B.9 Sweep3D Model Validation on Jaguar (Cray XT4) - 14 14 255 per processor
    • problem size, Htile = 2:5, mmi = 6 . . . . . . . . . . . . . . . . . . . . . . . . . . 149
    • [75] Linpack.
    • http://www:netlib:org/linpack/. 24 [88] H. J. Curnow and B. A. Wichmann. A Synthetic Benchmark. Computer Journal, 19(1), 1976. 25 [89] R.P. Weicker. Dhrystone: A Synthetic Systems Programming Benchmark. Commun. ACM,
    • 27(10):1013-1030, 1984. 25 [90] D.A. Patterson and J.L. Hennessy. Computer Architecture: A Quantitative Approach. Morgan Kauf-
    • mann Publishers Inc., San Francisco, CA, USA, 3rd edition, 2003. 25 [93] J. Gustafson. Purpose-Based Benchmarks. International Journal of High Performance Computing
    • Applications, 18(4):475-487, 2004. 26 [94] V. Strassen. Gaussian Elimination is not Optimal. Numer. Math, 13:354-356, 1969. 26 [95] W.E. Nagel, A. Arnold, M. Weber, H.C. Hoppe, and K. Solchenbach. VAMPIR: Visualization and
    • Analysis of MPI Resources. Supercomputer, 12(1):69-80, 1996. 26, 27 [96] S.L. Graham, P.B. Kessler, and M.K. McKusick. gprof: a Call Graph Execution Profiler. In SIGPLAN
    • Symposium on Compiler Construction, pages 120-126, 1982. 26, 27 [97] Performance Application Programming Interface.
    • http://icl:cs:utk:edu/papi/. 26, 27, 104 [98] S. Browne, J. Dongarra, N. Garner, G. Ho, and P. Mucci. A Portable Programming Interface for
    • puting Applications, 14(3):189-204, Fall 2000. 26, 27 [99] CrayPat.
    • http://docs:cray:com/. 27 [100] Using OPT - A White Paper.
    • http://www:allinea:com/downloads/OPTWhite:pdf. 27 [101] D. Reed, R. Aydt, T. Madhyastha, R. Noe, K. Shields, and B. Schwartz. An Overview of the Pablo
    • puter, 1992. 27 [102] B.P. Miller, M.D. Callaghan, J.M. Cargille, J.K. Hollingsworth, R.B. Irvin, K.L. Karavanic, K. Kun-
    • puter, 28(11):37-46, 1995. 27 [103] TAU - Tuning and Analysis Utilities.
    • http://www:cs:uoregon:edu/research/tau/. 27 [104] S.S. Shende and A.D. Malony. The TAU Parallel Performance System. Int. J. High Perform. Comput.
    • Appl., 20(2):287-311, 2006. 27 [105] PGPROF - Performance Profiler.
    • http://www:pgroup:com/products/pgprof:htm. 27 [106] V. Herrarte and E. Lusk. Studying Parallel Program Behavior with upshot. Technical Report
    • ANL-91/15, Argonne National Laboratory, 1991. 27 [107] O. Zaki, E. Lusk, W. Gropp, and D. Swider. Toward Scalable Performance Visualization with
    • Jumpshot. High Performance Computing Applications, 13(2):277-288, Fall 1999. 27 [111] P.A. Dinda. The Statistical Properties of Host Load. Sci. Program., 7(3-4):211-229, 1999. 28 [112] P.A. Dinda. Resource Signal Prediction and its Application to Real-time Scheduling Advisors. PhD thesis,
    • Pittsburgh, PA, USA, 2000. Chair-David R. O'Hallaron. 28 [113] P.A. Dinda. Online Prediction of the Running Time of Tasks. hpdc, 00:0383, 2001. 28 [114] P.A. Dinda. A Prediction-Based Real-Time Scheduling Advisor. In IPDPS '02: Proceedings of the
    • 16th International Parallel and Distributed Processing Symposium, page 35, Washington, DC, USA,
    • 2002. IEEE Computer Society. 28 [115] P.A. Dinda. Design, Implementation, and Performance of an Extensible Toolkit for Resource Pre-
    • diction in Distributed Systems. IEEE Trans. Parallel Distrib. Syst., 17(2):160-173, 2006. 28 [116] S. Vazhkudai and J.M. Schopf. Using Regression Techniques to Predict Large Data Transfers. Int.
    • J. High Perform. Comput. Appl., 17(3):249-268, 2003. 28 [117] D. Gunter, B. Tierney, B. Crowley, M. Holding, and J. Lee. NetLogger: A Toolkit for Distributed
    • System Performance Analysis. In MASCOTS, pages 267-273, 2000. 28 [118] S. Chiang and M.K. Vernon. Characteristics of a Large Shared Memory Production Workload.
    • Lecture Notes in Computer Science, 2221:159, 2001. 28 [119] D.G Feitelson and B. Nitzberg. Job Characteristics of a Production Pparallel Scientific Workload
    • on the NASA Ames iPSC/860. In Dror G. Feitelson and Larry Rudolph, editors, Job Scheduling
    • Strategies for Parallel Processing - IPPS'95 Workshop, volume 949, pages 337-360. Springer, 1995. 28 [120] S. Hotovy. Workload Evolution on the Cornell Theory Center IBM SP2. In Dror G. Feitelson
    • and Larry Rudolph, editors, Job Scheduling Strategies for Parallel Processing, pages 27-40. Springer-
    • Verlag, 1996. 28 [121] K. Windisch, V. Lo, R. Moore, D. Feitelson, and B. Nitzberg. A Comparison of Workload Traces
    • From Two Production Parallel Machines. In 6th Symp. Frontiers Massively Parallel Comput., pages
    • 319-326, 1996. 28 [122] J.M. Schopf and F. Berman. Stochastic Scheduling. In Supercomputing '99: Proceedings of the 1999
    • ACM/IEEE conference on Supercomputing (CDROM), page 48, New York, NY, USA, 1999. ACM. 28 [123] F.D. Berman, R. Wolski, S. Figueira, J.M Schopf, and G. Shao. Application-level Scheduling on
    • Distributed Heterogeneous Networks. In Supercomputing '96: Proceedings of the 1996 ACM/IEEE
    • conference on Supercomputing (CDROM), page 39, Washington, DC, USA, 1996. IEEE Computer
    • Society. 28 [124] A. Gefflaut and P. Joubert. SPAM: A Multiprocessor Execution-Driven Simulation Kernel. Int.
    • Journal in Computer Simulation, 6(1):69, 1996. 29 [125] S. Girona and J. Labarta. Sensitivity of Performance Prediction of Message Passing Programs. In
    • International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA'99),
    • Las Vegas, Nevada, USA, July 1999. 29 [126] J. Labarta, S. Girona, and T. Cortes. Analyzing Scheduling Policies Using DIMEMAS. Parallel
    • Comput., 23(1-2):23-34, 1997. 29 [127] R. C. Covington, S. Madala, V. Mehta, J. R. Jump, and J. B. Sinclair. The Rice Parallel Processing
    • Testbed. SIGMETRICS Perform. Eval. Rev., 16(1):4-11, 1988. 29 [128] E.A. Brewer, C. Dellarocas, A. Colbrook, and W.E. Weihl. PROTEUS: A High-Performance Parallel-
    • Architecture Simulator. In Measurement and Modeling of Computer Systems, pages 247-248, 1992. 29 [130] S.K. Reinhardt, M.D. Hill, J.R. Larus, A.R. Lebeck, J.C. Lewis, and D.A. Wood. The Wisconsin
    • Systems, pages 48-60, 1993. 30 [131] S.S. Mukherjee, S.K. Reinhardt, B. Falsafi, M. Litzkow, M.D. Hill, D.A. Wood, S. Huss-Lederman,
    • Concurrency, 8(4):12-20, 2000. 30 [132] S. Prakash, E. Deelman, and R. Bagrodia. Asynchronous Parallel Simulation of Parallel Programs.
    • IEEE Trans. Softw. Eng., 26(5):385-400, 2000. 30 [133] G. Zheng, T. Wilmarth, P. Jagadishprasad, and L.V. Kale´. Simulation-based Performance Predic-
    • tion for Large Parallel Machines. Int. J. Parallel Program., 33(2):183-207, 2005. 30 [134] U. Legedza and W.E. Weihl. Reducing Synchronization Overhead in Parallel Simulation. In Work-
    • shop on Parallel and Distributed Simulation, pages 86-95, 1996. 30 [135] R. Berry and K.M. Chandy. Performance Models of Token Ring Local Area Networks. In SIG-
    • METRICS '83: Proceedings of the 1983 ACM SIGMETRICS conference on Measurement and modeling of
    • computer systems, pages 266-274, New York, NY, USA, 1983. ACM. 31 [136] V.S. Adve and M.K. Vernon. Performance Analysis of Mesh Interconnection Networks with De-
    • terministic Routing. IEEE Transactions on Parallel and Distributed Systems, 05(3):225-246, 1994. 31 [137] D.J. Sorin, V.S. Pai, S.V. Adve, M.K. Vernon, and D.A. Wood. Analytic Evaluation of Shared-
    • memory Systems with ILP Processors. SIGARCH Comput. Archit. News, 26(3):380-391, 1998. 31 [138] M. Chiang and G.S. Sohi. Evaluating Design Choices for Shared Bus Multiprocessors in a
    • Throughput-Oriented Environment. IEEE Trans. Comput., 41(3):297-317, 1992. 31 [139] M.K. Vernon, E.D. Lazowska, and J.Zahorjan. An Accurate and Efficient Performance Analysis
    • News, 16(2):308-315, 1988. 31 [140] A.G. Greenberg, I. Mitrani, and L. Rudolph. Analysis of Snooping Caches. In Performance '87: Pro-
    • ceedings of the 12th IFIP WG 7.3 International Symposium on Computer Performance Modelling, Mea-
    • surement and Evaluation, pages 345-361, Amsterdam, The Netherlands, The Netherlands, 1988.
    • North-Holland Publishing Co. 31 [141] D.J. Kerbyson, S.D. Pautz, and A. Hoisie. Performance Modelling of Deterministic Transport
    • Computations. In Performance Analysis and Grid Computing, Kluwer, 2003. 31 [142] V. Taylor, X. Wu, and R. Stevens. Prophesy: an Infrastructure for Performance Analysis and Mod-
    • eling of Parallel and Grid Applications. SIGMETRICS Perform. Eval. Rev., 30(4):13-18, 2003. 31,
    • 32 [143] S.R. Alam and J.S. Vetter. Hierarchical Model Validation of Symbolic Performance Models of
    • Scientific Kernels. In Euro-Par, pages 65-77, 2006. 31, 32 [144] D.C. Burger and T.M. Austin. The SimpleScalar Tool Set, Version 2.0. Technical Report CS-TR-
    • 1997-1342, 1997. 31, 50 [145] R. Bagrodia, R. Meyer, M. Takai, Y. Chen, X. Zeng, J. Martin, and H.Y. Song. Parsec: A Parallel
    • Simulation Environment for Complex Systems. Computer, 31(10):77-85, 1998. 32 [146] E. Barszcz, R.A. Fatoohi, V.Venkatakrishnan, and S.K. Weeratunga. Solution of Regular,Sparse
    • RNR-93-007, NAS Applied Research Branch, NASA Ames Research Center, Moffett Field, CA
    • 94035, April 1993. 35, 41 [147] K.R. Koch, R.S. Baker, and R.E. Alcouffe. Solution of the First-Order form of the 3D Discrete
    • 65:198-199, 1992. Annual Meeting, Boston, MA. 35, 44, 49 [148] W. Joubert, T. Oppe, R. Janardhan, and W. Dearholt. Fully Parallel Global M/ILU Preconditioning
    • For 3-D Structured Problems. 35, 47, 48 [149] J. Qin and T. Chan. Performance Analysis in Parallel Triangular Solve. In IEEE Second International
    • Conference on Algorithms and Architectures for Parallel Processing, pages 405-412. IEEE Computer
    • Society, 1996. 35, 47, 48 [150] D. Bailey, J. Barton, T. Lasinski, and H. Simon. The NAS Parallel Benchmarks. Technical Report
    • RNR-91-002, Applied Research Branch, NASA Ames Research Center, Moffett Field, CA 94035,
    • January 1991. 41 [151] M.R. Dorr and C.H. Still. Concurrent Source Iteration in the Solution of Three-Dimensional Multi-
    • group Discrete Ordinates Nutron Transport Equations. Technical Report UCRL-JC-116694 Rev 1,
    • Lawrence Livermore National Laboatory, Livermore, CA, May 1995. 44 [152] Accelerated Strategic Computing Initiative (ASCI) Statement of Work, C6939RFP6-3X.
    • http://www:llnl:gov/asci rfp, February 12 1996. 48 [153] A. Hoisie, G. Johnson, D.J. Kerbyson, M. Lang, and S. Pakin. A Performance Comparison Through
    • Purple. In SC '06: Proceedings of the 2006 ACM/IEEE conference on Supercomputing, page 74, New
    • York, NY, USA, 2006. ACM. 49 [154] S. Kitawaki and M. Yokokawa. Earth Simulator Running. Int.Supercomputing Conference, Hei-
    • delberg, June 2002. 49 [155] T. Sato. Can the Earth Simulator Change the Way Humans Think? Keynote address, Int. Conf.
    • Supercomputing, New York, June 2002. 49 [156] Cray XT3 Data Sheet. http://www:cray:com/products/xt3. 60 [157] Cray XT4 Data Sheet. http://www:cray:com/products/xt4. 60, 71 [158] HyperTransport Consortium. http://www:hypertransport:org/. 60 [159] R. Thakur and W. Gropp. Improving the Performance of Collective Operations in MPICH. In 10th
    • European PVM/MPI Users Group Meeting, Oct 2003. 66 [160] J.D. Turner. A Dynamic Prediction and Monitoring Framework for Distributed Applications. PhD thesis,
    • University of Warwick, Department of Computer Science, 2003. 94 [161] E. Papaefstathiou, D.J. Kerbyson, and G.R. Nudd. A Layered Approach to Parallel Software Per-
    • opment, 1994. 94 [162] J. Cao, D.J. Kerbyson, E. Papaefstathiou, and G.R. Nudd. Modelling of ASCI High Performance
    • Applications Using PACE. In Proc. UK Performance Engineering Workshop (UKPEW'99), pages 413-
    • 424, Bristol, July. 97, 103 [163] E. Papaefstathiou, D.J. Kerbyson, G.R. Nudd, T.J. Atherton, and J.S. Harper. An Introduction to
    • the Layered Characterisation for High Performance Systems, December 5, 1997. Research Report
    • CS-RR-335,University of Warwick, Dept. of Computer Science. 98 [164] E. Papaefstathiou. A Framework for Characterising Parallel Systems for Performance Evaluation. PhD
    • thesis, Department of Computer Science, University of Warwick, Coventry, U.K, 1995. 99 [165] J.S. Harper. Analytic Cache Modelling of Numerical Programs. PhD thesis, University of Warwick,
    • Department of Computer Science, Sept, 1999. 102 [166] G. Karypis and V. Kumar. METIS 4.0: Unstructured Graph Partitioning and Sparse Matrix Order-
    • ing System. Technical report, Department of Computer Science, University of Minnesota. 122 [167] The METIS home page. http://www:cs:umn:edu/˜metis. 122 [168] G. Johnson, D.J. Kerbyson, and M. Lang. Optimization of Infiniband for Scientific Applications. In
    • Symposium (IPDPS), Miami, FL, April 2008. 124 [169] Condor: High Throughput Computing. http://www:cs:wisc:edu/condor/. 129 [170] C. Kesselman and I. Foster. The Grid: Blueprint for a New Computing Infrastructure. Morgan Kauf-
    • mann Publishers, November 1998. 129 [171] T. El-Ghazawi, W. Carlson, T. Sterling, and K. Yelick. UPC: Distributed Shared Memory Programming.
    • John Wiley and Sons, May 2005. ISBN: 0-471-22048-5. 129 [172] Berkeley UPC - Unified Parallel C. http://upc:lbl:gov/. 129 [173] Co-Array Fortran. http://www:co-array:org/. 129 [174] Titanium. http://titanium:cs:berkeley:edu/. 129 [175] Project Fortress. http://projectfortress:sun:com/. 129
    • 1)(1 + ) + nfull(n
    • 1)(1 + ) + 2nsweep( + )
  • No related research data.
  • No similar publications.

Share - Bookmark

Funded by projects

  • NSF | NeTS-NR: Next Generation Pr...

Cite this article