Your Search Results

Use this resource - and many more! - in your textbook!

AcademicPub holds over eight million pieces of educational content for you to mix-and-match your way.

Experience the freedom of customizing your course pack with AcademicPub!
Not an educator but still interested in using this content? No problem! Visit our provider's page to contact the publisher and get permission directly.

Tiling, block data layout, and memory hierarchy performance

By: Hong, B.; Park, N.; Prasanna, V.K.;

2003 / IEEE

Description

This item was taken from the IEEE Periodical ' Tiling, block data layout, and memory hierarchy performance ' Recently, several experimental studies have been conducted on block data layout in conjunction with tiling as a data transformation technique to improve cache performance. In this paper, we analyze cache and translation look-aside buffer (TLB) performance of such alternate layouts (including block data layout and Morton layout) when used in conjunction with tiling. We derive a tight lower bound on TLB performance for standard matrix access patterns, and show that block data layout and Morton layout achieve this bound. To improve cache performance, block data layout is used in concert with tiling. Based on the cache and TLB performance analysis, we propose a data block size selection algorithm that finds a tight range for optimal block size. To validate our analysis, we conducted simulations and experiments using tiled matrix multiplication, LU decomposition, and Cholesky factorization. For matrix multiplication, simulation results using UltraSparc II parameters show that tiling and block data layout with a block size given by our block size selection algorithm, reduces up to 93 percent of TLB misses compared with other techniques. The total miss cost is reduced considerably. Experiments on several platforms show that tiling with block data layout achieves up to 50 percent performance improvement over other techniques that use conventional layouts. Morton layout is also analyzed and compared with block data layout. Experimental results show that matrix multiplication using block data layout is up to 15 percent faster than that using Morton data layout.