Flash Attention and CUDA Image Renderer

Flash Attention using Parallel Programming

Class project for CS194: I created an image renderer using data-parallelism, tiling, and tricky shared memory management to minimize memory latency and maximize rendering beloved particles on our screen. This was done on NVIDIA's T4 GPU. The final course project we made was a flash attention for DNN Transformers! A few thoughts on this course and these two projects:1. Very cool to learn more about the roofline model, learning that you can actually look up the specs on any GPU and measure the impressive throughput, but only realizing that this doesn't matter at all because memory latency is the modern bottleneck of the 'true' throughput of any computation. 2. My favorite parts of this course were learning about the balance between arithmetic intensity and memory latency. I thought that the roofline model was fascinating, and to be at the center of high-performance computing in this course was an amazing experience. T4 GPU ran on AWS clusters: https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/tesla-t4/t4-tensor-core-datasheet-951643.pdf

Operating Systems

December 2023