Flash Attention and CUDA Image Renderer (December 2023)

Flash Attention using Parallel Programming

Optimized Flash Attention with blocked matrix multiplication and OpenMP, achieving 3x speedup by improving cache locality and reducing DRAM access. Also, fused attention with softmax, reducing memory usage by 80% and boosting speed by 5x with SIMD and parallelization.For CUDA Image render, I used similar techniques in loop fusion to reduce DRAM accesses and did loop tiling for maximizing throughput and used thread pooling to minimize synchronization overhead for multithreading.

Services

Non-memory allocating function integration

Industries

Operating Systems

Date

August 2023 - March 2024