Flash Attention and CUDA Image Renderer (December 2023)
Flash Attention using Parallel Programming
Optimized Flash Attention with blocked matrix multiplication and OpenMP, achieving 3x speedup by improving cache locality and reducing DRAM access. Also, fused attention with softmax, reducing memory usage by 80% and boosting speed by 5x with SIMD and parallelization.For CUDA Image render, I used similar techniques in loop fusion to reduce DRAM accesses and did loop tiling for maximizing throughput and used thread pooling to minimize synchronization overhead for multithreading.
Services
Non-memory allocating function integration
Industries
Operating Systems
Date
August 2023 - March 2024