ShlokVFX

Follow

Shlok_Limbhare ShlokVFX

Follow

GPU dev

37 followers · 645 following

Achievements

Achievements

Organizations

Pinned Loading

Mini-Attention Mini-Attention Public

FP16 Flash Attention 2 from scratch in CUDA C++ acheving 96% of CuDnn performance on SM120 (RTX 5090)

Cuda 5
100-days-cuda 100-days-cuda Public

This repository documents my 100-day journey of learning and writing CUDA kernels.

Jupyter Notebook 28 1
SageAttention SageAttention Public

Forked from thu-ml/SageAttention

[ICLR2025, ICML2025, NeurIPS2025 Spotlight] Quantized Attention achieves speedup of 2-5x compared to FlashAttention, without losing end-to-end metrics across language, image, and video models.

Cuda
flex-block-attn flex-block-attn Public

Forked from Tencent-Hunyuan/flex-block-attn

flex-block-attn: an efficient block sparse attention computation library

Jupyter Notebook
sglang sglang Public

Forked from sgl-project/sglang

SGLang is a high-performance serving framework for large language models and multimodal models.

Python
ThunderKittens ThunderKittens Public

Forked from HazyResearch/ThunderKittens

Tile primitives for speedy kernels

Cuda