Hi there, I was reading and saw:
Enabling flash attention reduces memory usage by at least 400 MB. At the moment, it is not supported when CUBLAS is enabled because the kernel implementation is missing.
But I'm curious, would it make sense to set -DSD_FLASH_ATTN=ON for the Mac, Linux, and other non-CUBLAS builds:
- build: "noavx"
defines: "-DGGML_AVX=OFF -DGGML_AVX2=OFF -DGGML_FMA=OFF -DSD_BUILD_SHARED_LIBS=ON"
- build: "avx2"
defines: "-DGGML_AVX2=ON -DSD_BUILD_SHARED_LIBS=ON"
- build: "avx"
defines: "-DGGML_AVX2=OFF -DSD_BUILD_SHARED_LIBS=ON"
- build: "avx512"
defines: "-DGGML_AVX512=ON -DSD_BUILD_SHARED_LIBS=ON"
- build: "cuda12"
Thanks!
Hi there, I was reading and saw:
But I'm curious, would it make sense to set
-DSD_FLASH_ATTN=ONfor the Mac, Linux, and other non-CUBLAS builds:Thanks!