Add patches for memory_efficient_attention and NTK scaling #743
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
What does this PR do?
Adding NTK scaling patch function
apply_ntk_scaling_patch()
Refactoring the NTK scaling code (Extend context size without fine-tuning #705) by moving the relevant code into
patches.py
to make the inference code clean and simplify the usage.Adding the
alpha
parameter for NTK scaling to the patch function and inference scripts. See the usage below for explanations.Adding attention patch function
apply_attention_patch()
Adding support for inference with memory_efficient_attention. On a single 24G GPU, the maximum context size can be scaled up to about 5K without exceed the GPU memory (model loaded in fp16).
Add an option for storing KV_cache before applying RoPE.
Updating
inference_hf.py
,gradio_demo.py
andopenai_api_server.py
to showcase NTK scaling and memory_efficient_attention.Usage
Parameters
alpha
: If'auto'
,alpha
is calculated with the empirical formulaalpha = (seq_len / 1024 - 1) * 1.1
during generation, otherwisealpha
is set to the fixed float value given.use_memory_efficient_attention
: If use memory_efficient_attention from xformers or not. Default isFalse
.store_kv_before_rope
: If store KV_cache before applying RoPE or not . Default isFalse
.Advices
use_memory_efficient_attention=True
to save GPU memory when processing long texts.alpha
to a float value (>1) to apply NTK scaling to support long context. Emperically, we findalpha = (seq_len / 1024 - 1)
may be a good choice, whereseq_len
is the estimated context size (sum of the lengths of the input and the output).alpha
to'auto'
to let the model determine the value ofalpha
dynamically and adatively.store_kv_before_rope=True
ifalpha='auto'
and if you encounter performance degradation. See the discussion here.