Add patches for memory_efficient_attention and NTK scaling #743

airaria · 2023-07-13T01:57:51Z

Description

What does this PR do?

Adding NTK scaling patch function apply_ntk_scaling_patch()
1. Refactoring the NTK scaling code (Extend context size without fine-tuning #705) by moving the relevant code into patches.py to make the inference code clean and simplify the usage.
2. Adding the alpha parameter for NTK scaling to the patch function and inference scripts. See the usage below for explanations.
Adding attention patch function apply_attention_patch()
1. Adding support for inference with memory_efficient_attention. On a single 24G GPU, the maximum context size can be scaled up to about 5K without exceed the GPU memory (model loaded in fp16).
2. Add an option for storing KV_cache before applying RoPE.
Updating inference_hf.py, gradio_demo.py and openai_api_server.py to showcase NTK scaling and memory_efficient_attention.

Usage

alpha=2.0 # alpha can be a float, a string representing a float,  or 'auto'
use_memory_efficient_attention=True # True or False
store_kv_before_rope=False # True or False

# The following code should be placed before model initialization
from patches import apply_attention_patch, apply_ntk_scaling_patch
apply_attention_patch(
    use_memory_efficient_attention=use_memory_efficient_attention,
    store_kv_before_rope=store_kv_before_rope
)
apply_ntk_scaling_patch(alpha=alpha)

Parameters

alpha: If 'auto', alpha is calculated with the empirical formula alpha = (seq_len / 1024 - 1) * 1.1 during generation, otherwise alpha is set to the fixed float value given.
use_memory_efficient_attention: If use memory_efficient_attention from xformers or not. Default is False.
store_kv_before_rope: If store KV_cache before applying RoPE or not . Default is False.

Advices

Set use_memory_efficient_attention=True to save GPU memory when processing long texts.
Set alpha to a float value (>1) to apply NTK scaling to support long context. Emperically, we find alpha = (seq_len / 1024 - 1) may be a good choice, where seq_len is the estimated context size (sum of the lengths of the input and the output).
Set alpha to 'auto' to let the model determine the value of alpha dynamically and adatively.
Set store_kv_before_rope=True if alpha='auto' and if you encounter performance degradation. See the discussion here.

ymcui

Let's merge this first and we need some wiki documentation updates after the release of v5.0.

IT-five · 2023-12-10T11:58:43Z

我想请问一下，源码中如下，对长度超过max_length的进行了截断，但在NTK实现里又要求"if seq_len > self.max_seq_len_cached:"，那是不是意味着永远不会超过self.max_seq_len_cached，那怎么支持NTK外推上下文呢？

if len(tokenized_prompt) > max_length:
            half = int(max_length/2)
            prompt = tokenizer.decode(tokenized_prompt[:half], skip_special_tokens=True)+tokenizer.decode(tokenized_prompt[-half:], skip_special_tokens=True)

airaria added 7 commits July 13, 2023 08:48

add pathces for memory_efficient_attention and NTK scaling

0bf50c7

update max_position_embeddings

a55ccbc

fix potential inv_freq issue; add alpha argument

3ac01f4

refactoring code

79b7290

fix mem_eff_attn; improve messages

1509d39

update emperical formula

1f9b872

fix code style

262082b

airaria marked this pull request as ready for review July 15, 2023 06:56

airaria mentioned this pull request Jul 15, 2023

Llama/GPTNeoX: add RoPE scaling huggingface/transformers#24653

Merged

airaria requested a review from ymcui July 17, 2023 00:55

airaria and others added 2 commits July 17, 2023 11:13

make patch functions simpler

91be2d4

Merge branch 'main' into patches

96612cc

ymcui changed the title ~~add pathces for memory_efficient_attention and NTK scaling~~ Add patches for memory_efficient_attention and NTK scaling Jul 18, 2023

ymcui approved these changes Jul 18, 2023

View reviewed changes

ymcui merged commit f27945f into main Jul 18, 2023
1 check passed

ymcui deleted the patches branch July 19, 2023 08:48

xyfZzz mentioned this pull request Nov 23, 2023

关于ntk_alpha的计算逻辑 QwenLM/Qwen#554

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add patches for memory_efficient_attention and NTK scaling #743

Add patches for memory_efficient_attention and NTK scaling #743

airaria commented Jul 13, 2023 •

edited

Loading

ymcui left a comment

IT-five commented Dec 10, 2023

Add patches for memory_efficient_attention and NTK scaling #743

Add patches for memory_efficient_attention and NTK scaling #743

Conversation

airaria commented Jul 13, 2023 • edited Loading

Description

Usage

ymcui left a comment

Choose a reason for hiding this comment

IT-five commented Dec 10, 2023

airaria commented Jul 13, 2023 •

edited

Loading