Skip to content

Smooth Sampling / Quadratic Sampling support #6445

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

kalomaze
Copy link
Contributor

@kalomaze kalomaze commented Apr 2, 2024

Smooth Sampling / Quadratic Sampling

This sampling method differs from the truncation samplers (Top K, Top P, Min P) or traditional Temperature sampling, since we are changing the raw scores in a non-linear fashion.

image

With this approach, we tweak the original logit scores using a quadratic transformation, based on each score's distance from the topmost logit.

You can view this as an alternative to Temperature that scales differently (though they are not mutually exclusive); the idea is that we can make the good choices more evenly probable while still punishing extreme outlier tokens.

Graph demonstration with voiceover (excuse how I sound here): https://files.catbox.moe/x804ia.mp4

This has been implemented in koboldcpp and text-generation-webui for a while now already, and it recieved some praise there.
Considering that, I wanted to backport it to llama.cpp for use with the server.

How is this meaningfully different from Temperature?

The interesting element is that even the higher, more deterministic values will avoid biasing towards the topmost token if there's a group of similar probability tokens at the top, which meaningfully differs from lower Temperature having a linear skew.
So instead of the top two tokens that were originally 51/49 becoming more skewed towards 60/40 with a temperature of say, 0.3, it would look more like a 50/50 dead even split with a high "smoothing factor" value, while still nullifying the rest of the distribution's low probability outliers.
Likewise, low values will make the top probabilities less deterministic and are a decent fit for creative writing, as low probability outliers are still being reduced "smoothly" without any specific cutoffs.

How do I scale it?

"0" turns off the sampler entirely. 0.01 would be extremely close to a flat distribution (so extremely unhinged like higher Temperature would be).
You can, in theory, scale to arbritarily large values from here, where it will become gradually more and more deterministic.
Consider 10.0 as a reasonable "max", but that's just an arbitrary limit.
"Smoothing Factor" values of 0.2-0.3 are generally good for creative writing.

Preview

Here is a useful webpage that allows you to visualize how the distribution changes in response to this sampler:
https://artefact2.github.io/llm-sampling/index.xhtml
Here is the text-generation-webui PR where I first implemented this: oobabooga/text-generation-webui#5403

@kalomaze
Copy link
Contributor Author

kalomaze commented Apr 2, 2024

I would also like to point out that Dynamic Temperature is unrelated from this, and it is not required to use the quadratic transformation, they are just combined in one "entropy" function for the sake of not creating an additional function.

Copy link
Contributor

github-actions bot commented Apr 2, 2024

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3: 498 iterations 🚀

  • Concurrent users: 8, duration: 10m
  • HTTP request : avg=9446.63ms p(90)=25919.5ms fails=0, finish reason: stop=498 truncated=0
  • Prompt processing (pp): avg=243.43tk/s p(90)=735.7tk/s total=197.87tk/s
  • Token generation (tg): avg=97.31tk/s p(90)=265.26tk/s total=130.11tk/s
  • ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=smooth-pr commit=91f3db8aab54302c767da26c9c11d7dcd16a0347
Time series

prompt_tokens_seconds

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 498 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1712151129 --> 1712151755
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 520.92, 520.92, 520.92, 520.92, 520.92, 687.55, 687.55, 687.55, 687.55, 687.55, 713.01, 713.01, 713.01, 713.01, 713.01, 746.49, 746.49, 746.49, 746.49, 746.49, 756.51, 756.51, 756.51, 756.51, 756.51, 753.35, 753.35, 753.35, 753.35, 753.35, 738.68, 738.68, 738.68, 738.68, 738.68, 727.47, 727.47, 727.47, 727.47, 727.47, 739.7, 739.7, 739.7, 739.7, 739.7, 738.92, 738.92, 738.92, 738.92, 738.92, 751.73, 751.73, 751.73, 751.73, 751.73, 756.17, 756.17, 756.17, 756.17, 756.17, 774.68, 774.68, 774.68, 774.68, 774.68, 777.18, 777.18, 777.18, 777.18, 777.18, 782.72, 782.72, 782.72, 782.72, 782.72, 784.76, 784.76, 784.76, 784.76, 784.76, 785.9, 785.9, 785.9, 785.9, 785.9, 782.94, 782.94, 782.94, 782.94, 782.94, 787.75, 787.75, 787.75, 787.75, 787.75, 783.01, 783.01, 783.01, 783.01, 783.01, 777.43, 777.43, 777.43, 777.43, 777.43, 779.31, 779.31, 779.31, 779.31, 779.31, 778.93, 778.93, 778.93, 778.93, 778.93, 787.43, 787.43, 787.43, 787.43, 787.43, 791.19, 791.19, 791.19, 791.19, 791.19, 786.63, 786.63, 786.63, 786.63, 786.63, 786.77, 786.77, 786.77, 786.77, 786.77, 790.93, 790.93, 790.93, 790.93, 790.93, 789.34, 789.34, 789.34, 789.34, 789.34, 787.62, 787.62, 787.62, 787.62, 787.62, 790.2, 790.2, 790.2, 790.2, 790.2, 785.8, 785.8, 785.8, 785.8, 785.8, 784.77, 784.77, 784.77, 784.77, 784.77, 779.54, 779.54, 779.54, 779.54, 779.54, 785.38, 785.38, 785.38, 785.38, 785.38, 789.92, 789.92, 789.92, 789.92, 789.92, 794.56, 794.56, 794.56, 794.56, 794.56, 793.39, 793.39, 793.39, 793.39, 793.39, 791.22, 791.22, 791.22, 791.22, 791.22, 790.51, 790.51, 790.51, 790.51, 790.51, 789.94, 789.94, 789.94, 789.94, 789.94, 791.99, 791.99, 791.99, 791.99, 791.99, 793.1, 793.1, 793.1, 793.1, 793.1, 788.4, 788.4, 788.4, 788.4, 788.4, 776.97, 776.97, 776.97, 776.97, 776.97, 776.27, 776.27, 776.27, 776.27, 776.27, 770.73, 770.73, 770.73, 770.73, 770.73, 769.08, 769.08, 769.08, 769.08, 769.08, 766.94, 766.94, 766.94, 766.94, 766.94, 768.47, 768.47, 768.47, 768.47, 768.47, 770.06, 770.06, 770.06, 770.06, 770.06, 769.5, 769.5, 769.5, 769.5, 769.5, 764.99, 764.99, 764.99, 764.99, 764.99, 764.47, 764.47, 764.47, 764.47, 764.47, 765.91, 765.91, 765.91, 765.91, 765.91, 765.79, 765.79, 765.79, 765.79, 765.79, 765.37, 765.37, 765.37, 765.37, 765.37, 764.25, 764.25, 764.25, 764.25, 764.25, 765.93, 765.93, 765.93, 765.93, 765.93, 764.34, 764.34, 764.34, 764.34, 764.34, 765.32, 765.32, 765.32, 765.32]
                    
Loading
predicted_tokens_seconds
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 498 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1712151129 --> 1712151755
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 28.94, 28.94, 28.94, 28.94, 28.94, 14.67, 14.67, 14.67, 14.67, 14.67, 16.86, 16.86, 16.86, 16.86, 16.86, 18.21, 18.21, 18.21, 18.21, 18.21, 18.59, 18.59, 18.59, 18.59, 18.59, 19.31, 19.31, 19.31, 19.31, 19.31, 20.15, 20.15, 20.15, 20.15, 20.15, 20.3, 20.3, 20.3, 20.3, 20.3, 20.44, 20.44, 20.44, 20.44, 20.44, 20.45, 20.45, 20.45, 20.45, 20.45, 20.53, 20.53, 20.53, 20.53, 20.53, 20.42, 20.42, 20.42, 20.42, 20.42, 20.22, 20.22, 20.22, 20.22, 20.22, 20.04, 20.04, 20.04, 20.04, 20.04, 19.73, 19.73, 19.73, 19.73, 19.73, 18.91, 18.91, 18.91, 18.91, 18.91, 18.46, 18.46, 18.46, 18.46, 18.46, 18.56, 18.56, 18.56, 18.56, 18.56, 18.54, 18.54, 18.54, 18.54, 18.54, 18.27, 18.27, 18.27, 18.27, 18.27, 18.23, 18.23, 18.23, 18.23, 18.23, 18.1, 18.1, 18.1, 18.1, 18.1, 18.01, 18.01, 18.01, 18.01, 18.01, 18.09, 18.09, 18.09, 18.09, 18.09, 18.04, 18.04, 18.04, 18.04, 18.04, 18.17, 18.17, 18.17, 18.17, 18.17, 18.26, 18.26, 18.26, 18.26, 18.26, 18.3, 18.3, 18.3, 18.3, 18.3, 18.26, 18.26, 18.26, 18.26, 18.26, 18.28, 18.28, 18.28, 18.28, 18.28, 18.38, 18.38, 18.38, 18.38, 18.38, 18.45, 18.45, 18.45, 18.45, 18.45, 18.53, 18.53, 18.53, 18.53, 18.53, 18.68, 18.68, 18.68, 18.68, 18.68, 18.63, 18.63, 18.63, 18.63, 18.63, 18.62, 18.62, 18.62, 18.62, 18.62, 18.6, 18.6, 18.6, 18.6, 18.6, 18.49, 18.49, 18.49, 18.49, 18.49, 18.5, 18.5, 18.5, 18.5, 18.5, 18.53, 18.53, 18.53, 18.53, 18.53, 18.55, 18.55, 18.55, 18.55, 18.55, 18.57, 18.57, 18.57, 18.57, 18.57, 18.53, 18.53, 18.53, 18.53, 18.53, 18.5, 18.5, 18.5, 18.5, 18.5, 18.38, 18.38, 18.38, 18.38, 18.38, 18.27, 18.27, 18.27, 18.27, 18.27, 17.91, 17.91, 17.91, 17.91, 17.91, 17.75, 17.75, 17.75, 17.75, 17.75, 17.66, 17.66, 17.66, 17.66, 17.66, 17.42, 17.42, 17.42, 17.42, 17.42, 17.39, 17.39, 17.39, 17.39, 17.39, 17.43, 17.43, 17.43, 17.43, 17.43, 17.49, 17.49, 17.49, 17.49, 17.49, 17.53, 17.53, 17.53, 17.53, 17.53, 17.55, 17.55, 17.55, 17.55, 17.55, 17.58, 17.58, 17.58, 17.58, 17.58, 17.57, 17.57, 17.57, 17.57, 17.57, 17.52, 17.52, 17.52, 17.52, 17.52, 17.51, 17.51, 17.51, 17.51, 17.51, 17.49, 17.49, 17.49, 17.49, 17.49, 17.53, 17.53, 17.53, 17.53]
                    
Loading

Details

kv_cache_usage_ratio

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 498 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1712151129 --> 1712151755
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.15, 0.15, 0.15, 0.15, 0.15, 0.2, 0.2, 0.2, 0.2, 0.2, 0.09, 0.09, 0.09, 0.09, 0.09, 0.22, 0.22, 0.22, 0.22, 0.22, 0.14, 0.14, 0.14, 0.14, 0.14, 0.15, 0.15, 0.15, 0.15, 0.15, 0.16, 0.16, 0.16, 0.16, 0.16, 0.1, 0.1, 0.1, 0.1, 0.1, 0.2, 0.2, 0.2, 0.2, 0.2, 0.19, 0.19, 0.19, 0.19, 0.19, 0.22, 0.22, 0.22, 0.22, 0.22, 0.15, 0.15, 0.15, 0.15, 0.15, 0.16, 0.16, 0.16, 0.16, 0.16, 0.19, 0.19, 0.19, 0.19, 0.19, 0.29, 0.29, 0.29, 0.29, 0.29, 0.19, 0.19, 0.19, 0.19, 0.19, 0.12, 0.12, 0.12, 0.12, 0.12, 0.17, 0.17, 0.17, 0.17, 0.17, 0.3, 0.3, 0.3, 0.3, 0.3, 0.26, 0.26, 0.26, 0.26, 0.26, 0.27, 0.27, 0.27, 0.27, 0.27, 0.19, 0.19, 0.19, 0.19, 0.19, 0.12, 0.12, 0.12, 0.12, 0.12, 0.21, 0.21, 0.21, 0.21, 0.21, 0.12, 0.12, 0.12, 0.12, 0.12, 0.11, 0.11, 0.11, 0.11, 0.11, 0.15, 0.15, 0.15, 0.15, 0.15, 0.32, 0.32, 0.32, 0.32, 0.32, 0.15, 0.15, 0.15, 0.15, 0.15, 0.08, 0.08, 0.08, 0.08, 0.08, 0.14, 0.14, 0.14, 0.14, 0.14, 0.13, 0.13, 0.13, 0.13, 0.13, 0.16, 0.16, 0.16, 0.16, 0.16, 0.11, 0.11, 0.11, 0.11, 0.11, 0.09, 0.09, 0.09, 0.09, 0.09, 0.13, 0.13, 0.13, 0.13, 0.13, 0.2, 0.2, 0.2, 0.2, 0.2, 0.16, 0.16, 0.16, 0.16, 0.16, 0.17, 0.17, 0.17, 0.17, 0.17, 0.18, 0.18, 0.18, 0.18, 0.18, 0.17, 0.17, 0.17, 0.17, 0.17, 0.19, 0.19, 0.19, 0.19, 0.19, 0.32, 0.32, 0.32, 0.32, 0.32, 0.45, 0.45, 0.45, 0.45, 0.45, 0.51, 0.51, 0.51, 0.51, 0.51, 0.54, 0.54, 0.54, 0.54, 0.54, 0.45, 0.45, 0.45, 0.45, 0.45, 0.36, 0.36, 0.36, 0.36, 0.36, 0.39, 0.39, 0.39, 0.39, 0.39, 0.17, 0.17, 0.17, 0.17, 0.17, 0.14, 0.14, 0.14, 0.14, 0.14, 0.15, 0.15, 0.15, 0.15, 0.15, 0.17, 0.17, 0.17, 0.17, 0.17, 0.11, 0.11, 0.11, 0.11, 0.11, 0.2, 0.2, 0.2, 0.2, 0.2, 0.27, 0.27, 0.27, 0.27, 0.27, 0.27, 0.27, 0.27, 0.27, 0.27, 0.16, 0.16, 0.16, 0.16, 0.16, 0.27, 0.27, 0.27, 0.27, 0.27, 0.1, 0.1, 0.1, 0.1, 0.1, 0.19, 0.19, 0.19, 0.19]
                    
Loading
requests_processing
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 498 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1712151129 --> 1712151755
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0]
                    
Loading

@kalomaze
Copy link
Contributor Author

kalomaze commented Apr 2, 2024

Backported the "curve", which is the 2nd hyperparameter (smoothing_curve)
This adds a cubic transformation on top of the quadratic one, and can help make lower smoothing_factor values work if the curve is set higher.
1.0 = equivalent to no change from just the quadratic transformation described earlier.

image

There might be a more intuitive or natural way to implement this, but people seemed to appreciate this addition to the original quad sampling, so I'm adding it as is.
oobabooga/text-generation-webui#5551

@kalomaze kalomaze changed the title Smooth Sampling support Smooth Sampling / Quadratic Sampling support Apr 2, 2024
@MaggotHATE
Copy link
Contributor

"temp": 1.5,
"dynatemp_range": 1.4,
"smoothing_factor": 0.2,
"smoothing_curve": 1.5,
"samplers_sequence": "kt",

And it just works, no major problems with repetition (even though penalties are switched off).

One question: dynatemp (and smoothing now) are not currently used with Mirostat. Would it be beneficial to use them together, theoretically speaking? I tried it once, but Mirostat was difficult to work with originally, and is difficult to compare now.

@Artefact2
Copy link
Collaborator

I know it's late to change it, but I really don't like how there's a huge discontinuity in behaviour between 0.0 and small positive values. None of the other samplers are like this, and I expect it will cause confusion to a lot of users that are mentally used to "it's close to the off value, so it means small effects right?".

@simsvml
Copy link

simsvml commented Apr 3, 2024

It would be nice to have a separate llama_sample_smoothing so it can be used on its own, independent of entropy/dynatemp. It looks like you can get the same effect here by calling llama_sample_entropy(ctx, candidates, 1, 1, 1, smoothing_factor, smoothing_curve) so the entropy part has no effect, but that seems a bit silly, and also does a bunch of unneeded work (the entropy part doesn't have a fast path for the case where its arguments make it into a no-op)

@kalomaze
Copy link
Contributor Author

kalomaze commented Apr 4, 2024

I know it's late to change it, but I really don't like how there's a huge discontinuity in behaviour between 0.0 and small positive values. None of the other samplers are like this, and I expect it will cause confusion to a lot of users that are mentally used to "it's close to the off value, so it means small effects right?".

I can't think of a mathematically identical way to scale the logits that wouldn't require bounding to some arbritary value / taking away control from the user unfortunately.

It would be nice to have a separate llama_sample_smoothing so it can be used on its own, independent of entropy/dynatemp. It looks like you can get the same effect here by calling llama_sample_entropy(ctx, candidates, 1, 1, 1, smoothing_factor, smoothing_curve) so the entropy part has no effect, but that seems a bit silly, and also does a bunch of unneeded work (the entropy part doesn't have a fast path for the case where its arguments make it into a no-op)

I imagine @ggerganov would probably prefer if I did this before it can be merged, since the entropy DynaTemp is distinct from this technique. Though, it would probably require a new case switch / new position in the sampler order if we want temperature + this to be "stackable". How does that sound?

@ggerganov
Copy link
Member

In my opinion, we already have more than enough samplers implemented. The interface allows user code to implement custom sampling techniques if more is needed.

Still, if we want to extend further, we can do it. But we should write some unit tests also. Maybe add some tests for this functionality and merge it for now?

@jukofyork
Copy link
Collaborator

jukofyork commented Apr 9, 2024

Have you tried using x*Exp[k*x]?

It's intrinsically linked to the softmax (aka multinational logistic regression) function and the only scaling function which is invariant to translation in the same way as the logits are (meaning you don't need to subtract from the maximum logit as the ratio of scale factors is constant if the gap between logits remains the same).

If you reparameterise the temperature as Exp[k*1] = Exp[k], then look at Exp[k*x] and Exp[k*x^2] and so on, then the relationship to polynomials should be clearer:

  • Standard temperature scaling is akin to the constant term.
  • Exp[k*x] is akin the the linear term where ratios are invariant (ie: linear).
  • Exp[k*x^2]is akin to the quadratic term and now isn't invariant (ie: nonlinear), and should likely be 'centered' on the maximum logit value in the same way as you did with your example.

If you do 'center' them then interestingly Exp[k*x^2] will he using the negative half of an unnormalised Normal distribution and Exp[k*x] will he using the negative half of an unnormalised Laplace distribution too.

Also IIRC, this idea is equivalent to multiplying the final probability values by a fraction of their negative logarithms (ie: it can be applied post-softmax), and this function has a special name (which I can't seem to remember nor find atm) which is used in actuarial models (IIRC, something to do with 'survival functions' where you want to penalise "weaker" values more and in a nonlinear way).

So the general transformation is:

x*Exp[a]*Exp[b*x]*Exp[c*x^2]*... = x*Exp[a + bx + cx^2 + ...]

which again shows the similarly to a polynomial transformation.

Another alternate parameterisation would be x*Exp[k*x^p]where p > 0. This would be equivalent to using the negative half on an unnormalised p-norm distribution like so:

image

https://en.wikipedia.org/wiki/Generalized_normal_distribution

@mofosyne mofosyne added performance Speed related topics Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level generation quality Quality of model output labels May 10, 2024
Copy link
Contributor

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 555 iterations 🚀

Expand details for performance related PR only
  • Concurrent users: 8, duration: 10m
  • HTTP request : avg=8403.31ms p(95)=20897.45ms fails=, finish reason: stop=492 truncated=63
  • Prompt processing (pp): avg=102.68tk/s p(95)=468.53tk/s
  • Token generation (tg): avg=34.82tk/s p(95)=49.66tk/s
  • ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=smooth-pr commit=68e7c2579a15dc02e881235cc9a5baadb1b78a10

prompt_tokens_seconds

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 555 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1715390057 --> 1715390681
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 617.97, 617.97, 617.97, 617.97, 617.97, 496.46, 496.46, 496.46, 496.46, 496.46, 534.28, 534.28, 534.28, 534.28, 534.28, 609.32, 609.32, 609.32, 609.32, 609.32, 631.24, 631.24, 631.24, 631.24, 631.24, 637.73, 637.73, 637.73, 637.73, 637.73, 669.81, 669.81, 669.81, 669.81, 669.81, 683.25, 683.25, 683.25, 683.25, 683.25, 699.24, 699.24, 699.24, 699.24, 699.24, 735.58, 735.58, 735.58, 735.58, 735.58, 748.69, 748.69, 748.69, 748.69, 748.69, 758.61, 758.61, 758.61, 758.61, 758.61, 789.18, 789.18, 789.18, 789.18, 789.18, 751.85, 751.85, 751.85, 751.85, 751.85, 760.99, 760.99, 760.99, 760.99, 760.99, 759.69, 759.69, 759.69, 759.69, 759.69, 783.26, 783.26, 783.26, 783.26, 783.26, 781.25, 781.25, 781.25, 781.25, 781.25, 783.21, 783.21, 783.21, 783.21, 783.21, 788.04, 788.04, 788.04, 788.04, 788.04, 784.96, 784.96, 784.96, 784.96, 784.96, 781.77, 781.77, 781.77, 781.77, 781.77, 782.03, 782.03, 782.03, 782.03, 782.03, 779.85, 779.85, 779.85, 779.85, 779.85, 781.03, 781.03, 781.03, 781.03, 781.03, 795.65, 795.65, 795.65, 795.65, 795.65, 795.56, 795.56, 795.56, 795.56, 795.56, 794.23, 794.23, 794.23, 794.23, 794.23, 794.6, 794.6, 794.6, 794.6, 794.6, 801.4, 801.4, 801.4, 801.4, 801.4, 800.44, 800.44, 800.44, 800.44, 800.44, 804.83, 804.83, 804.83, 804.83, 804.83, 809.81, 809.81, 809.81, 809.81, 809.81, 825.53, 825.53, 825.53, 825.53, 825.53, 833.34, 833.34, 833.34, 833.34, 833.34, 835.94, 835.94, 835.94, 835.94, 835.94, 834.27, 834.27, 834.27, 834.27, 834.27, 834.3, 834.3, 834.3, 834.3, 834.3, 836.57, 836.57, 836.57, 836.57, 836.57, 839.16, 839.16, 839.16, 839.16, 839.16, 842.75, 842.75, 842.75, 842.75, 842.75, 814.85, 814.85, 814.85, 814.85, 814.85, 815.55, 815.55, 815.55, 815.55, 815.55, 814.4, 814.4, 814.4, 814.4, 814.4, 812.31, 812.31, 812.31, 812.31, 812.31, 812.72, 812.72, 812.72, 812.72, 812.72, 815.43, 815.43, 815.43, 815.43, 815.43, 815.9, 815.9, 815.9, 815.9, 815.9, 820.15, 820.15, 820.15, 820.15, 820.15, 824.12, 824.12, 824.12, 824.12, 824.12, 829.33, 829.33, 829.33, 829.33, 829.33, 827.42, 827.42, 827.42, 827.42, 827.42, 829.43, 829.43, 829.43, 829.43, 829.43, 833.77, 833.77, 833.77, 833.77, 833.77, 833.15, 833.15, 833.15, 833.15, 833.15, 834.5, 834.5, 834.5, 834.5, 834.5, 833.87, 833.87, 833.87, 833.87, 833.87, 835.88, 835.88, 835.88, 835.88, 835.88, 836.09, 836.09, 836.09, 836.09, 836.09, 836.38, 836.38, 836.38, 836.38, 836.38, 836.38, 836.38, 836.38]
                    
Loading
predicted_tokens_seconds
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 555 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1715390057 --> 1715390681
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 41.32, 41.32, 41.32, 41.32, 41.32, 38.44, 38.44, 38.44, 38.44, 38.44, 36.28, 36.28, 36.28, 36.28, 36.28, 37.49, 37.49, 37.49, 37.49, 37.49, 36.34, 36.34, 36.34, 36.34, 36.34, 37.15, 37.15, 37.15, 37.15, 37.15, 37.86, 37.86, 37.86, 37.86, 37.86, 37.9, 37.9, 37.9, 37.9, 37.9, 37.31, 37.31, 37.31, 37.31, 37.31, 36.68, 36.68, 36.68, 36.68, 36.68, 36.57, 36.57, 36.57, 36.57, 36.57, 35.69, 35.69, 35.69, 35.69, 35.69, 35.06, 35.06, 35.06, 35.06, 35.06, 34.04, 34.04, 34.04, 34.04, 34.04, 33.61, 33.61, 33.61, 33.61, 33.61, 33.61, 33.61, 33.61, 33.61, 33.61, 33.71, 33.71, 33.71, 33.71, 33.71, 32.65, 32.65, 32.65, 32.65, 32.65, 32.33, 32.33, 32.33, 32.33, 32.33, 32.29, 32.29, 32.29, 32.29, 32.29, 32.0, 32.0, 32.0, 32.0, 32.0, 32.07, 32.07, 32.07, 32.07, 32.07, 31.76, 31.76, 31.76, 31.76, 31.76, 31.84, 31.84, 31.84, 31.84, 31.84, 31.87, 31.87, 31.87, 31.87, 31.87, 32.12, 32.12, 32.12, 32.12, 32.12, 31.77, 31.77, 31.77, 31.77, 31.77, 31.57, 31.57, 31.57, 31.57, 31.57, 31.78, 31.78, 31.78, 31.78, 31.78, 31.9, 31.9, 31.9, 31.9, 31.9, 31.93, 31.93, 31.93, 31.93, 31.93, 32.16, 32.16, 32.16, 32.16, 32.16, 32.19, 32.19, 32.19, 32.19, 32.19, 32.08, 32.08, 32.08, 32.08, 32.08, 31.92, 31.92, 31.92, 31.92, 31.92, 31.64, 31.64, 31.64, 31.64, 31.64, 31.59, 31.59, 31.59, 31.59, 31.59, 31.61, 31.61, 31.61, 31.61, 31.61, 31.7, 31.7, 31.7, 31.7, 31.7, 31.87, 31.87, 31.87, 31.87, 31.87, 31.96, 31.96, 31.96, 31.96, 31.96, 31.69, 31.69, 31.69, 31.69, 31.69, 31.6, 31.6, 31.6, 31.6, 31.6, 31.0, 31.0, 31.0, 31.0, 31.0, 30.21, 30.21, 30.21, 30.21, 30.21, 30.09, 30.09, 30.09, 30.09, 30.09, 30.06, 30.06, 30.06, 30.06, 30.06, 30.21, 30.21, 30.21, 30.21, 30.21, 30.32, 30.32, 30.32, 30.32, 30.32, 30.41, 30.41, 30.41, 30.41, 30.41, 30.44, 30.44, 30.44, 30.44, 30.44, 30.32, 30.32, 30.32, 30.32, 30.32, 30.27, 30.27, 30.27, 30.27, 30.27, 30.26, 30.26, 30.26, 30.26, 30.26, 30.36, 30.36, 30.36, 30.36, 30.36, 30.52, 30.52, 30.52, 30.52, 30.52, 30.63, 30.63, 30.63, 30.63, 30.63, 30.71, 30.71, 30.71, 30.71, 30.71, 30.78, 30.78, 30.78, 30.78, 30.78, 30.77, 30.77, 30.77, 30.77, 30.77, 30.79, 30.79, 30.79]
                    
Loading

Details

kv_cache_usage_ratio

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 555 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1715390057 --> 1715390681
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.19, 0.19, 0.19, 0.19, 0.19, 0.32, 0.32, 0.32, 0.32, 0.32, 0.15, 0.15, 0.15, 0.15, 0.15, 0.22, 0.22, 0.22, 0.22, 0.22, 0.1, 0.1, 0.1, 0.1, 0.1, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.17, 0.17, 0.17, 0.17, 0.17, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.18, 0.18, 0.18, 0.18, 0.18, 0.25, 0.25, 0.25, 0.25, 0.25, 0.33, 0.33, 0.33, 0.33, 0.33, 0.14, 0.14, 0.14, 0.14, 0.14, 0.18, 0.18, 0.18, 0.18, 0.18, 0.14, 0.14, 0.14, 0.14, 0.14, 0.31, 0.31, 0.31, 0.31, 0.31, 0.3, 0.3, 0.3, 0.3, 0.3, 0.31, 0.31, 0.31, 0.31, 0.31, 0.22, 0.22, 0.22, 0.22, 0.22, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.11, 0.11, 0.11, 0.11, 0.11, 0.08, 0.08, 0.08, 0.08, 0.08, 0.15, 0.15, 0.15, 0.15, 0.15, 0.27, 0.27, 0.27, 0.27, 0.27, 0.24, 0.24, 0.24, 0.24, 0.24, 0.13, 0.13, 0.13, 0.13, 0.13, 0.12, 0.12, 0.12, 0.12, 0.12, 0.07, 0.07, 0.07, 0.07, 0.07, 0.13, 0.13, 0.13, 0.13, 0.13, 0.16, 0.16, 0.16, 0.16, 0.16, 0.15, 0.15, 0.15, 0.15, 0.15, 0.17, 0.17, 0.17, 0.17, 0.17, 0.32, 0.32, 0.32, 0.32, 0.32, 0.15, 0.15, 0.15, 0.15, 0.15, 0.22, 0.22, 0.22, 0.22, 0.22, 0.14, 0.14, 0.14, 0.14, 0.14, 0.11, 0.11, 0.11, 0.11, 0.11, 0.1, 0.1, 0.1, 0.1, 0.1, 0.3, 0.3, 0.3, 0.3, 0.3, 0.49, 0.49, 0.49, 0.49, 0.49, 0.52, 0.52, 0.52, 0.52, 0.52, 0.42, 0.42, 0.42, 0.42, 0.42, 0.09, 0.09, 0.09, 0.09, 0.09, 0.14, 0.14, 0.14, 0.14, 0.14, 0.11, 0.11, 0.11, 0.11, 0.11, 0.1, 0.1, 0.1, 0.1, 0.1, 0.11, 0.11, 0.11, 0.11, 0.11, 0.2, 0.2, 0.2, 0.2, 0.2, 0.26, 0.26, 0.26, 0.26, 0.26, 0.2, 0.2, 0.2, 0.2, 0.2, 0.19, 0.19, 0.19, 0.19, 0.19, 0.1, 0.1, 0.1, 0.1, 0.1, 0.11, 0.11, 0.11, 0.11, 0.11, 0.13, 0.13, 0.13, 0.13, 0.13, 0.1, 0.1, 0.1, 0.1, 0.1, 0.13, 0.13, 0.13, 0.13, 0.13, 0.1, 0.1, 0.1, 0.1, 0.1, 0.19, 0.19, 0.19, 0.19, 0.19, 0.24, 0.24, 0.24]
                    
Loading
requests_processing
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 555 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1715390057 --> 1715390681
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 2.0, 2.0, 2.0, 2.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 2.0, 2.0, 2.0, 2.0, 2.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 2.0, 2.0, 2.0, 2.0, 2.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 1.0, 1.0, 1.0, 1.0, 1.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 2.0, 2.0, 2.0, 2.0, 2.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 1.0, 1.0, 1.0, 1.0, 1.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 2.0, 2.0, 2.0]
                    
Loading

@mofosyne mofosyne added Review Complexity : High Generally require indepth knowledge of LLMs or GPUs and removed Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level labels May 11, 2024
@mofosyne
Copy link
Collaborator

Hows this PR? Is there a positive intent now to start merging this?

@SilverLatios
Copy link

Hi, now that DRY was merged, is it possible to continue working on this pr?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
generation quality Quality of model output performance Speed related topics Review Complexity : High Generally require indepth knowledge of LLMs or GPUs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants