Smooth Sampling / Quadratic Sampling support #6445

kalomaze · 2024-04-02T22:25:09Z

Smooth Sampling / Quadratic Sampling

This sampling method differs from the truncation samplers (Top K, Top P, Min P) or traditional Temperature sampling, since we are changing the raw scores in a non-linear fashion.

With this approach, we tweak the original logit scores using a quadratic transformation, based on each score's distance from the topmost logit.

You can view this as an alternative to Temperature that scales differently (though they are not mutually exclusive); the idea is that we can make the good choices more evenly probable while still punishing extreme outlier tokens.

Graph demonstration with voiceover (excuse how I sound here): https://files.catbox.moe/x804ia.mp4

This has been implemented in koboldcpp and text-generation-webui for a while now already, and it recieved some praise there.
Considering that, I wanted to backport it to llama.cpp for use with the server.

How is this meaningfully different from Temperature?

The interesting element is that even the higher, more deterministic values will avoid biasing towards the topmost token if there's a group of similar probability tokens at the top, which meaningfully differs from lower Temperature having a linear skew.
So instead of the top two tokens that were originally 51/49 becoming more skewed towards 60/40 with a temperature of say, 0.3, it would look more like a 50/50 dead even split with a high "smoothing factor" value, while still nullifying the rest of the distribution's low probability outliers.
Likewise, low values will make the top probabilities less deterministic and are a decent fit for creative writing, as low probability outliers are still being reduced "smoothly" without any specific cutoffs.

How do I scale it?

"0" turns off the sampler entirely. 0.01 would be extremely close to a flat distribution (so extremely unhinged like higher Temperature would be).
You can, in theory, scale to arbritarily large values from here, where it will become gradually more and more deterministic.
Consider 10.0 as a reasonable "max", but that's just an arbitrary limit.
"Smoothing Factor" values of 0.2-0.3 are generally good for creative writing.

Preview

Here is a useful webpage that allows you to visualize how the distribution changes in response to this sampler:
https://artefact2.github.io/llm-sampling/index.xhtml
Here is the text-generation-webui PR where I first implemented this: oobabooga/text-generation-webui#5403

kalomaze · 2024-04-02T22:31:08Z

I would also like to point out that Dynamic Temperature is unrelated from this, and it is not required to use the quadratic transformation, they are just combined in one "entropy" function for the sake of not creating an additional function.

github-actions · 2024-04-02T22:43:41Z

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3: 498 iterations 🚀

Concurrent users: 8, duration: 10m
HTTP request : avg=9446.63ms p(90)=25919.5ms fails=0, finish reason: stop=498 truncated=0
Prompt processing (pp): avg=243.43tk/s p(90)=735.7tk/s total=197.87tk/s
Token generation (tg): avg=97.31tk/s p(90)=265.26tk/s total=130.11tk/s
ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=smooth-pr commit=91f3db8aab54302c767da26c9c11d7dcd16a0347

Time series

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 498 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1712151129 --> 1712151755
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 520.92, 520.92, 520.92, 520.92, 520.92, 687.55, 687.55, 687.55, 687.55, 687.55, 713.01, 713.01, 713.01, 713.01, 713.01, 746.49, 746.49, 746.49, 746.49, 746.49, 756.51, 756.51, 756.51, 756.51, 756.51, 753.35, 753.35, 753.35, 753.35, 753.35, 738.68, 738.68, 738.68, 738.68, 738.68, 727.47, 727.47, 727.47, 727.47, 727.47, 739.7, 739.7, 739.7, 739.7, 739.7, 738.92, 738.92, 738.92, 738.92, 738.92, 751.73, 751.73, 751.73, 751.73, 751.73, 756.17, 756.17, 756.17, 756.17, 756.17, 774.68, 774.68, 774.68, 774.68, 774.68, 777.18, 777.18, 777.18, 777.18, 777.18, 782.72, 782.72, 782.72, 782.72, 782.72, 784.76, 784.76, 784.76, 784.76, 784.76, 785.9, 785.9, 785.9, 785.9, 785.9, 782.94, 782.94, 782.94, 782.94, 782.94, 787.75, 787.75, 787.75, 787.75, 787.75, 783.01, 783.01, 783.01, 783.01, 783.01, 777.43, 777.43, 777.43, 777.43, 777.43, 779.31, 779.31, 779.31, 779.31, 779.31, 778.93, 778.93, 778.93, 778.93, 778.93, 787.43, 787.43, 787.43, 787.43, 787.43, 791.19, 791.19, 791.19, 791.19, 791.19, 786.63, 786.63, 786.63, 786.63, 786.63, 786.77, 786.77, 786.77, 786.77, 786.77, 790.93, 790.93, 790.93, 790.93, 790.93, 789.34, 789.34, 789.34, 789.34, 789.34, 787.62, 787.62, 787.62, 787.62, 787.62, 790.2, 790.2, 790.2, 790.2, 790.2, 785.8, 785.8, 785.8, 785.8, 785.8, 784.77, 784.77, 784.77, 784.77, 784.77, 779.54, 779.54, 779.54, 779.54, 779.54, 785.38, 785.38, 785.38, 785.38, 785.38, 789.92, 789.92, 789.92, 789.92, 789.92, 794.56, 794.56, 794.56, 794.56, 794.56, 793.39, 793.39, 793.39, 793.39, 793.39, 791.22, 791.22, 791.22, 791.22, 791.22, 790.51, 790.51, 790.51, 790.51, 790.51, 789.94, 789.94, 789.94, 789.94, 789.94, 791.99, 791.99, 791.99, 791.99, 791.99, 793.1, 793.1, 793.1, 793.1, 793.1, 788.4, 788.4, 788.4, 788.4, 788.4, 776.97, 776.97, 776.97, 776.97, 776.97, 776.27, 776.27, 776.27, 776.27, 776.27, 770.73, 770.73, 770.73, 770.73, 770.73, 769.08, 769.08, 769.08, 769.08, 769.08, 766.94, 766.94, 766.94, 766.94, 766.94, 768.47, 768.47, 768.47, 768.47, 768.47, 770.06, 770.06, 770.06, 770.06, 770.06, 769.5, 769.5, 769.5, 769.5, 769.5, 764.99, 764.99, 764.99, 764.99, 764.99, 764.47, 764.47, 764.47, 764.47, 764.47, 765.91, 765.91, 765.91, 765.91, 765.91, 765.79, 765.79, 765.79, 765.79, 765.79, 765.37, 765.37, 765.37, 765.37, 765.37, 764.25, 764.25, 764.25, 764.25, 764.25, 765.93, 765.93, 765.93, 765.93, 765.93, 764.34, 764.34, 764.34, 764.34, 764.34, 765.32, 765.32, 765.32, 765.32]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 498 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1712151129 --> 1712151755
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 28.94, 28.94, 28.94, 28.94, 28.94, 14.67, 14.67, 14.67, 14.67, 14.67, 16.86, 16.86, 16.86, 16.86, 16.86, 18.21, 18.21, 18.21, 18.21, 18.21, 18.59, 18.59, 18.59, 18.59, 18.59, 19.31, 19.31, 19.31, 19.31, 19.31, 20.15, 20.15, 20.15, 20.15, 20.15, 20.3, 20.3, 20.3, 20.3, 20.3, 20.44, 20.44, 20.44, 20.44, 20.44, 20.45, 20.45, 20.45, 20.45, 20.45, 20.53, 20.53, 20.53, 20.53, 20.53, 20.42, 20.42, 20.42, 20.42, 20.42, 20.22, 20.22, 20.22, 20.22, 20.22, 20.04, 20.04, 20.04, 20.04, 20.04, 19.73, 19.73, 19.73, 19.73, 19.73, 18.91, 18.91, 18.91, 18.91, 18.91, 18.46, 18.46, 18.46, 18.46, 18.46, 18.56, 18.56, 18.56, 18.56, 18.56, 18.54, 18.54, 18.54, 18.54, 18.54, 18.27, 18.27, 18.27, 18.27, 18.27, 18.23, 18.23, 18.23, 18.23, 18.23, 18.1, 18.1, 18.1, 18.1, 18.1, 18.01, 18.01, 18.01, 18.01, 18.01, 18.09, 18.09, 18.09, 18.09, 18.09, 18.04, 18.04, 18.04, 18.04, 18.04, 18.17, 18.17, 18.17, 18.17, 18.17, 18.26, 18.26, 18.26, 18.26, 18.26, 18.3, 18.3, 18.3, 18.3, 18.3, 18.26, 18.26, 18.26, 18.26, 18.26, 18.28, 18.28, 18.28, 18.28, 18.28, 18.38, 18.38, 18.38, 18.38, 18.38, 18.45, 18.45, 18.45, 18.45, 18.45, 18.53, 18.53, 18.53, 18.53, 18.53, 18.68, 18.68, 18.68, 18.68, 18.68, 18.63, 18.63, 18.63, 18.63, 18.63, 18.62, 18.62, 18.62, 18.62, 18.62, 18.6, 18.6, 18.6, 18.6, 18.6, 18.49, 18.49, 18.49, 18.49, 18.49, 18.5, 18.5, 18.5, 18.5, 18.5, 18.53, 18.53, 18.53, 18.53, 18.53, 18.55, 18.55, 18.55, 18.55, 18.55, 18.57, 18.57, 18.57, 18.57, 18.57, 18.53, 18.53, 18.53, 18.53, 18.53, 18.5, 18.5, 18.5, 18.5, 18.5, 18.38, 18.38, 18.38, 18.38, 18.38, 18.27, 18.27, 18.27, 18.27, 18.27, 17.91, 17.91, 17.91, 17.91, 17.91, 17.75, 17.75, 17.75, 17.75, 17.75, 17.66, 17.66, 17.66, 17.66, 17.66, 17.42, 17.42, 17.42, 17.42, 17.42, 17.39, 17.39, 17.39, 17.39, 17.39, 17.43, 17.43, 17.43, 17.43, 17.43, 17.49, 17.49, 17.49, 17.49, 17.49, 17.53, 17.53, 17.53, 17.53, 17.53, 17.55, 17.55, 17.55, 17.55, 17.55, 17.58, 17.58, 17.58, 17.58, 17.58, 17.57, 17.57, 17.57, 17.57, 17.57, 17.52, 17.52, 17.52, 17.52, 17.52, 17.51, 17.51, 17.51, 17.51, 17.51, 17.49, 17.49, 17.49, 17.49, 17.49, 17.53, 17.53, 17.53, 17.53]

Details

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 498 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1712151129 --> 1712151755
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.15, 0.15, 0.15, 0.15, 0.15, 0.2, 0.2, 0.2, 0.2, 0.2, 0.09, 0.09, 0.09, 0.09, 0.09, 0.22, 0.22, 0.22, 0.22, 0.22, 0.14, 0.14, 0.14, 0.14, 0.14, 0.15, 0.15, 0.15, 0.15, 0.15, 0.16, 0.16, 0.16, 0.16, 0.16, 0.1, 0.1, 0.1, 0.1, 0.1, 0.2, 0.2, 0.2, 0.2, 0.2, 0.19, 0.19, 0.19, 0.19, 0.19, 0.22, 0.22, 0.22, 0.22, 0.22, 0.15, 0.15, 0.15, 0.15, 0.15, 0.16, 0.16, 0.16, 0.16, 0.16, 0.19, 0.19, 0.19, 0.19, 0.19, 0.29, 0.29, 0.29, 0.29, 0.29, 0.19, 0.19, 0.19, 0.19, 0.19, 0.12, 0.12, 0.12, 0.12, 0.12, 0.17, 0.17, 0.17, 0.17, 0.17, 0.3, 0.3, 0.3, 0.3, 0.3, 0.26, 0.26, 0.26, 0.26, 0.26, 0.27, 0.27, 0.27, 0.27, 0.27, 0.19, 0.19, 0.19, 0.19, 0.19, 0.12, 0.12, 0.12, 0.12, 0.12, 0.21, 0.21, 0.21, 0.21, 0.21, 0.12, 0.12, 0.12, 0.12, 0.12, 0.11, 0.11, 0.11, 0.11, 0.11, 0.15, 0.15, 0.15, 0.15, 0.15, 0.32, 0.32, 0.32, 0.32, 0.32, 0.15, 0.15, 0.15, 0.15, 0.15, 0.08, 0.08, 0.08, 0.08, 0.08, 0.14, 0.14, 0.14, 0.14, 0.14, 0.13, 0.13, 0.13, 0.13, 0.13, 0.16, 0.16, 0.16, 0.16, 0.16, 0.11, 0.11, 0.11, 0.11, 0.11, 0.09, 0.09, 0.09, 0.09, 0.09, 0.13, 0.13, 0.13, 0.13, 0.13, 0.2, 0.2, 0.2, 0.2, 0.2, 0.16, 0.16, 0.16, 0.16, 0.16, 0.17, 0.17, 0.17, 0.17, 0.17, 0.18, 0.18, 0.18, 0.18, 0.18, 0.17, 0.17, 0.17, 0.17, 0.17, 0.19, 0.19, 0.19, 0.19, 0.19, 0.32, 0.32, 0.32, 0.32, 0.32, 0.45, 0.45, 0.45, 0.45, 0.45, 0.51, 0.51, 0.51, 0.51, 0.51, 0.54, 0.54, 0.54, 0.54, 0.54, 0.45, 0.45, 0.45, 0.45, 0.45, 0.36, 0.36, 0.36, 0.36, 0.36, 0.39, 0.39, 0.39, 0.39, 0.39, 0.17, 0.17, 0.17, 0.17, 0.17, 0.14, 0.14, 0.14, 0.14, 0.14, 0.15, 0.15, 0.15, 0.15, 0.15, 0.17, 0.17, 0.17, 0.17, 0.17, 0.11, 0.11, 0.11, 0.11, 0.11, 0.2, 0.2, 0.2, 0.2, 0.2, 0.27, 0.27, 0.27, 0.27, 0.27, 0.27, 0.27, 0.27, 0.27, 0.27, 0.16, 0.16, 0.16, 0.16, 0.16, 0.27, 0.27, 0.27, 0.27, 0.27, 0.1, 0.1, 0.1, 0.1, 0.1, 0.19, 0.19, 0.19, 0.19]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 498 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1712151129 --> 1712151755
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0]

kalomaze · 2024-04-02T23:23:24Z

Backported the "curve", which is the 2nd hyperparameter (smoothing_curve)
This adds a cubic transformation on top of the quadratic one, and can help make lower smoothing_factor values work if the curve is set higher.
1.0 = equivalent to no change from just the quadratic transformation described earlier.

There might be a more intuitive or natural way to implement this, but people seemed to appreciate this addition to the original quad sampling, so I'm adding it as is.
oobabooga/text-generation-webui#5551

MaggotHATE · 2024-04-03T09:56:07Z

"temp": 1.5,
"dynatemp_range": 1.4,
"smoothing_factor": 0.2,
"smoothing_curve": 1.5,
"samplers_sequence": "kt",

And it just works, no major problems with repetition (even though penalties are switched off).

One question: dynatemp (and smoothing now) are not currently used with Mirostat. Would it be beneficial to use them together, theoretically speaking? I tried it once, but Mirostat was difficult to work with originally, and is difficult to compare now.

Artefact2 · 2024-04-03T11:40:33Z

I know it's late to change it, but I really don't like how there's a huge discontinuity in behaviour between 0.0 and small positive values. None of the other samplers are like this, and I expect it will cause confusion to a lot of users that are mentally used to "it's close to the off value, so it means small effects right?".

llama.cpp

simsvml · 2024-04-03T15:54:24Z

It would be nice to have a separate llama_sample_smoothing so it can be used on its own, independent of entropy/dynatemp. It looks like you can get the same effect here by calling llama_sample_entropy(ctx, candidates, 1, 1, 1, smoothing_factor, smoothing_curve) so the entropy part has no effect, but that seems a bit silly, and also does a bunch of unneeded work (the entropy part doesn't have a fast path for the case where its arguments make it into a no-op)

kalomaze · 2024-04-04T21:36:34Z

I know it's late to change it, but I really don't like how there's a huge discontinuity in behaviour between 0.0 and small positive values. None of the other samplers are like this, and I expect it will cause confusion to a lot of users that are mentally used to "it's close to the off value, so it means small effects right?".

I can't think of a mathematically identical way to scale the logits that wouldn't require bounding to some arbritary value / taking away control from the user unfortunately.

It would be nice to have a separate llama_sample_smoothing so it can be used on its own, independent of entropy/dynatemp. It looks like you can get the same effect here by calling llama_sample_entropy(ctx, candidates, 1, 1, 1, smoothing_factor, smoothing_curve) so the entropy part has no effect, but that seems a bit silly, and also does a bunch of unneeded work (the entropy part doesn't have a fast path for the case where its arguments make it into a no-op)

I imagine @ggerganov would probably prefer if I did this before it can be merged, since the entropy DynaTemp is distinct from this technique. Though, it would probably require a new case switch / new position in the sampler order if we want temperature + this to be "stackable". How does that sound?

ggerganov · 2024-04-05T18:17:23Z

In my opinion, we already have more than enough samplers implemented. The interface allows user code to implement custom sampling techniques if more is needed.

Still, if we want to extend further, we can do it. But we should write some unit tests also. Maybe add some tests for this functionality and merge it for now?

jukofyork · 2024-04-09T21:18:14Z

Have you tried using x*Exp[k*x]?

It's intrinsically linked to the softmax (aka multinational logistic regression) function and the only scaling function which is invariant to translation in the same way as the logits are (meaning you don't need to subtract from the maximum logit as the ratio of scale factors is constant if the gap between logits remains the same).

If you reparameterise the temperature as Exp[k*1] = Exp[k], then look at Exp[k*x] and Exp[k*x^2] and so on, then the relationship to polynomials should be clearer:

Standard temperature scaling is akin to the constant term.
Exp[k*x] is akin the the linear term where ratios are invariant (ie: linear).
Exp[k*x^2]is akin to the quadratic term and now isn't invariant (ie: nonlinear), and should likely be 'centered' on the maximum logit value in the same way as you did with your example.

If you do 'center' them then interestingly Exp[k*x^2] will he using the negative half of an unnormalised Normal distribution and Exp[k*x] will he using the negative half of an unnormalised Laplace distribution too.

Also IIRC, this idea is equivalent to multiplying the final probability values by a fraction of their negative logarithms (ie: it can be applied post-softmax), and this function has a special name (which I can't seem to remember nor find atm) which is used in actuarial models (IIRC, something to do with 'survival functions' where you want to penalise "weaker" values more and in a nonlinear way).

So the general transformation is:

x*Exp[a]*Exp[b*x]*Exp[c*x^2]*... = x*Exp[a + bx + cx^2 + ...]

which again shows the similarly to a polynomial transformation.

Another alternate parameterisation would be x*Exp[k*x^p]where p > 0. This would be equivalent to using the negative half on an unnormalised p-norm distribution like so:

https://en.wikipedia.org/wiki/Generalized_normal_distribution

github-actions · 2024-05-11T01:24:47Z

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 555 iterations 🚀

Expand details for performance related PR only

Concurrent users: 8, duration: 10m
HTTP request : avg=8403.31ms p(95)=20897.45ms fails=, finish reason: stop=492 truncated=63
Prompt processing (pp): avg=102.68tk/s p(95)=468.53tk/s
Token generation (tg): avg=34.82tk/s p(95)=49.66tk/s
ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=smooth-pr commit=68e7c2579a15dc02e881235cc9a5baadb1b78a10

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 555 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1715390057 --> 1715390681
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 617.97, 617.97, 617.97, 617.97, 617.97, 496.46, 496.46, 496.46, 496.46, 496.46, 534.28, 534.28, 534.28, 534.28, 534.28, 609.32, 609.32, 609.32, 609.32, 609.32, 631.24, 631.24, 631.24, 631.24, 631.24, 637.73, 637.73, 637.73, 637.73, 637.73, 669.81, 669.81, 669.81, 669.81, 669.81, 683.25, 683.25, 683.25, 683.25, 683.25, 699.24, 699.24, 699.24, 699.24, 699.24, 735.58, 735.58, 735.58, 735.58, 735.58, 748.69, 748.69, 748.69, 748.69, 748.69, 758.61, 758.61, 758.61, 758.61, 758.61, 789.18, 789.18, 789.18, 789.18, 789.18, 751.85, 751.85, 751.85, 751.85, 751.85, 760.99, 760.99, 760.99, 760.99, 760.99, 759.69, 759.69, 759.69, 759.69, 759.69, 783.26, 783.26, 783.26, 783.26, 783.26, 781.25, 781.25, 781.25, 781.25, 781.25, 783.21, 783.21, 783.21, 783.21, 783.21, 788.04, 788.04, 788.04, 788.04, 788.04, 784.96, 784.96, 784.96, 784.96, 784.96, 781.77, 781.77, 781.77, 781.77, 781.77, 782.03, 782.03, 782.03, 782.03, 782.03, 779.85, 779.85, 779.85, 779.85, 779.85, 781.03, 781.03, 781.03, 781.03, 781.03, 795.65, 795.65, 795.65, 795.65, 795.65, 795.56, 795.56, 795.56, 795.56, 795.56, 794.23, 794.23, 794.23, 794.23, 794.23, 794.6, 794.6, 794.6, 794.6, 794.6, 801.4, 801.4, 801.4, 801.4, 801.4, 800.44, 800.44, 800.44, 800.44, 800.44, 804.83, 804.83, 804.83, 804.83, 804.83, 809.81, 809.81, 809.81, 809.81, 809.81, 825.53, 825.53, 825.53, 825.53, 825.53, 833.34, 833.34, 833.34, 833.34, 833.34, 835.94, 835.94, 835.94, 835.94, 835.94, 834.27, 834.27, 834.27, 834.27, 834.27, 834.3, 834.3, 834.3, 834.3, 834.3, 836.57, 836.57, 836.57, 836.57, 836.57, 839.16, 839.16, 839.16, 839.16, 839.16, 842.75, 842.75, 842.75, 842.75, 842.75, 814.85, 814.85, 814.85, 814.85, 814.85, 815.55, 815.55, 815.55, 815.55, 815.55, 814.4, 814.4, 814.4, 814.4, 814.4, 812.31, 812.31, 812.31, 812.31, 812.31, 812.72, 812.72, 812.72, 812.72, 812.72, 815.43, 815.43, 815.43, 815.43, 815.43, 815.9, 815.9, 815.9, 815.9, 815.9, 820.15, 820.15, 820.15, 820.15, 820.15, 824.12, 824.12, 824.12, 824.12, 824.12, 829.33, 829.33, 829.33, 829.33, 829.33, 827.42, 827.42, 827.42, 827.42, 827.42, 829.43, 829.43, 829.43, 829.43, 829.43, 833.77, 833.77, 833.77, 833.77, 833.77, 833.15, 833.15, 833.15, 833.15, 833.15, 834.5, 834.5, 834.5, 834.5, 834.5, 833.87, 833.87, 833.87, 833.87, 833.87, 835.88, 835.88, 835.88, 835.88, 835.88, 836.09, 836.09, 836.09, 836.09, 836.09, 836.38, 836.38, 836.38, 836.38, 836.38, 836.38, 836.38, 836.38]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 555 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1715390057 --> 1715390681
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 41.32, 41.32, 41.32, 41.32, 41.32, 38.44, 38.44, 38.44, 38.44, 38.44, 36.28, 36.28, 36.28, 36.28, 36.28, 37.49, 37.49, 37.49, 37.49, 37.49, 36.34, 36.34, 36.34, 36.34, 36.34, 37.15, 37.15, 37.15, 37.15, 37.15, 37.86, 37.86, 37.86, 37.86, 37.86, 37.9, 37.9, 37.9, 37.9, 37.9, 37.31, 37.31, 37.31, 37.31, 37.31, 36.68, 36.68, 36.68, 36.68, 36.68, 36.57, 36.57, 36.57, 36.57, 36.57, 35.69, 35.69, 35.69, 35.69, 35.69, 35.06, 35.06, 35.06, 35.06, 35.06, 34.04, 34.04, 34.04, 34.04, 34.04, 33.61, 33.61, 33.61, 33.61, 33.61, 33.61, 33.61, 33.61, 33.61, 33.61, 33.71, 33.71, 33.71, 33.71, 33.71, 32.65, 32.65, 32.65, 32.65, 32.65, 32.33, 32.33, 32.33, 32.33, 32.33, 32.29, 32.29, 32.29, 32.29, 32.29, 32.0, 32.0, 32.0, 32.0, 32.0, 32.07, 32.07, 32.07, 32.07, 32.07, 31.76, 31.76, 31.76, 31.76, 31.76, 31.84, 31.84, 31.84, 31.84, 31.84, 31.87, 31.87, 31.87, 31.87, 31.87, 32.12, 32.12, 32.12, 32.12, 32.12, 31.77, 31.77, 31.77, 31.77, 31.77, 31.57, 31.57, 31.57, 31.57, 31.57, 31.78, 31.78, 31.78, 31.78, 31.78, 31.9, 31.9, 31.9, 31.9, 31.9, 31.93, 31.93, 31.93, 31.93, 31.93, 32.16, 32.16, 32.16, 32.16, 32.16, 32.19, 32.19, 32.19, 32.19, 32.19, 32.08, 32.08, 32.08, 32.08, 32.08, 31.92, 31.92, 31.92, 31.92, 31.92, 31.64, 31.64, 31.64, 31.64, 31.64, 31.59, 31.59, 31.59, 31.59, 31.59, 31.61, 31.61, 31.61, 31.61, 31.61, 31.7, 31.7, 31.7, 31.7, 31.7, 31.87, 31.87, 31.87, 31.87, 31.87, 31.96, 31.96, 31.96, 31.96, 31.96, 31.69, 31.69, 31.69, 31.69, 31.69, 31.6, 31.6, 31.6, 31.6, 31.6, 31.0, 31.0, 31.0, 31.0, 31.0, 30.21, 30.21, 30.21, 30.21, 30.21, 30.09, 30.09, 30.09, 30.09, 30.09, 30.06, 30.06, 30.06, 30.06, 30.06, 30.21, 30.21, 30.21, 30.21, 30.21, 30.32, 30.32, 30.32, 30.32, 30.32, 30.41, 30.41, 30.41, 30.41, 30.41, 30.44, 30.44, 30.44, 30.44, 30.44, 30.32, 30.32, 30.32, 30.32, 30.32, 30.27, 30.27, 30.27, 30.27, 30.27, 30.26, 30.26, 30.26, 30.26, 30.26, 30.36, 30.36, 30.36, 30.36, 30.36, 30.52, 30.52, 30.52, 30.52, 30.52, 30.63, 30.63, 30.63, 30.63, 30.63, 30.71, 30.71, 30.71, 30.71, 30.71, 30.78, 30.78, 30.78, 30.78, 30.78, 30.77, 30.77, 30.77, 30.77, 30.77, 30.79, 30.79, 30.79]

Details

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 555 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1715390057 --> 1715390681
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.19, 0.19, 0.19, 0.19, 0.19, 0.32, 0.32, 0.32, 0.32, 0.32, 0.15, 0.15, 0.15, 0.15, 0.15, 0.22, 0.22, 0.22, 0.22, 0.22, 0.1, 0.1, 0.1, 0.1, 0.1, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.17, 0.17, 0.17, 0.17, 0.17, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.18, 0.18, 0.18, 0.18, 0.18, 0.25, 0.25, 0.25, 0.25, 0.25, 0.33, 0.33, 0.33, 0.33, 0.33, 0.14, 0.14, 0.14, 0.14, 0.14, 0.18, 0.18, 0.18, 0.18, 0.18, 0.14, 0.14, 0.14, 0.14, 0.14, 0.31, 0.31, 0.31, 0.31, 0.31, 0.3, 0.3, 0.3, 0.3, 0.3, 0.31, 0.31, 0.31, 0.31, 0.31, 0.22, 0.22, 0.22, 0.22, 0.22, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.11, 0.11, 0.11, 0.11, 0.11, 0.08, 0.08, 0.08, 0.08, 0.08, 0.15, 0.15, 0.15, 0.15, 0.15, 0.27, 0.27, 0.27, 0.27, 0.27, 0.24, 0.24, 0.24, 0.24, 0.24, 0.13, 0.13, 0.13, 0.13, 0.13, 0.12, 0.12, 0.12, 0.12, 0.12, 0.07, 0.07, 0.07, 0.07, 0.07, 0.13, 0.13, 0.13, 0.13, 0.13, 0.16, 0.16, 0.16, 0.16, 0.16, 0.15, 0.15, 0.15, 0.15, 0.15, 0.17, 0.17, 0.17, 0.17, 0.17, 0.32, 0.32, 0.32, 0.32, 0.32, 0.15, 0.15, 0.15, 0.15, 0.15, 0.22, 0.22, 0.22, 0.22, 0.22, 0.14, 0.14, 0.14, 0.14, 0.14, 0.11, 0.11, 0.11, 0.11, 0.11, 0.1, 0.1, 0.1, 0.1, 0.1, 0.3, 0.3, 0.3, 0.3, 0.3, 0.49, 0.49, 0.49, 0.49, 0.49, 0.52, 0.52, 0.52, 0.52, 0.52, 0.42, 0.42, 0.42, 0.42, 0.42, 0.09, 0.09, 0.09, 0.09, 0.09, 0.14, 0.14, 0.14, 0.14, 0.14, 0.11, 0.11, 0.11, 0.11, 0.11, 0.1, 0.1, 0.1, 0.1, 0.1, 0.11, 0.11, 0.11, 0.11, 0.11, 0.2, 0.2, 0.2, 0.2, 0.2, 0.26, 0.26, 0.26, 0.26, 0.26, 0.2, 0.2, 0.2, 0.2, 0.2, 0.19, 0.19, 0.19, 0.19, 0.19, 0.1, 0.1, 0.1, 0.1, 0.1, 0.11, 0.11, 0.11, 0.11, 0.11, 0.13, 0.13, 0.13, 0.13, 0.13, 0.1, 0.1, 0.1, 0.1, 0.1, 0.13, 0.13, 0.13, 0.13, 0.13, 0.1, 0.1, 0.1, 0.1, 0.1, 0.19, 0.19, 0.19, 0.19, 0.19, 0.24, 0.24, 0.24]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 555 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1715390057 --> 1715390681
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 2.0, 2.0, 2.0, 2.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 2.0, 2.0, 2.0, 2.0, 2.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 2.0, 2.0, 2.0, 2.0, 2.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 1.0, 1.0, 1.0, 1.0, 1.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 2.0, 2.0, 2.0, 2.0, 2.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 1.0, 1.0, 1.0, 1.0, 1.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 2.0, 2.0, 2.0]

mofosyne · 2024-05-22T07:24:39Z

Hows this PR? Is there a positive intent now to start merging this?

SilverLatios · 2024-12-26T19:00:52Z

Hi, now that DRY was merged, is it possible to continue working on this pr?

Smoothing factor backport

b5dbcf6

Cubic "smoothing curve" support

a4e54ab

kalomaze changed the title ~~Smooth Sampling support~~ Smooth Sampling / Quadratic Sampling support Apr 2, 2024

kalomaze mentioned this pull request Apr 3, 2024

Enable smooth / quad sampling for llamacpp server SillyTavern/SillyTavern#2006

Closed

ggerganov approved these changes Apr 3, 2024

View reviewed changes

llama.cpp Outdated Show resolved Hide resolved

Update llama.cpp

91f3db8

menhguin mentioned this pull request May 3, 2024

Min P style sampling - an alternative to Top P/TopK huggingface/transformers#27670

Closed

mofosyne added performance Speed related topics Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level generation quality Quality of model output labels May 10, 2024

Merge branch 'master' into smooth-pr

68e7c25

mofosyne added Review Complexity : High Generally require indepth knowledge of LLMs or GPUs and removed Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level labels May 11, 2024

ggerganov mentioned this pull request Jul 21, 2024

added implementation of DRY sampler #6839

Closed

Silver267 mentioned this pull request May 10, 2025

sampling: Port of Smooth Sampling / Quadratic Sampling support #13441

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Smooth Sampling / Quadratic Sampling support #6445

Smooth Sampling / Quadratic Sampling support #6445

kalomaze commented Apr 2, 2024 •

edited

Loading

Uh oh!

kalomaze commented Apr 2, 2024

Uh oh!

github-actions bot commented Apr 2, 2024 •

edited

Loading

Uh oh!

kalomaze commented Apr 2, 2024 •

edited

Loading

Uh oh!

MaggotHATE commented Apr 3, 2024

Uh oh!

Artefact2 commented Apr 3, 2024

Uh oh!

Uh oh!

simsvml commented Apr 3, 2024

Uh oh!

kalomaze commented Apr 4, 2024

Uh oh!

ggerganov commented Apr 5, 2024

Uh oh!

jukofyork commented Apr 9, 2024 •

edited

Loading

Uh oh!

github-actions bot commented May 11, 2024

Uh oh!

mofosyne commented May 22, 2024

Uh oh!

SilverLatios commented Dec 26, 2024

Uh oh!

Uh oh!

Smooth Sampling / Quadratic Sampling support #6445

Are you sure you want to change the base?

Smooth Sampling / Quadratic Sampling support #6445

Conversation

kalomaze commented Apr 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Smooth Sampling / Quadratic Sampling

How is this meaningfully different from Temperature?

How do I scale it?

Preview

Uh oh!

kalomaze commented Apr 2, 2024

Uh oh!

github-actions bot commented Apr 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kalomaze commented Apr 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MaggotHATE commented Apr 3, 2024

Uh oh!

Artefact2 commented Apr 3, 2024

Uh oh!

Uh oh!

simsvml commented Apr 3, 2024

Uh oh!

kalomaze commented Apr 4, 2024

Uh oh!

ggerganov commented Apr 5, 2024

Uh oh!

jukofyork commented Apr 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented May 11, 2024

Uh oh!

mofosyne commented May 22, 2024

Uh oh!

SilverLatios commented Dec 26, 2024

Uh oh!

Uh oh!

kalomaze commented Apr 2, 2024 •

edited

Loading

github-actions bot commented Apr 2, 2024 •

edited

Loading

kalomaze commented Apr 2, 2024 •

edited

Loading

jukofyork commented Apr 9, 2024 •

edited

Loading