Closed
Description
When using llama.cpp, the entire prompt gets processed at each generation making things like chat mode unbearably slow. The problem compounds as the context gets larger and larger as well.
Perhaps using interactive mode in the binding might work? Or, maybe more likely, implementing something similar to the prompt fast-forwarding seen here.
Metadata
Metadata
Assignees
Labels
No labels