Skip to content

Using llama.cpp, the entire context gets reprocessed each generation #866

Closed
@ClassicDirt

Description

@ClassicDirt

When using llama.cpp, the entire prompt gets processed at each generation making things like chat mode unbearably slow. The problem compounds as the context gets larger and larger as well.

Perhaps using interactive mode in the binding might work? Or, maybe more likely, implementing something similar to the prompt fast-forwarding seen here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions