Skip to content

NVIDIA object never overrides GPT2 Fast Tokenizer #164

@lucian-cap

Description

I am testing out a basic RAG pipeline by chunking HTML pages to use as a document library. To determine if a chunk is larger enough, I am attempting to count the number of tokens in the chunk currently, however after finding some weird behavior I dug into the class code a little bit and found that the NVIDIA class never overrides the default behavior of the inherited BaseLanguageModel.

It always will instantiate a tokenizer by using the transformers library to download the GPT2 fast tokenizer completely ignoring any idea of the Nvidia NIM I am directing it at.

Is this expected behavior? If so, why? The documentation for NVIDIA.get_num_tokens() states "Useful for checking if an input fits in a model’s context window." but how is this useful if the token count retrieved is for a different model's tokenizer? I can see that the BaseLanguageModel class does check if the custom_get_token_ids attribute is set and uses that if so, so I can set that myself, but why is the object setting that on instantiation and checking with the NIM in some way?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions