Skip to content

train-text-from-scratch.exe stop after "begin training" (tensor->src0 is null) #1869

@Entretoize

Description

@Entretoize

I'm running the latest release (master-254a7a7) like that:

bin\train-text-from-scratch.exe --vocab-model models\ggml-vocab.bin --checkpoint-in chk-lamartine-256x16.bin --checkpoint-out chk-lamartine-256x16.bin --model-out ggml-lamartine-265x16-f32.bin --train-data "shakespeare.txt"
I tried with several models.

Expected Behavior

Training shoud run for a long time

Current Behavior

Training stop immediatly without error:

D:\git\llama.cpp>bin\train-text-from-scratch.exe --vocab-model models\ggml-vocab.bin --ctx 64 --embd 256 --head 8 --layer 16 --checkpoint-in chk-lamartine-256x16.bin --checkpoint-out chk-lamartine-256x16.bin --model-out ggml-lamartine-265x16-f32.bin --train-data "alphonsedelamartine.txt" -t 6 -b 1 -n 32 --seed 2 --adam-iter 16 --print-details-interval 0 --predict 16 --use-flash
main: seed: 2
llama.cpp: loading model from models\ggml-vocab.bin
llama_model_load_internal: format     = ggjt v1 (pre #1405)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 1 (mostly F16)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
main: tokenize training data
main: number of training tokens: 474
print_params: n_vocab: 32000
print_params: n_ctx:   64
print_params: n_embd:  256
print_params: n_mult:  256
print_params: n_head:  8
print_params: n_ff:    768
print_params: n_layer: 16
print_params: n_rot:   32
main: number of unique tokens: 253
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3080
main: init model
load_checkpoint: Training iterations: 0.
load_checkpoint: Training samples:    0.
load_checkpoint: Training tokens:     0.
main: opt iter 0
used_mem model+cache: 242364416 bytes
main: begin training

Environment and Context

Windows 11
NVidia RTX 3080
Ryzen 7 2700
Ram 32GB

Activity

SlyEcho

SlyEcho commented on Jun 15, 2023

@SlyEcho
Collaborator

It's really fast now.

Entretoize

Entretoize commented on Jun 15, 2023

@Entretoize
Author

Funny... my fault, I should have added that no checkpoint nor model is created in any folder.

SlyEcho

SlyEcho commented on Jun 15, 2023

@SlyEcho
Collaborator

Sorry about joking but this tool is still very new so it has some problems. There are quite a few issues.

Entretoize

Entretoize commented on Jun 16, 2023

@Entretoize
Author

But how can I check the problem, can I debug with visual studio ?

Entretoize

Entretoize commented on Jun 16, 2023

@Entretoize
Author

Just tried in visual studio this is tensor->src0 (and 1) that are null in ggml-cuda.cu in function ggml_cuda_compute_forward maybe that help ?
It runs if I disable CUBLAS. At least it seems, it is running for some minutes now, but I suppose it will be very slow ?

robyngraf

robyngraf commented on Jun 16, 2023

@robyngraf
Contributor

For me also it works without CUBLAS (successfully trains the model) and does not work with CUBLAS (quits without creating a model file).

Entretoize

Entretoize commented on Jun 16, 2023

@Entretoize
Author

I added if (node->src0!=NULL) at line 16009 of ggml.c:

        if (node->src0!=NULL)
            ggml_compute_forward(&params, node);

As it is in a loop and other nodes have src0 not null. It doesn't crash now, and seems to learn, but I don't know the aftermaths of doing that.

robyngraf

robyngraf commented on Jun 16, 2023

@robyngraf
Contributor

That didn't quite do it for me, but then I added similar checks to the other calls to ggml_compute_forward in ggml.c as well and it seems to have started training now. Or it least it's getting further than it did before.

changed the title [-]train-text-from-scratch.exe stop after "begin training" without any errors[/-] [+]train-text-from-scratch.exe stop after "begin training" (tensor->src0 is null)[/+] on Jun 16, 2023
added a commit that references this issue on Jun 17, 2023
added a commit that references this issue on Jun 24, 2023
5ec8dd5

4 remaining items

Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @Entretoize@SlyEcho@robyngraf

        Issue actions

          train-text-from-scratch.exe stop after "begin training" (tensor->src0 is null) · Issue #1869 · ggml-org/llama.cpp