-
Notifications
You must be signed in to change notification settings - Fork 12.4k
Closed
Labels
Description
I'm running the latest release (master-254a7a7) like that:
bin\train-text-from-scratch.exe --vocab-model models\ggml-vocab.bin --checkpoint-in chk-lamartine-256x16.bin --checkpoint-out chk-lamartine-256x16.bin --model-out ggml-lamartine-265x16-f32.bin --train-data "shakespeare.txt"
I tried with several models.
Expected Behavior
Training shoud run for a long time
Current Behavior
Training stop immediatly without error:
D:\git\llama.cpp>bin\train-text-from-scratch.exe --vocab-model models\ggml-vocab.bin --ctx 64 --embd 256 --head 8 --layer 16 --checkpoint-in chk-lamartine-256x16.bin --checkpoint-out chk-lamartine-256x16.bin --model-out ggml-lamartine-265x16-f32.bin --train-data "alphonsedelamartine.txt" -t 6 -b 1 -n 32 --seed 2 --adam-iter 16 --print-details-interval 0 --predict 16 --use-flash
main: seed: 2
llama.cpp: loading model from models\ggml-vocab.bin
llama_model_load_internal: format = ggjt v1 (pre #1405)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 512
llama_model_load_internal: n_embd = 4096
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 32
llama_model_load_internal: n_layer = 32
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 1 (mostly F16)
llama_model_load_internal: n_ff = 11008
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 7B
main: tokenize training data
main: number of training tokens: 474
print_params: n_vocab: 32000
print_params: n_ctx: 64
print_params: n_embd: 256
print_params: n_mult: 256
print_params: n_head: 8
print_params: n_ff: 768
print_params: n_layer: 16
print_params: n_rot: 32
main: number of unique tokens: 253
ggml_init_cublas: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3080
main: init model
load_checkpoint: Training iterations: 0.
load_checkpoint: Training samples: 0.
load_checkpoint: Training tokens: 0.
main: opt iter 0
used_mem model+cache: 242364416 bytes
main: begin training
Environment and Context
Windows 11
NVidia RTX 3080
Ryzen 7 2700
Ram 32GB
lin72h
Metadata
Metadata
Assignees
Labels
Type
Projects
Milestone
Relationships
Development
Select code repository
Activity
SlyEcho commentedon Jun 15, 2023
It's really fast now.
Entretoize commentedon Jun 15, 2023
Funny... my fault, I should have added that no checkpoint nor model is created in any folder.
SlyEcho commentedon Jun 15, 2023
Sorry about joking but this tool is still very new so it has some problems. There are quite a few issues.
Entretoize commentedon Jun 16, 2023
But how can I check the problem, can I debug with visual studio ?
Entretoize commentedon Jun 16, 2023
Just tried in visual studio this is tensor->src0 (and 1) that are null in ggml-cuda.cu in function ggml_cuda_compute_forward maybe that help ?
It runs if I disable CUBLAS. At least it seems, it is running for some minutes now, but I suppose it will be very slow ?
robyngraf commentedon Jun 16, 2023
For me also it works without CUBLAS (successfully trains the model) and does not work with CUBLAS (quits without creating a model file).
Entretoize commentedon Jun 16, 2023
I added
if (node->src0!=NULL)
at line 16009 of ggml.c:As it is in a loop and other nodes have src0 not null. It doesn't crash now, and seems to learn, but I don't know the aftermaths of doing that.
robyngraf commentedon Jun 16, 2023
That didn't quite do it for me, but then I added similar checks to the other calls to ggml_compute_forward in ggml.c as well and it seems to have started training now. Or it least it's getting further than it did before.
[-]train-text-from-scratch.exe stop after "begin training" without any errors[/-][+]train-text-from-scratch.exe stop after "begin training" (tensor->src0 is null)[/+]ggml-org#1869 Fix null reference errors when training from scratch wi…
#1869 Fix null reference errors when training from scratch with CUDA (#…
Squashed commit of the following:
Squashed commit of the following:
4 remaining items