This is the training code used to train StripedHyena-Nous-7B.
First, tokenize your data
python tokenization.py \
--dataset your-super-cool-sharegpt-format-dataset \
--tokenizer togethercomputer/StripedHyena-Hessian-7B \
--output tokenized \
--num-proc 32 \
--pad-to-length 4096 \
--truncateMake sure you have done accelerate config -- we used the provided DeepSpeed configuration.
Then, train!
accelerate launch finetune.py \
--model togethercomputer/StripedHyena-Hessian-7B \
--dataset tokenized \
--output-dir output \
--epochs 4 \
--batch-size 12 \
--gradient-accumulate-every 12 \
--warmup-steps 350 \
--learning-rate 0.000004 \
--lr-schedule linear \
--weight-decay 0.1 \
--checkpointing-steps 1000 \
--no-decay poles residuesThe --no-decay option disables weight decay on only the specified parameters.
For StripedHyena, we've found that disabling weight decay on the Hyena operator's poles and residues parameters improves performance.
There is also an option --frozen that can completely freeze select parameter groups.