Russian Federation
The main problematic issues in the development and specialization of LLM are: catastrophic forgetting, the risk of overfitting, hallucinations, incorrect interpretations, incorrect processing of exceptional situations as well as exceptionally high performance requirements for the computing tools used in this case. The purpose of the study is to select and develop methods for optimizing the training and fine-tuning process LLM, providing a significant reduction in the computing resources required for this. To achieve this goal, it is proposed to use the following methods of optimizing LLMs and their learning algorithms: LoRA and QLoRA, Batch size choice, Gradient Accumulation, Gradient Checkpoint, Mixed precision training, FlashAttention-2. To obtain a cumulative positive effect when using these methods together, it is necessary to perform a number of practical experiments. When setting up LLM learning hyperparameters, you should first determine which package size gives the best results, and then choose adequate methods to optimize the computing resources used. The application of the presented methods will increase the efficiency of using computing resources when training and fine-tuning large language models and will reduce the time and financial costs necessary for this.
fine-tuning, gradient accumulation, graphics processing unit, Large Language Model, Low-Rank Adaptation, mixed precision
1. A Survey of Large Language Models / W. Zhao [et al.] // ArXiv. 2023. Vol. 2303.18223. 124 p. DOI: 10.48550/ arXiv.2303.18223
2. Down to Scale Up: A Guide to Parameter-Efficient Fine-Tuning / V. Lialin [et al.] // ArXiv. 2023. Vol. 2303.15647. 21 p. DOI:https://doi.org/10.48550/arXiv.2303.15647
3. Matrix Multiplication Background User’s Guide // NVIDIA Documentation Hub. URL: http://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication (accessed 26 Mar 2024).
4. Bekman S. Benchmarking Transformers with HF Trainer on a Single A100 40GB // Github. URL: http://github.com/huggingface/transformers/issues/15026 (accessed 26 Mar 2024).
5. LORA: Low-Rank Adaptation of Large Language Models / E. Hu [et al.] // ArXiv. 2021. Vol. 2106.09685. 26 p. DOI:https://doi.org/10.48550/arXiv.2106.09685
6. LLaMA-Adapter: Efficient Fine-Tuning of Language Models with Zero-Init Attention / R. Zhang [et al.] // ArXiv. 2023. Vol. 2303.16199. 22 p. DOI:https://doi.org/10.48550/arXiv.2303.16199
7. Delta Tuning: A Comprehensive Study of Parameter Efficient Methods for Pre-Trained Language Models / N. Ding [et al.] // ArXiv. 2022. Vol. 2203.06904. 49 p. DOI:https://doi.org/10.48550/arXiv.2203.06904
8. QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models / Y. Xu [et al.] // ArXiv. 2023. Vol. 2309.14717. 16 p. DOI:https://doi.org/10.48550/arXiv.2309.14717
9. QDyLoRA: Quantized Dynamic Low-Rank Adaptation for Efficient Large Language Model Tuning / H. Rajabzadeh [et al.] // ArXiv. 2024. Vol. 2402.10462. 6 p. DOI:https://doi.org/10.48550/arXiv.2402.10462
10. Methods and Tools for Efficient Training on a Single GPU // Hugging Face Community. URL: http://huggingface.co/docs/transformers/perf_train_gpu_one (accessed 26 Mar 2024).
11. Goodfellow I., Bengio Y., Courville A. Optimization for Training Deep Model // Deep Learning. Cambridge (MA): MIT Press, 2016. Pp. 267–320.
12. Bekman S. Benchmarking Transformers with HF Trainer on RTX-309 // Github. URL: http://github.com/huggingface/transformers/issues/14608 (accessed 26 Mar 2024).
13. Linear/Fully Connected Layers User’s Guide // NVIDIA Documentation Hub. URL: http://docs.nvidia.com/deeplearning/performance/dl-performance-fully-connected (accessed 26 Mar 2024).
14. Exploring Parameter-Efficient Fine-Tuning Techniques for Code Generation with Large Language Models / M. Weyssow [et al.] // ArXiv. 2023. Vol. 2308.10462. 23 p. DOI:https://doi.org/10.48550/arXiv.2308.10462
15. PTraining FP8 Large Language Models / H. Peng [et al.] // ArXiv. 2023. Vol. 2310.18313. 23 p. DOI: 10.48550/ arXiv.2310.18313https://doi.org/10.48550/arXiv.2310.18313