Understanding LLMs: A Comprehensive Overview from Training to Inference

Yiheng Liu

Understanding LLMs: A Comprehensive Overview from Training to Inference

Yiheng Liu

Abstract

The introduction of ChatGPT has led to a significant increase in the utilization of Large Language Models (LLMs) for addressing downstream tasks. There’s an increasing focus on cost-efficient training and deployment within this context. Low-cost training and deployment of LLMs represent the future development trend. This paper reviews the evolution of large language model training techniques and inference deployment technologies aligned with this emerging trend. The discussion on training includes various aspects, including data preprocessing, training architecture, pre-training tasks, parallel training, and relevant content related to model finetuning. On the inference side, the paper covers topics such as model compression, parallel computation, memory scheduling, and structural optimization. It also explores LLMs’ utilization and provides insights into their future development.

Full Text:

PDF

References

Y. Liu, T. Han, S. Ma, J. Zhang, Y. Yang, J. Tian, H. He, A. Li, M. He, Z. Liu, et al., “Summary of ChatGPT-related research and perspectives toward the future of large language models,” *Meta-Radiology*, p. 100017, 2023.

J. Wang, E. Shi, S. Yu, Z. Wu, C. Ma, H. Dai, Q. Yang, Y. Kang, J. Wu, H. Hu, et al., “Prompt engineering for healthcare: Methodologies and applications,” *arXiv preprint* arXiv:2304.14670, 2023.

W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min, B. Zhang, J. Zhang, Z. Dong, et al., “A survey of large language models,” *arXiv preprint* arXiv:2303.18223, 2023.

J. Kaddour, J. Harris, M. Mozes, H. Bradley, R. Raileanu, and R. McHardy, “Challenges and applications of large language models,” *arXiv preprint* arXiv:2307.10169, 2023.

M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer, “Deep contextualized word representations,” in *Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies*, vol. 1 (Long Papers), Jun. 2018, pp. 2227–2237.

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” *Advances in Neural Information Processing Systems*, vol. 30, 2017.

A. Radford, J. Wu, D. Amodei, D. Amodei, J. Clark, M. Brundage, and I. Sutskever, “Better language models and their implications,” *OpenAI Blog*, https://openai.com/blog/better-language-models, vol. 1, no. 2, 2019.

T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al., “Language models are few-shot learners,” *Advances in Neural Information Processing Systems*, vol. 33, pp. 1877–1901, 2020.

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, et al., “Llama: Open and efficient foundation language models,” *arXiv preprint* arXiv:2302.13971, 2023.

H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, et al., “Llama 2: Open foundation and fine-tuned chat models,” *arXiv preprint* arXiv:2307.09288, 2023.

S. Rezayi, H. Dai, Z. Liu, Z. Wu, A. Hebbar, A. H. Burns, L. Zhao, D. Zhu, Q. Li, W. Liu, et al., “Clinicalradiobert: Knowledge-infused few-shot learning for clinical notes named entity recognition,” in *International Workshop on Machine Learning in Medical Imaging*, Springer, 2022, pp. 269–278.

Z. Liu, M. He, Z. Jiang, Z. Wu, H. Dai, L. Zhang, S. Luo, T. Han, X. Li, X. Jiang, et al., “Survey on natural language processing in medical image analysis,” *Zhong nan da xue xue bao. Yi xue ban= Journal of Central South University. Medical Sciences*, vol. 47, no. 8, pp. 981–993, 2022.

W. Liao, Z. Liu, H. Dai, Z. Wu, Y. Zhang, X. Huang, Y. Chen, X. Jiang, D. Zhu, T. Liu, et al., “Mask-guided BERT for few-shot text classification,” *arXiv preprint* arXiv:2302.10447, 2023.

S. Rezayi, Z. Liu, Z. Wu, C. Dhakal, B. Ge, H. Dai, G. Mai, N. Liu, C. Zhen, T. Liu, et al., “Exploring new frontiers in agricultural NLP: Investigating the potential of large language models for food applications,” *arXiv preprint* arXiv:2306.11892, 2023.

T. Zhong, W. Zhao, Y. Zhang, Y. Pan, P. Dong, Z. Jiang, X. Kui, Y. Shang, L. Yang, Y. Wei, et al., “Chatradio-valuer: A chat large language model for generalizable radiology report generation based on multi-institution and multi-system data,” *arXiv preprint* arXiv:2310.05242, 2023.

Z. Liu, T. Zhong, Y. Li, Y. Zhang, Y. Pan, Z. Zhao, P. Dong, C. Cao, Y. Liu, P. Shu, et al., “Evaluating large language models for radiology natural language processing,” *arXiv preprint* arXiv:2307.13693, 2023.

T. Zhong, Y. Wei, L. Yang, Z. Wu, Z. Liu, X. Wei, W. Li, J. Yao, C. Ma, X. Li, et al., “ChatABL: Abductive learning via natural language interaction with ChatGPT,” *arXiv preprint* arXiv:2304.11107, 2023.

A. Radford, K. Narasimhan, T. Salimans, I. Sutskever, et al., “Improving language understanding by generative pre-training,” OpenAI, 2018.

OpenAI, “GPT-4 technical report,” 2023.

H. Dai, Z. Liu, W. Liao, X. Huang, Y. Cao, Z. Wu, L. Zhao, S. Xu, W. Liu, N. Liu, S. Li, D. Zhu, H. Cai, L. Sun, Q. Li, D. Shen, T. Liu, and X. Li, “AugGPT: Leveraging ChatGPT for text data augmentation,” 2023.

Z. Liu, X. Yu, L. Zhang, Z. Wu, C. Cao, H. Dai, L. Zhao, W. Liu, D. Shen, Q. Li, et al., “DeID-GPT: Zero-shot medical text de-identification by GPT-4,” *arXiv preprint* arXiv:2303.11032, 2023.

C. Ma, Z. Wu, J. Wang, S. Xu, Y. Wei, Z. Liu, L. Guo, X. Cai, S. Zhang, T. Zhang, et al., “ImpressionGPT: An iterative optimizing framework for radiology report summarization with ChatGPT,” *arXiv preprint* arXiv:2304.08448, 2023.

W. Liao, Z. Liu, H. Dai, S. Xu, Z. Wu, Y. Zhang, X. Huang, D. Zhu, H. Cai, T. Liu, et al., “Differentiate ChatGPT-generated and human-written medical texts,” *arXiv preprint* arXiv:2304.11567, 2023.

H. Dai, Y. Li, Z. Liu, L. Zhao, Z. Wu, S. Song, Y. Shen, D. Zhu, X. Li, S. Li, et al., “AD-AutoGPT: An autonomous GPT for Alzheimer’s disease infodemiology,” *arXiv preprint* arXiv:2306.10095, 2023.

Z. Guan, Z. Wu, Z. Liu, D. Wu, H. Ren, Q. Li, X. Li, and N. Liu, “CohortGPT: An enhanced GPT for participant recruitment in clinical study,” *arXiv preprint* arXiv:2307.11346, 2023.

Z. Liu, Z. Wu, M. Hu, B. Zhao, L. Zhao, T. Zhang, H. Dai, X. Chen, Y. Shen, S. Li, et al., “PharmacyGPT: The AI pharmacist,” *arXiv preprint* arXiv:2307.10432, 2023.

Y. Wei, T. Zhang, H. Zhang, T. Zhong, L. Zhao, Z. Liu, C. Ma, S. Zhang, M. Shang, L. Du, et al., “Chat2Brain: A method for mapping open-ended semantic queries to brain activation maps,” *arXiv preprint* arXiv:2309.05021, 2023.

T. Zhong, X. Wei, E. Shi, J. Gao, C. Ma, Y. Wei, S. Zhang, L. Guo, J. Han, T. Liu, et al., “A small-sample method with EEG signals based on abductive learning for motor imagery decoding,” in *International Conference on Medical Image Computing and Computer-Assisted Intervention*, Springer, 2023, pp. 416–424.

J. Gao, L. Zhao, T. Zhong, C. Li, Z. He, Y. Wei, S. Zhang, L. Guo, T. Liu, J. Han, et al., “Prediction of cognitive scores by joint use of movie-watching fMRI connectivity and eye tracking via attention-CENSNet,” *Psychoradiology*, vol. 3, 2023.

I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” *Advances in Neural Information Processing Systems*, vol. 27, 2014.

G. Bebis and M. Georgiopoulos, “Feed-forward neural networks,” *IEEE Potentials*, vol. 13, no. 4, pp. 27–31, 1994.

Y. Yang, L. Wang, S. Shi, P. Tadepalli, S. Lee, and Z. Tu, “On the sub-layer functionalities of transformer decoder,” *arXiv preprint* arXiv:2010.02648, 2020.

Z. Dai, Z. Yang, Y. Yang, J. Carbonell, Q. V. Le, and R. Salakhutdinov, “Transformer-XL: Attentive language models beyond a fixed-length context,” *arXiv preprint* arXiv:1901.02860, 2019.

J. Su, M. Ahmed, Y. Lu, S. Pan, W. Bo, and Y. Liu, “RoFormer: Enhanced transformer with rotary position embedding,” *Neurocomputing*, p. 127063, 2023.

O. Press, N. A. Smith, and M. Lewis, “Train short, test long: Attention with linear biases enables input length extrapolation,” *arXiv preprint* arXiv:2108.12409, 2021.

A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann, et al., “PaLM: Scaling language modeling with pathways,” *arXiv preprint* arXiv:2204.02311, 2022.

A. Zeng, X. Liu, Z. Du, Z. Wang, H. Lai, M. Ding, Z. Yang, Y. Xu, W. Zheng, X. Xia, et al., “GLM-130B: An open bilingual pre-trained model,” *arXiv preprint* arXiv:2210.02414, 2022.

B. Workshop, T. L. Scao, A. Fan, C. Akiki, E. Pavlick, S. Ilić, D. Hesslow, R. Castagné, A. S. Luccioni, F. Yvon, et al., “BLOOM: A 176B-parameter open-access multilingual language model,” *arXiv preprint* arXiv:2211.05100, 2022.

L. Zhao, L. Zhang, Z. Wu, Y. Chen, H. Dai, X. Yu, Z. Liu, T. Zhang, X. Hu, X. Jiang, et al., “When brain-inspired AI meets AGI,” *Meta-Radiology*, p. 100005, 2023.

J. Holmes, Z. Liu, L. Zhang, Y. Ding, T. T. Sio, L. A. McGee, J. B. Ashman, X. Li , T. Liu, J. Shen, and W. Liu, “Evaluating large language models on a highly-specialized topic, radiation oncology physics,” *Frontiers in Oncology*, vol. 13, Jul. 2023.

Z. Wu, L. Zhang, C. Cao, X. Yu, H. Dai, C. Ma, Z. Liu, L. Zhao, G. Li, W. Liu, et al., “Exploring the trade-offs: Unified large language models vs local fine-tuned models for highly-specific radiology NLI task,” *arXiv preprint* arXiv:2304.09138, 2023.

S. Rezayi, Z. Liu, Z. Wu, C. Dhakal, B. Ge, C. Zhen, T. Liu, and S. Li, “AgriBERT: Knowledge-infused agricultural language models for matching food and nutrition,” in *Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence*, vol. 7, 2022, pp. 5150–5156.

Z. Liu, A. Zhong, Y. Li, L. Yang, C. Ju, Z. Wu, C. Ma, P. Shu, C. Chen, S. Kim, et al., “Radiology-GPT: A large language model for radiology,” *arXiv preprint* arXiv:2306.08666, 2023.

Z. Liu, X. He, L. Liu, T. Liu, and X. Zhai, *Context Matters: A Strategy to Pre-train Language Model for Science Education*, Springer Nature Switzerland, 2023, pp. 666–674.

J. Wang, Z. Liu, L. Zhao, Z. Wu, C. Ma, S. Yu, H. Dai, Q. Yang, Y. Liu, S. Zhang, et al., “Review of large vision models and visual prompt engineering,” *arXiv preprint* arXiv:2307.00855, 2023.

X. Li, L. Zhang, Z. Wu, Z. Liu, L. Zhao, Y. Yuan, J. Liu, G. Li, D. Zhu, P. Yan, et al., “Artificial general intelligence for medical imaging,” *arXiv preprint* arXiv:2306.05480, 2023.

H. Cai, W. Liao, Z. Liu, Y. Zhang, X. Huang, S. Ding, H. Ren, Z. Wu, H. Dai, S. Li, et al., “Coarse-to-fine knowledge graph domain adaptation based on distantly-supervised iterative training,” *arXiv preprint* arXiv:2211.02849, 2022.

H. Dai, C. Ma, Z. Liu, Y. Li, P. Shu, X. Wei, L. Zhao, Z. Wu, D. Zhu, W. Liu, et al., “SamAug: Point prompt augmentation for segment anything model,” *arXiv preprint* arXiv:2307.01187, 2023.

L. Zhang, Z. Liu, L. Zhang, Z. Wu, X. Yu, J. Holmes, H. Feng, H. Dai, X. Li, Q. Li, et al., “Segment anything model (SAM) for radiation oncology,” *arXiv preprint* arXiv:2306.11730, 2023.

Z. Xiao, Y. Chen, L. Zhang, J. Yao, Z. Wu, X. Yu, Y. Pan, L. Zhao, C. Ma, X. Liu, et al., “Instruction-ViT: Multi-modal prompts for instruction learning in ViT,” *arXiv preprint* arXiv:2305.00201, 2023.

P. Liu, W. Yuan, J. Fu, Z. Jiang, H. Hayashi, and G. Neubig, “Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing,” *ACM Computing Surveys*, vol. 55, no. 9, pp. 1–35, 2023.

S. B. Kotsiantis, I. Zaharakis, P. Pintelas, et al., “Supervised machine learning: A review of classification techniques,” *Emerging Artificial Intelligence Applications in Computer Engineering*, vol. 160, no. 1, pp. 3–24, 2007.

Y. Bengio, A. Courville, and P. Vincent, “Representation learning: A review and new perspectives,” *IEEE Transactions on Pattern Analysis and Machine Intelligence*, vol. 35, no. 8, pp. 1798–1828, 2013.

L. Dong, N. Yang, W. Wang, F. Wei, X. Liu, Y. Wang, J. Gao, M. Zhou, and H.-W. Hon, “Unified language model pre-training for natural language understanding and generation,” *Advances in Neural Information Processing Systems*, vol. 32, 2019.

T. Schick and H. Schütze, “It’s not just size that matters: Small language models are also few-shot learners,” *arXiv preprint* arXiv:2009.07118, 2020.

F. Petroni, T. Rocktäschel, P. Lewis, A. Bakhtin, Y. Wu, A. H. Miller, and S. Riedel, “Language models as knowledge bases?” *arXiv preprint* arXiv:1909.01066, 2019.

B. Lester, R. Al-Rfou, and N. Constant, “The power of scale for parameter-efficient prompt tuning,” *arXiv preprint* arXiv:2104.08691, 2021.

T. Schick and H. Schütze, “Exploiting cloze questions for few-shot text classification and natural language inference,” *arXiv preprint* arXiv:2001.07676, 2020.

R. Shin, C. H. Lin, S. Thomson, C. Chen, S. Roy, E. A. Platanios, A. Pauls, D. Klein, J. Eisner, and B. Van Durme, “Constrained language models yield few-shot semantic parsers,” *arXiv preprint* arXiv:2104.08768, 2021.

Z. Jiang, F. F. Xu, J. Araki, and G. Neubig, “How can we know what language models know?” *Transactions of the Association for Computational Linguistics*, vol. 8, pp. 423–438, 2020.

K. Duh, K. Sudoh, X. Wu, H. Tsukada, and M. Nagata, “Generalized minimum Bayes risk system combination,” in *Proceedings of 5th International Joint Conference on Natural Language Processing*, 2011, pp. 1356–1360.

Z. Jiang, J. Araki, H. Ding, and G. Neubig, “How can we know when language models know? On the calibration of language models for question answering,” *Transactions of the Association for Computational Linguistics*, vol. 9, pp. 962–977, 2021.

M. McCloskey and N. J. Cohen, “Catastrophic interference in connectionist networks: The sequential learning problem,” in *Psychology of Learning and Motivation*, Elsevier, 1989, vol. 24, pp. 109–165.

Y. Zhu, R. Kiros, R. Zemel, R. Salakhutdinov, R. Urtasun, A. Torralba, and S. Fidler, “Aligning books and movies: Towards story-like visual explanations by watching movies and reading books,” in *Proceedings of the IEEE International Conference on Computer Vision*, 2015, pp. 19–27.

Project Gutenberg. [Online]. Available: https://www.gutenberg.org/

Common Crawl. [Online]. Available: https://commoncrawl.org/

C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,” *The Journal of Machine Learning Research*, vol. 21, no. 1, pp. 5485–5551, 2020.

T. H. Trinh and Q. V. Le, “A simple method for commonsense reasoning,” *arXiv preprint* arXiv:1806.02847, 2018.

Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “RoBERTa: A robustly optimized BERT pretraining approach,” *arXiv preprint* arXiv:1907.11692, 2019.

R. Zellers, A. Holtzman, H. Rashkin, Y. Bisk, A. Farhadi, F. Roesner, and Y. Choi, “Defending against neural fake news,” *Advances in Neural Information Processing Systems*, vol. 32, 2019.

G. Penedo, Q. Malartic, D. Hesslow, R. Cojocaru, H. Alobeidli, A. Cappelli, B. Pannier, E. Almazrouei, and J. Launay, “The RefinedWeb dataset for Falcon LLM: Outperforming curated corpora with web data only,” in *Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track*, 2023.

A. Gokaslan, V. C. E. Pavlick, and S. Tellex, “OpenWebText corpus,” http://Skylion007.github.io/OpenWebTextCorpus, 2019.

J. Baumgartner, S. Zannettou, B. Keegan

, M. Squire, and J. Blackburn, “The pushshift Reddit dataset,” in *Proceedings of the International AAAI Conference on Web and Social Media*, vol. 14, 2020, pp. 830–839.

Wikipedia. [Online]. Available: https://en.wikipedia.org/wiki/Main_Page

BigQuery Dataset. [Online]. Available: https://cloud.google.com/bigquery

L. Gao, S. Biderman, S. Black, L. Golding, T. Hoppe, C. Foster, J. Phang, H. He, A. Thite, N. Nabeshima, et al., “The Pile: An 800GB dataset of diverse text for language modeling,” *arXiv preprint* arXiv:2101.00027, 2020.

Refbacks

There are currently no refbacks.

Username
Password
Remember me

Mathematical Computing