NTT's Large Language Models 'tsuzumi'NTT Human Informatics Laboratories

In this article


Recently, large-scale language models*1 such as ChatGPT have been attracting significant attention. While they demonstrate high language processing performance by leveraging extensive knowledge embedded in the models, it's said that the energy needed for training equals the power generated by a nuclear power plant in an hour*2. Additionally, operation requires large-scale GPU clusters and the cost for tuning and inference needed to specialize for various industries can be substantial. This creates challenges in terms of sustainability and the financial burden for enterprises to prepare a learning environment.
At NTT, we have been conducting research and development to resolve these issues and have now developed 'tsuzumi,' a large-scale language model that is lightweight yet has top-level Japanese language processing capabilities. The parameter size of 'tsuzumi' ranges from 0.6 to 7 billion, which is relatively small, reducing the cost needed for learning and tuning, a common issue with publicly available cloud-based LLMs. 'tsuzumi' supports both English and Japanese and allows for inferencing on a single GPU or CPU. Moreover, 'tsuzumi' is compatible with various modalities such as visual and audio, and can be specifically tuned for particular industries or corporate organizations.
The NTT Group plans to launch commercial services using 'tsuzumi' in March 2024 and is also pushing forward with research and development on 'tsuzumi' to create new value by adding further multimodality*3. In this document, we will introduce four key features of 'tsuzumi'.

Lightweight LLM
Multilingual Support - Proficiency in Japanese
Flexible Customization - Base Model + Adapter
Multimodality - Language + Visual + Audio + User Situation

Lightweight LLM

Since the introduction of 'ChatGPT' by OpenAI on November 30, 2022, conversational AIs that can facilitate natural dialogues with users and handle a broader range of tasks than humans on numerous topics have been gaining attention. These language models that enable conversational AIs excel in natural language generation tasks by using models with a vast number of parameters, which are trained utilizing extensive datasets and computational resources (Figure 1).

Figure 1. Parameter Size of LLMs

Increasing the number of parameters improves performance but also introduces challenges such as increasing computational resources and power consumption. For instance, it's said that training a GPT-3 scale model requires approximately 1300MWh of power, equivalent to the power generated by a nuclear power plant in an hour*4. Instead of increasing the parameter size, 'tsuzumi' enhances the quality and quantity of Japanese training data, enabling high Japanese processing ability with a lightweight model.
As of March 2024, 'tsuzumi' offers two versions: a lightweight with 7 billion (7B) parameters and an ultra-lightweight version with 600 million (0.6B) parameters. These are approximately 1/300 and 1/25 the size of OpenAI's GPT-3, which has 175 billion (175B) parameters. By making the model size capable of efficient inferencing on one GPU for the lightweight version and on a CPU for the ultra-lightweight version, it becomes possible to keep the cost of additional training and inference required in practical use to a minimum (Figures 2 and 3).

Figure 2. Comparison of Training Costs between tsuzumi and GPT-3

Figure 3. Comparison of Inference Costs between tsuzumi and GPT-3

Besides the financial benefits in model operation, its use in local environments, such as in healthcare institutions and contact centers, is possible. This makes it particularly suitable for use cases where there may be barriers to handling sensitive information in cloud environments, including SaaS. Additionally, the speed of inference ensures high response capabilities, making it ideal for services that require quick responses.

Multilingual Support - Proficiency in Japanese

NTT has over 40 years of accumulated research in natural language processing, and our expertise in the AI field is at the top level globally. We ranked 12th globally and 1st domestically in the number of AI Research Rankings 2022*5. We are also the leading Japanese company in terms of the number of papers accepted at top natural language processing conferences, and we have been consistently recognized for our achievements, being the top corporate recipient of excellence awards at the Association for Natural Language Processing over the past decade.
'tsuzumi' supports both Japanese and English, and especially for Japanese language processing, we have been able to confirm high accuracy in various benchmark comparisons, even with a small parameter size, by leveraging NTT's long-standing research in language processing. We will show the performance comparison results for the Rakuda benchmark*6, designed for generative AI(Figure 4). The benchmark involve GPT-4, judging the quality of the outputs of the two models, making an overall determination that includes the composition and fluency of the documents. In the Rakuda benchmark, tsuzumi scored a winning rate of 81.3%, surpassing GPT-3.5, and significantly outperformed the top domestic LLM group with a winning rate of over 70%. (Based on NTT's research, March 2024)

Figure 4. Comparison in Rakuda Benchmark (Based on NTT's research, March 2024)

Figure 5. Comparison of Actual LM response and Judgement Examples by GPT-4

Flexible Customization - Base Model + Adapter

Tuning an LLM is a process of adjusting the model's behavior to align with specific tasks or objectives. The main goal is for the model to generate more accurate and useful responses for specific tasks. Generally, this process involves enhancing task suitability, adding constraints that comply with users and legal requirements, acquiring knowledge and vocabulary specific to certain domains or industries, and adjusting response styles or tones, among other things. The result is expected to yield more desirable responses.
When trying to get an LLM to learn new knowledge, retraining all the enormous number of parameters can result in a large computational training cost. 'tsuzumi' can effectively accomplish tuning, such as adjusting to language expressions and knowledge unique to specific industries, with a small amount of additional training, thanks to adapters*7, a mechanism that enables efficient knowledge learning (Figure 6-1, Figure 6-2).

Figure 6-1, 6-2. Benefits of Adapter Tuning

From April 2024 onward, we plan to introduce a 'multi-adapter' feature, which allows users to flexibly switch between multiple adapters depending on the user or scenario, and combine them to create a synergistic effect. This feature allows multiple adapters to be connected to one 'tsuzumi' base model, enabling multiple processing according to the tuning target in one computation process, and thus reducing the service provisioning cost. This means, for example, that we can provide detailed tuning at a low cost according to specific organizations, positions, and even to authorities within a company (Figure 7).

Figure 7. Conceptual Image of Multi-Adapter Implementation

Multimodality - Language + Visual + Audio + User Situation

'tsuzumi' is also planned to support modal extension, which can handle not only languages but also graphical displays as of March 2024. We also plan to support other capabilities such as nuance in voice tones, facial expressions, the user's given situation, and even understanding a robot's body sensation and human physical characteristics, making it possible to work cooperatively with people in the real world (Figure 8). (Available after April 2024)

Figure 8. Modal Extension of Language + Visiual

With the modal extension of language + visual, it becomes possible not only to answer questions based on language but also to answer questions that are presented with document images. For instance, you can utilize language models in tasks requiring human cognition that are often needed in business situations, such as tasks of searching for and screening image-attached documents like invoices or manuals, or when using AI to compare and evaluate product descriptions or pricing plans posted on websites. Also, contributions can be expected in the development of industrially significant services such as web search and chatbots.
Figure 9-1 presents an example of visual document understanding Technology that reads and interprets graphical documents. For the question, 'What percentage of people don't trust online reviews as much as brand advertisements?', the correct understanding of the pie chart in the poster and the correct answer to the proportion of the darkened section are required. 'tsuzumi' aims to realize Visual Document Understanding Technology that can correctly answer '30%' to such complex questions that involve a high level of understanding of diagrams. As shown in Figure 9-3, 'tsuzumi' has achieved higher performance that GPT-3.5 and GPT-4 (without vision) on higher complexity evaluation data sets. (Based on NTT's research, February 2024) NTT has been working on visual reading understanding since 2020, succeeded in multiple paper acceptances[1] [2] at the most challenging international conferences and high rankings in international competitions. From here on, we will focus on establishing technology based on 'tsuzumi'.

Figure 9-1, 9-2 Modality Extension Implementation Example (Language + Visual)
Figure 9-3, Modality Extension Performance (Language + Visual)

Furthermore, with the modal extension of language + visual + audio, it becomes possible to generate answers taking into consideration the questioner's situation, in addition to questions based on only language. In normal voice recognition, the nuance information contained in the voice is lost by converting the voice to text. Even when recognizing emotions from voice, it is common to classify them into some pre-defined types of emotions. In contrast, using 'tsuzumi', which supports multimodal, for example, when a child speaks in a dull voice, it can understand the child's 'dull situation' without using emotion types such as 'dull' and 'gloomy' internally, and can realize behavior that encourages gentle and warm words according to this. Features of 'tsuzumi' like these can be expected to be used in automatic responses tailored to various user situations, such as counseling and call centers (Figure 10-1).
Moreover, not just audio but also user situations (location information, parking lot congestion, driver fatigue, time of day, user preference information, etc.) can be used as inputs, making it possible to apply to concierge tasks such as car navigation and smartphone navigation (Figure 10-2).

Figure 10-1, 10-2 Variation of Modal Extension

End Notes

In this paper, we have explained the overview of NTT's large-scale language model, 'tsuzumi', as of October 2023. At Our laboratories, we will promote the practical application of technology that takes advantage of our accumulated years of language processing research, and we will also work on R&D for cyber security application and AI constellation that AIs autonomously coordinate and discuss, as further areas of LLM research and development.

Associated Resources

Press Conference Presentation Material, November 1, 2023.[1.52MB]

[1] Ryota Tanaka, Taichi Iki, Kyosuke Nishida, Kuniko Saito, Jun Suzuki: InstructDoc: A Dataset for Zero-Shot Generalization of Visual Document Understanding with Instructions. AAAI 2024.

[2] Taku Hasegawa, Kyosuke Nishida, Koki Maeda, Kuniko Saito: DueT: Image-Text Contrastive Transfer Learning with Dual-adapter Tuning. EMNLP 2023: 13607-13624.

Glossary of Terms

*1 Large Language Models(LLM)

A language model trained using a large amount of text data, possessing excellent abilities in language understanding and text generation.

*2 tsuzumi

'tsuzumi' is currently in the process of trademark application. Emphasizing processing performance in Japanese, we have put our hopes for language model technology leading industrial development in tsuzumi, the drum that triggers the start of Gagaku ensemble.

*3 Multimodality

The term 'modality' refers to the types of input information for AI (such as text, images, and audio), and it is a word that refers to the characteristics of artificial intelligence that has the ability to combine and use different type of input information.

*4 the power generated by a nuclear power plant in an hour

The amount of power needed to train a GPT-3 scale model, with a parameter count of 175 billion, is about 1300 MWh(1), which is on par with the power output of one nuclear power plant in one hour (around 1000 MWh).

(1) https://gizmodo.com/chatgpt-ai-openai-carbon-emissions-stanford-report-1850288635

*5 AI Research Rankings 2022

Top 100 Global Companies Leading in AI Research in 2022


*6 Rakuda Benchmark

One of the benchmarks for evaluating the performance of Japanese language models, it assesses through question-answering tasks related to Japanese geography, politics, history, and society.


*7 Adapter Tuning

A submodule added externally to the pre-trained model. By only updating the parameters of the adapter while keeping the parameters of the pre-trained base model fixed during tuning, it can learn knowledge without having to retrain the computationally expensive base model.

Related content