It is based on the keynote speech given by Shingo Kinoshita, Research and Development Planning of NTT Corporation, at the "NTT R&D FORUM 2023 ― IOWN ACCELERATION" held from November 14th to 17th, 2023.
We recently announced at a press conference on November 1 that we have developed tsuzumi, NTT's LLM. tsuzumi has four main features.
Its first feature is that it is lightweight. Today's large-scale language models are in a competition for the number of parameters, becoming very large in scale. As such, the challenge now is achieving sustainability. For example, GPT-3 has 175B parameters and requires about 1300 MWh of electricity for one training session. This is equivalent to an hour's worth of electricity produced by one nuclear power plant. In contrast, this is the strategy we have adopted. First, in terms of direction being pursued, we do not aim to build one massive LLM that knows everything, but small LLMs with specialized knowledge. Our approach to this is not just to increase the parameter size, but to make the LLM smarter by improving the quality and quantity of training data added to it. As a result, we have developed and announced two types of tsuzumi. The ultralight version, tsuzumi-0.6B, on the left is 0.6B in parameter size, which is about 1/300th that of GPT-3. The light version on the right, tsuzumi-7B, is 1/25th the size of GPT-3. So, what are the benefits of reducing size? First is that training can be carried out at very low cost. For example, as shown on the left, GPT-3-scale training is said to cost about 470 million yen per session. In contrast, tsuzumi 7B and 0.6B cost 19 million yen and 1.6 million yen, respectively, enabling reduction of costs by 1/25th and 1/300th. The second benefit is the cost of inference, i.e., the cost of using the language learning model. For example, GPT-3 would need about 5 high-end GPUs, which cost about 15 million yen. The 7B and 0.6B versions, meanwhile, would cost about 0.7 million yen and 0.2 million yen, respectively. In terms of the number of GPUs, cost reduction is achieved with the use of only one low-end GPU and only one CPU, respectively.
Its second feature is its high linguistic proficiency, especially in the Japanese language. This is an example showing the answers of tsuzumi on the left and GPT-3.5 on the right to a question about the current situation and possible improvement measures for attaining a balance between Japan's energy policies and environmental protection (Fig. 1). The left answer presents a well-analyzed response in Japanese. This shows comparisons of tsuzumi with GPT-3 and other LLMs in response to the same questions. Comparisons of performance were made using the Rakuda benchmarks. For example, tsuzumi was compared against GTP-3.5. They were asked the same questions, and their answers were input to GTP-4 to determine which gave the better answer and which is superior or inferior. 52.5% represents the win rate. It means that tsuzumi beat GTP-3.5 at 52.5% probability. The remaining four are comparisons with the top-class LLMs in Japan, against which tsuzumi had an overwhelming win rate of 71.3% to 97.5%.
Further, tsuzumi cannot just give answers in Japanese to questions in Japanese. For example, as shown here, tsuzumi was asked to extract four data items, namely, device name, achievement, exhibition event, and future plans in JSON format from a text of a recent press release about an artificial photosynthesis device developed by NTT. As shown below, it gave a properly structured response as requested. How about its English proficiency? Its performance in English is in fact at par with the world's top-level language models. The leftmost is Llama2, an English language model developed by Meta. In comparison, tsuzumi gave out almost the same English benchmark results. This is a translation example. As shown here, tsuzumi gave a smooth and quick response when asked to translate the above Japanese text into English. Other than English, it is also proficient in programming language. When asked to write code in a specific format, it gave a proper response in the requested format. Therefore, it is also proficient in English and in programming language. Right now, it is also being trained in Chinese, Korean, Italian, and German, so it will also be able to give answers in those languages in the future.
Its third feature is flexible customization. Language models have a base model, which can give fairly adequate answers to general questions. However, to create models that can give specific answers in the field of finance or the public sector, for example, tuning must be performed. There are three tuning methods available right now (Fig. 2). On the left is a method called prompt engineering. In this method, prompt inputs to the base model are added with financial information to enable the model to give a more finance-specific response. In the middle is a method called full fine tuning, which creates a model specialized in finance by re-training the base model with financial data and changing the entire set of parameters. On the right is a method called adapter tuning, where the base model is used as is, and these blue adapter components on finance-specific knowledge are added on top of the base model like a hat. They have different advantages and disadvantages in terms of cost and accuracy. As an advantage of tuning, for example, the base model can be made more specific to a particular industry, company, or organization. Or it can be updated with the latest information. Also, you can add functions by training the model with new tasks, such as summarization and translation, to make it more task specific.
This is an example of fine tuning in the financial industry (Fig. 3). On the right is the response using data before tuning, and on the left is the response after fine tuning the data for the financial industry. The LLM was asked to explain the market segments of the Tokyo Stock Exchange. The right response shows the old segments, such as 1st section, 2nd section, JASDAQ, and Mothers. The left response shows the new segments established by TSE on April 4, 2020. The LLM properly learned and gave the correct segments, namely, Prime, Standard, and Growth.
Its fourth feature is multimodality. Thus far, language models in general receive language inputs and produce language outputs. Multimodality refers to the ability to handle modals other than language. For example, you can now add visual and audio capabilities. This is an example of adding visual modality. The left shows invoice data. The LLM was shown this receipt and asked, "What is the total amount excluding the 10% consumption tax?" The LLM then calculated the total by looking at the unit price and quantity columns on the receipt, and correctly replied that the total is 9,500 yen.
Here is another example (Fig. 4). This is a graph from NTT's Green Vision. While being shown this complicated graph, the LLM was asked "What percentage of power consumption is expected to be reduced by IOWN in 2040?" The graph is quite complicated. If you look at the year 2040, you can see IOWN at the upper part and the number 55 beside it. But if you look closer, there is an arrow accounted for by energy savings on the upper part. When you look farther to the right, you can see that IOWN's contribution below that is actually only 45%. The language model correctly analyzed the graph and replied 45%. Thus, the LLM is capable of providing answers by analyzing the question in combination with figures. Next, I would like to talk about the technological capabilities of NTT Laboratories that enabled us to achieve these four excellent features.
First, this is a table that shows the ranking of companies based on the number of publications in the field of AI (Fig. 5). NTT is ranked 12th in the world and 1st in Japan in this report published annually by a U.S. venture capital. In the top 1 to 11 are GAFA and other major IT vendors in the U.S. and China. Compared to NTT, they are probably spending tens of times more research funding and have many times more researchers. Nevertheless, NTT has been able to conduct research quite efficiently, enabling us to achieve these rankings. In the field of AI, natural language processing is a very important area in the development of language models. NTT is the number one in Japan in terms of the number of publications in this particular area.
As shown on the right, it is also number one in the number of awards for excellence given by the Japanese Association for Natural Language Processing. The development of NTT's LLM is backed by a long history and solid track record in research. This slide shows the excellent data we used for training the LLM, which is its distinguishing feature. The left part shows the data used for pre-training. In terms of tokens, which can be thought of as the number of words, we used more than one trillion. In terms of languages, we used not only Japanese and English, but also 21 other languages, as well as programming languages to train the LLM. These data cover a very wide range of domains, from specialized fields to entertainment. The right part shows the training data used for pre-training of the model by instruction tuning to further make it more human-like in response and behavior . For the training, we used Japanese corpora created over more than 40 years of research, which is NTT's advantage. Other than these, we also used new tuning data we created specifically for generative AI.
In this year's R&D forum, we put together exhibits related to tsuzumi in one location at B1F. As shown here, there are 11 exhibits on tsuzumi. I have already introduced Exhibits (1) and (2) in the demonstrations I showed earlier. So, I will talk about Exhibits (3) onwards. I would like to introduce briefly some representative exhibits, among the many exhibits on its specific features.
First is Exhibit (3), tsuzumi comprehensively understands the real world. The demo shows a supervisor and a junior member having a conversation via online communication, during which the supervisor displays power harassment behavior. tsuzumi detects the power harassment behavior and calls the supervisor's attention. "I'm sorry. I'm a little busy with other jobs so I couldn't reply immediately." "If you're busy with other work, isn't it your basic responsibility as a working adult to report that, too?" The person on the left of this figure (Fig.6) is making some kind of power harassment statements. Shown on the left is the supervisor and his statements below his video feed. tsuzumi then analyzes the emotions based on the facial expressions and speech. It determines to what degree the person is laughing or angry as percentages. Next, in this column, the blue part on the left shows what the supervisor said. And in the middle, it shows that the level of harassment is about 71%. It also shows a 73% level of interruption while the other person is speaking. These percentages show a relatively high level of harassment behavior. In response, as shown in the pink part, the LLM gives advice to encourage change in the behavior of the supervisor. It says, "While it is important to properly report on their work, it is also important to encourage junior members to do so. An effective way to encourage them is for the supervisor to create opportunities to check on junior members on a regular basis and adjust their workload as necessary. It is also important to listen patiently to junior members to create an environment in which they can work with peace of mind." Thus, tsuzumi can give appropriate advice for the supervisor to address their behavior.
Next is Exhibit (5), tsuzumi understands the user situation. tsuzumi creates specific travel plans based on user attributes and preferences, taking into account road congestion conditions and other factors. The gray text on the left of the figure (Fig.7) shows the user's requests. It says, "Tomorrow, after enjoying the mining experience at the Ashio Copper Mines, I plan to try Utsunomiya's famous gyoza. My son is interested in geology and is excited about ore mining.The whole family also loves gyoza and is looking forward to Utsunomiya's specialty." Once the car navigation system hears this, it structures and analyzes the inputs into specific information such as departure time and departure point, as shown in the middle. And then it searches the web to gather information on directions, hotel reservations, and restaurant reservations. Finally, tsuzumi creates an action or travel plan to propose to the user.
Next is Exhibit (6), tsuzumi with physical sensory capabilities. A robot equipped with tsuzumi can create a menu and set a table according to the user's request. In this example, the user says, "Prepare a dinner table that will warm the body up on a cold winter day. Make considerations for left-handedness." The robot then analyzes the request and actually serves the food while explaining the arrangement. For example, it says, "Curry is good for warming the body up, and salad, too. They pair well with spring rolls for a seasonal feel. And tea warms the body, too. And, in consideration of left-handers, the chopstick and spoon are placed in the opposite direction." The demo features a robot serving food while giving such explanations.
Next is Exhibit (7), ultra-high speed software development. The demo shows how to add a new function, namely a review function, to a shopping site. The website on the left of this figure (Fig.8) only gives the usual introduction of a product, without a review function. You can then instruct tsuzumi to add a product review function. In response, it analyzes the source code and carries out these actions. The demo shows how tsuzumi can write a new source code to create a review section on the website.
Next is Exhibit (8), next-generation security operations. tsuzumi can handle incident response on behalf of security experts in a dialogue format. For example, if a virus is detected on a user's computer, tsuzumi analyzes the virus. It then informs the user, asks the user whether they have accessed the malicious site, and instructs the user to respond as soon as possible via chat. The demo shows how tsuzumi interacts with the user step by step to urge the user to respond to the security issue.
The last security demo is phishing site detection. This was featured in newspapers a few months ago. The language model analyzes the input website and determines whether it is a phishing site or not. tsuzumi's accuracy in detecting phishing sites is more than 98%, which is much more accurate than checking by humans.
Next, I would like to move on to the advancement of IOWN. First, I would like to explain the IOWN roadmap (Fig. 9).
IOWN 1.0 is a networking technology that connects data centers using optical fiber. Next, IOWN 2.0 optically links the boards inside the server inside the data center. Evolving further, IOWN 3.0 will optically connect the chips, while IOWN 4.0 will enable optical connection inside the chip. Now, let's look at the roadmap by generation. There are a number of elemental technologies that make up IOWN for each generation. An example is a device called PEC, or photonics-electronics convergence device. Along with the evolution of IOWN generations from 1.0 to 4.0, PEC will also continue to evolve from the 2nd to the 3rd, 4th, and 5th generations. The All-Photonics Network (APN) will evolve within IOWN 1.0 by adding more functions and increasing performance. Further, the super white box of the Data Centric Infrastructure will evolve with the evolution from IOWN 1.0, to 2.0, and to 3.0, along with the evolution of PEC devices, as Steps 0, 1, and 2. This roadmap shows how we will be moving forward with the advancement of these technologies.
First, I would like to introduce what we achieved for IOWN 1.0 this year. One is the significant progress in the commercialization of APN. APN consists of APN-I for the core network, APN-G for the edge network, APN-T installed in the user base, and OTN Anywhere in the user terminals. Different companies have launched specific products as shown here. In March this year, NTT East and NTT West began providing specific network services using these products. This 100-gigabyte leased-line service allows users exclusive use of optical wavelengths from end to end. Further, using OTN Anywhere enables visualizing latency and provides functions for adjusting and aligning different delay times. Using this service, we have conducted various PoC implementations this year.
In particular, many of the PoCs were in the entertainment field, such as for concerts, e-sports, comedy, and dance. Other than entertainment, we plan to use APN to realize the data center of the future, which is our main goal. Thus far, the range of data center-to-data center connections has been limited because of significant delays in conventional networks. It is said that the limit is about 60 km, but there is not enough land within this range, making it difficult to add more data centers. If the connection distance between data centers can be increased from 60 km to 100 km by using APN, then there will be more land available for building connected data centers. We believe that APN is highly suitable for such expansion of connected data centers. We are now conducting demonstration experiments in various locations to achieve this. And, by expanding further beyond the Tokyo metropolitan area to major cities throughout Japan and even to other parts of the world, we believe that it will be possible to build a global APN network.
Next, I will report on the status of IOWN 2.0 and 3.0. The first is about Data Centric Infrastructure (DCI). DCI is a next-generation computing architecture aimed at achieving high performance with low power consumption by allocating optimally subdivided computer resources centered on data.
In Step 0, the unit for computer resource subdivision is the server and storage, and APN is used for connecting them. Next, in Step 1, the unit of subdivision is the board inside the server, i.e., by board units. By connecting boards with 3rd generation PEC device, we aim to achieve ultra-low power consumption and ultra-high-speed switching. Evolving further, in Step 2, computing resources are subdivided by chip units, which will be connected with 4th generation PEC device. This will enable achieving even lower power consumption and higher performance. The key device to achieve Step 1 is the 3rd generation PEC device called the optical engine. The yellow parts of this figure (Fig. 10) correspond to each of the optical engines. We have been conducting experiments with Broadcom on this chip in the middle, which has a switching capacity of about 5 Tbps, with each optical engine having a transmission capacity of 3.2 Tbps. It is therefore possible to configure a single device with 5 Tbps of switching capacity. Evolving further, the 4th generation device will optically connect chips at an implementation efficiency six times higher and a power efficiency two times higher than the 3rd generation. This enables further improving performance and lowering power consumption.
The third topic is about the synergy between LLM and IOWN. For IOWN, we are conducting experiments combining DCI Step 0, APN, and LLM. We have this amount of training data in Yokosuka and wanted to install GPUs nearby, but there was not enough power or space. So, we constructed a GPU cloud in Mitaka, and we connected it with the database in Yokosuka by APN for remote access. Conventionally, at this distance, the NFS would be quite slow and there would be considerable performance degradation. But with APN, we were able to achieve connection with almost zero performance degradation even at 100 km distance. Specifically, the performance is reduced only by about 0.5%. Further, by optically connecting each CPU and GPU directly using this optical switch, we can carry out LLM learning and inference with the minimal and optimized combination of computing resources. Currently, we use a lot of GPUs to train LLMs, but many of the GPUs are sometimes idle . Using this mechanism, we aim to implement training with as little computer resources as possible while making all devices work at full capacity.
NTT's future vision for the world of AI is to create an AI constellation. We envision a next-generation AI architecture to more smartly and efficiently solve social issues by combining multiple, small, specialized LLMs as shown on the right, instead of creating a single monolithic massive LLM like the one on the left. Today's R&D Forum features a demonstration of this. For example, AI with personas representing a human resources manager, a clinical psychologist, a truck driver, and an elementary school teacher talk about "what is needed to revitalize our shrinking community." They offer their opinions and come up with a consensus on what to do, with the involvement of humans as necessary . We believe we can create a mechanism for building consensus through these interactions.
As we announced yesterday, we are planning to form a business partnership with Sakana AI and conduct joint research to build the AI constellation mentioned above. Sakana AI is a venture company that is in the spotlight right now. It was founded by well-known AI key persons. David Ha was lead researcher at Google Brain and at Stability AI, the company that created Stable Diffusion image generation AI. While Llion Jones was one of the Google developers who created the basic algorithm for Transformer, the T in ChatGPT. They established Sakana AI with base in Japan to conduct R&D of new LLMs and AI constellations. We have entered into a business partnership to work together with them in these areas.
Finally, I would like to conclude by talking about the three resolutions of NTT Laboratories (Fig .11). "Do research by drawing from the fountain of knowledge and provide specific benefits to society through commercial development." These words, proclaimed in 1950 by Goro Yoshida, the first Director of the Electrical Communication Laboratories, embody the vision of NTT Laboratories. They point to three elements built on top of each other. "Doing research by drawing from the fountain of knowledge" is the foundation. On top of this is the development phase of "commercial development," and, on the topmost is "providing specific benefits to society."
The most important thing is the research at the bottom, done "by drawing from the fountain of knowledge." Not only in AI, as mentioned earlier, but also in all engineering fields, NTT is ranked 11th in the world in the number of publications (Fig. 12). In world-class research areas, such as speech recognition, information security, optical communications, and quantum computing, NTT boasts the world's highest number of publications, beating Google and IBM. We would like to build on these accomplishments and solidify our position as a world leader in research by further aiming for the top and expanding our world-class research areas. This is our first resolution.
Next is development "through commercial development" in the middle. IOWN and LLM are the two key technologies that we would like to robustly develop and put into commercial use as an embodiment of our second resolution.
Last is the social implementation phase of "providing specific benefits to society." In this regard, we newly established the Research and Development Market Strategy Division in June this year. Thus far, the Laboratories and the R&D Planning Department, to which I belong, have been working together in various ways with customers, partner companies, and business companies. Under the new organization, we established the Marketing Planning and Analysis Department, wherein R&D Planning will work together with both the Marketing and Alliance departments to enhance and broaden the scope of our activities.
Thus, our third resolution is to implement research and development results into society going forward.