All about NTT's fully domestic AI tsuzumi 2 and tsuzumi 2 Vision models

NTT's homegrown large language model "tsuzumi," developed entirely in Japan, evolved into "tsuzumi 2" in October 2025 with its parameters expanded to approximately 30 billion (30B), and on May 19, 2026, added the "tsuzumi 2 Vision model," which can read Japanese business documents containing charts and figures as images. Its greatest strengths are that it runs on a single GPU, comes pre-loaded with knowledge in finance, healthcare, and the public sector, and can be deployed on-premises without sending confidential data outside the organization. This article traces the journey and technology of tsuzumi, the philosophy of Senior Distinguished Researcher Kyosuke Nishida (西田京介) who oversees its development, and its adoption by Tokyo University of Communications and the government AI platform "Gennai," while also positioning it against Sili

What is tsuzumi (つづみ) ― A purely domestic Japanese AI that competes on "lightness" rather than "size"

First, let me explain what kind of AI tsuzumi is through a concrete image. A regional bank employee asks, "How do I handle this error code in our internal system?" and gets an immediate answer. A hospital administrator summarizes a thick clinical guideline. A city office drafts a notification letter for residents. All of these tasks are completed entirely within a single server (with one GPU) placed on-premises or in a data center, without sending any data to an external cloud. This is the intended use case for tsuzumi.

Tsuzumi is a large language model (LLM) developed in-house by NTT, specialized for the Japanese language. The name derives from the traditional Japanese instrument "tsuzumi" (hand drum), embodying the spirit of a Japan-born AI that resonates richly despite its small size. While OpenAI's GPT series and Google's Gemini pursue the goal of "an all-purpose model that simply scales up to handle anything," tsuzumi takes the exact opposite approach. NTT has put forward a vision of "realizing societal well-being together with humans not through the scaling-up and centralization of LLMs, but through the collaboration of many AIs with distinct personalities," and made it a foundational design principle to build AI of a "usable size" — one that fits within on-site budget and hardware constraints — rather than a single massive brain.

This "lightness" is not merely a philosophy; it connects directly to economic viability. Frontier-scale large models require dozens to hundreds of GPUs for inference (actually running the AI), and their power consumption and operational costs have become a barrier to enterprise adoption. Tsuzumi fits into a single GPU, dramatically reducing power consumption and costs. Furthermore, it adheres strictly to "full-scratch" (fully in-house development from the ground up) — using only data for which NTT holds rights or has obtained licenses — which avoids the copyright and intellectual property dispute risks that plague overseas models trained on massive amounts of web content without permission. This is also an important differentiating factor for enterprises and government agencies seeking peace of mind.

The Journey of tsuzumi ― From Its 2023 Launch to Commercialization and Generational Transition

Tsuzumi made its debut on November 1, 2023. At a press conference, NTT announced tsuzumi as a proprietary LLM developed on the foundation of approximately 40 years of natural language processing research accumulated at its laboratories. The first-generation tsuzumi was remarkably compact, with an ultra-lightweight version at 600 million (0.6B) parameters and a lightweight version at 7 billion (7B) parameters—roughly 1/300th and 1/25th, respectively, of OpenAI's GPT-3 (175 billion parameters). The design philosophy of "small yet strong in Japanese" was already clearly defined at this stage.

On March 25, 2024, NTT began commercial offerings of tsuzumi. NTT Communications and NTT DATA served as the initial points of contact, with other group companies such as NTT East and NTT West subsequently rolling out the service in sequence. The lightweight model's ability to meet customer demand for on-premises (self-hosted) deployment was seen as a distinct advantage over overseas competitors who assumed cloud-based infrastructure. Furthermore, in November 2024, tsuzumi became available via Microsoft Azure, and the first-generation tsuzumi expanded its adoption primarily in environments such as local governments, finance, and healthcare—sectors characterized by high confidentiality requirements and a reluctance to entrust data to overseas cloud providers.

Then, on October 20, 2025, NTT will begin offering its next-generation model, "tsuzumi 2." This represents a generational leap that inherits the first generation's approach—lightweight, highly secure, and low-cost—while raising performance by another level, and it is the subject of this article. NTT positions tsuzumi 2 at the core of the "AI For Quality Growth" vision championed by President Akira Shimada: solving customers' challenges through AI and jointly realizing sustainable, high-quality growth.

tsuzumi 2 ― Scaling to 30B and the Design Philosophy of "Running on a Single GPU"

The biggest change in tsuzumi 2 is a dramatic expansion of parameter scale, from 7B in the first generation to approximately 30 billion (30B). Generally speaking, larger models become smarter but also far more expensive to run. tsuzumi 2 increases its size while maintaining the absolute requirement carried over from the first generation — running on a single GPU — through techniques that reduce the memory required during inference. According to NTT, the necessary initial hardware investment amounts to approximately 5 million yen, equivalent to a single NVIDIA A100 (40GB). Compared to hardware costs for similar use cases — around 100 million yen for DeepSeek-V3.1 (approximately 700B) and around 50 million yen for Meta's Llama-4 (approximately 400B) — this works out to roughly one-tenth to one-twentieth the inference cost.

The internal design also forgoes competing on general-purpose intelligence, instead focusing squarely on "domains that Japanese enterprises actually use." NTT has reinforced knowledge of industry terminology, regulations, and practical documents for three fields — finance, healthcare, and public services (local government) — starting from the pre-training stage. Furthermore, it improves the accuracy of RAG (Retrieval-Augmented Generation), which retrieves internal documents to generate answers, as well as the efficiency of fine-tuning for adapting the model to specific use cases with small amounts of data. A telling example comes from validation in the financial domain: on questions equivalent to the Financial Planning Skills Test Level 2 (FP Level 2), Google's Gemma 27B model achieved a 64% accuracy rate with 1,900 additional training samples, while tsuzumi 2 reached 70% with just 200 additional samples, according to NTT. In other words, the practical advantage for real-world deployment lies in being "easy to develop into a domain expert with minimal training material."

tsuzumi 2 is designed for on-premises or private cloud deployment, enabling organizations to handle confidential information without it leaving their environment. Rather than a general-purpose knowledge engine, it is a practical tool for reducing day-to-day operational "friction" — summarizing internal manuals, searching through regulations, and handling document-based Q&A. NTT itself positions tsuzumi 2 in exactly these terms.

tsuzumi 2 Vision Model ― Reading Business Documents with Charts and Diagrams "as Images"

On May 19, 2026, NTT announced a major update to tsuzumi 2, introducing what is known as the "tsuzumi 2 Vision model." This is a multimodal extension capable of understanding not only text but also images, with a primary focus on visually comprehending the tables, graphs, and diagrams (charts) commonly found in Japanese business documents by ingesting entire documents as images.

There are practical reasons behind the "as images" approach. Confidential documents such as financial statements, design specifications, application forms, and internal approval documents often pack critical figures and conditions into charts and tables, not just text. Processing these through conventional text extraction disrupts the layout and table structure, leading to misinterpretation. tsuzumi 2 Vision can extract particularly important information from charts and tables to build databases, pull required fields from forms, and understand the flow of process diagrams. In addition, the level of "logical reasoning and numerical processing" has been elevated — including comprehension and calculation of numerical data such as sales figures, and interpretation of functions contained in technical documents such as API documentation.

The use cases NTT cites include credit assessment workflows that involve ingesting materials filled with charts and tables, and support for technical inquiry operations that require referencing technical documents to provide answers. Crucially, this advanced chart and diagram comprehension is still achieved within a single-GPU environment. The ability to interpret confidential documents containing charts and diagrams entirely in-house — without uploading them to overseas cloud services — holds significant value for enterprises and government agencies that favor on-premises deployments. Availability is planned to roll out gradually through NTT Group companies.

How to Read Japanese-Language Performance — The Substance and Limits of "GPT-5 Class" Evaluations

When discussing tsuzumi 2, the expression "GPT-5-class Japanese language performance" inevitably comes up. This requires careful understanding.

In NTT's evaluation, tsuzumi 2 is said to outperform same-size competitors Google's Gemma-3 27B and Alibaba's Qwen-2.5 32B across four benchmarks critical to business use: knowledge, analysis, instruction-following, and safety. NTT further explains that the model achieves scores rivaling GPT-5 — a model far larger than its own — on many tasks in the Japanese-language MT-Bench, which measures conversational quality. Multiple specialist outlets including Ledge.ai have reported that it "achieves GPT-5-class Japanese language performance in a lightweight model built entirely from scratch."

However, a measured caveat is warranted here. As several analytical articles point out, these results are evaluated specifically in "Japanese" and within the "same size class" — and in terms of general all-purpose performance across every task, frontier models such as GPT-5, Anthropic's Claude, and Google's Gemini 3 Pro still hold the lead. tsuzumi 2 is not a model that "beats ChatGPT in every dimension"; rather, it is a model that excels on a specific playing field: "enabling Japanese enterprises that cannot send confidential data outside their walls to handle Japanese-language business tasks at a high level, at realistic cost." This deliberate positioning is the core of tsuzumi's strategy, and it would be a mistake to read the benchmark numbers at face value as "defeating the giant models."

Senior Special Researcher Kyosuke Nishida, leading development

Leading the research and development of tsuzumi is Kyosuke Nishida, Senior Distinguished Researcher at NTT Human Informatics Laboratories. The title of "Senior Distinguished Researcher" is a position granted by the NTT Group to exceptionally talented researchers expected to make long-term contributions, carrying the mission of driving innovative and pioneering technological development in fields of long-term strategic importance to the group.

Nishida's expertise spans large language models, natural language processing, machine reading comprehension (AI that reads text and answers questions), and Vision-and-Language models that connect text and images, as well as deep learning. This background is telling. The direction tsuzumi 2 Vision took—reading documents with figures and diagrams as images—is a natural extension of the years of work Nishida and his colleagues have accumulated in machine reading comprehension and vision-language models. His research achievements are extensive: he has published numerous papers at the most competitive international conferences in natural language processing, including ACL, AAAI, ICLR, and EMNLP, and has received high recognition both domestically and internationally, including the NLP2021 Best Paper Award, the 2024 NTT R&D Award, and various awards from the Association for Natural Language Processing in 2025.

What Nishida repeatedly articulates is a vision of the future in which not a single massive AI concentrating all intelligence, but rather a large number of AIs each with their own distinct character, work in collaboration with people. While championing "the realization of general-purpose AI that can naturally coexist with humans in any environment," the seemingly paradoxical choice of implementing this through the lightweight tsuzumi rather than an ultra-large model is precisely where NTT's AI philosophy is revealed.

Case Studies ― Universities, Power, and the Government "Gennai"

The adoption of tsuzumi 2 is steadily expanding from use cases where balancing confidentiality and cost is a critical concern.

In the education sector, Tokyo Communication University became the first educational institution to adopt tsuzumi 2. The university built an on-premises LLM infrastructure that operates without relying on the cloud, keeping student and faculty data within the institution, and uses it for advanced course-related Q&A, support for creating teaching materials and exams, and personalized counseling on enrollment and career matters. The ability to use AI while protecting students' personal data was the defining reason for choosing tsuzumi, which runs on-premises.

In the energy sector, on January 26, 2026, NTT Docomo Business (NTT Communications) and Chugoku Electric Power announced the start of construction and verification of a power-industry-specific LLM leveraging tsuzumi 2. The plan involves training the LLM on Chugoku Electric Power's operational information and domain expertise to create a model specialized for the power industry, with full-scale deployment in mind from fiscal year 2026 onward. In the financial sector, a collaboration is also progressing that combines Fujifilm Business Innovation's document structuring technology "REiLI" with tsuzumi to handle unstructured corporate documents.

The most symbolic adoption, however, is by the government. On March 6, 2026, the Digital Agency selected tsuzumi 2 among 7 models chosen from 15 applicants as a domestically produced LLM to be trialed in "Gennai (GENAI)," the generative AI platform for staff across all ministries and agencies. The name "Gennai" is derived from Edo-era inventor Hiraga Gennai and doubles as a play on words with "GenAI" (generative AI). tsuzumi 2 is expected to contribute as "a model with strong Japanese-language capability oriented toward practical use in business and public administration," particularly in drafting, summarizing, and organizing administrative documents and leveraging operational knowledge. The selection of a fully domestically produced model for an infrastructure handling critical national information represents a major vote of confidence for tsuzumi.

How Silicon Valley and the World View It — The Geopolitics of "Sovereign AI"

I want to reposition tsuzumi here within the context that Silicon Valley venture capitalists are keenly watching right now. The keyword is "Sovereign AI" — the concept of developing and operating AI under the control of one's own country's data, culture, and legal systems.

The most vocal proponent of this trend is Jensen Huang, CEO of NVIDIA, which dominates the world in AI semiconductors. At venues such as the World Government Summit, he has stated that "every country will build AI" and "nobody needs an atomic bomb, but everybody needs AI," defining Sovereign AI as "encoding your culture, your society's intelligence, its common sense, its history — your own data, owned by you." He goes so far as to advise leaders of developing nations to "encode your own language and cultural data into your own large language models." Within this worldview, which regards AI infrastructure as a national foundation, NTT's tsuzumi is positioned as a prime example of "Japan's Sovereign AI." NTT President Shimada himself espouses the very philosophy of Sovereign AI — that each country should develop technology suited to its own cultural and historical context.

VC money is flowing heavily in this direction as well. In AI investment in 2026, sovereign wealth funds such as Saudi Arabia's PIF and Abu Dhabi's Mubadala are growing their presence as major contributors to large fundraising rounds. This is because markets in each country have strong demand for AI built specifically for them, driven by concerns around data residency, regulatory compliance, and information security.

In Japan, the startups that directly compare to tsuzumi embody this demand. Among them, Sakana AI raised a $135 million (approximately ¥20 billion) Series B on November 17, 2025, reaching a valuation of $2.65 billion (approximately ¥400 billion). Founded in 2023 by Llion Jones and other Google alumni — co-authors of the "Attention Is All You Need" paper — the company specializes in creating models optimized for the Japanese language and culture using limited data and efficient post-training. Its investors include MUFG (Mitsubishi UFJ Financial Group), Khosla Ventures, NEA, Lux Capital, and In-Q-Tel, a VC affiliated with U.S. intelligence agencies. In Europe, France's Mistral AI raised a €1.7 billion (approximately ¥280 billion) Series C in September 2025, with ASML — the dominant maker of semiconductor lithography equipment — as its lead shareholder, sending its valuation soaring to approximately $13.8 billion (approximately ¥2.07 trillion). That round also counted NVIDIA and Andreessen Horowitz (a16z) among its participants. The picture of nations and regions pouring enormous sums into "their own AI" perfectly reflects the global fever surrounding Sovereign AI.

Compared to these examples, tsuzumi differs fundamentally in its origins: rather than a startup chasing rapid growth on VC money, it is a model cultivated in-house by telecommunications infrastructure company NTT from a research and development base. However, it is significant that overseas media outlets — AI News, Computer Weekly, and others — uniformly praise tsuzumi as a "lightweight approach that runs on a single GPU, in contrast to hyperscaler strategies requiring tens to hundreds of GPUs," positioning it as a practical solution for organizations that cannot afford to use frontier large models. Tsuzumi is Japan's most compelling implementation of the "backlash against the relentless push toward gigantism" underway in Silicon Valley — the trend of using small, efficient models (SLMs) in task-specific deployments.

The Lineup of Competing Domestic LLMs — tsuzumi's Position

Let us also map out the domestic competitors tsuzumi faces in the sovereign AI market. The seven models selected by the Digital Agency's "Gennai" platform effectively capture the current landscape of Japan's LLM ecosystem. The chosen models are NTT DATA's "tsuzumi 2," KDDI/ELYZA's "Llama-3.1-ELYZA-JP-70B," SoftBank's "Sarashina2 mini," NEC's "cotomi v3," Fujitsu's "Takane 32B," Preferred Networks (PFN)'s "PLaMo 2.0 Prime," and Customer Cloud's "CC Gov-LLM" — seven entries in total.

Development approaches fall broadly into two camps. On one side are the full-scratch developers — tsuzumi and PFN's PLaMo — who build foundation models entirely in-house. On the other are the continued pre-training developers — like ELYZA — who take Meta's Llama and fine-tune it on Japanese-language data. PFN is partnering with Sakura Internet and NICT to advance development toward "PLaMo 3.0 Prime," a model capable of extended reasoning, which they claim will approach the level of overseas models such as Qwen3-235B and gpt-oss-120b. SB Intuitions, a SoftBank subsidiary, fields the Sarashina line with a Mixture-of-Experts (MoE) architecture of approximately 460 billion parameters. Fujitsu's Takane pursues an enterprise-oriented path combining quantization and distillation. KDDI's ELYZA leads in commercial deployment. Each company is carving out its own niche through differentiated strengths.

Within this lineup, tsuzumi 2's positioning is clear. Rather than competing on maximum parameter count — the territory of PLaMo and Sarashina — tsuzumi 2 defines itself as a mid-weight model optimized for the practical requirements of enterprises and government agencies: the ability to run on a single GPU, deep domain knowledge in finance, healthcare, and the public sector, and on-premises deployment that keeps sensitive data from leaving the organization. What the Gennai selection reveals is that the government's preference for domestically developed models is not necessarily driven by raw performance alone, but by design philosophies centered on data sovereignty, security, and procurement requirements — and that is precisely tsuzumi's home turf.

Going Forward ― Multilingual & Voice, and the Watershed of Government Procurement in 2027

Finally, let us look ahead to where tsuzumi is heading and on what timeline.

On the technical side, NTT has indicated a policy of further improving Japanese and English processing performance while expanding supported languages—including Chinese, Korean, French, and German—to broaden its user base. tsuzumi 2 has multimodal support in its sights, handling not only text and images but also audio, and the Vision model released in May 2026 is positioned as the first step in acquiring that "sense of sight." Toward the future vision described by Nishida and others—"a world where many AIs with distinct personalities work together"—the next milestones after chart and diagram comprehension are expected to be expansions into audio and more advanced reasoning.

On the business side, the greatest watershed is the timeline surrounding "Gennai," the government AI platform. According to the Digital Agency's plan, the seven selected models will begin trial use across all 39 ministries and agencies—approximately 180,000 users—around August 2026, with evaluation results to be published around January 2027. From April 2027 onward, the best-performing models are scheduled to be procured by the government on a paid basis. In other words, the key upcoming events narrow down to three moments: "the launch of large-scale trials in August 2026," "the publication of government evaluations in January 2027," and "full-scale procurement from April 2027 onward." If tsuzumi 2 performs well here, a path opens for it to become one pillar of a fully domestic AI platform spanning all ministries and agencies. Based on media reports, domestic inquiries have already reached some 2,000 cases, and the footprint is steadily expanding, centered on local governments, finance, and healthcare.

Catching up with the global frontier in the race to scale up is not possible—accepting that reality head-on, NTT is competing on a different axis: "lightness," "Japanese language," and "sovereignty." In an era where, as Jensen Huang has said, every nation needs its own AI, tsuzumi 2 and tsuzumi 2 Vision have entered the moment when their true worth will be tested—as one of the most realistic options for Japan to handle its own language, culture, and confidential information with its own hands.