Drug Discovery Acceleration Driven by Synthetic Data Biobanks

Developing a new drug takes an average of 10–15 years and costs $2.6 billion (approximately ¥390 billion), with roughly 90% of clinical trials ending in failure. Eroom's Law — the half-century-old observation that drug discovery productivity halves approximately every nine years, the inverse of Moore's Law — is now on the verge of being fundamentally overturned. The 2024 Nobel Prize in Chemistry awarded to Demis Hassabis and John Jumper (AlphaFold) and David Baker (computational protein design) symbolizes that AI-driven drug discovery has fully moved past the "whether it's possible" phase and into the "when it will be realized at scale" phase. Insilico Medicine's INS018_055 (rentsevertib) has become the first fully AI-discovered and AI-designed drug candidate to reach Phase II clinical trials, and more than 100 AI-discovered molecules are now in clinical trials. Large-scale biobanks such as the UK Biobank (500,000 participants), All of Us (over 800,000), and BioBank Japan (270,000) provide genotype–phenotype "ground truth," while synthetic data technologies address data scarcity while preserving privacy. Isomorphic Labs (Alphabet) has signed deals worth up to $1.7 billion (approximately ¥255 billion) with Eli Lilly and up to $1.2 billion (approximately ¥180 billion) with Novartis, and Xaira Therapeutics was founded in 2024 with over $1 billion (approximately ¥150 billion) in seed funding. The AI drug discovery market is projected to grow from approximately $2.5 billion (approximately ¥375 billion) in 2024 to $10–12 billion (approximately ¥1.5–1.8 trillion) by 2030 (CAGR of 24–28%). BCG estimates that AI drug discovery could generate $50–100 billion (approximately ¥7.5–15 trillion) in annual value across the entire pharmaceutical value chain. This paper comprehensively examines the full landscape of drug discovery acceleration, the roles of synthetic data and biobanks, the technologies and products of key players, enabling technologies, market data, and the outlook ahead.

Structural Challenges in Drug Discovery——Why New Drugs Are So Costly and Slow

The inefficiency of new drug development is one of the pharmaceutical industry's most deep-rooted challenges.

According to estimates from the Tufts Center for the Study of Drug Development (CSDD), the average cost of bringing a single approved drug to market is $2.6 billion, with an average timeline of 10–15 years. A 2021 analysis by BIO/QLS Advisors/Informa found that only 12% of drugs entering Phase I trials ultimately receive FDA approval. The overall clinical trial failure rate reaches approximately 90%. In oncology, the probability of advancing from Phase I to approval is a mere 5%.

The root cause of this inefficiency is captured by "Eroom's Law"—a reverse reading of Moore's Law—which holds that inflation-adjusted drug discovery productivity has roughly halved every nine years since the 1950s. The culprits are increasingly stringent regulation, the depletion of "low-hanging fruit" (easy targets already have drugs), and critically, a shortage of high-quality, diverse biological data.

Global pharmaceutical R&D spending surpassed $265 billion in 2024 (per IQVIA estimates), yet the vast majority of that spending disappears as the cost of failed trials. The trinity of biobanks, synthetic data, and AI/machine learning is now poised to fundamentally transform this paradigm.

Biobank — The "Ground Truth" Connecting Genotypes and Phenotypes

A biobank is a research infrastructure that collects and stores genomic data, biological samples such as blood and urine, and health records from large populations over extended periods. In drug discovery, it provides the "ground truth" for validating causal relationships between genes and diseases at the population level.

UK Biobank is the world's most well-known biobank. It recruited 500,000 participants between 2006 and 2010, and whole-genome sequencing (WGS) for all participants was completed in 2023. Proteomics data measuring approximately 3,000 proteins across all 500,000 participants, produced in collaboration with Olink, has also been made publicly available. It has generated more than 30,000 registered researchers, over 10,000 approved projects, and more than 8,000 peer-reviewed publications. Its cumulative budget stands at approximately £260 million (around ¥50 billion). An open-access model allows researchers worldwide to access the data.

All of Us (NIH, National Institutes of Health) targets more than 1 million participants reflecting the diversity of the United States. As of 2025, more than 800,000 people have enrolled, and more than 500,000 have provided biological samples. A landmark aspect is that racial and ethnic minorities who have historically been excluded from research account for more than 50% of participants. The budget for the first phase is $1.4 billion (approximately ¥210 billion).

BioBank Japan (BBJ) has approximately 270,000 participants and 47 target diseases, and is operated by RIKEN and the University of Tokyo. It is one of the largest biobanks in the world among non-European populations and is essential for understanding the genetic structure of East Asian populations. It has contributed to the identification of more than 200 disease-associated loci.

FinnGen (Finland) has more than 500,000 participants and is particularly valuable for discovering rare variants by leveraging Finland's founder effect. It is a public-private partnership involving 13 biobanks and 11 pharmaceutical companies (AbbVie, AstraZeneca, Pfizer, etc.).

deCODE Genetics (Iceland) holds genotype data for more than 190,000 Icelanders (more than half the population), combined with over 1,000 years of genealogical data. Amgen acquired it in 2012 for $415 million (approximately ¥62.25 billion), and it has contributed to the identification of numerous drug targets.

The greatest value that biobank data brings to drug discovery is the finding that drug targets with genetic evidence have a twofold higher success rate in clinical trials (Nelson et al., Nature Genetics, 2015). An update by King et al. (2019) shows that drugs with genetic support are 2.6 times more likely to progress from Phase I to approval.

Synthetic Data — Breaking Through the Walls of Privacy and Scarcity

Synthetic data is a technology that generates artificial data that cannot identify individuals while mimicking the statistical properties of real patient data. It clears privacy regulations such as HIPAA (United States) and GDPR (EU) at the design stage.

There are three values of synthetic data in drug discovery. First, resolving data scarcity. More than 7,000 rare diseases are known worldwide, but many have patient populations in the hundreds, making conventional clinical trial design impossible. Synthetic data generates statistically valid cohorts of thousands of patients from this small amount of real data. Second, privacy-protected data sharing. Multiple medical institutions can collaborate using synthetic data without sharing real data. Third, clinical trial simulation (in silico trials). Unlearn.AI has an FDA-approved approach to generating synthetic control groups (digital twins), capable of reducing control group size by 20–30% and achieving cost savings of $10–50 million (approximately ¥1.5–7.5 billion) per trial.

Key companies include Syntegra (San Francisco, synthetic EHR data, raised approximately $17 million in Series A), MDClone (Israel/US, "ADAMS" platform, adopted by Mayo Clinic and others, approximately $63 million raised to date), Gretel.ai (San Diego, differential privacy guarantees, raised approximately $68 million in Series B), Mostly AI (Vienna, GDPR-focused, approximately $31 million raised), and Datavant (San Francisco, data network connecting over 70,000 hospitals, raised over $110 million).

The synthetic data healthcare market is projected to reach approximately $1.2–1.5 billion (approximately ¥180–225 billion) in 2025 and expand to $4–5.5 billion (approximately ¥600–825 billion) by 2030 (CAGR 25–28%). Gartner predicts that by 2030, synthetic data will surpass real data in training AI models.

Major Companies in AI Drug Discovery — A New Era Pioneered by Technology

In the AI drug discovery space, billion-dollar companies are proliferating and transforming every stage of the pharmaceutical value chain.

Recursion Pharmaceuticals (Salt Lake City, NASDAQ: RXRX) has raised a cumulative over $1.5 billion and announced a merger with Exscientia (approximately $688 million) in August 2024, becoming one of the world's largest AI drug discovery companies. The company holds "the world's largest proprietary biological and chemical dataset" — trillions of data points — and has formed a multi-year partnership with NVIDIA (which invested $50 million). More than 8 programs are in clinical stages.

Insilico Medicine (Hong Kong) is an iconic presence in AI drug discovery. INS018_055 (rentosertib) has reached Phase II clinical trials as the world's first fully AI-discovered and AI-designed drug, with both the target discovered by AI (PandaOmics) and the molecule designed by AI (Chemistry42). Targeting idiopathic pulmonary fibrosis (IPF), Phase IIa reported acceptable safety and early efficacy signals. The company signed a deal with Sanofi worth up to $1.2 billion (approximately ¥180 billion).

Isomorphic Labs (London, Alphabet/DeepMind), led by CEO Demis Hassabis, applies AlphaFold technology to drug discovery. In January 2024, the company signed deals worth up to $1.7 billion (approximately ¥255 billion) with Eli Lilly and up to $1.2 billion (approximately ¥180 billion) with Novartis. AlphaFold 3 (announced May 2024) can predict the structures of protein-ligand, protein-DNA, and protein-RNA complexes, with capabilities directly applicable to drug discovery.

Xaira Therapeutics (San Francisco/Seattle) was founded in 2024 with over $1 billion in seed funding. Investors include ARCH Venture Partners, Foresite Capital, Sequoia Capital, and Lightspeed Venture Partners. The company licensed IP from the David Baker laboratory (University of Washington Institute for Protein Design) and aims to build foundation models for biology. It represents one of the largest founding rounds in biotech startup history.

Generate Biomedicines (Somerville, MA) was founded in 2020 by Flagship Pioneering, the creator of Moderna. Having raised over $573 million, it designs protein therapeutics from scratch using generative AI. Its diffusion model "Chroma," published in Nature, generates proteins with specified properties.

Absci (Vancouver, WA, NASDAQ: ABSI) specializes in antibody design using generative AI. In Nature Biotechnology (2023), the company published the first demonstration of zero-shot design of antibodies that bind a target without a starting antibody. It signed a deal with AstraZeneca worth up to $610 million (approximately ¥91.5 billion).

In Japan, Takeda Pharmaceutical has formed partnerships with Recursion, Schrödinger, and Exscientia (pre-merger), investing over $500 million in data and digital transformation. Daiichi Sankyo is collaborating with Preferred Networks (PFN) to apply AI to the optimization of ADCs (antibody-drug conjugates). Sumitomo Pharma partnered with Exscientia to develop DSP-1181 (an OCD therapeutic) — one of the earliest examples of an AI-designed molecule to enter Phase I clinical trials.

Component Technologies — From AlphaFold to Biological Foundation Models

The evolution of core technologies supporting AI drug discovery is remarkable.

AlphaFold 2 (2020) solved the 50-year-old grand challenge of protein structure prediction, and in collaboration with EMBL-EBI, predicted and published the structures of more than 200 million proteins. AlphaFold 3 (May 2024) adopts a diffusion architecture to predict the structures of protein-ligand and protein-DNA/RNA complexes. The accuracy of protein-ligand interaction prediction improved by more than 50% over conventional methods. The 2024 Nobel Prize in Chemistry awarded to Hassabis, Jumper, and Baker symbolizes the milestone this field has reached.

Molecular generation via diffusion models is at the cutting edge of AI drug discovery. RFdiffusion (David Baker Lab, Nature 2023) generates novel protein structures with specified properties. DiffDock (MIT, 2023) applies diffusion models to molecular docking and surpasses conventional docking software. Chroma (Generate Biomedicines) is a generative model for protein structures.

Large language models for biology are on the rise. ESM-2/ESMFold (Meta AI) is trained on more than 250 million protein sequences and directly predicts structure from sequence. ProGen/ProGen2 (Salesforce Research) generates functional protein sequences and has demonstrated that the generated proteins function as active enzymes. Evo (Arc Institute, co-founded by Patrick Collison, 2024) is a genomic foundation model trained on 2.7 million genomes, capable of generating gene- and genome-scale DNA sequences.

Clinical trials using digital twins are also advancing. Unlearn.AI generates patient digital twins from historical trial data and constructs synthetic control arms using FDA-approved covariate adjustment methods. This reduces the required control group size by 20–30%, saving time and cost.

Silicon Valley VC Perspective — "Biology Has Become Information Science"

Silicon Valley VCs are positioning AI drug discovery as a "once-in-a-generation investment opportunity."

a16z Bio (Andreessen Horowitz), led by Vijay Pande (inventor of Folding@home, former Stanford professor), assembled over $1.5 billion in Bio funds between 2019 and 2023. Operating under the thesis that "software will eat biology," the firm has invested in Insitro, Freenome, and others. Pande writes and speaks extensively on "engineering biology."

Flagship Pioneering (Cambridge, MA) is the venture creation firm behind Moderna (peak market cap exceeding $150 billion). Led by CEO Noubar Afeyan, it manages over $10 billion in cumulative capital. Generate Biomedicines stands as its flagship AI-biology investment.

ARCH Venture Partners, led by Robert Nelsen and Kristina Burow, has invested in Illumina (as an early backer) and led Xaira Therapeutics' $1 billion-plus round. The firm has stated: "Biology has become an information science. The convergence of AI and biology is the greatest once-in-a-generation investment opportunity."

Total VC investment in AI drug discovery grew from $4.6 billion in 2020 to $6.7 billion in 2024 (PitchBook, BioCentury), with 2025 projected to reach $7–8 billion.

Jensen Huang (NVIDIA CEO) declared at his GTC 2024 keynote, "The next computing platform is biology. Every pharmaceutical company will become a technology company." NVIDIA's BioNeMo platform and GPUs (H100/B200) are serving as the "picks and shovels" of AI drug discovery.

Perspectives from Notable Figures — From Nobel Laureates to VC Entrepreneurs

Demis Hassabis (CEO of Google DeepMind/Isomorphic Labs, 2024 Nobel Prize in Chemistry) has stated that "AlphaFold is the most important work I have ever been involved in — it is the most impactful application of AI," predicting that "within five years, AI will dramatically accelerate the early stages of drug discovery, and within ten years, the entire process will be fundamentally transformed."

David Baker (University of Washington, 2024 Nobel Prize in Chemistry) said in his Nobel lecture: "We have entered an era where we can design molecules from scratch that do things evolution never explored. The combination of computational design and AI is transforming what is possible in medicine."

Daphne Koller (CEO of Insitro, Professor at Stanford) explains: "The reason drug discovery is so expensive is that we run experiments in humans because we lack sufficiently good predictive models. If we can build better predictive models with machine learning, we can move faster, cheaper, and fail earlier."

Eric Topol (Director, Scripps Research Translational Institute) states: "We are at an inflection point. The convergence of rich genomic data from biobanks and AI will transform how drugs are discovered," predicting that "by 2030, not having AI in your drug discovery pipeline will be like not having email in the office — it will no longer be an option."

AI Drug Discovery by the Numbers — A Rapidly Expanding Market and Clinical Outcomes

The numbers behind AI drug discovery tell the story of a rapidly expanding field.

The AI drug discovery market is projected to grow from approximately $2.5 billion in 2024 to $10–12 billion by 2030 (CAGR of 24–28%, Precedence Research). The broader AI in pharma market is estimated to exceed $20 billion by 2032 (Grand View Research).

Clinical progress for AI-discovered molecules is accelerating. As of early 2025, more than 50 molecules in Phase I, 15–20 in Phase II, and 2–3 in Phase III clinical trials were AI-discovered. The first FDA approval is anticipated between 2026 and 2028. The cost of whole-genome sequencing has fallen to approximately $200 (down from $3 billion for the Human Genome Project). The biobank market is expected to expand from roughly $3.5–4 billion in 2024 to $6–7 billion by 2030.

Japan Trends — The Intersection of Biobanks and Pharmaceutical AI

Japan is a key player in both biobank infrastructure and AI-driven drug discovery.

BioBank Japan hosts approximately 270,000 participants and covers 47 target diseases, making it an indispensable resource for understanding the genetic structure of East Asian populations. It also contributes to trans-ethnic genome-wide association studies (GWAS) alongside UK Biobank and All of Us.

Tohoku Medical Megabank (ToMMo) was established as a reconstruction project following the 2011 Great East Japan Earthquake and has approximately 150,000 participants. Its Japanese reference genome panel (3.5KJPNv2/8.3KJPN) is essential for genomics research specific to the Japanese population.

The Japanese government's "Genomic Medicine Implementation Strategy" (formulated in 2019, updated in 2023) targets over 100,000 cancer genome sequences, and AMED (Japan Agency for Medical Research and Development) supports drug discovery research with an annual budget of approximately 400 billion yen.

Preferred Networks (PFN) has partnered with Daiichi Sankyo and AMED in AI-driven drug discovery and has particular strengths in deep learning for molecular simulation. It is one of Japan's largest AI startups, with a valuation exceeding $3.5 billion.

Challenges — The Hurdles Behind the Optimism

The future of AI drug discovery is bright, but significant challenges remain.

Data bias is the greatest concern. Approximately 78% of GWAS participants are of European ancestry (as of 2023, though this is improving), and UK Biobank participants are healthier, wealthier, and more white than the UK average. Synthetic data can propagate and amplify the biases present in training data.

Regulatory uncertainty also persists. As of early 2026, no AI-discovered drug has received full FDA/EMA approval. The FDA has published frameworks, but no regulatory pathway specifically tailored to AI-discovered drugs has been established.

Validation of synthetic data is another challenge. How can one prove that synthetic data accurately reflects real-world biology? There is a risk that "hallucinated" patterns — statistical artifacts — may be embedded in synthetic data.

Integration with traditional pharma is also far from straightforward. AI predictions require wet lab validation (the "last-mile problem"), and cultural resistance in the form of "this is how we've always done it" remains deeply entrenched.

Future Outlook — The Future of Drug Discovery Is Changing

Industry leaders are showing an unusually unified optimism about the future of AI-driven drug discovery.

2026–2028: The first FDA approval of an AI-discovered drug is expected to materialize (the leading candidates being INS018_055 or Recursion/Exscientia programs). AI drug discovery will become standard practice at major pharmaceutical companies (McKinsey).

2028–2030: Foundation models for biology will reach their "GPT-3 moment" — general-purpose biological AI will become fine-tunable for any drug discovery task. The integration of real-world biobank data, synthetic data, and AI models will become standard operating procedure in pharmaceutical R&D.

2032–2035: Drug discovery timelines will be compressed to 3–5 years for many indications, with costs reduced by 50–70%. Drugs for rare diseases will become economically viable (currently, approximately 95% of more than 7,000 rare diseases have no approved treatments).

Alex Zhavoronkov, CEO of Insilico Medicine, says: "We have proven that AI can discover and design drugs. The question is no longer 'can it be done' but 'how quickly can we scale it.'" The day that question is answered is no longer in the distant future.

Impact on the Industry

First, the combination of biobanks and synthetic data is structurally dismantling the "data wall" in drug discovery. Real-world data from UK Biobank (500,000 participants, full WGS completed), All of Us (800,000+ participants, over 50% diversity), and BioBank Japan (270,000 participants) provides "ground truth," while synthetic data overcomes constraints of privacy and data scarcity. The finding that drug targets with genetic evidence have a 2–2.6× higher clinical success rate underpins the economic rationale for this combination.

Second, AI drug discovery has moved beyond the "proof of concept" stage and entered the "clinical validation" stage. More than 100 AI-discovered molecules are in clinical trials, and the first FDA approval is expected between 2026 and 2028. Isomorphic Labs' deals with major pharmaceutical companies totaling $2.9 billion, and Xaira Therapeutics' founding round of over $1 billion, demonstrate the depth of confidence in this field.

Third, the Nobel Prize in Chemistry awarded to Hassabis, Jumper, and Baker signifies that the convergence of AI and biology has been recognized at the highest academic level. AlphaFold 3's diffusion architecture has capabilities directly applicable to drug discovery, and together with generative models such as RFdiffusion and Chroma, it is making the era of "designing molecules from scratch" a reality.

Fourth, through the presence of BioBank Japan, ToMMo, AMED, and PFN, Japan holds a unique position at the intersection of East Asian population genomics and AI drug discovery. The AI partnerships of Takeda, Daiichi Sankyo, and Sumitomo Pharma demonstrate that Japan's pharmaceutical industry is actively participating in this transformation.

References: Tufts CSDD Drug Development Cost Study (2020), BIO/QLS Advisors Clinical Trial Success Rates (2021), IQVIA Global R&D Spending Report (2024), Eroom's Law (Scannell et al., Nature Reviews Drug Discovery, 2012), UK Biobank Open Access Data, NIH All of Us Research Program, BioBank Japan/RIKEN, FinnGen Public-Private Partnership, deCODE Genetics/Amgen, Nelson et al. "The support of human genetic evidence for approved drug indications" (Nature Genetics, 2015), King et al. update (2019), Recursion-Exscientia Merger Announcement (Aug 2024), Insilico INS018_055 Phase II Results, Isomorphic Labs-Lilly-Novartis Deals (Jan 2024), Xaira Therapeutics $1B+ Launch, AlphaFold 3 Release (May 2024), Nobel Prize Chemistry 2024, Generate Biomedicines Chroma (Nature 2023), Absci Zero-Shot Antibody Design (Nature Biotechnology 2023), RFdiffusion (Nature 2023), Evo Model (Arc Institute 2024), Unlearn.AI FDA Guidance, Syntegra/MDClone/Gretel.ai/Datavant Company Data, a16z Bio Fund, Flagship Pioneering, ARCH Venture Partners, PitchBook/BioCentury AI Drug Discovery Funding Data, NVIDIA BioNeMo, Jensen Huang GTC 2024, Demis Hassabis Nobel Lecture, David Baker Nobel Lecture, Daphne Koller a16z Podcast (2023), Eric Topol "Ground Truths" Substack, Patrick Collison Arc Institute, Precedence Research AI Drug Discovery Market, Grand View Research, ToMMo Japanese Reference Genome, AMED Budget Data, Preferred Networks/Daiichi Sankyo Partnership, Takeda Digital Transformation, Sumitomo Pharma/Exscientia DSP-1181