The world's first foundational dataset purpose-built for biological AI

Our knowledge graph is driving a step change in the performance of AI-based biological design – surpassing speed, cost and complexity limits that were previously considered unattainable

We design more complex systems than anyone else

We are designing genomes and complex genomic systems with far greater novelty, accuracy and end performance than the previous state of the art. Our advances come from our hard work, our data and the diligent care we take to teach AI as much as possible about the underlying biology.

We design more complex systems than anyone else

We are designing genomes and proteins with far greater novelty, controllability, and performance than the previous state-of-the-art. Our advances come from our unique, proprietary dataset of real organisms and the diligent care we take to teach AI as much as possible about the underlying biology.

We're delivering step changes across R&D, from personalised medicine to plastic degradation

The deep learning models we build can, for the first time ever, understand complete biological context. This allows our platform to transform biological R&D outcomes in fields as varied and important as genome editing, biocatalysis, and agriculture.

More accurate structure predictions

than Google DeepMind's AlphaFold2, unlocking more reliable small molecule docking for larger and more complex proteins than ever before


40% more proteins annotated

than all other state-of the art algorithms, including CLEAN and Google's ProteInfer, allowing us to discover and classify the most difficult dark matter sequences


Controllable sequence generation

that leverages our dataset's superior diversity, context, and quality to design proteins and genomic systems to best match our partners' desired function and performance

Drawing from the best possible foundational data

AI performance improves dramatically with more diversity and more context

The dataset that a model is trained off defines the AI’s ‘imagination’ — its ability to think creatively to solve problems.

Put simply, the AI will, by definition, never be able to ‘think outside this box’, as it can only ever recapitulate and reorder patterns that it has already been exposed to. Therefore, by expanding the training dataset, you quite literally allow it to think outside the box.
Out of necessity, all foundational models in biology today are trained on the same public datasets. Lacking diversity, consistency, context, quality, and clear commercialisation rights, these public datasets that are seriously unfit for the AI era.

Two-thirds of all public sequencing data come from just 12 species, while there are at least a trillion species on our planet. 10% of all enzyme classes only have a single sequence representative in public dataset.

For generative AI applications, these are probably the worst class imbalance problems ever encountered. That's why we've created a categorically superior dataset that gives us categorically superior performance.
Learn more
Technology highlights

Basecamp Research's platform applied across the bioeconomy

Protein Evolution, Inc. and Basecamp Research Aim to Make Polyurethane and Nylon Infinitely Recyclable with Expanded Strategic Collaboration
The companies will develop novel enzymes that break down difficult-to-recycle plastic waste to solve a major bottleneck in the fashion industry.
JM announces partnership with Basecamp Research to accelerate the adoption of biocatalysis solutions
This partnership combines Johnson Matthey’s catalysis expertise with Basecamp Research’s AI-enabled biodiversity genetic mapping to meet the growing demand of the pharmaceutical and chemicals industry.
Expanding the repertoire of recombinases that can integrate large DNA fragments into the human genome by over 30X
CRISPR for gene editing applications is beyond a doubt a milestone in modern biotechnology. Moving beyond edits, recent discoveries have enabled ‘gene writing’ technology, that is the insertion of large DNA ‘cargoes’ into host genomes.
Deep evolutionary context

Our knowledge graph has 4X more sequence diversity and 20X more genomic context than public resources

Continuous growth

Our unique Nagoya-compliant data supply chain covers 5 continents and over 60% of global biomes

Case studies

[ High-impact outcomes for protein design ]

[ Selecting 3-4 examples from this notion page:

- Reducing development time from 2 years to 1 month
- Zero-shot multi-domain design, beating best-in-class & freedom-to-operate for therapeutics customers
- Zero-shot design of ultra-rare chemistries
- Gene-writing therapeutic systems with safe human integration sites ]
Dive into the details

Recent publications & views

Improving AlphaFold2 Performance with a Global Metagenomic & Biological Data Supply Chain
With higher protein sequence diversity captured in this dataset compared to existing public data, we apply this data advantage to the protein folding problem by MSA supplementation during inference of AlphaFold2.
HiFi-NN annotates the microbial dark matter with Enzyme Commission numbers
Here,we present HiFi-NN (Hierarchically-Finetuned Nearest Neighbor search) which annotates protein sequences with greater precision and recall than all existing deep learning methods.
ZymCTRL: a conditional language model for the controllable generation of artificial enzymes
Pre-print from our first collaboration with Dr Noelia Ferruz on an enzyme-specific language model to provide new opportunities to design purpose-built artificial enzymes.
Explore the world with us

We target our expeditions based on your requirements

We know the exact geological and chemical properties of the locations where various types of protein classes can be found. When you partner with us, we proactively find samples in places that are predicted to have a high hit rate for you.
A map with several connected locations that Basecamp Research has sampled