The world's first foundational dataset purpose-built for biological AI
Our knowledge graph is driving a step change in the performance of AI-based biological design – surpassing speed, cost and complexity limits that were previously considered unattainable
We design more complex systems than anyone else
We are designing genomes and complex genomic systems with far greater novelty, accuracy and end performance than the previous state of the art. Our advances come from our hard work, our data and the diligent care we take to teach AI as much as possible about the underlying biology.
We design more complex systems than anyone else
We are designing genomes and proteins with far greater novelty, controllability, and performance than the previous state-of-the-art. Our advances come from our unique, proprietary dataset of real organisms and the diligent care we take to teach AI as much as possible about the underlying biology.
We're delivering step changes across R&D, from personalised medicine to plastic degradation
The deep learning models we build can, for the first time ever, understand complete biological context. This allows our platform to transform biological R&D outcomes in fields as varied and important as genome editing, biocatalysis, and agriculture.
More accurate structure predictions
than Google DeepMind's AlphaFold2, unlocking more reliable small molecule docking for larger and more complex proteins than ever before
READ MORE
40% more proteins annotated
than all other state-of the art algorithms, including CLEAN and Google's ProteInfer, allowing us to discover and classify the most difficult dark matter sequences
READ MORE
Controllable sequence generation
that leverages our dataset's superior diversity, context, and quality to design proteins and genomic systems to best match our partners' desired function and performance
READ MORE
Drawing from the best possible foundational data
AI performance improves dramatically with more diversity and more context
The dataset that a model is trained off defines the AI’s ‘imagination’ — its ability to think creatively to solve problems.
Put simply, the AI will, by definition, never be able to ‘think outside this box’, as it can only ever recapitulate and reorder patterns that it has already been exposed to. Therefore, by expanding the training dataset, you quite literally allow it to think outside the box.
Put simply, the AI will, by definition, never be able to ‘think outside this box’, as it can only ever recapitulate and reorder patterns that it has already been exposed to. Therefore, by expanding the training dataset, you quite literally allow it to think outside the box.
Out of necessity, all foundational models in biology today are trained on the same public datasets. Lacking diversity, consistency, context, quality, and clear commercialisation rights, these public datasets that are seriously unfit for the AI era.
Two-thirds of all public sequencing data come from just 12 species, while there are at least a trillion species on our planet. 10% of all enzyme classes only have a single sequence representative in public dataset.
For generative AI applications, these are probably the worst class imbalance problems ever encountered. That's why we've created a categorically superior dataset that gives us categorically superior performance.
Learn moreTwo-thirds of all public sequencing data come from just 12 species, while there are at least a trillion species on our planet. 10% of all enzyme classes only have a single sequence representative in public dataset.
For generative AI applications, these are probably the worst class imbalance problems ever encountered. That's why we've created a categorically superior dataset that gives us categorically superior performance.
Technology highlights
Basecamp Research's platform applied across the bioeconomy
Protein Evolution, Inc. and Basecamp Research Aim to Make Polyurethane and Nylon Infinitely Recyclable with Expanded Strategic Collaboration
The companies will develop novel enzymes that break down difficult-to-recycle plastic waste to solve a major bottleneck in the fashion industry.
READ MORE
JM announces partnership with Basecamp Research to accelerate the adoption of biocatalysis solutions
This partnership combines Johnson Matthey’s catalysis expertise with Basecamp Research’s AI-enabled biodiversity genetic mapping to meet the growing demand of the pharmaceutical and chemicals industry.
READ MORE
Expanding the repertoire of recombinases that can integrate large DNA fragments into the human genome by over 30X
CRISPR for gene editing applications is beyond a doubt a milestone in modern biotechnology. Moving beyond edits, recent discoveries have enabled ‘gene writing’ technology, that is the insertion of large DNA ‘cargoes’ into host genomes.
READ MORE
Deep evolutionary context
Our knowledge graph has 4X more sequence diversity and 20X more genomic context than public resources
Continuous growth
Our unique Nagoya-compliant data supply chain covers 5 continents and over 60% of global biomes
Case studies
[ High-impact outcomes for protein design ]
[ Selecting 3-4 examples from this notion page: https://www.notion.so/Basecamp-Research-Protein-Design-GenAI-Overview-7097d8e7c2134dc799fdb7fc751ede52?pvs=4:
- Reducing development time from 2 years to 1 month
- Zero-shot multi-domain design, beating best-in-class & freedom-to-operate for therapeutics customers
- Zero-shot design of ultra-rare chemistries
- Gene-writing therapeutic systems with safe human integration sites ]
- Reducing development time from 2 years to 1 month
- Zero-shot multi-domain design, beating best-in-class & freedom-to-operate for therapeutics customers
- Zero-shot design of ultra-rare chemistries
- Gene-writing therapeutic systems with safe human integration sites ]
Dive into the details
Recent publications & views
Improving AlphaFold2 Performance with a Global Metagenomic & Biological Data Supply Chain
With higher protein sequence diversity captured in this dataset compared to existing public data, we apply this data advantage to the protein folding problem by MSA supplementation during inference of AlphaFold2.
READ MORE
HiFi-NN annotates the microbial dark matter with Enzyme Commission numbers
Here,we present HiFi-NN (Hierarchically-Finetuned Nearest Neighbor search) which annotates protein sequences with greater precision and recall than all existing deep learning methods.
READ MORE
ZymCTRL: a conditional language model for the controllable generation of artificial enzymes
Pre-print from our first collaboration with Dr Noelia Ferruz on an enzyme-specific language model to provide new opportunities to design purpose-built artificial enzymes.
READ MORE