satoru.bio · DigiShield Labs

Satoru

悟る — to perceive, to awaken to understanding

Satoru uses large language models to extract and synthesise fragmented biological knowledge at scale — making decades of specialised research queryable and accessible for the first time. We begin where data is richest and most opaque: deep time.

Datasets

SMILE Phase One Complete

Systematic Microbiome Intelligence for Lost Ecosystems

The first comprehensive, queryable database of prehistoric oral microbiomes, aggregating ancient samples from published literature into a standardised, spatially-indexed corpus with authentication metadata. SMILE makes cross-study comparative research tractable for the first time — resolving the metadata fragmentation that has blocked systematic analysis of human-microbe co-evolution across deep time.

1,247

Oral samples

Publications

100k+

Year span

PostGIS

Spatial index

~49,000 BP · Neanderthal Medieval Europe

Satoru Isotopes In Development

Stable Isotope Database · δ¹³C · δ¹⁵N · δ¹⁸O · δ³⁴S

A comprehensive, open-access database of stable isotope values extracted from archaeological publications — the first to aggregate δ¹³C, δ¹⁵N, δ¹⁸O, and δ³⁴S measurements from human and faunal remains across the British Isles and Europe. Validated against the PI's own doctoral dataset prior to full extraction. Targeting 2,000–3,000 publications and 10,000+ individual measurements.

Approach

AI-Powered Extraction

A tiered large language model pipeline processes scientific publications through sequential classification, sample reconciliation, taxonomic composition extraction, authentication scoring, and methodological metadata capture. Model selection is calibrated to task complexity; quality is validated against known ground-truth datasets.

Domain Expertise as Ground Truth

Satoru is built and validated by a specialist in bioarchaeological data. The principal investigator holds a PhD in Archaeology specialising in stable isotope analysis and organic residue analysis — providing direct disciplinary authority to assess and correct extraction outputs.

Open Science

All aggregated databases are freely accessible under CC-BY 4.0 licensing. Extraction methodology will be published for peer review and community replication. The infrastructure is designed to be extended, forked, and adapted across biological domains beyond the initial scope.

Spatial Infrastructure

PostgreSQL 16 with PostGIS enables geographic and temporal queries across the corpus — supporting regional comparisons, site-level drill-downs, and spatiotemporal visualisation. Each sample is georeferenced at point level with SRID 4326 and linked to archival sequence accessions where available.

Satoru

AI-Powered Extraction

Domain Expertise as Ground Truth

Open Science

Spatial Infrastructure

Collaborate or Inquire