Skip to content

Benchmark

Measured environment: Python 3.12, Apple M-series, chematic v0.4.26, RDKit 2026.03.3.


Summary

Metric chematic RDKit
Import time ~35 ms ~400 ms (11×)
SMILES parse — 5,000 mol ~5 ms ~50 ms (10×)
ECFP4 batch — 10,000 mol 36 ms ~500 ms (14×)
Descriptor accuracy vs RDKit 19 metrics 100% — MW/HBA/HBD/TPSA/LogP/ARC/RotB/Spiro/Bridge/… (4,999-mol) baseline
Install pip install chematic conda or cmake
C/C++ dependencies Zero Required
WASM binary size 504 KB ~30 MB

1. Startup Time (import latency)

Cold-process import time measured by spawning a fresh Python subprocess per sample (5 samples, median reported). No module-cache warm-up.

chematic RDKit
import only ~35 ms ~400 ms
import + first parse ~38 ms ~430 ms
Speedup ~11× baseline

chematic

python scripts/bench_startup.py --runs 5

Why chematic is faster: chematic is a single PyO3 extension module with no transitive Python dependencies. RDKit initialises multiple C++ modules, reads SMARTS definition files, and triggers Boost data-structure setup on first import.


2. SMILES Parse Throughput

Timed on the built-in 20-molecule diverse corpus repeated to 5,000 total parses. Warm-up pass excluded.

N chematic RDKit Speedup
1,000 ~1 ms ~10 ms ~10×
5,000 ~5 ms ~50 ms ~10×
10,000 ~10 ms ~100 ms ~10×

Per-molecule: ~1 µs/mol (chematic) vs ~10 µs/mol (RDKit).

chematic

import chematic
mols = [chematic.from_smiles(s) for s in smiles_list]
# or batch:
mols = chematic.from_smiles_list(smiles_list)

RDKit

from rdkit import Chem
mols = [Chem.MolFromSmiles(s) for s in smiles_list]

How to reproduce

python scripts/bench_smiles_parse.py --n 5000 --rdkit

3. Speed — ECFP4 Fingerprint Generation (batch)

Rayon parallelism across all CPU cores; speedup grows with batch size.

Molecules (N) chematic (bulk.ecfp4) RDKit (Python loop) Speedup
100 0.36 ms 2 ms
1,000 3.6 ms 20 ms
10,000 36 ms ~500 ms ~14×

Per-molecule: 3.6 µs/mol (chematic) vs 20–50 µs/mol (RDKit).

chematic

import chematic
fps = chematic.bulk.ecfp4(smiles_list)  # (N, 2048) uint8 numpy array

RDKit

from rdkit import Chem
from rdkit.Chem import rdMolDescriptors
fps = [rdMolDescriptors.GetMorganFingerprintAsBitVect(
           Chem.MolFromSmiles(s), 2, 2048)
       for s in smiles_list]

How to reproduce

python scripts/benchmark_vs_rdkit.py --rdkit

4. Descriptor Accuracy vs RDKit

Tested on a 4,999-molecule ChEMBL-derived SMILES corpus (scripts/bench5k.py). See Validation for full per-metric breakdown.

Descriptor Agreement Tolerance
Molecular weight 100% exact
Heavy atom count 100% exact
H-bond donors (HBD) 100% exact
H-bond acceptors (HBA) 100% exact
TPSA 100% ±0.1 Ų
LogP (Crippen) 100% exact*
MR (molar refractivity) 100% ±0.01
Fsp3 100% ±0.001
Aromatic ring count 100% exact
Aliphatic ring count 100% exact
Saturated ring count 100% exact
Rotatable bonds 100% exact
Num heteroatoms 100% exact
Num spiro atoms 100% exact
Num bridgehead atoms 100% exact
Num amide bonds 100% exact
Aromatic/aliphatic heterocycles 100% exact
Num stereocenters (legacy) 99.98% exact
Num stereocenters (new CIP) 98.7% exact
[nH] SMARTS match 100% precision/recall

19 of 19 metrics reach ≥98.7% on the 4,999-molecule ChEMBL corpus (RDKit 2026.03.3). Stereocenters are reported against two RDKit oracles: - Legacy CalcNumAtomStereoCenters: 99.98% (4998/4999). The 1 discrepancy is a polyester where chematic correctly identifies 4 stereocenters while legacy misses 2 (confirmed by FindPotentialStereo). - New CIP FindPotentialStereo: 98.7% (4932/4999). The new oracle counts cage/bridgehead atoms as potential stereocenters in 67 molecules; chematic and legacy both correctly exclude these false positives.

How to reproduce

# Requires RDKit and the 5k SMILES file
python scripts/bench5k.py path/to/SMILES.csv --detail

5. Installation & Deployment

chematic RDKit
Python pip install chematic conda install -c conda-forge rdkit
C/C++ compiler Not required Required (Boost)
Docker image size delta ~4 MB ~200 MB+
GitHub Actions Single pip line Separate conda setup step
JavaScript / WASM npm install @kent-tokyo/chematic (504 KB) No official package
Browser deployment Yes No

6. Feature Comparison

Feature chematic RDKit
pKa prediction Built-in (15 SMARTS rules) External tool required
ADMET profile (BBB, Caco-2, hERG, CYP3A4) Built-in External tool required
MCP server (AI agent integration) 15 tools Not available
LSH approximate nearest-neighbour index Built-in Not available
IUPAC name generation Built-in (offline) Not available
Browser / WASM deployment Yes (504 KB) No
ECFP4 batch speed 5–14× faster Baseline
SMARTS atom map :N Yes Yes
Retrosynthesis (template-based) 60 retro-SMIRKS built-in External tool
File formats 20+ 100+
3D conformer quality Good (ETKDG rules) Better (ML-assisted)
Community & publications Growing Established (20+ years)

7. Batch Descriptor Computation

chematic.bulk.descriptors returns 55+ descriptors per molecule including ADMET and pKa — all in parallel. chematic.bulk.descriptors_array returns selected columns as numpy arrays (~25% faster for column-oriented access).

N chematic (bulk.descriptors) Descriptors per call
100 ~10 ms 55+ (incl. pKa, ADMET)
1,000 ~50 ms 55+ (incl. pKa, ADMET)
import chematic
import pandas as pd

# list-of-dicts (general purpose)
df = pd.DataFrame(chematic.bulk.descriptors(smiles_list))

# columnar numpy arrays (faster for specific columns)
result = chematic.bulk.descriptors_array(smiles_list, ["mw", "logp", "tpsa", "hba"])
df = pd.DataFrame(result)   # float64 / bool arrays, no per-molecule dict overhead

Compound screening

# One call bundles lipinski / veber / pains / brenk / qed / sa_score:
results = chematic.screen(smiles_list, profile="druglike")
passing = [r for r in results if r["overall_pass"]]

Large SDF files (streaming)

# iter_sdf() streams one record at a time — no full-file load:
for rec in chematic.iter_sdf("large.sdf"):
    print(rec.smiles, rec.get("Activity"))

# batch pipeline:
for batch in chematic.iter_sdf_batched("large.sdf", batch_size=1000):
    descs = chematic.bulk.descriptors([r.smiles for r in batch])

RDKit's rdkit.Chem.Descriptors.CalcMolDescriptors covers ~200 descriptors but does not include pKa or ADMET.