Benchmark
Measured environment: Python 3.12, Apple M-series, chematic v0.4.26, RDKit 2026.03.3.
Summary
| Metric | chematic | RDKit |
|---|---|---|
| Import time | ~35 ms | ~400 ms (11×) |
| SMILES parse — 5,000 mol | ~5 ms | ~50 ms (10×) |
| ECFP4 batch — 10,000 mol | 36 ms | ~500 ms (14×) |
| Descriptor accuracy vs RDKit | 19 metrics 100% — MW/HBA/HBD/TPSA/LogP/ARC/RotB/Spiro/Bridge/… (4,999-mol) | baseline |
| Install | pip install chematic |
conda or cmake |
| C/C++ dependencies | Zero | Required |
| WASM binary size | 504 KB | ~30 MB |
1. Startup Time (import latency)
Cold-process import time measured by spawning a fresh Python subprocess per sample (5 samples, median reported). No module-cache warm-up.
| chematic | RDKit | |
|---|---|---|
import only |
~35 ms | ~400 ms |
import + first parse |
~38 ms | ~430 ms |
| Speedup | ~11× | baseline |
chematic
Why chematic is faster: chematic is a single PyO3 extension module with no transitive Python dependencies. RDKit initialises multiple C++ modules, reads SMARTS definition files, and triggers Boost data-structure setup on first import.
2. SMILES Parse Throughput
Timed on the built-in 20-molecule diverse corpus repeated to 5,000 total parses. Warm-up pass excluded.
| N | chematic | RDKit | Speedup |
|---|---|---|---|
| 1,000 | ~1 ms | ~10 ms | ~10× |
| 5,000 | ~5 ms | ~50 ms | ~10× |
| 10,000 | ~10 ms | ~100 ms | ~10× |
Per-molecule: ~1 µs/mol (chematic) vs ~10 µs/mol (RDKit).
chematic
import chematic
mols = [chematic.from_smiles(s) for s in smiles_list]
# or batch:
mols = chematic.from_smiles_list(smiles_list)
RDKit
How to reproduce
3. Speed — ECFP4 Fingerprint Generation (batch)
Rayon parallelism across all CPU cores; speedup grows with batch size.
| Molecules (N) | chematic (bulk.ecfp4) |
RDKit (Python loop) | Speedup |
|---|---|---|---|
| 100 | 0.36 ms | 2 ms | 5× |
| 1,000 | 3.6 ms | 20 ms | 5× |
| 10,000 | 36 ms | ~500 ms | ~14× |
Per-molecule: 3.6 µs/mol (chematic) vs 20–50 µs/mol (RDKit).
chematic
RDKit
from rdkit import Chem
from rdkit.Chem import rdMolDescriptors
fps = [rdMolDescriptors.GetMorganFingerprintAsBitVect(
Chem.MolFromSmiles(s), 2, 2048)
for s in smiles_list]
How to reproduce
4. Descriptor Accuracy vs RDKit
Tested on a 4,999-molecule ChEMBL-derived SMILES corpus (scripts/bench5k.py). See Validation for full per-metric breakdown.
| Descriptor | Agreement | Tolerance |
|---|---|---|
| Molecular weight | 100% | exact |
| Heavy atom count | 100% | exact |
| H-bond donors (HBD) | 100% | exact |
| H-bond acceptors (HBA) | 100% | exact |
| TPSA | 100% | ±0.1 Ų |
| LogP (Crippen) | 100% | exact* |
| MR (molar refractivity) | 100% | ±0.01 |
| Fsp3 | 100% | ±0.001 |
| Aromatic ring count | 100% | exact |
| Aliphatic ring count | 100% | exact |
| Saturated ring count | 100% | exact |
| Rotatable bonds | 100% | exact |
| Num heteroatoms | 100% | exact |
| Num spiro atoms | 100% | exact |
| Num bridgehead atoms | 100% | exact |
| Num amide bonds | 100% | exact |
| Aromatic/aliphatic heterocycles | 100% | exact |
| Num stereocenters (legacy) | 99.98% | exact |
| Num stereocenters (new CIP) | 98.7% | exact |
| [nH] SMARTS match | 100% | precision/recall |
19 of 19 metrics reach ≥98.7% on the 4,999-molecule ChEMBL corpus (RDKit 2026.03.3).
Stereocenters are reported against two RDKit oracles:
- Legacy CalcNumAtomStereoCenters: 99.98% (4998/4999). The 1 discrepancy is a polyester
where chematic correctly identifies 4 stereocenters while legacy misses 2 (confirmed by
FindPotentialStereo).
- New CIP FindPotentialStereo: 98.7% (4932/4999). The new oracle counts cage/bridgehead
atoms as potential stereocenters in 67 molecules; chematic and legacy both correctly
exclude these false positives.
How to reproduce
5. Installation & Deployment
| chematic | RDKit | |
|---|---|---|
| Python | pip install chematic |
conda install -c conda-forge rdkit |
| C/C++ compiler | Not required | Required (Boost) |
| Docker image size delta | ~4 MB | ~200 MB+ |
| GitHub Actions | Single pip line | Separate conda setup step |
| JavaScript / WASM | npm install @kent-tokyo/chematic (504 KB) |
No official package |
| Browser deployment | Yes | No |
6. Feature Comparison
| Feature | chematic | RDKit |
|---|---|---|
| pKa prediction | Built-in (15 SMARTS rules) | External tool required |
| ADMET profile (BBB, Caco-2, hERG, CYP3A4) | Built-in | External tool required |
| MCP server (AI agent integration) | 15 tools | Not available |
| LSH approximate nearest-neighbour index | Built-in | Not available |
| IUPAC name generation | Built-in (offline) | Not available |
| Browser / WASM deployment | Yes (504 KB) | No |
| ECFP4 batch speed | 5–14× faster | Baseline |
SMARTS atom map :N |
Yes | Yes |
| Retrosynthesis (template-based) | 60 retro-SMIRKS built-in | External tool |
| File formats | 20+ | 100+ |
| 3D conformer quality | Good (ETKDG rules) | Better (ML-assisted) |
| Community & publications | Growing | Established (20+ years) |
7. Batch Descriptor Computation
chematic.bulk.descriptors returns 55+ descriptors per molecule including ADMET and pKa — all in parallel.
chematic.bulk.descriptors_array returns selected columns as numpy arrays (~25% faster for column-oriented access).
| N | chematic (bulk.descriptors) |
Descriptors per call |
|---|---|---|
| 100 | ~10 ms | 55+ (incl. pKa, ADMET) |
| 1,000 | ~50 ms | 55+ (incl. pKa, ADMET) |
import chematic
import pandas as pd
# list-of-dicts (general purpose)
df = pd.DataFrame(chematic.bulk.descriptors(smiles_list))
# columnar numpy arrays (faster for specific columns)
result = chematic.bulk.descriptors_array(smiles_list, ["mw", "logp", "tpsa", "hba"])
df = pd.DataFrame(result) # float64 / bool arrays, no per-molecule dict overhead
Compound screening
# One call bundles lipinski / veber / pains / brenk / qed / sa_score:
results = chematic.screen(smiles_list, profile="druglike")
passing = [r for r in results if r["overall_pass"]]
Large SDF files (streaming)
# iter_sdf() streams one record at a time — no full-file load:
for rec in chematic.iter_sdf("large.sdf"):
print(rec.smiles, rec.get("Activity"))
# batch pipeline:
for batch in chematic.iter_sdf_batched("large.sdf", batch_size=1000):
descs = chematic.bulk.descriptors([r.smiles for r in batch])
RDKit's rdkit.Chem.Descriptors.CalcMolDescriptors covers ~200 descriptors but does not include pKa or ADMET.