kgcnn.data.datasets package¶
Submodules¶
kgcnn.data.datasets.ClinToxDataset module¶
-
class
kgcnn.data.datasets.ClinToxDataset.
ClinToxDataset
(reload=False, verbose: int = 10, label_index: int = 0)[source]¶ Bases:
kgcnn.data.datasets.MoleculeNetDataset2018.MoleculeNetDataset2018
Store and process ‘ClinTox’ dataset from MoleculeNet database. Class inherits from
MoleculeNetDataset2018
and downloads dataset on class initialization.Compare reference: DeepChem reading: Qualitative data of drugs approved by the FDA and those that have failed clinical trials for toxicity reasons. Random splitting is recommended for this dataset.
The ClinTox dataset compares drugs approved by the FDA and drugs that have failed clinical trials for toxicity. The dataset includes two classification tasks for 1491 drug compounds with known chemical structures:
clinical trial toxicity (or absence of toxicity)
FDA approval status.
List of FDA-approved drugs are compiled from the SWEETLEAD database, and list of drugs that failed clinical trials for toxicity reasons are compiled from the Aggregate Analysis of ClinicalTrials.gov(AACT) database.
References
Gayvert, Kaitlyn M., Neel S. Madhukar, and Olivier Elemento. A data-driven approach to predicting successes and failures of clinica trials. Cell chemical biology 23.10 (2016): 1294-1301.
Artemov, Artem V., et al. Integrated deep learned transcriptomic and structure-based predictor of clinical trials outcomes. bioRxiv (2016): 095653.
Novick, Paul A., et al. SWEETLEAD: an in silico database of approved drugs, regulated chemicals, and herbal isolates for computer-aided drug discovery. PloS one 8.11 (2013): e79568.
Aggregate Analysis of ClincalTrials.gov (AACT) Database. https://www.ctti-clinicaltrials.org/aact-database .
-
__init__
(reload=False, verbose: int = 10, label_index: int = 0)[source]¶ Initialize ClinTox dataset.
- Parameters
reload (bool) – Whether to reload the data and make new dataset. Default is False.
verbose (int) – Print progress or info for processing where 60=silent. Default is 10.
label_index (int) – Which information should be taken as binary label. Default is 0. Determines the positive class label, which is ‘approved’ by default.
-
read_in_memory
(**kwargs)[source]¶ Load list of molecules from cached SDF-file in into memory. File name must be given in
file_name
and path information in the constructor of this class.It further checks the csv-file for graph labels specified by
label_column_name
. Labels that do not have valid smiles or molecule in the SDF-file are also skipped, but added as None to keep the index and the molecule assignment.Set further molecular attributes or features by string identifier. Requires
MolecularGraphRDKit
. Default values are features that has been used by Luo et al (2019).The argument
additional_callbacks
allows adding custom properties to each element of the dataset. It is a dictionary whose string keys are the names of the properties and the values are callable function objects which define how the property is derived from either theMolecularGraphRDKit
or the corresponding row of the original CSV file. Those callback functions accept two parameters:mg: The
MolecularGraphRDKit
instance of the moleculeds: A pandas data series that match data in the CSV file for the specific molecule.
Example:
from os import linesep csv = f"index,name,label,smiles{linesep}1,Propanolol,1,[Cl].CC(C)NCC(O)COc1cccc2ccccc12" with open('/tmp/moleculenet_example.csv', mode='w') as file: file.write(csv) dataset = MoleculeNetDataset('/tmp', 'example', 'moleculenet_example.csv') dataset.prepare_data(smiles_column_name='smiles') dataset.read_in_memory(label_column_name='label') dataset.set_attributes( nodes=['Symbol'], encoder_nodes={'Symbol': OneHotEncoder(['C', 'O'], dtype='str')}, edges=['BondType'], encoder_edges={'BondType': int}, additional_callbacks={ # It is important that the callbacks return a numpy array, even if it is just a single element. 'name': lambda mg, ds: np.array(ds['name'], dtype='str') } ) mol: dict = dataset[0] mol['node_attributes'] # np array of one hot encoded atom type per node mol['edge_attributes'] # int value representing the bond type mol['name'] # Array of a single string which is the name from the original CSV data
- Parameters
label_column_name (str) – Column name in the csv-file where to take graph labels from. For multi-targets you can supply a list of column names or positions. A slice can be provided for selecting columns as graph labels. Default is None.
nodes (list) – A list of node attributes as string. In place of names also functions can be added.
edges (list) – A list of edge attributes as string. In place of names also functions can be added.
graph (list) – A list of graph attributes as string. In place of names also functions can be added.
encoder_nodes (dict) – A dictionary of callable encoder where the key matches the attribute.
encoder_edges (dict) – A dictionary of callable encoder where the key matches the attribute.
encoder_graph (dict) – A dictionary of callable encoder where the key matches the attribute.
add_hydrogen (bool) – Whether to keep hydrogen after reading the mol-information. Default is False.
has_conformers (bool) – Whether to add node coordinates from conformer. Default is True.
make_directed (bool) – Whether to have directed or undirected bonds. Default is False.
sanitize (bool) – Whether to sanitize molecule. Default is True.
compute_partial_charges (str) – Whether to compute partial charges, e.g. ‘gasteiger’. Default is None.
additional_callbacks (dict) – A dictionary whose keys are string attribute names which the elements of the dataset are supposed to have and the elements are callback function objects which implement how those attributes are derived from the
MolecularGraphRDKit
of the molecule in question or the row of the CSV file.custom_transform (Callable) – Custom transformation function to modify the generated
MolecularGraphRDKit
before callbacks are carried out. The function must take a singleMolecularGraphRDKit
instance as argument and return a (new)MolecularGraphRDKit
instance.
- Returns
self
kgcnn.data.datasets.CoraDataset module¶
-
class
kgcnn.data.datasets.CoraDataset.
CoraDataset
(reload=False, verbose: int = 10)[source]¶ Bases:
kgcnn.data.download.DownloadDataset
,kgcnn.data.base.MemoryGraphDataset
Store and process (full) Cora dataset.
Data loaded from https://github.com/abojchevski/graph2gauss/raw/master/data .
From Paper: “Deep Gaussian Embedding of Graphs: Unsupervised Inductive Learning via Ranking” paper. Nodes represent documents and edges represent citation links.
References
Bojchevski, Aleksandar and Günnemann, Stephan, Deep Gaussian Embedding of Graphs: Unsupervised Inductive Learning via Ranking, arXiv, doi:10.48550/ARXIV.1707.03815, 2017
-
download_info
= {'data_directory_name': 'cora', 'dataset_name': 'Cora', 'download_file_name': 'cora.npz', 'download_url': 'https://github.com/abojchevski/graph2gauss/raw/master/data/cora.npz', 'unpack_directory_name': None, 'unpack_tar': False, 'unpack_zip': False}¶
kgcnn.data.datasets.CoraLuDataset module¶
-
class
kgcnn.data.datasets.CoraLuDataset.
CoraLuDataset
(reload=False, verbose: int = 10)[source]¶ Bases:
kgcnn.data.download.DownloadDataset
,kgcnn.data.base.MemoryGraphDataset
Store and process Cora dataset after Lu et al. 2003 .
Information in Papers with code read: The Cora dataset consists of 2708 scientific publications classified into one of seven classes. The citation network consists of 5429 links. Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary. The dictionary consists of 1433 unique words.
Downloaded from: https://linqs-data.soe.ucsc.edu/public/lbc/ .
References
McCallum, A.K., Nigam, K., Rennie, J. et al. Automating the Construction of Internet Portals with Machine Learning. Information Retrieval 3, 127–163 (2000). https://doi.org/10.1023/A:1009953814988
Lu, Qing and Lise Getoor. “Link-Based Classification.” Encyclopedia of Machine Learning and Data Mining (2003).
-
download_info
= {'data_directory_name': 'cora_lu', 'dataset_name': 'cora_lu', 'download_file_name': 'cora.tgz', 'download_url': 'https://linqs-data.soe.ucsc.edu/public/lbc/cora.tgz', 'unpack_directory_name': 'cora_lu', 'unpack_tar': True, 'unpack_zip': False}¶
kgcnn.data.datasets.ESOLDataset module¶
-
class
kgcnn.data.datasets.ESOLDataset.
ESOLDataset
(reload=False, verbose: int = 10)[source]¶ Bases:
kgcnn.data.datasets.MoleculeNetDataset2018.MoleculeNetDataset2018
Store and process ‘ESOL’ dataset from MoleculeNet database. Class inherits from
MoleculeNetDataset2018
and downloads dataset on class initialization.Compare reference: DeepChem reading:
Water solubility data(log solubility in mols per litre) for common organic small molecules. Random or Scaffold splitting is recommended for this dataset. Description in DeepChem reads: ‘The Delaney (ESOL) dataset a regression dataset containing structures and water solubility data for 1128 compounds. The dataset is widely used to validate machine learning models on estimating solubility directly from molecular structures (as encoded in SMILES strings).’
References
Delaney, John S. ESOL: estimating aqueous solubility directly from molecular structure. Journal of chemical information and computer sciences 44.3 (2004): 1000-1005.
kgcnn.data.datasets.FreeSolvDataset module¶
-
class
kgcnn.data.datasets.FreeSolvDataset.
FreeSolvDataset
(reload=False, verbose: int = 10)[source]¶ Bases:
kgcnn.data.datasets.MoleculeNetDataset2018.MoleculeNetDataset2018
Store and process ‘FreeSolv’ dataset from MoleculeNet database. Class inherits from
MoleculeNetDataset2018
and downloads dataset on class initialization. Compare reference: DeepChem reading: Experimental and calculated hydration free energy of small molecules in water. Description in DeepChem reads: ‘The FreeSolv dataset is a collection of experimental and calculated hydration free energies for small molecules in water, along with their experiemental values. Here, we are using a modified version of the dataset with the molecule smile string and the corresponding experimental hydration free energies.’Random splitting is recommended for this dataset.
References
Lukasz Maziarka, et al. Molecule Attention Transformer. NeurIPS 2019 arXiv:2002.08264v1 [cs.LG].
Mobley DL, Guthrie JP. FreeSolv: a database of experimental and calculated hydration free energies, with input files. J Comput Aided Mol Des. 2014;28(7):711-720. doi:10.1007/s10822-014-9747-x
kgcnn.data.datasets.GraphTUDataset2020 module¶
-
class
kgcnn.data.datasets.GraphTUDataset2020.
GraphTUDataset2020
(dataset_name: str, reload: bool = False, verbose: int = 10)[source]¶ Bases:
kgcnn.data.tudataset.GraphTUDataset
,kgcnn.data.download.DownloadDataset
Base class for loading graph datasets published by TU Dortmund University .
This general base class has functionality to load TUDatasets in a generic way.
Note
Note that subclasses of GraphTUDataset in :obj:
kgcnn.data.datasets
should still be made, if the dataset needs more refined post-precessing. Not all datasets can provide all types of graph properties like edge_attributes etc.References
TUDataset: A collection of benchmark datasets for learning with graphs. Christopher Morris and Nils M. Kriege and Franka Bause and Kristian Kersting and Petra Mutzel and Marion Neumann, ICML 2020 Workshop on Graph Representation Learning and Beyond (GRL+ 2020) www.graphlearning.io .
-
__init__
(dataset_name: str, reload: bool = False, verbose: int = 10)[source]¶ Initialize a GraphTUDataset instance from string identifier.
-
tudataset_ids
= ['AIDS', 'alchemy_full', 'aspirin', 'benzene', 'BZR', 'BZR_MD', 'COX2', 'COX2_MD', 'DHFR', 'DHFR_MD', 'ER_MD', 'ethanol', 'FRANKENSTEIN', 'malonaldehyde', 'MCF-7', 'MCF-7H', 'MOLT-4', 'MOLT-4H', 'Mutagenicity', 'MUTAG', 'naphthalene', 'NCI1', 'NCI109', 'NCI-H23', 'NCI-H23H', 'OVCAR-8', 'OVCAR-8H', 'P388', 'P388H', 'PC-3', 'PC-3H', 'PTC_FM', 'PTC_FR', 'PTC_MM', 'PTC_MR', 'QM9', 'salicylic_acid', 'SF-295', 'SF-295H', 'SN12C', 'SN12CH', 'SW-620', 'SW-620H', 'toluene', 'Tox21_AhR_training', 'Tox21_AhR_testing', 'Tox21_AhR_evaluation', 'Tox21_AR_training', 'Tox21_AR_testing', 'Tox21_AR_evaluation', 'Tox21_AR-LBD_training', 'Tox21_AR-LBD_testing', 'Tox21_AR-LBD_evaluation', 'Tox21_ARE_training', 'Tox21_ARE_testing', 'Tox21_ARE_evaluation', 'Tox21_aromatase_training', 'Tox21_aromatase_testing', 'Tox21_aromatase_evaluation', 'Tox21_ATAD5_training', 'Tox21_ATAD5_testing', 'Tox21_ATAD5_evaluation', 'Tox21_ER_training', 'Tox21_ER_testing', 'Tox21_ER_evaluation', 'Tox21_ER-LBD_training', 'Tox21_ER-LBD_testing', 'Tox21_ER-LBD_evaluation', 'Tox21_HSE_training', 'Tox21_HSE_testing', 'Tox21_HSE_evaluation', 'Tox21_MMP_training', 'Tox21_MMP_testing', 'Tox21_MMP_evaluation', 'Tox21_p53_training', 'Tox21_p53_testing', 'Tox21_p53_evaluation', 'Tox21_PPAR-gamma_training', 'Tox21_PPAR-gamma_testing', 'Tox21_PPAR-gamma_evaluation', 'UACC257', 'UACC257H', 'uracil', 'Yeast', 'YeastH', 'ZINC_full', 'ZINC_test', 'ZINC_train', 'ZINC_val', 'DD', 'ENZYMES', 'KKI', 'OHSU', 'Peking_1', 'PROTEINS', 'PROTEINS_full', 'COIL-DEL', 'COIL-RAG', 'Cuneiform', 'Fingerprint', 'FIRSTMM_DB', 'Letter-high', 'Letter-low', 'Letter-med', 'MSRC_9', 'MSRC_21', 'MSRC_21C', 'COLLAB', 'dblp_ct1', 'dblp_ct2', 'DBLP_v1', 'deezer_ego_nets', 'facebook_ct1', 'facebook_ct2', 'github_stargazers', 'highschool_ct1', 'highschool_ct2', 'IMDB-BINARY', 'IMDB-MULTI', 'infectious_ct1', 'infectious_ct2', 'mit_ct1', 'mit_ct2', 'REDDIT-BINARY', 'REDDIT-MULTI-5K', 'REDDIT-MULTI-12K', 'reddit_threads', 'tumblr_ct1', 'tumblr_ct2', 'twitch_egos', 'TWITTER-Real-Graph-Partial', 'COLORS-3', 'SYNTHETIC', 'SYNTHETICnew', 'Synthie', 'TRIANGLES']¶
kgcnn.data.datasets.ISO17Dataset module¶
-
class
kgcnn.data.datasets.ISO17Dataset.
ISO17Dataset
(reload=False, verbose: int = 10)[source]¶ Bases:
kgcnn.data.download.DownloadDataset
,kgcnn.data.base.MemoryGraphDataset
Dataset ‘ISO17’ with molecules randomly taken from QM9 dataset [1] with a fixed composition of atoms (C7O2H10). They were arranged in different chemically valid structures and is an extension of the isomer MD data used in [2].
Information below and the dataset itself is copied and downloaded from http://quantum-machine.org/datasets/ . The database consist of 129 molecules each containing 5000 conformational geometries, energies and forces with a resolution of 1 femto-second in the molecular dynamics trajectories. The database was generated from molecular dynamics simulations using the Fritz-Haber Institute ab initio simulation package (FHI-aims)[3]. The simulations in ISO17 were carried out using the standard quantum chemistry computational method density functional theory (DFT) in the generalized gradient approximation (GGA) with the Perdew-Burke-Ernzerhof (PBE) functional[4] and the Tkatchenko-Scheffler (TS) van der Waals correction method [5]. The dataset is stored in ASE sqlite format with the total energy in eV and forces in eV/Ang. Download-url: http://quantum-machine.org/datasets/iso17.tar.gz .
from ase.db import connect with connect(path_to_db) as conn: for row in conn.select(limit=10): print(row.toatoms()) print(row['total_energy']) print(row.data['atomic_forces'])
References
R. Ramakrishnan et al. Quantum chemistry structures and properties of 134 kilo molecules. Scientific Data, 1. (2014) https://www.nature.com/articles/sdata201422 .
Schütt, K. T. et al. Quantum-chemical insights from deep tensor neural networks. Nature Communications, 8, 13890. (2017) https://www.nature.com/articles/ncomms13890 .
Blum, V. et al. Ab Initio Molecular Simulations with Numeric Atom-Centered Orbitals. Comput. Phys. Commun. 2009, 180 (11), 2175–2196 https://www.sciencedirect.com/science/article/pii/S0010465509002033 .
Perdew, J. P. et al. Generalized Gradient Approximation Made Simple. Phys. Rev. Lett. 1996, 77 (18), 3865–3868 https://journals.aps.org/prl/abstract/10.1103/PhysRevLett.77.3865.
Tkatchenko, A. et al. Accurate Molecular Van Der Waals Interactions from Ground-State Electron Density and Free-Atom Reference Data. Phys. Rev. Lett. 2009, 102 (7), 73005 https://journals.aps.org/prl/abstract/10.1103/PhysRevLett.102.073005.
Schütt, K. T., et al. SchNet: A continuous-filter convolutional neural network for modeling quantum interactions. Advances in Neural Information Processing System (accepted). 2017. https://arxiv.org/abs/1706.08566
-
__init__
(reload=False, verbose: int = 10)[source]¶ Initialize full
ISO17Dataset
dataset.
-
download_info
= {'data_directory_name': 'ISO17', 'dataset_name': 'ISO17', 'download_file_name': 'iso17.tar.gz', 'download_url': 'http://quantum-machine.org/datasets/iso17.tar.gz', 'unpack_directory_name': 'iso17', 'unpack_tar': True, 'unpack_zip': False}¶
-
read_in_memory
()[source]¶ Load complete
ISO17Dataset
data into memory. Additionally, the different train and validation properties are set according to http://quantum-machine.org/datasets/.The data is partitioned as used in the SchNet paper [6]:
reference.db: 80% of steps of 80% of MD trajectories
reference_eq.db: equilibrium conformations of those molecules
test_within.db: remaining 20% unseen steps of reference trajectories
test_other.db: remaining 20% unseen MD trajectories
test_eq.db: equilibrium conformations of test trajectories
Where ‘reference.db’ and ‘reference_eq.db’ have ‘train’ property with index 0, 1, respectively, and ‘test_within’, ‘test_other’, ‘test_eq’ have ‘test’ property 0, 1, 2, respectively. The original validation geometries are noted by ‘valid’ property with index 0. Use
get_train_test_indices
for reading out split indices.
kgcnn.data.datasets.LipopDataset module¶
-
class
kgcnn.data.datasets.LipopDataset.
LipopDataset
(reload=False, verbose: int = 10)[source]¶ Bases:
kgcnn.data.datasets.MoleculeNetDataset2018.MoleculeNetDataset2018
Store and process ‘Lipop’ dataset from MoleculeNet database. Class inherits from
MoleculeNetDataset2018
and downloads dataset on class initialization. Compare reference: DeepChem reading: Experimental results of octanol/water distribution coefficient(logD at pH 7.4). Description in DeepChem reads: ‘Lipophilicity is an important feature of drug molecules that affects both membrane permeability and solubility. The lipophilicity dataset, curated from ChEMBL database, provides experimental results of octanol/water distribution coefficient (logD at pH 7.4) of 4200 compounds.’ Random or Scaffold splitting is recommended for this dataset.References
Hersey, A. ChEMBL Deposited Data Set - AZ dataset; 2015. https://doi.org/10.6019/chembl3301361
kgcnn.data.datasets.MD17Dataset module¶
-
class
kgcnn.data.datasets.MD17Dataset.
MD17Dataset
(trajectory_name: Optional[str] = None, reload=False, verbose=10)[source]¶ Bases:
kgcnn.data.download.DownloadDataset
,kgcnn.data.base.MemoryGraphDataset
Store and process trajectories from the
MD17Dataset
dataset. The dataset contains atomic coordinates of molecular dynamics trajectories, as well as the total energy (in kcal/mol) and forces (kcal/mol/Angstrom) on each atom. For reference data source, refer to the links http://www.sgdml.org/#datasets or http://quantum-machine.org/gdml/data/ .Which trajectory is downloaded is determined by
trajectory_name
argument. There are two different versions of trajectories, which are a long trajectory on DFT level of theory and a short trajectory on coupled cluster level of theory marked in the name by ‘dft’ and ‘ccsd_t’ respectively.Overview:
Molecule
Level of Theory
trajectory_name
graphs
Aspirin
DFT
aspirin_dft
211762
Azobenzene
DFT
azobenzene_dft
99999
Benzene
DFT
benzene2017_dft
627983
Benzene
DFT
benzene2018_dft
49863
Ethanol
DFT
ethanol_dft
555092
Malonaldehyde
DFT
malonaldehyde_dft
993237
Naphthalene
DFT
naphthalene_dft
326250
Paracetamol
DFT
paracetamol_dft
106490
Salicylic
DFT
salicylic_dft
320231
Toluene
DFT
toluene_dft
442790
Uracil
DFT
uracil_dft
133770
Aspirin_ccsd
CCSD
aspirin_ccsd
1500
Benzene
CCSD
benzene_ccsd_t
1500
Ethanol
CCSD
ethanol_ccsd_t
2000
Malonaldehyde
CCSD
malonaldehyde_ccsd_t
1500
Toluene
CCSD
toluene_ccsd_t
1501
It is recommended to use the given train-test splits. Only the requested trajectory is downloaded.
-
__init__
(trajectory_name: Optional[str] = None, reload=False, verbose=10)[source]¶ Initialize MD17Dataset dataset.
-
datasets_download_info
= {'CG-CG': {'download_file_name': 'CG-CG.npz'}, 'aspirin_ccsd': {'download_file_name': 'aspirin_ccsd.zip', 'unpack_directory_name': 'aspirin_ccsd', 'unpack_zip': True}, 'aspirin_dft': {'download_file_name': 'aspirin_dft.npz'}, 'azobenzene_dft': {'download_file_name': 'azobenzene_dft.npz'}, 'benzene2017_dft': {'download_file_name': 'benzene2017_dft.npz'}, 'benzene2018_dft': {'download_file_name': 'benzene2018_dft.npz'}, 'benzene_ccsd_t': {'download_file_name': 'benzene_ccsd_t.zip', 'unpack_directory_name': 'benzene_ccsd_t', 'unpack_zip': True}, 'ethanol_ccsd_t': {'download_file_name': 'ethanol_ccsd_t.zip', 'unpack_directory_name': 'ethanol_ccsd_t', 'unpack_zip': True}, 'ethanol_dft': {'download_file_name': 'ethanol_dft.npz'}, 'malonaldehyde_ccsd_t': {'download_file_name': 'malonaldehyde_ccsd_t.zip', 'unpack_directory_name': 'malonaldehyde_ccsd_t', 'unpack_zip': True}, 'malonaldehyde_dft': {'download_file_name': 'malonaldehyde_dft.npz'}, 'naphthalene_dft': {'download_file_name': 'naphthalene_dft.npz'}, 'paracetamol_dft': {'download_file_name': 'paracetamol_dft.npz'}, 'salicylic_dft': {'download_file_name': 'salicylic_dft.npz'}, 'toluene_ccsd_t': {'download_file_name': 'toluene_ccsd_t.zip', 'unpack_directory_name': 'toluene_ccsd_t', 'unpack_zip': True}, 'toluene_dft': {'download_file_name': 'toluene_dft.npz'}, 'uracil_dft': {'download_file_name': 'uracil_dft.npz'}}¶
-
kgcnn.data.datasets.MD17RevisedDataset module¶
-
class
kgcnn.data.datasets.MD17RevisedDataset.
MD17RevisedDataset
(trajectory_name: Optional[str] = None, reload=False, verbose=10)[source]¶ Bases:
kgcnn.data.download.DownloadDataset
,kgcnn.data.base.MemoryGraphDataset
Store and process trajectories from
MD17DatasetRevised
dataset.The information of the readme file is given below:
The molecules are taken from the original MD17 dataset by Chmiela et al. , and 100,000 structures are taken, and the energies and forces are recalculated at the PBE/def2-SVP level of theory using very tight SCF convergence and very dense DFT integration grid. As such, the dataset is practically free from nummerical noise.
One warning: As the structures are taken from a molecular dynamics simulation (i.e. time series data), they are not guaranteed to be independent samples. This is easily evident from the auto-correlation function for the original MD17 dataset
In short: DO NOT train a model on more than 1000 samples from this dataset. Data already published with 50K samples on the original MD17 dataset should be considered meaningless due to this fact and due to the noise in the original data.
The data: The ten molecules are saved in Numpy .npz format. The keys correspond to:
‘nuclear_charges’ : The nuclear charges for the molecule
‘coords’ : The coordinates for each conformation (in units of Angstrom)
‘energies’ : The total energy of each conformation (in units of kcal/mol)
‘forces’ : The cartesian forces of each conformation (in units of kcal/mol/Angstrom)
‘old_indices’ : The index of each conformation in the original MD17 dataset
‘old_energies’ : The energy of each conformation taken from the original MD17 dataset
‘old_forces’ : The forces of each conformation taken from the original MD17 dataset
Note that for Azobenzene, only 99988 samples are available due to 11 failed DFT calculations due to van der Walls clash, and the original dataset only contained 99999 structures.
Data splits: Five training and test splits are saved in CSV format containing the corresponding indices.
References
Anders Christensen, O. Anatole von Lilienfeld, Revised MD17 dataset, Materials Cloud Archive 2020.82 (2020), doi: 10.24435/materialscloud:wy-kn.
-
__init__
(trajectory_name: Optional[str] = None, reload=False, verbose=10)[source]¶ Initialize MD17DatasetRevised dataset for a specific trajectory.
-
download_info
= {'data_directory_name': 'MD17Revised', 'dataset_name': 'MD17Revised', 'download_file_name': 'rmd17.tar.bz2', 'download_url': 'https://archive.materialscloud.org/record/file?filename=rmd17.tar.bz2&record_id=466', 'unpack_directory_name': 'rmd17', 'unpack_tar': True, 'unpack_zip': False}¶
-
possible_trajectory_names
= ['aspirin', 'azobenzene', 'benzene', 'ethanol', 'malonaldehyde', 'naphthalene', 'paracetamol', 'salicylic', 'toluene', 'uracil']¶
kgcnn.data.datasets.MUTAGDataset module¶
-
class
kgcnn.data.datasets.MUTAGDataset.
MUTAGDataset
(reload=False, verbose: int = 10)[source]¶ Bases:
kgcnn.data.datasets.GraphTUDataset2020.GraphTUDataset2020
Store and process MUTAG dataset from TUDatasets .
In Papers with Code : In particular, MUTAG is a collection of nitroaromatic compounds and the goal is to predict their mutagenicity on Salmonella typhimurium. Input graphs are used to represent chemical compounds, where vertices stand for atoms and are labeled by the atom type (represented by one-hot encoding), while edges between vertices represent bonds between the corresponding atoms. It includes 188 samples of chemical compounds with 7 discrete node labels.
References
Debnath, A.K., Lopez de Compadre, R.L., Debnath, G., Shusterman, A.J., and Hansch, C. Structure-activity relationship of mutagenic aromatic and heteroaromatic nitro compounds. J. Med. Chem. 34(2):786-797 (1991).
Nils Kriege, Petra Mutzel. 2012. Subgraph Matching Kernels for Attributed Graphs. International Conference on Machine Learning 2012.
kgcnn.data.datasets.MatBenchDataset2020 module¶
-
class
kgcnn.data.datasets.MatBenchDataset2020.
MatBenchDataset2020
(dataset_name: str, reload: bool = False, verbose: int = 10)[source]¶ Bases:
kgcnn.data.crystal.CrystalDataset
,kgcnn.data.download.DownloadDataset
Base class for loading graph datasets from MatBench , collection of materials datasets. For graph learning only those with structure are relevant. Process and loads from serialized
pymatgen
structures.Note
This class does not follow the interface of MatBench and therefore also not the original splits required for submission of benchmark values.
Matbench is an automated leaderboard for benchmarking state of the art ML algorithms predicting a diverse range of solid materials’ properties. It is hosted and maintained by the Materials Project .
Matbench is an ImageNet for materials science; a curated set of 13 supervised, pre-cleaned, ready-to-use ML tasks for benchmarking and fair comparison. The tasks span a wide domain of inorganic materials science applications including electronic, thermodynamic, mechanical, and thermal properties among crystals, 2D materials, disordered metals, and more.
References
Dunn, A., Wang, Q., Ganose, A. et al. Benchmarking materials property prediction methods: the Matbench test set and Automatminer reference algorithm. npj Comput Mater 6, 138 (2020). https://doi.org/10.1038/s41524-020-00406-3 .
-
__init__
(dataset_name: str, reload: bool = False, verbose: int = 10)[source]¶ Initialize a GraphTUDataset instance from string identifier.
-
datasets_download_info
= {'matbench_dielectric': {'data_directory_name': 'matbench_dielectric', 'dataset_name': 'matbench_dielectric', 'download_file_name': 'matbench_dielectric.json.gz', 'extract_file_name': 'matbench_dielectric.json', 'extract_gz': True}, 'matbench_expt_gap': {'data_directory_name': 'matbench_expt_gap', 'dataset_name': 'matbench_expt_gap', 'download_file_name': 'matbench_expt_gap.json.gz', 'extract_file_name': 'matbench_expt_gap.json', 'extract_gz': True}, 'matbench_expt_is_metal': {'data_directory_name': 'matbench_expt_is_metal', 'dataset_name': 'matbench_expt_is_metal', 'download_file_name': 'matbench_expt_is_metal.json.gz', 'extract_file_name': 'matbench_expt_is_metal.json', 'extract_gz': True}, 'matbench_glass': {'data_directory_name': 'matbench_glass', 'dataset_name': 'matbench_glass', 'download_file_name': 'matbench_glass.json.gz', 'extract_file_name': 'matbench_glass.json', 'extract_gz': True}, 'matbench_jdft2d': {'data_directory_name': 'matbench_jdft2d', 'dataset_name': 'matbench_jdft2d', 'download_file_name': 'matbench_jdft2d.json.gz', 'extract_file_name': 'matbench_jdft2d.json', 'extract_gz': True}, 'matbench_log_gvrh': {'data_directory_name': 'matbench_log_gvrh', 'dataset_name': 'matbench_log_gvrh', 'download_file_name': 'matbench_log_gvrh.json.gz', 'extract_file_name': 'matbench_log_gvrh.json', 'extract_gz': True}, 'matbench_log_kvrh': {'data_directory_name': 'matbench_log_kvrh', 'dataset_name': 'matbench_log_kvrh', 'download_file_name': 'matbench_log_kvrh.json.gz', 'extract_file_name': 'matbench_log_kvrh.json', 'extract_gz': True}, 'matbench_mp_e_form': {'data_directory_name': 'matbench_mp_e_form', 'dataset_name': 'matbench_mp_e_form', 'download_file_name': 'matbench_mp_e_form.json.gz', 'extract_file_name': 'matbench_mp_e_form.json', 'extract_gz': True}, 'matbench_mp_gap': {'data_directory_name': 'matbench_mp_gap', 'dataset_name': 'matbench_mp_gap', 'download_file_name': 'matbench_mp_gap.json.gz', 'extract_file_name': 'matbench_mp_gap.json', 'extract_gz': True}, 'matbench_mp_is_metal': {'data_directory_name': 'matbench_mp_is_metal', 'dataset_name': 'matbench_mp_is_metal', 'download_file_name': 'matbench_mp_is_metal.json.gz', 'extract_file_name': 'matbench_mp_is_metal.json', 'extract_gz': True}, 'matbench_perovskites': {'data_directory_name': 'matbench_perovskites', 'dataset_name': 'matbench_perovskites', 'download_file_name': 'matbench_perovskites.json.gz', 'extract_file_name': 'matbench_perovskites.json', 'extract_gz': True}, 'matbench_phonons': {'data_directory_name': 'matbench_phonons', 'dataset_name': 'matbench_phonons', 'download_file_name': 'matbench_phonons.json.gz', 'extract_file_name': 'matbench_phonons.json', 'extract_gz': True}, 'matbench_steels': {'data_directory_name': 'matbench_steels', 'dataset_name': 'matbench_steels', 'download_file_name': 'matbench_steels.json.gz', 'extract_file_name': 'matbench_steels.json', 'extract_gz': True}}¶
-
datasets_prepare_data_info
= {'matbench_dielectric': {'file_column_name': 'structure'}, 'matbench_expt_gap': {'file_column_name': 'composition'}, 'matbench_expt_is_metal': {'file_column_name': 'composition'}, 'matbench_glass': {'file_column_name': 'composition'}, 'matbench_jdft2d': {'file_column_name': 'structure'}, 'matbench_log_gvrh': {'file_column_name': 'structure'}, 'matbench_log_kvrh': {'file_column_name': 'structure'}, 'matbench_mp_e_form': {'file_column_name': 'structure'}, 'matbench_mp_gap': {'file_column_name': 'structure'}, 'matbench_mp_is_metal': {'file_column_name': 'structure'}, 'matbench_perovskites': {'file_column_name': 'structure'}, 'matbench_phonons': {'file_column_name': 'structure'}, 'matbench_steels': {'file_column_name': 'composition'}}¶
-
datasets_read_in_memory_info
= {'matbench_dielectric': {'label_column_name': 'n'}, 'matbench_expt_gap': {'label_column_name': 'gap expt'}, 'matbench_expt_is_metal': {'label_column_name': 'is_metal'}, 'matbench_glass': {'label_column_name': 'gfa'}, 'matbench_jdft2d': {'label_column_name': 'exfoliation_en'}, 'matbench_log_gvrh': {'label_column_name': 'log10(G_VRH)'}, 'matbench_log_kvrh': {'label_column_name': 'log10(K_VRH)'}, 'matbench_mp_e_form': {'label_column_name': 'e_form'}, 'matbench_mp_gap': {'label_column_name': 'gap pbe'}, 'matbench_mp_is_metal': {'label_column_name': 'is_metal'}, 'matbench_perovskites': {'label_column_name': 'e_form'}, 'matbench_phonons': {'label_column_name': 'last phdos peak'}, 'matbench_steels': {'label_column_name': 'yield strength'}}¶
-
prepare_data
(file_column_name: Optional[str] = None, overwrite: bool = False)[source]¶ Default preparation for crystal datasets.
Try to load all crystal structures from single files and save them as a pymatgen json serialization. Can load multiple CIF files from a table that keeps file names and possible labels or additional information.
kgcnn.data.datasets.MatProjectDielectricDataset module¶
-
class
kgcnn.data.datasets.MatProjectDielectricDataset.
MatProjectDielectricDataset
(reload=False, verbose: int = 10)[source]¶ Bases:
kgcnn.data.datasets.MatBenchDataset2020.MatBenchDataset2020
Store and process
MatProjectDielectricDataset
from MatBench database. Name within Matbench: ‘matbench_dielectric’.Matbench test dataset for predicting refractive index from structure. Adapted from Materials Project database. Removed entries having a formation energy (or energy above the convex hull) more than 150meV and those having refractive indices less than 1 and those containing noble gases. Retrieved April 2, 2019. For benchmarking w/ nested cross validation, the order of the dataset must be identical to the retrieved data; refer to the Automatminer/Matbench publication for more details.
Number of samples: 4764
Task type: regression
Input type: structure
kgcnn.data.datasets.MatProjectEFormDataset module¶
-
class
kgcnn.data.datasets.MatProjectEFormDataset.
MatProjectEFormDataset
(reload=False, verbose: int = 10)[source]¶ Bases:
kgcnn.data.datasets.MatBenchDataset2020.MatBenchDataset2020
Store and process
MatProjectEFormDataset
from MatBench database. Name within Matbench: ‘matbench_mp_e_form’.Matbench test dataset for predicting DFT formation energy from structure. Adapted from Materials Project database. Removed entries having formation energy more than 2.5eV and those containing noble gases. Retrieved April 2, 2019. For benchmarking w/ nested cross validation, the order of the dataset must be identical to the retrieved data; refer to the Automatminer/Matbench publication for more details.
Number of samples: 132752.
Task type: regression.
Input type: structure.
kgcnn.data.datasets.MatProjectGapDataset module¶
-
class
kgcnn.data.datasets.MatProjectGapDataset.
MatProjectGapDataset
(reload=False, verbose: int = 10)[source]¶ Bases:
kgcnn.data.datasets.MatBenchDataset2020.MatBenchDataset2020
Store and process
MatProjectGapDataset
from MatBench database. Name within Matbench: ‘matbench_mp_gap’.Matbench test dataset for predicting DFT PBE band gap from structure. Adapted from Materials Project database. Removed entries having a formation energy (or energy above the convex hull) more than 150meV and those containing noble gases. Retrieved April 2, 2019. For benchmarking w/ nested cross validation, the order of the dataset must be identical to the retrieved data; refer to the Automatminer/Matbench publication for more details.
Number of samples: 106113
Task type: regression
Input type: structure
kgcnn.data.datasets.MatProjectIsMetalDataset module¶
-
class
kgcnn.data.datasets.MatProjectIsMetalDataset.
MatProjectIsMetalDataset
(reload=False, verbose: int = 10)[source]¶ Bases:
kgcnn.data.datasets.MatBenchDataset2020.MatBenchDataset2020
Store and process
MatProjectIsMetalDataset
from MatBench database. Name within Matbench: ‘matbench_mp_is_metal’.Matbench test dataset for predicting DFT metallicity from structure. Adapted from Materials Project database. Removed entries having a formation energy (or energy above the convex hull) more than 150meV and those containing noble gases. Retrieved April 2, 2019. For benchmarking w/ nested cross validation, the order of the dataset must be identical to the retrieved data; refer to the Automatminer/Matbench publication for more details.
Number of samples: 106113.
Task type: classification.
Input type: structure.
kgcnn.data.datasets.MatProjectJdft2dDataset module¶
-
class
kgcnn.data.datasets.MatProjectJdft2dDataset.
MatProjectJdft2dDataset
(reload=False, verbose: int = 10)[source]¶ Bases:
kgcnn.data.datasets.MatBenchDataset2020.MatBenchDataset2020
Store and process
MatProjectJdft2dDataset
from MatBench database. Name within Matbench: ‘matbench_jdft2d’.Matbench test dataset for predicting exfoliation energies from crystal structure (computed with the OptB88vdW and TBmBJ functionals). Adapted from the JARVIS DFT database. For benchmarking w/ nested cross validation, the order of the dataset must be identical to the retrieved data; refer to the Automatminer/Matbench publication for more details.
Number of samples: 636
Task type: regression
Input type: structure
kgcnn.data.datasets.MatProjectLogGVRHDataset module¶
-
class
kgcnn.data.datasets.MatProjectLogGVRHDataset.
MatProjectLogGVRHDataset
(reload=False, verbose: int = 10)[source]¶ Bases:
kgcnn.data.datasets.MatBenchDataset2020.MatBenchDataset2020
Store and process
MatProjectLogGVRHDataset
from MatBench database. Name within Matbench: ‘matbench_log_gvrh’.Matbench v0.1 test dataset for predicting DFT log10 VRH-average shear modulus from structure. Adapted from Materials Project database. Removed entries having a formation energy (or energy above the convex hull) more than 150meV and those having negative G_Voigt, G_Reuss, G_VRH, K_Voigt, K_Reuss, or K_VRH and those failing G_Reuss <= G_VRH <= G_Voigt or K_Reuss <= K_VRH <= K_Voigt and those containing noble gases. Retrieved April 2, 2019. For benchmarking w/ nested cross validation, the order of the dataset must be identical to the retrieved data; refer to the Automatminer/Matbench publication for more details.
Number of samples: 10987
Task type: regression
Input type: structure
kgcnn.data.datasets.MatProjectLogKVRHDataset module¶
-
class
kgcnn.data.datasets.MatProjectLogKVRHDataset.
MatProjectLogKVRHDataset
(reload=False, verbose: int = 10)[source]¶ Bases:
kgcnn.data.datasets.MatBenchDataset2020.MatBenchDataset2020
Store and process
MatProjectLogKVRHDataset
from MatBench database. Name within Matbench: ‘matbench_log_kvrh’.Matbench v0.1 test dataset for predicting DFT log10 VRH-average bulk modulus from structure. Adapted from Materials Project database. Removed entries having a formation energy (or energy above the convex hull) more than 150meV and those having negative G_Voigt, G_Reuss, G_VRH, K_Voigt, K_Reuss, or K_VRH and those failing G_Reuss <= G_VRH <= G_Voigt or K_Reuss <= K_VRH <= K_Voigt and those containing noble gases. Retrieved April 2, 2019. For benchmarking w/ nested cross validation, the order of the dataset must be identical to the retrieved data; refer to the Automatminer/Matbench publication for more details.
Number of samples: 10987
Task type: regression
Input type: structure
kgcnn.data.datasets.MatProjectPerovskitesDataset module¶
-
class
kgcnn.data.datasets.MatProjectPerovskitesDataset.
MatProjectPerovskitesDataset
(reload=False, verbose: int = 10)[source]¶ Bases:
kgcnn.data.datasets.MatBenchDataset2020.MatBenchDataset2020
Store and process
MatProjectPerovskitesDataset
from MatBench database. Name within Matbench: ‘matbench_perovskites’.Matbench test dataset for predicting formation energy from crystal structure. Adapted from an original dataset generated by Castelli et al. For benchmarking w/ nested cross validation, the order of the dataset must be identical to the retrieved data; refer to the Automatminer/Matbench publication for more details.
Number of samples: 18928
Task type: regression
Input type: structure
kgcnn.data.datasets.MatProjectPhononsDataset module¶
-
class
kgcnn.data.datasets.MatProjectPhononsDataset.
MatProjectPhononsDataset
(reload=False, verbose: int = 10)[source]¶ Bases:
kgcnn.data.datasets.MatBenchDataset2020.MatBenchDataset2020
Store and process
MatProjectPhononsDataset
from MatBench database. Name within Matbench: ‘matbench_phonons’.Matbench test dataset for predicting vibration properties from crystal structure. Original data retrieved from Petretto et al. Original calculations done via ABINIT in the harmonic approximation based on density functional perturbation theory. Removed entries having a formation energy (or energy above the convex hull) more than 150meV. For benchmarking w/ nested cross validation, the order of the dataset must be identical to the retrieved data; refer to the Automatminer/Matbench publication for more details.
Number of samples: 1265
Task type: regression
Input type: structure
last phdos peak: Target variable. Frequency of the highest frequency optical phonon mode peak, in units of 1/cm; may be used as an estimation of dominant longitudinal optical phonon frequency.
kgcnn.data.datasets.MoleculeNetDataset2018 module¶
-
class
kgcnn.data.datasets.MoleculeNetDataset2018.
MoleculeNetDataset2018
(dataset_name: str, reload: bool = False, verbose: int = 10)[source]¶ Bases:
kgcnn.data.moleculenet.MoleculeNetDataset
,kgcnn.data.download.DownloadDataset
Downloader for MoleculeNet datasets. This class inherits from
MoleculeNetDataset
. QM datasets are however excluded from this class as they have specific kgcnn.data.datasets which inherits fromQMDatasets
.MoleculeNet is a benchmark specially designed for testing machine learning methods of molecular properties. Their work curates a number of dataset collections. All methods and datasets are integrated as parts of the open source DeepChem package (MIT license).
- Stats:
Name
#graphs
#features
#classes
ESOL
1,128
9
1
FreeSolv
642
9
1
Lipophilicity
4,200
9
1
PCBA
437,929
9
128
MUV
93,087
9
17
HIV
41,127
9
1
BACE
1513
9
1
BBPB
2,050
9
1
Tox21
7,831
9
12
ToxCast
8,597
9
617
SIDER
1,427
9
27
ClinTox
1,484
9
2
References
Zhenqin Wu, Bharath Ramsundar, Evan N. Feinberg, Joseph Gomes, Caleb Geniesse, Aneesh S. Pappu, Karl Leswing, Vijay Pande, MoleculeNet: A Benchmark for Molecular Machine Learning, arXiv preprint, arXiv: 1703.00564, 2017.
-
__init__
(dataset_name: str, reload: bool = False, verbose: int = 10)[source]¶ Initialize a MoleculeNetDataset2018 instance from string identifier.
-
datasets_download_info
= {'BACE': {'data_directory_name': 'BACE', 'dataset_name': 'BACE', 'download_file_name': 'bace.csv'}, 'BBBP': {'data_directory_name': 'BBBP', 'dataset_name': 'BBBP', 'download_file_name': 'BBBP.csv'}, 'ClinTox': {'data_directory_name': 'ClinTox', 'dataset_name': 'ClinTox', 'download_file_name': 'clintox.csv.gz', 'extract_file_name': 'clintox.csv', 'extract_gz': True}, 'ESOL': {'data_directory_name': 'ESOL', 'dataset_name': 'ESOL', 'download_file_name': 'delaney-processed.csv'}, 'FreeSolv': {'data_directory_name': 'FreeSolv', 'dataset_name': 'FreeSolv', 'download_file_name': 'SAMPL.csv'}, 'HIV': {'data_directory_name': 'HIV', 'dataset_name': 'HIV', 'download_file_name': 'HIV.csv'}, 'Lipop': {'data_directory_name': 'Lipop', 'dataset_name': 'Lipop', 'download_file_name': 'Lipophilicity.csv'}, 'MUV': {'data_directory_name': 'MUV', 'dataset_name': 'MUV', 'download_file_name': 'muv.csv.gz', 'extract_file_name': 'muv.csv', 'extract_gz': True}, 'PCBA': {'data_directory_name': 'PCBA', 'dataset_name': 'PCBA', 'download_file_name': 'pcba.csv.gz', 'extract_file_name': 'pcba.csv', 'extract_gz': True}, 'SIDER': {'data_directory_name': 'SIDER', 'dataset_name': 'SIDER', 'download_file_name': 'sider.csv.gz', 'extract_file_name': 'sider.csv', 'extract_gz': True}, 'Tox21': {'data_directory_name': 'Tox21', 'dataset_name': 'Tox21', 'download_file_name': 'tox21.csv.gz', 'extract_file_name': 'tox21.csv', 'extract_gz': True}, 'ToxCast': {'data_directory_name': 'ToxCast', 'dataset_name': 'ToxCast', 'download_file_name': 'toxcast_data.csv.gz', 'extract_file_name': 'toxcast_data.csv', 'extract_gz': True}}¶
-
datasets_prepare_data_info
= {'BACE': {'add_hydrogen': True, 'make_conformers': True, 'smiles_column_name': 'mol'}, 'BBBP': {'add_hydrogen': True, 'make_conformers': True, 'smiles_column_name': 'smiles'}, 'ClinTox': {'add_hydrogen': True, 'make_conformers': True, 'smiles_column_name': 'smiles'}, 'ESOL': {'add_hydrogen': True, 'make_conformers': True}, 'FreeSolv': {'add_hydrogen': True, 'make_conformers': True}, 'HIV': {'add_hydrogen': True, 'make_conformers': True}, 'Lipop': {'add_hydrogen': True, 'make_conformers': True}, 'MUV': {'add_hydrogen': True, 'make_conformers': True}, 'PCBA': {'add_hydrogen': True, 'make_conformers': True}, 'SIDER': {'add_hydrogen': True, 'make_conformers': True, 'smiles_column_name': 'smiles'}, 'Tox21': {'add_hydrogen': True, 'make_conformers': True, 'smiles_column_name': 'smiles'}, 'ToxCast': {'add_hydrogen': True, 'make_conformers': True, 'smiles_column_name': 'smiles'}}¶
-
datasets_read_in_memory_info
= {'BACE': {'add_hydrogen': False, 'has_conformers': True, 'label_column_name': 'Class'}, 'BBBP': {'add_hydrogen': False, 'has_conformers': True, 'label_column_name': 'p_np'}, 'ClinTox': {'add_hydrogen': False, 'has_conformers': True, 'label_column_name': [1, 2]}, 'ESOL': {'add_hydrogen': False, 'has_conformers': True, 'label_column_name': 'measured log solubility in mols per litre'}, 'FreeSolv': {'add_hydrogen': False, 'has_conformers': True, 'label_column_name': 'expt'}, 'HIV': {'add_hydrogen': False, 'has_conformers': True, 'label_column_name': 'HIV_active'}, 'Lipop': {'add_hydrogen': False, 'has_conformers': True, 'label_column_name': 'exp'}, 'MUV': {'add_hydrogen': False, 'has_conformers': True, 'label_column_name': slice(0, 17, None)}, 'PCBA': {'add_hydrogen': False, 'has_conformers': False, 'label_column_name': slice(0, 128, None)}, 'SIDER': {'add_hydrogen': False, 'has_conformers': True, 'label_column_name': slice(1, 28, None)}, 'Tox21': {'add_hydrogen': False, 'has_conformers': True, 'label_column_name': slice(0, 12, None)}, 'ToxCast': {'add_hydrogen': False, 'has_conformers': True, 'label_column_name': slice(1, 618, None)}}¶
kgcnn.data.datasets.MutagenicityDataset module¶
-
class
kgcnn.data.datasets.MutagenicityDataset.
MutagenicityDataset
(reload=False, verbose: int = 10)[source]¶ Bases:
kgcnn.data.datasets.GraphTUDataset2020.GraphTUDataset2020
Store and process Mutagenicity dataset from TUDatasets .
Mutagenicity is a chemical compound dataset of drugs, which can be categorized into two classes: mutagen and non-mutagen.
References
Riesen, K. and Bunke, H.: IAM Graph Database Repository for Graph Based Pattern Recognition and Machine Learning. In: da Vitora Lobo, N. et al. (Eds.), SSPR&SPR 2008, LNCS, vol. 5342, pp. 287-297, 2008.
kgcnn.data.datasets.PROTEINSDataset module¶
-
class
kgcnn.data.datasets.PROTEINSDataset.
PROTEINSDataset
(reload=False, verbose: int = 10)[source]¶ Bases:
kgcnn.data.datasets.GraphTUDataset2020.GraphTUDataset2020
Store and process PROTEINS dataset from TUDatasets .
In Papers with Code : PROTEINS is a dataset of proteins that are classified as enzymes or non-enzymes. Nodes represent the amino acids and two nodes are connected by an edge if they are less than 6 Angstroms apart.
References
K. M. Borgwardt, C. S. Ong, S. Schoenauer, S. V. N. Vishwanathan, A. J. Smola, and H. P. Kriegel. Protein function prediction via graph kernels. Bioinformatics, 21(Suppl 1):i47–i56, Jun 2005.
P. D. Dobson and A. J. Doig. Distinguishing enzyme structures from non-enzymes without alignments. J. Mol. Biol., 330(4):771–783, Jul 2003.
kgcnn.data.datasets.QM7Dataset module¶
-
class
kgcnn.data.datasets.QM7Dataset.
QM7Dataset
(reload: bool = False, verbose: int = 10)[source]¶ Bases:
kgcnn.data.qm.QMDataset
,kgcnn.data.download.DownloadDataset
Store and process QM7 dataset from Quantum Machine . dataset.
From Quantum Machine : This dataset is a subset of GDB-13 (a database of nearly 1 billion stable and synthetically accessible organic molecules) composed of all molecules of up to 23 atoms (including 7 heavy atoms C, N, O, and S), totalling 7165 molecules. We provide the Coulomb matrix representation of these molecules and their atomization energies computed similarly to the FHI-AIMS implementation of the Perdew-Burke-Ernzerhof hybrid functional (PBE0). This dataset features a large variety of molecular structures such as double and triple bonds, cycles, carboxy, cyanide, amide, alcohol and epoxy. The atomization energies are given in kcal/mol and are ranging from -800 to -2000 kcal/mol. The dataset is composed of three multidimensional arrays X (7165 x 23 x 23), Tm(7165) and P (5 x 1433) representing the inputs (Coulomb matrices), the labels (atomization energies) and the splits for cross-validation, respectively. The dataset also contain two additional multidimensional arrays Z (7165) and R (7165 x 3) representing the atomic charge and the cartesian coordinate of each atom in the molecules.
Here, the coordinates are given and converted with
QMDataset
to molecular structure. Labels are not scaled but have original units. Original splits are added to the dataset.References
L. C. Blum, J.-L. Reymond, 970 Million Druglike Small Molecules for Virtual Screening in the Chemical Universe Database GDB-13, J. Am. Chem. Soc., 131:8732, 2009.
M. Rupp, A. Tkatchenko, K.-R. Müller, O. A. von Lilienfeld: Fast and Accurate Modeling of Molecular Atomization Energies with Machine Learning, Physical Review Letters, 108(5):058301, 2012.
-
download_info
= {'data_directory_name': 'qm7', 'dataset_name': 'QM7', 'download_file_name': 'qm7.mat', 'download_url': 'http://quantum-machine.org/data/qm7.mat', 'unpack_tar': False, 'unpack_zip': False}¶
-
prepare_data
(overwrite: bool = False, file_column_name: Optional[str] = None, make_sdf: bool = True)[source]¶ Pre-computation of molecular structure information in a sdf-file from a xyz-file or a folder of xyz-files.
If there is no single xyz-file, it will be created with the information of a csv-file with the same name.
- Parameters
- Returns
self
-
read_in_memory
(**kwargs)[source]¶ Read geometric information into memory.
Graph labels require a column specified by
label_column_name
.- Parameters
label_column_name (str, list) – Name of labels for columns in CSV file.
nodes (list) – A list of node attributes as string. In place of names also functions can be added.
edges (list) – A list of edge attributes as string. In place of names also functions can be added.
graph (list) – A list of graph attributes as string. In place of names also functions can be added.
encoder_nodes (dict) – A dictionary of callable encoder where the key matches the attribute.
encoder_edges (dict) – A dictionary of callable encoder where the key matches the attribute.
encoder_graph (dict) – A dictionary of callable encoder where the key matches the attribute.
add_hydrogen (bool) – Whether to keep hydrogen after reading the mol-information. Default is False.
make_directed (bool) – Whether to have directed or undirected bonds. Default is False.
compute_partial_charges (str) – Whether to compute partial charges, e.g. ‘gasteiger’. Default is None.
sanitize (bool) – Whether to sanitize molecule. Default is False.
additional_callbacks (dict) – A dictionary whose keys are string attribute names which the elements of the dataset are supposed to have and the elements are callback function objects which implement how those attributes are derived from the
MolecularGraphRDKit
of the molecule in question or the row of the CSV file.custom_transform (Callable) – Custom transformation function to modify the generated
MolecularGraphRDKit
before callbacks are carried out. The function must take a singleMolecularGraphRDKit
instance as argument and return a (new)MolecularGraphRDKit
instance.
- Returns
self
kgcnn.data.datasets.QM7bDataset module¶
-
class
kgcnn.data.datasets.QM7bDataset.
QM7bDataset
(reload: bool = False, verbose: int = 10)[source]¶ Bases:
kgcnn.data.qm.QMDataset
,kgcnn.data.download.DownloadDataset
Store and process QM7b dataset from Quantum Machine .
From Quantum Machine : This dataset is an extension of the QM7 dataset for multitask learning where 13 additional properties (e.g. polarizability, HOMO and LUMO eigenvalues, excitation energies) have to be predicted at different levels of theory (ZINDO, SCS, PBE0, GW). Additional molecules comprising chlorine atoms are also included, totalling 7211 molecules.
The dataset is composed of two multidimensional arrays X (7211 x 23 x 23) and T (7211 x 14) representing the inputs (Coulomb matrices) and the labels (molecular properties) and one array names of size 14 listing the names of the different properties.
Here, the Coulomb matrices are converted back into coordinates and with
QMDataset
to molecular structure. Labels are not scaled but have original units.References
L. C. Blum, J.-L. Reymond, 970 Million Druglike Small Molecules for Virtual Screening in the Chemical Universe Database GDB-13, J. Am. Chem. Soc., 131:8732, 2009.
G. Montavon, M. Rupp, V. Gobre, A. Vazquez-Mayagoitia, K. Hansen, A. Tkatchenko, K.-R. Müller, O.A. von Lilienfeld, Machine Learning of Molecular Electronic Properties in Chemical Compound Space, New J. Phys. 15 095003, 2013.
-
download_info
= {'data_directory_name': 'qm7b', 'dataset_name': 'QM7b', 'download_file_name': 'qm7b.mat', 'download_url': 'http://quantum-machine.org/data/qm7b.mat', 'unpack_tar': False, 'unpack_zip': False}¶
-
prepare_data
(overwrite: bool = False, file_column_name: Optional[str] = None, make_sdf: bool = True)[source]¶ Pre-computation of molecular structure information in a sdf-file from a xyz-file or a folder of xyz-files.
If there is no single xyz-file, it will be created with the information of a csv-file with the same name.
- Parameters
- Returns
self
kgcnn.data.datasets.QM8Dataset module¶
-
class
kgcnn.data.datasets.QM8Dataset.
QM8Dataset
(reload: bool = False, verbose: int = 10)[source]¶ Bases:
kgcnn.data.qm.QMDataset
,kgcnn.data.download.DownloadDataset
Store and process QM8 dataset from MoleculeNet datasets.
From Quantum Machine : Due to its favorable computational efficiency, time-dependent (TD) density functional theory(DFT) enables the prediction of electronic spectra in a high-throughput manner across chemical space. Its predictions, however, can be quite inaccurate. We resolve this issue with machine learning models trained on deviations of reference second-order approximate coupled-cluster (CC2) singles and doubles spectra from TDDFT counterparts, or even from DFT gap. We applied this approach to low-lying singlet-singlet vertical electronic spectra of over 20000 synthetically feasible small organic molecules with up to eight CONF atoms. The prediction errors decay monotonously as a function of training set size. For a training set of 10000 molecules, CC2 excitation energies can be reproduced to within ±0.1 eV for the remaining molecules. Analysis of our spectral database via chromophore counting suggests that even higher accuracies can be achieved. Based on the evidence collected, we discuss open challenges associated with data-driven modeling of high-lying spectra and transition intensities.
Note
We take the pre-processed dataset from MoleculeNet .
References
L. Ruddigkeit, R. van Deursen, L. C. Blum, J.-L. Reymond, Enumeration of 166 billion organic small molecules in the chemical universe database GDB-17, J. Chem. Inf. Model. 52, 2864–2875, 2012.
R. Ramakrishnan, M. Hartmann, E. Tapavicza, O. A. von Lilienfeld, Electronic Spectra from TDDFT and Machine Learning in Chemical Space, J. Chem. Phys. 143 084111, 2015.
-
download_info
= {'data_directory_name': 'qm8', 'dataset_name': 'QM8', 'download_file_name': 'gdb8.tar.gz', 'download_url': 'https://deepchemdata.s3-us-west-1.amazonaws.com/datasets/gdb8.tar.gz', 'unpack_directory_name': 'gdb8', 'unpack_tar': True, 'unpack_zip': False}¶
kgcnn.data.datasets.QM9Dataset module¶
-
class
kgcnn.data.datasets.QM9Dataset.
QM9Dataset
(reload: bool = False, verbose: int = 10)[source]¶ Bases:
kgcnn.data.qm.QMDataset
,kgcnn.data.download.DownloadDataset
Store and process QM9 dataset from Quantum Machine . dataset.
Dataset of 134k stable small organic molecules made up of C,H,O,N,F.
From Quantum Machine : Computational de novo design of new drugs and materials requires rigorous and unbiased exploration of chemical compound space. However, large uncharted territories persist due to its size scaling combinatorially with molecular size. We report computed geometric, energetic, electronic, and thermodynamic properties for 134k stable small organic molecules made up of CHONF. These molecules correspond to the subset of all 133,885 species with up to nine heavy atoms (CONF) out of the GDB-17 chemical universe of 166 billion organic molecules. We report geometries minimal in energy, corresponding harmonic frequencies, dipole moments, polarizabilities, along with energies, enthalpies, and free energies of atomization. All properties were calculated at the B3LYP/6-31G(2df,p) level of quantum chemistry. Furthermore, for the predominant stoichiometry, C7H10O2, there are 6,095 constitutional isomers among the 134k molecules. We report energies, enthalpies, and free energies of atomization at the more accurate G4MP2 level of theory for all of them. As such, this data set provides quantum chemical properties for a relevant, consistent, and comprehensive chemical space of small organic molecules. This database may serve the benchmarking of existing methods, development of new methods, such as hybrid quantum mechanics/machine learning, and systematic identification of structure-property relationships.
Labels include geometric, energetic, electronic, and thermodynamic properties. Typically, a random 10% validation and 10% test set are used. In literature, test errors are given as MAE and for energies are in [eV].
Molecules that have a different smiles code after convergence can be removed with
remove_uncharacterized
. Also labels with removed atomization energy are generated.from kgcnn.data.datasets.QM9Dataset import QM9Dataset dataset = QM9Dataset(reload=True) print(dataset[0])
References
L. Ruddigkeit, R. van Deursen, L. C. Blum, J.-L. Reymond, Enumeration of 166 billion organic small molecules in the chemical universe database GDB-17, J. Chem. Inf. Model. 52, 2864–2875, 2012.
R. Ramakrishnan, P. O. Dral, M. Rupp, O. A. von Lilienfeld, Quantum chemistry structures and properties of 134 kilo molecules, Scientific Data 1, 140022, 2014.
-
download_info
= {'data_directory_name': 'qm9', 'dataset_name': 'QM9', 'download_file_name': 'dsgdb9nsd.xyz.tar.bz2', 'download_url': 'https://ndownloader.figshare.com/files/3195389', 'unpack_directory_name': 'dsgdb9nsd.xyz', 'unpack_tar': True, 'unpack_zip': False}¶
-
prepare_data
(overwrite: bool = False, file_column_name: Optional[str] = None, make_sdf: bool = True)[source]¶ Process data by loading all single xyz-files and store all pickled information to file. The single files are deleted afterwards, requires to re-extract the tar-file for overwrite.
-
read_in_memory
(**kwargs)[source]¶ Read geometric information into memory.
Graph labels require a column specified by
label_column_name
.- Parameters
label_column_name (str, list) – Name of labels for columns in CSV file.
nodes (list) – A list of node attributes as string. In place of names also functions can be added.
edges (list) – A list of edge attributes as string. In place of names also functions can be added.
graph (list) – A list of graph attributes as string. In place of names also functions can be added.
encoder_nodes (dict) – A dictionary of callable encoder where the key matches the attribute.
encoder_edges (dict) – A dictionary of callable encoder where the key matches the attribute.
encoder_graph (dict) – A dictionary of callable encoder where the key matches the attribute.
add_hydrogen (bool) – Whether to keep hydrogen after reading the mol-information. Default is False.
make_directed (bool) – Whether to have directed or undirected bonds. Default is False.
compute_partial_charges (str) – Whether to compute partial charges, e.g. ‘gasteiger’. Default is None.
sanitize (bool) – Whether to sanitize molecule. Default is False.
additional_callbacks (dict) – A dictionary whose keys are string attribute names which the elements of the dataset are supposed to have and the elements are callback function objects which implement how those attributes are derived from the
MolecularGraphRDKit
of the molecule in question or the row of the CSV file.custom_transform (Callable) – Custom transformation function to modify the generated
MolecularGraphRDKit
before callbacks are carried out. The function must take a singleMolecularGraphRDKit
instance as argument and return a (new)MolecularGraphRDKit
instance.
- Returns
self
kgcnn.data.datasets.QM9MolNetDataset module¶
-
class
kgcnn.data.datasets.QM9MolNetDataset.
QM9MolNetDataset
(reload: bool = False, verbose: int = 10)[source]¶ Bases:
kgcnn.data.qm.QMDataset
,kgcnn.data.download.DownloadDataset
Store and process QM9 dataset from MoleculeNet dataset.
This is the QM9 dataset as preprocessed from MoleculeNet with structure and labels. See
kgcnn.data.datasets.QM9Dataset
for documentation and comparison.References
L. Ruddigkeit, R. van Deursen, L. C. Blum, J.-L. Reymond, Enumeration of 166 billion organic small molecules in the chemical universe database GDB-17, J. Chem. Inf. Model. 52, 2864–2875, 2012.
R. Ramakrishnan, P. O. Dral, M. Rupp, O. A. von Lilienfeld, Quantum chemistry structures and properties of 134 kilo molecules, Scientific Data 1, 140022, 2014.
-
download_info
= {'data_directory_name': 'qm9_mol_net', 'dataset_name': 'QM9MolNet', 'download_file_name': 'gdb9.tar.gz', 'download_url': 'https://deepchemdata.s3-us-west-1.amazonaws.com/datasets/gdb9.tar.gz', 'unpack_directory_name': 'gdb9', 'unpack_tar': True, 'unpack_zip': False}¶
kgcnn.data.datasets.SIDERDataset module¶
-
class
kgcnn.data.datasets.SIDERDataset.
SIDERDataset
(reload=False, verbose: int = 10)[source]¶ Bases:
kgcnn.data.datasets.MoleculeNetDataset2018.MoleculeNetDataset2018
Store and process ‘SIDER’ dataset from MoleculeNet database.
Compare reference: DeepChem reading: The Side Effect Resource (SIDER) is a database of marketed drugs and adverse drug reactions (ADR). The version of the SIDER dataset in DeepChem has grouped drug side effects into 27 system organ classes following MedDRA classifications measured for 1427 approved drugs.
Random splitting is recommended for this dataset.
The raw data csv file contains columns below:
“smiles”: SMILES representation of the molecular structure
- “Hepatobiliary disorders” ~ “Injury, poisoning and procedural complications”: Recorded side effects for the
drug. Please refer to http://sideeffects.embl.de/se/?page=98 for details on ADRs.
References
Kuhn, Michael, et al. “The SIDER database of drugs and side effects.” Nucleic acids research 44.D1 (2015): D1075-D1079.
Altae-Tran, Han, et al. “Low data drug discovery with one-shot learning.” ACS central science 3.4 (2017): 283-293.
Medical Dictionary for Regulatory Activities. http://www.meddra.org/
-
read_in_memory
(**kwargs)[source]¶ Load list of molecules from cached SDF-file in into memory. File name must be given in
file_name
and path information in the constructor of this class.It further checks the csv-file for graph labels specified by
label_column_name
. Labels that do not have valid smiles or molecule in the SDF-file are also skipped, but added as None to keep the index and the molecule assignment.Set further molecular attributes or features by string identifier. Requires
MolecularGraphRDKit
. Default values are features that has been used by Luo et al (2019).The argument
additional_callbacks
allows adding custom properties to each element of the dataset. It is a dictionary whose string keys are the names of the properties and the values are callable function objects which define how the property is derived from either theMolecularGraphRDKit
or the corresponding row of the original CSV file. Those callback functions accept two parameters:mg: The
MolecularGraphRDKit
instance of the moleculeds: A pandas data series that match data in the CSV file for the specific molecule.
Example:
from os import linesep csv = f"index,name,label,smiles{linesep}1,Propanolol,1,[Cl].CC(C)NCC(O)COc1cccc2ccccc12" with open('/tmp/moleculenet_example.csv', mode='w') as file: file.write(csv) dataset = MoleculeNetDataset('/tmp', 'example', 'moleculenet_example.csv') dataset.prepare_data(smiles_column_name='smiles') dataset.read_in_memory(label_column_name='label') dataset.set_attributes( nodes=['Symbol'], encoder_nodes={'Symbol': OneHotEncoder(['C', 'O'], dtype='str')}, edges=['BondType'], encoder_edges={'BondType': int}, additional_callbacks={ # It is important that the callbacks return a numpy array, even if it is just a single element. 'name': lambda mg, ds: np.array(ds['name'], dtype='str') } ) mol: dict = dataset[0] mol['node_attributes'] # np array of one hot encoded atom type per node mol['edge_attributes'] # int value representing the bond type mol['name'] # Array of a single string which is the name from the original CSV data
- Parameters
label_column_name (str) – Column name in the csv-file where to take graph labels from. For multi-targets you can supply a list of column names or positions. A slice can be provided for selecting columns as graph labels. Default is None.
nodes (list) – A list of node attributes as string. In place of names also functions can be added.
edges (list) – A list of edge attributes as string. In place of names also functions can be added.
graph (list) – A list of graph attributes as string. In place of names also functions can be added.
encoder_nodes (dict) – A dictionary of callable encoder where the key matches the attribute.
encoder_edges (dict) – A dictionary of callable encoder where the key matches the attribute.
encoder_graph (dict) – A dictionary of callable encoder where the key matches the attribute.
add_hydrogen (bool) – Whether to keep hydrogen after reading the mol-information. Default is False.
has_conformers (bool) – Whether to add node coordinates from conformer. Default is True.
make_directed (bool) – Whether to have directed or undirected bonds. Default is False.
sanitize (bool) – Whether to sanitize molecule. Default is True.
compute_partial_charges (str) – Whether to compute partial charges, e.g. ‘gasteiger’. Default is None.
additional_callbacks (dict) – A dictionary whose keys are string attribute names which the elements of the dataset are supposed to have and the elements are callback function objects which implement how those attributes are derived from the
MolecularGraphRDKit
of the molecule in question or the row of the CSV file.custom_transform (Callable) – Custom transformation function to modify the generated
MolecularGraphRDKit
before callbacks are carried out. The function must take a singleMolecularGraphRDKit
instance as argument and return a (new)MolecularGraphRDKit
instance.
- Returns
self
kgcnn.data.datasets.Tox21MolNetDataset module¶
-
class
kgcnn.data.datasets.Tox21MolNetDataset.
Tox21MolNetDataset
(reload=False, verbose: int = 10, remove_nan: bool = False)[source]¶ Bases:
kgcnn.data.datasets.MoleculeNetDataset2018.MoleculeNetDataset2018
Store and process ‘TOX21’ dataset from MoleculeNet database.
Compare reference: DeepChem reading: The “Toxicology in the 21st Century” (Tox21) initiative created a public database measuring toxicity of compounds, which has been used in the 2014 Tox21 Data Challenge. This dataset contains qualitative toxicity measurements for 8k compounds on 12 different targets, including nuclear receptors and stress response pathways.
Random splitting is recommended for this dataset.
The raw data csv file contains columns below:
“smiles”: SMILES representation of the molecular structure
“NR-XXX”: Nuclear receptor signaling bioassays results
“SR-XXX”: Stress response bioassays results
please refer to https://tripod.nih.gov/tox21/challenge/data.jsp for details.
References
Tox21 Challenge. https://tripod.nih.gov/tox21/challenge/
-
__init__
(reload=False, verbose: int = 10, remove_nan: bool = False)[source]¶ Initialize Tox21 dataset.
-
read_in_memory
(**kwargs)[source]¶ Load list of molecules from cached SDF-file in into memory. File name must be given in
file_name
and path information in the constructor of this class.It further checks the csv-file for graph labels specified by
label_column_name
. Labels that do not have valid smiles or molecule in the SDF-file are also skipped, but added as None to keep the index and the molecule assignment.Set further molecular attributes or features by string identifier. Requires
MolecularGraphRDKit
. Default values are features that has been used by Luo et al (2019).The argument
additional_callbacks
allows adding custom properties to each element of the dataset. It is a dictionary whose string keys are the names of the properties and the values are callable function objects which define how the property is derived from either theMolecularGraphRDKit
or the corresponding row of the original CSV file. Those callback functions accept two parameters:mg: The
MolecularGraphRDKit
instance of the moleculeds: A pandas data series that match data in the CSV file for the specific molecule.
Example:
from os import linesep csv = f"index,name,label,smiles{linesep}1,Propanolol,1,[Cl].CC(C)NCC(O)COc1cccc2ccccc12" with open('/tmp/moleculenet_example.csv', mode='w') as file: file.write(csv) dataset = MoleculeNetDataset('/tmp', 'example', 'moleculenet_example.csv') dataset.prepare_data(smiles_column_name='smiles') dataset.read_in_memory(label_column_name='label') dataset.set_attributes( nodes=['Symbol'], encoder_nodes={'Symbol': OneHotEncoder(['C', 'O'], dtype='str')}, edges=['BondType'], encoder_edges={'BondType': int}, additional_callbacks={ # It is important that the callbacks return a numpy array, even if it is just a single element. 'name': lambda mg, ds: np.array(ds['name'], dtype='str') } ) mol: dict = dataset[0] mol['node_attributes'] # np array of one hot encoded atom type per node mol['edge_attributes'] # int value representing the bond type mol['name'] # Array of a single string which is the name from the original CSV data
- Parameters
label_column_name (str) – Column name in the csv-file where to take graph labels from. For multi-targets you can supply a list of column names or positions. A slice can be provided for selecting columns as graph labels. Default is None.
nodes (list) – A list of node attributes as string. In place of names also functions can be added.
edges (list) – A list of edge attributes as string. In place of names also functions can be added.
graph (list) – A list of graph attributes as string. In place of names also functions can be added.
encoder_nodes (dict) – A dictionary of callable encoder where the key matches the attribute.
encoder_edges (dict) – A dictionary of callable encoder where the key matches the attribute.
encoder_graph (dict) – A dictionary of callable encoder where the key matches the attribute.
add_hydrogen (bool) – Whether to keep hydrogen after reading the mol-information. Default is False.
has_conformers (bool) – Whether to add node coordinates from conformer. Default is True.
make_directed (bool) – Whether to have directed or undirected bonds. Default is False.
sanitize (bool) – Whether to sanitize molecule. Default is True.
compute_partial_charges (str) – Whether to compute partial charges, e.g. ‘gasteiger’. Default is None.
additional_callbacks (dict) – A dictionary whose keys are string attribute names which the elements of the dataset are supposed to have and the elements are callback function objects which implement how those attributes are derived from the
MolecularGraphRDKit
of the molecule in question or the row of the CSV file.custom_transform (Callable) – Custom transformation function to modify the generated
MolecularGraphRDKit
before callbacks are carried out. The function must take a singleMolecularGraphRDKit
instance as argument and return a (new)MolecularGraphRDKit
instance.
- Returns
self