kgcnn.data.datasets package

Submodules

kgcnn.data.datasets.ClinToxDataset module

class kgcnn.data.datasets.ClinToxDataset.ClinToxDataset(reload=False, verbose: int = 10, label_index: int = 0)[source]

Bases: kgcnn.data.datasets.MoleculeNetDataset2018.MoleculeNetDataset2018

Store and process ‘ClinTox’ dataset from MoleculeNet database. Class inherits from MoleculeNetDataset2018 and downloads dataset on class initialization.

Compare reference: DeepChem reading: Qualitative data of drugs approved by the FDA and those that have failed clinical trials for toxicity reasons. Random splitting is recommended for this dataset.

The ClinTox dataset compares drugs approved by the FDA and drugs that have failed clinical trials for toxicity. The dataset includes two classification tasks for 1491 drug compounds with known chemical structures:

  • clinical trial toxicity (or absence of toxicity)

  • FDA approval status.

List of FDA-approved drugs are compiled from the SWEETLEAD database, and list of drugs that failed clinical trials for toxicity reasons are compiled from the Aggregate Analysis of ClinicalTrials.gov(AACT) database.

References

  1. Gayvert, Kaitlyn M., Neel S. Madhukar, and Olivier Elemento. A data-driven approach to predicting successes and failures of clinica trials. Cell chemical biology 23.10 (2016): 1294-1301.

  2. Artemov, Artem V., et al. Integrated deep learned transcriptomic and structure-based predictor of clinical trials outcomes. bioRxiv (2016): 095653.

  3. Novick, Paul A., et al. SWEETLEAD: an in silico database of approved drugs, regulated chemicals, and herbal isolates for computer-aided drug discovery. PloS one 8.11 (2013): e79568.

  4. Aggregate Analysis of ClincalTrials.gov (AACT) Database. https://www.ctti-clinicaltrials.org/aact-database .

__init__(reload=False, verbose: int = 10, label_index: int = 0)[source]

Initialize ClinTox dataset.

Parameters
  • reload (bool) – Whether to reload the data and make new dataset. Default is False.

  • verbose (int) – Print progress or info for processing where 60=silent. Default is 10.

  • label_index (int) – Which information should be taken as binary label. Default is 0. Determines the positive class label, which is ‘approved’ by default.

read_in_memory(**kwargs)[source]

Load list of molecules from cached SDF-file in into memory. File name must be given in file_name and path information in the constructor of this class.

It further checks the csv-file for graph labels specified by label_column_name. Labels that do not have valid smiles or molecule in the SDF-file are also skipped, but added as None to keep the index and the molecule assignment.

Set further molecular attributes or features by string identifier. Requires MolecularGraphRDKit. Default values are features that has been used by Luo et al (2019).

The argument additional_callbacks allows adding custom properties to each element of the dataset. It is a dictionary whose string keys are the names of the properties and the values are callable function objects which define how the property is derived from either the MolecularGraphRDKit or the corresponding row of the original CSV file. Those callback functions accept two parameters:

  • mg: The MolecularGraphRDKit instance of the molecule

  • ds: A pandas data series that match data in the CSV file for the specific molecule.

Example:

from os import linesep
csv = f"index,name,label,smiles{linesep}1,Propanolol,1,[Cl].CC(C)NCC(O)COc1cccc2ccccc12"
with open('/tmp/moleculenet_example.csv', mode='w') as file:
    file.write(csv)

dataset = MoleculeNetDataset('/tmp', 'example', 'moleculenet_example.csv')
dataset.prepare_data(smiles_column_name='smiles')
dataset.read_in_memory(label_column_name='label')
dataset.set_attributes(
    nodes=['Symbol'],
    encoder_nodes={'Symbol': OneHotEncoder(['C', 'O'], dtype='str')},
    edges=['BondType'],
    encoder_edges={'BondType': int},
    additional_callbacks={
        # It is important that the callbacks return a numpy array, even if it is just a single element.
        'name': lambda mg, ds: np.array(ds['name'], dtype='str')
    }
)

mol: dict = dataset[0]
mol['node_attributes']  # np array of one hot encoded atom type per node
mol['edge_attributes']  # int value representing the bond type
mol['name']  # Array of a single string which is the name from the original CSV data
Parameters
  • label_column_name (str) – Column name in the csv-file where to take graph labels from. For multi-targets you can supply a list of column names or positions. A slice can be provided for selecting columns as graph labels. Default is None.

  • nodes (list) – A list of node attributes as string. In place of names also functions can be added.

  • edges (list) – A list of edge attributes as string. In place of names also functions can be added.

  • graph (list) – A list of graph attributes as string. In place of names also functions can be added.

  • encoder_nodes (dict) – A dictionary of callable encoder where the key matches the attribute.

  • encoder_edges (dict) – A dictionary of callable encoder where the key matches the attribute.

  • encoder_graph (dict) – A dictionary of callable encoder where the key matches the attribute.

  • add_hydrogen (bool) – Whether to keep hydrogen after reading the mol-information. Default is False.

  • has_conformers (bool) – Whether to add node coordinates from conformer. Default is True.

  • make_directed (bool) – Whether to have directed or undirected bonds. Default is False.

  • sanitize (bool) – Whether to sanitize molecule. Default is True.

  • compute_partial_charges (str) – Whether to compute partial charges, e.g. ‘gasteiger’. Default is None.

  • additional_callbacks (dict) – A dictionary whose keys are string attribute names which the elements of the dataset are supposed to have and the elements are callback function objects which implement how those attributes are derived from the MolecularGraphRDKit of the molecule in question or the row of the CSV file.

  • custom_transform (Callable) – Custom transformation function to modify the generated MolecularGraphRDKit before callbacks are carried out. The function must take a single MolecularGraphRDKit instance as argument and return a (new) MolecularGraphRDKit instance.

Returns

self

kgcnn.data.datasets.CoraDataset module

class kgcnn.data.datasets.CoraDataset.CoraDataset(reload=False, verbose: int = 10)[source]

Bases: kgcnn.data.download.DownloadDataset, kgcnn.data.base.MemoryGraphDataset

Store and process (full) Cora dataset.

Data loaded from https://github.com/abojchevski/graph2gauss/raw/master/data .

From Paper: “Deep Gaussian Embedding of Graphs: Unsupervised Inductive Learning via Ranking” paper. Nodes represent documents and edges represent citation links.

References

  1. Bojchevski, Aleksandar and Günnemann, Stephan, Deep Gaussian Embedding of Graphs: Unsupervised Inductive Learning via Ranking, arXiv, doi:10.48550/ARXIV.1707.03815, 2017

__init__(reload=False, verbose: int = 10)[source]

Initialize full Cora dataset.

Parameters
  • reload (bool) – Whether to reload the data and make new dataset. Default is False.

  • verbose (int) – Print progress or info for processing where 60=silent. Default is 10.

download_info = {'data_directory_name': 'cora', 'dataset_name': 'Cora', 'download_file_name': 'cora.npz', 'download_url': 'https://github.com/abojchevski/graph2gauss/raw/master/data/cora.npz', 'unpack_directory_name': None, 'unpack_tar': False, 'unpack_zip': False}
read_in_memory()[source]

Load full Cora data into memory and already split into items.

kgcnn.data.datasets.CoraLuDataset module

class kgcnn.data.datasets.CoraLuDataset.CoraLuDataset(reload=False, verbose: int = 10)[source]

Bases: kgcnn.data.download.DownloadDataset, kgcnn.data.base.MemoryGraphDataset

Store and process Cora dataset after Lu et al. 2003 .

Information in Papers with code read: The Cora dataset consists of 2708 scientific publications classified into one of seven classes. The citation network consists of 5429 links. Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary. The dictionary consists of 1433 unique words.

Downloaded from: https://linqs-data.soe.ucsc.edu/public/lbc/ .

References

  1. McCallum, A.K., Nigam, K., Rennie, J. et al. Automating the Construction of Internet Portals with Machine Learning. Information Retrieval 3, 127–163 (2000). https://doi.org/10.1023/A:1009953814988

  2. Lu, Qing and Lise Getoor. “Link-Based Classification.” Encyclopedia of Machine Learning and Data Mining (2003).

__init__(reload=False, verbose: int = 10)[source]

Initialize Cora dataset after Lu et al. 2003.

Parameters
  • reload (bool) – Whether to reload the data and make new dataset. Default is False.

  • verbose (int) – Print progress or info for processing where 60=silent. Default is 10.

download_info = {'data_directory_name': 'cora_lu', 'dataset_name': 'cora_lu', 'download_file_name': 'cora.tgz', 'download_url': 'https://linqs-data.soe.ucsc.edu/public/lbc/cora.tgz', 'unpack_directory_name': 'cora_lu', 'unpack_tar': True, 'unpack_zip': False}
read_in_memory()[source]

Load Cora data into memory and already split into items.

kgcnn.data.datasets.ESOLDataset module

class kgcnn.data.datasets.ESOLDataset.ESOLDataset(reload=False, verbose: int = 10)[source]

Bases: kgcnn.data.datasets.MoleculeNetDataset2018.MoleculeNetDataset2018

Store and process ‘ESOL’ dataset from MoleculeNet database. Class inherits from MoleculeNetDataset2018 and downloads dataset on class initialization.

Compare reference: DeepChem reading:

Water solubility data(log solubility in mols per litre) for common organic small molecules. Random or Scaffold splitting is recommended for this dataset. Description in DeepChem reads: ‘The Delaney (ESOL) dataset a regression dataset containing structures and water solubility data for 1128 compounds. The dataset is widely used to validate machine learning models on estimating solubility directly from molecular structures (as encoded in SMILES strings).’

References

  1. Delaney, John S. ESOL: estimating aqueous solubility directly from molecular structure. Journal of chemical information and computer sciences 44.3 (2004): 1000-1005.

__init__(reload=False, verbose: int = 10)[source]

Initialize ESOL dataset.

Parameters
  • reload (bool) – Whether to reload the data and make new dataset. Default is False.

  • verbose (int) – Print progress or info for processing where 60=silent. Default is 10.

kgcnn.data.datasets.FreeSolvDataset module

class kgcnn.data.datasets.FreeSolvDataset.FreeSolvDataset(reload=False, verbose: int = 10)[source]

Bases: kgcnn.data.datasets.MoleculeNetDataset2018.MoleculeNetDataset2018

Store and process ‘FreeSolv’ dataset from MoleculeNet database. Class inherits from MoleculeNetDataset2018 and downloads dataset on class initialization. Compare reference: DeepChem reading: Experimental and calculated hydration free energy of small molecules in water. Description in DeepChem reads: ‘The FreeSolv dataset is a collection of experimental and calculated hydration free energies for small molecules in water, along with their experiemental values. Here, we are using a modified version of the dataset with the molecule smile string and the corresponding experimental hydration free energies.’

Random splitting is recommended for this dataset.

References

  1. Lukasz Maziarka, et al. Molecule Attention Transformer. NeurIPS 2019 arXiv:2002.08264v1 [cs.LG].

  2. Mobley DL, Guthrie JP. FreeSolv: a database of experimental and calculated hydration free energies, with input files. J Comput Aided Mol Des. 2014;28(7):711-720. doi:10.1007/s10822-014-9747-x

__init__(reload=False, verbose: int = 10)[source]

Initialize Lipop dataset.

Parameters
  • reload (bool) – Whether to reload the data and make new dataset. Default is False.

  • verbose (int) – Print progress or info for processing where 60=silent. Default is 10.

kgcnn.data.datasets.GraphTUDataset2020 module

class kgcnn.data.datasets.GraphTUDataset2020.GraphTUDataset2020(dataset_name: str, reload: bool = False, verbose: int = 10)[source]

Bases: kgcnn.data.tudataset.GraphTUDataset, kgcnn.data.download.DownloadDataset

Base class for loading graph datasets published by TU Dortmund University .

This general base class has functionality to load TUDatasets in a generic way.

Note

Note that subclasses of GraphTUDataset in :obj:kgcnn.data.datasets should still be made, if the dataset needs more refined post-precessing. Not all datasets can provide all types of graph properties like edge_attributes etc.

References

  1. TUDataset: A collection of benchmark datasets for learning with graphs. Christopher Morris and Nils M. Kriege and Franka Bause and Kristian Kersting and Petra Mutzel and Marion Neumann, ICML 2020 Workshop on Graph Representation Learning and Beyond (GRL+ 2020) www.graphlearning.io .

__init__(dataset_name: str, reload: bool = False, verbose: int = 10)[source]

Initialize a GraphTUDataset instance from string identifier.

Parameters
  • dataset_name (str) – Name of a dataset.

  • reload (bool) – Download the dataset again and prepare data on disk.

  • verbose (int) – Print progress or info for processing where 60=silent. Default is 10.

tudataset_ids = ['AIDS', 'alchemy_full', 'aspirin', 'benzene', 'BZR', 'BZR_MD', 'COX2', 'COX2_MD', 'DHFR', 'DHFR_MD', 'ER_MD', 'ethanol', 'FRANKENSTEIN', 'malonaldehyde', 'MCF-7', 'MCF-7H', 'MOLT-4', 'MOLT-4H', 'Mutagenicity', 'MUTAG', 'naphthalene', 'NCI1', 'NCI109', 'NCI-H23', 'NCI-H23H', 'OVCAR-8', 'OVCAR-8H', 'P388', 'P388H', 'PC-3', 'PC-3H', 'PTC_FM', 'PTC_FR', 'PTC_MM', 'PTC_MR', 'QM9', 'salicylic_acid', 'SF-295', 'SF-295H', 'SN12C', 'SN12CH', 'SW-620', 'SW-620H', 'toluene', 'Tox21_AhR_training', 'Tox21_AhR_testing', 'Tox21_AhR_evaluation', 'Tox21_AR_training', 'Tox21_AR_testing', 'Tox21_AR_evaluation', 'Tox21_AR-LBD_training', 'Tox21_AR-LBD_testing', 'Tox21_AR-LBD_evaluation', 'Tox21_ARE_training', 'Tox21_ARE_testing', 'Tox21_ARE_evaluation', 'Tox21_aromatase_training', 'Tox21_aromatase_testing', 'Tox21_aromatase_evaluation', 'Tox21_ATAD5_training', 'Tox21_ATAD5_testing', 'Tox21_ATAD5_evaluation', 'Tox21_ER_training', 'Tox21_ER_testing', 'Tox21_ER_evaluation', 'Tox21_ER-LBD_training', 'Tox21_ER-LBD_testing', 'Tox21_ER-LBD_evaluation', 'Tox21_HSE_training', 'Tox21_HSE_testing', 'Tox21_HSE_evaluation', 'Tox21_MMP_training', 'Tox21_MMP_testing', 'Tox21_MMP_evaluation', 'Tox21_p53_training', 'Tox21_p53_testing', 'Tox21_p53_evaluation', 'Tox21_PPAR-gamma_training', 'Tox21_PPAR-gamma_testing', 'Tox21_PPAR-gamma_evaluation', 'UACC257', 'UACC257H', 'uracil', 'Yeast', 'YeastH', 'ZINC_full', 'ZINC_test', 'ZINC_train', 'ZINC_val', 'DD', 'ENZYMES', 'KKI', 'OHSU', 'Peking_1', 'PROTEINS', 'PROTEINS_full', 'COIL-DEL', 'COIL-RAG', 'Cuneiform', 'Fingerprint', 'FIRSTMM_DB', 'Letter-high', 'Letter-low', 'Letter-med', 'MSRC_9', 'MSRC_21', 'MSRC_21C', 'COLLAB', 'dblp_ct1', 'dblp_ct2', 'DBLP_v1', 'deezer_ego_nets', 'facebook_ct1', 'facebook_ct2', 'github_stargazers', 'highschool_ct1', 'highschool_ct2', 'IMDB-BINARY', 'IMDB-MULTI', 'infectious_ct1', 'infectious_ct2', 'mit_ct1', 'mit_ct2', 'REDDIT-BINARY', 'REDDIT-MULTI-5K', 'REDDIT-MULTI-12K', 'reddit_threads', 'tumblr_ct1', 'tumblr_ct2', 'twitch_egos', 'TWITTER-Real-Graph-Partial', 'COLORS-3', 'SYNTHETIC', 'SYNTHETICnew', 'Synthie', 'TRIANGLES']

kgcnn.data.datasets.ISO17Dataset module

class kgcnn.data.datasets.ISO17Dataset.ISO17Dataset(reload=False, verbose: int = 10)[source]

Bases: kgcnn.data.download.DownloadDataset, kgcnn.data.base.MemoryGraphDataset

Dataset ‘ISO17’ with molecules randomly taken from QM9 dataset [1] with a fixed composition of atoms (C7O2H10). They were arranged in different chemically valid structures and is an extension of the isomer MD data used in [2].

Information below and the dataset itself is copied and downloaded from http://quantum-machine.org/datasets/ . The database consist of 129 molecules each containing 5000 conformational geometries, energies and forces with a resolution of 1 femto-second in the molecular dynamics trajectories. The database was generated from molecular dynamics simulations using the Fritz-Haber Institute ab initio simulation package (FHI-aims)[3]. The simulations in ISO17 were carried out using the standard quantum chemistry computational method density functional theory (DFT) in the generalized gradient approximation (GGA) with the Perdew-Burke-Ernzerhof (PBE) functional[4] and the Tkatchenko-Scheffler (TS) van der Waals correction method [5]. The dataset is stored in ASE sqlite format with the total energy in eV and forces in eV/Ang. Download-url: http://quantum-machine.org/datasets/iso17.tar.gz .

from ase.db import connect
with connect(path_to_db) as conn:
   for row in conn.select(limit=10):
       print(row.toatoms())
       print(row['total_energy'])
       print(row.data['atomic_forces'])

References

  1. R. Ramakrishnan et al. Quantum chemistry structures and properties of 134 kilo molecules. Scientific Data, 1. (2014) https://www.nature.com/articles/sdata201422 .

  2. Schütt, K. T. et al. Quantum-chemical insights from deep tensor neural networks. Nature Communications, 8, 13890. (2017) https://www.nature.com/articles/ncomms13890 .

  3. Blum, V. et al. Ab Initio Molecular Simulations with Numeric Atom-Centered Orbitals. Comput. Phys. Commun. 2009, 180 (11), 2175–2196 https://www.sciencedirect.com/science/article/pii/S0010465509002033 .

  4. Perdew, J. P. et al. Generalized Gradient Approximation Made Simple. Phys. Rev. Lett. 1996, 77 (18), 3865–3868 https://journals.aps.org/prl/abstract/10.1103/PhysRevLett.77.3865.

  5. Tkatchenko, A. et al. Accurate Molecular Van Der Waals Interactions from Ground-State Electron Density and Free-Atom Reference Data. Phys. Rev. Lett. 2009, 102 (7), 73005 https://journals.aps.org/prl/abstract/10.1103/PhysRevLett.102.073005.

  6. Schütt, K. T., et al. SchNet: A continuous-filter convolutional neural network for modeling quantum interactions. Advances in Neural Information Processing System (accepted). 2017. https://arxiv.org/abs/1706.08566

__init__(reload=False, verbose: int = 10)[source]

Initialize full ISO17Dataset dataset.

Parameters
  • reload (bool) – Whether to reload the data and make new dataset. Default is False.

  • verbose (int) – Print progress or info for processing where 60=silent. Default is 10.

download_info = {'data_directory_name': 'ISO17', 'dataset_name': 'ISO17', 'download_file_name': 'iso17.tar.gz', 'download_url': 'http://quantum-machine.org/datasets/iso17.tar.gz', 'unpack_directory_name': 'iso17', 'unpack_tar': True, 'unpack_zip': False}
read_in_memory()[source]

Load complete ISO17Dataset data into memory. Additionally, the different train and validation properties are set according to http://quantum-machine.org/datasets/.

The data is partitioned as used in the SchNet paper [6]:

  • reference.db: 80% of steps of 80% of MD trajectories

  • reference_eq.db: equilibrium conformations of those molecules

  • test_within.db: remaining 20% unseen steps of reference trajectories

  • test_other.db: remaining 20% unseen MD trajectories

  • test_eq.db: equilibrium conformations of test trajectories

Where ‘reference.db’ and ‘reference_eq.db’ have ‘train’ property with index 0, 1, respectively, and ‘test_within’, ‘test_other’, ‘test_eq’ have ‘test’ property 0, 1, 2, respectively. The original validation geometries are noted by ‘valid’ property with index 0. Use get_train_test_indices for reading out split indices.

kgcnn.data.datasets.LipopDataset module

class kgcnn.data.datasets.LipopDataset.LipopDataset(reload=False, verbose: int = 10)[source]

Bases: kgcnn.data.datasets.MoleculeNetDataset2018.MoleculeNetDataset2018

Store and process ‘Lipop’ dataset from MoleculeNet database. Class inherits from MoleculeNetDataset2018 and downloads dataset on class initialization. Compare reference: DeepChem reading: Experimental results of octanol/water distribution coefficient(logD at pH 7.4). Description in DeepChem reads: ‘Lipophilicity is an important feature of drug molecules that affects both membrane permeability and solubility. The lipophilicity dataset, curated from ChEMBL database, provides experimental results of octanol/water distribution coefficient (logD at pH 7.4) of 4200 compounds.’ Random or Scaffold splitting is recommended for this dataset.

References

  1. Hersey, A. ChEMBL Deposited Data Set - AZ dataset; 2015. https://doi.org/10.6019/chembl3301361

__init__(reload=False, verbose: int = 10)[source]

Initialize Lipop dataset.

Parameters
  • reload (bool) – Whether to reload the data and make new dataset. Default is False.

  • verbose (int) – Print progress or info for processing where 60=silent. Default is 10.

kgcnn.data.datasets.MD17Dataset module

class kgcnn.data.datasets.MD17Dataset.MD17Dataset(trajectory_name: Optional[str] = None, reload=False, verbose=10)[source]

Bases: kgcnn.data.download.DownloadDataset, kgcnn.data.base.MemoryGraphDataset

Store and process trajectories from the MD17Dataset dataset. The dataset contains atomic coordinates of molecular dynamics trajectories, as well as the total energy (in kcal/mol) and forces (kcal/mol/Angstrom) on each atom. For reference data source, refer to the links http://www.sgdml.org/#datasets or http://quantum-machine.org/gdml/data/ .

Which trajectory is downloaded is determined by trajectory_name argument. There are two different versions of trajectories, which are a long trajectory on DFT level of theory and a short trajectory on coupled cluster level of theory marked in the name by ‘dft’ and ‘ccsd_t’ respectively.

Overview:

Molecule

Level of Theory

trajectory_name

graphs

Aspirin

DFT

aspirin_dft

211762

Azobenzene

DFT

azobenzene_dft

99999

Benzene

DFT

benzene2017_dft

627983

Benzene

DFT

benzene2018_dft

49863

Ethanol

DFT

ethanol_dft

555092

Malonaldehyde

DFT

malonaldehyde_dft

993237

Naphthalene

DFT

naphthalene_dft

326250

Paracetamol

DFT

paracetamol_dft

106490

Salicylic

DFT

salicylic_dft

320231

Toluene

DFT

toluene_dft

442790

Uracil

DFT

uracil_dft

133770

Aspirin_ccsd

CCSD

aspirin_ccsd

1500

Benzene

CCSD

benzene_ccsd_t

1500

Ethanol

CCSD

ethanol_ccsd_t

2000

Malonaldehyde

CCSD

malonaldehyde_ccsd_t

1500

Toluene

CCSD

toluene_ccsd_t

1501

It is recommended to use the given train-test splits. Only the requested trajectory is downloaded.

__init__(trajectory_name: Optional[str] = None, reload=False, verbose=10)[source]

Initialize MD17Dataset dataset.

Parameters
  • trajectory_name (str) – Name of a trajectory or molecule.

  • reload (bool) – Whether to reload the data and make new dataset. Default is False.

  • verbose (int) – Print progress or info for processing where 60=silent. Default is 10.

datasets_download_info = {'CG-CG': {'download_file_name': 'CG-CG.npz'}, 'aspirin_ccsd': {'download_file_name': 'aspirin_ccsd.zip', 'unpack_directory_name': 'aspirin_ccsd', 'unpack_zip': True}, 'aspirin_dft': {'download_file_name': 'aspirin_dft.npz'}, 'azobenzene_dft': {'download_file_name': 'azobenzene_dft.npz'}, 'benzene2017_dft': {'download_file_name': 'benzene2017_dft.npz'}, 'benzene2018_dft': {'download_file_name': 'benzene2018_dft.npz'}, 'benzene_ccsd_t': {'download_file_name': 'benzene_ccsd_t.zip', 'unpack_directory_name': 'benzene_ccsd_t', 'unpack_zip': True}, 'ethanol_ccsd_t': {'download_file_name': 'ethanol_ccsd_t.zip', 'unpack_directory_name': 'ethanol_ccsd_t', 'unpack_zip': True}, 'ethanol_dft': {'download_file_name': 'ethanol_dft.npz'}, 'malonaldehyde_ccsd_t': {'download_file_name': 'malonaldehyde_ccsd_t.zip', 'unpack_directory_name': 'malonaldehyde_ccsd_t', 'unpack_zip': True}, 'malonaldehyde_dft': {'download_file_name': 'malonaldehyde_dft.npz'}, 'naphthalene_dft': {'download_file_name': 'naphthalene_dft.npz'}, 'paracetamol_dft': {'download_file_name': 'paracetamol_dft.npz'}, 'salicylic_dft': {'download_file_name': 'salicylic_dft.npz'}, 'toluene_ccsd_t': {'download_file_name': 'toluene_ccsd_t.zip', 'unpack_directory_name': 'toluene_ccsd_t', 'unpack_zip': True}, 'toluene_dft': {'download_file_name': 'toluene_dft.npz'}, 'uracil_dft': {'download_file_name': 'uracil_dft.npz'}}
read_in_memory()[source]

Load a trajectory into memory.

kgcnn.data.datasets.MD17RevisedDataset module

class kgcnn.data.datasets.MD17RevisedDataset.MD17RevisedDataset(trajectory_name: Optional[str] = None, reload=False, verbose=10)[source]

Bases: kgcnn.data.download.DownloadDataset, kgcnn.data.base.MemoryGraphDataset

Store and process trajectories from MD17DatasetRevised dataset.

The information of the readme file is given below:

The molecules are taken from the original MD17 dataset by Chmiela et al. , and 100,000 structures are taken, and the energies and forces are recalculated at the PBE/def2-SVP level of theory using very tight SCF convergence and very dense DFT integration grid. As such, the dataset is practically free from nummerical noise.

One warning: As the structures are taken from a molecular dynamics simulation (i.e. time series data), they are not guaranteed to be independent samples. This is easily evident from the auto-correlation function for the original MD17 dataset

In short: DO NOT train a model on more than 1000 samples from this dataset. Data already published with 50K samples on the original MD17 dataset should be considered meaningless due to this fact and due to the noise in the original data.

The data: The ten molecules are saved in Numpy .npz format. The keys correspond to:

  • ‘nuclear_charges’ : The nuclear charges for the molecule

  • ‘coords’ : The coordinates for each conformation (in units of Angstrom)

  • ‘energies’ : The total energy of each conformation (in units of kcal/mol)

  • ‘forces’ : The cartesian forces of each conformation (in units of kcal/mol/Angstrom)

  • ‘old_indices’ : The index of each conformation in the original MD17 dataset

  • ‘old_energies’ : The energy of each conformation taken from the original MD17 dataset

  • ‘old_forces’ : The forces of each conformation taken from the original MD17 dataset

Note that for Azobenzene, only 99988 samples are available due to 11 failed DFT calculations due to van der Walls clash, and the original dataset only contained 99999 structures.

Data splits: Five training and test splits are saved in CSV format containing the corresponding indices.

References

  1. Anders Christensen, O. Anatole von Lilienfeld, Revised MD17 dataset, Materials Cloud Archive 2020.82 (2020), doi: 10.24435/materialscloud:wy-kn.

__init__(trajectory_name: Optional[str] = None, reload=False, verbose=10)[source]

Initialize MD17DatasetRevised dataset for a specific trajectory.

Parameters
  • trajectory_name (str) – Name of trajectory to load.

  • reload (bool) – Whether to reload the data and make new dataset. Default is False.

  • verbose (int) – Print progress or info for processing where 60=silent. Default is 10.

download_info = {'data_directory_name': 'MD17Revised', 'dataset_name': 'MD17Revised', 'download_file_name': 'rmd17.tar.bz2', 'download_url': 'https://archive.materialscloud.org/record/file?filename=rmd17.tar.bz2&record_id=466', 'unpack_directory_name': 'rmd17', 'unpack_tar': True, 'unpack_zip': False}
possible_trajectory_names = ['aspirin', 'azobenzene', 'benzene', 'ethanol', 'malonaldehyde', 'naphthalene', 'paracetamol', 'salicylic', 'toluene', 'uracil']
read_in_memory()[source]

Read dataset trajectory into memory.

Returns

self.

kgcnn.data.datasets.MUTAGDataset module

class kgcnn.data.datasets.MUTAGDataset.MUTAGDataset(reload=False, verbose: int = 10)[source]

Bases: kgcnn.data.datasets.GraphTUDataset2020.GraphTUDataset2020

Store and process MUTAG dataset from TUDatasets .

In Papers with Code : In particular, MUTAG is a collection of nitroaromatic compounds and the goal is to predict their mutagenicity on Salmonella typhimurium. Input graphs are used to represent chemical compounds, where vertices stand for atoms and are labeled by the atom type (represented by one-hot encoding), while edges between vertices represent bonds between the corresponding atoms. It includes 188 samples of chemical compounds with 7 discrete node labels.

References

  1. Debnath, A.K., Lopez de Compadre, R.L., Debnath, G., Shusterman, A.J., and Hansch, C. Structure-activity relationship of mutagenic aromatic and heteroaromatic nitro compounds. J. Med. Chem. 34(2):786-797 (1991).

  2. Nils Kriege, Petra Mutzel. 2012. Subgraph Matching Kernels for Attributed Graphs. International Conference on Machine Learning 2012.

__init__(reload=False, verbose: int = 10)[source]

Initialize MUTAG dataset.

Parameters
  • reload (bool) – Whether to reload the data and make new dataset. Default is False.

  • verbose (int) – Print progress or info for processing where 60=silent. Default is 10.

read_in_memory(verbose: int = 1)[source]

Load MUTAG data into memory and already split into items.

kgcnn.data.datasets.MatBenchDataset2020 module

class kgcnn.data.datasets.MatBenchDataset2020.MatBenchDataset2020(dataset_name: str, reload: bool = False, verbose: int = 10)[source]

Bases: kgcnn.data.crystal.CrystalDataset, kgcnn.data.download.DownloadDataset

Base class for loading graph datasets from MatBench , collection of materials datasets. For graph learning only those with structure are relevant. Process and loads from serialized pymatgen structures.

Note

This class does not follow the interface of MatBench and therefore also not the original splits required for submission of benchmark values.

Matbench is an automated leaderboard for benchmarking state of the art ML algorithms predicting a diverse range of solid materials’ properties. It is hosted and maintained by the Materials Project .

Matbench is an ImageNet for materials science; a curated set of 13 supervised, pre-cleaned, ready-to-use ML tasks for benchmarking and fair comparison. The tasks span a wide domain of inorganic materials science applications including electronic, thermodynamic, mechanical, and thermal properties among crystals, 2D materials, disordered metals, and more.

References

  1. Dunn, A., Wang, Q., Ganose, A. et al. Benchmarking materials property prediction methods: the Matbench test set and Automatminer reference algorithm. npj Comput Mater 6, 138 (2020). https://doi.org/10.1038/s41524-020-00406-3 .

__init__(dataset_name: str, reload: bool = False, verbose: int = 10)[source]

Initialize a GraphTUDataset instance from string identifier.

Parameters
  • dataset_name (str) – Name of a dataset.

  • reload (bool) – Download the dataset again and prepare data on disk.

  • verbose (int) – Print progress or info for processing where 60=silent. Default is 10.

datasets_download_info = {'matbench_dielectric': {'data_directory_name': 'matbench_dielectric', 'dataset_name': 'matbench_dielectric', 'download_file_name': 'matbench_dielectric.json.gz', 'extract_file_name': 'matbench_dielectric.json', 'extract_gz': True}, 'matbench_expt_gap': {'data_directory_name': 'matbench_expt_gap', 'dataset_name': 'matbench_expt_gap', 'download_file_name': 'matbench_expt_gap.json.gz', 'extract_file_name': 'matbench_expt_gap.json', 'extract_gz': True}, 'matbench_expt_is_metal': {'data_directory_name': 'matbench_expt_is_metal', 'dataset_name': 'matbench_expt_is_metal', 'download_file_name': 'matbench_expt_is_metal.json.gz', 'extract_file_name': 'matbench_expt_is_metal.json', 'extract_gz': True}, 'matbench_glass': {'data_directory_name': 'matbench_glass', 'dataset_name': 'matbench_glass', 'download_file_name': 'matbench_glass.json.gz', 'extract_file_name': 'matbench_glass.json', 'extract_gz': True}, 'matbench_jdft2d': {'data_directory_name': 'matbench_jdft2d', 'dataset_name': 'matbench_jdft2d', 'download_file_name': 'matbench_jdft2d.json.gz', 'extract_file_name': 'matbench_jdft2d.json', 'extract_gz': True}, 'matbench_log_gvrh': {'data_directory_name': 'matbench_log_gvrh', 'dataset_name': 'matbench_log_gvrh', 'download_file_name': 'matbench_log_gvrh.json.gz', 'extract_file_name': 'matbench_log_gvrh.json', 'extract_gz': True}, 'matbench_log_kvrh': {'data_directory_name': 'matbench_log_kvrh', 'dataset_name': 'matbench_log_kvrh', 'download_file_name': 'matbench_log_kvrh.json.gz', 'extract_file_name': 'matbench_log_kvrh.json', 'extract_gz': True}, 'matbench_mp_e_form': {'data_directory_name': 'matbench_mp_e_form', 'dataset_name': 'matbench_mp_e_form', 'download_file_name': 'matbench_mp_e_form.json.gz', 'extract_file_name': 'matbench_mp_e_form.json', 'extract_gz': True}, 'matbench_mp_gap': {'data_directory_name': 'matbench_mp_gap', 'dataset_name': 'matbench_mp_gap', 'download_file_name': 'matbench_mp_gap.json.gz', 'extract_file_name': 'matbench_mp_gap.json', 'extract_gz': True}, 'matbench_mp_is_metal': {'data_directory_name': 'matbench_mp_is_metal', 'dataset_name': 'matbench_mp_is_metal', 'download_file_name': 'matbench_mp_is_metal.json.gz', 'extract_file_name': 'matbench_mp_is_metal.json', 'extract_gz': True}, 'matbench_perovskites': {'data_directory_name': 'matbench_perovskites', 'dataset_name': 'matbench_perovskites', 'download_file_name': 'matbench_perovskites.json.gz', 'extract_file_name': 'matbench_perovskites.json', 'extract_gz': True}, 'matbench_phonons': {'data_directory_name': 'matbench_phonons', 'dataset_name': 'matbench_phonons', 'download_file_name': 'matbench_phonons.json.gz', 'extract_file_name': 'matbench_phonons.json', 'extract_gz': True}, 'matbench_steels': {'data_directory_name': 'matbench_steels', 'dataset_name': 'matbench_steels', 'download_file_name': 'matbench_steels.json.gz', 'extract_file_name': 'matbench_steels.json', 'extract_gz': True}}
datasets_prepare_data_info = {'matbench_dielectric': {'file_column_name': 'structure'}, 'matbench_expt_gap': {'file_column_name': 'composition'}, 'matbench_expt_is_metal': {'file_column_name': 'composition'}, 'matbench_glass': {'file_column_name': 'composition'}, 'matbench_jdft2d': {'file_column_name': 'structure'}, 'matbench_log_gvrh': {'file_column_name': 'structure'}, 'matbench_log_kvrh': {'file_column_name': 'structure'}, 'matbench_mp_e_form': {'file_column_name': 'structure'}, 'matbench_mp_gap': {'file_column_name': 'structure'}, 'matbench_mp_is_metal': {'file_column_name': 'structure'}, 'matbench_perovskites': {'file_column_name': 'structure'}, 'matbench_phonons': {'file_column_name': 'structure'}, 'matbench_steels': {'file_column_name': 'composition'}}
datasets_read_in_memory_info = {'matbench_dielectric': {'label_column_name': 'n'}, 'matbench_expt_gap': {'label_column_name': 'gap expt'}, 'matbench_expt_is_metal': {'label_column_name': 'is_metal'}, 'matbench_glass': {'label_column_name': 'gfa'}, 'matbench_jdft2d': {'label_column_name': 'exfoliation_en'}, 'matbench_log_gvrh': {'label_column_name': 'log10(G_VRH)'}, 'matbench_log_kvrh': {'label_column_name': 'log10(K_VRH)'}, 'matbench_mp_e_form': {'label_column_name': 'e_form'}, 'matbench_mp_gap': {'label_column_name': 'gap pbe'}, 'matbench_mp_is_metal': {'label_column_name': 'is_metal'}, 'matbench_perovskites': {'label_column_name': 'e_form'}, 'matbench_phonons': {'label_column_name': 'last phdos peak'}, 'matbench_steels': {'label_column_name': 'yield strength'}}
prepare_data(file_column_name: Optional[str] = None, overwrite: bool = False)[source]

Default preparation for crystal datasets.

Try to load all crystal structures from single files and save them as a pymatgen json serialization. Can load multiple CIF files from a table that keeps file names and possible labels or additional information.

Parameters
  • file_column_name (str) – Name of the column that has file names found in file_directory. Default is None.

  • overwrite (bool) – Whether to rerun the data extraction. Default is False.

Returns

self

kgcnn.data.datasets.MatProjectDielectricDataset module

class kgcnn.data.datasets.MatProjectDielectricDataset.MatProjectDielectricDataset(reload=False, verbose: int = 10)[source]

Bases: kgcnn.data.datasets.MatBenchDataset2020.MatBenchDataset2020

Store and process MatProjectDielectricDataset from MatBench database. Name within Matbench: ‘matbench_dielectric’.

Matbench test dataset for predicting refractive index from structure. Adapted from Materials Project database. Removed entries having a formation energy (or energy above the convex hull) more than 150meV and those having refractive indices less than 1 and those containing noble gases. Retrieved April 2, 2019. For benchmarking w/ nested cross validation, the order of the dataset must be identical to the retrieved data; refer to the Automatminer/Matbench publication for more details.

  • Number of samples: 4764

  • Task type: regression

  • Input type: structure

__init__(reload=False, verbose: int = 10)[source]

Initialize ‘matbench_mp_e_form’ dataset.

Parameters
  • reload (bool) – Whether to reload the data and make new dataset. Default is False.

  • verbose (int) – Print progress or info for processing where 60=silent. Default is 10.

kgcnn.data.datasets.MatProjectEFormDataset module

class kgcnn.data.datasets.MatProjectEFormDataset.MatProjectEFormDataset(reload=False, verbose: int = 10)[source]

Bases: kgcnn.data.datasets.MatBenchDataset2020.MatBenchDataset2020

Store and process MatProjectEFormDataset from MatBench database. Name within Matbench: ‘matbench_mp_e_form’.

Matbench test dataset for predicting DFT formation energy from structure. Adapted from Materials Project database. Removed entries having formation energy more than 2.5eV and those containing noble gases. Retrieved April 2, 2019. For benchmarking w/ nested cross validation, the order of the dataset must be identical to the retrieved data; refer to the Automatminer/Matbench publication for more details.

  • Number of samples: 132752.

  • Task type: regression.

  • Input type: structure.

__init__(reload=False, verbose: int = 10)[source]

Initialize ‘matbench_mp_e_form’ dataset.

Parameters
  • reload (bool) – Whether to reload the data and make new dataset. Default is False.

  • verbose (int) – Print progress or info for processing where 60=silent. Default is 10.

kgcnn.data.datasets.MatProjectGapDataset module

class kgcnn.data.datasets.MatProjectGapDataset.MatProjectGapDataset(reload=False, verbose: int = 10)[source]

Bases: kgcnn.data.datasets.MatBenchDataset2020.MatBenchDataset2020

Store and process MatProjectGapDataset from MatBench database. Name within Matbench: ‘matbench_mp_gap’.

Matbench test dataset for predicting DFT PBE band gap from structure. Adapted from Materials Project database. Removed entries having a formation energy (or energy above the convex hull) more than 150meV and those containing noble gases. Retrieved April 2, 2019. For benchmarking w/ nested cross validation, the order of the dataset must be identical to the retrieved data; refer to the Automatminer/Matbench publication for more details.

  • Number of samples: 106113

  • Task type: regression

  • Input type: structure

__init__(reload=False, verbose: int = 10)[source]

Initialize ‘matbench_mp_e_form’ dataset.

Parameters
  • reload (bool) – Whether to reload the data and make new dataset. Default is False.

  • verbose (int) – Print progress or info for processing where 60=silent. Default is 10.

kgcnn.data.datasets.MatProjectIsMetalDataset module

class kgcnn.data.datasets.MatProjectIsMetalDataset.MatProjectIsMetalDataset(reload=False, verbose: int = 10)[source]

Bases: kgcnn.data.datasets.MatBenchDataset2020.MatBenchDataset2020

Store and process MatProjectIsMetalDataset from MatBench database. Name within Matbench: ‘matbench_mp_is_metal’.

Matbench test dataset for predicting DFT metallicity from structure. Adapted from Materials Project database. Removed entries having a formation energy (or energy above the convex hull) more than 150meV and those containing noble gases. Retrieved April 2, 2019. For benchmarking w/ nested cross validation, the order of the dataset must be identical to the retrieved data; refer to the Automatminer/Matbench publication for more details.

  • Number of samples: 106113.

  • Task type: classification.

  • Input type: structure.

__init__(reload=False, verbose: int = 10)[source]

Initialize ‘matbench_mp_e_form’ dataset.

Parameters
  • reload (bool) – Whether to reload the data and make new dataset. Default is False.

  • verbose (int) – Print progress or info for processing where 60=silent. Default is 10.

kgcnn.data.datasets.MatProjectJdft2dDataset module

class kgcnn.data.datasets.MatProjectJdft2dDataset.MatProjectJdft2dDataset(reload=False, verbose: int = 10)[source]

Bases: kgcnn.data.datasets.MatBenchDataset2020.MatBenchDataset2020

Store and process MatProjectJdft2dDataset from MatBench database. Name within Matbench: ‘matbench_jdft2d’.

Matbench test dataset for predicting exfoliation energies from crystal structure (computed with the OptB88vdW and TBmBJ functionals). Adapted from the JARVIS DFT database. For benchmarking w/ nested cross validation, the order of the dataset must be identical to the retrieved data; refer to the Automatminer/Matbench publication for more details.

  • Number of samples: 636

  • Task type: regression

  • Input type: structure

__init__(reload=False, verbose: int = 10)[source]

Initialize ‘matbench_mp_e_form’ dataset.

Parameters
  • reload (bool) – Whether to reload the data and make new dataset. Default is False.

  • verbose (int) – Print progress or info for processing where 60=silent. Default is 10.

kgcnn.data.datasets.MatProjectLogGVRHDataset module

class kgcnn.data.datasets.MatProjectLogGVRHDataset.MatProjectLogGVRHDataset(reload=False, verbose: int = 10)[source]

Bases: kgcnn.data.datasets.MatBenchDataset2020.MatBenchDataset2020

Store and process MatProjectLogGVRHDataset from MatBench database. Name within Matbench: ‘matbench_log_gvrh’.

Matbench v0.1 test dataset for predicting DFT log10 VRH-average shear modulus from structure. Adapted from Materials Project database. Removed entries having a formation energy (or energy above the convex hull) more than 150meV and those having negative G_Voigt, G_Reuss, G_VRH, K_Voigt, K_Reuss, or K_VRH and those failing G_Reuss <= G_VRH <= G_Voigt or K_Reuss <= K_VRH <= K_Voigt and those containing noble gases. Retrieved April 2, 2019. For benchmarking w/ nested cross validation, the order of the dataset must be identical to the retrieved data; refer to the Automatminer/Matbench publication for more details.

  • Number of samples: 10987

  • Task type: regression

  • Input type: structure

__init__(reload=False, verbose: int = 10)[source]

Initialize ‘matbench_mp_e_form’ dataset.

Parameters
  • reload (bool) – Whether to reload the data and make new dataset. Default is False.

  • verbose (int) – Print progress or info for processing where 60=silent. Default is 10.

kgcnn.data.datasets.MatProjectLogKVRHDataset module

class kgcnn.data.datasets.MatProjectLogKVRHDataset.MatProjectLogKVRHDataset(reload=False, verbose: int = 10)[source]

Bases: kgcnn.data.datasets.MatBenchDataset2020.MatBenchDataset2020

Store and process MatProjectLogKVRHDataset from MatBench database. Name within Matbench: ‘matbench_log_kvrh’.

Matbench v0.1 test dataset for predicting DFT log10 VRH-average bulk modulus from structure. Adapted from Materials Project database. Removed entries having a formation energy (or energy above the convex hull) more than 150meV and those having negative G_Voigt, G_Reuss, G_VRH, K_Voigt, K_Reuss, or K_VRH and those failing G_Reuss <= G_VRH <= G_Voigt or K_Reuss <= K_VRH <= K_Voigt and those containing noble gases. Retrieved April 2, 2019. For benchmarking w/ nested cross validation, the order of the dataset must be identical to the retrieved data; refer to the Automatminer/Matbench publication for more details.

  • Number of samples: 10987

  • Task type: regression

  • Input type: structure

__init__(reload=False, verbose: int = 10)[source]

Initialize ‘matbench_mp_e_form’ dataset.

Parameters
  • reload (bool) – Whether to reload the data and make new dataset. Default is False.

  • verbose (int) – Print progress or info for processing where 60=silent. Default is 10.

kgcnn.data.datasets.MatProjectPerovskitesDataset module

class kgcnn.data.datasets.MatProjectPerovskitesDataset.MatProjectPerovskitesDataset(reload=False, verbose: int = 10)[source]

Bases: kgcnn.data.datasets.MatBenchDataset2020.MatBenchDataset2020

Store and process MatProjectPerovskitesDataset from MatBench database. Name within Matbench: ‘matbench_perovskites’.

Matbench test dataset for predicting formation energy from crystal structure. Adapted from an original dataset generated by Castelli et al. For benchmarking w/ nested cross validation, the order of the dataset must be identical to the retrieved data; refer to the Automatminer/Matbench publication for more details.

  • Number of samples: 18928

  • Task type: regression

  • Input type: structure

__init__(reload=False, verbose: int = 10)[source]

Initialize ‘matbench_mp_e_form’ dataset.

Parameters
  • reload (bool) – Whether to reload the data and make new dataset. Default is False.

  • verbose (int) – Print progress or info for processing where 60=silent. Default is 10.

kgcnn.data.datasets.MatProjectPhononsDataset module

class kgcnn.data.datasets.MatProjectPhononsDataset.MatProjectPhononsDataset(reload=False, verbose: int = 10)[source]

Bases: kgcnn.data.datasets.MatBenchDataset2020.MatBenchDataset2020

Store and process MatProjectPhononsDataset from MatBench database. Name within Matbench: ‘matbench_phonons’.

Matbench test dataset for predicting vibration properties from crystal structure. Original data retrieved from Petretto et al. Original calculations done via ABINIT in the harmonic approximation based on density functional perturbation theory. Removed entries having a formation energy (or energy above the convex hull) more than 150meV. For benchmarking w/ nested cross validation, the order of the dataset must be identical to the retrieved data; refer to the Automatminer/Matbench publication for more details.

  • Number of samples: 1265

  • Task type: regression

  • Input type: structure

last phdos peak: Target variable. Frequency of the highest frequency optical phonon mode peak, in units of 1/cm; may be used as an estimation of dominant longitudinal optical phonon frequency.

__init__(reload=False, verbose: int = 10)[source]

Initialize ‘matbench_mp_e_form’ dataset.

Parameters
  • reload (bool) – Whether to reload the data and make new dataset. Default is False.

  • verbose (int) – Print progress or info for processing where 60=silent. Default is 10.

kgcnn.data.datasets.MoleculeNetDataset2018 module

class kgcnn.data.datasets.MoleculeNetDataset2018.MoleculeNetDataset2018(dataset_name: str, reload: bool = False, verbose: int = 10)[source]

Bases: kgcnn.data.moleculenet.MoleculeNetDataset, kgcnn.data.download.DownloadDataset

Downloader for MoleculeNet datasets. This class inherits from MoleculeNetDataset . QM datasets are however excluded from this class as they have specific kgcnn.data.datasets which inherits from QMDatasets .

MoleculeNet is a benchmark specially designed for testing machine learning methods of molecular properties. Their work curates a number of dataset collections. All methods and datasets are integrated as parts of the open source DeepChem package (MIT license).

Stats:

Name

#graphs

#features

#classes

ESOL

1,128

9

1

FreeSolv

642

9

1

Lipophilicity

4,200

9

1

PCBA

437,929

9

128

MUV

93,087

9

17

HIV

41,127

9

1

BACE

1513

9

1

BBPB

2,050

9

1

Tox21

7,831

9

12

ToxCast

8,597

9

617

SIDER

1,427

9

27

ClinTox

1,484

9

2

References

  1. Zhenqin Wu, Bharath Ramsundar, Evan N. Feinberg, Joseph Gomes, Caleb Geniesse, Aneesh S. Pappu, Karl Leswing, Vijay Pande, MoleculeNet: A Benchmark for Molecular Machine Learning, arXiv preprint, arXiv: 1703.00564, 2017.

__init__(dataset_name: str, reload: bool = False, verbose: int = 10)[source]

Initialize a MoleculeNetDataset2018 instance from string identifier.

Parameters
  • dataset_name (str) – Name of a dataset.

  • reload (bool) – Download the dataset again and prepare data on disk.

  • verbose (int) – Print progress or info for processing where 60=silent. Default is 10.

datasets_download_info = {'BACE': {'data_directory_name': 'BACE', 'dataset_name': 'BACE', 'download_file_name': 'bace.csv'}, 'BBBP': {'data_directory_name': 'BBBP', 'dataset_name': 'BBBP', 'download_file_name': 'BBBP.csv'}, 'ClinTox': {'data_directory_name': 'ClinTox', 'dataset_name': 'ClinTox', 'download_file_name': 'clintox.csv.gz', 'extract_file_name': 'clintox.csv', 'extract_gz': True}, 'ESOL': {'data_directory_name': 'ESOL', 'dataset_name': 'ESOL', 'download_file_name': 'delaney-processed.csv'}, 'FreeSolv': {'data_directory_name': 'FreeSolv', 'dataset_name': 'FreeSolv', 'download_file_name': 'SAMPL.csv'}, 'HIV': {'data_directory_name': 'HIV', 'dataset_name': 'HIV', 'download_file_name': 'HIV.csv'}, 'Lipop': {'data_directory_name': 'Lipop', 'dataset_name': 'Lipop', 'download_file_name': 'Lipophilicity.csv'}, 'MUV': {'data_directory_name': 'MUV', 'dataset_name': 'MUV', 'download_file_name': 'muv.csv.gz', 'extract_file_name': 'muv.csv', 'extract_gz': True}, 'PCBA': {'data_directory_name': 'PCBA', 'dataset_name': 'PCBA', 'download_file_name': 'pcba.csv.gz', 'extract_file_name': 'pcba.csv', 'extract_gz': True}, 'SIDER': {'data_directory_name': 'SIDER', 'dataset_name': 'SIDER', 'download_file_name': 'sider.csv.gz', 'extract_file_name': 'sider.csv', 'extract_gz': True}, 'Tox21': {'data_directory_name': 'Tox21', 'dataset_name': 'Tox21', 'download_file_name': 'tox21.csv.gz', 'extract_file_name': 'tox21.csv', 'extract_gz': True}, 'ToxCast': {'data_directory_name': 'ToxCast', 'dataset_name': 'ToxCast', 'download_file_name': 'toxcast_data.csv.gz', 'extract_file_name': 'toxcast_data.csv', 'extract_gz': True}}
datasets_prepare_data_info = {'BACE': {'add_hydrogen': True, 'make_conformers': True, 'smiles_column_name': 'mol'}, 'BBBP': {'add_hydrogen': True, 'make_conformers': True, 'smiles_column_name': 'smiles'}, 'ClinTox': {'add_hydrogen': True, 'make_conformers': True, 'smiles_column_name': 'smiles'}, 'ESOL': {'add_hydrogen': True, 'make_conformers': True}, 'FreeSolv': {'add_hydrogen': True, 'make_conformers': True}, 'HIV': {'add_hydrogen': True, 'make_conformers': True}, 'Lipop': {'add_hydrogen': True, 'make_conformers': True}, 'MUV': {'add_hydrogen': True, 'make_conformers': True}, 'PCBA': {'add_hydrogen': True, 'make_conformers': True}, 'SIDER': {'add_hydrogen': True, 'make_conformers': True, 'smiles_column_name': 'smiles'}, 'Tox21': {'add_hydrogen': True, 'make_conformers': True, 'smiles_column_name': 'smiles'}, 'ToxCast': {'add_hydrogen': True, 'make_conformers': True, 'smiles_column_name': 'smiles'}}
datasets_read_in_memory_info = {'BACE': {'add_hydrogen': False, 'has_conformers': True, 'label_column_name': 'Class'}, 'BBBP': {'add_hydrogen': False, 'has_conformers': True, 'label_column_name': 'p_np'}, 'ClinTox': {'add_hydrogen': False, 'has_conformers': True, 'label_column_name': [1, 2]}, 'ESOL': {'add_hydrogen': False, 'has_conformers': True, 'label_column_name': 'measured log solubility in mols per litre'}, 'FreeSolv': {'add_hydrogen': False, 'has_conformers': True, 'label_column_name': 'expt'}, 'HIV': {'add_hydrogen': False, 'has_conformers': True, 'label_column_name': 'HIV_active'}, 'Lipop': {'add_hydrogen': False, 'has_conformers': True, 'label_column_name': 'exp'}, 'MUV': {'add_hydrogen': False, 'has_conformers': True, 'label_column_name': slice(0, 17, None)}, 'PCBA': {'add_hydrogen': False, 'has_conformers': False, 'label_column_name': slice(0, 128, None)}, 'SIDER': {'add_hydrogen': False, 'has_conformers': True, 'label_column_name': slice(1, 28, None)}, 'Tox21': {'add_hydrogen': False, 'has_conformers': True, 'label_column_name': slice(0, 12, None)}, 'ToxCast': {'add_hydrogen': False, 'has_conformers': True, 'label_column_name': slice(1, 618, None)}}

kgcnn.data.datasets.MutagenicityDataset module

class kgcnn.data.datasets.MutagenicityDataset.MutagenicityDataset(reload=False, verbose: int = 10)[source]

Bases: kgcnn.data.datasets.GraphTUDataset2020.GraphTUDataset2020

Store and process Mutagenicity dataset from TUDatasets .

Mutagenicity is a chemical compound dataset of drugs, which can be categorized into two classes: mutagen and non-mutagen.

References

  1. Riesen, K. and Bunke, H.: IAM Graph Database Repository for Graph Based Pattern Recognition and Machine Learning. In: da Vitora Lobo, N. et al. (Eds.), SSPR&SPR 2008, LNCS, vol. 5342, pp. 287-297, 2008.

__init__(reload=False, verbose: int = 10)[source]

Initialize Mutagenicity dataset.

Parameters
  • reload (bool) – Whether to reload the data and make new dataset. Default is False.

  • verbose (int) – Print progress or info for processing where 60=silent. Default is 10.

read_in_memory()[source]

Load Mutagenicity Dataset into memory and already split into items with further cleaning and processing.

kgcnn.data.datasets.PROTEINSDataset module

class kgcnn.data.datasets.PROTEINSDataset.PROTEINSDataset(reload=False, verbose: int = 10)[source]

Bases: kgcnn.data.datasets.GraphTUDataset2020.GraphTUDataset2020

Store and process PROTEINS dataset from TUDatasets .

In Papers with Code : PROTEINS is a dataset of proteins that are classified as enzymes or non-enzymes. Nodes represent the amino acids and two nodes are connected by an edge if they are less than 6 Angstroms apart.

References

  1. K. M. Borgwardt, C. S. Ong, S. Schoenauer, S. V. N. Vishwanathan, A. J. Smola, and H. P. Kriegel. Protein function prediction via graph kernels. Bioinformatics, 21(Suppl 1):i47–i56, Jun 2005.

  2. P. D. Dobson and A. J. Doig. Distinguishing enzyme structures from non-enzymes without alignments. J. Mol. Biol., 330(4):771–783, Jul 2003.

__init__(reload=False, verbose: int = 10)[source]

Initialize MUTAG dataset.

Parameters
  • reload (bool) – Whether to reload the data and make new dataset. Default is False.

  • verbose (int) – Print progress or info for processing where 60=silent. Default is 10.

read_in_memory()[source]

Load PROTEINS Dataset into memory and already split into items with further cleaning and processing.

kgcnn.data.datasets.QM7Dataset module

class kgcnn.data.datasets.QM7Dataset.QM7Dataset(reload: bool = False, verbose: int = 10)[source]

Bases: kgcnn.data.qm.QMDataset, kgcnn.data.download.DownloadDataset

Store and process QM7 dataset from Quantum Machine . dataset.

From Quantum Machine : This dataset is a subset of GDB-13 (a database of nearly 1 billion stable and synthetically accessible organic molecules) composed of all molecules of up to 23 atoms (including 7 heavy atoms C, N, O, and S), totalling 7165 molecules. We provide the Coulomb matrix representation of these molecules and their atomization energies computed similarly to the FHI-AIMS implementation of the Perdew-Burke-Ernzerhof hybrid functional (PBE0). This dataset features a large variety of molecular structures such as double and triple bonds, cycles, carboxy, cyanide, amide, alcohol and epoxy. The atomization energies are given in kcal/mol and are ranging from -800 to -2000 kcal/mol. The dataset is composed of three multidimensional arrays X (7165 x 23 x 23), Tm(7165) and P (5 x 1433) representing the inputs (Coulomb matrices), the labels (atomization energies) and the splits for cross-validation, respectively. The dataset also contain two additional multidimensional arrays Z (7165) and R (7165 x 3) representing the atomic charge and the cartesian coordinate of each atom in the molecules.

Here, the coordinates are given and converted with QMDataset to molecular structure. Labels are not scaled but have original units. Original splits are added to the dataset.

References

  1. L. C. Blum, J.-L. Reymond, 970 Million Druglike Small Molecules for Virtual Screening in the Chemical Universe Database GDB-13, J. Am. Chem. Soc., 131:8732, 2009.

  2. M. Rupp, A. Tkatchenko, K.-R. Müller, O. A. von Lilienfeld: Fast and Accurate Modeling of Molecular Atomization Energies with Machine Learning, Physical Review Letters, 108(5):058301, 2012.

__init__(reload: bool = False, verbose: int = 10)[source]

Initialize QM9 dataset.

Parameters
  • reload (bool) – Whether to reload the data and make new dataset. Default is False.

  • verbose (int) – Print progress or info for processing where 60=silent. Default is 10.

download_info = {'data_directory_name': 'qm7', 'dataset_name': 'QM7', 'download_file_name': 'qm7.mat', 'download_url': 'http://quantum-machine.org/data/qm7.mat', 'unpack_tar': False, 'unpack_zip': False}
prepare_data(overwrite: bool = False, file_column_name: Optional[str] = None, make_sdf: bool = True)[source]

Pre-computation of molecular structure information in a sdf-file from a xyz-file or a folder of xyz-files.

If there is no single xyz-file, it will be created with the information of a csv-file with the same name.

Parameters
  • overwrite (bool) – Overwrite existing database SDF file. Default is False.

  • file_column_name (str) – Name of the column in csv file with list of xyz-files located in file_directory

  • make_sdf (bool) – Whether to try to make a sdf file from xyz information via RDKit and OpenBabel.

Returns

self

read_in_memory(**kwargs)[source]

Read geometric information into memory.

Graph labels require a column specified by label_column_name.

Parameters
  • label_column_name (str, list) – Name of labels for columns in CSV file.

  • nodes (list) – A list of node attributes as string. In place of names also functions can be added.

  • edges (list) – A list of edge attributes as string. In place of names also functions can be added.

  • graph (list) – A list of graph attributes as string. In place of names also functions can be added.

  • encoder_nodes (dict) – A dictionary of callable encoder where the key matches the attribute.

  • encoder_edges (dict) – A dictionary of callable encoder where the key matches the attribute.

  • encoder_graph (dict) – A dictionary of callable encoder where the key matches the attribute.

  • add_hydrogen (bool) – Whether to keep hydrogen after reading the mol-information. Default is False.

  • make_directed (bool) – Whether to have directed or undirected bonds. Default is False.

  • compute_partial_charges (str) – Whether to compute partial charges, e.g. ‘gasteiger’. Default is None.

  • sanitize (bool) – Whether to sanitize molecule. Default is False.

  • additional_callbacks (dict) – A dictionary whose keys are string attribute names which the elements of the dataset are supposed to have and the elements are callback function objects which implement how those attributes are derived from the MolecularGraphRDKit of the molecule in question or the row of the CSV file.

  • custom_transform (Callable) – Custom transformation function to modify the generated MolecularGraphRDKit before callbacks are carried out. The function must take a single MolecularGraphRDKit instance as argument and return a (new) MolecularGraphRDKit instance.

Returns

self

kgcnn.data.datasets.QM7bDataset module

class kgcnn.data.datasets.QM7bDataset.QM7bDataset(reload: bool = False, verbose: int = 10)[source]

Bases: kgcnn.data.qm.QMDataset, kgcnn.data.download.DownloadDataset

Store and process QM7b dataset from Quantum Machine .

From Quantum Machine : This dataset is an extension of the QM7 dataset for multitask learning where 13 additional properties (e.g. polarizability, HOMO and LUMO eigenvalues, excitation energies) have to be predicted at different levels of theory (ZINDO, SCS, PBE0, GW). Additional molecules comprising chlorine atoms are also included, totalling 7211 molecules.

The dataset is composed of two multidimensional arrays X (7211 x 23 x 23) and T (7211 x 14) representing the inputs (Coulomb matrices) and the labels (molecular properties) and one array names of size 14 listing the names of the different properties.

Here, the Coulomb matrices are converted back into coordinates and with QMDataset to molecular structure. Labels are not scaled but have original units.

References

  1. L. C. Blum, J.-L. Reymond, 970 Million Druglike Small Molecules for Virtual Screening in the Chemical Universe Database GDB-13, J. Am. Chem. Soc., 131:8732, 2009.

  2. G. Montavon, M. Rupp, V. Gobre, A. Vazquez-Mayagoitia, K. Hansen, A. Tkatchenko, K.-R. Müller, O.A. von Lilienfeld, Machine Learning of Molecular Electronic Properties in Chemical Compound Space, New J. Phys. 15 095003, 2013.

__init__(reload: bool = False, verbose: int = 10)[source]

Initialize QM9 dataset.

Parameters
  • reload (bool) – Whether to reload the data and make new dataset. Default is False.

  • verbose (int) – Print progress or info for processing where 60=silent. Default is 10.

download_info = {'data_directory_name': 'qm7b', 'dataset_name': 'QM7b', 'download_file_name': 'qm7b.mat', 'download_url': 'http://quantum-machine.org/data/qm7b.mat', 'unpack_tar': False, 'unpack_zip': False}
prepare_data(overwrite: bool = False, file_column_name: Optional[str] = None, make_sdf: bool = True)[source]

Pre-computation of molecular structure information in a sdf-file from a xyz-file or a folder of xyz-files.

If there is no single xyz-file, it will be created with the information of a csv-file with the same name.

Parameters
  • overwrite (bool) – Overwrite existing database SDF file. Default is False.

  • file_column_name (str) – Name of the column in csv file with list of xyz-files located in file_directory

  • make_sdf (bool) – Whether to try to make a sdf file from xyz information via RDKit and OpenBabel.

Returns

self

kgcnn.data.datasets.QM8Dataset module

class kgcnn.data.datasets.QM8Dataset.QM8Dataset(reload: bool = False, verbose: int = 10)[source]

Bases: kgcnn.data.qm.QMDataset, kgcnn.data.download.DownloadDataset

Store and process QM8 dataset from MoleculeNet datasets.

From Quantum Machine : Due to its favorable computational efficiency, time-dependent (TD) density functional theory(DFT) enables the prediction of electronic spectra in a high-throughput manner across chemical space. Its predictions, however, can be quite inaccurate. We resolve this issue with machine learning models trained on deviations of reference second-order approximate coupled-cluster (CC2) singles and doubles spectra from TDDFT counterparts, or even from DFT gap. We applied this approach to low-lying singlet-singlet vertical electronic spectra of over 20000 synthetically feasible small organic molecules with up to eight CONF atoms. The prediction errors decay monotonously as a function of training set size. For a training set of 10000 molecules, CC2 excitation energies can be reproduced to within ±0.1 eV for the remaining molecules. Analysis of our spectral database via chromophore counting suggests that even higher accuracies can be achieved. Based on the evidence collected, we discuss open challenges associated with data-driven modeling of high-lying spectra and transition intensities.

Note

We take the pre-processed dataset from MoleculeNet .

References

  1. L. Ruddigkeit, R. van Deursen, L. C. Blum, J.-L. Reymond, Enumeration of 166 billion organic small molecules in the chemical universe database GDB-17, J. Chem. Inf. Model. 52, 2864–2875, 2012.

  2. R. Ramakrishnan, M. Hartmann, E. Tapavicza, O. A. von Lilienfeld, Electronic Spectra from TDDFT and Machine Learning in Chemical Space, J. Chem. Phys. 143 084111, 2015.

__init__(reload: bool = False, verbose: int = 10)[source]

Initialize QM8 dataset.

Parameters
  • reload (bool) – Whether to reload the data and make new dataset. Default is False.

  • verbose (int) – Print progress or info for processing where 60=silent. Default is 10.

download_info = {'data_directory_name': 'qm8', 'dataset_name': 'QM8', 'download_file_name': 'gdb8.tar.gz', 'download_url': 'https://deepchemdata.s3-us-west-1.amazonaws.com/datasets/gdb8.tar.gz', 'unpack_directory_name': 'gdb8', 'unpack_tar': True, 'unpack_zip': False}

kgcnn.data.datasets.QM9Dataset module

class kgcnn.data.datasets.QM9Dataset.QM9Dataset(reload: bool = False, verbose: int = 10)[source]

Bases: kgcnn.data.qm.QMDataset, kgcnn.data.download.DownloadDataset

Store and process QM9 dataset from Quantum Machine . dataset.

Dataset of 134k stable small organic molecules made up of C,H,O,N,F.

From Quantum Machine : Computational de novo design of new drugs and materials requires rigorous and unbiased exploration of chemical compound space. However, large uncharted territories persist due to its size scaling combinatorially with molecular size. We report computed geometric, energetic, electronic, and thermodynamic properties for 134k stable small organic molecules made up of CHONF. These molecules correspond to the subset of all 133,885 species with up to nine heavy atoms (CONF) out of the GDB-17 chemical universe of 166 billion organic molecules. We report geometries minimal in energy, corresponding harmonic frequencies, dipole moments, polarizabilities, along with energies, enthalpies, and free energies of atomization. All properties were calculated at the B3LYP/6-31G(2df,p) level of quantum chemistry. Furthermore, for the predominant stoichiometry, C7H10O2, there are 6,095 constitutional isomers among the 134k molecules. We report energies, enthalpies, and free energies of atomization at the more accurate G4MP2 level of theory for all of them. As such, this data set provides quantum chemical properties for a relevant, consistent, and comprehensive chemical space of small organic molecules. This database may serve the benchmarking of existing methods, development of new methods, such as hybrid quantum mechanics/machine learning, and systematic identification of structure-property relationships.

Labels include geometric, energetic, electronic, and thermodynamic properties. Typically, a random 10% validation and 10% test set are used. In literature, test errors are given as MAE and for energies are in [eV].

Molecules that have a different smiles code after convergence can be removed with remove_uncharacterized . Also labels with removed atomization energy are generated.

from kgcnn.data.datasets.QM9Dataset import QM9Dataset
dataset = QM9Dataset(reload=True)
print(dataset[0])

References

  1. L. Ruddigkeit, R. van Deursen, L. C. Blum, J.-L. Reymond, Enumeration of 166 billion organic small molecules in the chemical universe database GDB-17, J. Chem. Inf. Model. 52, 2864–2875, 2012.

  2. R. Ramakrishnan, P. O. Dral, M. Rupp, O. A. von Lilienfeld, Quantum chemistry structures and properties of 134 kilo molecules, Scientific Data 1, 140022, 2014.

__init__(reload: bool = False, verbose: int = 10)[source]

Initialize QM9 dataset.

Parameters
  • reload (bool) – Whether to reload the data and make new dataset. Default is False.

  • verbose (int) – Print progress or info for processing where 60=silent. Default is 10.

download_info = {'data_directory_name': 'qm9', 'dataset_name': 'QM9', 'download_file_name': 'dsgdb9nsd.xyz.tar.bz2', 'download_url': 'https://ndownloader.figshare.com/files/3195389', 'unpack_directory_name': 'dsgdb9nsd.xyz', 'unpack_tar': True, 'unpack_zip': False}
prepare_data(overwrite: bool = False, file_column_name: Optional[str] = None, make_sdf: bool = True)[source]

Process data by loading all single xyz-files and store all pickled information to file. The single files are deleted afterwards, requires to re-extract the tar-file for overwrite.

Parameters
  • overwrite (bool) – Whether to redo the processing, requires un-zip of the data again. Defaults to False.

  • file_column_name (str) – Not used.

  • make_sdf (bool) – Whether to make SDF file.

read_in_memory(**kwargs)[source]

Read geometric information into memory.

Graph labels require a column specified by label_column_name.

Parameters
  • label_column_name (str, list) – Name of labels for columns in CSV file.

  • nodes (list) – A list of node attributes as string. In place of names also functions can be added.

  • edges (list) – A list of edge attributes as string. In place of names also functions can be added.

  • graph (list) – A list of graph attributes as string. In place of names also functions can be added.

  • encoder_nodes (dict) – A dictionary of callable encoder where the key matches the attribute.

  • encoder_edges (dict) – A dictionary of callable encoder where the key matches the attribute.

  • encoder_graph (dict) – A dictionary of callable encoder where the key matches the attribute.

  • add_hydrogen (bool) – Whether to keep hydrogen after reading the mol-information. Default is False.

  • make_directed (bool) – Whether to have directed or undirected bonds. Default is False.

  • compute_partial_charges (str) – Whether to compute partial charges, e.g. ‘gasteiger’. Default is None.

  • sanitize (bool) – Whether to sanitize molecule. Default is False.

  • additional_callbacks (dict) – A dictionary whose keys are string attribute names which the elements of the dataset are supposed to have and the elements are callback function objects which implement how those attributes are derived from the MolecularGraphRDKit of the molecule in question or the row of the CSV file.

  • custom_transform (Callable) – Custom transformation function to modify the generated MolecularGraphRDKit before callbacks are carried out. The function must take a single MolecularGraphRDKit instance as argument and return a (new) MolecularGraphRDKit instance.

Returns

self

remove_uncharacterized()[source]

Remove 3054 uncharacterized molecules that failed structure test from this dataset.

kgcnn.data.datasets.QM9MolNetDataset module

class kgcnn.data.datasets.QM9MolNetDataset.QM9MolNetDataset(reload: bool = False, verbose: int = 10)[source]

Bases: kgcnn.data.qm.QMDataset, kgcnn.data.download.DownloadDataset

Store and process QM9 dataset from MoleculeNet dataset.

This is the QM9 dataset as preprocessed from MoleculeNet with structure and labels. See kgcnn.data.datasets.QM9Dataset for documentation and comparison.

References

  1. L. Ruddigkeit, R. van Deursen, L. C. Blum, J.-L. Reymond, Enumeration of 166 billion organic small molecules in the chemical universe database GDB-17, J. Chem. Inf. Model. 52, 2864–2875, 2012.

  2. R. Ramakrishnan, P. O. Dral, M. Rupp, O. A. von Lilienfeld, Quantum chemistry structures and properties of 134 kilo molecules, Scientific Data 1, 140022, 2014.

__init__(reload: bool = False, verbose: int = 10)[source]

Initialize QM8 dataset.

Parameters
  • reload (bool) – Whether to reload the data and make new dataset. Default is False.

  • verbose (int) – Print progress or info for processing where 60=silent. Default is 10.

download_info = {'data_directory_name': 'qm9_mol_net', 'dataset_name': 'QM9MolNet', 'download_file_name': 'gdb9.tar.gz', 'download_url': 'https://deepchemdata.s3-us-west-1.amazonaws.com/datasets/gdb9.tar.gz', 'unpack_directory_name': 'gdb9', 'unpack_tar': True, 'unpack_zip': False}

kgcnn.data.datasets.SIDERDataset module

class kgcnn.data.datasets.SIDERDataset.SIDERDataset(reload=False, verbose: int = 10)[source]

Bases: kgcnn.data.datasets.MoleculeNetDataset2018.MoleculeNetDataset2018

Store and process ‘SIDER’ dataset from MoleculeNet database.

Compare reference: DeepChem reading: The Side Effect Resource (SIDER) is a database of marketed drugs and adverse drug reactions (ADR). The version of the SIDER dataset in DeepChem has grouped drug side effects into 27 system organ classes following MedDRA classifications measured for 1427 approved drugs.

Random splitting is recommended for this dataset.

The raw data csv file contains columns below:

  • “smiles”: SMILES representation of the molecular structure

  • “Hepatobiliary disorders” ~ “Injury, poisoning and procedural complications”: Recorded side effects for the

    drug. Please refer to http://sideeffects.embl.de/se/?page=98 for details on ADRs.

References

  1. Kuhn, Michael, et al. “The SIDER database of drugs and side effects.” Nucleic acids research 44.D1 (2015): D1075-D1079.

  2. Altae-Tran, Han, et al. “Low data drug discovery with one-shot learning.” ACS central science 3.4 (2017): 283-293.

  3. Medical Dictionary for Regulatory Activities. http://www.meddra.org/

__init__(reload=False, verbose: int = 10)[source]

Initialize SIDER dataset.

Parameters
  • reload (bool) – Whether to reload the data and make new dataset. Default is False.

  • verbose (int) – Print progress or info for processing where 60=silent. Default is 10.

read_in_memory(**kwargs)[source]

Load list of molecules from cached SDF-file in into memory. File name must be given in file_name and path information in the constructor of this class.

It further checks the csv-file for graph labels specified by label_column_name. Labels that do not have valid smiles or molecule in the SDF-file are also skipped, but added as None to keep the index and the molecule assignment.

Set further molecular attributes or features by string identifier. Requires MolecularGraphRDKit. Default values are features that has been used by Luo et al (2019).

The argument additional_callbacks allows adding custom properties to each element of the dataset. It is a dictionary whose string keys are the names of the properties and the values are callable function objects which define how the property is derived from either the MolecularGraphRDKit or the corresponding row of the original CSV file. Those callback functions accept two parameters:

  • mg: The MolecularGraphRDKit instance of the molecule

  • ds: A pandas data series that match data in the CSV file for the specific molecule.

Example:

from os import linesep
csv = f"index,name,label,smiles{linesep}1,Propanolol,1,[Cl].CC(C)NCC(O)COc1cccc2ccccc12"
with open('/tmp/moleculenet_example.csv', mode='w') as file:
    file.write(csv)

dataset = MoleculeNetDataset('/tmp', 'example', 'moleculenet_example.csv')
dataset.prepare_data(smiles_column_name='smiles')
dataset.read_in_memory(label_column_name='label')
dataset.set_attributes(
    nodes=['Symbol'],
    encoder_nodes={'Symbol': OneHotEncoder(['C', 'O'], dtype='str')},
    edges=['BondType'],
    encoder_edges={'BondType': int},
    additional_callbacks={
        # It is important that the callbacks return a numpy array, even if it is just a single element.
        'name': lambda mg, ds: np.array(ds['name'], dtype='str')
    }
)

mol: dict = dataset[0]
mol['node_attributes']  # np array of one hot encoded atom type per node
mol['edge_attributes']  # int value representing the bond type
mol['name']  # Array of a single string which is the name from the original CSV data
Parameters
  • label_column_name (str) – Column name in the csv-file where to take graph labels from. For multi-targets you can supply a list of column names or positions. A slice can be provided for selecting columns as graph labels. Default is None.

  • nodes (list) – A list of node attributes as string. In place of names also functions can be added.

  • edges (list) – A list of edge attributes as string. In place of names also functions can be added.

  • graph (list) – A list of graph attributes as string. In place of names also functions can be added.

  • encoder_nodes (dict) – A dictionary of callable encoder where the key matches the attribute.

  • encoder_edges (dict) – A dictionary of callable encoder where the key matches the attribute.

  • encoder_graph (dict) – A dictionary of callable encoder where the key matches the attribute.

  • add_hydrogen (bool) – Whether to keep hydrogen after reading the mol-information. Default is False.

  • has_conformers (bool) – Whether to add node coordinates from conformer. Default is True.

  • make_directed (bool) – Whether to have directed or undirected bonds. Default is False.

  • sanitize (bool) – Whether to sanitize molecule. Default is True.

  • compute_partial_charges (str) – Whether to compute partial charges, e.g. ‘gasteiger’. Default is None.

  • additional_callbacks (dict) – A dictionary whose keys are string attribute names which the elements of the dataset are supposed to have and the elements are callback function objects which implement how those attributes are derived from the MolecularGraphRDKit of the molecule in question or the row of the CSV file.

  • custom_transform (Callable) – Custom transformation function to modify the generated MolecularGraphRDKit before callbacks are carried out. The function must take a single MolecularGraphRDKit instance as argument and return a (new) MolecularGraphRDKit instance.

Returns

self

kgcnn.data.datasets.Tox21MolNetDataset module

class kgcnn.data.datasets.Tox21MolNetDataset.Tox21MolNetDataset(reload=False, verbose: int = 10, remove_nan: bool = False)[source]

Bases: kgcnn.data.datasets.MoleculeNetDataset2018.MoleculeNetDataset2018

Store and process ‘TOX21’ dataset from MoleculeNet database.

Compare reference: DeepChem reading: The “Toxicology in the 21st Century” (Tox21) initiative created a public database measuring toxicity of compounds, which has been used in the 2014 Tox21 Data Challenge. This dataset contains qualitative toxicity measurements for 8k compounds on 12 different targets, including nuclear receptors and stress response pathways.

Random splitting is recommended for this dataset.

The raw data csv file contains columns below:

  • “smiles”: SMILES representation of the molecular structure

  • “NR-XXX”: Nuclear receptor signaling bioassays results

  • “SR-XXX”: Stress response bioassays results

please refer to https://tripod.nih.gov/tox21/challenge/data.jsp for details.

References

  1. Tox21 Challenge. https://tripod.nih.gov/tox21/challenge/

__init__(reload=False, verbose: int = 10, remove_nan: bool = False)[source]

Initialize Tox21 dataset.

Parameters
  • reload (bool) – Whether to reload the data and make new dataset. Default is False.

  • verbose (int) – Print progress or info for processing where 60=silent. Default is 10.

read_in_memory(**kwargs)[source]

Load list of molecules from cached SDF-file in into memory. File name must be given in file_name and path information in the constructor of this class.

It further checks the csv-file for graph labels specified by label_column_name. Labels that do not have valid smiles or molecule in the SDF-file are also skipped, but added as None to keep the index and the molecule assignment.

Set further molecular attributes or features by string identifier. Requires MolecularGraphRDKit. Default values are features that has been used by Luo et al (2019).

The argument additional_callbacks allows adding custom properties to each element of the dataset. It is a dictionary whose string keys are the names of the properties and the values are callable function objects which define how the property is derived from either the MolecularGraphRDKit or the corresponding row of the original CSV file. Those callback functions accept two parameters:

  • mg: The MolecularGraphRDKit instance of the molecule

  • ds: A pandas data series that match data in the CSV file for the specific molecule.

Example:

from os import linesep
csv = f"index,name,label,smiles{linesep}1,Propanolol,1,[Cl].CC(C)NCC(O)COc1cccc2ccccc12"
with open('/tmp/moleculenet_example.csv', mode='w') as file:
    file.write(csv)

dataset = MoleculeNetDataset('/tmp', 'example', 'moleculenet_example.csv')
dataset.prepare_data(smiles_column_name='smiles')
dataset.read_in_memory(label_column_name='label')
dataset.set_attributes(
    nodes=['Symbol'],
    encoder_nodes={'Symbol': OneHotEncoder(['C', 'O'], dtype='str')},
    edges=['BondType'],
    encoder_edges={'BondType': int},
    additional_callbacks={
        # It is important that the callbacks return a numpy array, even if it is just a single element.
        'name': lambda mg, ds: np.array(ds['name'], dtype='str')
    }
)

mol: dict = dataset[0]
mol['node_attributes']  # np array of one hot encoded atom type per node
mol['edge_attributes']  # int value representing the bond type
mol['name']  # Array of a single string which is the name from the original CSV data
Parameters
  • label_column_name (str) – Column name in the csv-file where to take graph labels from. For multi-targets you can supply a list of column names or positions. A slice can be provided for selecting columns as graph labels. Default is None.

  • nodes (list) – A list of node attributes as string. In place of names also functions can be added.

  • edges (list) – A list of edge attributes as string. In place of names also functions can be added.

  • graph (list) – A list of graph attributes as string. In place of names also functions can be added.

  • encoder_nodes (dict) – A dictionary of callable encoder where the key matches the attribute.

  • encoder_edges (dict) – A dictionary of callable encoder where the key matches the attribute.

  • encoder_graph (dict) – A dictionary of callable encoder where the key matches the attribute.

  • add_hydrogen (bool) – Whether to keep hydrogen after reading the mol-information. Default is False.

  • has_conformers (bool) – Whether to add node coordinates from conformer. Default is True.

  • make_directed (bool) – Whether to have directed or undirected bonds. Default is False.

  • sanitize (bool) – Whether to sanitize molecule. Default is True.

  • compute_partial_charges (str) – Whether to compute partial charges, e.g. ‘gasteiger’. Default is None.

  • additional_callbacks (dict) – A dictionary whose keys are string attribute names which the elements of the dataset are supposed to have and the elements are callback function objects which implement how those attributes are derived from the MolecularGraphRDKit of the molecule in question or the row of the CSV file.

  • custom_transform (Callable) – Custom transformation function to modify the generated MolecularGraphRDKit before callbacks are carried out. The function must take a single MolecularGraphRDKit instance as argument and return a (new) MolecularGraphRDKit instance.

Returns

self

Module contents