kgcnn.data package¶

Subpackages¶

Submodules¶

kgcnn.data.base module¶

kgcnn.data.base.MemoryGeometricGraphDataset¶: alias of kgcnn.data.base.MemoryGraphDataset

class kgcnn.data.base.MemoryGraphDataset(data_directory: Optional[str] = None, dataset_name: Optional[str] = None, file_name: Optional[str] = None, file_directory: Optional[str] = None, verbose: int = 10)[source]¶

Bases: kgcnn.data.base.MemoryGraphList

Dataset class for lists of graph tensor dictionaries stored on file and fit into memory.

This class inherits from MemoryGraphList and can be used (after loading and setup) as such. It has further information about a location on disk, i.e. a file directory and a file name as well as a name of the dataset.

import numpy as np
from kgcnn.data.base import MemoryGraphDataset
dataset = MemoryGraphDataset(data_directory="", dataset_name="Example")
# Methods of MemoryGraphList
dataset.set("edge_indices", [np.array([[1, 0], [0, 1]])])
dataset.set("edge_labels", [np.array([[0], [1]])])
dataset.save()
dataset.load()

The file directory and file name are used in child classes and in save and load .

__init__(data_directory: Optional[str] = None, dataset_name: Optional[str] = None, file_name: Optional[str] = None, file_directory: Optional[str] = None, verbose: int = 10)[source]¶

Initialize a base class of MemoryGraphDataset.

Parameters

data_directory (str) – Full path to directory of the dataset. Default is None.
file_name (str) – Generic filename for dataset to read into memory like a ‘csv’ file. Default is None.
file_directory (str) – Name or relative path from data_directory to a directory containing sorted files. Default is None.
dataset_name (str) – Name of the dataset. Important for naming and saving files. Default is None.
verbose (int) – Logging level. Default is 10.

_verify_data_directory() → Optional[str][source]¶: Utility function that checks if data_directory is set correctly.

assert_valid_model_input(hyper_input: Union[list, dict], raise_error_on_fail: bool = True)[source]¶

Check whether dataset has graph properties (in tensor format) requested by model input.

The list hyper_input that defines model input match interface to hyperparameter. The model input is set up by a list of layer configs for the keras Input layer. The list must contain dictionaries for each model input with “name” and “shape” keys.

hyper_input = [
    {"shape": [None, 8710], "name": "node_attributes", "dtype": "float32", "ragged": True},
    {"shape": [None, 1], "name": "edge_weights", "dtype": "float32", "ragged": True},
    {"shape": [None, 2], "name": "edge_indices", "dtype": "int64", "ragged": True}
]

Parameters

hyper_input (list) – List of properties that need to be available to a model for training.
raise_error_on_fail (bool) – Whether to raise an error if assertion failed.

collect_files_in_file_directory(file_column_name: Optional[str] = None, table_file_path: Optional[str] = None, read_method_file: Optional[Callable] = None, update_counter: int = 1000, append_file_content: bool = True, read_method_return_list: bool = False) → list [source]¶

Utility function to collect single files in file_directory by names in CSV table file.

Parameters

file_column_name (str) – Name of the column in Table file that holds list of file names.
table_file_path (str) – Path to table file. Can be None. Default is None.
read_method_file (Callable) – Callable read-file method to return (processed) file content.
update_counter (int) – Loop counter to show progress. Default is 1000.
append_file_content (bool) – Whether to append or add return of read_method_file.
read_method_return_list (bool) – Whether read_method_file returns list of items or the item itself.

Returns

File content loaded from single files.

Return type

list

error(*args, **kwargs)[source]¶: Pass error to class’ logger instance.

property file_directory_path¶: Construct file-directory path from ‘data_directory’ and ‘file_directory’ given in init.

property file_path¶: Construct filepath from ‘file_name’ given in init.

fits_in_memory = True¶

get_train_test_indices(train: str = 'train', test: str = 'test', valid: Optional[str] = None, split_index: Optional[Union[int, list]] = None, shuffle: bool = False, seed: Optional[int] = None) → List[List[numpy.ndarray]][source]¶

Get train and test indices from graph list. The ‘train’ and ‘test’ properties must be set on the graph, and optionally an additional property for the validation split may be present. All of these properties may have either of the following values:

The property is a boolean integer value indicating whether the corresponding element of the dataset belongs to that part of the split (train / test)
The property is a list containing integer split indices, where each split index present within that list implies that the corresponding dataset element is part of that particular split. In this case the split_index parameter may also be a list of split indices that specify for which of these split indices the train test index split is to be returned by this method.

The return value of this method is a list with the same length as the split_index parameter, which by default will be None.

Parameters

train (str) – Name of graph property that has train split assignment. Defaults to ‘train’.
test (str) – Name of graph property that has test split assignment. Defaults to ‘test’.
valid (str) – Name of graph property that has validation assignment. Defaults to None.
split_index (int, list) – Split index to get indices for. Can also be list.
shuffle (bool) – Whether to shuffle splits. Default is True.
seed (int) – Random seed for shuffle. Default is None.

Returns

List of tuples (or triples) of train, test, (validation) split indices.

Return type

list

info(*args, **kwargs)[source]¶: Pass information to class’ logger instance.

load(filepath: Optional[str] = None)[source]¶

Load graph properties from a pickled file. By default, loads a file named dataset_name.kgcnn.pickle in data_directory .

Parameters: filepath (str) – Full path of input file.

read_in_table_file(file_path: Optional[str] = None, **kwargs)[source]¶

Read a data frame in data_frame from file path. By default, uses file_name and pandas. Checks for a ‘.csv’ file and then for Excel file endings. Meaning the file extension of file_path is ignored but must be any of the following ‘.csv’, ‘.xls’, ‘.xlsx’, ‘.odt’.

Parameters

file_path (str) – File path to table file. Default is None.
kwargs – Kwargs for pandas read_csv function.

Returns

self

relocate(data_directory: Optional[str] = None, file_name: Optional[str] = None, file_directory: Optional[str] = None)[source]¶

Change file information. Does not copy files on disk!

Parameters

data_directory (str) – Full path to directory of the dataset. Default is None.
file_name (str) – Generic filename for dataset to read into memory like a ‘csv’ file. Default is None.
file_directory (str) – Name or relative path from data_directory to a directory containing sorted files. Default is None.

Returns

self

save(filepath: Optional[str] = None)[source]¶

Save all graph properties to python dictionary as pickled file. By default, saves a file named dataset_name.kgcnn.pickle in data_directory .

Parameters: filepath (str) – Full path of output file. Default is None.

set_methods(method_list: List[dict]) → None [source]¶

Apply a list of serialized class-methods on the dataset.

This can extend the config-serialization scheme in kgcnn.utils.serial.

for method_item in method_list:
    for method, kwargs in method_item.items():
        if hasattr(self, method):
            getattr(self, method)(**kwargs)

Parameters: method_list (list) – A list of dictionaries that specify class methods. The dict key denotes the method and the value must contain kwargs for the method
Returns: None.

set_multi_target_labels(graph_labels: str = 'graph_labels', multi_target_indices: Optional[list] = None, data_unit: Optional[Union[str, list]] = None)[source]¶

Select multiple targets in labels.

Parameters

graph_labels (str) – Name of the property that holds multiple targets.
multi_target_indices (list) – List of indices of targets to select.
data_unit (str, list) – Optional list of data units for all labels in graph_labels .

Returns

List of label names and label units for each target.

Return type

tuple

set_train_test_indices_k_fold(n_splits: int = 5, shuffle: bool = False, random_state: Optional[int] = None, train: str = 'train', test: str = 'test')[source]¶

Helper function to set train/test indices for each graph in the list from a random k-fold cross-validation.

Parameters

n_splits (int) – Number of splits.
shuffle (bool) – Whether to shuffle indices.
random_state (int) – Random seed for split.
train (str) – Property to assign train indices to.
test (str) – Property to assign test indices to.

Returns

None.

warning(*args, **kwargs)[source]¶: Pass warning to class’ logger instance.

class kgcnn.data.base.MemoryGraphList(iterable: Optional[list] = None)[source]¶

Bases: list

Class to store a list of graph dictionaries in memory.

Inherits from a python list. The graph properties are defined by tensor-like (numpy) arrays for indices, attributes, labels, symbol etc. in GraphDict , which are the items of the list. Access to items via [] indexing operator.

A python list of a single named property can be obtained from each GraphDict in MemoryGraphList via get or assigned from a python list via set methods.

The MemoryGraphList further provides simple map-functionality map_list to apply methods for each GraphDict, and to cast properties to tensor with tensor.

Cleaning the list for missing properties or empty graphs is done with clean.

import numpy as np
from kgcnn.data.base import MemoryGraphList

data = MemoryGraphList()
data.empty(1)
data.set("edge_indices", [np.array([[0, 1], [1, 0]])])
data.set("node_labels", [np.array([[0], [1]])])
print(data.get("edge_indices"))
data.set("node_coordinates", [np.array([[1, 0, 0], [0, 1, 0], [0, 2, 0], [0, 3, 0]])])
data.map_list("set_range", max_distance=1.5, max_neighbours=10, self_loops=False)
data.clean("range_indices")  # Returns cleaned graph indices
print(len(data))
print(data[0])

__init__(iterable: Optional[list] = None)[source]¶

Initialize an empty MemoryGraphList instance.

Parameters: iterable (list, MemoryGraphList) – A list or MemoryGraphList of GraphDict items.

append(graph)[source]¶: Append object to the end of the list.

assign_property(key: str, value: list)[source]¶

Assign a list of numpy arrays of a property to the respective GraphDict s in this list.

Parameters

key (str) – Name of the property.
value (list) – List of numpy arrays for property key .

Returns

self

clean(inputs: Union[list, str])[source]¶

Given a list of property names, this method removes all elements from the internal list of GraphDict items, which do not define at least one of those properties. Meaning, only those graphs remain in the list which definitely define all properties specified by inputs.

Parameters

inputs (list) – A list of strings, where each string is supposed to be a property name, which the graphs in this list may possess. Within kgcnn, this can be simpy the ‘input’ category in model configuration. In this case, a list of dicts that specify the name of the property with ‘name’ key.

Returns

A list of graph indices that do not have the required properties and which: have been removed.

Return type

invalid_graphs (np.ndarray)

copy(deep_copy: bool = False)[source]¶: Copy data in the list.

empty(length: int)[source]¶

Create an empty list in place. Overwrites existing list.

Parameters: length (int) – Length of the empty list.
Returns: self

get(key: str) → Optional[List]¶

Returns a list with the values of all the graphs defined for the string property name key. If none of the graphs in the list have this property, returns None.

Parameters: key (str) – The string name of the property to be retrieved for all the graphs contained in this list

insert(index: int, value) → None [source]¶: Insert object before index.

property length¶: Length of list.

map_list(method: Union[str, Callable], **kwargs)[source]¶

Map a method over this list and apply on each GraphDict. For method being string, either a class-method or a preprocessor is chosen for backward compatibility.

for i, x in enumerate(self):
    method(x, **kwargs)

Parameters

method (str) – Name of the GraphDict method.
kwargs – Kwargs for method.

Returns

self

obtain_property(key: str) → Optional[List][source]¶

Returns a list with the values of all the graphs defined for the string property name key. If none of the graphs in the list have this property, returns None.

Parameters: key (str) – The string name of the property to be retrieved for all the graphs contained in this list

rename_property_on_graphs(old_property_name: str, new_property_name: str) → list [source]¶

Change the name of a graph property on all graphs in the list.

Parameters

old_property_name (str) – Old property name.
new_property_name (str) – New property name.

Returns

List indices of replaced property names.

Return type

list

set(key: str, value: list)¶

Assign a list of numpy arrays of a property to the respective GraphDict s in this list.

Parameters

key (str) – Name of the property.
value (list) – List of numpy arrays for property key .

Returns

self

tensor(items: Union[list, Dict], make_copy=True)[source]¶

Make tensor objects from multiple graph properties in list.

It is recommended to run clean beforehand.

Parameters

items (list) – List of dictionaries that specify graph properties in list via ‘name’ key. The dict-items match the tensor input for tf.keras.layers.Input layers. Required dict-keys should be ‘name’ and ‘ragged’. Optionally shape information can be included via ‘shape’ and ‘dtype’. E.g.: [{‘name’: ‘edge_indices’, ‘ragged’: True}, {…}, …].
make_copy (bool) – Whether to copy the data. Default is True.

Returns

List of Tensors.

Return type

list

tf_dataset_disjoint(inputs, **kwargs)[source]¶

Return generator via tf.data.Dataset from this list. Uses kgcnn.io.loader.tf_dataset_disjoint_generator

Parameters

inputs –
kwargs – Kwargs for tf_dataset_disjoint_generator

Returns

Dataset from generator.

Return type

tf.data.Dataset

update(other) → None [source]¶

validate()[source]¶

kgcnn.data.crystal module¶

class kgcnn.data.crystal.CrystalDataset(data_directory: Optional[str] = None, dataset_name: Optional[str] = None, file_name: Optional[str] = None, file_directory: Optional[str] = None, file_name_pymatgen_json: Optional[str] = None, verbose: int = 10)[source]¶

Bases: kgcnn.data.base.MemoryGraphDataset

Class for making graph dataset from periodic structures such as crystals.

The dataset class requires a data_directory to store a table ‘.csv’ file containing labels and information of the structures stored in multiple (CIF, POSCAR, …) files in file_directory . The file names must be included in the ‘.csv’ table. The table file must have one line of header with column names!

├── data_directory
    ├── file_directory
    │   ├── *.cif
    │   ├── *.cif
    │   └── ...
    ├── file_name.csv
    ├── file_name.pymatgen.json
    └── dataset_name.kgcnn.pickle

This class uses pymatgen.core.structure.Structure and therefore requires pymatgen to be installed. A ‘.pymatgen.json’ serialized file is generated to store a list of structures from single ‘.cif’ files via prepare_data() . Consequently, a ‘file_name.pymatgen.json’ can be directly stored in data_directory. In this, case prepare_data() does not have to be used. Additionally, a table file ‘file_name.csv’ that lists the single file names and possible labels or classification targets is required.

from kgcnn.data.crystal import CrystalDataset
dataset = CrystalDataset(
    data_directory="data_directory/",
    dataset_name="ExampleCrystal",
    file_name="file_name.csv",
    file_directory="file_directory")
dataset.prepare_data(file_column_name="file_name", overwrite=True)
dataset.read_in_memory(label_column_name="label")

__init__(data_directory: Optional[str] = None, dataset_name: Optional[str] = None, file_name: Optional[str] = None, file_directory: Optional[str] = None, file_name_pymatgen_json: Optional[str] = None, verbose: int = 10)[source]¶

Initialize a base class of CrystalDataset.

Parameters

data_directory (str) – Full path to directory of the dataset. Default is None.
file_name (str) – Filename for dataset to read into memory. This is a table file. The ‘.csv’ should contain file names that are expected to be CIF-files in file_directory. Default is None.
file_directory (str) – Name or relative path from data_directory to a directory containing sorted ‘cif’ files. Default is None.
file_name_pymatgen_json (str) – This class will generate a ‘json’ file with pymatgen structures. You can specify the file name of that file with this argument. By default, it will be named from file_name when passed None.
dataset_name (str) – Name of the dataset. Important for naming and saving files. Default is None.
verbose (int) – Logging level. Default is 10.

_map_callbacks(structs: list, data: pandas.core.series.Series, callbacks: Dict[str, Callable[[pymatgen.core.structure.Structure, pandas.core.series.Series], Optional[numpy.ndarray]]], assign_to_self: bool = True) → dict [source]¶

Map callbacks on a data series object plus structure list.

Parameters

structs (list) – List of pymatgen structures.
data (pd.Series, pd.DataFrame) – Data Frame matching the structure list.
callbacks (dict) – Dictionary of callbacks that take a data object plus pymatgen structure as argument.
assign_to_self (bool) – Whether to already assign the output of callbacks to this class.

Returns

Values of callbacks.

Return type

dict

get_structures_from_json_file(file_path: Optional[str] = None) → List[source]¶

Load pymatgen serialized json-file into memory.

Structures are not added to CrystalDataset but returned by this function.

Parameters: file_path (str) – File path to json-file, uses class default. Default is None.
Returns: List of pymatgen structures.
Return type: list

prepare_data(file_column_name: Optional[str] = None, overwrite: bool = False)[source]¶

Default preparation for crystal datasets.

Try to load all crystal structures from single files and save them as a pymatgen json serialization. Can load multiple CIF files from a table that keeps file names and possible labels or additional information.

Parameters

file_column_name (str) – Name of the column that has file names found in file_directory. Default is None.
overwrite (bool) – Whether to rerun the data extraction. Default is False.

Returns

self

property pymatgen_json_file_path¶: Internal file name for the pymatgen serialization information to store to disk.

read_in_memory(label_column_name: Optional[str] = None, additional_callbacks: Optional[Dict[str, Callable[[pymatgen.core.structure.Structure, pandas.core.series.Series], None]]] = None)[source]¶

Read structures from pymatgen json serialization and convert them into graph information.

Parameters

label_column_name (str) – Columns of labels for graph in table file. Default is None.
additional_callbacks (dict) – Callbacks to add during read into memory.

Returns

self

save_structures_to_json_file(structs: list, file_path: Optional[str] = None)[source]¶

Save a list of pymatgen structures to file.

Parameters

structs (list) – List of pymatgen structures.
file_path (str) – File path to store structures to disk, uses class-default. Default is None.

Returns

None.

set_representation(pre_processor: Union[kgcnn.crystal.base.CrystalPreprocessor, dict], reset_graphs: bool = False)[source]¶

Build a graph representation for this dataset using kgcnn.crystal .

Parameters

pre_processor (CrystalPreprocessor) – Crystal preprocessor to use.
reset_graphs (bool) – Whether to reset the graph information. Default is False.

Returns:

kgcnn.data.download module¶

class kgcnn.data.download.DownloadDataset(dataset_name: Optional[str] = None, download_file_name: Optional[str] = None, data_directory_name: Optional[str] = None, unpack_directory_name: Optional[str] = None, extract_file_name: Optional[str] = None, download_url: Optional[str] = None, unpack_tar: bool = False, unpack_zip: bool = False, extract_gz: bool = False, reload: bool = False, verbose: int = 10, data_main_dir: str = '/home/docs/.kgcnn/datasets', execute_download_dataset_on_init: bool = True)[source]¶

Bases: object

Download class for datasets.

Provides static-methods and functions for download and unzip of the data. They are intentionally kept general and could also be used without this class definition. Downloading is handled by download_dataset_to_disk already in init by default. Dataset-specific functions like prepare_data must be implemented in subclasses.

Note

Note that DownloadDataset uses a main directory located at ‘~/.kgcnn/datasets’ for downloading datasets as default.

Classes in kgcnn.data.datasets inherit from this class, but DownloadDataset can also be used as a member via composition.

Warning

Downloads are not checked for safety or malware. Use with caution!

Example on how to use DownloadDataset standalone:

from kgcnn.data.download import DownloadDataset
download = DownloadDataset(
    download_url="https://github.com/aimat-lab/gcnn_keras/blob/master/README.md",
    data_main_dir="./",
    data_directory_name="",
    download_file_name="README.html",
    reload=True,
    execute_download_dataset_on_init=False
)
download.download_dataset_to_disk()

__init__(dataset_name: Optional[str] = None, download_file_name: Optional[str] = None, data_directory_name: Optional[str] = None, unpack_directory_name: Optional[str] = None, extract_file_name: Optional[str] = None, download_url: Optional[str] = None, unpack_tar: bool = False, unpack_zip: bool = False, extract_gz: bool = False, reload: bool = False, verbose: int = 10, data_main_dir: str = '/home/docs/.kgcnn/datasets', execute_download_dataset_on_init: bool = True)[source]¶

Base initialization function for downloading and extracting the data. The arguments to the constructor determine what to download and whether to unpack the download. The main function download_dataset_to_disk is already called in the constructor.

Parameters

dataset_name (str) – Name of the dataset to download (optional). Default is None.
download_file_name (str) – Name of the file that the source is downloaded to. Default is None.
data_directory_name (str) – Name of the data directory in data_main_dir the file is saved. Default is None.
unpack_directory_name (str) – The name of a new directory in data_directory_name to unpack archive. Default is None.
extract_file_name (str) – Name of a gz-file to extract. Default is None.
download_url (str) – Url for file to download. Default is None.
unpack_tar (bool) – Whether to unpack a tar-archive. Default is False.
unpack_zip (bool) – Whether to unpack a zip-archive. Default is False.
extract_gz (bool) – Whether to unpack a gz-archive. Default is False.
reload (bool) – Whether to reload the data and make new dataset. Default is False.
verbose (int) – Logging level. Default is 10.
execute_download_dataset_on_init (bool) – Whether to start download on class construction.

static download_database(path: str, download_url: str, filename: str, overwrite: bool = False, logger=None)[source]¶

Download dataset file.

Parameters

path (str) – Target filepath to store file (without filename).
download_url (str) – String of the download url to catch database from.
filename (str) – Name the dataset is downloaded to.
overwrite (bool) – Overwrite existing database. Default is False.
logger – Logger to print information or warnings.

Returns

Filepath of downloaded file.

Return type

os.path

download_dataset_to_disk()[source]¶: Main download function to run the download and unpack of the dataset. Defined by attributes in self.

static extract_gz_file(path: str, filename: str, out_filename: Optional[str] = None, overwrite: bool = False, logger=None)[source]¶

Extract gz-file.

Parameters

path (str) – Filepath where the gz-file to extract is located.
filename (str) – Name of the gz-file to extract.
out_filename (str) – Name of the extracted file.
overwrite (bool) – Overwrite existing database. Default is False.
logger – Logger to print information or warnings.

Returns

Filepath of the extracted file.

Return type

os.path

static setup_dataset_dir(data_main_dir: str, data_directory: str, logger=None)[source]¶

Make directory for each dataset.

Parameters

data_main_dir (str) – Path-location of the directory for all datasets.
data_directory (str) – Path of the directory for specific dataset to create.
logger – Logger to print information or warnings.

static setup_dataset_main(data_main_dir, logger=None)[source]¶

Make the main-directory for all datasets to store data.

Parameters

data_main_dir (str) – Path to create directory.
logger – Logger to print information or warnings.

static unpack_tar_file(path: str, filename: str, unpack_directory: str, overwrite: bool = False, logger=None)[source]¶

Extract tar-file.

Parameters

path (str) – Filepath where the tar-file to extract is located.
filename (str) – Name of the dataset to extract.
unpack_directory (str) – Directory to extract data to.
overwrite (bool) – Overwrite existing database. Default is False.
logger – Logger to print information or warnings.

Returns

Filepath of the extracted dataset folder.

Return type

os.path

static unpack_zip_file(path: str, filename: str, unpack_directory: str, overwrite: bool = False, logger=None)[source]¶

Extract zip-file.

Parameters

path (str) – Filepath where the zip-file to extract is located.
filename (str) – Name of the dataset to extract.
unpack_directory (str) – Directory to extract data to.
overwrite (bool) – Overwrite existing database. Default is False.
logger – Logger to print information or warnings.

Returns

Filepath of the extracted dataset folder.

Return type

os.path

kgcnn.data.force module¶

class kgcnn.data.force.ForceDataset(data_directory: Optional[str] = None, dataset_name: Optional[str] = None, file_name: Optional[str] = None, verbose: int = 10, file_directory: Optional[str] = None, file_name_xyz: Optional[str] = None, file_name_mol: Optional[str] = None, file_name_force_xyz: Optional[str] = None)[source]¶

Bases: kgcnn.data.qm.QMDataset

This is a base class for Force datasets. Inherits all functionality from QMDataset.

It generates graph properties from a xyz-file, which stores atomic coordinates. Additionally, loading multiple single xyz-files into one file is supported. The file names and labels are given by a CSV or table file. The table file must have one line of header with column names!

├── data_directory
    ├── file_directory
    │   ├── *.xyz
    │   ├── *.xyz
    │   └── ...
    ├── file_name.csv
    ├── file_name.xyz
    ├── file_name.sdf
    ├── file_name_force.xyz
    ├── ...
    └── dataset_name.kgcnn.pickle

Additionally, forces xyz information can be read in with this class.

__init__(data_directory: Optional[str] = None, dataset_name: Optional[str] = None, file_name: Optional[str] = None, verbose: int = 10, file_directory: Optional[str] = None, file_name_xyz: Optional[str] = None, file_name_mol: Optional[str] = None, file_name_force_xyz: Optional[str] = None)[source]¶

Default initialization. File information on the location of the dataset on disk should be provided here.

Parameters

data_directory (str) – Full path to directory of the dataset. Optional. Default is None.
file_name (str) – Filename for reading table ‘.csv’ file into memory. Must be given! For example as ‘.csv’ formatted file with QM labels such as energy, states, dipole etc. Moreover, the table file can contain a list of file names of individual ‘.xyz’ files to collect. Files are expected to be in file_directory. Default is None.
file_name_xyz (str) – Filename of a single ‘.xyz’ file. This file is generated when collecting single ‘.xyz’ files in file_directory . If not specified, the name is generated based on file_name .
file_name_mol (str) – Filename of a single ‘.sdf’ file. This file is generated from the single ‘.xyz’ file. SDF generation does require proper geometries. If not specified, the name is generated based on file_name .
file_directory (str) – Name or relative path from data_directory to a directory containing sorted ‘.xyz’ files. Only used if file_name is None. Default is None.
dataset_name (str) – Name of the dataset. Important for naming and saving files. Default is None.
verbose (int) – Logging level. Default is 10.

property file_path_force_xyz¶: Try to determine a file name for the mol information to store.

prepare_data(overwrite: bool = False, file_column_name: Optional[str] = None, file_column_name_force: Optional[str] = None, make_sdf: bool = False)[source]¶

Pre-computation of molecular structure information in a sdf-file from a xyz-file or a folder of xyz-files.

If there is no single xyz-file, it will be created with the information of a csv-file with the same name.

Parameters

overwrite (bool) – Overwrite existing database SDF file. Default is False.
file_column_name (str) – Name of the column in csv file with list of xyz-files located in file_directory. This is for the positions only.
file_column_name_force (str, list) – Column name of xyz files for forces in file directory.
make_sdf (bool) – Whether to try to make a sdf file from xyz information via RDKit and OpenBabel.

Returns

self

read_in_memory(label_column_name: Optional[Union[str, list]] = None, nodes: Optional[list] = None, edges: Optional[list] = None, graph: Optional[list] = None, encoder_nodes: Optional[dict] = None, encoder_edges: Optional[dict] = None, encoder_graph: Optional[dict] = None, add_hydrogen: bool = True, sanitize: bool = False, make_directed: bool = False, compute_partial_charges: bool = False, additional_callbacks: Optional[Dict[str, Callable[[kgcnn.molecule.base.MolGraphInterface, dict], None]]] = None, custom_transform: Optional[Callable[[kgcnn.molecule.base.MolGraphInterface], kgcnn.molecule.base.MolGraphInterface]] = None)[source]¶

Read geometric information into memory.

Graph labels require a column specified by label_column_name.

Parameters

label_column_name (str, list) – Name of labels for columns in CSV file.
nodes (list) – A list of node attributes as string. In place of names also functions can be added.
edges (list) – A list of edge attributes as string. In place of names also functions can be added.
graph (list) – A list of graph attributes as string. In place of names also functions can be added.
encoder_nodes (dict) – A dictionary of callable encoder where the key matches the attribute.
encoder_edges (dict) – A dictionary of callable encoder where the key matches the attribute.
encoder_graph (dict) – A dictionary of callable encoder where the key matches the attribute.
add_hydrogen (bool) – Whether to keep hydrogen after reading the mol-information. Default is False.
make_directed (bool) – Whether to have directed or undirected bonds. Default is False.
compute_partial_charges (str) – Whether to compute partial charges, e.g. ‘gasteiger’. Default is None.
sanitize (bool) – Whether to sanitize molecule. Default is False.
additional_callbacks (dict) – A dictionary whose keys are string attribute names which the elements of the dataset are supposed to have and the elements are callback function objects which implement how those attributes are derived from the MolecularGraphRDKit of the molecule in question or the row of the CSV file.
custom_transform (Callable) – Custom transformation function to modify the generated MolecularGraphRDKit before callbacks are carried out. The function must take a single MolecularGraphRDKit instance as argument and return a (new) MolecularGraphRDKit instance.

Returns

self

kgcnn.data.moleculenet module¶

class kgcnn.data.moleculenet.MoleculeNetDataset(data_directory: Optional[str] = None, dataset_name: Optional[str] = None, file_name: Optional[str] = None, file_name_mol: Optional[str] = None, file_name_smiles: Optional[str] = None, verbose: int = 10)[source]¶

Bases: kgcnn.data.base.MemoryGraphDataset

Class for using ‘MoleculeNet’ datasets.

The concept is to load a table of smiles and corresponding targets and convert them into a tensor representation for graph networks.

├── data_directory
    ├── file_name.csv
    ├── file_name.SMILES
    ├── file_name.sdf
    └── dataset_name.kgcnn.pickle

The class provides properties and methods for making graph features from smiles. The typical input is a csv or excel file with smiles and corresponding graph labels. The table file must have one line of header with column names!

The graph structure matches the molecular graph, i.e. the chemical structure. The atomic coordinates are generated by a conformer guess. Since this require some computation time, it is only done once and the molecular coordinate or mol-blocks stored in a single SDF file with the base-name of the csv :obj:file_name. Conversion is using the MolConverter class.

The selection of smiles and whether conformers should be generated is handled by subclasses or specified in the methods prepare_data and read_in_memory, see the documentation of the methods for further details.

Attribute generation is carried out via the MolecularGraphRDKit class and requires RDKit as backend. You can also use a pre-processed SDF or SMILES file in data_directory and add their name in the class initialization.

__init__(data_directory: Optional[str] = None, dataset_name: Optional[str] = None, file_name: Optional[str] = None, file_name_mol: Optional[str] = None, file_name_smiles: Optional[str] = None, verbose: int = 10)[source]¶

Initialize a MoleculeNetDataset with information of the dataset location on disk.

Parameters

file_name (str) – Filename for reading into memory. This must be the name of the ‘.csv’ file. Default is None.
file_name_mol (str) – Filename of the SDF file that is generated from the SMILES file that is generated from a list of smiles given in the table file specified by file_name . By default, the name is chosen equal to file_name when passed None.
file_name_smiles (str) – Filename of the SMILES file that is generated from a list of smiles given in the table file specified by file_name . By default, the name is chosen equal to file_name when passed None.
data_directory (str) – Full path to directory containing all dataset files. Default is None. Not used by this subclass. Ignored.
dataset_name (str) – Name of the dataset. Important for naming. Default is None.
verbose (int) – Logging level. Default is 10.

property file_path_mol¶: Try to determine a file path for the mol information to store.

property file_path_smiles¶: Try to determine a file path for the SMILES information to store.

get_mol_blocks_from_sdf_file()[source]¶

prepare_data(overwrite: bool = False, smiles_column_name: str = 'smiles', add_hydrogen: bool = True, sanitize: bool = True, make_conformers: bool = True, optimize_conformer: bool = True, external_program: Optional[dict] = None, num_workers: Optional[int] = None)[source]¶

Computation of molecular structure information and optionally conformers from smiles.

This function reads smiles from the csv-file given by file_name and creates a single SDF File of generated mol-blocks with the same file name. The function requires RDKit and (optionally) OpenBabel. Smiles that are not compatible with both RDKit and OpenBabel result in an empty mol-block in the SDF file to keep the number of molecules the same.

Parameters

overwrite (bool) – Overwrite existing database mol-json file. Default is False.
smiles_column_name (str) – Column name where smiles are given in csv-file. Default is “smiles”.
add_hydrogen (bool) – Whether to add H after smile translation. Default is True.
sanitize (bool) – Whether to sanitize molecule. Default is True.
make_conformers (bool) – Whether to make conformers. Default is True.
optimize_conformer (bool) – Whether to optimize conformer via force field. Only possible with make_conformers. Default is True.
external_program (dict) – External program for translating smiles. Default is None. If you want to use an external program you have to supply a dictionary of the form: {“class_name”: “balloon”, “config”: {“balloon_executable_path”: …, …}}. Note that usually the parameters like add_hydrogen are ignored. And you need to control the SDF file generation within config of the external_program.
num_workers (int) – Parallel execution for translating smiles.

Returns

self

read_in_memory(label_column_name: Optional[Union[str, list]] = None, nodes: Optional[list] = None, edges: Optional[list] = None, graph: Optional[list] = None, encoder_nodes: Optional[dict] = None, encoder_edges: Optional[dict] = None, encoder_graph: Optional[dict] = None, add_hydrogen: bool = False, make_directed: bool = False, has_conformers: bool = True, sanitize: bool = True, compute_partial_charges: Optional[str] = None, additional_callbacks: Optional[Dict[str, Callable[[kgcnn.molecule.base.MolGraphInterface, dict], None]]] = None, custom_transform: Optional[Callable[[kgcnn.molecule.base.MolGraphInterface], kgcnn.molecule.base.MolGraphInterface]] = None)¶

Load list of molecules from cached SDF-file in into memory. File name must be given in file_name and path information in the constructor of this class.

It further checks the csv-file for graph labels specified by label_column_name. Labels that do not have valid smiles or molecule in the SDF-file are also skipped, but added as None to keep the index and the molecule assignment.

Set further molecular attributes or features by string identifier. Requires MolecularGraphRDKit. Default values are features that has been used by Luo et al (2019).

The argument additional_callbacks allows adding custom properties to each element of the dataset. It is a dictionary whose string keys are the names of the properties and the values are callable function objects which define how the property is derived from either the MolecularGraphRDKit or the corresponding row of the original CSV file. Those callback functions accept two parameters:

mg: The MolecularGraphRDKit instance of the molecule

ds: A pandas data series that match data in the CSV file for the specific molecule.

Example:

from os import linesep
csv = f"index,name,label,smiles{linesep}1,Propanolol,1,[Cl].CC(C)NCC(O)COc1cccc2ccccc12"
with open('/tmp/moleculenet_example.csv', mode='w') as file:
    file.write(csv)

dataset = MoleculeNetDataset('/tmp', 'example', 'moleculenet_example.csv')
dataset.prepare_data(smiles_column_name='smiles')
dataset.read_in_memory(label_column_name='label')
dataset.set_attributes(
    nodes=['Symbol'],
    encoder_nodes={'Symbol': OneHotEncoder(['C', 'O'], dtype='str')},
    edges=['BondType'],
    encoder_edges={'BondType': int},
    additional_callbacks={
        # It is important that the callbacks return a numpy array, even if it is just a single element.
        'name': lambda mg, ds: np.array(ds['name'], dtype='str')
    }
)

mol: dict = dataset[0]
mol['node_attributes']  # np array of one hot encoded atom type per node
mol['edge_attributes']  # int value representing the bond type
mol['name']  # Array of a single string which is the name from the original CSV data

Parameters

label_column_name (str) – Column name in the csv-file where to take graph labels from. For multi-targets you can supply a list of column names or positions. A slice can be provided for selecting columns as graph labels. Default is None.
nodes (list) – A list of node attributes as string. In place of names also functions can be added.
edges (list) – A list of edge attributes as string. In place of names also functions can be added.
graph (list) – A list of graph attributes as string. In place of names also functions can be added.
encoder_nodes (dict) – A dictionary of callable encoder where the key matches the attribute.
encoder_edges (dict) – A dictionary of callable encoder where the key matches the attribute.
encoder_graph (dict) – A dictionary of callable encoder where the key matches the attribute.
add_hydrogen (bool) – Whether to keep hydrogen after reading the mol-information. Default is False.
has_conformers (bool) – Whether to add node coordinates from conformer. Default is True.
make_directed (bool) – Whether to have directed or undirected bonds. Default is False.
sanitize (bool) – Whether to sanitize molecule. Default is True.
compute_partial_charges (str) – Whether to compute partial charges, e.g. ‘gasteiger’. Default is None.
additional_callbacks (dict) – A dictionary whose keys are string attribute names which the elements of the dataset are supposed to have and the elements are callback function objects which implement how those attributes are derived from the MolecularGraphRDKit of the molecule in question or the row of the CSV file.
custom_transform (Callable) – Custom transformation function to modify the generated MolecularGraphRDKit before callbacks are carried out. The function must take a single MolecularGraphRDKit instance as argument and return a (new) MolecularGraphRDKit instance.

Returns

self

set_attributes(label_column_name: Optional[Union[str, list]] = None, nodes: Optional[list] = None, edges: Optional[list] = None, graph: Optional[list] = None, encoder_nodes: Optional[dict] = None, encoder_edges: Optional[dict] = None, encoder_graph: Optional[dict] = None, add_hydrogen: bool = False, make_directed: bool = False, has_conformers: bool = True, sanitize: bool = True, compute_partial_charges: Optional[str] = None, additional_callbacks: Optional[Dict[str, Callable[[kgcnn.molecule.base.MolGraphInterface, dict], None]]] = None, custom_transform: Optional[Callable[[kgcnn.molecule.base.MolGraphInterface], kgcnn.molecule.base.MolGraphInterface]] = None)[source]¶

Load list of molecules from cached SDF-file in into memory. File name must be given in file_name and path information in the constructor of this class.

It further checks the csv-file for graph labels specified by label_column_name. Labels that do not have valid smiles or molecule in the SDF-file are also skipped, but added as None to keep the index and the molecule assignment.

Set further molecular attributes or features by string identifier. Requires MolecularGraphRDKit. Default values are features that has been used by Luo et al (2019).

The argument additional_callbacks allows adding custom properties to each element of the dataset. It is a dictionary whose string keys are the names of the properties and the values are callable function objects which define how the property is derived from either the MolecularGraphRDKit or the corresponding row of the original CSV file. Those callback functions accept two parameters:

mg: The MolecularGraphRDKit instance of the molecule

ds: A pandas data series that match data in the CSV file for the specific molecule.

Example:

from os import linesep
csv = f"index,name,label,smiles{linesep}1,Propanolol,1,[Cl].CC(C)NCC(O)COc1cccc2ccccc12"
with open('/tmp/moleculenet_example.csv', mode='w') as file:
    file.write(csv)

dataset = MoleculeNetDataset('/tmp', 'example', 'moleculenet_example.csv')
dataset.prepare_data(smiles_column_name='smiles')
dataset.read_in_memory(label_column_name='label')
dataset.set_attributes(
    nodes=['Symbol'],
    encoder_nodes={'Symbol': OneHotEncoder(['C', 'O'], dtype='str')},
    edges=['BondType'],
    encoder_edges={'BondType': int},
    additional_callbacks={
        # It is important that the callbacks return a numpy array, even if it is just a single element.
        'name': lambda mg, ds: np.array(ds['name'], dtype='str')
    }
)

mol: dict = dataset[0]
mol['node_attributes']  # np array of one hot encoded atom type per node
mol['edge_attributes']  # int value representing the bond type
mol['name']  # Array of a single string which is the name from the original CSV data

Parameters

label_column_name (str) – Column name in the csv-file where to take graph labels from. For multi-targets you can supply a list of column names or positions. A slice can be provided for selecting columns as graph labels. Default is None.
nodes (list) – A list of node attributes as string. In place of names also functions can be added.
edges (list) – A list of edge attributes as string. In place of names also functions can be added.
graph (list) – A list of graph attributes as string. In place of names also functions can be added.
encoder_nodes (dict) – A dictionary of callable encoder where the key matches the attribute.
encoder_edges (dict) – A dictionary of callable encoder where the key matches the attribute.
encoder_graph (dict) – A dictionary of callable encoder where the key matches the attribute.
add_hydrogen (bool) – Whether to keep hydrogen after reading the mol-information. Default is False.
has_conformers (bool) – Whether to add node coordinates from conformer. Default is True.
make_directed (bool) – Whether to have directed or undirected bonds. Default is False.
sanitize (bool) – Whether to sanitize molecule. Default is True.
compute_partial_charges (str) – Whether to compute partial charges, e.g. ‘gasteiger’. Default is None.
additional_callbacks (dict) – A dictionary whose keys are string attribute names which the elements of the dataset are supposed to have and the elements are callback function objects which implement how those attributes are derived from the MolecularGraphRDKit of the molecule in question or the row of the CSV file.
custom_transform (Callable) – Custom transformation function to modify the generated MolecularGraphRDKit before callbacks are carried out. The function must take a single MolecularGraphRDKit instance as argument and return a (new) MolecularGraphRDKit instance.

Returns

self

kgcnn.data.moleculenet.map_molecule_callbacks(mol_list: List[str], data: Union[pandas.core.series.Series, pandas.core.frame.DataFrame], callbacks: Dict[str, Callable[[kgcnn.molecule.base.MolGraphInterface, pandas.core.series.Series], None]], custom_transform: Optional[Callable[[kgcnn.molecule.base.MolGraphInterface], kgcnn.molecule.base.MolGraphInterface]] = None, add_hydrogen: bool = False, make_directed: bool = False, sanitize: bool = True, compute_partial_charges: Optional[str] = None, mol_interface_class=None, logger=None, loop_update_info: int = 5000) → dict [source]¶

This method receive the list of molecules, as well as the data from a pandas data series. It then iterates over all the molecules / data rows and invokes the callbacks for each.

The “callbacks” parameter is supposed to be a dictionary whose keys are string names of attributes which are supposed to be derived from the molecule / data and the values are function objects which define how to derive that data. Those callback functions get two parameters:

mg: The MolGraphInterface instance for the current molecule

ds: A pandas data series that match data in the CSV file for the specific molecule.

The string keys of the “callbacks” directory are also the string names which are later used to assign the properties of the underlying GraphList. This means that each element of the dataset will then have a field with the same name.

Note

If a molecule cannot be properly loaded by MolGraphInterface, then for all attributes “None” is added without invoking the callback!

Example:

mol_net = MoleculeNetDataset()
mol_net.prepare_data()

mol_values = map_molecule_callbacks(
    mol_net.get_mol_blocks_from_sdf_file(),
    mol_net.read_in_table_file().data_frame,
    callbacks={
        'graph_size': lambda mg, dd: len(mg.node_number),
        'index': lambda mg, dd: dd['index']
    }
)

for key, value in mol_values.items():
    mol_net.assign_property(key, value)

mol: dict = mol_net[0]
assert 'graph_size' in mol.keys()
assert 'index' in mol.keys()

Parameters

mol_list (list) – List of mol strings.
data (pd.DataFrame) – Pandas data frame or series matching list of mol-strings.
callbacks (dict) – Dictionary of callbacks to perform on MolecularGraph object and table entries.
add_hydrogen (bool) – Whether to add hydrogen when making a MolecularGraphRDKit instance.
make_directed (bool) – Whether to have directed or undirected bonds. Default is False.
sanitize (bool) – Whether to sanitize molecule. Default is True.
custom_transform (Callable) – Custom transformation function to modify the generated MolecularGraphRDKit before callbacks are carried out. The function must take a single MolecularGraphRDKit instance as argument and return a (new) MolecularGraphRDKit instance.
compute_partial_charges (str) – Whether to compute partial charges, e.g. ‘gasteiger’. Default is None.
mol_interface_class – Interface for molecular graphs. Must be a MolGraphInterface.
logger – Logger to report error and progress.
loop_update_info (int) – Updates for processed molecules.

Returns

Values of callbacks.

Return type

dict

kgcnn.data.qm module¶

class kgcnn.data.qm.QMDataset(data_directory: Optional[str] = None, dataset_name: Optional[str] = None, file_name: Optional[str] = None, verbose: int = 10, file_directory: Optional[str] = None, file_name_xyz: Optional[str] = None, file_name_mol: Optional[str] = None)[source]¶

Bases: kgcnn.data.base.MemoryGraphDataset

This is a base class for QM (quantum mechanical) datasets.

It generates graph properties from a xyz-file, which stores atomic coordinates. Additionally, loading multiple single xyz-files into one file is supported. The file names and labels are given by a CSV or table file. The table file must have one line of header with column names!

├── data_directory
    ├── file_directory
    │   ├── *.xyz
    │   ├── *.xyz
    │   └── ...
    ├── file_name.csv
    ├── file_name.xyz
    ├── file_name.sdf
    └── dataset_name.kgcnn.pickle

It should be possible to generate approximate chemical bonding information via openbabel, if this additional package is installed. The class inherits from MemoryGraphDataset . If openbabel is not installed minimal loading of labels and coordinates should be supported.

For additional attributes, the set_attributes enables further features that require RDkit or Openbabel to be installed. Note that for QMDataset the mol-information, if it is generated, is not cleaned during reading by default.

__init__(data_directory: Optional[str] = None, dataset_name: Optional[str] = None, file_name: Optional[str] = None, verbose: int = 10, file_directory: Optional[str] = None, file_name_xyz: Optional[str] = None, file_name_mol: Optional[str] = None)[source]¶

Default initialization. File information on the location of the dataset on disk should be provided here.

Parameters

data_directory (str) – Full path to directory of the dataset. Optional. Default is None.
file_name (str) – Filename for reading table ‘.csv’ file into memory. Must be given! For example as ‘.csv’ formatted file with QM labels such as energy, states, dipole etc. Moreover, the table file can contain a list of file names of individual ‘.xyz’ files to collect. Files are expected to be in file_directory. Default is None.
file_name_xyz (str) – Filename of a single ‘.xyz’ file. This file is generated when collecting single ‘.xyz’ files in file_directory . If not specified, the name is generated based on file_name .
file_name_mol (str) – Filename of a single ‘.sdf’ file. This file is generated from the single ‘.xyz’ file. SDF generation does require proper geometries. If not specified, the name is generated based on file_name .
file_directory (str) – Name or relative path from data_directory to a directory containing sorted ‘.xyz’ files. Only used if file_name is None. Default is None.
dataset_name (str) – Name of the dataset. Important for naming and saving files. Default is None.
verbose (int) – Logging level. Default is 10.

property file_path_mol¶: Try to determine a file name for the mol information to store.

property file_path_xyz¶: Try to determine a file name for the mol information to store.

get_geom_from_xyz_file(file_path: str) → list [source]¶

Get a list of xyz items from file.

Parameters: file_path (str) – File path of XYZ file. Default None uses file_path_xyz.
Returns: List of xyz lists.
Return type: list

get_mol_blocks_from_sdf_file(file_path: Optional[str] = None) → list [source]¶

Get a list of mol-blocks from file.

Parameters: file_path (str) – File path of SDF file. Default None uses file_path_mol.
Returns: List of mol-strings.
Return type: list

prepare_data(overwrite: bool = False, file_column_name: Optional[str] = None, make_sdf: bool = True)[source]¶

Pre-computation of molecular structure information in a sdf-file from a xyz-file or a folder of xyz-files.

If there is no single xyz-file, it will be created with the information of a csv-file with the same name.

Parameters

overwrite (bool) – Overwrite existing database SDF file. Default is False.
file_column_name (str) – Name of the column in csv file with list of xyz-files located in file_directory
make_sdf (bool) – Whether to try to make a sdf file from xyz information via RDKit and OpenBabel.

Returns

self

read_in_memory(label_column_name: Optional[Union[str, list]] = None, nodes: Optional[list] = None, edges: Optional[list] = None, graph: Optional[list] = None, encoder_nodes: Optional[dict] = None, encoder_edges: Optional[dict] = None, encoder_graph: Optional[dict] = None, add_hydrogen: bool = True, sanitize: bool = False, make_directed: bool = False, compute_partial_charges: bool = False, additional_callbacks: Optional[Dict[str, Callable[[kgcnn.molecule.base.MolGraphInterface, dict], None]]] = None, custom_transform: Optional[Callable[[kgcnn.molecule.base.MolGraphInterface], kgcnn.molecule.base.MolGraphInterface]] = None)[source]¶

Read geometric information into memory.

Graph labels require a column specified by label_column_name.

Parameters

label_column_name (str, list) – Name of labels for columns in CSV file.
nodes (list) – A list of node attributes as string. In place of names also functions can be added.
edges (list) – A list of edge attributes as string. In place of names also functions can be added.
graph (list) – A list of graph attributes as string. In place of names also functions can be added.
encoder_nodes (dict) – A dictionary of callable encoder where the key matches the attribute.
encoder_edges (dict) – A dictionary of callable encoder where the key matches the attribute.
encoder_graph (dict) – A dictionary of callable encoder where the key matches the attribute.
add_hydrogen (bool) – Whether to keep hydrogen after reading the mol-information. Default is False.
make_directed (bool) – Whether to have directed or undirected bonds. Default is False.
compute_partial_charges (str) – Whether to compute partial charges, e.g. ‘gasteiger’. Default is None.
sanitize (bool) – Whether to sanitize molecule. Default is False.
additional_callbacks (dict) – A dictionary whose keys are string attribute names which the elements of the dataset are supposed to have and the elements are callback function objects which implement how those attributes are derived from the MolecularGraphRDKit of the molecule in question or the row of the CSV file.
custom_transform (Callable) – Custom transformation function to modify the generated MolecularGraphRDKit before callbacks are carried out. The function must take a single MolecularGraphRDKit instance as argument and return a (new) MolecularGraphRDKit instance.

Returns

self

read_in_memory_xyz(file_path: Optional[str] = None, atomic_coordinates: Optional[str] = 'node_coordinates', atomic_symbol: Optional[str] = 'node_symbol', atomic_number: Optional[str] = 'node_number')[source]¶

Read XYZ-file with geometric information into memory.

Parameters

file_path (str) – Filepath to xyz file.
atomic_coordinates (str) – Name of graph property of atomic coordinates. Default is “node_coordinates”.
atomic_symbol (str) – Name of graph property of atomic symbol. Default is “node_symbol”.
atomic_number (str) – Name of graph property of atomic number. Default is “node_number”.

Returns

self

set_attributes(label_column_name: Optional[Union[str, list]] = None, nodes: Optional[list] = None, edges: Optional[list] = None, graph: Optional[list] = None, encoder_nodes: Optional[dict] = None, encoder_edges: Optional[dict] = None, encoder_graph: Optional[dict] = None, add_hydrogen: bool = True, make_directed: bool = False, sanitize: bool = False, compute_partial_charges: Optional[str] = None, additional_callbacks: Optional[Dict[str, Callable[[kgcnn.molecule.base.MolGraphInterface, dict], None]]] = None, custom_transform: Optional[Callable[[kgcnn.molecule.base.MolGraphInterface], kgcnn.molecule.base.MolGraphInterface]] = None)[source]¶

Read SDF-file with chemical structure information into memory.

Parameters

label_column_name (str, list) – Name of labels for columns in CSV file.
nodes (list) – A list of node attributes as string. In place of names also functions can be added.
edges (list) – A list of edge attributes as string. In place of names also functions can be added.
graph (list) – A list of graph attributes as string. In place of names also functions can be added.
encoder_nodes (dict) – A dictionary of callable encoder where the key matches the attribute.
encoder_edges (dict) – A dictionary of callable encoder where the key matches the attribute.
encoder_graph (dict) – A dictionary of callable encoder where the key matches the attribute.
add_hydrogen (bool) – Whether to keep hydrogen after reading the mol-information. Default is False.
make_directed (bool) – Whether to have directed or undirected bonds. Default is False.
sanitize (bool) – Whether to sanitize molecule. Default is False.
compute_partial_charges (str) – Whether to compute partial charges, e.g. ‘gasteiger’. Default is None.
additional_callbacks (dict) – A dictionary whose keys are string attribute names which the elements of the dataset are supposed to have and the elements are callback function objects which implement how those attributes are derived from the MolecularGraphRDKit of the molecule in question or the row of the CSV file.
custom_transform (Callable) – Custom transformation function to modify the generated MolecularGraphRDKit before callbacks are carried out. The function must take a single MolecularGraphRDKit instance as argument and return a (new) MolecularGraphRDKit instance.

Returns

self

kgcnn.data.serial module¶

kgcnn.data.serial.deserialize(dataset: Union[str, dict])[source]¶

Deserialize a dataset class from dictionary including “class_name” and “config” keys.

Furthermore, prepare_data, read_in_memory and map_list are possible for deserialization if manually set in ‘methods’ key as list. Tries to resolve datasets also without module_name key. Otherwise, you can use general kgcnn.utils.serial .

Parameters: dataset (str, dict) – Dictionary of the dataset serialization.
Returns: Deserialized dataset.
Return type: MemoryGraphDataset

kgcnn.data.tudataset module¶

class kgcnn.data.tudataset.GraphTUDataset(data_directory: Optional[str] = None, dataset_name: Optional[str] = None, file_name: Optional[str] = None, file_directory: Optional[str] = None, verbose: int = 10)[source]¶

Bases: kgcnn.data.base.MemoryGraphDataset

Base class for loading graph datasets published by TU Dortmund University.

Datasets contain non-isomorphic graphs for many graph classification or regression tasks. This general base class has functionality to load TUDatasets in a generic way. The datasets are already in a graph-like format and do not need further processing via e.g. prepare data.

Note

Note that subclasses of GraphTUDataset2020 in kgcnn.data.datasets downloads datasets, There are also further TU-datasets in kgcnn.data.datasets, if further processing is used in literature. Not all datasets can provide all types of graph properties like edge_attributes etc.

The file structure of GraphTUDataset for a given dataset ‘DS’ (replace DS with dataset name).

├── data_directory
    ├── DS_graph_indicator.txt
    ├── DS_A.txt
    ├── DS_node_labels.txt
    ├── DS_node_attributes.txt
    ├── DS_edge_labels.txt
    ├── DS_edge_attributes.txt
    ├── DS_graph_labels.txt
    ├── DS_graph_attributes.txt
    ├──  ...
    └── dataset_name.kgcnn.pickle

Setting up a single file can be made additionally with base class save method.

__init__(data_directory: Optional[str] = None, dataset_name: Optional[str] = None, file_name: Optional[str] = None, file_directory: Optional[str] = None, verbose: int = 10)[source]¶

Initialize a GraphTUDataset instance from file.

Parameters

data_directory (str) – Full path to directory of the dataset. Default is None.
file_name (str) – Filename for reading into memory. Not used for general TUDataset, since there are multiple files with a prefix and pre-defined suffix. Default is None.
file_directory (str) – Name or relative path from data_directory to a directory containing sorted files. Default is None.
dataset_name (str) – Name of the dataset. Important for base-name for naming of files. Default is None.
verbose (int) – Logging level. Default is 10.

static read_csv_simple(filepath: str, delimiter: str = ', ', dtype: Callable = <class 'float'>)[source]¶

Very simple python-only function to read in a csv-file from file.

Parameters

filepath (str) – Full filepath of csv-file to read in.
delimiter (str) – Delimiter character for separation. Default is “,”.
dtype – Callable type conversion from string. Default is float.

Returns

Python list of values. Length of the list equals the number of lines.

Return type

list

read_in_memory()[source]¶

Read the GraphTUDataset into memory.

The TUDataset is stored in disjoint representations. The data is cast to a list of separate graph properties for MemoryGraphDataset.

Returns: self

kgcnn.data.utils module¶

kgcnn.data.utils.load_hyper_file(file_name: str, **kwargs) → dict [source]¶

Load hyperparameter from file. File type can be ‘.yaml’, ‘.json’, ‘.pickle’ or ‘.py’.

Parameters: file_name (str) – Path or name of the file containing hyperparameter.
Returns: Dictionary of hyperparameter.
Return type: hyper (dict)

kgcnn.data.utils.load_json_file(file_path: str, **kwargs)[source]¶

Load json file.

Parameters: file_path (str) – File path or name to load.
Returns: Python-object of file.
Return type: obj

kgcnn.data.utils.load_pickle_file(file_path: str, **kwargs)[source]¶

Load pickle file.

Parameters: file_path (str) – File path or name to load.
Returns: Python-object of file.
Return type: obj

kgcnn.data.utils.load_yaml_file(file_path: str)[source]¶

Load yaml file.

Parameters: file_path (str) – File path or name to load.
Returns: Python-object of file.
Return type: obj

kgcnn.data.utils.pad_np_array_list_batch_dim(values: list, dtype: Optional[str] = None)[source]¶

Pad a list of numpy arrays along first dimension.

Parameters

values (list) – List of np.ndarray .
dtype (str) – Data type of values tensor. Defaults to None.

Returns

Padded and mask np.ndarray of values.

Return type

tuple

kgcnn.data.utils.ragged_tensor_from_nested_numpy(numpy_list: list, dtype: Optional[str] = None, row_splits_dtype: str = 'int64')[source]¶

Make ragged tensor from a list of numpy arrays. Each array can have different length but must match in shape except the first dimension. This will result in a ragged tensor with ragged dimension only at first axis (ragged_rank=1), like shape (batch, None, …) . This way a tensor can be generated faster than tf.ragged.constant() .

Warning

The data will be copied for this operation.

import tensorflow as tf
import numpy as np
ragged_tensor = ragged_tensor_from_nested_numpy([np.array([[0]]), np.array([[1], [2], [3]])])
print(ragged_tensor)
# <tf.RaggedTensor [[[0]], [[1], [2], [3]]]>
print(ragged_tensor.shape)
# (2, None, 1)

Parameters

numpy_list (list) – List of numpy arrays of different length but else identical shape.
dtype (str) – Data type of values tensor. Defaults to None.
row_splits_dtype (str) – Data type of partition tensor. Default is “int64”.

Returns

Ragged tensor of former nested list of numpy arrays.

Return type

tf.RaggedTensor

kgcnn.data.utils.save_json_file(obj, file_path: str, **kwargs)[source]¶

Save json file.

Parameters

obj – Python-object to dump.
file_path (str) – File path or name to save ‘obj’ to.

Returns

None.

kgcnn.data.utils.save_pickle_file(obj, file_path: str, **kwargs)[source]¶

Save pickle file.

Parameters

obj – Python-object to dump.
file_path (str) – File path or name to save ‘obj’ to.

Returns

None.

kgcnn.data.utils.save_yaml_file(obj, file_path: str, default_flow_style: bool = False, **kwargs)[source]¶

Save yaml file.

Parameters

obj – Python-object to dump.
file_path (str) – File path or name to save ‘obj’ to.
default_flow_style (bool) – Flag for flow style. Default to False.

Returns

None.

kgcnn.data package¶

Subpackages¶

Submodules¶

kgcnn.data.base module¶

kgcnn.data.crystal module¶

kgcnn.data.download module¶

kgcnn.data.force module¶

kgcnn.data.moleculenet module¶

kgcnn.data.qm module¶

kgcnn.data.serial module¶

kgcnn.data.tudataset module¶

kgcnn.data.utils module¶

Module contents¶