kgcnn.data package¶
Subpackages¶
- kgcnn.data.datasets package
- Submodules
- kgcnn.data.datasets.ClinToxDataset module
- kgcnn.data.datasets.CoraDataset module
- kgcnn.data.datasets.CoraLuDataset module
- kgcnn.data.datasets.ESOLDataset module
- kgcnn.data.datasets.FreeSolvDataset module
- kgcnn.data.datasets.GraphTUDataset2020 module
- kgcnn.data.datasets.ISO17Dataset module
- kgcnn.data.datasets.LipopDataset module
- kgcnn.data.datasets.MD17Dataset module
- kgcnn.data.datasets.MD17RevisedDataset module
- kgcnn.data.datasets.MUTAGDataset module
- kgcnn.data.datasets.MatBenchDataset2020 module
- kgcnn.data.datasets.MatProjectDielectricDataset module
- kgcnn.data.datasets.MatProjectEFormDataset module
- kgcnn.data.datasets.MatProjectGapDataset module
- kgcnn.data.datasets.MatProjectIsMetalDataset module
- kgcnn.data.datasets.MatProjectJdft2dDataset module
- kgcnn.data.datasets.MatProjectLogGVRHDataset module
- kgcnn.data.datasets.MatProjectLogKVRHDataset module
- kgcnn.data.datasets.MatProjectPerovskitesDataset module
- kgcnn.data.datasets.MatProjectPhononsDataset module
- kgcnn.data.datasets.MoleculeNetDataset2018 module
- kgcnn.data.datasets.MutagenicityDataset module
- kgcnn.data.datasets.PROTEINSDataset module
- kgcnn.data.datasets.QM7Dataset module
- kgcnn.data.datasets.QM7bDataset module
- kgcnn.data.datasets.QM8Dataset module
- kgcnn.data.datasets.QM9Dataset module
- kgcnn.data.datasets.QM9MolNetDataset module
- kgcnn.data.datasets.SIDERDataset module
- kgcnn.data.datasets.Tox21MolNetDataset module
- Module contents
- kgcnn.data.transform package
Submodules¶
kgcnn.data.base module¶
-
kgcnn.data.base.
MemoryGeometricGraphDataset
¶ alias of
kgcnn.data.base.MemoryGraphDataset
-
class
kgcnn.data.base.
MemoryGraphDataset
(data_directory: Optional[str] = None, dataset_name: Optional[str] = None, file_name: Optional[str] = None, file_directory: Optional[str] = None, verbose: int = 10)[source]¶ Bases:
kgcnn.data.base.MemoryGraphList
Dataset class for lists of graph tensor dictionaries stored on file and fit into memory.
This class inherits from
MemoryGraphList
and can be used (after loading and setup) as such. It has further information about a location on disk, i.e. a file directory and a file name as well as a name of the dataset.import numpy as np from kgcnn.data.base import MemoryGraphDataset dataset = MemoryGraphDataset(data_directory="", dataset_name="Example") # Methods of MemoryGraphList dataset.set("edge_indices", [np.array([[1, 0], [0, 1]])]) dataset.set("edge_labels", [np.array([[0], [1]])]) dataset.save() dataset.load()
The file directory and file name are used in child classes and in
save
andload
.-
__init__
(data_directory: Optional[str] = None, dataset_name: Optional[str] = None, file_name: Optional[str] = None, file_directory: Optional[str] = None, verbose: int = 10)[source]¶ Initialize a base class of
MemoryGraphDataset
.- Parameters
data_directory (str) – Full path to directory of the dataset. Default is None.
file_name (str) – Generic filename for dataset to read into memory like a ‘csv’ file. Default is None.
file_directory (str) – Name or relative path from
data_directory
to a directory containing sorted files. Default is None.dataset_name (str) – Name of the dataset. Important for naming and saving files. Default is None.
verbose (int) – Logging level. Default is 10.
-
_verify_data_directory
() → Optional[str][source]¶ Utility function that checks if data_directory is set correctly.
-
assert_valid_model_input
(hyper_input: Union[list, dict], raise_error_on_fail: bool = True)[source]¶ Check whether dataset has graph properties (in tensor format) requested by model input.
The list
hyper_input
that defines model input match interface to hyperparameter. The model input is set up by a list of layer configs for the kerasInput
layer. The list must contain dictionaries for each model input with “name” and “shape” keys.hyper_input = [ {"shape": [None, 8710], "name": "node_attributes", "dtype": "float32", "ragged": True}, {"shape": [None, 1], "name": "edge_weights", "dtype": "float32", "ragged": True}, {"shape": [None, 2], "name": "edge_indices", "dtype": "int64", "ragged": True} ]
-
collect_files_in_file_directory
(file_column_name: Optional[str] = None, table_file_path: Optional[str] = None, read_method_file: Optional[Callable] = None, update_counter: int = 1000, append_file_content: bool = True, read_method_return_list: bool = False) → list[source]¶ Utility function to collect single files in
file_directory
by names in CSV table file.- Parameters
file_column_name (str) – Name of the column in Table file that holds list of file names.
table_file_path (str) – Path to table file. Can be None. Default is None.
read_method_file (Callable) – Callable read-file method to return (processed) file content.
update_counter (int) – Loop counter to show progress. Default is 1000.
append_file_content (bool) – Whether to append or add return of
read_method_file
.read_method_return_list (bool) – Whether
read_method_file
returns list of items or the item itself.
- Returns
File content loaded from single files.
- Return type
-
property
file_directory_path
¶ Construct file-directory path from ‘data_directory’ and ‘file_directory’ given in init.
-
property
file_path
¶ Construct filepath from ‘file_name’ given in init.
-
fits_in_memory
= True¶
-
get_train_test_indices
(train: str = 'train', test: str = 'test', valid: Optional[str] = None, split_index: Optional[Union[int, list]] = None, shuffle: bool = False, seed: Optional[int] = None) → List[List[numpy.ndarray]][source]¶ Get train and test indices from graph list. The ‘train’ and ‘test’ properties must be set on the graph, and optionally an additional property for the validation split may be present. All of these properties may have either of the following values:
The property is a boolean integer value indicating whether the corresponding element of the dataset belongs to that part of the split (train / test)
The property is a list containing integer split indices, where each split index present within that list implies that the corresponding dataset element is part of that particular split. In this case the split_index parameter may also be a list of split indices that specify for which of these split indices the train test index split is to be returned by this method.
The return value of this method is a list with the same length as the split_index parameter, which by default will be None.
- Parameters
train (str) – Name of graph property that has train split assignment. Defaults to ‘train’.
test (str) – Name of graph property that has test split assignment. Defaults to ‘test’.
valid (str) – Name of graph property that has validation assignment. Defaults to None.
split_index (int, list) – Split index to get indices for. Can also be list.
shuffle (bool) – Whether to shuffle splits. Default is True.
seed (int) – Random seed for shuffle. Default is None.
- Returns
List of tuples (or triples) of train, test, (validation) split indices.
- Return type
-
load
(filepath: Optional[str] = None)[source]¶ Load graph properties from a pickled file. By default, loads a file named
dataset_name.kgcnn.pickle
indata_directory
.- Parameters
filepath (str) – Full path of input file.
-
read_in_table_file
(file_path: Optional[str] = None, **kwargs)[source]¶ Read a data frame in
data_frame
from file path. By default, usesfile_name
and pandas. Checks for a ‘.csv’ file and then for Excel file endings. Meaning the file extension of file_path is ignored but must be any of the following ‘.csv’, ‘.xls’, ‘.xlsx’, ‘.odt’.- Parameters
file_path (str) – File path to table file. Default is None.
kwargs – Kwargs for pandas
read_csv
function.
- Returns
self
-
relocate
(data_directory: Optional[str] = None, file_name: Optional[str] = None, file_directory: Optional[str] = None)[source]¶ Change file information. Does not copy files on disk!
- Parameters
data_directory (str) – Full path to directory of the dataset. Default is None.
file_name (str) – Generic filename for dataset to read into memory like a ‘csv’ file. Default is None.
file_directory (str) – Name or relative path from
data_directory
to a directory containing sorted files. Default is None.
- Returns
self
-
save
(filepath: Optional[str] = None)[source]¶ Save all graph properties to python dictionary as pickled file. By default, saves a file named
dataset_name.kgcnn.pickle
indata_directory
.- Parameters
filepath (str) – Full path of output file. Default is None.
-
set_methods
(method_list: List[dict]) → None[source]¶ Apply a list of serialized class-methods on the dataset.
This can extend the config-serialization scheme in
kgcnn.utils.serial
.for method_item in method_list: for method, kwargs in method_item.items(): if hasattr(self, method): getattr(self, method)(**kwargs)
- Parameters
method_list (list) – A list of dictionaries that specify class methods. The dict key denotes the method and the value must contain kwargs for the method
- Returns
None.
-
set_multi_target_labels
(graph_labels: str = 'graph_labels', multi_target_indices: Optional[list] = None, data_unit: Optional[Union[str, list]] = None)[source]¶ Select multiple targets in labels.
- Parameters
- Returns
List of label names and label units for each target.
- Return type
-
-
class
kgcnn.data.base.
MemoryGraphList
(iterable: Optional[list] = None)[source]¶ Bases:
list
Class to store a list of graph dictionaries in memory.
Inherits from a python list. The graph properties are defined by tensor-like (numpy) arrays for indices, attributes, labels, symbol etc. in
GraphDict
, which are the items of the list. Access to items via [] indexing operator.A python list of a single named property can be obtained from each
GraphDict
inMemoryGraphList
viaget
or assigned from a python list viaset
methods.The
MemoryGraphList
further provides simple map-functionalitymap_list
to apply methods for eachGraphDict
, and to cast properties to tensor withtensor
.Cleaning the list for missing properties or empty graphs is done with
clean
.import numpy as np from kgcnn.data.base import MemoryGraphList data = MemoryGraphList() data.empty(1) data.set("edge_indices", [np.array([[0, 1], [1, 0]])]) data.set("node_labels", [np.array([[0], [1]])]) print(data.get("edge_indices")) data.set("node_coordinates", [np.array([[1, 0, 0], [0, 1, 0], [0, 2, 0], [0, 3, 0]])]) data.map_list("set_range", max_distance=1.5, max_neighbours=10, self_loops=False) data.clean("range_indices") # Returns cleaned graph indices print(len(data)) print(data[0])
-
__init__
(iterable: Optional[list] = None)[source]¶ Initialize an empty
MemoryGraphList
instance.- Parameters
iterable (list, MemoryGraphList) – A list or
MemoryGraphList
ofGraphDict
items.
-
assign_property
(key: str, value: list)[source]¶ Assign a list of numpy arrays of a property to the respective
GraphDict
s in this list.
-
clean
(inputs: Union[list, str])[source]¶ Given a list of property names, this method removes all elements from the internal list of GraphDict items, which do not define at least one of those properties. Meaning, only those graphs remain in the list which definitely define all properties specified by
inputs
.- Parameters
inputs (list) – A list of strings, where each string is supposed to be a property name, which the graphs in this list may possess. Within
kgcnn
, this can be simpy the ‘input’ category in model configuration. In this case, a list of dicts that specify the name of the property with ‘name’ key.- Returns
- A list of graph indices that do not have the required properties and which
have been removed.
- Return type
invalid_graphs (np.ndarray)
-
empty
(length: int)[source]¶ Create an empty list in place. Overwrites existing list.
- Parameters
length (int) – Length of the empty list.
- Returns
self
-
get
(key: str) → Optional[List]¶ Returns a list with the values of all the graphs defined for the string property name key. If none of the graphs in the list have this property, returns None.
- Parameters
key (str) – The string name of the property to be retrieved for all the graphs contained in this list
-
property
length
¶ Length of list.
-
map_list
(method: Union[str, Callable], **kwargs)[source]¶ Map a method over this list and apply on each
GraphDict
. Formethod
being string, either a class-method or a preprocessor is chosen for backward compatibility.for i, x in enumerate(self): method(x, **kwargs)
- Parameters
method (str) – Name of the
GraphDict
method.kwargs – Kwargs for method.
- Returns
self
-
obtain_property
(key: str) → Optional[List][source]¶ Returns a list with the values of all the graphs defined for the string property name key. If none of the graphs in the list have this property, returns None.
- Parameters
key (str) – The string name of the property to be retrieved for all the graphs contained in this list
-
rename_property_on_graphs
(old_property_name: str, new_property_name: str) → list[source]¶ Change the name of a graph property on all graphs in the list.
-
set
(key: str, value: list)¶ Assign a list of numpy arrays of a property to the respective
GraphDict
s in this list.
-
tensor
(items: Union[list, Dict], make_copy=True)[source]¶ Make tensor objects from multiple graph properties in list.
It is recommended to run
clean
beforehand.- Parameters
items (list) – List of dictionaries that specify graph properties in list via ‘name’ key. The dict-items match the tensor input for
tf.keras.layers.Input
layers. Required dict-keys should be ‘name’ and ‘ragged’. Optionally shape information can be included via ‘shape’ and ‘dtype’. E.g.: [{‘name’: ‘edge_indices’, ‘ragged’: True}, {…}, …].make_copy (bool) – Whether to copy the data. Default is True.
- Returns
List of Tensors.
- Return type
-
tf_dataset_disjoint
(inputs, **kwargs)[source]¶ Return generator via
tf.data.Dataset
from this list. Useskgcnn.io.loader.tf_dataset_disjoint_generator
- Parameters
inputs –
kwargs – Kwargs for
tf_dataset_disjoint_generator
- Returns
Dataset from generator.
- Return type
tf.data.Dataset
-
kgcnn.data.crystal module¶
-
class
kgcnn.data.crystal.
CrystalDataset
(data_directory: Optional[str] = None, dataset_name: Optional[str] = None, file_name: Optional[str] = None, file_directory: Optional[str] = None, file_name_pymatgen_json: Optional[str] = None, verbose: int = 10)[source]¶ Bases:
kgcnn.data.base.MemoryGraphDataset
Class for making graph dataset from periodic structures such as crystals.
The dataset class requires a
data_directory
to store a table ‘.csv’ file containing labels and information of the structures stored in multiple (CIF, POSCAR, …) files infile_directory
. The file names must be included in the ‘.csv’ table. The table file must have one line of header with column names!├── data_directory ├── file_directory │ ├── *.cif │ ├── *.cif │ └── ... ├── file_name.csv ├── file_name.pymatgen.json └── dataset_name.kgcnn.pickle
This class uses
pymatgen.core.structure.Structure
and therefore requirespymatgen
to be installed. A ‘.pymatgen.json’ serialized file is generated to store a list of structures from single ‘.cif’ files viaprepare_data()
. Consequently, a ‘file_name.pymatgen.json’ can be directly stored indata_directory
. In this, caseprepare_data()
does not have to be used. Additionally, a table file ‘file_name.csv’ that lists the single file names and possible labels or classification targets is required.from kgcnn.data.crystal import CrystalDataset dataset = CrystalDataset( data_directory="data_directory/", dataset_name="ExampleCrystal", file_name="file_name.csv", file_directory="file_directory") dataset.prepare_data(file_column_name="file_name", overwrite=True) dataset.read_in_memory(label_column_name="label")
-
__init__
(data_directory: Optional[str] = None, dataset_name: Optional[str] = None, file_name: Optional[str] = None, file_directory: Optional[str] = None, file_name_pymatgen_json: Optional[str] = None, verbose: int = 10)[source]¶ Initialize a base class of
CrystalDataset
.- Parameters
data_directory (str) – Full path to directory of the dataset. Default is None.
file_name (str) – Filename for dataset to read into memory. This is a table file. The ‘.csv’ should contain file names that are expected to be CIF-files in
file_directory
. Default is None.file_directory (str) – Name or relative path from
data_directory
to a directory containing sorted ‘cif’ files. Default is None.file_name_pymatgen_json (str) – This class will generate a ‘json’ file with pymatgen structures. You can specify the file name of that file with this argument. By default, it will be named from
file_name
when passed None.dataset_name (str) – Name of the dataset. Important for naming and saving files. Default is None.
verbose (int) – Logging level. Default is 10.
-
_map_callbacks
(structs: list, data: pandas.core.series.Series, callbacks: Dict[str, Callable[[pymatgen.core.structure.Structure, pandas.core.series.Series], Optional[numpy.ndarray]]], assign_to_self: bool = True) → dict[source]¶ Map callbacks on a data series object plus structure list.
- Parameters
structs (list) – List of pymatgen structures.
data (pd.Series, pd.DataFrame) – Data Frame matching the structure list.
callbacks (dict) – Dictionary of callbacks that take a data object plus pymatgen structure as argument.
assign_to_self (bool) – Whether to already assign the output of callbacks to this class.
- Returns
Values of callbacks.
- Return type
-
get_structures_from_json_file
(file_path: Optional[str] = None) → List[source]¶ Load pymatgen serialized json-file into memory.
Structures are not added to
CrystalDataset
but returned by this function.
-
prepare_data
(file_column_name: Optional[str] = None, overwrite: bool = False)[source]¶ Default preparation for crystal datasets.
Try to load all crystal structures from single files and save them as a pymatgen json serialization. Can load multiple CIF files from a table that keeps file names and possible labels or additional information.
-
property
pymatgen_json_file_path
¶ Internal file name for the pymatgen serialization information to store to disk.
-
read_in_memory
(label_column_name: Optional[str] = None, additional_callbacks: Optional[Dict[str, Callable[[pymatgen.core.structure.Structure, pandas.core.series.Series], None]]] = None)[source]¶ Read structures from pymatgen json serialization and convert them into graph information.
-
save_structures_to_json_file
(structs: list, file_path: Optional[str] = None)[source]¶ Save a list of pymatgen structures to file.
-
set_representation
(pre_processor: Union[kgcnn.crystal.base.CrystalPreprocessor, dict], reset_graphs: bool = False)[source]¶ Build a graph representation for this dataset using
kgcnn.crystal
.- Parameters
pre_processor (CrystalPreprocessor) – Crystal preprocessor to use.
reset_graphs (bool) – Whether to reset the graph information. Default is False.
Returns:
-
kgcnn.data.download module¶
-
class
kgcnn.data.download.
DownloadDataset
(dataset_name: Optional[str] = None, download_file_name: Optional[str] = None, data_directory_name: Optional[str] = None, unpack_directory_name: Optional[str] = None, extract_file_name: Optional[str] = None, download_url: Optional[str] = None, unpack_tar: bool = False, unpack_zip: bool = False, extract_gz: bool = False, reload: bool = False, verbose: int = 10, data_main_dir: str = '/home/docs/.kgcnn/datasets', execute_download_dataset_on_init: bool = True)[source]¶ Bases:
object
Download class for datasets.
Provides static-methods and functions for download and unzip of the data. They are intentionally kept general and could also be used without this class definition. Downloading is handled by
download_dataset_to_disk
already ininit
by default. Dataset-specific functions likeprepare_data
must be implemented in subclasses.Note
Note that
DownloadDataset
uses a main directory located at ‘~/.kgcnn/datasets’ for downloading datasets as default.Classes in
kgcnn.data.datasets
inherit from this class, butDownloadDataset
can also be used as a member via composition.Warning
Downloads are not checked for safety or malware. Use with caution!
Example on how to use
DownloadDataset
standalone:from kgcnn.data.download import DownloadDataset download = DownloadDataset( download_url="https://github.com/aimat-lab/gcnn_keras/blob/master/README.md", data_main_dir="./", data_directory_name="", download_file_name="README.html", reload=True, execute_download_dataset_on_init=False ) download.download_dataset_to_disk()
-
__init__
(dataset_name: Optional[str] = None, download_file_name: Optional[str] = None, data_directory_name: Optional[str] = None, unpack_directory_name: Optional[str] = None, extract_file_name: Optional[str] = None, download_url: Optional[str] = None, unpack_tar: bool = False, unpack_zip: bool = False, extract_gz: bool = False, reload: bool = False, verbose: int = 10, data_main_dir: str = '/home/docs/.kgcnn/datasets', execute_download_dataset_on_init: bool = True)[source]¶ Base initialization function for downloading and extracting the data. The arguments to the constructor determine what to download and whether to unpack the download. The main function
download_dataset_to_disk
is already called in the constructor.- Parameters
dataset_name (str) – Name of the dataset to download (optional). Default is None.
download_file_name (str) – Name of the file that the source is downloaded to. Default is None.
data_directory_name (str) – Name of the data directory in data_main_dir the file is saved. Default is None.
unpack_directory_name (str) – The name of a new directory in data_directory_name to unpack archive. Default is None.
extract_file_name (str) – Name of a gz-file to extract. Default is None.
download_url (str) – Url for file to download. Default is None.
unpack_tar (bool) – Whether to unpack a tar-archive. Default is False.
unpack_zip (bool) – Whether to unpack a zip-archive. Default is False.
extract_gz (bool) – Whether to unpack a gz-archive. Default is False.
reload (bool) – Whether to reload the data and make new dataset. Default is False.
verbose (int) – Logging level. Default is 10.
execute_download_dataset_on_init (bool) – Whether to start download on class construction.
-
static
download_database
(path: str, download_url: str, filename: str, overwrite: bool = False, logger=None)[source]¶ Download dataset file.
- Parameters
path (str) – Target filepath to store file (without filename).
download_url (str) – String of the download url to catch database from.
filename (str) – Name the dataset is downloaded to.
overwrite (bool) – Overwrite existing database. Default is False.
logger – Logger to print information or warnings.
- Returns
Filepath of downloaded file.
- Return type
os.path
-
download_dataset_to_disk
()[source]¶ Main download function to run the download and unpack of the dataset. Defined by attributes in self.
-
static
extract_gz_file
(path: str, filename: str, out_filename: Optional[str] = None, overwrite: bool = False, logger=None)[source]¶ Extract gz-file.
- Parameters
- Returns
Filepath of the extracted file.
- Return type
os.path
-
static
setup_dataset_dir
(data_main_dir: str, data_directory: str, logger=None)[source]¶ Make directory for each dataset.
-
static
setup_dataset_main
(data_main_dir, logger=None)[source]¶ Make the main-directory for all datasets to store data.
- Parameters
data_main_dir (str) – Path to create directory.
logger – Logger to print information or warnings.
-
static
unpack_tar_file
(path: str, filename: str, unpack_directory: str, overwrite: bool = False, logger=None)[source]¶ Extract tar-file.
- Parameters
- Returns
Filepath of the extracted dataset folder.
- Return type
os.path
-
kgcnn.data.force module¶
-
class
kgcnn.data.force.
ForceDataset
(data_directory: Optional[str] = None, dataset_name: Optional[str] = None, file_name: Optional[str] = None, verbose: int = 10, file_directory: Optional[str] = None, file_name_xyz: Optional[str] = None, file_name_mol: Optional[str] = None, file_name_force_xyz: Optional[str] = None)[source]¶ Bases:
kgcnn.data.qm.QMDataset
This is a base class for Force datasets. Inherits all functionality from QMDataset.
It generates graph properties from a xyz-file, which stores atomic coordinates. Additionally, loading multiple single xyz-files into one file is supported. The file names and labels are given by a CSV or table file. The table file must have one line of header with column names!
├── data_directory ├── file_directory │ ├── *.xyz │ ├── *.xyz │ └── ... ├── file_name.csv ├── file_name.xyz ├── file_name.sdf ├── file_name_force.xyz ├── ... └── dataset_name.kgcnn.pickle
Additionally, forces xyz information can be read in with this class.
-
__init__
(data_directory: Optional[str] = None, dataset_name: Optional[str] = None, file_name: Optional[str] = None, verbose: int = 10, file_directory: Optional[str] = None, file_name_xyz: Optional[str] = None, file_name_mol: Optional[str] = None, file_name_force_xyz: Optional[str] = None)[source]¶ Default initialization. File information on the location of the dataset on disk should be provided here.
- Parameters
data_directory (str) – Full path to directory of the dataset. Optional. Default is None.
file_name (str) – Filename for reading table ‘.csv’ file into memory. Must be given! For example as ‘.csv’ formatted file with QM labels such as energy, states, dipole etc. Moreover, the table file can contain a list of file names of individual ‘.xyz’ files to collect. Files are expected to be in
file_directory
. Default is None.file_name_xyz (str) – Filename of a single ‘.xyz’ file. This file is generated when collecting single ‘.xyz’ files in
file_directory
. If not specified, the name is generated based onfile_name
.file_name_mol (str) – Filename of a single ‘.sdf’ file. This file is generated from the single ‘.xyz’ file. SDF generation does require proper geometries. If not specified, the name is generated based on
file_name
.file_directory (str) – Name or relative path from
data_directory
to a directory containing sorted ‘.xyz’ files. Only used iffile_name
is None. Default is None.dataset_name (str) – Name of the dataset. Important for naming and saving files. Default is None.
verbose (int) – Logging level. Default is 10.
-
property
file_path_force_xyz
¶ Try to determine a file name for the mol information to store.
-
prepare_data
(overwrite: bool = False, file_column_name: Optional[str] = None, file_column_name_force: Optional[str] = None, make_sdf: bool = False)[source]¶ Pre-computation of molecular structure information in a sdf-file from a xyz-file or a folder of xyz-files.
If there is no single xyz-file, it will be created with the information of a csv-file with the same name.
- Parameters
overwrite (bool) – Overwrite existing database SDF file. Default is False.
file_column_name (str) – Name of the column in csv file with list of xyz-files located in file_directory. This is for the positions only.
file_column_name_force (str, list) – Column name of xyz files for forces in file directory.
make_sdf (bool) – Whether to try to make a sdf file from xyz information via RDKit and OpenBabel.
- Returns
self
-
read_in_memory
(label_column_name: Optional[Union[str, list]] = None, nodes: Optional[list] = None, edges: Optional[list] = None, graph: Optional[list] = None, encoder_nodes: Optional[dict] = None, encoder_edges: Optional[dict] = None, encoder_graph: Optional[dict] = None, add_hydrogen: bool = True, sanitize: bool = False, make_directed: bool = False, compute_partial_charges: bool = False, additional_callbacks: Optional[Dict[str, Callable[[kgcnn.molecule.base.MolGraphInterface, dict], None]]] = None, custom_transform: Optional[Callable[[kgcnn.molecule.base.MolGraphInterface], kgcnn.molecule.base.MolGraphInterface]] = None)[source]¶ Read geometric information into memory.
Graph labels require a column specified by
label_column_name
.- Parameters
label_column_name (str, list) – Name of labels for columns in CSV file.
nodes (list) – A list of node attributes as string. In place of names also functions can be added.
edges (list) – A list of edge attributes as string. In place of names also functions can be added.
graph (list) – A list of graph attributes as string. In place of names also functions can be added.
encoder_nodes (dict) – A dictionary of callable encoder where the key matches the attribute.
encoder_edges (dict) – A dictionary of callable encoder where the key matches the attribute.
encoder_graph (dict) – A dictionary of callable encoder where the key matches the attribute.
add_hydrogen (bool) – Whether to keep hydrogen after reading the mol-information. Default is False.
make_directed (bool) – Whether to have directed or undirected bonds. Default is False.
compute_partial_charges (str) – Whether to compute partial charges, e.g. ‘gasteiger’. Default is None.
sanitize (bool) – Whether to sanitize molecule. Default is False.
additional_callbacks (dict) – A dictionary whose keys are string attribute names which the elements of the dataset are supposed to have and the elements are callback function objects which implement how those attributes are derived from the
MolecularGraphRDKit
of the molecule in question or the row of the CSV file.custom_transform (Callable) – Custom transformation function to modify the generated
MolecularGraphRDKit
before callbacks are carried out. The function must take a singleMolecularGraphRDKit
instance as argument and return a (new)MolecularGraphRDKit
instance.
- Returns
self
-
kgcnn.data.moleculenet module¶
-
class
kgcnn.data.moleculenet.
MoleculeNetDataset
(data_directory: Optional[str] = None, dataset_name: Optional[str] = None, file_name: Optional[str] = None, file_name_mol: Optional[str] = None, file_name_smiles: Optional[str] = None, verbose: int = 10)[source]¶ Bases:
kgcnn.data.base.MemoryGraphDataset
Class for using ‘MoleculeNet’ datasets.
The concept is to load a table of smiles and corresponding targets and convert them into a tensor representation for graph networks.
├── data_directory ├── file_name.csv ├── file_name.SMILES ├── file_name.sdf └── dataset_name.kgcnn.pickle
The class provides properties and methods for making graph features from smiles. The typical input is a csv or excel file with smiles and corresponding graph labels. The table file must have one line of header with column names!
The graph structure matches the molecular graph, i.e. the chemical structure. The atomic coordinates are generated by a conformer guess. Since this require some computation time, it is only done once and the molecular coordinate or mol-blocks stored in a single SDF file with the base-name of the csv :obj:
file_name
. Conversion is using theMolConverter
class.The selection of smiles and whether conformers should be generated is handled by subclasses or specified in the methods
prepare_data
andread_in_memory
, see the documentation of the methods for further details.Attribute generation is carried out via the
MolecularGraphRDKit
class and requires RDKit as backend. You can also use a pre-processed SDF or SMILES file indata_directory
and add their name in the class initialization.-
__init__
(data_directory: Optional[str] = None, dataset_name: Optional[str] = None, file_name: Optional[str] = None, file_name_mol: Optional[str] = None, file_name_smiles: Optional[str] = None, verbose: int = 10)[source]¶ Initialize a
MoleculeNetDataset
with information of the dataset location on disk.- Parameters
file_name (str) – Filename for reading into memory. This must be the name of the ‘.csv’ file. Default is None.
file_name_mol (str) – Filename of the SDF file that is generated from the SMILES file that is generated from a list of smiles given in the table file specified by
file_name
. By default, the name is chosen equal tofile_name
when passed None.file_name_smiles (str) – Filename of the SMILES file that is generated from a list of smiles given in the table file specified by
file_name
. By default, the name is chosen equal tofile_name
when passed None.data_directory (str) – Full path to directory containing all dataset files. Default is None. Not used by this subclass. Ignored.
dataset_name (str) – Name of the dataset. Important for naming. Default is None.
verbose (int) – Logging level. Default is 10.
-
property
file_path_mol
¶ Try to determine a file path for the mol information to store.
-
property
file_path_smiles
¶ Try to determine a file path for the SMILES information to store.
-
prepare_data
(overwrite: bool = False, smiles_column_name: str = 'smiles', add_hydrogen: bool = True, sanitize: bool = True, make_conformers: bool = True, optimize_conformer: bool = True, external_program: Optional[dict] = None, num_workers: Optional[int] = None)[source]¶ Computation of molecular structure information and optionally conformers from smiles.
This function reads smiles from the csv-file given by
file_name
and creates a single SDF File of generated mol-blocks with the same file name. The function requiresRDKit
and (optionally)OpenBabel
. Smiles that are not compatible with both RDKit and OpenBabel result in an empty mol-block in the SDF file to keep the number of molecules the same.- Parameters
overwrite (bool) – Overwrite existing database mol-json file. Default is False.
smiles_column_name (str) – Column name where smiles are given in csv-file. Default is “smiles”.
add_hydrogen (bool) – Whether to add H after smile translation. Default is True.
sanitize (bool) – Whether to sanitize molecule. Default is True.
make_conformers (bool) – Whether to make conformers. Default is True.
optimize_conformer (bool) – Whether to optimize conformer via force field. Only possible with
make_conformers
. Default is True.external_program (dict) – External program for translating smiles. Default is None. If you want to use an external program you have to supply a dictionary of the form: {“class_name”: “balloon”, “config”: {“balloon_executable_path”: …, …}}. Note that usually the parameters like
add_hydrogen
are ignored. And you need to control the SDF file generation within config of theexternal_program
.num_workers (int) – Parallel execution for translating smiles.
- Returns
self
-
read_in_memory
(label_column_name: Optional[Union[str, list]] = None, nodes: Optional[list] = None, edges: Optional[list] = None, graph: Optional[list] = None, encoder_nodes: Optional[dict] = None, encoder_edges: Optional[dict] = None, encoder_graph: Optional[dict] = None, add_hydrogen: bool = False, make_directed: bool = False, has_conformers: bool = True, sanitize: bool = True, compute_partial_charges: Optional[str] = None, additional_callbacks: Optional[Dict[str, Callable[[kgcnn.molecule.base.MolGraphInterface, dict], None]]] = None, custom_transform: Optional[Callable[[kgcnn.molecule.base.MolGraphInterface], kgcnn.molecule.base.MolGraphInterface]] = None)¶ Load list of molecules from cached SDF-file in into memory. File name must be given in
file_name
and path information in the constructor of this class.It further checks the csv-file for graph labels specified by
label_column_name
. Labels that do not have valid smiles or molecule in the SDF-file are also skipped, but added as None to keep the index and the molecule assignment.Set further molecular attributes or features by string identifier. Requires
MolecularGraphRDKit
. Default values are features that has been used by Luo et al (2019).The argument
additional_callbacks
allows adding custom properties to each element of the dataset. It is a dictionary whose string keys are the names of the properties and the values are callable function objects which define how the property is derived from either theMolecularGraphRDKit
or the corresponding row of the original CSV file. Those callback functions accept two parameters:mg: The
MolecularGraphRDKit
instance of the moleculeds: A pandas data series that match data in the CSV file for the specific molecule.
Example:
from os import linesep csv = f"index,name,label,smiles{linesep}1,Propanolol,1,[Cl].CC(C)NCC(O)COc1cccc2ccccc12" with open('/tmp/moleculenet_example.csv', mode='w') as file: file.write(csv) dataset = MoleculeNetDataset('/tmp', 'example', 'moleculenet_example.csv') dataset.prepare_data(smiles_column_name='smiles') dataset.read_in_memory(label_column_name='label') dataset.set_attributes( nodes=['Symbol'], encoder_nodes={'Symbol': OneHotEncoder(['C', 'O'], dtype='str')}, edges=['BondType'], encoder_edges={'BondType': int}, additional_callbacks={ # It is important that the callbacks return a numpy array, even if it is just a single element. 'name': lambda mg, ds: np.array(ds['name'], dtype='str') } ) mol: dict = dataset[0] mol['node_attributes'] # np array of one hot encoded atom type per node mol['edge_attributes'] # int value representing the bond type mol['name'] # Array of a single string which is the name from the original CSV data
- Parameters
label_column_name (str) – Column name in the csv-file where to take graph labels from. For multi-targets you can supply a list of column names or positions. A slice can be provided for selecting columns as graph labels. Default is None.
nodes (list) – A list of node attributes as string. In place of names also functions can be added.
edges (list) – A list of edge attributes as string. In place of names also functions can be added.
graph (list) – A list of graph attributes as string. In place of names also functions can be added.
encoder_nodes (dict) – A dictionary of callable encoder where the key matches the attribute.
encoder_edges (dict) – A dictionary of callable encoder where the key matches the attribute.
encoder_graph (dict) – A dictionary of callable encoder where the key matches the attribute.
add_hydrogen (bool) – Whether to keep hydrogen after reading the mol-information. Default is False.
has_conformers (bool) – Whether to add node coordinates from conformer. Default is True.
make_directed (bool) – Whether to have directed or undirected bonds. Default is False.
sanitize (bool) – Whether to sanitize molecule. Default is True.
compute_partial_charges (str) – Whether to compute partial charges, e.g. ‘gasteiger’. Default is None.
additional_callbacks (dict) – A dictionary whose keys are string attribute names which the elements of the dataset are supposed to have and the elements are callback function objects which implement how those attributes are derived from the
MolecularGraphRDKit
of the molecule in question or the row of the CSV file.custom_transform (Callable) – Custom transformation function to modify the generated
MolecularGraphRDKit
before callbacks are carried out. The function must take a singleMolecularGraphRDKit
instance as argument and return a (new)MolecularGraphRDKit
instance.
- Returns
self
-
set_attributes
(label_column_name: Optional[Union[str, list]] = None, nodes: Optional[list] = None, edges: Optional[list] = None, graph: Optional[list] = None, encoder_nodes: Optional[dict] = None, encoder_edges: Optional[dict] = None, encoder_graph: Optional[dict] = None, add_hydrogen: bool = False, make_directed: bool = False, has_conformers: bool = True, sanitize: bool = True, compute_partial_charges: Optional[str] = None, additional_callbacks: Optional[Dict[str, Callable[[kgcnn.molecule.base.MolGraphInterface, dict], None]]] = None, custom_transform: Optional[Callable[[kgcnn.molecule.base.MolGraphInterface], kgcnn.molecule.base.MolGraphInterface]] = None)[source]¶ Load list of molecules from cached SDF-file in into memory. File name must be given in
file_name
and path information in the constructor of this class.It further checks the csv-file for graph labels specified by
label_column_name
. Labels that do not have valid smiles or molecule in the SDF-file are also skipped, but added as None to keep the index and the molecule assignment.Set further molecular attributes or features by string identifier. Requires
MolecularGraphRDKit
. Default values are features that has been used by Luo et al (2019).The argument
additional_callbacks
allows adding custom properties to each element of the dataset. It is a dictionary whose string keys are the names of the properties and the values are callable function objects which define how the property is derived from either theMolecularGraphRDKit
or the corresponding row of the original CSV file. Those callback functions accept two parameters:mg: The
MolecularGraphRDKit
instance of the moleculeds: A pandas data series that match data in the CSV file for the specific molecule.
Example:
from os import linesep csv = f"index,name,label,smiles{linesep}1,Propanolol,1,[Cl].CC(C)NCC(O)COc1cccc2ccccc12" with open('/tmp/moleculenet_example.csv', mode='w') as file: file.write(csv) dataset = MoleculeNetDataset('/tmp', 'example', 'moleculenet_example.csv') dataset.prepare_data(smiles_column_name='smiles') dataset.read_in_memory(label_column_name='label') dataset.set_attributes( nodes=['Symbol'], encoder_nodes={'Symbol': OneHotEncoder(['C', 'O'], dtype='str')}, edges=['BondType'], encoder_edges={'BondType': int}, additional_callbacks={ # It is important that the callbacks return a numpy array, even if it is just a single element. 'name': lambda mg, ds: np.array(ds['name'], dtype='str') } ) mol: dict = dataset[0] mol['node_attributes'] # np array of one hot encoded atom type per node mol['edge_attributes'] # int value representing the bond type mol['name'] # Array of a single string which is the name from the original CSV data
- Parameters
label_column_name (str) – Column name in the csv-file where to take graph labels from. For multi-targets you can supply a list of column names or positions. A slice can be provided for selecting columns as graph labels. Default is None.
nodes (list) – A list of node attributes as string. In place of names also functions can be added.
edges (list) – A list of edge attributes as string. In place of names also functions can be added.
graph (list) – A list of graph attributes as string. In place of names also functions can be added.
encoder_nodes (dict) – A dictionary of callable encoder where the key matches the attribute.
encoder_edges (dict) – A dictionary of callable encoder where the key matches the attribute.
encoder_graph (dict) – A dictionary of callable encoder where the key matches the attribute.
add_hydrogen (bool) – Whether to keep hydrogen after reading the mol-information. Default is False.
has_conformers (bool) – Whether to add node coordinates from conformer. Default is True.
make_directed (bool) – Whether to have directed or undirected bonds. Default is False.
sanitize (bool) – Whether to sanitize molecule. Default is True.
compute_partial_charges (str) – Whether to compute partial charges, e.g. ‘gasteiger’. Default is None.
additional_callbacks (dict) – A dictionary whose keys are string attribute names which the elements of the dataset are supposed to have and the elements are callback function objects which implement how those attributes are derived from the
MolecularGraphRDKit
of the molecule in question or the row of the CSV file.custom_transform (Callable) – Custom transformation function to modify the generated
MolecularGraphRDKit
before callbacks are carried out. The function must take a singleMolecularGraphRDKit
instance as argument and return a (new)MolecularGraphRDKit
instance.
- Returns
self
-
-
kgcnn.data.moleculenet.
map_molecule_callbacks
(mol_list: List[str], data: Union[pandas.core.series.Series, pandas.core.frame.DataFrame], callbacks: Dict[str, Callable[[kgcnn.molecule.base.MolGraphInterface, pandas.core.series.Series], None]], custom_transform: Optional[Callable[[kgcnn.molecule.base.MolGraphInterface], kgcnn.molecule.base.MolGraphInterface]] = None, add_hydrogen: bool = False, make_directed: bool = False, sanitize: bool = True, compute_partial_charges: Optional[str] = None, mol_interface_class=None, logger=None, loop_update_info: int = 5000) → dict[source]¶ This method receive the list of molecules, as well as the data from a pandas data series. It then iterates over all the molecules / data rows and invokes the callbacks for each.
The “callbacks” parameter is supposed to be a dictionary whose keys are string names of attributes which are supposed to be derived from the molecule / data and the values are function objects which define how to derive that data. Those callback functions get two parameters:
mg: The
MolGraphInterface
instance for the current moleculeds: A pandas data series that match data in the CSV file for the specific molecule.
The string keys of the “callbacks” directory are also the string names which are later used to assign the properties of the underlying
GraphList
. This means that each element of the dataset will then have a field with the same name.Note
If a molecule cannot be properly loaded by
MolGraphInterface
, then for all attributes “None” is added without invoking the callback!Example:
mol_net = MoleculeNetDataset() mol_net.prepare_data() mol_values = map_molecule_callbacks( mol_net.get_mol_blocks_from_sdf_file(), mol_net.read_in_table_file().data_frame, callbacks={ 'graph_size': lambda mg, dd: len(mg.node_number), 'index': lambda mg, dd: dd['index'] } ) for key, value in mol_values.items(): mol_net.assign_property(key, value) mol: dict = mol_net[0] assert 'graph_size' in mol.keys() assert 'index' in mol.keys()
- Parameters
mol_list (list) – List of mol strings.
data (pd.DataFrame) – Pandas data frame or series matching list of mol-strings.
callbacks (dict) – Dictionary of callbacks to perform on MolecularGraph object and table entries.
add_hydrogen (bool) – Whether to add hydrogen when making a
MolecularGraphRDKit
instance.make_directed (bool) – Whether to have directed or undirected bonds. Default is False.
sanitize (bool) – Whether to sanitize molecule. Default is True.
custom_transform (Callable) – Custom transformation function to modify the generated
MolecularGraphRDKit
before callbacks are carried out. The function must take a singleMolecularGraphRDKit
instance as argument and return a (new)MolecularGraphRDKit
instance.compute_partial_charges (str) – Whether to compute partial charges, e.g. ‘gasteiger’. Default is None.
mol_interface_class – Interface for molecular graphs. Must be a
MolGraphInterface
.logger – Logger to report error and progress.
loop_update_info (int) – Updates for processed molecules.
- Returns
Values of callbacks.
- Return type
kgcnn.data.qm module¶
-
class
kgcnn.data.qm.
QMDataset
(data_directory: Optional[str] = None, dataset_name: Optional[str] = None, file_name: Optional[str] = None, verbose: int = 10, file_directory: Optional[str] = None, file_name_xyz: Optional[str] = None, file_name_mol: Optional[str] = None)[source]¶ Bases:
kgcnn.data.base.MemoryGraphDataset
This is a base class for QM (quantum mechanical) datasets.
It generates graph properties from a xyz-file, which stores atomic coordinates. Additionally, loading multiple single xyz-files into one file is supported. The file names and labels are given by a CSV or table file. The table file must have one line of header with column names!
├── data_directory ├── file_directory │ ├── *.xyz │ ├── *.xyz │ └── ... ├── file_name.csv ├── file_name.xyz ├── file_name.sdf └── dataset_name.kgcnn.pickle
It should be possible to generate approximate chemical bonding information via openbabel, if this additional package is installed. The class inherits from
MemoryGraphDataset
. If openbabel is not installed minimal loading of labels and coordinates should be supported.For additional attributes, the
set_attributes
enables further features that require RDkit or Openbabel to be installed. Note that for QMDataset the mol-information, if it is generated, is not cleaned during reading by default.-
__init__
(data_directory: Optional[str] = None, dataset_name: Optional[str] = None, file_name: Optional[str] = None, verbose: int = 10, file_directory: Optional[str] = None, file_name_xyz: Optional[str] = None, file_name_mol: Optional[str] = None)[source]¶ Default initialization. File information on the location of the dataset on disk should be provided here.
- Parameters
data_directory (str) – Full path to directory of the dataset. Optional. Default is None.
file_name (str) – Filename for reading table ‘.csv’ file into memory. Must be given! For example as ‘.csv’ formatted file with QM labels such as energy, states, dipole etc. Moreover, the table file can contain a list of file names of individual ‘.xyz’ files to collect. Files are expected to be in
file_directory
. Default is None.file_name_xyz (str) – Filename of a single ‘.xyz’ file. This file is generated when collecting single ‘.xyz’ files in
file_directory
. If not specified, the name is generated based onfile_name
.file_name_mol (str) – Filename of a single ‘.sdf’ file. This file is generated from the single ‘.xyz’ file. SDF generation does require proper geometries. If not specified, the name is generated based on
file_name
.file_directory (str) – Name or relative path from
data_directory
to a directory containing sorted ‘.xyz’ files. Only used iffile_name
is None. Default is None.dataset_name (str) – Name of the dataset. Important for naming and saving files. Default is None.
verbose (int) – Logging level. Default is 10.
-
property
file_path_mol
¶ Try to determine a file name for the mol information to store.
-
property
file_path_xyz
¶ Try to determine a file name for the mol information to store.
-
get_geom_from_xyz_file
(file_path: str) → list[source]¶ Get a list of xyz items from file.
- Parameters
file_path (str) – File path of XYZ file. Default None uses
file_path_xyz
.- Returns
List of xyz lists.
- Return type
-
get_mol_blocks_from_sdf_file
(file_path: Optional[str] = None) → list[source]¶ Get a list of mol-blocks from file.
- Parameters
file_path (str) – File path of SDF file. Default None uses
file_path_mol
.- Returns
List of mol-strings.
- Return type
-
prepare_data
(overwrite: bool = False, file_column_name: Optional[str] = None, make_sdf: bool = True)[source]¶ Pre-computation of molecular structure information in a sdf-file from a xyz-file or a folder of xyz-files.
If there is no single xyz-file, it will be created with the information of a csv-file with the same name.
- Parameters
- Returns
self
-
read_in_memory
(label_column_name: Optional[Union[str, list]] = None, nodes: Optional[list] = None, edges: Optional[list] = None, graph: Optional[list] = None, encoder_nodes: Optional[dict] = None, encoder_edges: Optional[dict] = None, encoder_graph: Optional[dict] = None, add_hydrogen: bool = True, sanitize: bool = False, make_directed: bool = False, compute_partial_charges: bool = False, additional_callbacks: Optional[Dict[str, Callable[[kgcnn.molecule.base.MolGraphInterface, dict], None]]] = None, custom_transform: Optional[Callable[[kgcnn.molecule.base.MolGraphInterface], kgcnn.molecule.base.MolGraphInterface]] = None)[source]¶ Read geometric information into memory.
Graph labels require a column specified by
label_column_name
.- Parameters
label_column_name (str, list) – Name of labels for columns in CSV file.
nodes (list) – A list of node attributes as string. In place of names also functions can be added.
edges (list) – A list of edge attributes as string. In place of names also functions can be added.
graph (list) – A list of graph attributes as string. In place of names also functions can be added.
encoder_nodes (dict) – A dictionary of callable encoder where the key matches the attribute.
encoder_edges (dict) – A dictionary of callable encoder where the key matches the attribute.
encoder_graph (dict) – A dictionary of callable encoder where the key matches the attribute.
add_hydrogen (bool) – Whether to keep hydrogen after reading the mol-information. Default is False.
make_directed (bool) – Whether to have directed or undirected bonds. Default is False.
compute_partial_charges (str) – Whether to compute partial charges, e.g. ‘gasteiger’. Default is None.
sanitize (bool) – Whether to sanitize molecule. Default is False.
additional_callbacks (dict) – A dictionary whose keys are string attribute names which the elements of the dataset are supposed to have and the elements are callback function objects which implement how those attributes are derived from the
MolecularGraphRDKit
of the molecule in question or the row of the CSV file.custom_transform (Callable) – Custom transformation function to modify the generated
MolecularGraphRDKit
before callbacks are carried out. The function must take a singleMolecularGraphRDKit
instance as argument and return a (new)MolecularGraphRDKit
instance.
- Returns
self
-
read_in_memory_xyz
(file_path: Optional[str] = None, atomic_coordinates: Optional[str] = 'node_coordinates', atomic_symbol: Optional[str] = 'node_symbol', atomic_number: Optional[str] = 'node_number')[source]¶ Read XYZ-file with geometric information into memory.
- Parameters
file_path (str) – Filepath to xyz file.
atomic_coordinates (str) – Name of graph property of atomic coordinates. Default is “node_coordinates”.
atomic_symbol (str) – Name of graph property of atomic symbol. Default is “node_symbol”.
atomic_number (str) – Name of graph property of atomic number. Default is “node_number”.
- Returns
self
-
set_attributes
(label_column_name: Optional[Union[str, list]] = None, nodes: Optional[list] = None, edges: Optional[list] = None, graph: Optional[list] = None, encoder_nodes: Optional[dict] = None, encoder_edges: Optional[dict] = None, encoder_graph: Optional[dict] = None, add_hydrogen: bool = True, make_directed: bool = False, sanitize: bool = False, compute_partial_charges: Optional[str] = None, additional_callbacks: Optional[Dict[str, Callable[[kgcnn.molecule.base.MolGraphInterface, dict], None]]] = None, custom_transform: Optional[Callable[[kgcnn.molecule.base.MolGraphInterface], kgcnn.molecule.base.MolGraphInterface]] = None)[source]¶ Read SDF-file with chemical structure information into memory.
- Parameters
label_column_name (str, list) – Name of labels for columns in CSV file.
nodes (list) – A list of node attributes as string. In place of names also functions can be added.
edges (list) – A list of edge attributes as string. In place of names also functions can be added.
graph (list) – A list of graph attributes as string. In place of names also functions can be added.
encoder_nodes (dict) – A dictionary of callable encoder where the key matches the attribute.
encoder_edges (dict) – A dictionary of callable encoder where the key matches the attribute.
encoder_graph (dict) – A dictionary of callable encoder where the key matches the attribute.
add_hydrogen (bool) – Whether to keep hydrogen after reading the mol-information. Default is False.
make_directed (bool) – Whether to have directed or undirected bonds. Default is False.
sanitize (bool) – Whether to sanitize molecule. Default is False.
compute_partial_charges (str) – Whether to compute partial charges, e.g. ‘gasteiger’. Default is None.
additional_callbacks (dict) – A dictionary whose keys are string attribute names which the elements of the dataset are supposed to have and the elements are callback function objects which implement how those attributes are derived from the
MolecularGraphRDKit
of the molecule in question or the row of the CSV file.custom_transform (Callable) – Custom transformation function to modify the generated
MolecularGraphRDKit
before callbacks are carried out. The function must take a singleMolecularGraphRDKit
instance as argument and return a (new)MolecularGraphRDKit
instance.
- Returns
self
-
kgcnn.data.serial module¶
-
kgcnn.data.serial.
deserialize
(dataset: Union[str, dict])[source]¶ Deserialize a dataset class from dictionary including “class_name” and “config” keys.
Furthermore, prepare_data, read_in_memory and map_list are possible for deserialization if manually set in ‘methods’ key as list. Tries to resolve datasets also without module_name key. Otherwise, you can use general kgcnn.utils.serial .
kgcnn.data.tudataset module¶
-
class
kgcnn.data.tudataset.
GraphTUDataset
(data_directory: Optional[str] = None, dataset_name: Optional[str] = None, file_name: Optional[str] = None, file_directory: Optional[str] = None, verbose: int = 10)[source]¶ Bases:
kgcnn.data.base.MemoryGraphDataset
Base class for loading graph datasets published by TU Dortmund University.
Datasets contain non-isomorphic graphs for many graph classification or regression tasks. This general base class has functionality to load TUDatasets in a generic way. The datasets are already in a graph-like format and do not need further processing via e.g. prepare data.
Note
Note that subclasses of GraphTUDataset2020 in
kgcnn.data.datasets
downloads datasets, There are also further TU-datasets inkgcnn.data.datasets
, if further processing is used in literature. Not all datasets can provide all types of graph properties like edge_attributes etc.The file structure of
GraphTUDataset
for a given dataset ‘DS’ (replace DS with dataset name).├── data_directory ├── DS_graph_indicator.txt ├── DS_A.txt ├── DS_node_labels.txt ├── DS_node_attributes.txt ├── DS_edge_labels.txt ├── DS_edge_attributes.txt ├── DS_graph_labels.txt ├── DS_graph_attributes.txt ├── ... └── dataset_name.kgcnn.pickle
Setting up a single file can be made additionally with base class
save
method.-
__init__
(data_directory: Optional[str] = None, dataset_name: Optional[str] = None, file_name: Optional[str] = None, file_directory: Optional[str] = None, verbose: int = 10)[source]¶ Initialize a
GraphTUDataset
instance from file.- Parameters
data_directory (str) – Full path to directory of the dataset. Default is None.
file_name (str) – Filename for reading into memory. Not used for general TUDataset, since there are multiple files with a prefix and pre-defined suffix. Default is None.
file_directory (str) – Name or relative path from
data_directory
to a directory containing sorted files. Default is None.dataset_name (str) – Name of the dataset. Important for base-name for naming of files. Default is None.
verbose (int) – Logging level. Default is 10.
-
static
read_csv_simple
(filepath: str, delimiter: str = ', ', dtype: Callable = <class 'float'>)[source]¶ Very simple python-only function to read in a csv-file from file.
-
read_in_memory
()[source]¶ Read the
GraphTUDataset
into memory.The TUDataset is stored in disjoint representations. The data is cast to a list of separate graph properties for MemoryGraphDataset.
- Returns
self
-
kgcnn.data.utils module¶
-
kgcnn.data.utils.
load_hyper_file
(file_name: str, **kwargs) → dict[source]¶ Load hyperparameter from file. File type can be ‘.yaml’, ‘.json’, ‘.pickle’ or ‘.py’.
-
kgcnn.data.utils.
load_json_file
(file_path: str, **kwargs)[source]¶ Load json file.
- Parameters
file_path (str) – File path or name to load.
- Returns
Python-object of file.
- Return type
obj
-
kgcnn.data.utils.
load_pickle_file
(file_path: str, **kwargs)[source]¶ Load pickle file.
- Parameters
file_path (str) – File path or name to load.
- Returns
Python-object of file.
- Return type
obj
-
kgcnn.data.utils.
load_yaml_file
(file_path: str)[source]¶ Load yaml file.
- Parameters
file_path (str) – File path or name to load.
- Returns
Python-object of file.
- Return type
obj
-
kgcnn.data.utils.
pad_np_array_list_batch_dim
(values: list, dtype: Optional[str] = None)[source]¶ Pad a list of numpy arrays along first dimension.
-
kgcnn.data.utils.
ragged_tensor_from_nested_numpy
(numpy_list: list, dtype: Optional[str] = None, row_splits_dtype: str = 'int64')[source]¶ Make ragged tensor from a list of numpy arrays. Each array can have different length but must match in shape except the first dimension. This will result in a ragged tensor with ragged dimension only at first axis (ragged_rank=1), like shape (batch, None, …) . This way a tensor can be generated faster than tf.ragged.constant() .
Warning
The data will be copied for this operation.
import tensorflow as tf import numpy as np ragged_tensor = ragged_tensor_from_nested_numpy([np.array([[0]]), np.array([[1], [2], [3]])]) print(ragged_tensor) # <tf.RaggedTensor [[[0]], [[1], [2], [3]]]> print(ragged_tensor.shape) # (2, None, 1)
- Parameters
- Returns
Ragged tensor of former nested list of numpy arrays.
- Return type
tf.RaggedTensor
-
kgcnn.data.utils.
save_json_file
(obj, file_path: str, **kwargs)[source]¶ Save json file.
- Parameters
obj – Python-object to dump.
file_path (str) – File path or name to save ‘obj’ to.
- Returns
None.