qxmt.datasets.builder module#

class qxmt.datasets.builder.DatasetBuilder(config, logger=<Logger qxmt.datasets.builder (INFO)>)

Bases: object

[NOTE]: Currently, this class is only support numpy array data type. The other data types will be supported in the future.

DatasetBuilder class is responsible for loading, preprocessing, and transforming the dataset. The dataset is loaded from the path or generated by the defined method in the config. After loading the dataset, the raw preprocess method is applied to the dataset. This process is optional. Next step is splitting the dataset into train and test sets. The test set size is defined in the config. Finally, the transform method is applied to the dataset. This process is optional. builder ruturns the Dataset object that contains the train and test split of the dataset.

Examples

>>> import numpy as np
>>> from qxmt.configs import ExperimentConfig
>>> from qxmt.datasets.builder import DatasetBuilder
>>> config = ExperimentConfig(path="configs/my_run.yaml")
>>> dataset = DatasetBuilder(config).build()
Dataset(
    X_train=array([[15.81324596, -9.07999965], ...]),
    y_train=array([0, ...]),
    X_test=array([[6.71832813, 11.2653727], ...]),
    y_test=array([1, ...]),
    config=DatasetConfig(type='file', ...)

Parameters:

config (ExperimentConfig)
logger (Logger)

__init__(config, logger=<Logger qxmt.datasets.builder (INFO)>)

Initialize the DatasetBuilder.

Parameters:

config (ExperimentConfig) – experiment config that loaded from the yaml file
logger (Logger, optional) – logger for output messages. Defaults to LOGGER.

Return type:

None

build()

Build the dataset. This method loads, preprocesses, splits, and transforms the dataset.

Returns:: Dataset object that contains the train and test split of the dataset
Return type:: Dataset

default_raw_preprocess(X, y)

Default raw preprocess method. This method does not apply any preprocess.

Parameters:

X (np.ndarray) – raw features of the dataset
y (np.ndarray) – raw labels of the dataset

Returns:

raw features and labels of the dataset

Return type:

RAW_DATASET_TYPE

default_transform(X_train, y_train, X_val, y_val, X_test, y_test)

Default transform method. This method does not apply any transformation.

Parameters:

X_train (np.ndarray) – raw features of the training data
y_train (np.ndarray) – raw labels of the training data
X_val (Optional[np.ndarray]) – raw features of the validation data. None if validation set is not used
y_val (Optional[np.ndarray]) – raw labels of the validation data. None if validation set is not used
X_test (np.ndarray) – raw features of the test data
y_test (np.ndarray) – raw labels of the test data

Returns:

train, val and test split of dataset (features and labels)

Return type:

PROCESSCED_DATASET_TYPE

load()

Load the dataset from the path defined in config.

Returns:: features and labels of the dataset
Return type:: RAW_DATASET_TYPE

raw_preprocess(X, y)

Preprocess the raw dataset. This process executes before splitting the dataset. ex) filtering, data sampling, etc.

Parameters:

X (np.ndarray) – raw features of the dataset
y (np.ndarray) – raw labels of the dataset

Returns:

preprocessed features and labels of the dataset

Return type:

RAW_DATASET_TYPE

split(X, y)

Split the dataset into train and test sets. Test set size is defined in the config.

Parameters:

X (np.ndarray) – raw features of the dataset
y (np.ndarray) – raw labels of the dataset

Returns:

train and test split of dataset (features and labels)

Return type:

PROCESSCED_DATASET_TYPE

transform(X_train, y_train, X_val, y_val, X_test, y_test)

Transform the dataset. ex) feature scaling, dimension reduction, etc.

Parameters:

X_train (np.ndarray) – raw features of the training data
y_train (np.ndarray) – raw labels of the training data
X_val (Optional[np.ndarray]) – raw features of the validation data. None if validation set is not used
y_val (Optional[np.ndarray]) – raw labels of the validation data. None if validation set is not used
X_test (np.ndarray) – raw features of the test data
y_test (np.ndarray) – raw labels of the test data

Returns:

transformed train, val and test split of dataset (features and labels)

Return type:

PROCESSCED_DATASET_TYPE