qxmt.datasets.builder module#
- class qxmt.datasets.builder.DatasetBuilder(config, logger=<Logger qxmt.datasets.builder (INFO)>)
Bases:
object
[NOTE]: Currently, this class is only support numpy array data type. The other data types will be supported in the future.
DatasetBuilder class is responsible for loading, preprocessing, and transforming the dataset. The dataset is loaded from the path or generated by the defined method in the config. After loading the dataset, the raw preprocess method is applied to the dataset. This process is optional. Next step is splitting the dataset into train and test sets. The test set size is defined in the config. Finally, the transform method is applied to the dataset. This process is optional. builder ruturns the Dataset object that contains the train and test split of the dataset.
Examples
>>> import numpy as np >>> from qxmt.configs import ExperimentConfig >>> from qxmt.datasets.builder import DatasetBuilder >>> config = ExperimentConfig(path="configs/my_run.yaml") >>> dataset = DatasetBuilder(config).build() Dataset( X_train=array([[15.81324596, -9.07999965], ...]), y_train=array([0, ...]), X_test=array([[6.71832813, 11.2653727], ...]), y_test=array([1, ...]), config=DatasetConfig(type='file', ...)
- Parameters:
config (ExperimentConfig)
logger (Logger)
- __init__(config, logger=<Logger qxmt.datasets.builder (INFO)>)
Initialize the DatasetBuilder.
- Parameters:
config (ExperimentConfig) – experiment config that loaded from the yaml file
logger (Logger, optional) – logger for output messages. Defaults to LOGGER.
- Return type:
None
- build()
Build the dataset. This method loads, preprocesses, splits, and transforms the dataset.
- Returns:
Dataset object that contains the train and test split of the dataset
- Return type:
Dataset
- default_raw_preprocess(X, y)
Default raw preprocess method. This method does not apply any preprocess.
- Parameters:
X (np.ndarray) – raw features of the dataset
y (np.ndarray) – raw labels of the dataset
- Returns:
raw features and labels of the dataset
- Return type:
RAW_DATASET_TYPE
- default_transform(X_train, y_train, X_val, y_val, X_test, y_test)
Default transform method. This method does not apply any transformation.
- Parameters:
X_train (np.ndarray) – raw features of the training data
y_train (np.ndarray) – raw labels of the training data
X_val (Optional[np.ndarray]) – raw features of the validation data. None if validation set is not used
y_val (Optional[np.ndarray]) – raw labels of the validation data. None if validation set is not used
X_test (np.ndarray) – raw features of the test data
y_test (np.ndarray) – raw labels of the test data
- Returns:
train, val and test split of dataset (features and labels)
- Return type:
PROCESSCED_DATASET_TYPE
- load()
Load the dataset from the path defined in config.
- Returns:
features and labels of the dataset
- Return type:
RAW_DATASET_TYPE
- raw_preprocess(X, y)
Preprocess the raw dataset. This process executes before splitting the dataset. ex) filtering, data sampling, etc.
- Parameters:
X (np.ndarray) – raw features of the dataset
y (np.ndarray) – raw labels of the dataset
- Returns:
preprocessed features and labels of the dataset
- Return type:
RAW_DATASET_TYPE
- split(X, y)
Split the dataset into train and test sets. Test set size is defined in the config.
- Parameters:
X (np.ndarray) – raw features of the dataset
y (np.ndarray) – raw labels of the dataset
- Returns:
train and test split of dataset (features and labels)
- Return type:
PROCESSCED_DATASET_TYPE
- transform(X_train, y_train, X_val, y_val, X_test, y_test)
Transform the dataset. ex) feature scaling, dimension reduction, etc.
- Parameters:
X_train (np.ndarray) – raw features of the training data
y_train (np.ndarray) – raw labels of the training data
X_val (Optional[np.ndarray]) – raw features of the validation data. None if validation set is not used
y_val (Optional[np.ndarray]) – raw labels of the validation data. None if validation set is not used
X_test (np.ndarray) – raw features of the test data
y_test (np.ndarray) – raw labels of the test data
- Returns:
transformed train, val and test split of dataset (features and labels)
- Return type:
PROCESSCED_DATASET_TYPE