qxmt.datasets.openml package#

Submodules#

Module contents#

class qxmt.datasets.openml.OpenMLDataLoader(name=None, id=None, save_path=None, return_format='numpy', use_cache=True, logger=<Logger qxmt.datasets.openml.loader (INFO)>)#

Bases: object

This class loads a dataset from OpenML (https://www.openml.org/) and converts it to the specified return format. OpenML is an online platform for sharing datasets and machine learning experiments. Identifier of the dataset can be specified by the dataset name or ID. If the name is specified, search the OpenML database to get the dataset ID and load the latest version. The dataset ID can be search in officail OpenML website (https://www.openml.org/search). ID has a higher priority than the name.

Supported return formats: - numpy: return as a tuple of numpy arrays. - pandas: return as a pandas DataFrame.

Supported save formats: - numpy: .npz, .npy - pandas: .csv, .tsv

Examples

Load the dataset “mnist_784” from OpenML and save it as a numpy file.

>>> loader = OpenMLDataLoader(name="mnist_784", save_path="data/mnist.npz")
>>> X, y = loader.load()
X = np.array([[0, 0, 0, ..., 0, 0, 0], ..]), y = np.array([5, 0, 4, ..., 4, 5, 6])

Load the dataset id=554 (“mnist_784”) from OpenML and save it as a pandas DataFrame.

>>> loader = OpenMLDataLoader(id=554, save_path="data/mnist.csv", return_format="pandas")
>>> data = loader.load()
data = pd.DataFrame([[0, 0, 0, ..., 0, 0, 0], .., [5, 0, 4, ..., 4, 5, 6]])
Parameters:
  • name (str | None)

  • id (int | None)

  • save_path (str | Path | None)

  • return_format (str)

  • use_cache (bool)

  • logger (Logger)

__init__(name=None, id=None, save_path=None, return_format='numpy', use_cache=True, logger=<Logger qxmt.datasets.openml.loader (INFO)>)#

Initialize the OpenML dataset loader.

Parameters:
  • dataset_identifier (str | int) – dataset name or ID. If the name is specified, the latest version is used.

  • save_path (Optional[str | Path], optional) – save path for the loaded dataset. If the value is None, the dataset is not saved. Defaults to None.

  • return_format (str, optional) – return format of the loaded dataset. Defaults to “numpy”.

  • name (str | None)

  • id (int | None)

  • use_cache (bool)

  • logger (Logger)

Return type:

None

load()#

Load the dataset from OpenML. Then, convert the dataset to the specified return format. The loaded dataset is saved to the specified path if it is not None.

Raises:

ValueError – unsupported return format

Returns:

loaded dataset

Return type:

tuple[np.ndarray, np.ndarray | None] | pd.DataFrame