Skip to content

Dataset size reduction fixed, updated TargetValidator to match signatures #1250

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 99 commits into from
Feb 1, 2022
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
Show all changes
99 commits
Select commit Hold shift + click to select a range
9369343
Moved to new splitter, moved to util file
eddiebergman Sep 15, 2021
c2be383
flake8'd
eddiebergman Sep 15, 2021
6e6a607
Fixed errors, added test specifically for CustomStratifiedShuffleSplit
eddiebergman Sep 16, 2021
786e508
flake8'd
eddiebergman Sep 16, 2021
58dc49b
Updated docstring
eddiebergman Sep 16, 2021
4bce38f
Updated types in docstring
eddiebergman Sep 16, 2021
206c3df
reduce_dataset_size_if_too_large supports more types
eddiebergman Oct 5, 2021
d6f018f
flake8'd
eddiebergman Oct 5, 2021
6ed3e2c
flake8'd
eddiebergman Oct 5, 2021
5981fad
Updated docstring
eddiebergman Oct 5, 2021
65c8667
Seperated out the data subsampling into individual functions
eddiebergman Oct 6, 2021
f130424
Improved typing from Automl.fit to reduce_dataset_size_if_too_large
eddiebergman Oct 6, 2021
9b6f613
flak8'd
eddiebergman Oct 7, 2021
a12cf33
subsample tested
eddiebergman Oct 7, 2021
077cb2c
Finished testing and flake8'd
eddiebergman Oct 7, 2021
9af22a7
Cleaned up transform function that was touched
eddiebergman Oct 7, 2021
8057766
^
eddiebergman Oct 7, 2021
e1cce3f
Removed double typing
eddiebergman Oct 7, 2021
c8693a9
Cleaned up typing of convert_if_sparse
eddiebergman Oct 7, 2021
2591cc2
Cleaned up splitters and added size test
eddiebergman Oct 7, 2021
a6cc39f
Cleanup doc in data
eddiebergman Oct 7, 2021
f987c65
rogue line added was removed
eddiebergman Oct 7, 2021
3c4964a
Test fix
eddiebergman Oct 7, 2021
a53b1e5
flake8'd
eddiebergman Oct 7, 2021
5343ab6
Typo fix
eddiebergman Oct 7, 2021
84ee347
Fixed ordering of things
eddiebergman Oct 7, 2021
019a06e
Fixed typing and tests of target_validator fit, transform, inv_transform
eddiebergman Oct 7, 2021
8aea9b9
Updated doc
eddiebergman Oct 8, 2021
065fbe1
Updated Type return
eddiebergman Oct 8, 2021
1abe8f0
Removed elif gaurd
eddiebergman Oct 8, 2021
0288a70
removed extraneuous overload
eddiebergman Oct 8, 2021
d55c687
Updated return type of feature validator
eddiebergman Oct 8, 2021
1e7e2a9
Type fixes for target validator fit
eddiebergman Oct 8, 2021
234ae5e
flake8'd
eddiebergman Oct 8, 2021
04f6d46
Moved to new splitter, moved to util file
eddiebergman Sep 15, 2021
ea82405
flake8'd
eddiebergman Sep 15, 2021
d3cd1cf
Fixed errors, added test specifically for CustomStratifiedShuffleSplit
eddiebergman Sep 16, 2021
f04d65a
flake8'd
eddiebergman Sep 16, 2021
a1038b1
Updated docstring
eddiebergman Sep 16, 2021
de475dd
Updated types in docstring
eddiebergman Sep 16, 2021
9021edc
reduce_dataset_size_if_too_large supports more types
eddiebergman Oct 5, 2021
3b7e49c
flake8'd
eddiebergman Oct 5, 2021
b835f48
flake8'd
eddiebergman Oct 5, 2021
8369a17
Updated docstring
eddiebergman Oct 5, 2021
86f5b65
Seperated out the data subsampling into individual functions
eddiebergman Oct 6, 2021
445b0ba
Improved typing from Automl.fit to reduce_dataset_size_if_too_large
eddiebergman Oct 6, 2021
658a244
flak8'd
eddiebergman Oct 7, 2021
b12a5f5
subsample tested
eddiebergman Oct 7, 2021
8ea2575
Finished testing and flake8'd
eddiebergman Oct 7, 2021
49a37bf
Cleaned up transform function that was touched
eddiebergman Oct 7, 2021
401a049
^
eddiebergman Oct 7, 2021
b46be44
Removed double typing
eddiebergman Oct 7, 2021
24a19cf
Cleaned up typing of convert_if_sparse
eddiebergman Oct 7, 2021
8922143
Cleaned up splitters and added size test
eddiebergman Oct 7, 2021
a071950
Cleanup doc in data
eddiebergman Oct 7, 2021
5c9b012
rogue line added was removed
eddiebergman Oct 7, 2021
cc5dcba
Test fix
eddiebergman Oct 7, 2021
fe15c14
flake8'd
eddiebergman Oct 7, 2021
54c4f2a
Typo fix
eddiebergman Oct 7, 2021
99c02a9
Fixed ordering of things
eddiebergman Oct 7, 2021
0e28bb3
Fixed typing and tests of target_validator fit, transform, inv_transform
eddiebergman Oct 7, 2021
b34e169
Updated doc
eddiebergman Oct 8, 2021
972f65e
Updated Type return
eddiebergman Oct 8, 2021
33ef1fd
Removed elif gaurd
eddiebergman Oct 8, 2021
1136573
removed extraneuous overload
eddiebergman Oct 8, 2021
b1f419b
Updated return type of feature validator
eddiebergman Oct 8, 2021
8585be7
Type fixes for target validator fit
eddiebergman Oct 8, 2021
aac7b26
flake8'd
eddiebergman Oct 8, 2021
e4c3426
Fixed err message str and automl sparse y tests
eddiebergman Oct 8, 2021
5585532
merged
eddiebergman Oct 8, 2021
75a974b
Flak8'd
eddiebergman Oct 8, 2021
5bf53a2
Fix sort indices
eddiebergman Nov 2, 2021
7ae1d87
list type to List
eddiebergman Nov 2, 2021
4ebfdc2
Remove uneeded comment
eddiebergman Nov 2, 2021
1e87a52
Updated comment to make it more clear
eddiebergman Nov 2, 2021
06196d3
Comment update
eddiebergman Nov 2, 2021
c7a47cb
Fixed warning message for reduce_dataset_if_too_large
eddiebergman Nov 2, 2021
c0305f9
Fix test
eddiebergman Nov 2, 2021
c109d93
Added check for error message in tests
eddiebergman Nov 2, 2021
1c2fe7e
Test Updates
eddiebergman Nov 2, 2021
377e260
Fix error msg
eddiebergman Nov 2, 2021
f909edc
reinclude csr y to test
eddiebergman Nov 2, 2021
f170fcc
Reintroduced explicit subsample values test
eddiebergman Nov 2, 2021
b4958e8
flaked
eddiebergman Nov 2, 2021
6861e34
Missed an uncomment
eddiebergman Nov 3, 2021
37f6948
Update the comment for test of splitters
eddiebergman Nov 3, 2021
a20291d
Updated warning message in CustomSplitter
eddiebergman Nov 3, 2021
4b0f8a0
Update comment in test
eddiebergman Nov 4, 2021
536c4c6
Update tests
eddiebergman Nov 4, 2021
f35102d
Removed overloads
eddiebergman Nov 4, 2021
ec0ed55
Narrowed type of subsample
eddiebergman Nov 4, 2021
5439235
Removed overload import
eddiebergman Nov 4, 2021
3d21282
Fix `todense` giving np.matrix, using `toarray`
eddiebergman Nov 5, 2021
e1317b1
Merge branch 'development' into use_new_splitter
eddiebergman Nov 5, 2021
f56356d
Made subsampling a little less aggresive
eddiebergman Nov 14, 2021
42e4397
Changed multiplier back to 10
eddiebergman Nov 15, 2021
9bcb210
Allow argument to specfiy how auto-sklearn handles compressing datase…
eddiebergman Dec 17, 2021
2cd1d48
Merge branch 'development' into use_new_splitter
eddiebergman Dec 17, 2021
a1cc277
Fixed bad merge
eddiebergman Dec 18, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 1 addition & 17 deletions autosklearn/data/validation.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# -*- encoding: utf-8 -*-
import logging
from typing import List, Optional, Tuple, Union, overload
from typing import List, Optional, Tuple, Union

import numpy as np

Expand Down Expand Up @@ -152,22 +152,6 @@ def fit(

return self

@overload
def transform(
self,
X: SUPPORTED_FEAT_TYPES,
y: None
) -> Tuple[Union[np.ndarray, pd.DataFrame, spmatrix], None]:
...

@overload
def transform(
self,
X: SUPPORTED_FEAT_TYPES,
y: Union[List, pd.Series, pd.DataFrame, np.ndarray]
) -> Tuple[Union[spmatrix, pd.DataFrame, np.ndarray], np.ndarray]:
...

def transform(
self,
X: SUPPORTED_FEAT_TYPES,
Expand Down
19 changes: 7 additions & 12 deletions autosklearn/util/data.py
Original file line number Diff line number Diff line change
Expand Up @@ -135,8 +135,8 @@ def predict_RAM_usage(X: np.ndarray, categorical: List[bool]) -> float:


def subsample(
X: SUPPORTED_FEAT_TYPES,
y: Union[List, np.ndarray, pd.DataFrame, pd.Series],
X: Union[np.ndarray, spmatrix],
y: np.ndarray,
is_classification: bool,
sample_size: Union[float, int],
random_state: Optional[Union[int, np.random.RandomState]] = None,
Expand All @@ -154,19 +154,12 @@ def subsample(
Interestingly enough, StratifiedShuffleSplut and descendants don't support
sparse `y` in `split(): _check_array` call. Hence, neither do we.

NOTE3:
The core autosklearn library doesn't rely on the full type of X.
The typing could be reduced to:
* X: np.ndarray | spmatrix
* Y: np.ndarray


Parameters
----------
X: SUPPORTED_FEAT_TYPES
X: Union[np.ndarray, spmatrix]
The X's to subsample

Y: List | np.ndarray | pd.DataFrame | Series
y: np.ndarray
The Y's to subsample

is_classification: bool
Expand All @@ -182,7 +175,7 @@ def subsample(

Returns
-------
(SUPPORTED_FEAT_TYPES, List | np.ndarray | pd.DataFrame | Series)
(np.ndarray | spmatrix, np.ndarray)
The X and y subsampled according to sample_size
"""
if isinstance(X, List):
Expand All @@ -198,6 +191,8 @@ def subsample(
)
left_idxs, _ = next(splitter.split(X=X, y=y))

# This function supports pandas objects but they won't get here
# yet as we do not reduce the size of pandas dataframes.
if isinstance(X, pd.DataFrame):
idxs = X.index[left_idxs]
X = X.loc[idxs]
Expand Down
4 changes: 3 additions & 1 deletion test/test_data/test_target_validator.py
Original file line number Diff line number Diff line change
Expand Up @@ -215,10 +215,12 @@ def dtype(arr):
# These next part of the tests rely on some encoding to have taken place
# This happens when `is_classification` and not task_type = multilabel-indicator
#
# TargetValidator._fit()
# As state in TargetValidator._fit()
# > Also, encoding multilabel indicator data makes the data multiclass
# Let the user employ a MultiLabelBinarizer if needed
#
# As a result of this, we don't encode 'multilabel-indicator' labels and
# there is nothing else to check here
if validator.type_of_target == 'multilabel-indicator':
assert validator.encoder is None

Expand Down
33 changes: 19 additions & 14 deletions test/test_util/test_data.py
Original file line number Diff line number Diff line change
Expand Up @@ -166,6 +166,8 @@ def test_reduce_precision_correctly_reduces_precision(X, dtype, x_type):
expected: Dict[type, type] = {
np.float32: np.float32,
np.float64: np.float32,
np.dtype('float32'): np.float32,
np.dtype('float64'): np.float32
}
if hasattr(np, 'float96'):
expected[np.float96] = np.float64
Expand All @@ -189,8 +191,8 @@ def test_reduce_precision_with_unsupported_dtypes(X, dtype):
with pytest.raises(ValueError) as err:
reduce_precision(X)

expected = f"X.dtype = {dtype} not equal to any supported {supported_precision_reductions}"
assert err.value == expected
expected = f"X.dtype = {X.dtype} not equal to any supported {supported_precision_reductions}"
assert err.value.args[0] == expected


@parametrize("X", [
Expand All @@ -215,15 +217,18 @@ def test_reduce_dataset_reduces_size_and_precision(
random_state = 0
memory_limit = 1 # Force reductions

X_out, y_out = reduce_dataset_size_if_too_large(
X=X,
y=y,
random_state=random_state,
memory_limit=memory_limit,
operations=operations,
multiplier=multiplier,
is_classification=is_classification,
)
with warnings.catch_warnings():
warnings.filterwarnings("ignore")

X_out, y_out = reduce_dataset_size_if_too_large(
X=X,
y=y,
random_state=random_state,
memory_limit=memory_limit,
operations=operations,
multiplier=multiplier,
is_classification=is_classification,
)

def bytes(arr):
return arr.nbytes if isinstance(arr, np.ndarray) else arr.data.nbytes
Expand Down Expand Up @@ -254,8 +259,8 @@ def test_reduce_dataset_invalid_dtype_for_precision_reduction():
is_classification=False
)

expected_err = f"Unsupported type `{dtype}` for precision reduction"
assert err.value == expected_err
expected_err = f"Unsupported type `{X.dtype}` for precision reduction"
assert err.value.args[0] == expected_err


def test_reduce_dataset_invalid_operations():
Expand All @@ -272,7 +277,7 @@ def test_reduce_dataset_invalid_operations():
)

expected_err = f"Unknown operation `{invalid_op}`"
assert err.value == expected_err
assert err.value.args[0] == expected_err


@pytest.mark.parametrize(
Expand Down