imblearn.over_sampling
.SMOTENC¶
-
class
imblearn.over_sampling.
SMOTENC
(categorical_features, sampling_strategy='auto', random_state=None, k_neighbors=5, n_jobs=1)[source][source]¶ Synthetic Minority Over-sampling Technique for Nominal and Continuous (SMOTE-NC).
Unlike
SMOTE
, SMOTE-NC for dataset containing continuous and categorical features.Read more in the User Guide.
Parameters: - categorical_features : ndarray, shape (n_cat_features,) or (n_features,)
Specified which features are categorical. Can either be:
- array of indices specifying the categorical features;
- mask array of shape (n_features, ) and
bool
dtype for whichTrue
indicates the categorical features.
- {sampling_strategy}
- {random_state}
- k_neighbors : int or object, optional (default=5)
If
int
, number of nearest neighbours to used to construct synthetic samples. If object, an estimator that inherits fromsklearn.neighbors.base.KNeighborsMixin
that will be used to find the k_neighbors.- n_jobs : int, optional (default=1)
The number of threads to open if possible.
See also
Notes
See the original paper [1] for more details.
Supports mutli-class resampling. A one-vs.-rest scheme is used as originally proposed in [1].
See Comparison of the different over-sampling algorithms, and sphx_glr_auto_examples_over-sampling_plot_smote.py.
References
[1] (1, 2, 3) N. V. Chawla, K. W. Bowyer, L. O.Hall, W. P. Kegelmeyer, “SMOTE: synthetic minority over-sampling technique,” Journal of artificial intelligence research, 321-357, 2002. Examples
>>> from collections import Counter >>> from numpy.random import RandomState >>> from sklearn.datasets import make_classification >>> from imblearn.over_sampling import SMOTENC >>> X, y = make_classification(n_classes=2, class_sep=2, ... weights=[0.1, 0.9], n_informative=3, n_redundant=1, flip_y=0, ... n_features=20, n_clusters_per_class=1, n_samples=1000, random_state=10) >>> print('Original dataset shape (%s, %s)' % X.shape) Original dataset shape (1000, 20) >>> print('Original dataset samples per class {}'.format(Counter(y))) Original dataset samples per class Counter({1: 900, 0: 100}) >>> # simulate the 2 last columns to be categorical features >>> X[:, -2:] = RandomState(10).randint(0, 4, size=(1000, 2)) >>> sm = SMOTENC(random_state=42, categorical_features=[18, 19]) >>> X_res, y_res = sm.fit_resample(X, y) >>> print('Resampled dataset samples per class {}'.format(Counter(y_res))) Resampled dataset samples per class Counter({0: 900, 1: 900})
-
__init__
(categorical_features, sampling_strategy='auto', random_state=None, k_neighbors=5, n_jobs=1)[source][source]¶ Initialize self. See help(type(self)) for accurate signature.
-
fit
(X, y)[source]¶ Check inputs and statistics of the sampler.
You should use
fit_resample
in all cases.Parameters: - X : {array-like, sparse matrix}, shape (n_samples, n_features)
Data array.
- y : array-like, shape (n_samples,)
Target array.
Returns: - self : object
Return the instance itself.
-
fit_resample
(X, y)[source]¶ Resample the dataset.
Parameters: - X : {array-like, sparse matrix}, shape (n_samples, n_features)
Matrix containing the data which have to be sampled.
- y : array-like, shape (n_samples,)
Corresponding label for each sample in X.
Returns: - X_resampled : {array-like, sparse matrix}, shape (n_samples_new, n_features)
The array containing the resampled data.
- y_resampled : array-like, shape (n_samples_new,)
The corresponding label of X_resampled.
-
fit_sample
(X, y)[source]¶ Resample the dataset.
Parameters: - X : {array-like, sparse matrix}, shape (n_samples, n_features)
Matrix containing the data which have to be sampled.
- y : array-like, shape (n_samples,)
Corresponding label for each sample in X.
Returns: - X_resampled : {array-like, sparse matrix}, shape (n_samples_new, n_features)
The array containing the resampled data.
- y_resampled : array-like, shape (n_samples_new,)
The corresponding label of X_resampled.
-
get_params
(deep=True)[source]¶ Get parameters for this estimator.
Parameters: - deep : boolean, optional
If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns: - params : mapping of string to any
Parameter names mapped to their values.
-
set_params
(**params)[source]¶ Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form
<component>__<parameter>
so that it’s possible to update each component of a nested object.Returns: - self