featureprocessing Package

featureprocessing Package

ComBat Module

WORC.featureprocessing.ComBat.ComBat(features_train_in, labels_train, config, features_train_out, features_test_in=None, labels_test=None, features_test_out=None, VarianceThreshold=True, scaler=False, logarithmic=False)[source]

Apply ComBat feature harmonization.

Based on: https://github.com/Jfortin1/ComBatHarmonization

WORC.featureprocessing.ComBat.ComBatMatlab(dat, batch, command, mod=None, par=1, per_feature='true')[source]

Run the ComBat Function Matlab script.

par = 0 is non-parametric.

WORC.featureprocessing.ComBat.ComBatPython(dat, batch, mod=None, par=1, eb=1, per_feature=False, plotting=False)[source]

Run the ComBat Function python script.

par = 0 is non-parametric.

WORC.featureprocessing.ComBat.Synthetictest(n_patients=50, n_features=10, par=1, eb=1, per_feature=False, difscale=False, logarithmic=False, oddpatient=True, oddfeat=True, samefeat=True)[source]

Test for ComBat with Synthetic data.

Decomposition Module

WORC.featureprocessing.Decomposition.Decomposition(features, patientinfo, config, output, label_type=None, verbose=True)[source]

Perform decompositions to two components of the feature space.

Useage is similar to StatisticalTestFeatures.

Parameters

features: string, mandatory

contains the paths to all .hdf5 feature files used. modalityname1=file1,file2,file3,… modalityname2=file1,… Thus, modalities names are always between a space and a equal sign, files are split by commas. We assume that the lists of files for each modality has the same length. Files on the same position on each list should belong to the same patient.

patientinfo: string, mandatory

Contains the path referring to a .txt file containing the patient label(s) and value(s) to be used for learning. See the Github Wiki for the format.

config: string, mandatory

path referring to a .ini file containing the parameters used for feature extraction. See the Github Wiki for the possible fields and their description.

# TODO: outputs

verbose: boolean, default True

print final feature values and labels to command line or not.

FeatureConverter Module

WORC.featureprocessing.FeatureConverter.FeatureConverter(feat_in, toolbox, config, feat_out)[source]

Convert features as extracted by a third-party toolbox to WORC format.

Parameters

feat_in: string

Path to input feature file as outputted by the feature extraction toolbox.

toolbox: string

Name of toolbox from which features are extracted.

config: string

Path to .ini file containing the configuration for this function.

feat_out: string

Path to .hdf5 file to which converted features should be saved

WORC.featureprocessing.FeatureConverter.convert_PREDICT(features, feat_out)[source]

Convert features from PREDICT toolbox to WORC compatible format.

As PREDICT is the WORC default toolbox, we only need to add the name of the toolbox.

WORC.featureprocessing.FeatureConverter.convert_pyradiomics(features, feat_out=None)[source]

Convert features from PyRadiomics toolbox to WORC compatible format.

Description:

WORC.featureprocessing.FeatureConverter.convert_pyradiomics_featurevector(featureVector)[source]

Convert a PyRadiomics feature vector to WORC compatible features.

ICCThreshold Module

class WORC.featureprocessing.ICCThreshold.ICCThreshold(ICCtype='intra', threshold=0.75)[source]

Bases: BaseEstimator, SelectorMixin

Object to fit feature selection based on intra- or inter-class correlation coefficient as defined by

Shrout, Patrick E., and Joseph L. Fleiss. “Intraclass correlations: uses in assessing rater reliability.” Psychological bulletin 86.2 (1979): 420. http://rokwa.x-y.net/Shrout-Fleiss-ICC.pdf

For the intra-class, we use ICC(3,1).For the inter-class ICC, we should use ICC(2,1) according to definitions of the paper, but according to radiomics literatue (https://www.tandfonline.com/doi/pdf/10.1080/0284186X.2018.1445283?needAccess=true, https://www.tandfonline.com/doi/pdf/10.3109/0284186X.2013.812798?needAccess=true), we use ICC(3,1) anyway.

The default threshold of 0.75 is also based on the literature metioned above.

__abstractmethods__ = frozenset({})
__init__(ICCtype='intra', threshold=0.75)[source]

Parameters

ICCtype: string, default ‘intra’

Type of ICC used. intra results in ICC(3,1), inter in ICC(2,1)

threshold: float, default 0.75

Threshold for ICC-value in order for feature to be selected

__module__ = 'WORC.featureprocessing.ICCThreshold'
fit(X_trains)[source]

Select only features specificed by the metric and threshold per patient.

Parameters

X_trains: numpy array, mandatory

Array containing feature values used for model_selection. Number of objects on first axis, features on second axis, observers on third axis.

Y_train: numpy array, mandatory

Array containing the binary labels for each object in X_train.

transform(inputarray)[source]

Transform the inputarray to select only the features based on the result from the fit function.

Parameters

inputarray: numpy array, mandatory

Array containing the items to use selection on. The type of item in this list does not matter, e.g. floats, strings etc.

WORC.featureprocessing.ICCThreshold.convert_features_ICC_threshold(features_in, csv_out=None, features_out=None, threshold=0.75)[source]

For features from multiple observers, compute ICC, return values, and optionally apply thresholding and save output.

features_in: list, containing one list per observer. csv_out: csv file, name of file to which ICC values should be written features_out: list, containing file names of output features.

Imputer Module

class WORC.featureprocessing.Imputer.Imputer(missing_values='nan', strategy='mean', n_neighbors=5)[source]

Bases: object

Module for feature imputation.

__dict__ = mappingproxy({'__module__': 'WORC.featureprocessing.Imputer', '__doc__': 'Module for feature imputation.', '__init__': <function Imputer.__init__>, 'fit': <function Imputer.fit>, 'transform': <function Imputer.transform>, '__dict__': <attribute '__dict__' of 'Imputer' objects>, '__weakref__': <attribute '__weakref__' of 'Imputer' objects>, '__annotations__': {}})
__init__(missing_values='nan', strategy='mean', n_neighbors=5)[source]

Imputation of feature values using either sklearn, missingpy or (WIP) fancyimpute approaches.

Parameters

missing_valuesnumber, string, np.nan (default) or None

The placeholder for the missing values. All occurrences of missing_values will be imputed.

strategystring, optional (default=”mean”)

The imputation strategy.

Supported using sklearn: - If “mean”, then replace missing values using the mean along

each column. Can only be used with numeric data.

  • If “median”, then replace missing values using the median along each column. Can only be used with numeric data.

  • If “most_frequent”, then replace missing using the most frequent value along each column. Can be used with strings or numeric data.

  • If “constant”, then replace missing values with fill_value. Can be used with strings or numeric data.

Supported using missingpy: - If ‘knn’, then use a nearest neighbor search. Can be

used with strings or numeric data.

WIP: More strategies using fancyimpute

n_neighborsint, optional (default = 5)

Number of neighboring samples to use for imputation if method is knn.

__module__ = 'WORC.featureprocessing.Imputer'
__weakref__

list of weak references to the object (if defined)

fit(X, y=None)[source]
transform(X)[source]

OneHotEncoderWrapper Module

class WORC.featureprocessing.OneHotEncoderWrapper.OneHotEncoderWrapper(feature_labels_tofit, handle_unknown='ignore', verbose=False)[source]

Bases: object

Module for OneHotEncoding features.

__dict__ = mappingproxy({'__module__': 'WORC.featureprocessing.OneHotEncoderWrapper', '__doc__': 'Module for OneHotEncoding features.', '__init__': <function OneHotEncoderWrapper.__init__>, 'fit': <function OneHotEncoderWrapper.fit>, 'transform': <function OneHotEncoderWrapper.transform>, '__dict__': <attribute '__dict__' of 'OneHotEncoderWrapper' objects>, '__weakref__': <attribute '__weakref__' of 'OneHotEncoderWrapper' objects>, '__annotations__': {}})
__init__(feature_labels_tofit, handle_unknown='ignore', verbose=False)[source]

Init preprocessor of features.

__module__ = 'WORC.featureprocessing.OneHotEncoderWrapper'
__weakref__

list of weak references to the object (if defined)

fit(X, feature_labels, y=None)[source]

Fit OneHotEncoder for labels in feature_labels.

transform(inputarray)[source]

Transform feature array.

Transform the inputarray to select only the features based on the result from the fit function.

Parameters

inputarray: numpy array, mandatory

Array containing the items to use selection on. The type of item in this list does not matter, e.g. floats, strings etc.

WORC.featureprocessing.OneHotEncoderWrapper.test()[source]

Test OneHotEncoderWrapper object.

Preprocessor Module

class WORC.featureprocessing.Preprocessor.Preprocessor(verbose=True)[source]

Bases: object

Module for feature preprocessing.

Currently implemented:
  • Remove features with > 80% NaNs

__dict__ = mappingproxy({'__module__': 'WORC.featureprocessing.Preprocessor', '__doc__': 'Module for feature preprocessing.\n\n    Currently implemented:\n        - Remove features with > 80% NaNs\n    ', '__init__': <function Preprocessor.__init__>, 'fit': <function Preprocessor.fit>, 'transform': <function Preprocessor.transform>, '__dict__': <attribute '__dict__' of 'Preprocessor' objects>, '__weakref__': <attribute '__weakref__' of 'Preprocessor' objects>, '__annotations__': {}})
__init__(verbose=True)[source]

Init preprocessor of features.

__module__ = 'WORC.featureprocessing.Preprocessor'
__weakref__

list of weak references to the object (if defined)

fit(X, y=None, feature_labels=None)[source]

Select columns with to many missing values (>80%).

transform(inputarray)[source]

Transform feature array.

Transform the inputarray to select only the features based on the result from the fit function.

Parameters

inputarray: numpy array, mandatory

Array containing the items to use selection on. The type of item in this list does not matter, e.g. floats, strings etc.

Relief Module

class WORC.featureprocessing.Relief.SelectMulticlassRelief(n_neighbours=3, sample_size=1, distance_p=2, numf=None, random_state=None)[source]

Bases: BaseEstimator, SelectorMixin

Object to fit feature selection based on the type group the feature belongs to. The label for the feature is used for this procedure.

__abstractmethods__ = frozenset({})
__init__(n_neighbours=3, sample_size=1, distance_p=2, numf=None, random_state=None)[source]

Parameters

n_neightbors: integer

Number of nearest neighbours used.

sample_size: float

Percentage of samples used to calculate score

distance_p: integer

Parameter in minkov distance usde for nearest neighbour calculation

numf: integer, default None

Number of important features to be selected with respect to their ranking. If None, all are used.

__module__ = 'WORC.featureprocessing.Relief'
fit(X, y, random_state=None)[source]

Select only features specificed by parameters per patient.

Parameters

feature_values: numpy array, mandatory

Array containing feature values used for model_selection. Number of objects on first axis, features on second axis.

feature_labels: list, mandatory

Contains the labels of all features used. The index in this list will be used in the transform funtion to select features.

multi_class_relief(feature_set, label_set, nb=3, sample_size=1, distance_p=2, numf=None, random_state=None)[source]
single_class_relief(feature_set, label_set, nb=3, sample_size=1, distance_p=2, numf=None, random_state=None)[source]
transform(inputarray)[source]

Transform the inputarray to select only the features based on the result from the fit function.

Parameters

inputarray: numpy array, mandatory

Array containing the items to use selection on. The type of item in this list does not matter, e.g. floats, strings etc.

Scalers Module

class WORC.featureprocessing.Scalers.LogStandardScaler(*, copy=True, with_mean=True, with_std=True)[source]

Bases: StandardScaler

Scale features using z-score and a logit transform.

This scaler first applies a logit transform to each feature before applying a z-score, i.e. the standard scaler. To handle negative and zero values, a constant is added before applying the logit transform:

lij = log(fij - min(Fj) + median(Fj) - min(Fj)) Zij = (lij - mu)/ sigma

Based on https://arxiv.org/pdf/2012.06875v1.pdf.

__module__ = 'WORC.featureprocessing.Scalers'
fit(X, y=None)[source]

Compute the mean and std to be used for later scaling.

Parameters

X{array-like, sparse matrix}, shape [n_samples, n_features]

The data used to compute the mean and standard deviation used for later scaling along the features axis.

y

Ignored

class WORC.featureprocessing.Scalers.RobustStandardScaler(*, copy=True, with_mean=True, with_std=True)[source]

Bases: StandardScaler

Scale features using z-score that is robust to outliers.

This scaler removes outliers (<5th and >95th percentile) and afterwards uses z-scoring to scale the features.

This scaler is thus a combination of the RobustScaler and StandardScaler from sklearn, hence please see those respective documentations for more information:

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html

__module__ = 'WORC.featureprocessing.Scalers'
fit(X, y=None)[source]

Compute the mean and std to be used for later scaling.

Note: if over 80% of the features are excluded in robustness, we switch to the standardscaler, as otherwise all numbers will be NaN after scaling.

Parameters

X{array-like, sparse matrix}, shape [n_samples, n_features]

The data used to compute the mean and standard deviation used for later scaling along the features axis.

y

Ignored

class WORC.featureprocessing.Scalers.WORCScaler(method='robust_z_score', skip_features=None, verbose=False)[source]

Bases: TransformerMixin, BaseEstimator

Scale features using an sklearn scaler.

Additionally, several features can be excluded. Mostly useful when using also categorical features such as patient sex.

__init__(method='robust_z_score', skip_features=None, verbose=False)[source]

Initialize object.

Parameters

method: string

Name of scaler used: robust_z_score, z_score, robust, or minmax

skip_features: list of strings

If any of these elements occur as substring in a feature label, this feature is excluded.

__module__ = 'WORC.featureprocessing.Scalers'
fit(X_train, feature_labels=None)[source]

Fit the scaler.

transform(X_test)[source]

Transform feature values with fitted scaler.

WORC.featureprocessing.Scalers.test()[source]

Test Scaling.

SelectGroups Module

class WORC.featureprocessing.SelectGroups.SelectGroups(parameters, toolboxes=['PREDICT'])[source]

Bases: BaseEstimator, SelectorMixin

Object to fit feature selection based on the type group the feature belongs to. The label for the feature is used for this procedure.

The following groups can be selected, and are detected through looking for the following substrings in the feature label:

__abstractmethods__ = frozenset({})
__init__(parameters, toolboxes=['PREDICT'])[source]

Parameters

parameters: dict, mandatory

Contains the settings for the groups to be selected. Should contain the settings for the following groups: - histogram_features - shape_features - orientation_features - semantic_features - dicom_features - coliage_features - phase_features - vessel_features - texture_Gabor_features - texture_GLCM_features - texture_GLCMMS_features - texture_GLRLM_features - texture_GLSZM_features - texture_GLDZM_features - texture_NGTDM_features - texture_NGLDM_features - texture_LBP_features - fractal_features - location_features - RGRD_features

Also, should contain a parameter for selecting per feature toolbox: - PREDICT - PyRadiomics

And a parameter to select whether transformation have been applied: - original_features - wavelet_features - log_features

__module__ = 'WORC.featureprocessing.SelectGroups'
fit(feature_labels)[source]

Select only features specificed by parameters per patient.

Parameters

feature_labels: list, optional

Contains the labels of all features used. The index in this list will be used in the transform funtion to select features.

transform(inputarray)[source]

Transform the inputarray to select only the features based on the result from the fit function.

Parameters

inputarray: numpy array, mandatory

Array containing the items to use selection on. The type of item in this list does not matter, e.g. floats, strings etc.

SelectIndividuals Module

class WORC.featureprocessing.SelectIndividuals.SelectIndividuals(parameters=['hf_mean', 'sf_compactness'])[source]

Bases: BaseEstimator, SelectorMixin

Object to fit feature selection based on the type group the feature belongs to. The label for the feature is used for this procedure.

__abstractmethods__ = frozenset({})
__init__(parameters=['hf_mean', 'sf_compactness'])[source]

Parameters

parameters: dict, mandatory

Contains the settings for the groups to be selected. Should contain the settings for the following groups: - histogram_features - shape_features - orientation_features - semantic_features - patient_features - coliage_features - phase_features - vessel_features - log_features - texture_features

__module__ = 'WORC.featureprocessing.SelectIndividuals'
fit(feature_labels)[source]

Select only features specificed by parameters per patient.

Parameters

feature_labels: list, optional

Contains the labels of all features used. The index in this list will be used in the transform funtion to select features.

transform(inputarray)[source]

Transform the inputarray to select only the features based on the result from the fit function.

Parameters

inputarray: numpy array, mandatory

Array containing the items to use selection on. The type of item in this list does not matter, e.g. floats, strings etc.

StatisticalTestFeatures Module

WORC.featureprocessing.StatisticalTestFeatures.StatisticalTestFeatures(features, patientinfo, config, output_csv=None, output_png=None, output_tex=None, plot_test='MWU', Bonferonni=True, fontsize='small', yspacing=1, threshold=0.05, verbose=True, label_type=None)[source]

Perform several statistical tests on features, such as a student t-test.

Parameters

features: string, mandatory

contains the paths to all .hdf5 feature files used. modalityname1=file1,file2,file3,… modalityname2=file1,… Thus, modalities names are always between a space and a equal sign, files are split by commas. We assume that the lists of files for each modality has the same length. Files on the same position on each list should belong to the same patient.

patientinfo: string, mandatory

Contains the path referring to a .txt file containing the patient label(s) and value(s) to be used for learning. See the Github Wiki for the format.

config: string, mandatory

path referring to a .ini file containing the parameters used for feature extraction. See the Github Wiki for the possible fields and their description.

# TODO: outputs

verbose: boolean, default True

print final feature values and labels to command line or not.

StatisticalTestThreshold Module

class WORC.featureprocessing.StatisticalTestThreshold.StatisticalTestThreshold(metric='ttest', threshold=0.05)[source]

Bases: BaseEstimator, SelectorMixin

Object to fit feature selection based on statistical tests.

__abstractmethods__ = frozenset({})
__init__(metric='ttest', threshold=0.05)[source]

Parameters

metric: string, default ‘ttest’

Statistical test used for selection. Options are ttest, Welch, Wilcoxon, MannWhitneyU

threshold: float, default 0.05

Threshold for p-value in order for feature to be selected

__module__ = 'WORC.featureprocessing.StatisticalTestThreshold'
fit(X_train, Y_train)[source]

Select only features specificed by the metric and threshold per patient.

Parameters

X_train: numpy array, mandatory

Array containing feature values used for model_selection. Number of objects on first axis, features on second axis.

Y_train: numpy array, mandatory

Array containing the binary labels for each object in X_train.

transform(inputarray)[source]

Transform the inputarray to select only the features based on the result from the fit function.

Parameters

inputarray: numpy array, mandatory

Array containing the items to use selection on. The type of item in this list does not matter, e.g. floats, strings etc.

VarianceThreshold Module

class WORC.featureprocessing.VarianceThreshold.VarianceThresholdMean(threshold)[source]

Bases: BaseEstimator, SelectorMixin

Select features based on variance among objects. Similar to VarianceThreshold from sklearn, but does take the mean of the feature into account.

__abstractmethods__ = frozenset({})
__init__(threshold)[source]
__module__ = 'WORC.featureprocessing.VarianceThreshold'
fit(image_features)[source]
transform(inputarray)[source]

Transform the inputarray to select only the features based on the result from the fit function. Parameters ———- inputarray: numpy array, mandatory

Array containing the items to use selection on. The type of item in this list does not matter, e.g. floats, strings etc.

WORC.featureprocessing.VarianceThreshold.selfeat_variance(image_features, labels=None, thresh=0.99, method='nomean')[source]

Select features using a variance threshold.

Parameters

image_features: numpy array, mandatory

Array containing the feature values to apply the variance threshold selection on. The rows correspond to the patients, the column to the features.

labels: numpy array, optional

Array containing the labels of the corresponding features. Array should therefore have the same shape as the image_features array.

thresh: float, default 0.99

Threshold to be used as lower boundary for feature variance among patients.

method: string, default nomean.

Method to use for selection. Default: do not use the mean of the features. Other valid option is ‘mean’.

Returns

image_features: numpy array

Transformed features array.

labels: list or None

When labels are given, returns the transformed labels. That object contains a list of all label names kept.

sel: VarianceThreshold object

The fitted variance threshold object.