Configuration

Introduction

WORC has defaults for all settings so it can be run out of the box to test the examples. However, you may want to alter the fastr configuration to your system settings, e.g. to locate your input and output folders and how much you want to parallelize the execution.

Fastr will search for a config file named config.py in the $FASTRHOME directory (which defaults to ~/.fastr/ if it is not set). So if $FASTRHOME is set the ~/.fastr/ will be ignored. Additionally, .py files from the $FASTRHOME/config.d folder will be parsed as well. You will see that upon installation, WORC has already put a WORC_config.py file in the config.d folder.

As WORC and the default tools used are mostly Python based, we’ve chosen to put our configuration in a configparser object. This has several advantages:

  1. The object can be treated as a python dictionary and thus is easily adjusted.

  2. Second, each tool can be set to parse only specific parts of the configuration, enabling us to supply one file to all tools instead of needing many parameter files.

Creation and interaction

The default configuration is generated through the WORC.defaultconfig() function. You can then change things as you would in a dictionary and then append it to the configs source:

>>> network = WORC.WORC('somename')
>>> config = network.defaultconfig()
>>> config['Classification']['classifier'] = 'RF'
>>> network.configs.append(config)

When executing the WORC.set() command, the config objects are saved as .ini files in the WORC.fastr_tempdir folder and added to the WORC.fastrconfigs() source.

Below are some details on several of the fields in the configuration. Note that for many of the fields, we currently only provide one default value. However, when adding your own tools, these fields can be adjusted to your specific settings.

WORC performs Combined Algorithm Selection and Hyperparameter (CASH) optimization. The configuration determines how the optimization is performed and which hyperparameters and models will be included. Repeating specific models/parameters in the config will make them more likely to be used, e.g.

>>> config['Classification']['classifiers'] = 'SVM, SVM, LR'

means that the SVM is 2x more likely to be tested in the model selection than LR.

Note

All fields in the config must either be supplied as strings. A list can be created by using commas for separation, e.g. Network.create_source.

Contents

The config object can be indexed as config[key][subkey] = value. The various keys, subkeys, and the values (description, defaults and options) can be found below.

Key

Reference

Bootstrap

Bootstrap

Classification

Classification

ComBat

ComBat

CrossValidation

CrossValidation

Ensemble

Ensemble

Evaluation

Evaluation

FeatPreProcess

FeatPreProcess

Featsel

Featsel

FeatureScaling

FeatureScaling

Fingerprinting

Fingerprinting

General

General

HyperOptimization

HyperOptimization

ImageFeatures

ImageFeatures

Imputation

Imputation

Labels

Labels

OneHotEncoding

OneHotEncoding

Preprocessing

Preprocessing

PyRadiomics

PyRadiomics

Resampling

Resampling

SMAC

SMAC

Segmentix

Segmentix

SelectFeatGroup

SelectFeatGroup

Details on each section of the config can be found below.

General

These fields contain general settings for when using WORC. For more info on the Joblib settings, which are used in the Joblib Parallel function, see here. When you run WORC on a cluster with nodes supporting only a single core to be used per node, e.g. the BIGR cluster, use only 1 core and threading as a backend.

Note

If you want to override configuration fields that are fingerprinted, e.g. the preprocessing, turn the fingerprinting off.

Description:

Subkey

Description

cross_validation

Determine whether a cross validation will be performed or not. Obsolete, will be removed.

Segmentix

Determine whether to use Segmentix tool for segmentation preprocessing.

FeatureCalculators

Specifies which feature calculation tools should be used. A list can be provided to use multiple tools.

Preprocessing

Specifies which tool will be used for image preprocessing.

RegistrationNode

Specifies which tool will be used for image registration.

TransformationNode

Specifies which tool will be used for applying image transformations.

Joblib_ncores

Number of cores to be used by joblib for multicore processing.

Joblib_backend

Type of backend to be used by joblib for multicore processing.

tempsave

Determines whether after every cross validation iteration the result will be saved, in addition to the result after all iterations. Especially useful for debugging.

AssumeSameImageAndMaskMetadata

Make the assumption that the image and mask have the same metadata. If True and there is a mismatch, metadata from the image will be copied to the mask.

ComBat

Whether to use ComBat feature harmonization on your FULL dataset, i.e. not in a train-test setting. See <https://github.com/Jfortin1/ComBatHarmonization for more information./>`_ .

Fingerprint

Whether to use Fingerprinting or not.

DoTestNRSNEns

If True, repeat the experiments from the WORC paper to check the performance of various N_RS, N_Ens and advanced ensembling combinations.

Defaults and Options:

Subkey

Default

Options

cross_validation

True

True, False

Segmentix

True

True, False

FeatureCalculators

[predict/CalcFeatures:1.0, pyradiomics/Pyradiomics:1.0]

predict/CalcFeatures:1.0, pyradiomics/Pyradiomics:1.0, pyradiomics/CF_pyradiomics:1.0, your own tool reference

Preprocessing

worc/PreProcess:1.0

worc/PreProcess:1.0, your own tool reference

RegistrationNode

elastix4.8/Elastix:4.8

‘elastix4.8/Elastix:4.8’, your own tool reference

TransformationNode

elastix4.8/Transformix:4.8

‘elastix4.8/Transformix:4.8’, your own tool reference

Joblib_ncores

1

Integer > 0

Joblib_backend

threading

multiprocessing, threading

tempsave

True

True, False

AssumeSameImageAndMaskMetadata

False

True, False

ComBat

False

True, False

Fingerprint

True

True, False

DoTestNRSNEns

False

Boolean

Labels

Set the label used for classification.

This part is quite important, as it should match your label file. Suppose your patientclass.txt file you supplied as source for labels looks like this:

Patient

Label1

Label2

patient1

1

0

patient2

2

1

patient3

1

5

You can supply a single label or multiple labels split by commas, for each of which an estimator will be fit. For example, suppose you simply want to use Label1 for classification, then set:

config['Labels']['label_names'] = 'Label1'

If you want to first train a classifier on Label1 and then Label2, set: config[Labels][label_names] = Label1, Label2

Description:

Subkey

Description

label_names

The labels used from your label file for classification.

modus

Determine whether multilabel or singlelabel classification or regression will be performed.

url

WIP

projectID

WIP

Defaults and Options:

Subkey

Default

Options

label_names

Label1, Label2

String(s)

modus

singlelabel

singlelabel, multilabel

url

WIP

WIP

projectID

WIP

WIP

Fingerprinting

The fingerprinting nodes are the first computational nodes to create a fingerprint of your dataset and accordingly adjust some configuration settings, see the WORC paper.

Description:

Subkey

Description

max_num_image

Maximum number of images and segmentations to evaluate during fingerprinting to limit the workload.

Defaults and Options:

Subkey

Default

Options

max_num_image

100

Integer

Preprocessing

The preprocessing node acts before the feature extraction on the image. Additionally, scans with imagetype CT (see later in the tutorial) provided as DICOM are scaled to Hounsfield Units. For more details on the preprocessing options, please see the additional functionality chapter.

Note

As several preprocessing functions are fingerprinted, if you want to edit these configuration settings yourself, please turn of the fingerprinting, see the General section of the config.

Description:

Subkey

Description

CheckSpacing

Determine whether to check the spacing or not. If True, and the spacing of the image is [1x1x1], we assume the spacing is incorrect, and overwrite it using the DICOM metadata.

Clipping

Determine whether to use intensity clipping in preprocessing of image or not.

Clipping_Range

Lower- and upperbound of intensities to be used in clipping.

Normalize

Determine whether to use normalization in preprocessing of image or not.

Normalize_ROI

If a mask is supplied and this is set to True, normalize image based on supplied ROI. Otherwise, the full image is used for normalization using the SimpleITK Normalize function. Lastly, setting this to False will result in no normalization being applied.

Method

Method used for normalization if ROI is supplied. Currently, z-scoring or using the minimum and median of the ROI can be used.

ROIDetermine

Choose whether a ROI for normalization is provided, or Otsu thresholding is used to determine one.

ROIdilate

Determine whether the ROI has to be dilated with a disc element or not.

ROIdilateradius

Radius of disc element to be used in ROI dilation.

Resampling

Determine whether the image and mask will be resampled or not.

Resampling_spacing

Spacing to resample image and mask to, if resampling is used.

BiasCorrection

Determine whether N4 Bias correction will be applied or not.

BiasCorrection_Mask

Whether withing bias correction, a mask generated through Otsu thresholding is used or not.

CheckOrientation

Determine whether to check the image orientation or not. If checked, if the orientation is not equal to the OrientationPrimaryAxis, the image is rotated.

OrientationPrimaryAxis

If CheckOrientation is True, if primary axis is not this one, rotate image such that it is. Currently, only “axial” is supported.

HistogramEqualization

Determine whether to use histogram equalization or not.

HistogramEqualization_Alpha

Controls how much the filter acts like the classical histogram equalization method, see https://simpleitk.org/doxygen/latest/html/classitk_1_1simple_1_1AdaptiveHistogramEqualizationImageFilter.html

HistogramEqualization_Beta

Controls how much the filters acts like an unsharp mask, see https://simpleitk.org/doxygen/latest/html/classitk_1_1simple_1_1AdaptiveHistogramEqualizationImageFilter.html

HistogramEqualization_Radius

Controls the windows size, see https://simpleitk.org/doxygen/latest/html/classitk_1_1simple_1_1AdaptiveHistogramEqualizationImageFilter.html

Defaults and Options:

Subkey

Default

Options

CheckSpacing

False

True, False

Clipping

False

True, False

Clipping_Range

-1000.0, 3000.0

Float, Float

Normalize

True

True, False

Normalize_ROI

Full

True, False, Full

Method

z_score

z_score, minmed

ROIDetermine

Provided

Provided, Otsu

ROIdilate

False

True, False

ROIdilateradius

10

Integer > 0

Resampling

False

True, False

Resampling_spacing

1, 1, 1

Float, Float, Float

BiasCorrection

False

True, False

BiasCorrection_Mask

False

Float, Float, Float

CheckOrientation

False

True, False

OrientationPrimaryAxis

axial

axial

HistogramEqualization

False

True, False

HistogramEqualization_Alpha

0.3

Float

HistogramEqualization_Beta

0.3

Float

HistogramEqualization_Radius

5

Float

Segmentix

These fields are only important if you specified using the segmentix tool in the general configuration.

Description:

Subkey

Description

mask

If None, masks will not be used by segmentix. If a mask is supplied, should the mask be subtracted from the contour or multiplied.

segtype

If Ring, then a ring around the segmentation will be used as contour. If Dilate, the segmentation will be dilated per 2-D axial slice with a disc.

segradius

Define the radius of the ring or disc used if segtype is Ring or Dilate, respectively.

N_blobs

How many of the largest blobs are extracted from the segmentation. If None, no blob extraction is used.

fillholes

Determines whether hole filling will be used.

remove_small_objects

Determines whether small objects will be removed.

min_object_size

Minimum of objects in voxels to not be removed if small objects are removed

Defaults and Options:

Subkey

Default

Options

mask

None

None, subtract, multiply

segtype

None

None, Ring, Dilate

segradius

5

Integer > 0

N_blobs

1

Integer > 0

fillholes

True

True, False

remove_small_objects

False

True, False

min_object_size

2

Integer > 0

ImageFeatures

If using the PREDICT toolbox for feature extraction, you can specify some settings for the feature computation here. Also, you can select if the certain features are computed or not.

Description:

Subkey

Description

shape

Determine whether orientation features are computed or not.

histogram

Determine whether histogram features are computed or not.

orientation

Determine whether orientation features are computed or not.

texture_Gabor

Determine whether Gabor texture features are computed or not.

texture_LBP

Determine whether LBP texture features are computed or not.

texture_GLCM

Determine whether GLCM texture features are computed or not.

texture_GLCMMS

Determine whether GLCM Multislice texture features are computed or not.

texture_GLRLM

Determine whether GLRLM texture features are computed or not.

texture_GLSZM

Determine whether GLSZM texture features are computed or not.

texture_NGTDM

Determine whether NGTDM texture features are computed or not.

coliage

Determine whether coliage features are computed or not.

vessel

Determine whether vessel features are computed or not.

log

Determine whether LoG features are computed or not.

phase

Determine whether local phase features are computed or not.

image_type

Modality of images supplied. Determines how the image is loaded. Mandatory to supply by user. Should be one of the valid quantitative modalities [‘CT’, ‘PET’, ‘Thermography’, ‘ADC’, ‘MG’] or qualitative modalities [‘MRI’, ‘MR’, ‘DWI’, ‘US’].

extraction_mode

Determine how to extract the features: 2D if your masks and/or images have only one 2D slice, 3D for tru 3D images, 2.5D for 3D images but in a slice-by-slice stacked 2D manner. The latter is recommended when the slice thickness is much larger (>2) than the pixel spacing.

gabor_frequencies

Frequencies of Gabor filters used: can be a single float or a list.

gabor_angles

Angles of Gabor filters in degrees: can be a single integer or a list.

GLCM_angles

Angles used in GLCM computation in radians: can be a single float or a list.

GLCM_levels

Number of grayscale levels used in discretization before GLCM computation.

GLCM_distances

Distance(s) used in GLCM computation in pixels: can be a single integer or a list.

LBP_radius

Radii used for LBP computation: can be a single integer or a list.

LBP_npoints

Number(s) of points used in LBP computation: can be a single integer or a list.

phase_minwavelength

Minimal wavelength in pixels used for phase features.

phase_nscale

Number of scales used in phase feature computation.

log_sigma

Standard deviation(s) in pixels used in log feature computation: can be a single integer or a list.

vessel_scale_range

Scale in pixels used for Frangi vessel filter. Given as a minimum and a maximum.

vessel_scale_step

Step size used to go from minimum to maximum scale on Frangi vessel filter.

vessel_radius

Radius to determine boundary of between inner part and edge in Frangi vessel filter.

dicom_feature_tags

DICOM tags to be extracted as features. See https://worc.readthedocs.io/en/latest/static/features.html.

dicom_feature_labels

For each of the DICOM tag values extracted, name that should be assigned to the feature. See https://worc.readthedocs.io/en/latest/static/features.html.

Defaults and Options:

Subkey

Default

Options

shape

True

True, False

histogram

True

True, False

orientation

True

True, False

texture_Gabor

True

True, False

texture_LBP

True

True, False

texture_GLCM

True

True, False

texture_GLCMMS

True

True, False

texture_GLRLM

False

True, False

texture_GLSZM

False

True, False

texture_NGTDM

False

True, False

coliage

False

True, False

vessel

True

True, False

log

True

True, False

phase

True

True, False

image_type

String

extraction_mode

2.5D

String: 2D, 2.5D or 3D

gabor_frequencies

0.05, 0.2, 0.5

Float(s)

gabor_angles

0, 45, 90, 135

Integer(s)

GLCM_angles

0, 0.79, 1.57, 2.36

Float(s)

GLCM_levels

16

Integer > 0

GLCM_distances

1, 3

Integer(s) > 0

LBP_radius

3, 8, 15

Integer(s) > 0

LBP_npoints

12, 24, 36

Integer(s) > 0

phase_minwavelength

3

Integer > 0

phase_nscale

5

Integer > 0

log_sigma

1, 5, 10

Integer(s)

vessel_scale_range

1, 10

Two integers: min and max.

vessel_scale_step

2

Integer > 0

vessel_radius

5

Integer > 0

dicom_feature_tags

0010 1010, 0010 0040

DICOM tag keys, e.g. 0010 0010, separated by comma’s

dicom_feature_labels

age, sex

List of strings

PyRadiomics

If using the PyRadiomics toolbox, you can specify some settings for the feature computation here. For more information, see https://pyradiomics.readthedocs.io/en/latest/customization.htm.

Description:

Subkey

Description

geometryTolerance

See <https://pyradiomics.readthedocs.io/en/latest/customization.html/>`_ .

normalize

See <https://pyradiomics.readthedocs.io/en/latest/customization.html/>`_ .

normalizeScale

See <https://pyradiomics.readthedocs.io/en/latest/customization.html/>`_ .

resampledPixelSpacing

See <https://pyradiomics.readthedocs.io/en/latest/customization.html/>`_ .

interpolator

See <https://pyradiomics.readthedocs.io/en/latest/customization.html?highlight=sitkbspline#feature-extractor-level/>`_ .

preCrop

See <https://pyradiomics.readthedocs.io/en/latest/customization.html/>`_ .

binCount

We advice to use a fixed bin count instead of a fixed bin width, as on imaging modalities such as MR, the scale of the values varies a lot, which is incompatible with a fixed bin width. See <https://pyradiomics.readthedocs.io/en/latest/customization.html/>`_ .

binWidth

See <https://pyradiomics.readthedocs.io/en/latest/customization.html/>`_ .

force2D

See <https://pyradiomics.readthedocs.io/en/latest/customization.html/>`_ .

force2Ddimension

See <https://pyradiomics.readthedocs.io/en/latest/customization.html/>`_ .

voxelArrayShift

See <https://pyradiomics.readthedocs.io/en/latest/customization.html/>`_ .

Original

Enable/Disable computation of original image features.

Wavelet

Enable/Disable computation of wavelet image features.

LoG

Enable/Disable computation of Laplacian of Gaussian (LoG) image features.

label

“Intensity” of the pixels in the mask to be used for feature extraction. If using segmentix, use 1, as your mask will be boolean. Otherwise, select the integer(s) corresponding to the ROI in your mask.

extract_firstorder

Determine whether first order features are computed or not.

extract_shape

Determine whether shape features are computed or not.

texture_GLCM

Determine whether GLCM features are computed or not.

texture_GLRLM

Determine whether GLRLM features are computed or not.

texture_GLSZM

Determine whether GLSZM features are computed or not.

texture_GLDM

Determine whether GLDM features are computed or not.

texture_NGTDM

Determine whether NGTDM features are computed or not.

Defaults and Options:

Subkey

Default

Options

geometryTolerance

0.0001

Float

normalize

False

Boolean

normalizeScale

100

Integer

resampledPixelSpacing

None

Float, Float, Float

interpolator

sitkBSpline

See <https://pyradiomics.readthedocs.io/en/latest/customization.html?highlight=sitkbspline#feature-extractor-level/>`_ .

preCrop

True

True, False

binCount

16

Integer or None

binWidth

None

Integer or None

force2D

False

True, False

force2Ddimension

0

0 = axial, 1 = coronal, 2 = sagital

voxelArrayShift

300

Integer

Original

True

True, False

Wavelet

False

True, False

LoG

False

True, False

label

1

Integer

extract_firstorder

False

True, False

extract_shape

True

True, False

texture_GLCM

False

True, False

texture_GLRLM

True

True, False

texture_GLSZM

True

True, False

texture_GLDM

True

True, False

texture_NGTDM

True

True, False

ComBat

If using the ComBat toolbox, you can specify some settings for the feature harmonization here. For more information, see https://github.com/Jfortin1/ComBatHarmonization.

Description:

Subkey

Description

language

Name of software implementation to use.

batch

Name of batch variable = variable to correct for.

mod

Name of moderation variable(s) = variables for which variation in features will be “preserverd”.

par

Either use the parametric (1) or non-parametric version (0) of ComBat.

eb

Either use the emperical Bayes (1) or simply mean shifting version (0) of ComBat.

per_feature

Either use ComBat for all features combined (0) or per feature (1), in which case a second feature equal to the single feature plus random noise will be added if eb=1

excluded_features

Provide substrings of feature labels of features which should be excluded from ComBat. Recommended to use for features unaffected by the batch variable.

matlab

If using Matlab, path to Matlab executable.

Defaults and Options:

Subkey

Default

Options

language

python

python, matlab

batch

Hospital

String

mod

[]

String(s), or []

par

1

0 or 1

eb

1

0 or 1

per_feature

0

0 or 1

excluded_features

sf_, of_, semf_, pf_

List of strings, comma separated

matlab

C:Program FilesMATLABR2015bbinmatlab.exe

String

FeatPreProcess

Before the features are given to the classification function, and thus the hyperoptimization, these can be preprocessed as following.

Description:

Subkey

Description

Use

If True, use feature preprocessor in the classify node. Currently excluded features with >80% NaNs.

Combine

If True, features of multiple objects (e.g. lesions) of the same patient are combined.

Combine_method

If features of multiple objects are combined, this determines the method. Currently included options are mean and max.

Defaults and Options:

Subkey

Default

Options

Use

False

Boolean

Combine

False

Boolean

Combine_method

mean

mean or max

OneHotEncoding

Optionally, you can use OneHotEncoding on specific features. For more information on why and how this is done, see https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html. By default, this is not done, as WORC does not know for which specific features you would like to do this.

Description:

Subkey

Description

Use

If True, use OneHotEncoding for specific features as determined by the field below.

feature_labels_tofit

Labels of features for which to use OneHotEncoding. WORC will check whether any of the values specified in this field is a substring of a feature name. For example, if you give gclm, all features for which glcm is in the feature label will be one hot encoded.

Defaults and Options:

Subkey

Default

Options

Use

False

Boolean(s)

feature_labels_tofit

List of strings

Imputation

These settings are used for feature imputation. Note that these settings are actually used in the hyperparameter optimization. Hence you can provide multiple values per field, of which random samples will be drawn of which finally the best setting in combination with the other hyperparameters is selected.

Description:

Subkey

Description

use

If True, use feature imputation methods to replace NaN values. If False, all NaN features will be set to zero.

strategy

Method to be used for imputation.

n_neighbors

When using k-Nearest Neighbors (kNN) for feature imputation, determines the number of neighbors used for imputation. Can be a single integer or a list.

skipallNaN

When True, if a feature is NaN for all objects/patients, simply remove this features for all patients.

Defaults and Options:

Subkey

Default

Options

use

True

Boolean(s)

strategy

mean, median, most_frequent, constant, knn

mean, median, most_frequent, constant, knn

n_neighbors

5, 5

Two Integers: loc and scale

skipallNaN

True

Boolean(s)

FeatureScaling

Determines which method is applied to scale each feature.

Description:

Subkey

Description

scaling_method

Determine the scaling method.

skip_features

Determine which features should be skipped. This field should contain a comma separated list of substrings: when one or more of these are in a feature name, the feature is skipped.

Defaults and Options:

Subkey

Default

Options

scaling_method

robust_z_score

robust_z_score, z_score, robust, minmax, log_z_score, None

skip_features

semf_, pf_

Comma separated list of strings

Featsel

Define feature selection methods. Note that these settings are actually used in the hyperparameter optimization. Hence you can provide multiple values per field, of which random samples will be drawn of which finally the best setting in combination with the other hyperparameters is selected. Again, these should be formatted as string containing the actual values, e.g. value1, value2.

Description:

Subkey

Description

Variance

Percentage of times features which have a variance < 0.01 are excluded. Based on ` sklearn”s VarianceThreshold <https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.VarianceThreshold.html/>`_.

GroupwiseSearch

Randomly select which feature groups to use. Parameters determined by the SelectFeatGroup config part, see below.

SelectFromModel

Percentage of times features are selected by first training a machine learning model which can rank the features with an ``importance. See also sklearn”s SelectFromModel.

SelectFromModel_estimator

Machine learning model / estimator used: can be LASSO, LogisticRegression, or a Random Forest

SelectFromModel_lasso_alpha

When using LASSO, search space of weigth of L1 term, see also sklearn <https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html/>.

SelectFromModel_n_trees

When using a random forest, search space of number of trees used.

UsePCA

Percentage of times Principle Component Analysis (PCA) is used to select features.

PCAType

Method to select number of components using PCA: Either the number of components that explains 95% of the variance, or use a fixed number of components.95variance

StatisticalTestUse

Percentage of times a statistical test is used to select features.

StatisticalTestMetric

Define the type of statistical test to be used.

StatisticalTestThreshold

Specify a threshold for the p-value threshold used in the statistical test to select features. The first element defines the lower boundary, the other the upper boundary. Random sampling will occur between the boundaries.

ReliefUse

Percentage of times Relief is used to select features.

ReliefNN

Min and max of number of nearest neighbors search range in Relief.

ReliefSampleSize

Min and max of sample size search range in Relief.

ReliefDistanceP

Min and max of positive distance search range in Relief.

ReliefNumFeatures

Min and max of number of features that is selected search range in Relief.

RFE

Percentage of times recursive feature elimination (RFE) is used to select features.

RFE_estimator

Machine learning model / estimator used: can be LASSO, LogisticRegression, or a Random Forest

RFE_lasso_alpha

When using LASSO, search space of weigth of L1 term, see also sklearn <https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html/>.

RFE_n_trees

When using a random forest, search space of number of trees used.

RFE_n_features_to_select

Number of features to select. Since we use sklearn < 0.24, this has to be an integer currently, not a float for a fraction of the features.

RFE_step

Number of features eliminated per step

Defaults and Options:

Subkey

Default

Options

Variance

1.0

Float

GroupwiseSearch

True

Boolean(s)

SelectFromModel

0.275

Float

SelectFromModel_estimator

Lasso, LR, RF

Lasso, LR, RF

SelectFromModel_lasso_alpha

0.1, 1.4

Two Floats: loc and scale

SelectFromModel_n_trees

10, 90

Two Integers: loc and scale

UsePCA

0.275

Float

PCAType

95variance, 10, 50, 100

Integer(s), 95variance

StatisticalTestUse

0.275

Float

StatisticalTestMetric

MannWhitneyU

ttest, Welch, Wilcoxon, MannWhitneyU

StatisticalTestThreshold

-3, 2.5

Two Integers: loc and scale

ReliefUse

0.275

Float

ReliefNN

2, 4

Two Integers: loc and scale

ReliefSampleSize

0.75, 0.2

Two Floats: loc and scale

ReliefDistanceP

1, 3

Two Integers: loc and scale

ReliefNumFeatures

10, 40

Two Integers: loc and scale

RFE

0.0

Float

RFE_estimator

Lasso, LR, RF

Lasso, LR, RF

RFE_lasso_alpha

0.1, 1.4

Two Floats: loc and scale

RFE_n_trees

10, 90

Two Integers: loc and scale

RFE_n_features_to_select

10, 90

Two Integers: loc and scale

RFE_step

1, 9

Number of features eliminated per step

SelectFeatGroup

If the PREDICT and/or PyRadiomics feature computation tools are used, then you can do a gridsearch among the various feature groups for the optimal combination. Here, you determine which groups can be selected.

Description:

Subkey

Description

shape_features

If True, use shape features in model.

histogram_features

If True, use histogram features in model.

orientation_features

If True, use orientation features in model.

texture_Gabor_features

If True, use Gabor texture features in model.

texture_GLCM_features

If True, use GLCM texture features in model.

texture_GLDM_features

If True, use GLDM texture features in model.

texture_GLCMMS_features

If True, use GLCM Multislice texture features in model.

texture_GLRLM_features

If True, use GLRLM texture features in model.

texture_GLSZM_features

If True, use GLSZM texture features in model.

texture_GLDZM_features

If True, use GLDZM texture features in model.

texture_NGTDM_features

If True, use NGTDM texture features in model.

texture_NGLDM_features

If True, use NGLDM texture features in model.

texture_LBP_features

If True, use LBP texture features in model.

dicom_features

If True, use DICOM features in model.

semantic_features

If True, use semantic features in model.

coliage_features

If True, use coliage features in model.

vessel_features

If True, use vessel features in model.

phase_features

If True, use phase features in model.

fractal_features

If True, use fractal features in model.

location_features

If True, use location features in model.

rgrd_features

If True, use rgrd features in model.

toolbox

List of names of toolboxes to be used, or All

original_features

If True, use original features in model.

wavelet_features

If True, use wavelet features in model.

log_features

If True, use log features in model.

Defaults and Options:

Subkey

Default

Options

shape_features

True, False

Boolean(s)

histogram_features

True, False

Boolean(s)

orientation_features

True, False

Boolean(s)

texture_Gabor_features

True, False

Boolean(s)

texture_GLCM_features

True, False

Boolean(s)

texture_GLDM_features

True, False

Boolean(s)

texture_GLCMMS_features

True, False

Boolean(s)

texture_GLRLM_features

True, False

Boolean(s)

texture_GLSZM_features

True, False

Boolean(s)

texture_GLDZM_features

True, False

Boolean(s)

texture_NGTDM_features

True, False

Boolean(s)

texture_NGLDM_features

True, False

Boolean(s)

texture_LBP_features

True, False

Boolean(s)

dicom_features

False

Boolean(s)

semantic_features

False

Boolean(s)

coliage_features

False

Boolean(s)

vessel_features

True, False

Boolean(s)

phase_features

True, False

Boolean(s)

fractal_features

True, False

Boolean(s)

location_features

True, False

Boolean(s)

rgrd_features

True, False

Boolean(s)

toolbox

All, PREDICT, PyRadiomics

All, or name of toolbox (PREDICT, PyRadiomics)

original_features

True

Boolean(s)

wavelet_features

True, False

Boolean(s)

log_features

True, False

Boolean(s)

Resampling

Before performing the hyperoptimization, you can use various resampling techniques to resample (under-sampling, over-sampling, or both) the data. All methods are adopted from imbalanced learn.

Description:

Subkey

Description

Use

Percentage of times Object (e.g. patient) resampling is used.

Method

One of the methods adopted, see also imbalanced learn <https://imbalanced-learn.readthedocs.io/en/stable/api/>`_.

sampling_strategy

Sampling strategy, see also imbalanced learn <https://imbalanced-learn.readthedocs.io/en/stable/api/>`_.

n_neighbors

Number of n_neighbors used in resampling. This should be (much) smaller than the number of objects/patients you supply. We sample on a uniform scale: the parameters specify the range (loc, loc + scale).

k_neighbors

Number of n_neighbors used in resampling. This should be (much) smaller than the number of objects/patients you supply. We sample on a uniform scale: the parameters specify the range (loc, loc + scale).

threshold_cleaning

Threshold for cleaning of samples. We sample on a uniform scale: the parameters specify the range (loc, loc + scale).

Defaults and Options:

Subkey

Default

Options

Use

0.20

Float

Method

RandomUnderSampling, RandomOverSampling, NearMiss, NeighbourhoodCleaningRule, ADASYN, BorderlineSMOTE, SMOTE, SMOTEENN, SMOTETomek

RandomUnderSampling, RandomOverSampling, NearMiss, NeighbourhoodCleaningRule, ADASYN, BorderlineSMOTE, SMOTE, SMOTEENN, SMOTETomek

sampling_strategy

auto, majority, minority, not minority, not majority, all

auto, majority, not minority, not majority, all

n_neighbors

3, 12

Two Integers: loc and scale

k_neighbors

5, 15

Two Integers: loc and scale

threshold_cleaning

0.25, 0.5

Two Floats: loc and scale

Classification

Determine settings for the classification in the hyperoptimization. Most of the classifiers are implemented using sklearn; hence descriptions of the hyperparameters can also be found there.

Defaults for XGB are based on https://towardsdatascience.com/doing-xgboost-hyper-parameter-tuning-the-smart-way-part-1-of-2-f6d255a45dde and https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/

Note, as XGB and AdaBoost take significantly longer to fit (3x), they are picked less often by default.

Description:

Subkey

Description

fastr

Use fastr for the optimization gridsearch (recommended on clusters, default) or if set to False , joblib (recommended for PCs but not on Windows).

fastr_plugin

Name of execution plugin to be used. Default use the same as the self.fastr_plugin for the WORC object.

classifiers

Select the estimator(s) to use. Most are implemented using sklearn. For abbreviations, see the options: LR = logistic regression.

max_iter

Maximum number of iterations to use in training an estimator. Only for specific estimators, see sklearn.

SVMKernel

When using a SVM, specify the kernel type.

SVMC

Range of the SVM slack parameter. We sample on a uniform log scale: the parameters specify the range of the exponent (loc, loc + scale).

SVMdegree

Range of the SVM polynomial degree when using a polynomial kernel. We sample on a uniform scale: the parameters specify the range (loc, loc + scale).

SVMcoef0

Range of SVM homogeneity parameter. We sample on a uniform scale: the parameters specify the range (loc, loc + scale).

SVMgamma

Range of the SVM gamma parameter. We sample on a uniform log scale: the parameters specify the range of the exponent (loc, loc + scale)

RFn_estimators

Range of number of trees in a RF. We sample on a uniform scale: the parameters specify the range (loc, loc + scale).

RFmin_samples_split

Range of minimum number of samples required to split a branch in a RF. We sample on a uniform scale: the parameters specify the range (loc, loc + scale).

RFmax_depth

Range of maximum depth of a RF. We sample on a uniform scale: the parameters specify the range (loc, loc + scale).

LRpenalty

Penalty term used in LR.

LRC

Range of regularization strength in LR. We sample on a uniform scale: the parameters specify the range (loc, loc + scale).

LR_solver

Solver used in LR.

LR_l1_ratio

Ratio between l1 and l2 penalty when using elasticnet penalty, see https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html.

LDA_solver

Solver used in LDA.

LDA_shrinkage

Range of the LDA shrinkage parameter. We sample on a uniform log scale: the parameters specify the range of the exponent (loc, loc + scale).

QDA_reg_param

Range of the QDA regularization parameter. We sample on a uniform log scale: the parameters specify the range of the exponent (loc, loc + scale).

ElasticNet_alpha

Range of the ElasticNet penalty parameter. We sample on a uniform log scale: the parameters specify the range of the exponent (loc, loc + scale).

ElasticNet_l1_ratio

Range of l1 ratio in LR. We sample on a uniform scale: the parameters specify the range (loc, loc + scale).

SGD_alpha

Range of the SGD penalty parameter. We sample on a uniform log scale: the parameters specify the range of the exponent (loc, loc + scale).

SGD_l1_ratio

Range of l1 ratio in SGD. We sample on a uniform scale: the parameters specify the range (loc, loc + scale).

SGD_loss

Loss function of SGD.

SGD_penalty

Penalty term in SGD.

CNB_alpha

Regularization strenght in ComplementNB. We sample on a uniform scale: the parameters specify the range (loc, loc + scale)

AdaBoost_n_estimators

Number of estimators used in AdaBoost. Default is equal to config[‘Classification’][‘RFn_estimators’].

AdaBoost_learning_rate

Learning rate in AdaBoost.

XGB_boosting_rounds

Number of estimators / boosting rounds used in XGB. Default is equal to config[‘Classification’][‘RFn_estimators’].

XGB_max_depth

Maximum depth of XGB.

XGB_learning_rate

Learning rate in AdaBoost. Default is equal to config[‘Classification’][‘AdaBoost_learning_rate’].

XGB_gamma

Gamma of XGB.

XGB_min_child_weight

Minimum child weights in XGB.

XGB_colsample_bytree

Col sample by tree in XGB.

LightGBM_num_leaves

Maximum tree leaves for base learners. See also https://lightgbm.readthedocs.io/en/latest/Parameters.html.

LightGBM_max_depth

Maximum tree depth for base learners. See also https://lightgbm.readthedocs.io/en/latest/Parameters.html.

LightGBM_min_child_samples

Minimum number of data needed in a child (leaf). See also https://lightgbm.readthedocs.io/en/latest/Parameters.html.

LightGBM_reg_alpha

L1 regularization term on weights. See also https://lightgbm.readthedocs.io/en/latest/Parameters.html.

LightGBM_reg_lambda

L2 regularization term on weights. See also https://lightgbm.readthedocs.io/en/latest/Parameters.html.

LightGBM_min_child_weight

Minimum sum of instance weight (hessian) needed in a child (leaf). See also https://lightgbm.readthedocs.io/en/latest/Parameters.html.

Defaults and Options:

Subkey

Default

Options

fastr

True

True, False

fastr_plugin

LinearExecution

Any fastr execution plugin .

classifiers

SVM, RF, LR, LDA, QDA, GaussianNB, AdaBoostClassifier, XGBClassifier

SVM , SVR, SGD, SGDR, RF, LDA, QDA, ComplementND, GaussianNB, AdaBoostClassifier, XGBClassifier, LR, RFR, Lasso, ElasticNet, LinR, Ridge, AdaBoostRegressor, XGBRegressor. All are estimators from sklearn

max_iter

100000

Integer

SVMKernel

linear, poly, rbf

poly, linear, rbf

SVMC

0, 6

Two Integers: loc and scale

SVMdegree

1, 6

Two Integers: loc and scale

SVMcoef0

0, 1

Two Integers: loc and scale

SVMgamma

-5, 5

Two Integers: loc and scale

RFn_estimators

10, 90

Two Integers: loc and scale

RFmin_samples_split

2, 3

Two Integers: loc and scale

RFmax_depth

5, 5

Two Integers: loc and scale

LRpenalty

l1, l2, elasticnet

none, l2, l1

LRC

0.01, 0.99

Two Floats: loc and scale

LR_solver

lbfgs, saga

Comma separated list of strings, for the options see https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

LR_l1_ratio

0, 1

Float between 0.0 and 1.0.

LDA_solver

svd, lsqr, eigen

svd, lsqr, eigen

LDA_shrinkage

-5, 5

Two Integers: loc and scale

QDA_reg_param

-5, 5

Two Integers: loc and scale

ElasticNet_alpha

-5, 5

Two Integers: loc and scale

ElasticNet_l1_ratio

0, 1

Two Integers: loc and scale

SGD_alpha

-5, 5

Two Integers: loc and scale

SGD_l1_ratio

0, 1

Two Integers: loc and scale

SGD_loss

squared_loss, huber, epsilon_insensitive, squared_epsilon_insensitive

hinge, squared_hinge, modified_huber

SGD_penalty

none, l2, l1

none, l2, l1

CNB_alpha

0, 1

Two Integers: loc and scale

AdaBoost_n_estimators

10, 90

Two Integers: loc and scale

AdaBoost_learning_rate

0.01, 0.99

Two Floats: loc and scale

XGB_boosting_rounds

10, 90

Two Integers: loc and scale

XGB_max_depth

3, 12

Two Integers: loc and scale

XGB_learning_rate

0.01, 0.99

Two Floats: loc and scale

XGB_gamma

0.01, 9.99

Two Floats: loc and scale

XGB_min_child_weight

1, 6

Two Integers: loc and scale

XGB_colsample_bytree

0.3, 0.7

Two Floats: loc and scale

LightGBM_num_leaves

5, 95

Two Integers: loc and scale

LightGBM_max_depth

3, 12

Two Integers: loc and scale

LightGBM_min_child_samples

5, 45

Two Integers: loc and scale

LightGBM_reg_alpha

0.01, 0.99

Two Floats: loc and scale

LightGBM_reg_lambda

0.01, 0.99

Two Floats: loc and scale

LightGBM_min_child_weight

-7, 4

Two Integers: loc and scale

CrossValidation

When using cross validation, specify the following settings.

Description:

Subkey

Description

Type

If performing a cross-validationm, type of cross-validation used. Currently random-splitting and leave-one-out (LOO) are supported.

N_iterations

Number of times the data is split in training and test in the outer cross-validation when using random-splitting.

test_size

The percentage of data to be used for testing when using random-splitting.

fixed_seed

If True, use a fixed seed for the cross-validation splits when using random-splitting.

Defaults and Options:

Subkey

Default

Options

Type

random_split

random_split, LOO

N_iterations

100

Integer

test_size

0.2

Float

fixed_seed

False

Boolean

HyperOptimization

Specify the hyperparameter optimization procedure here.

Description:

Subkey

Description

scoring_method

Specify the optimization metric for your hyperparameter search.

test_size

Size of test set in the hyperoptimization cross validation, given as a percentage of the whole dataset.

n_splits

Number of iterations in train-validation cross-validation used for model optimization.

N_iterations

Number of iterations used in the hyperparameter optimization. This corresponds to the number of samples drawn from the parameter grid.

n_jobspercore

Number of jobs assigned to a single core. Only used if fastr is set to true in the classfication.

maxlen

Number of estimators for which the fitted outcomes and parameters are saved. Increasing this number will increase the memory usage.

ranking_score

Score used for ranking the performance of the evaluated workflows.

memory

When using DRMAA plugin, e.g. on BIGR cluster, memory usage of a single optimization job. Should be a string consisting of an integer + “G”.

refit_training_workflows

If True, refit all workflows trained on the full training dataset automatically during training. This will save time while performing inference, but will take more time during training and make the saved model much larger.

refit_validation_workflows

If True, refit all workflows trained on the train-validation training dataset automatically during training. This will save time while performing validation evaluation, but will take more time during training and make the saved model much larger.

fix_random_seed

If True, a fixed random seed is used for all fitted method with a random seed during training. In this way, if you would run the experiment again, you would get exactly the same result.

Defaults and Options:

Subkey

Default

Options

scoring_method

f1_weighted

Manual metric by WORC: f1_weighted_predictproba, average_precision_weighted, gmean. Other accepted values are any sklearn metric

test_size

0.2

Float

n_splits

5

Integer

N_iterations

1000

Integer

n_jobspercore

200

Integer

maxlen

100

Integer

ranking_score

test_score

String

memory

3G

String consisting of integer + “G”

refit_training_workflows

False

Boolean

refit_validation_workflows

False

Boolean

fix_random_seed

False

Boolean

SMAC

WORC enables the use of the SMAC algorithm for the hyperparameter optimization. SMAC uses the same parameter options as the default random search, except for resampling which is currently not compatible with SMAC.

Description:

Subkey

Description

use

If True, use SMAC as the optimization strategy.

n_smac_cores

Number of independent, parallel SMAC instances to use.

budget_type

Type of budget to use for the SMAC optimization, either an evaluation limit or a time limit.

budget

Size of the budget, which depends on the type of budget. Number of evaluations for an evaluation limit, or wallclock seconds for a time limit.

init_method

Initialization method of SMAC. Supported are a random initialization or a sobol sequence.

init_budget

Number of evaluations used for the initialization. Always an evaluation limit, regardless of the budget type choice of the optimization.

Defaults and Options:

Subkey

Default

Options

use

False

True, False

n_smac_cores

1

Integer

budget_type

evals

evals, time

budget

100

Integer

init_method

random

random, sobol

init_budget

20

Integer

Ensemble

WORC supports ensembling of workflows. This is not a default approach in radiomics, hence the default is to not use it and select only the best performing workflow.

Description:

Subkey

Description

Method

Choose which ensemble method to use. If you do not wish to use an ensemble, use Single or top_N with size 1.

Size

Number of estimators to use in the ensemble for the top_N method, or the number of bags for the Bagging method.

Metric

Metric used to determine ranking of estimators in ensemble. When using default, the metric that is used in the hyperoptimization is used.

Defaults and Options:

Subkey

Default

Options

Method

top_N

Single, top_N, FitNumber, ForwardSelection, Caruana, Bagging

Size

100

Integer

Metric

Default

Default, generalization

Evaluation

In the evaluation of the performance, several adjustments can be made.

Description:

Subkey

Description

OverfitScaler

Wheter to fit a separate scaler on the test set (=overfitting) or use scaler on training dataset. Only used for experimental purposes: never overfit your scaler for the actual performance evaluation.

Defaults and Options:

Subkey

Default

Options

OverfitScaler

False

True, False

Bootstrap

Besides cross validation, WORC supports bootstrapping on the test set for performance evaluation.

Description:

Subkey

Description

Use

Determine whether to use bootstrapping or not.

N_iterations

Number of iterations to use for bootstrapping.

Defaults and Options:

Subkey

Default

Options

Use

False

Boolean

N_iterations

10000

Integer