Configuration

Introduction

WORC has defaults for all settings so it can be run out of the box to test the examples. However, you may want to alter the fastr configuration to your system settings, e.g. to locate your input and output folders and how much you want to parallelize the execution.

Fastr will search for a config file named config.py in the $FASTRHOME directory (which defaults to ~/.fastr/ if it is not set). So if $FASTRHOME is set the ~/.fastr/ will be ignored. Additionally, .py files from the $FASTRHOME/config.d folder will be parsed as well. You will see that upon installation, WORC has already put a WORC_config.py file in the config.d folder.

As WORC and the default tools used are mostly Python based, we’ve chosen to put our configuration in a configparser object. This has several advantages:

The object can be treated as a python dictionary and thus is easily adjusted.
Second, each tool can be set to parse only specific parts of the configuration, enabling us to supply one file to all tools instead of needing many parameter files.

Creation and interaction

The default configuration is generated through the WORC.defaultconfig() function. You can then change things as you would in a dictionary and then append it to the configs source:

>>> network = WORC.WORC('somename')
>>> config = network.defaultconfig()
>>> config['Classification']['classifier'] = 'RF'
>>> network.configs.append(config)

When executing the WORC.set() command, the config objects are saved as .ini files in the WORC.fastr_tempdir folder and added to the WORC.fastrconfigs() source.

Below are some details on several of the fields in the configuration. Note that for many of the fields, we currently only provide one default value. However, when adding your own tools, these fields can be adjusted to your specific settings.

WORC performs Combined Algorithm Selection and Hyperparameter (CASH) optimization. The configuration determines how the optimization is performed and which hyperparameters and models will be included. Repeating specific models/parameters in the config will make them more likely to be used, e.g.

>>> config['Classification']['classifiers'] = 'SVM, SVM, LR'

means that the SVM is 2x more likely to be tested in the model selection than LR.

Note

All fields in the config must either be supplied as strings. A list can be created by using commas for separation, e.g. Network.create_source.

Contents

The config object can be indexed as config[key][subkey] = value. The various keys, subkeys, and the values (description, defaults and options) can be found below.

Key	Reference
Bootstrap	Bootstrap
Classification	Classification
ComBat	ComBat
CrossValidation	CrossValidation
Ensemble	Ensemble
Evaluation	Evaluation
FeatPreProcess	FeatPreProcess
Featsel	Featsel
FeatureScaling	FeatureScaling
Fingerprinting	Fingerprinting
General	General
HyperOptimization	HyperOptimization
ImageFeatures	ImageFeatures
Imputation	Imputation
Labels	Labels
OneHotEncoding	OneHotEncoding
Preprocessing	Preprocessing
PyRadiomics	PyRadiomics
Resampling	Resampling
SMAC	SMAC
Segmentix	Segmentix
SelectFeatGroup	SelectFeatGroup

Details on each section of the config can be found below.

General

These fields contain general settings for when using WORC. For more info on the Joblib settings, which are used in the Joblib Parallel function, see here. When you run WORC on a cluster with nodes supporting only a single core to be used per node, e.g. the BIGR cluster, use only 1 core and threading as a backend.

Note

If you want to override configuration fields that are fingerprinted, e.g. the preprocessing, turn the fingerprinting off.

Description:

Subkey	Description
cross_validation	Determine whether a cross validation will be performed or not. Obsolete, will be removed.
Segmentix	Determine whether to use Segmentix tool for segmentation preprocessing.
FeatureCalculators	Specifies which feature calculation tools should be used. A list can be provided to use multiple tools.
Preprocessing	Specifies which tool will be used for image preprocessing.
RegistrationNode	Specifies which tool will be used for image registration.
TransformationNode	Specifies which tool will be used for applying image transformations.
Joblib_ncores	Number of cores to be used by joblib for multicore processing.
Joblib_backend	Type of backend to be used by joblib for multicore processing.
tempsave	Determines whether after every cross validation iteration the result will be saved, in addition to the result after all iterations. Especially useful for debugging.
AssumeSameImageAndMaskMetadata	Make the assumption that the image and mask have the same metadata. If True and there is a mismatch, metadata from the image will be copied to the mask.
ComBat	Whether to use ComBat feature harmonization on your FULL dataset, i.e. not in a train-test setting. See <https://github.com/Jfortin1/ComBatHarmonization for more information./>`_ .
Fingerprint	Whether to use Fingerprinting or not.
DoTestNRSNEns	If True, repeat the experiments from the WORC paper to check the performance of various N_RS, N_Ens and advanced ensembling combinations.
WindowsCharacterLimitHack	If True, various nodes (e.g., classify, fingerprinter) in the fastr network will themsleves look for the right files instead of fastr suppling them. This overcomes the command line character limit on Windows.

Defaults and Options:

Subkey	Default	Options
cross_validation	True	True, False
Segmentix	True	True, False
FeatureCalculators	[predict/CalcFeatures:1.0, pyradiomics/Pyradiomics:1.0]	predict/CalcFeatures:1.0, pyradiomics/Pyradiomics:1.0, pyradiomics/CF_pyradiomics:1.0, your own tool reference
Preprocessing	worc/PreProcess:1.0	worc/PreProcess:1.0, your own tool reference
RegistrationNode	elastix4.8/Elastix:4.8	‘elastix4.8/Elastix:4.8’, your own tool reference
TransformationNode	elastix4.8/Transformix:4.8	‘elastix4.8/Transformix:4.8’, your own tool reference
Joblib_ncores	1	Integer > 0
Joblib_backend	threading	multiprocessing, threading
tempsave	True	True, False
AssumeSameImageAndMaskMetadata	False	True, False
ComBat	False	True, False
Fingerprint	True	True, False
DoTestNRSNEns	False	True, False
WindowsCharacterLimitHack	False	True, False

Labels

Set the label used for classification.

This part is quite important, as it should match your label file. Suppose your patientclass.txt file you supplied as source for labels looks like this:

Patient	Label1	Label2
patient1	1	0
patient2	2	1
patient3	1	5

You can supply a single label or multiple labels split by commas, for each of which an estimator will be fit. For example, suppose you simply want to use Label1 for classification, then set:

config['Labels']['label_names'] = 'Label1'

If you want to first train a classifier on Label1 and then Label2, set: config[Labels][label_names] = Label1, Label2

Description:

Subkey	Description
label_names	The labels used from your label file for classification.
modus	Determine whether multilabel or singlelabel classification or regression will be performed.
url	WIP
projectID	WIP

Defaults and Options:

Subkey	Default	Options
label_names	Label1, Label2	String(s)
modus	singlelabel	singlelabel, multilabel
url	WIP	WIP
projectID	WIP	WIP

Fingerprinting

The fingerprinting nodes are the first computational nodes to create a fingerprint of your dataset and accordingly adjust some configuration settings, see the WORC paper.

Description:

Subkey	Description
max_num_image	Maximum number of images and segmentations to evaluate during fingerprinting to limit the workload.
inputtype	Type of input the fingerprinting node needs. Only used when using Windows hack.

Defaults and Options:

Subkey	Default	Options
max_num_image	100	Integer
inputtype	images	images, features

Preprocessing

The preprocessing node acts before the feature extraction on the image. Additionally, scans with imagetype CT (see later in the tutorial) provided as DICOM are scaled to Hounsfield Units. For more details on the preprocessing options, please see the additional functionality chapter.

Note

As several preprocessing functions are fingerprinted, if you want to edit these configuration settings yourself, please turn of the fingerprinting, see the General section of the config.

Description:

Subkey	Description
CheckSpacing	Determine whether to check the spacing or not. If True, and the spacing of the image is [1x1x1], we assume the spacing is incorrect, and overwrite it using the DICOM metadata.
Clipping	Determine whether to use intensity clipping in preprocessing of image or not.
Clipping_Range	Lower- and upperbound of intensities to be used in clipping.
Normalize	Determine whether to use normalization in preprocessing of image or not.
Normalize_ROI	If a mask is supplied and this is set to True, normalize image based on supplied ROI. Otherwise, the full image is used for normalization using the SimpleITK Normalize function. Lastly, setting this to False will result in no normalization being applied.
Method	Method used for normalization if ROI is supplied. Currently, z-scoring or using the minimum and median of the ROI can be used.
ROIDetermine	Choose whether a ROI for normalization is provided, or Otsu thresholding is used to determine one.
ROIdilate	Determine whether the ROI has to be dilated with a disc element or not.
ROIdilateradius	Radius of disc element to be used in ROI dilation.
Resampling	Determine whether the image and mask will be resampled or not.
Resampling_spacing	Spacing to resample image and mask to, if resampling is used.
BiasCorrection	Determine whether N4 Bias correction will be applied or not.
BiasCorrection_Mask	Whether withing bias correction, a mask generated through Otsu thresholding is used or not.
CheckOrientation	Determine whether to check the image orientation or not. If checked, if the orientation is not equal to the OrientationPrimaryAxis, the image is rotated.
OrientationPrimaryAxis	If CheckOrientation is True, if primary axis is not this one, rotate image such that it is. Currently, only “axial” is supported.
HistogramEqualization	Determine whether to use histogram equalization or not.
HistogramEqualization_Alpha	Controls how much the filter acts like the classical histogram equalization method, see https://simpleitk.org/doxygen/latest/html/classitk_1_1simple_1_1AdaptiveHistogramEqualizationImageFilter.html
HistogramEqualization_Beta	Controls how much the filters acts like an unsharp mask, see https://simpleitk.org/doxygen/latest/html/classitk_1_1simple_1_1AdaptiveHistogramEqualizationImageFilter.html
HistogramEqualization_Radius	Controls the windows size, see https://simpleitk.org/doxygen/latest/html/classitk_1_1simple_1_1AdaptiveHistogramEqualizationImageFilter.html

Defaults and Options:

Subkey	Default	Options
CheckSpacing	False	True, False
Clipping	False	True, False
Clipping_Range	-1000.0, 3000.0	Float, Float
Normalize	True	True, False
Normalize_ROI	Full	True, False, Full
Method	z_score	z_score, minmed
ROIDetermine	Provided	Provided, Otsu
ROIdilate	False	True, False
ROIdilateradius	10	Integer > 0
Resampling	False	True, False
Resampling_spacing	1, 1, 1	Float, Float, Float
BiasCorrection	False	True, False
BiasCorrection_Mask	False	Float, Float, Float
CheckOrientation	False	True, False
OrientationPrimaryAxis	axial	axial
HistogramEqualization	False	True, False
HistogramEqualization_Alpha	0.3	Float
HistogramEqualization_Beta	0.3	Float
HistogramEqualization_Radius	5	Float

Segmentix

These fields are only important if you specified using the segmentix tool in the general configuration.

Description:

Subkey	Description
mask	If None, masks will not be used by segmentix. If a mask is supplied, should the mask be subtracted from the contour or multiplied.
segtype	If Ring, then a ring around the segmentation will be used as contour. If Dilate, the segmentation will be dilated per 2-D axial slice with a disc.
segradius	Define the radius of the ring or disc used if segtype is Ring or Dilate, respectively.
N_blobs	How many of the largest blobs are extracted from the segmentation. If None, no blob extraction is used.
fillholes	Determines whether hole filling will be used.
remove_small_objects	Determines whether small objects will be removed.
min_object_size	Minimum of objects in voxels to not be removed if small objects are removed

Defaults and Options:

Subkey	Default	Options
mask	None	None, subtract, multiply
segtype	None	None, Ring, Dilate
segradius	5	Integer > 0
N_blobs	1	Integer > 0
fillholes	True	True, False
remove_small_objects	False	True, False
min_object_size	2	Integer > 0

ImageFeatures

If using the PREDICT toolbox for feature extraction, you can specify some settings for the feature computation here. Also, you can select if the certain features are computed or not.

Description:

Subkey	Description
shape	Determine whether orientation features are computed or not.
histogram	Determine whether histogram features are computed or not.
orientation	Determine whether orientation features are computed or not.
texture_Gabor	Determine whether Gabor texture features are computed or not.
texture_LBP	Determine whether LBP texture features are computed or not.
texture_GLCM	Determine whether GLCM texture features are computed or not.
texture_GLCMMS	Determine whether GLCM Multislice texture features are computed or not.
texture_GLRLM	Determine whether GLRLM texture features are computed or not.
texture_GLSZM	Determine whether GLSZM texture features are computed or not.
texture_NGTDM	Determine whether NGTDM texture features are computed or not.
coliage	Determine whether coliage features are computed or not.
vessel	Determine whether vessel features are computed or not.
log	Determine whether LoG features are computed or not.
phase	Determine whether local phase features are computed or not.
image_type	Modality of images supplied. Determines how the image is loaded. Mandatory to supply by user. Should be one of the valid quantitative modalities [‘CT’, ‘PET’, ‘Thermography’, ‘ADC’, ‘MG’] or qualitative modalities [‘MRI’, ‘MR’, ‘DWI’, ‘US’].
extraction_mode	Determine how to extract the features: 2D if your masks and/or images have only one 2D slice, 3D for tru 3D images, 2.5D for 3D images but in a slice-by-slice stacked 2D manner. The latter is recommended when the slice thickness is much larger (>2) than the pixel spacing.
gabor_frequencies	Frequencies of Gabor filters used: can be a single float or a list.
gabor_angles	Angles of Gabor filters in degrees: can be a single integer or a list.
GLCM_angles	Angles used in GLCM computation in radians: can be a single float or a list.
GLCM_levels	Number of grayscale levels used in discretization before GLCM computation.
GLCM_distances	Distance(s) used in GLCM computation in pixels: can be a single integer or a list.
LBP_radius	Radii used for LBP computation: can be a single integer or a list.
LBP_npoints	Number(s) of points used in LBP computation: can be a single integer or a list.
phase_minwavelength	Minimal wavelength in pixels used for phase features.
phase_nscale	Number of scales used in phase feature computation.
log_sigma	Standard deviation(s) in pixels used in log feature computation: can be a single integer or a list.
vessel_scale_range	Scale in pixels used for Frangi vessel filter. Given as a minimum and a maximum.
vessel_scale_step	Step size used to go from minimum to maximum scale on Frangi vessel filter.
vessel_radius	Radius to determine boundary of between inner part and edge in Frangi vessel filter.
dicom_feature_tags	DICOM tags to be extracted as features. See https://worc.readthedocs.io/en/latest/static/features.html.
dicom_feature_labels	For each of the DICOM tag values extracted, name that should be assigned to the feature. See https://worc.readthedocs.io/en/latest/static/features.html.

Defaults and Options:

Subkey	Default	Options
shape	True	True, False
histogram	True	True, False
orientation	True	True, False
texture_Gabor	True	True, False
texture_LBP	True	True, False
texture_GLCM	True	True, False
texture_GLCMMS	True	True, False
texture_GLRLM	False	True, False
texture_GLSZM	False	True, False
texture_NGTDM	False	True, False
coliage	False	True, False
vessel	True	True, False
log	True	True, False
phase	True	True, False
image_type		String
extraction_mode	2.5D	String: 2D, 2.5D or 3D
gabor_frequencies	0.05, 0.2, 0.5	Float(s)
gabor_angles	0, 45, 90, 135	Integer(s)
GLCM_angles	0, 0.79, 1.57, 2.36	Float(s)
GLCM_levels	16	Integer > 0
GLCM_distances	1, 3	Integer(s) > 0
LBP_radius	3, 8, 15	Integer(s) > 0
LBP_npoints	12, 24, 36	Integer(s) > 0
phase_minwavelength	3	Integer > 0
phase_nscale	5	Integer > 0
log_sigma	1, 5, 10	Integer(s)
vessel_scale_range	1, 10	Two integers: min and max.
vessel_scale_step	2	Integer > 0
vessel_radius	5	Integer > 0
dicom_feature_tags	0010 1010, 0010 0040	DICOM tag keys, e.g. 0010 0010, separated by comma’s
dicom_feature_labels	age, sex	List of strings

PyRadiomics

If using the PyRadiomics toolbox, you can specify some settings for the feature computation here. For more information, see https://pyradiomics.readthedocs.io/en/latest/customization.htm.

Description:

Subkey	Description
geometryTolerance	See <https://pyradiomics.readthedocs.io/en/latest/customization.html/>`_ .
normalize	See <https://pyradiomics.readthedocs.io/en/latest/customization.html/>`_ .
normalizeScale	See <https://pyradiomics.readthedocs.io/en/latest/customization.html/>`_ .
resampledPixelSpacing	See <https://pyradiomics.readthedocs.io/en/latest/customization.html/>`_ .
interpolator	See <https://pyradiomics.readthedocs.io/en/latest/customization.html?highlight=sitkbspline#feature-extractor-level/>`_ .
preCrop	See <https://pyradiomics.readthedocs.io/en/latest/customization.html/>`_ .
binCount	We advice to use a fixed bin count instead of a fixed bin width, as on imaging modalities such as MR, the scale of the values varies a lot, which is incompatible with a fixed bin width. See <https://pyradiomics.readthedocs.io/en/latest/customization.html/>`_ .
binWidth	See <https://pyradiomics.readthedocs.io/en/latest/customization.html/>`_ .
force2D	See <https://pyradiomics.readthedocs.io/en/latest/customization.html/>`_ .
force2Ddimension	See <https://pyradiomics.readthedocs.io/en/latest/customization.html/>`_ .
voxelArrayShift	See <https://pyradiomics.readthedocs.io/en/latest/customization.html/>`_ .
Original	Enable/Disable computation of original image features.
Wavelet	Enable/Disable computation of wavelet image features.
LoG	Enable/Disable computation of Laplacian of Gaussian (LoG) image features.
label	“Intensity” of the pixels in the mask to be used for feature extraction. If using segmentix, use 1, as your mask will be boolean. Otherwise, select the integer(s) corresponding to the ROI in your mask.
extract_firstorder	Determine whether first order features are computed or not.
extract_shape	Determine whether shape features are computed or not.
texture_GLCM	Determine whether GLCM features are computed or not.
texture_GLRLM	Determine whether GLRLM features are computed or not.
texture_GLSZM	Determine whether GLSZM features are computed or not.
texture_GLDM	Determine whether GLDM features are computed or not.
texture_NGTDM	Determine whether NGTDM features are computed or not.

Defaults and Options:

Subkey	Default	Options
geometryTolerance	0.0001	Float
normalize	False	Boolean
normalizeScale	100	Integer
resampledPixelSpacing	None	Float, Float, Float
interpolator	sitkBSpline	See <https://pyradiomics.readthedocs.io/en/latest/customization.html?highlight=sitkbspline#feature-extractor-level/>`_ .
preCrop	True	True, False
binCount	16	Integer or None
binWidth	None	Integer or None
force2D	False	True, False
force2Ddimension	0	0 = axial, 1 = coronal, 2 = sagital
voxelArrayShift	300	Integer
Original	True	True, False
Wavelet	False	True, False
LoG	False	True, False
label	1	Integer
extract_firstorder	False	True, False
extract_shape	True	True, False
texture_GLCM	False	True, False
texture_GLRLM	True	True, False
texture_GLSZM	True	True, False
texture_GLDM	True	True, False
texture_NGTDM	True	True, False

ComBat

If using the ComBat toolbox, you can specify some settings for the feature harmonization here. For more information, see https://github.com/Jfortin1/ComBatHarmonization.

Description:

Subkey	Description
language	Name of software implementation to use.
batch	Name of batch variable = variable to correct for.
mod	Name of moderation variable(s) = variables for which variation in features will be “preserverd”.
par	Either use the parametric (1) or non-parametric version (0) of ComBat.
eb	Either use the emperical Bayes (1) or simply mean shifting version (0) of ComBat.
per_feature	Either use ComBat for all features combined (0) or per feature (1), in which case a second feature equal to the single feature plus random noise will be added if eb=1
excluded_features	Provide substrings of feature labels of features which should be excluded from ComBat. Recommended to use for features unaffected by the batch variable.
matlab	If using Matlab, path to Matlab executable.

Defaults and Options:

Subkey	Default	Options
language	python	python, matlab
batch	Hospital	String
mod	[]	String(s), or []
par	1	0 or 1
eb	1	0 or 1
per_feature	0	0 or 1
excluded_features	sf_, of_, semf_, pf_	List of strings, comma separated
matlab	C:Program FilesMATLABR2015bbinmatlab.exe	String

FeatPreProcess

Before the features are given to the classification function, and thus the hyperoptimization, these can be preprocessed as following.

Description:

Subkey	Description
Use	If True, use feature preprocessor in the classify node. Currently excluded features with >80% NaNs.
Combine	If True, features of multiple objects (e.g. lesions) of the same patient are combined.
Combine_method	If features of multiple objects are combined, this determines the method. Currently included options are mean and max.

Defaults and Options:

Subkey	Default	Options
Use	False	Boolean
Combine	False	Boolean
Combine_method	mean	mean or max

OneHotEncoding

Optionally, you can use OneHotEncoding on specific features. For more information on why and how this is done, see https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html. By default, this is not done, as WORC does not know for which specific features you would like to do this.

Description:

Subkey	Description
Use	If True, use OneHotEncoding for specific features as determined by the field below.
feature_labels_tofit	Labels of features for which to use OneHotEncoding. WORC will check whether any of the values specified in this field is a substring of a feature name. For example, if you give gclm, all features for which glcm is in the feature label will be one hot encoded.

Defaults and Options:

Subkey	Default	Options
Use	False	Boolean(s)
feature_labels_tofit		List of strings

Imputation

These settings are used for feature imputation. Note that these settings are actually used in the hyperparameter optimization. Hence you can provide multiple values per field, of which random samples will be drawn of which finally the best setting in combination with the other hyperparameters is selected.

Description:

Subkey	Description
use	If True, use feature imputation methods to replace NaN values. If False, all NaN features will be set to zero.
strategy	Method to be used for imputation.
n_neighbors	When using k-Nearest Neighbors (kNN) for feature imputation, determines the number of neighbors used for imputation. Can be a single integer or a list.
skipallNaN	When True, if a feature is NaN for all objects/patients, simply remove this features for all patients.

Defaults and Options:

Subkey	Default	Options
use	True	Boolean(s)
strategy	mean, median, most_frequent, constant, knn	mean, median, most_frequent, constant, knn
n_neighbors	5, 5	Two Integers: loc and scale
skipallNaN	True	Boolean(s)

FeatureScaling

Determines which method is applied to scale each feature.

Description:

Subkey	Description
scaling_method	Determine the scaling method.
skip_features	Determine which features should be skipped. This field should contain a comma separated list of substrings: when one or more of these are in a feature name, the feature is skipped.

Defaults and Options:

Subkey	Default	Options
scaling_method	robust_z_score	robust_z_score, z_score, robust, minmax, log_z_score, None
skip_features	semf_, pf_	Comma separated list of strings

Featsel

Define feature selection methods. Note that these settings are actually used in the hyperparameter optimization. Hence you can provide multiple values per field, of which random samples will be drawn of which finally the best setting in combination with the other hyperparameters is selected. Again, these should be formatted as string containing the actual values, e.g. value1, value2.

Description:

Subkey	Description
Variance	Percentage of times features which have a variance < 0.01 are excluded. Based on ` sklearn”s VarianceThreshold <https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.VarianceThreshold.html/>`_.
GroupwiseSearch	Randomly select which feature groups to use. Parameters determined by the SelectFeatGroup config part, see below.
SelectFromModel	Percentage of times features are selected by first training a machine learning model which can rank the features with an ``importance. See also sklearn”s SelectFromModel.
SelectFromModel_estimator	Machine learning model / estimator used: can be LASSO, LogisticRegression, or a Random Forest
SelectFromModel_lasso_alpha	When using LASSO, search space of weigth of L1 term, see also sklearn <https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html/>.
SelectFromModel_n_trees	When using a random forest, search space of number of trees used.
UsePCA	Percentage of times Principle Component Analysis (PCA) is used to select features.
PCAType	Method to select number of components using PCA: Either the number of components that explains 95% of the variance, or use a fixed number of components.95variance
StatisticalTestUse	Percentage of times a statistical test is used to select features.
StatisticalTestMetric	Define the type of statistical test to be used.
StatisticalTestThreshold	Specify a threshold for the p-value threshold used in the statistical test to select features. The first element defines the lower boundary, the other the upper boundary. Random sampling will occur between the boundaries.
ReliefUse	Percentage of times Relief is used to select features.
ReliefNN	Min and max of number of nearest neighbors search range in Relief.
ReliefSampleSize	Min and max of sample size search range in Relief.
ReliefDistanceP	Min and max of positive distance search range in Relief.
ReliefNumFeatures	Min and max of number of features that is selected search range in Relief.
RFE	Percentage of times recursive feature elimination (RFE) is used to select features.
RFE_estimator	Machine learning model / estimator used: can be LASSO, LogisticRegression, or a Random Forest
RFE_lasso_alpha	When using LASSO, search space of weigth of L1 term, see also sklearn <https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html/>.
RFE_n_trees	When using a random forest, search space of number of trees used.
RFE_n_features_to_select	Number of features to select. Since we use sklearn < 0.24, this has to be an integer currently, not a float for a fraction of the features.
RFE_step	Number of features eliminated per step

Defaults and Options:

Subkey	Default	Options
Variance	1.0	Float
GroupwiseSearch	True	Boolean(s)
SelectFromModel	0.275	Float
SelectFromModel_estimator	Lasso, LR, RF	Lasso, LR, RF
SelectFromModel_lasso_alpha	0.1, 1.4	Two Floats: loc and scale
SelectFromModel_n_trees	10, 90	Two Integers: loc and scale
UsePCA	0.275	Float
PCAType	95variance, 10, 50, 100	Integer(s), 95variance
StatisticalTestUse	0.275	Float
StatisticalTestMetric	MannWhitneyU	ttest, Welch, Wilcoxon, MannWhitneyU
StatisticalTestThreshold	-3, 2.5	Two Integers: loc and scale
ReliefUse	0.275	Float
ReliefNN	2, 4	Two Integers: loc and scale
ReliefSampleSize	0.75, 0.2	Two Floats: loc and scale
ReliefDistanceP	1, 3	Two Integers: loc and scale
ReliefNumFeatures	10, 40	Two Integers: loc and scale
RFE	0.0	Float
RFE_estimator	Lasso, LR, RF	Lasso, LR, RF
RFE_lasso_alpha	0.1, 1.4	Two Floats: loc and scale
RFE_n_trees	10, 90	Two Integers: loc and scale
RFE_n_features_to_select	10, 90	Two Integers: loc and scale
RFE_step	1, 9	Number of features eliminated per step

SelectFeatGroup

If the PREDICT and/or PyRadiomics feature computation tools are used, then you can do a gridsearch among the various feature groups for the optimal combination. Here, you determine which groups can be selected.

Description:

Subkey	Description
shape_features	If True, use shape features in model.
histogram_features	If True, use histogram features in model.
orientation_features	If True, use orientation features in model.
texture_Gabor_features	If True, use Gabor texture features in model.
texture_GLCM_features	If True, use GLCM texture features in model.
texture_GLDM_features	If True, use GLDM texture features in model.
texture_GLCMMS_features	If True, use GLCM Multislice texture features in model.
texture_GLRLM_features	If True, use GLRLM texture features in model.
texture_GLSZM_features	If True, use GLSZM texture features in model.
texture_GLDZM_features	If True, use GLDZM texture features in model.
texture_NGTDM_features	If True, use NGTDM texture features in model.
texture_NGLDM_features	If True, use NGLDM texture features in model.
texture_LBP_features	If True, use LBP texture features in model.
dicom_features	If True, use DICOM features in model.
semantic_features	If True, use semantic features in model.
coliage_features	If True, use coliage features in model.
vessel_features	If True, use vessel features in model.
phase_features	If True, use phase features in model.
fractal_features	If True, use fractal features in model.
location_features	If True, use location features in model.
rgrd_features	If True, use rgrd features in model.
toolbox	List of names of toolboxes to be used, or All
original_features	If True, use original features in model.
wavelet_features	If True, use wavelet features in model.
log_features	If True, use log features in model.

Defaults and Options:

Subkey	Default	Options
shape_features	True, False	Boolean(s)
histogram_features	True, False	Boolean(s)
orientation_features	True, False	Boolean(s)
texture_Gabor_features	True, False	Boolean(s)
texture_GLCM_features	True, False	Boolean(s)
texture_GLDM_features	True, False	Boolean(s)
texture_GLCMMS_features	True, False	Boolean(s)
texture_GLRLM_features	True, False	Boolean(s)
texture_GLSZM_features	True, False	Boolean(s)
texture_GLDZM_features	True, False	Boolean(s)
texture_NGTDM_features	True, False	Boolean(s)
texture_NGLDM_features	True, False	Boolean(s)
texture_LBP_features	True, False	Boolean(s)
dicom_features	False	Boolean(s)
semantic_features	False	Boolean(s)
coliage_features	False	Boolean(s)
vessel_features	True, False	Boolean(s)
phase_features	True, False	Boolean(s)
fractal_features	True, False	Boolean(s)
location_features	True, False	Boolean(s)
rgrd_features	True, False	Boolean(s)
toolbox	All, PREDICT, PyRadiomics	All, or name of toolbox (PREDICT, PyRadiomics)
original_features	True	Boolean(s)
wavelet_features	True, False	Boolean(s)
log_features	True, False	Boolean(s)

Resampling

Before performing the hyperoptimization, you can use various resampling techniques to resample (under-sampling, over-sampling, or both) the data. All methods are adopted from imbalanced learn.

Description:

Subkey	Description
Use	Percentage of times Object (e.g. patient) resampling is used.
Method	One of the methods adopted, see also imbalanced learn <https://imbalanced-learn.readthedocs.io/en/stable/api/>`_.
sampling_strategy	Sampling strategy, see also imbalanced learn <https://imbalanced-learn.readthedocs.io/en/stable/api/>`_.
n_neighbors	Number of n_neighbors used in resampling. This should be (much) smaller than the number of objects/patients you supply. We sample on a uniform scale: the parameters specify the range (loc, loc + scale).
k_neighbors	Number of n_neighbors used in resampling. This should be (much) smaller than the number of objects/patients you supply. We sample on a uniform scale: the parameters specify the range (loc, loc + scale).
threshold_cleaning	Threshold for cleaning of samples. We sample on a uniform scale: the parameters specify the range (loc, loc + scale).

Defaults and Options:

Subkey	Default	Options
Use	0.20	Float
Method	RandomUnderSampling, RandomOverSampling, NearMiss, NeighbourhoodCleaningRule, ADASYN, BorderlineSMOTE, SMOTE, SMOTEENN, SMOTETomek	RandomUnderSampling, RandomOverSampling, NearMiss, NeighbourhoodCleaningRule, ADASYN, BorderlineSMOTE, SMOTE, SMOTEENN, SMOTETomek
sampling_strategy	auto, majority, minority, not minority, not majority, all	auto, majority, not minority, not majority, all
n_neighbors	3, 12	Two Integers: loc and scale
k_neighbors	5, 15	Two Integers: loc and scale
threshold_cleaning	0.25, 0.5	Two Floats: loc and scale

Classification

Determine settings for the classification in the hyperoptimization. Most of the classifiers are implemented using sklearn; hence descriptions of the hyperparameters can also be found there.

Defaults for XGB are based on https://towardsdatascience.com/doing-xgboost-hyper-parameter-tuning-the-smart-way-part-1-of-2-f6d255a45dde and https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/

Note, as XGB and AdaBoost take significantly longer to fit (3x), they are picked less often by default.

Description:

Subkey	Description
fastr	Use fastr for the optimization gridsearch (recommended on clusters, default) or if set to False , joblib (recommended for PCs but not on Windows).
fastr_plugin	Name of execution plugin to be used. Default use the same as the self.fastr_plugin for the WORC object.
classifiers	Select the estimator(s) to use. Most are implemented using sklearn. For abbreviations, see the options: LR = logistic regression.
max_iter	Maximum number of iterations to use in training an estimator. Only for specific estimators, see sklearn.
SVMKernel	When using a SVM, specify the kernel type.
SVMC	Range of the SVM slack parameter. We sample on a uniform log scale: the parameters specify the range of the exponent (loc, loc + scale).
SVMdegree	Range of the SVM polynomial degree when using a polynomial kernel. We sample on a uniform scale: the parameters specify the range (loc, loc + scale).
SVMcoef0	Range of SVM homogeneity parameter. We sample on a uniform scale: the parameters specify the range (loc, loc + scale).
SVMgamma	Range of the SVM gamma parameter. We sample on a uniform log scale: the parameters specify the range of the exponent (loc, loc + scale)
RFn_estimators	Range of number of trees in a RF. We sample on a uniform scale: the parameters specify the range (loc, loc + scale).
RFmin_samples_split	Range of minimum number of samples required to split a branch in a RF. We sample on a uniform scale: the parameters specify the range (loc, loc + scale).
RFmax_depth	Range of maximum depth of a RF. We sample on a uniform scale: the parameters specify the range (loc, loc + scale).
LRpenalty	Penalty term used in LR.
LRC	Range of regularization strength in LR. We sample on a uniform scale: the parameters specify the range (loc, loc + scale).
LR_solver	Solver used in LR.
LR_l1_ratio	Ratio between l1 and l2 penalty when using elasticnet penalty, see https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html.
LDA_solver	Solver used in LDA.
LDA_shrinkage	Range of the LDA shrinkage parameter. We sample on a uniform log scale: the parameters specify the range of the exponent (loc, loc + scale).
QDA_reg_param	Range of the QDA regularization parameter. We sample on a uniform log scale: the parameters specify the range of the exponent (loc, loc + scale).
ElasticNet_alpha	Range of the ElasticNet penalty parameter. We sample on a uniform log scale: the parameters specify the range of the exponent (loc, loc + scale).
ElasticNet_l1_ratio	Range of l1 ratio in LR. We sample on a uniform scale: the parameters specify the range (loc, loc + scale).
SGD_alpha	Range of the SGD penalty parameter. We sample on a uniform log scale: the parameters specify the range of the exponent (loc, loc + scale).
SGD_l1_ratio	Range of l1 ratio in SGD. We sample on a uniform scale: the parameters specify the range (loc, loc + scale).
SGD_loss	Loss function of SGD.
SGD_penalty	Penalty term in SGD.
CNB_alpha	Regularization strenght in ComplementNB. We sample on a uniform scale: the parameters specify the range (loc, loc + scale)
AdaBoost_n_estimators	Number of estimators used in AdaBoost. Default is equal to config[‘Classification’][‘RFn_estimators’].
AdaBoost_learning_rate	Learning rate in AdaBoost.
XGB_boosting_rounds	Number of estimators / boosting rounds used in XGB. Default is equal to config[‘Classification’][‘RFn_estimators’].
XGB_max_depth	Maximum depth of XGB.
XGB_learning_rate	Learning rate in AdaBoost. Default is equal to config[‘Classification’][‘AdaBoost_learning_rate’].
XGB_gamma	Gamma of XGB.
XGB_min_child_weight	Minimum child weights in XGB.
XGB_colsample_bytree	Col sample by tree in XGB.
LightGBM_num_leaves	Maximum tree leaves for base learners. See also https://lightgbm.readthedocs.io/en/latest/Parameters.html.
LightGBM_max_depth	Maximum tree depth for base learners. See also https://lightgbm.readthedocs.io/en/latest/Parameters.html.
LightGBM_min_child_samples	Minimum number of data needed in a child (leaf). See also https://lightgbm.readthedocs.io/en/latest/Parameters.html.
LightGBM_reg_alpha	L1 regularization term on weights. See also https://lightgbm.readthedocs.io/en/latest/Parameters.html.
LightGBM_reg_lambda	L2 regularization term on weights. See also https://lightgbm.readthedocs.io/en/latest/Parameters.html.
LightGBM_min_child_weight	Minimum sum of instance weight (hessian) needed in a child (leaf). See also https://lightgbm.readthedocs.io/en/latest/Parameters.html.

Defaults and Options:

Subkey	Default	Options
fastr	True	True, False
fastr_plugin	LinearExecution	Any fastr execution plugin .
classifiers	SVM, RF, LR, LDA, QDA, GaussianNB, AdaBoostClassifier, XGBClassifier	SVM , SVR, SGD, SGDR, RF, LDA, QDA, ComplementND, GaussianNB, AdaBoostClassifier, XGBClassifier, LR, RFR, Lasso, ElasticNet, LinR, Ridge, AdaBoostRegressor, XGBRegressor. All are estimators from sklearn
max_iter	100000	Integer
SVMKernel	linear, poly, rbf	poly, linear, rbf
SVMC	0, 6	Two Integers: loc and scale
SVMdegree	1, 6	Two Integers: loc and scale
SVMcoef0	0, 1	Two Integers: loc and scale
SVMgamma	-5, 5	Two Integers: loc and scale
RFn_estimators	10, 90	Two Integers: loc and scale
RFmin_samples_split	2, 3	Two Integers: loc and scale
RFmax_depth	5, 5	Two Integers: loc and scale
LRpenalty	l1, l2, elasticnet	none, l2, l1
LRC	0.01, 0.99	Two Floats: loc and scale
LR_solver	lbfgs, saga	Comma separated list of strings, for the options see https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
LR_l1_ratio	0, 1	Float between 0.0 and 1.0.
LDA_solver	svd, lsqr, eigen	svd, lsqr, eigen
LDA_shrinkage	-5, 5	Two Integers: loc and scale
QDA_reg_param	-5, 5	Two Integers: loc and scale
ElasticNet_alpha	-5, 5	Two Integers: loc and scale
ElasticNet_l1_ratio	0, 1	Two Integers: loc and scale
SGD_alpha	-5, 5	Two Integers: loc and scale
SGD_l1_ratio	0, 1	Two Integers: loc and scale
SGD_loss	squared_loss, huber, epsilon_insensitive, squared_epsilon_insensitive	hinge, squared_hinge, modified_huber
SGD_penalty	none, l2, l1	none, l2, l1
CNB_alpha	0, 1	Two Integers: loc and scale
AdaBoost_n_estimators	10, 90	Two Integers: loc and scale
AdaBoost_learning_rate	0.01, 0.99	Two Floats: loc and scale
XGB_boosting_rounds	10, 90	Two Integers: loc and scale
XGB_max_depth	3, 12	Two Integers: loc and scale
XGB_learning_rate	0.01, 0.99	Two Floats: loc and scale
XGB_gamma	0.01, 9.99	Two Floats: loc and scale
XGB_min_child_weight	1, 6	Two Integers: loc and scale
XGB_colsample_bytree	0.3, 0.7	Two Floats: loc and scale
LightGBM_num_leaves	5, 95	Two Integers: loc and scale
LightGBM_max_depth	3, 12	Two Integers: loc and scale
LightGBM_min_child_samples	5, 45	Two Integers: loc and scale
LightGBM_reg_alpha	0.01, 0.99	Two Floats: loc and scale
LightGBM_reg_lambda	0.01, 0.99	Two Floats: loc and scale
LightGBM_min_child_weight	-7, 4	Two Integers: loc and scale

CrossValidation

When using cross validation, specify the following settings.

Description:

Subkey	Description
Type	If performing a cross-validationm, type of cross-validation used. Currently random-splitting and leave-one-out (LOO) are supported.
N_iterations	Number of times the data is split in training and test in the outer cross-validation when using random-splitting.
test_size	The percentage of data to be used for testing when using random-splitting.
fixed_seed	If True, use a fixed seed for the cross-validation splits when using random-splitting.

Defaults and Options:

Subkey	Default	Options
Type	random_split	random_split, LOO
N_iterations	100	Integer
test_size	0.2	Float
fixed_seed	False	Boolean

HyperOptimization

Specify the hyperparameter optimization procedure here.

Description:

Subkey	Description
scoring_method	Specify the optimization metric for your hyperparameter search.
test_size	Size of test set in the hyperoptimization cross validation, given as a percentage of the whole dataset.
n_splits	Number of iterations in train-validation cross-validation used for model optimization.
N_iterations	Number of iterations used in the hyperparameter optimization. This corresponds to the number of samples drawn from the parameter grid.
n_jobspercore	Number of jobs assigned to a single core. Only used if fastr is set to true in the classfication.
maxlen	Number of estimators for which the fitted outcomes and parameters are saved. Increasing this number will increase the memory usage.
ranking_score	Score used for ranking the performance of the evaluated workflows.
memory	When using DRMAA plugin, e.g. on BIGR cluster, memory usage of a single optimization job. Should be a string consisting of an integer + “G”.
refit_training_workflows	If True, refit all workflows trained on the full training dataset automatically during training. This will save time while performing inference, but will take more time during training and make the saved model much larger.
refit_validation_workflows	If True, refit all workflows trained on the train-validation training dataset automatically during training. This will save time while performing validation evaluation, but will take more time during training and make the saved model much larger.
fix_random_seed	If True, a fixed random seed is used for all fitted method with a random seed during training. In this way, if you would run the experiment again, you would get exactly the same result.

Defaults and Options:

Subkey	Default	Options
scoring_method	f1_weighted	Manual metric by WORC: f1_weighted_predictproba, average_precision_weighted, gmean. Other accepted values are any sklearn metric
test_size	0.2	Float
n_splits	5	Integer
N_iterations	1000	Integer
n_jobspercore	200	Integer
maxlen	100	Integer
ranking_score	test_score	String
memory	3G	String consisting of integer + “G”
refit_training_workflows	False	Boolean
refit_validation_workflows	False	Boolean
fix_random_seed	False	Boolean

SMAC

WORC enables the use of the SMAC algorithm for the hyperparameter optimization. SMAC uses the same parameter options as the default random search, except for resampling which is currently not compatible with SMAC.

Description:

Subkey	Description
use	If True, use SMAC as the optimization strategy.
n_smac_cores	Number of independent, parallel SMAC instances to use.
budget_type	Type of budget to use for the SMAC optimization, either an evaluation limit or a time limit.
budget	Size of the budget, which depends on the type of budget. Number of evaluations for an evaluation limit, or wallclock seconds for a time limit.
init_method	Initialization method of SMAC. Supported are a random initialization or a sobol sequence.
init_budget	Number of evaluations used for the initialization. Always an evaluation limit, regardless of the budget type choice of the optimization.

Defaults and Options:

Subkey	Default	Options
use	False	True, False
n_smac_cores	1	Integer
budget_type	evals	evals, time
budget	100	Integer
init_method	random	random, sobol
init_budget	20	Integer

Ensemble

WORC supports ensembling of workflows. This is not a default approach in radiomics, hence the default is to not use it and select only the best performing workflow.

Description:

Subkey	Description
Method	Choose which ensemble method to use. If you do not wish to use an ensemble, use Single or top_N with size 1.
Size	Number of estimators to use in the ensemble for the top_N method, or the number of bags for the Bagging method.
Metric	Metric used to determine ranking of estimators in ensemble. When using default, the metric that is used in the hyperoptimization is used.

Defaults and Options:

Subkey	Default	Options
Method	top_N	Single, top_N, FitNumber, ForwardSelection, Caruana, Bagging
Size	100	Integer
Metric	Default	Default, generalization

Evaluation

In the evaluation of the performance, several adjustments can be made.

Description:

Subkey	Description
OverfitScaler	Wheter to fit a separate scaler on the test set (=overfitting) or use scaler on training dataset. Only used for experimental purposes: never overfit your scaler for the actual performance evaluation.

Defaults and Options:

Subkey	Default	Options
OverfitScaler	False	True, False

Bootstrap

Besides cross validation, WORC supports bootstrapping on the test set for performance evaluation.

Description:

Subkey	Description
Use	Determine whether to use bootstrapping or not.
N_iterations	Number of iterations to use for bootstrapping.

Defaults and Options:

Subkey	Default	Options
Use	False	Boolean
N_iterations	10000	Integer