Configuration¶
Introduction¶
WORC has defaults for all settings so it can be run out of the box to test the examples. However, you may want to alter the fastr configuration to your system settings, e.g. to locate your input and output folders and how much you want to parallelize the execution.
Fastr will search for a config file named config.py
in the $FASTRHOME
directory
(which defaults to ~/.fastr/
if it is not set). So if $FASTRHOME
is set the ~/.fastr/
will be ignored. Additionally, .py files from the $FASTRHOME/config.d
folder will be parsed
as well. You will see that upon installation, WORC has already put a WORC_config.py
file in the
config.d
folder.
% Note: Above was originally from quick start
As WORC
and the default tools used are mostly Python based, we’ve chosen
to put our configuration in a configparser
object. This has several
advantages:
The object can be treated as a python dictionary and thus is easily adjusted.
Second, each tool can be set to parse only specific parts of the configuration, enabling us to supply one file to all tools instead of needing many parameter files.
Creation and interaction¶
The default configuration is generated through the
WORC.defaultconfig()
function. You can then change things as you would in a dictionary and
then append it to the configs source:
>>> network = WORC.WORC('somename')
>>> config = network.defaultconfig()
>>> config['Classification']['classifier'] = 'RF'
>>> network.configs.append(config)
When executing the WORC.set()
command, the config objects are saved as
.ini files in the WORC.fastr_tempdir
folder and added to the
WORC.fastrconfigs()
source.
Below are some details on several of the fields in the configuration. Note that for many of the fields, we currently only provide one default value. However, when adding your own tools, these fields can be adjusted to your specific settings.
WORC performs Combined Algorithm Selection and Hyperparameter (CASH) optimization. The configuration determines how the optimization is performed and which hyperparameters and models will be included. Repeating specific models/parameters in the config will make them more likely to be used, e.g.
>>> config['Classification']['classifiers'] = 'SVM, SVM, LR'
means that the SVM is 2x more likely to be tested in the model selection than LR.
Note
All fields in the config must either be supplied as strings. A
list can be created by using commas for separation, e.g.
Network.create_source
.
Contents¶
The config object can be indexed as config[key][subkey] = value
. The various keys, subkeys, and the values
(description, defaults and options) can be found below.
Key |
Reference |
---|---|
Bootstrap |
|
Classification |
|
ComBat |
|
CrossValidation |
|
Ensemble |
|
Evaluation |
|
FeatPreProcess |
|
Featsel |
|
FeatureScaling |
|
General |
|
HyperOptimization |
|
ImageFeatures |
|
Imputation |
|
Labels |
|
OneHotEncoding |
|
Preprocessing |
|
PyRadiomics |
|
Resampling |
|
Segmentix |
|
SelectFeatGroup |
Details on each section of the config can be found below.
General¶
These fields contain general settings for when using WORC. For more info on the Joblib settings, which are used in the Joblib Parallel function, see here. When you run WORC on a cluster with nodes supporting only a single core to be used per node, e.g. the BIGR cluster, use only 1 core and threading as a backend.
Description:
Subkey |
Description |
---|---|
cross_validation |
Determine whether a cross validation will be performed or not. Obsolete, will be removed. |
Segmentix |
Determine whether to use Segmentix tool for segmentation preprocessing. |
FeatureCalculators |
Specifies which feature calculation tools should be used. A list can be provided to use multiple tools. |
Preprocessing |
Specifies which tool will be used for image preprocessing. |
RegistrationNode |
Specifies which tool will be used for image registration. |
TransformationNode |
Specifies which tool will be used for applying image transformations. |
Joblib_ncores |
Number of cores to be used by joblib for multicore processing. |
Joblib_backend |
Type of backend to be used by joblib for multicore processing. |
tempsave |
Determines whether after every cross validation iteration the result will be saved, in addition to the result after all iterations. Especially useful for debugging. |
AssumeSameImageAndMaskMetadata |
Make the assumption that the image and mask have the same metadata. If True and there is a mismatch, metadata from the image will be copied to the mask. |
ComBat |
Whether to use ComBat feature harmonization on your FULL dataset, i.e. not in a train-test setting. See <https://github.com/Jfortin1/ComBatHarmonization for more information./>`_ . |
Defaults and Options:
Subkey |
Default |
Options |
---|---|---|
cross_validation |
True |
True, False |
Segmentix |
True |
True, False |
FeatureCalculators |
[predict/CalcFeatures:1.0, pyradiomics/Pyradiomics:1.0] |
predict/CalcFeatures:1.0, pyradiomics/Pyradiomics:1.0, pyradiomics/CF_pyradiomics:1.0, your own tool reference |
Preprocessing |
worc/PreProcess:1.0 |
worc/PreProcess:1.0, your own tool reference |
RegistrationNode |
elastix4.8/Elastix:4.8 |
‘elastix4.8/Elastix:4.8’, your own tool reference |
TransformationNode |
elastix4.8/Transformix:4.8 |
‘elastix4.8/Transformix:4.8’, your own tool reference |
Joblib_ncores |
1 |
Integer > 0 |
Joblib_backend |
threading |
multiprocessing, threading |
tempsave |
False |
True, False |
AssumeSameImageAndMaskMetadata |
False |
True, False |
ComBat |
False |
True, False |
Labels¶
Set the label used for classification.
This part is quite important, as it should match your label file. Suppose your patientclass.txt file you supplied as source for labels looks like this:
Patient |
Label1 |
Label2 |
---|---|---|
patient1 |
1 |
0 |
patient2 |
2 |
1 |
patient3 |
1 |
5 |
You can supply a single label or multiple labels split by commas, for each of which an estimator will be fit. For example, suppose you simply want to use Label1 for classification, then set:
config['Labels']['label_names'] = 'Label1'
If you want to first train a classifier on Label1 and then Label2,
set: config[Labels][label_names] = Label1, Label2
Description:
Subkey |
Description |
---|---|
label_names |
The labels used from your label file for classification. |
modus |
Determine whether multilabel or singlelabel classification or regression will be performed. |
url |
WIP |
projectID |
WIP |
Defaults and Options:
Subkey |
Default |
Options |
---|---|---|
label_names |
Label1, Label2 |
String(s) |
modus |
singlelabel |
singlelabel, multilabel |
url |
WIP |
WIP |
projectID |
WIP |
WIP |
Preprocessing¶
The preprocessing node acts before the feature extraction on the image. Additionally, scans with imagetype CT (see later in the tutorial) provided as DICOM are scaled to Hounsfield Units. For more details on the preprocessing options, please see the additional functionality chapter.
Description:
Subkey |
Description |
---|---|
CheckSpacing |
Determine whether to check the spacing or not. If True, and the spacing of the image is [1x1x1], we assume the spacing is incorrect, and overwrite it using the DICOM metadata. |
Clipping |
Determine whether to use intensity clipping in preprocessing of image or not. |
Clipping_Range |
Lower- and upperbound of intensities to be used in clipping. |
Normalize |
Determine whether to use normalization in preprocessing of image or not. |
Normalize_ROI |
If a mask is supplied and this is set to True, normalize image based on supplied ROI. Otherwise, the full image is used for normalization using the SimpleITK Normalize function. Lastly, setting this to False will result in no normalization being applied. |
Method |
Method used for normalization if ROI is supplied. Currently, z-scoring or using the minimum and median of the ROI can be used. |
ROIDetermine |
Choose whether a ROI for normalization is provided, or Otsu thresholding is used to determine one. |
ROIdilate |
Determine whether the ROI has to be dilated with a disc element or not. |
ROIdilateradius |
Radius of disc element to be used in ROI dilation. |
Resampling |
Determine whether the image and mask will be resampled or not. |
Resampling_spacing |
Spacing to resample image and mask to, if resampling is used. |
BiasCorrection |
Determine whether N4 Bias correction will be applied or not. |
BiasCorrection_Mask |
Whether withing bias correction, a mask generated through Otsu thresholding is used or not. |
CheckOrientation |
Determine whether to check the image orientation or not. If checked, if the orientation is not equal to the OrientationPrimaryAxis, the image is rotated. |
OrientationPrimaryAxis |
If CheckOrientation is True, if primary axis is not this one, rotate image such that it is. Currently, only “axial” is supported. |
Defaults and Options:
Subkey |
Default |
Options |
---|---|---|
CheckSpacing |
False |
True, False |
Clipping |
False |
True, False |
Clipping_Range |
-1000.0, 3000.0 |
Float, Float |
Normalize |
True |
True, False |
Normalize_ROI |
Full |
True, False, Full |
Method |
z_score |
z_score, minmed |
ROIDetermine |
Provided |
Provided, Otsu |
ROIdilate |
False |
True, False |
ROIdilateradius |
10 |
Integer > 0 |
Resampling |
False |
True, False |
Resampling_spacing |
1, 1, 1 |
Float, Float, Float |
BiasCorrection |
False |
True, False |
BiasCorrection_Mask |
False |
Float, Float, Float |
CheckOrientation |
False |
True, False |
OrientationPrimaryAxis |
axial |
axial |
Segmentix¶
These fields are only important if you specified using the segmentix tool in the general configuration.
Description:
Subkey |
Description |
---|---|
mask |
If a mask is supplied, should the mask be subtracted from the contour or multiplied. |
segtype |
If Ring, then a ring around the segmentation will be used as contour. If Dilate, the segmentation will be dilated per 2-D axial slice with a disc. |
segradius |
Define the radius of the ring or disc used if segtype is Ring or Dilate, respectively. |
N_blobs |
How many of the largest blobs are extracted from the segmentation. If None, no blob extraction is used. |
fillholes |
Determines whether hole filling will be used. |
remove_small_objects |
Determines whether small objects will be removed. |
min_object_size |
Minimum of objects in voxels to not be removed if small objects are removed |
Defaults and Options:
Subkey |
Default |
Options |
---|---|---|
mask |
subtract |
subtract, multiply |
segtype |
None |
None, Ring, Dilate |
segradius |
5 |
Integer > 0 |
N_blobs |
1 |
Integer > 0 |
fillholes |
True |
True, False |
remove_small_objects |
False |
True, False |
min_object_size |
2 |
Integer > 0 |
ImageFeatures¶
If using the PREDICT toolbox for feature extraction, you can specify some settings for the feature computation here. Also, you can select if the certain features are computed or not.
Description:
Subkey |
Description |
---|---|
shape |
Determine whether orientation features are computed or not. |
histogram |
Determine whether histogram features are computed or not. |
orientation |
Determine whether orientation features are computed or not. |
texture_Gabor |
Determine whether Gabor texture features are computed or not. |
texture_LBP |
Determine whether LBP texture features are computed or not. |
texture_GLCM |
Determine whether GLCM texture features are computed or not. |
texture_GLCMMS |
Determine whether GLCM Multislice texture features are computed or not. |
texture_GLRLM |
Determine whether GLRLM texture features are computed or not. |
texture_GLSZM |
Determine whether GLSZM texture features are computed or not. |
texture_NGTDM |
Determine whether NGTDM texture features are computed or not. |
coliage |
Determine whether coliage features are computed or not. |
vessel |
Determine whether vessel features are computed or not. |
log |
Determine whether LoG features are computed or not. |
phase |
Determine whether local phase features are computed or not. |
image_type |
Modality of images supplied. Determines how the image is loaded. |
gabor_frequencies |
Frequencies of Gabor filters used: can be a single float or a list. |
gabor_angles |
Angles of Gabor filters in degrees: can be a single integer or a list. |
GLCM_angles |
Angles used in GLCM computation in radians: can be a single float or a list. |
GLCM_levels |
Number of grayscale levels used in discretization before GLCM computation. |
GLCM_distances |
Distance(s) used in GLCM computation in pixels: can be a single integer or a list. |
LBP_radius |
Radii used for LBP computation: can be a single integer or a list. |
LBP_npoints |
Number(s) of points used in LBP computation: can be a single integer or a list. |
phase_minwavelength |
Minimal wavelength in pixels used for phase features. |
phase_nscale |
Number of scales used in phase feature computation. |
log_sigma |
Standard deviation(s) in pixels used in log feature computation: can be a single integer or a list. |
vessel_scale_range |
Scale in pixels used for Frangi vessel filter. Given as a minimum and a maximum. |
vessel_scale_step |
Step size used to go from minimum to maximum scale on Frangi vessel filter. |
vessel_radius |
Radius to determine boundary of between inner part and edge in Frangi vessel filter. |
dicom_feature_tags |
DICOM tags to be extracted as features. See https://worc.readthedocs.io/en/latest/static/features.html. |
dicom_feature_labels |
For each of the DICOM tag values extracted, name that should be assigned to the feature. See https://worc.readthedocs.io/en/latest/static/features.html. |
Defaults and Options:
Subkey |
Default |
Options |
---|---|---|
shape |
True |
True, False |
histogram |
True |
True, False |
orientation |
True |
True, False |
texture_Gabor |
True |
True, False |
texture_LBP |
True |
True, False |
texture_GLCM |
True |
True, False |
texture_GLCMMS |
True |
True, False |
texture_GLRLM |
False |
True, False |
texture_GLSZM |
False |
True, False |
texture_NGTDM |
False |
True, False |
coliage |
False |
True, False |
vessel |
True |
True, False |
log |
True |
True, False |
phase |
True |
True, False |
image_type |
CT |
CT |
gabor_frequencies |
0.05, 0.2, 0.5 |
Float(s) |
gabor_angles |
0, 45, 90, 135 |
Integer(s) |
GLCM_angles |
0, 0.79, 1.57, 2.36 |
Float(s) |
GLCM_levels |
16 |
Integer > 0 |
GLCM_distances |
1, 3 |
Integer(s) > 0 |
LBP_radius |
3, 8, 15 |
Integer(s) > 0 |
LBP_npoints |
12, 24, 36 |
Integer(s) > 0 |
phase_minwavelength |
3 |
Integer > 0 |
phase_nscale |
5 |
Integer > 0 |
log_sigma |
1, 5, 10 |
Integer(s) |
vessel_scale_range |
1, 10 |
Two integers: min and max. |
vessel_scale_step |
2 |
Integer > 0 |
vessel_radius |
5 |
Integer > 0 |
dicom_feature_tags |
0010 1010, 0010 0040 |
DICOM tag keys, e.g. 0010 0010, separated by comma’s |
dicom_feature_labels |
age, sex |
List of strings |
PyRadiomics¶
If using the PyRadiomics toolbox, you can specify some settings for the feature computation here. For more information, see https://pyradiomics.readthedocs.io/en/latest/customization.htm.
Description:
Subkey |
Description |
---|---|
geometryTolerance |
See <https://pyradiomics.readthedocs.io/en/latest/customization.html/>`_ . |
normalize |
See <https://pyradiomics.readthedocs.io/en/latest/customization.html/>`_ . |
normalizeScale |
See <https://pyradiomics.readthedocs.io/en/latest/customization.html/>`_ . |
resampledPixelSpacing |
See <https://pyradiomics.readthedocs.io/en/latest/customization.html/>`_ . |
interpolator |
|
preCrop |
See <https://pyradiomics.readthedocs.io/en/latest/customization.html/>`_ . |
binCount |
We advice to use a fixed bin count instead of a fixed bin width, as on imaging modalities such as MRI, the scale of the values varies a lot, which is incompatible with a fixed bin width. See <https://pyradiomics.readthedocs.io/en/latest/customization.html/>`_ . |
binWidth |
See <https://pyradiomics.readthedocs.io/en/latest/customization.html/>`_ . |
force2D |
See <https://pyradiomics.readthedocs.io/en/latest/customization.html/>`_ . |
force2Ddimension |
See <https://pyradiomics.readthedocs.io/en/latest/customization.html/>`_ . |
voxelArrayShift |
See <https://pyradiomics.readthedocs.io/en/latest/customization.html/>`_ . |
Original |
Enable/Disable computation of original image features. |
Wavelet |
Enable/Disable computation of wavelet image features. |
LoG |
Enable/Disable computation of Laplacian of Gaussian (LoG) image features. |
label |
“Intensity” of the pixels in the mask to be used for feature extraction. If using segmentix, use 1, as your mask will be boolean. Otherwise, select the integer(s) corresponding to the ROI in your mask. |
extract_firstorder |
Determine whether first order features are computed or not. |
extract_shape |
Determine whether shape features are computed or not. |
texture_GLCM |
Determine whether GLCM features are computed or not. |
texture_GLRLM |
Determine whether GLRLM features are computed or not. |
texture_GLSZM |
Determine whether GLSZM features are computed or not. |
texture_GLDM |
Determine whether GLDM features are computed or not. |
texture_NGTDM |
Determine whether NGTDM features are computed or not. |
Defaults and Options:
Subkey |
Default |
Options |
---|---|---|
geometryTolerance |
0.0001 |
Float |
normalize |
False |
Boolean |
normalizeScale |
100 |
Integer |
resampledPixelSpacing |
None |
Float, Float, Float |
interpolator |
sitkBSpline |
|
preCrop |
True |
True, False |
binCount |
16 |
Integer or None |
binWidth |
None |
Integer or None |
force2D |
False |
True, False |
force2Ddimension |
0 |
0 = axial, 1 = coronal, 2 = sagital |
voxelArrayShift |
300 |
Integer |
Original |
True |
True, False |
Wavelet |
False |
True, False |
LoG |
False |
True, False |
label |
1 |
Integer |
extract_firstorder |
False |
True, False |
extract_shape |
True |
True, False |
texture_GLCM |
False |
True, False |
texture_GLRLM |
True |
True, False |
texture_GLSZM |
True |
True, False |
texture_GLDM |
True |
True, False |
texture_NGTDM |
True |
True, False |
ComBat¶
If using the ComBat toolbox, you can specify some settings for the feature harmonization here. For more information, see https://github.com/Jfortin1/ComBatHarmonization.
Description:
Subkey |
Description |
---|---|
language |
Name of software implementation to use. |
batch |
Name of batch variable = variable to correct for. |
mod |
Name of moderation variable(s) = variables for which variation in features will be “preserverd”. |
par |
Either use the parametric (1) or non-parametric version (0) of ComBat. |
eb |
Either use the emperical Bayes (1) or simply mean shifting version (0) of ComBat. |
per_feature |
Either use ComBat for all features combined (0) or per feature (1), in which case a second feature equal to the single feature plus random noise will be added if eb=1 |
excluded_features |
Provide substrings of feature labels of features which should be excluded from ComBat. Recommended to use for features unaffected by the batch variable. |
matlab |
If using Matlab, path to Matlab executable. |
Defaults and Options:
Subkey |
Default |
Options |
---|---|---|
language |
python |
python, matlab |
batch |
Hospital |
String |
mod |
[] |
String(s), or [] |
par |
1 |
0 or 1 |
eb |
1 |
0 or 1 |
per_feature |
0 |
0 or 1 |
excluded_features |
List of strings, comma separated |
|
matlab |
C:Program FilesMATLABR2015bbinmatlab.exe |
String |
FeatPreProcess¶
Before the features are given to the classification function, and thus the hyperoptimization, these can be preprocessed as following.
Description:
Subkey |
Description |
---|---|
Use |
If True, use feature preprocessor in the classify node. Currently excluded features with >80% NaNs. |
Combine |
If True, features of multiple objects (e.g. lesions) of the same patient are combined. |
Combine_method |
If features of multiple objects are combined, this determines the method. Currently included options are mean and max. |
Defaults and Options:
Subkey |
Default |
Options |
---|---|---|
Use |
False |
Boolean |
Combine |
False |
Boolean |
Combine_method |
mean |
mean or max |
OneHotEncoding¶
Optionally, you can use OneHotEncoding on specific features. For more information on why and how this is done, see https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html. By default, this is not done, as WORC does not know for which specific features you would like to do this.
Description:
Subkey |
Description |
---|---|
Use |
If True, use OneHotEncoding for specific features as determined by the field below. |
feature_labels_tofit |
Labels of features for which to use OneHotEncoding. WORC will check whether any of the values specified in this field is a substring of a feature name. For example, if you give gclm, all features for which glcm is in the feature label will be one hot encoded. |
Defaults and Options:
Subkey |
Default |
Options |
---|---|---|
Use |
False |
Boolean(s) |
feature_labels_tofit |
List of strings |
Imputation¶
These settings are used for feature imputation. Note that these settings are actually used in the hyperparameter optimization. Hence you can provide multiple values per field, of which random samples will be drawn of which finally the best setting in combination with the other hyperparameters is selected.
Description:
Subkey |
Description |
---|---|
use |
If True, use feature imputation methods to replace NaN values. If False, all NaN features will be set to zero. |
strategy |
Method to be used for imputation. |
n_neighbors |
When using k-Nearest Neighbors (kNN) for feature imputation, determines the number of neighbors used for imputation. Can be a single integer or a list. |
Defaults and Options:
Subkey |
Default |
Options |
---|---|---|
use |
True |
Boolean(s) |
strategy |
mean, median, most_frequent, constant, knn |
mean, median, most_frequent, constant, knn |
n_neighbors |
5, 5 |
Two Integers: loc and scale |
FeatureScaling¶
Determines which method is applied to scale each feature.
Description:
Subkey |
Description |
---|---|
scaling_method |
Determine the scaling method. |
skip_features |
Determine which features should be skipped. This field should contain a comma separated list of substrings: when one or more of these are in a feature name, the feature is skipped. |
Defaults and Options:
Subkey |
Default |
Options |
---|---|---|
scaling_method |
robust_z_score |
robust_z_score, z_score, robust, minmax, log_z_score, None |
skip_features |
Comma separated list of strings |
Featsel¶
Define feature selection methods. Note that these settings are actually used in the hyperparameter optimization. Hence you can provide multiple values per field, of which random samples will be drawn of which finally the best setting in combination with the other hyperparameters is selected. Again, these should be formatted as string containing the actual values, e.g. value1, value2.
Description:
Subkey |
Description |
---|---|
Variance |
Percentage of times features which have a variance < 0.01 are excluded. Based on ` sklearn”s VarianceThreshold <https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.VarianceThreshold.html/>`_. |
GroupwiseSearch |
Randomly select which feature groups to use. Parameters determined by the SelectFeatGroup config part, see below. |
SelectFromModel |
Percentage of times features are selected by first training a machine learning model which can rank the features with an ``importance. See also sklearn”s SelectFromModel. |
SelectFromModel_estimator |
Machine learning model / estimator used: can be LASSO, LogisticRegression, or a Random Forest |
SelectFromModel_lasso_alpha |
When using LASSO, search space of weigth of L1 term, see also sklearn <https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html/>. |
SelectFromModel_n_trees |
When using a random forest, search space of number of trees used. |
UsePCA |
Percentage of times Principle Component Analysis (PCA) is used to select features. |
PCAType |
Method to select number of components using PCA: Either the number of components that explains 95% of the variance, or use a fixed number of components.95variance |
StatisticalTestUse |
Percentage of times a statistical test is used to select features. |
StatisticalTestMetric |
Define the type of statistical test to be used. |
StatisticalTestThreshold |
Specify a threshold for the p-value threshold used in the statistical test to select features. The first element defines the lower boundary, the other the upper boundary. Random sampling will occur between the boundaries. |
ReliefUse |
Percentage of times Relief is used to select features. |
ReliefNN |
Min and max of number of nearest neighbors search range in Relief. |
ReliefSampleSize |
Min and max of sample size search range in Relief. |
ReliefDistanceP |
Min and max of positive distance search range in Relief. |
ReliefNumFeatures |
Min and max of number of features that is selected search range in Relief. |
Defaults and Options:
Subkey |
Default |
Options |
---|---|---|
Variance |
1.0 |
Float |
GroupwiseSearch |
True |
Boolean(s) |
SelectFromModel |
0.2 |
Float |
SelectFromModel_estimator |
Lasso, LR, RF |
Lasso, LR, RF |
SelectFromModel_lasso_alpha |
0.1, 1.4 |
Two Floats: loc and scale |
SelectFromModel_n_trees |
10, 90 |
Two Integers: loc and scale |
UsePCA |
0.2 |
Float |
PCAType |
95variance, 10, 50, 100 |
Integer(s), 95variance |
StatisticalTestUse |
0.2 |
Float |
StatisticalTestMetric |
MannWhitneyU |
ttest, Welch, Wilcoxon, MannWhitneyU |
StatisticalTestThreshold |
-3, 2.5 |
Two Integers: loc and scale |
ReliefUse |
0.2 |
Float |
ReliefNN |
2, 4 |
Two Integers: loc and scale |
ReliefSampleSize |
0.75, 0.2 |
Two Floats: loc and scale |
ReliefDistanceP |
1, 3 |
Two Integers: loc and scale |
ReliefNumFeatures |
10, 40 |
Two Integers: loc and scale |
SelectFeatGroup¶
If the PREDICT and/or PyRadiomics feature computation tools are used, then you can do a gridsearch among the various feature groups for the optimal combination. Here, you determine which groups can be selected.
Description:
Subkey |
Description |
---|---|
shape_features |
If True, use shape features in model. |
histogram_features |
If True, use histogram features in model. |
orientation_features |
If True, use orientation features in model. |
texture_Gabor_features |
If True, use Gabor texture features in model. |
texture_GLCM_features |
If True, use GLCM texture features in model. |
texture_GLDM_features |
If True, use GLDM texture features in model. |
texture_GLCMMS_features |
If True, use GLCM Multislice texture features in model. |
texture_GLRLM_features |
If True, use GLRLM texture features in model. |
texture_GLSZM_features |
If True, use GLSZM texture features in model. |
texture_GLDZM_features |
If True, use GLDZM texture features in model. |
texture_NGTDM_features |
If True, use NGTDM texture features in model. |
texture_NGLDM_features |
If True, use NGLDM texture features in model. |
texture_LBP_features |
If True, use LBP texture features in model. |
dicom_features |
If True, use DICOM features in model. |
semantic_features |
If True, use semantic features in model. |
coliage_features |
If True, use coliage features in model. |
vessel_features |
If True, use vessel features in model. |
phase_features |
If True, use phase features in model. |
fractal_features |
If True, use fractal features in model. |
location_features |
If True, use location features in model. |
rgrd_features |
If True, use rgrd features in model. |
toolbox |
List of names of toolboxes to be used, or All |
original_features |
If True, use original features in model. |
wavelet_features |
If True, use wavelet features in model. |
log_features |
If True, use log features in model. |
Defaults and Options:
Subkey |
Default |
Options |
---|---|---|
shape_features |
True, False |
Boolean(s) |
histogram_features |
True, False |
Boolean(s) |
orientation_features |
True, False |
Boolean(s) |
texture_Gabor_features |
True, False |
Boolean(s) |
texture_GLCM_features |
True, False |
Boolean(s) |
texture_GLDM_features |
True, False |
Boolean(s) |
texture_GLCMMS_features |
True, False |
Boolean(s) |
texture_GLRLM_features |
True, False |
Boolean(s) |
texture_GLSZM_features |
True, False |
Boolean(s) |
texture_GLDZM_features |
True, False |
Boolean(s) |
texture_NGTDM_features |
True, False |
Boolean(s) |
texture_NGLDM_features |
True, False |
Boolean(s) |
texture_LBP_features |
True, False |
Boolean(s) |
dicom_features |
False |
Boolean(s) |
semantic_features |
False |
Boolean(s) |
coliage_features |
False |
Boolean(s) |
vessel_features |
True, False |
Boolean(s) |
phase_features |
True, False |
Boolean(s) |
fractal_features |
True, False |
Boolean(s) |
location_features |
True, False |
Boolean(s) |
rgrd_features |
True, False |
Boolean(s) |
toolbox |
All, PREDICT, PyRadiomics |
All, or name of toolbox (PREDICT, PyRadiomics) |
original_features |
True |
Boolean(s) |
wavelet_features |
True, False |
Boolean(s) |
log_features |
True, False |
Boolean(s) |
Resampling¶
Before performing the hyperoptimization, you can use various resampling techniques to resample (under-sampling, over-sampling, or both) the data. All methods are adopted from imbalanced learn.
Description:
Subkey |
Description |
---|---|
Use |
Percentage of times Object (e.g. patient) resampling is used. |
Method |
One of the methods adopted, see also imbalanced learn <https://imbalanced-learn.readthedocs.io/en/stable/api/>`_. |
sampling_strategy |
Sampling strategy, see also imbalanced learn <https://imbalanced-learn.readthedocs.io/en/stable/api/>`_. |
n_neighbors |
Number of n_neighbors used in resampling. This should be (much) smaller than the number of objects/patients you supply. We sample on a uniform scale: the parameters specify the range (loc, loc + scale). |
k_neighbors |
Number of n_neighbors used in resampling. This should be (much) smaller than the number of objects/patients you supply. We sample on a uniform scale: the parameters specify the range (loc, loc + scale). |
threshold_cleaning |
Threshold for cleaning of samples. We sample on a uniform scale: the parameters specify the range (loc, loc + scale). |
Defaults and Options:
Subkey |
Default |
Options |
---|---|---|
Use |
0.20 |
Float |
Method |
RandomUnderSampling, RandomOverSampling, NearMiss, NeighbourhoodCleaningRule, ADASYN, BorderlineSMOTE, SMOTE, SMOTEENN, SMOTETomek |
RandomUnderSampling, RandomOverSampling, NearMiss, NeighbourhoodCleaningRule, ADASYN, BorderlineSMOTE, SMOTE, SMOTEENN, SMOTETomek |
sampling_strategy |
auto, majority, minority, not minority, not majority, all |
auto, majority, not minority, not majority, all |
n_neighbors |
3, 12 |
Two Integers: loc and scale |
k_neighbors |
5, 15 |
Two Integers: loc and scale |
threshold_cleaning |
0.25, 0.5 |
Two Floats: loc and scale |
Classification¶
Determine settings for the classification in the hyperoptimization. Most of the classifiers are implemented using sklearn; hence descriptions of the hyperparameters can also be found there.
Defaults for XGB are based on https://towardsdatascience.com/doing-xgboost-hyper-parameter-tuning-the-smart-way-part-1-of-2-f6d255a45dde and https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/
Note, as XGB and AdaBoost take significantly longer to fit (3x), they are picked less often by default.
Description:
Subkey |
Description |
---|---|
fastr |
Use fastr for the optimization gridsearch (recommended on clusters, default) or if set to False , joblib (recommended for PCs but not on Windows). |
fastr_plugin |
Name of execution plugin to be used. Default use the same as the self.fastr_plugin for the WORC object. |
classifiers |
Select the estimator(s) to use. Most are implemented using sklearn. For abbreviations, see the options: LR = logistic regression. |
max_iter |
Maximum number of iterations to use in training an estimator. Only for specific estimators, see sklearn. |
SVMKernel |
When using a SVM, specify the kernel type. |
SVMC |
Range of the SVM slack parameter. We sample on a uniform log scale: the parameters specify the range of the exponent (loc, loc + scale). |
SVMdegree |
Range of the SVM polynomial degree when using a polynomial kernel. We sample on a uniform scale: the parameters specify the range (loc, loc + scale). |
SVMcoef0 |
Range of SVM homogeneity parameter. We sample on a uniform scale: the parameters specify the range (loc, loc + scale). |
SVMgamma |
Range of the SVM gamma parameter. We sample on a uniform log scale: the parameters specify the range of the exponent (loc, loc + scale) |
RFn_estimators |
Range of number of trees in a RF. We sample on a uniform scale: the parameters specify the range (loc, loc + scale). |
RFmin_samples_split |
Range of minimum number of samples required to split a branch in a RF. We sample on a uniform scale: the parameters specify the range (loc, loc + scale). |
RFmax_depth |
Range of maximum depth of a RF. We sample on a uniform scale: the parameters specify the range (loc, loc + scale). |
LRpenalty |
Penalty term used in LR. |
LRC |
Range of regularization strength in LR. We sample on a uniform scale: the parameters specify the range (loc, loc + scale). |
LR_solver |
Solver used in LR. |
LR_l1_ratio |
Ratio between l1 and l2 penalty when using elasticnet penalty, see https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html. |
LDA_solver |
Solver used in LDA. |
LDA_shrinkage |
Range of the LDA shrinkage parameter. We sample on a uniform log scale: the parameters specify the range of the exponent (loc, loc + scale). |
QDA_reg_param |
Range of the QDA regularization parameter. We sample on a uniform log scale: the parameters specify the range of the exponent (loc, loc + scale). |
ElasticNet_alpha |
Range of the ElasticNet penalty parameter. We sample on a uniform log scale: the parameters specify the range of the exponent (loc, loc + scale). |
ElasticNet_l1_ratio |
Range of l1 ratio in LR. We sample on a uniform scale: the parameters specify the range (loc, loc + scale). |
SGD_alpha |
Range of the SGD penalty parameter. We sample on a uniform log scale: the parameters specify the range of the exponent (loc, loc + scale). |
SGD_l1_ratio |
Range of l1 ratio in SGD. We sample on a uniform scale: the parameters specify the range (loc, loc + scale). |
SGD_loss |
Loss function of SGD. |
SGD_penalty |
Penalty term in SGD. |
CNB_alpha |
Regularization strenght in ComplementNB. We sample on a uniform scale: the parameters specify the range (loc, loc + scale) |
AdaBoost_n_estimators |
Number of estimators used in AdaBoost. Default is equal to config[‘Classification’][‘RFn_estimators’]. |
AdaBoost_learning_rate |
Learning rate in AdaBoost. |
XGB_boosting_rounds |
Number of estimators / boosting rounds used in XGB. Default is equal to config[‘Classification’][‘RFn_estimators’]. |
XGB_max_depth |
Maximum depth of XGB. |
XGB_learning_rate |
Learning rate in AdaBoost. Default is equal to config[‘Classification’][‘AdaBoost_learning_rate’]. |
XGB_gamma |
Gamma of XGB. |
XGB_min_child_weight |
Minimum child weights in XGB. |
XGB_colsample_bytree |
Col sample by tree in XGB. |
Defaults and Options:
Subkey |
Default |
Options |
---|---|---|
fastr |
True |
True, False |
fastr_plugin |
LinearExecution |
Any fastr execution plugin . |
classifiers |
SVM, RF, LR, LDA, QDA, GaussianNB, AdaBoostClassifier, XGBClassifier |
SVM , SVR, SGD, SGDR, RF, LDA, QDA, ComplementND, GaussianNB, AdaBoostClassifier, XGBClassifier, LR, RFR, Lasso, ElasticNet, LinR, Ridge, AdaBoostRegressor, XGBRegressor. All are estimators from sklearn |
max_iter |
100000 |
Integer |
SVMKernel |
linear, poly, rbf |
poly, linear, rbf |
SVMC |
0, 6 |
Two Integers: loc and scale |
SVMdegree |
1, 6 |
Two Integers: loc and scale |
SVMcoef0 |
0, 1 |
Two Integers: loc and scale |
SVMgamma |
-5, 5 |
Two Integers: loc and scale |
RFn_estimators |
10, 90 |
Two Integers: loc and scale |
RFmin_samples_split |
2, 3 |
Two Integers: loc and scale |
RFmax_depth |
5, 5 |
Two Integers: loc and scale |
LRpenalty |
l1, l2, elasticnet |
none, l2, l1 |
LRC |
0.01, 0.99 |
Two Floats: loc and scale |
LR_solver |
lbfgs, saga |
Comma separated list of strings, for the options see https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html |
LR_l1_ratio |
0, 1 |
Float between 0.0 and 1.0. |
LDA_solver |
svd, lsqr, eigen |
svd, lsqr, eigen |
LDA_shrinkage |
-5, 5 |
Two Integers: loc and scale |
QDA_reg_param |
-5, 5 |
Two Integers: loc and scale |
ElasticNet_alpha |
-5, 5 |
Two Integers: loc and scale |
ElasticNet_l1_ratio |
0, 1 |
Two Integers: loc and scale |
SGD_alpha |
-5, 5 |
Two Integers: loc and scale |
SGD_l1_ratio |
0, 1 |
Two Integers: loc and scale |
SGD_loss |
squared_loss, huber, epsilon_insensitive, squared_epsilon_insensitive |
hinge, squared_hinge, modified_huber |
SGD_penalty |
none, l2, l1 |
none, l2, l1 |
CNB_alpha |
0, 1 |
Two Integers: loc and scale |
AdaBoost_n_estimators |
10, 90 |
Two Integers: loc and scale |
AdaBoost_learning_rate |
0.01, 0.99 |
Two Floats: loc and scale |
XGB_boosting_rounds |
10, 90 |
Two Integers: loc and scale |
XGB_max_depth |
3, 12 |
Two Integers: loc and scale |
XGB_learning_rate |
0.01, 0.99 |
Two Floats: loc and scale |
XGB_gamma |
0.01, 0.99 |
Two Floats: loc and scale |
XGB_min_child_weight |
1, 6 |
Two Integers: loc and scale |
XGB_colsample_bytree |
0.3, 0.7 |
Two Floats: loc and scale |
CrossValidation¶
When using cross validation, specify the following settings.
Description:
Subkey |
Description |
---|---|
Type |
If performing a cross-validationm, type of cross-validation used. Currently random-splitting and leave-one-out (LOO) are supported. |
N_iterations |
Number of times the data is split in training and test in the outer cross-validation when using random-splitting. |
test_size |
The percentage of data to be used for testing when using random-splitting. |
fixed_seed |
If True, use a fixed seed for the cross-validation splits when using random-splitting. |
Defaults and Options:
Subkey |
Default |
Options |
---|---|---|
Type |
random_split |
random_split, LOO |
N_iterations |
100 |
Integer |
test_size |
0.2 |
Float |
fixed_seed |
False |
Boolean |
HyperOptimization¶
Specify the hyperparameter optimization procedure here.
Description:
Subkey |
Description |
---|---|
scoring_method |
Specify the optimization metric for your hyperparameter search. |
test_size |
Size of test set in the hyperoptimization cross validation, given as a percentage of the whole dataset. |
n_splits |
Number of iterations in train-validation cross-validation used for model optimization. |
N_iterations |
Number of iterations used in the hyperparameter optimization. This corresponds to the number of samples drawn from the parameter grid. |
n_jobspercore |
Number of jobs assigned to a single core. Only used if fastr is set to true in the classfication. |
maxlen |
Number of estimators for which the fitted outcomes and parameters are saved. Increasing this number will increase the memory usage. |
ranking_score |
Score used for ranking the performance of the evaluated workflows. |
memory |
When using DRMAA plugin, e.g. on BIGR cluster, memory usage of a single optimization job. Should be a string consisting of an integer + “G”. |
refit_workflows |
If True, refit all workflows in the ensemble automatically during training. This will save time while performing inference, but will take more time during training and make the saved model much larger. |
Defaults and Options:
Subkey |
Default |
Options |
---|---|---|
scoring_method |
f1_weighted |
Manual metric by WORC: f1_weighted_predictproba, average_precision_weighted, gmean. Other accepted values are any sklearn metric |
test_size |
0.2 |
Float |
n_splits |
5 |
Integer |
N_iterations |
1000 |
Integer |
n_jobspercore |
500 |
Integer |
maxlen |
100 |
Integer |
ranking_score |
test_score |
String |
memory |
3G |
String consisting of integer + “G” |
refit_workflows |
False |
Boolean |
Ensemble¶
WORC supports ensembling of workflows. This is not a default approach in radiomics, hence the default is to not use it and select only the best performing workflow.
Description:
Subkey |
Description |
---|---|
Use |
Determine whether to use ensembling or not. Provide an integer to state how many estimators to include: 1 equals no ensembling. |
Metric |
Metric used to determine ranking of estimators in ensemble. When using default, the metric that is used in the hyperoptimization is used. |
Defaults and Options:
Subkey |
Default |
Options |
---|---|---|
Use |
100 |
Integer |
Metric |
Default |
Default, generalization |
Evaluation¶
In the evaluation of the performance, several adjustments can be made.
Description:
Subkey |
Description |
---|---|
OverfitScaler |
Wheter to fit a separate scaler on the test set (=overfitting) or use scaler on training dataset. Only used for experimental purposes: never overfit your scaler for the actual performance evaluation. |
Defaults and Options:
Subkey |
Default |
Options |
---|---|---|
OverfitScaler |
False |
True, False |
Bootstrap¶
Besides cross validation, WORC supports bootstrapping on the test set for performance evaluation.
Description:
Subkey |
Description |
---|---|
Use |
Determine whether to use bootstrapping or not. |
N_iterations |
Number of iterations to use for bootstrapping. |
Defaults and Options:
Subkey |
Default |
Options |
---|---|---|
Use |
False |
Boolean |
N_iterations |
1000 |
Integer |