Additional functionality¶
When using SimpleWORC
, or WORC with similar simple configuration settings, you can
already benefit from the main functionality of WORC, i.e. the automatic algorithm
optimization. However, several additional functionalities are provided, which are discussed in
this chapter.
For a description of the radiomics features, please see the radiomics features chapter. For a description of the data mining components, see the data mining chapter. All other components are discussed here.
For a comprehensive overview of all functions and parameters, please look at the config chapter.
Image Preprocessing¶
Preprocessing of the image, and accordingly the mask, is done in respectively
the WORC.processing.preprocessing
and the
WORC.processing.segmentix
scripts. Options for preprocessing
the image include, in the following order:
N4 Bias field correction, see also https://simpleitk.readthedocs.io/en/master/link_N4BiasFieldCorrection_docs.html.
Checking and optionally correcting the spacing if it’s 1x1x1 and the DICOM metadata says otherwise.
Clipping of the image intensities above and below a certain value.
Normalization, see
WORC.processing.preprocessing.normalize_image
for all options.Transposing the image to another ‘’main’’ orientation, e.g. axial.
Resampling the image to a different spacing.
Options for preprocessing the segmentation include:
Hole filling. Many feature computations cannot deal with holes.
Removing small objects. Many feature computations cannot deal with multiple
objects in a single segmentation.
Extracing the largest blob. Many feature computations cannot deal with
multiple objects in a single segmentation.
Instead of using the full segmentation, extracting a ring around the border
of the image to compute the features on. Ring captures both the inner and outer border.
Dilating the contour.
Masking the contour with another contour.
When assuming the same image and metadata, copy the metadata of the image
to the segmentation.
Checking and optionally correcting the spacing if it’s 1x1x1 and the
DICOM metadata says otherwise. Same as image preprocessing step 2.
Transposing the segmentation to another ‘’main’’ orientation, e.g. axial.
Same as image preprocessing step 5.
Resampling the segmentation and the segmentation to a different spacing.
Same as image preprocessing step 10.
Feature scaling¶
The default method for feature scaling in WORC
is a robust version
of z-scoring. Additional options include:
regular z-scoring
MinMax scaling, i.e., scaling to a range between 0 and 1
Scaling by centering using the median and IQR
A combination of z-scoring with a logarithmic transform and a correction term to better cope with outliers and non-normally distributed features [CIT1].
Image Registration¶
When using multiple modalities or sequences, and there is only a segmentation
on a single image, image registration is applied to spatially align all
sequences and warp the segmentation to the other images through
elastix
[CIT2]. Usage of elastix
is automatically included in WORC
when only a single segmentation and multiple modalities are supplied.
The image on which the segmentation is provided is used as the moving image,
the others as the fixed image, as the segmentations will be moved from the
segmented image to the others.
Registration is by default performed using a
rigid transformation model, based on a mutual information using the adaptive
stochastic gradient descent optimizer. Manual
overrides of these defaults are included in the WORC
configuration.
When using Elastix, parameter files have to be provided in the
network.Elastix_Para
object, e.g.
network.Elastix_Para = [['Parameters_Rigid.txt', 'Parameters_BSpline.txt']]
The outer list defines the parameter files used per modality. If only one
element is provided, the same will be applied for all modalities. Each element
of the list should be a list of its own, including the filenames
of elastix
. In the example, we provided two files, resulting
in first a rigid registration being performed, followed by a bspline registration.
Examples of elastix
parameter files can be found at https://github.com/SuperElastix/ElastixModelZoo/tree/master/models/default
ComBat¶
Commonly, radiomics studies include multicenter data, resulting in heterogeneity in the acquisition protocols. As radiomics features are generally sensitive to these variations, this limits the repeatability and reproducibility. To compensate for the differences in acquisition, feature harmonization techniques may be used, one of the most frequently used is ComBat. In ComBat, feature distributions are harmonized for variations in the imaging acquisition, e.g. due to differences in hospitals, manufacturers, or acquisition parameters. The dataset is divided in groups based on these differences, and a correction of the error caused by these differences is estimated using empirical Bayes.
ComBat is included in WORC
and can be turned on in the configuration,
including options to use empirical Bayes or not, a parametric or
non-parametric approach, and a moderation variable.
ComBat feature harmonization is embedded in WORC. A wrapper around the
original ComBat code,
compatible with the other tools provided by WORC
, is included in the
WORC
installation.
When using ComBat, the following configurations should be done:
Set
config['General']['ComBat']
to'True'
.To change the ComBat parameters (i.e. which batch and moderation variable to use), change the relevant config fields, see the Config chapter.
WORC extracts the batch and moderation variables from the label file which you also use to give WORC the actual label you want to predict. The same format therefore applies, see the User manual for more details..
Note
In line with current literature, ComBat is applied once on the full dataset straight after the feature extraction, thus before the actual hyperoptimization. Hence, to avoid serious overfitting, we advice to NEVER use the variable you are trying to predict as the moderation variable.
Bayesian optimization with SMAC instead of random search¶
Note
The SMAC algorithm only works on Linux, because of its random forest surrogate model
implementation. Make sure to use swig3.0
. To circumvent pyrfr
issues
with SMAC, we use a custom fork of the original SMAC package that needs to be installed separately.
Steps to take in order to use SMAC within WORC:
sudo apt-get remove swig
sudo apt-get install swig3.0
sudo ln -s /usr/bin/swig3.0 /usr/bin/swig
pip install pyrfr==0.8.0
pip install git+https://github.com/mitchelldeen/SMAC3.git
The SMAC algorithm, using Bayesian optimization, can be used for the hyperparameter optimization by
setting the config['SMAC']['use']
parameter to 'True'
. For details on which SMAC parameters
can be modified, see Config chapter.
The core functionality of SMAC within WORC is implemented in
WORC.resources.fastr_tools.worc.bin.smac_tool
. The configuration space of SMAC is specified
in WORC.classification.smac
, which is also where new methods can be added to the search space.
There is additional output when using SMAC. The final output file smac_results_all_0.json
is added along with the regular performance file in the output folder. It contains information on the
optimization procedure for each cross-validation split, with statistics on the performance and all
intermediate best found configurations.The end of the file contains a summary of the average statistics
over all train-test cross-validations.
Multilabel classification and regression¶
While WORC
was primarily designed for binary classification, as also
demonstrated in the main manuscript, various other types of machine
learning workflows have been included as well.
In multilabel classification, several mutually exclusive classes are
predicted at the same time. This is a special form of multiclass classification,
in which the classes do not have to be mutually exclusive. When using
multilabel classification in WORC
, the only differences with binary
classification in the workflows is in the machine learning component.
For the other components, e.g. feature selection and resampling, when not
supporting multiclass classification, the methods are performed per
class in a one-vs-rest approach. Some of the binary classifiers naturally
support multilabel classification (i.e., random forest, AdaBoost,
and extreme gradient boosting) and are thus normally used. Others only
support binary classification (i.e., LDA, QDA, Naive Bayes, SVM, logistic
regression), and are therefore also performed per class in a one-vs-rest
approach and combined in a single multilabel model. In the evaluation,
the same metrics as in the binary classification are evaluated per class.
Additionally, the multiclass AUC [CIT3]. and multiclass BCR are computed.
In regression, a continuous label is predicted. As there are no classes,
all class-based feature and sample preprocessing methods
(RELIEF, univariate testing, and all resampling methods) cannot be used.
In the machine learning component, WORC
includes the following regressors:
linear regression;
support vector machines;
random forest;
elastic net;
LASSO;
ridge regression;
AdaBoost;
extreme gradient boosting (XGBoost).
The optimization is by default based on the R2-score. Performance metrics computed are the rw-score, mean squared error, inter-class correlation coefficient, Pearson coefficient and p-value, and Spearman coefficient and p-value.
References¶
Chen, Jianan, et al. AMINN: Autoencoder-based Multiple Instance Neural Network for Outcome Prediction of Multifocal Liver Metastases. arXiv preprint arXiv:2012.06875 (2020).
Klein, Stefan, et al. Elastix: a toolbox for intensity-based medical image registration. IEEE transactions on medical imaging 29.1 (2009): 196-205.
Hand, David J., and Robert J. Till. A simple generalisation of the area under the ROC curve for multiple class classification problems. Machine learning 45.2 (2001): 171-186.