.. _additonalfunctionality-chapter: Additional functionality ======================== When using ``SimpleWORC``, or WORC with similar simple configuration settings, you can already benefit from the main functionality of WORC, i.e. the automatic algorithm optimization. However, several additional functionalities are provided, which are discussed in this chapter. For a description of the radiomics features, please see :ref:`the radiomics features chapter `. For a description of the data mining components, see :ref:`the data mining chapter `. All other components are discussed here. For a comprehensive overview of all functions and parameters, please look at :ref:`the config chapter `. Image Preprocessing -------------------- Preprocessing of the image, and accordingly the mask, is done in respectively the :py:mod:`WORC.processing.preprocessing` and the :py:mod:`WORC.processing.segmentix` scripts. Options for preprocessing the image include, in the following order: 1. N4 Bias field correction, see also https://simpleitk.readthedocs.io/en/master/link_N4BiasFieldCorrection_docs.html. 2. Checking and optionally correcting the spacing if it's 1x1x1 and the DICOM metadata says otherwise. 3. Clipping of the image intensities above and below a certain value. 4. Normalization, see :py:mod:`WORC.processing.preprocessing.normalize_image` for all options. 5. Transposing the image to another ''main'' orientation, e.g. axial. 6. Resampling the image to a different spacing. Options for preprocessing the segmentation include: 1. Hole filling. Many feature computations cannot deal with holes. 2. Removing small objects. Many feature computations cannot deal with multiple objects in a single segmentation. 3. Extracing the largest blob. Many feature computations cannot deal with multiple objects in a single segmentation. 4. Instead of using the full segmentation, extracting a ring around the border of the image to compute the features on. Ring captures both the inner and outer border. 5. Dilating the contour. 6. Masking the contour with another contour. 7. When assuming the same image and metadata, copy the metadata of the image to the segmentation. 8. Checking and optionally correcting the spacing if it's 1x1x1 and the DICOM metadata says otherwise. Same as image preprocessing step 2. 9. Transposing the segmentation to another ''main'' orientation, e.g. axial. Same as image preprocessing step 5. 10. Resampling the segmentation **and the segmentation** to a different spacing. Same as image preprocessing step 10. Feature scaling -------------------- The default method for feature scaling in ``WORC`` is a robust version of z-scoring. Additional options include: 1. regular z-scoring 2. MinMax scaling, i.e., scaling to a range between 0 and 1 3. Scaling by centering using the median and IQR 4. A combination of z-scoring with a logarithmic transform and a correction term to better cope with outliers and non-normally distributed features [CIT1]_. Image Registration ------------------- When using multiple modalities or sequences, and there is only a segmentation on a single image, image registration is applied to spatially align all sequences and warp the segmentation to the other images through ``elastix`` [CIT2]_. Usage of ``elastix`` is automatically included in ``WORC`` when only a single segmentation and multiple modalities are supplied. The image on which the segmentation is provided is used as the moving image, the others as the fixed image, as the segmentations will be moved from the segmented image to the others. Registration is by default performed using a rigid transformation model, based on a mutual information using the adaptive stochastic gradient descent optimizer. Manual overrides of these defaults are included in the ``WORC`` configuration. When using Elastix, parameter files have to be provided in the ``network.Elastix_Para`` object, e.g. .. code-block:: python network.Elastix_Para = [['Parameters_Rigid.txt', 'Parameters_BSpline.txt']] The outer list defines the parameter files used per modality. If only one element is provided, the same will be applied for all modalities. Each element of the list should be a list of its own, including the filenames of ``elastix``. In the example, we provided two files, resulting in first a rigid registration being performed, followed by a bspline registration. Examples of ``elastix`` parameter files can be found at https://github.com/SuperElastix/ElastixModelZoo/tree/master/models/default ComBat -------- Commonly, radiomics studies include multicenter data, resulting in heterogeneity in the acquisition protocols. As radiomics features are generally sensitive to these variations, this limits the repeatability and reproducibility. To compensate for the differences in acquisition, feature harmonization techniques may be used, one of the most frequently used is ComBat. In ComBat, feature distributions are harmonized for variations in the imaging acquisition, e.g. due to differences in hospitals, manufacturers, or acquisition parameters. The dataset is divided in groups based on these differences, and a correction of the error caused by these differences is estimated using empirical Bayes. ComBat is included in ``WORC`` and can be turned on in the configuration, including options to use empirical Bayes or not, a parametric or non-parametric approach, and a moderation variable. ComBat feature harmonization is embedded in WORC. A wrapper around the original `ComBat code `_, compatible with the other tools provided by ``WORC``, is included in the ``WORC`` installation. When using ComBat, the following configurations should be done: 1. Set ``config['General']['ComBat']`` to ``'True'``. 2. To change the ComBat parameters (i.e. which batch and moderation variable to use), change the relevant config fields, see the :ref:`Config chapter `. 3. WORC extracts the batch and moderation variables from the label file which you also use to give WORC the actual label you want to predict. The same format therefore applies, see the :ref:`User manual ` for more details.. .. note:: In line with current literature, ComBat is applied once on the full dataset straight after the feature extraction, thus before the actual hyperoptimization. Hence, to avoid serious overfitting, we advice to **NEVER** use the variable you are trying to predict as the moderation variable. Bayesian optimization with SMAC instead of random search -------------------------------------------------------- .. note:: The SMAC algorithm only works on Linux, because of its random forest surrogate model implementation. Make sure to use ``swig3.0``. To circumvent ``pyrfr`` issues with SMAC, we use a custom fork of the original SMAC package that needs to be installed separately. Steps to take in order to use SMAC within WORC: 1. ``sudo apt-get remove swig`` 2. ``sudo apt-get install swig3.0`` 3. ``sudo ln -s /usr/bin/swig3.0 /usr/bin/swig`` 4. ``pip install pyrfr==0.8.0`` 5. ``pip install git+https://github.com/mitchelldeen/SMAC3.git`` The SMAC algorithm, using Bayesian optimization, can be used for the hyperparameter optimization by setting the ``config['SMAC']['use']`` parameter to ``'True'``. For details on which SMAC parameters can be modified, see :ref:`Config chapter `. The core functionality of SMAC within WORC is implemented in :py:mod:`WORC.resources.fastr_tools.worc.bin.smac_tool`. The configuration space of SMAC is specified in :py:mod:`WORC.classification.smac`, which is also where new methods can be added to the search space. There is additional output when using SMAC. The final output file ``smac_results_all_0.json`` is added along with the regular performance file in the output folder. It contains information on the optimization procedure for each cross-validation split, with statistics on the performance and all intermediate best found configurations.The end of the file contains a summary of the average statistics over all train-test cross-validations. Multilabel classification and regression ---------------------------------------- While ``WORC`` was primarily designed for binary classification, as also demonstrated in the main manuscript, various other types of machine learning workflows have been included as well. In multilabel classification, several mutually exclusive classes are predicted at the same time. This is a special form of multiclass classification, in which the classes do not have to be mutually exclusive. When using multilabel classification in ``WORC``, the only differences with binary classification in the workflows is in the machine learning component. For the other components, e.g. feature selection and resampling, when not supporting multiclass classification, the methods are performed per class in a one-vs-rest approach. Some of the binary classifiers naturally support multilabel classification (i.e., random forest, AdaBoost, and extreme gradient boosting) and are thus normally used. Others only support binary classification (i.e., LDA, QDA, Naive Bayes, SVM, logistic regression), and are therefore also performed per class in a one-vs-rest approach and combined in a single multilabel model. In the evaluation, the same metrics as in the binary classification are evaluated per class. Additionally, the multiclass AUC [CIT3]_. and multiclass BCR are computed. In regression, a continuous label is predicted. As there are no classes, all class-based feature and sample preprocessing methods (RELIEF, univariate testing, and all resampling methods) cannot be used. In the machine learning component, ``WORC`` includes the following regressors: 1. linear regression; 2. support vector machines; 3. random forest; 4. elastic net; 5. LASSO; 6. ridge regression; 7. AdaBoost; 8. extreme gradient boosting (XGBoost). The optimization is by default based on the R2-score. Performance metrics computed are the rw-score, mean squared error, inter-class correlation coefficient, Pearson coefficient and p-value, and Spearman coefficient and p-value. References ------------ .. [CIT1] Chen, Jianan, et al. *AMINN: Autoencoder-based Multiple Instance Neural Network for Outcome Prediction of Multifocal Liver Metastases.* arXiv preprint arXiv:2012.06875 (2020). .. [CIT2] Klein, Stefan, et al. *Elastix: a toolbox for intensity-based medical image registration.* IEEE transactions on medical imaging 29.1 (2009): 196-205. .. [CIT3] Hand, David J., and Robert J. Till. *A simple generalisation of the area under the ROC curve for multiple class classification problems.* Machine learning 45.2 (2001): 171-186.