W-Net: A Deep Model for Fully Unsupervised Image Segmentation Reproduction

13 min readApr 15, 2021

Original Authors: Xide Xia, Brian Kulis

Reproduction Authors: Guru Deep Singh , Nadine Duursma

Section I: Introduction

In this blogpost we will describe our implementation to reproduce the deep learning paper: “W-Net: A Deep Model for Fully Unsupervised Image Segmentation” [1] in Pytorch. We are doing this for a student assignment for the course Deep Learning 2020–2021 at Delft University of Technology.

W-Net is a deep learning model that is used for unsupervised image segmentation. This is becoming increasingly important because image labeling is time consuming, expensive as well as it is difficult to obtain in novel domains. The W-Net architecture consists of two concatenated U-net architectures, first acting as an encoder that outputs the image’s segmentation, and second as a decoder that reconstructs the image from this segmentation. We have adapted and modified three existing Github repositories for the purpose of this reproduction.

In Section II we will explain the W-Net model architecture, methodology of training and evaluation for paper. In Section III we will discuss why and how have we modified the three existing repositories for the reproduction of the W-Net model. In Section IV we will discuss the results and its limitations and finally in Section V we will conclude the paper.

Section II: W-Net Model

Architecture

The W-Net architecture is shown in Figure 1. It consists of two parts:

U-Encoder: Outputs image segmentations from the unlabelled original images.
U-Decoder: Outputs the reconstruction images from the segmentations.

The W-Net consists of 18 modules, each model has two 3x3 convolutional layers, followed by a ReLU non-linearity and batch normalization. There are 46 convolutional layers in total. The first nine modules form the encoder unit that gives the image segmentations, the other nine modules form the decoder unit that outputs the reconstructed image.

*Figure 1: The architecture of the W-Net model [1].*

First, the input image is resized to 224 x 224 pixels. This is passed through a convolution layer of size 3x3. The feature vector is passed on to the second module by means of a 2x2 pooling layer. This reduces the dimensionality of both the width and the height from 224 x 224 to 112 x 112, whereas it doubles the number of feature channels from 64 to 128. Now, instead of performing a regular convolution, a depth-wise separable convolution is performed. This operation consists of a depth-wise convolution and a pointwise convolution to gain performance more efficiently with the same number of parameters. This is because you can examine the spatial correlations and cross-channel correlations independently. Note that all modules 2 till 8, and modules 11 till 17 perform depth-wise separable convolution instead of a regular convolution. Only modules 1, 9, 10 and 18 consist of two regular convolutions with size 3x3.

Modules 3, 4 and 5 are identical to module 2 as explained above. This results in an image of 14x14 with 1024 feature channels. From module 5 till 6, up-sampling is performed. Each up-sampling step doubles the image dimensionality. Module 6 concatenates the output of module 4 with the up-sampled output of module 5 in order not to lose spatial information. The same is done in module 7, 8 and 9, which concatenate the output of module 3, 2 and 1 to the output of the previous module respectively. It should be noted that after each module in the expansion module we also halve the number of channels.

The final convolutional layer of the encoder consists of a 1x1 convolution which maps each 64-component feature vector to the desired number of classes K. Finally, a Softmax layer rescales the logits of k class between 0 and 1.

The decoder has a similar architecture as the encoder. It reduces the dimensionality and doubles the number of feature channels by means of a 2x2 max pooling in between modules 10, 11, 12, 13 and 14. Then it up-samples the feature vectors by doubling the dimensionality and halving the number of feature maps. Decoder’s final layer consists of a 1x1 convolution that reconstructs the feature vectors from the 64 feature map to 3 feature map.

Training

To train the model, an optimizer is used (Paper fails to mention which optimizer was used) to update the model parameters of the encoder and the decoder. A batch consists of 10 images, the algorithm uses a new batch for every iteration.

There are two separate optimizers, namely one that updates the encoder parameters using the soft-N-cut loss, and another optimizer is used to update the whole model: both the encoder and the decoder parameters using the reconstruction loss. Both loss functions are given below. With these loss functions, the gradient can be computed with backpropagation which is used to update the model parameters.

The model is trained on the PASCAL VOC2012 dataset that contains 11530 images and 6929 segmentations. The learning rate is set to 0.003 which is divided by 10 after every 1000 iterations. Also, a dropout of 0.65 was added to prevent overfitting. Then after 50,000 iterations, the training is stopped.

The reconstruction loss and the N-cut loss of the model during training are shown in Figure 3.

*Figure 3: The reconstruction loss and Ncut loss after training the model, obtained from the reference paper [1].*

Test

The BSDS300 and BSDS500 datasets are used for testing. These datasets contain 300 and 500 images respectively with human-annotated segmentations as the ground truth. Different metrics are used for the validation, namely the Variation of Information (VI), Probabilistic Rand Index (PRI) and Segmentation Covering (SC). The average of ODS and OIS results from the reference paper are given in Table 1.

*Table 1: The results of the paper to reproduce.*

Section III: Reproduction

To reproduce the results of the paper, three existing Github repositories were used to do so. Repository [2] functioned as the base code of the W-Net model. The problem with this repository was that the computation of the N-Cut loss was inefficient which made it difficult to run locally without cuda running out of memory. And even if run on virtual servers like provided by google Colab, it took almost 10 hours of computation time per epoch. Apart from computational problem this repository failed to incorporate 5th and 14th module of the architecture. We decided to modify the architecture as per our needs and use the N-cut loss from another repository.

Repository [3] was used to calculate the N-cut loss for the training purpose, this repository though had an efficient implementation of the loss function, but its architecture was too poorly written to be understood. Hence, we decided to use the model of the first repository [2].

Repository [4] was used to evaluate the metrics SC, PRI and VI by comparing the predicted segmentation with the ground truth segmentation of the image (annotated by humans). Below we will explain the existing code and how we have modified the existing repositories to reproduce the results from the W-Net paper.

Repository setup

The repository is built up as follows. The folders ‘datasets’ consist of all training and test images. The folder ‘models’ is empty initially, but this is where all model parameters will be stored. Furthermore, the folder ‘latent_images’ is where the visualizations of the segmented images will be placed, after every epoch displaying the improvement of model as we proceed. Other relevant files are ‘config.py’, ‘train.py’ and ‘test.py’, which are used for the model set-up, training and generating the segmentations of test images respectively.

Model parameters

The configuration of the implementation is described in the Config file from where multiple parameters can be changed. We have set the image input size to 224, the batch size to 10. The repository provides option to use batch normalization or instance normalization we chose to use the batch norm as was done in the paper. The number of classes was not explicitly given in the paper, we found many repositories using K value as 64. So after discussion with our Teaching Assistant we have set this number to 64. The model will run for 32 epochs with a dropout rate of 0.65. These configurations can be seen below in Code 1.

Data loader

After completing the set-up, the data was loaded into the model. First, we resized it to the image input size of 224 x 224, then we transformed it to a tensor object. This was done both for the training data and the validation data (Here Validation data means the data which is used to produce the progress images after every epoch).

Optimizer

Since the paper does not mentions the optimizer so we went ahead with ADAM in default settings except for learning rate. Here, we would like to highlight that in most of the repositories we found online there was no repository which implemented the training according to the algorithm mentioned in the paper as shown in Figure.2. Most of them added the N-Cut and reconstruction loss together and calculated the gradient together. We observed that doing this causes the N-cut to not update at all. We observed that this was due to the difference in the scale of the losses, since reconstruction loss is in thousands in comparison to it, N-cut (only in hundreds) was not given any preference for optimization.

In our repository we use two optimizers both ADAM. The first optimizer is used to update the parameters of encoder portion only based on N-cut loss. The second optimizer is used to update the complete W-net using based on reconstruction loss. We would like to mention that the paper uses a learning rate of 0.003 which is reduced by a factor of 10 after every 1000 iterations but by doing this we found that after 5 epochs the learning rate dropped so low (3 x 10–8) that no improvements in the reconstructions were observed in the progress images. Thus, we made a call of keeping the learning rate constant to 0.003 for first 16 epochs and learning rate of 0.0003 for the remaining 16 epochs.

Soft-N-cut loss

Then the Soft-N-cut loss was defined from [3], because this loss function was computationally cheaper compared to the implementation in the original resource.

Testing

The model was trained with this loss criterion. To test the model, the BSDS ground truth were converted from .mat files to .np objects. They were also saved for them to be used later.

*Code 5: Function to convert ground truth from “.mat” to “.npy”*

Then we checked how many 224 x 224 patches within each test image can be extracted so that they could be fed to the trained model and extract out segmentation from it. These patches are then concatenated together. This is done instead of just resizing the image because later while calculating the metrics like SC, PRI and VI it would not be possible to find the ground truth values of resized images.

*Code 6: Generating the segmentations for the test data*

The base repository [2] also uses a function “pixel_count” for making the predicted segmentation values lie in the range of that of ground truth, this uses the ground truth values due to which we convert the ground truth from the “.mat” to “.npy” format. This was described above.

Metric Calculation

Lastly, the metrics were evaluated, taken from repository [4]. This repository demands the ground truth and predictions in “.mat” file for it to evaluate the Segmentation Covering, Probablistic Rand Index and Variation of Information. This provides us information on the mentioned metrics of each image as well as average of complete dataset.

For the full implementation, we refer you to our Github repository: Guru-Deep-Singh/Group-31-W-Net-A-Deep-Model-for-Fully-Unsupervised-Image-Segmentation (github.com)

Section IV: Results and Discussion

Results

After training the model with the code from the previous section, we plotted the N-cut loss and the reconstruction loss for every iteration. It can be noticed that both losses follow approximately the same pattern compared to the original paper, they reduce fast at the beginning and flatten out after a higher number of iterations. Similarly, the N-cut loss dropped from 60 to nearly 20. The magnitude of both losses is higher compared to the losses in the original implementation, the cause for this will be discussed later after the presentation of all results.

*Figure 4: Reconstruction loss and the N-cut loss after training our own model.*

In addition to plotting the loss function, we have visualized the segmentations both on the training images and the validation set to evaluate the convergence.

*Figure 5: Visualization of the image segmentations and reconstructions at initialization (top-left), after the 5th (top-right), 20th (bottom-left) and last 30th epoch (bottom-right).*

It can be seen that the segmentations improve after every epoch. The details are better captured, as well the region separations show a better overlap with the first image. The reconstructions also show improvement, the colours become more representative of the original colours and the number of details showed on the reconstruction images has increased as well.

Below the segmentations on the test set compared to the ground truth segmentations are shown.

*Figure 6: Segmentations of the BSDS300 images by our own model (left) vs ground truth (right).*

*Figure 7: Segmentations of the BSDS500 images by our own model (left) vs ground truth (right).*

It can be noticed that the segmentations do not correspond to the ground truth segmentations. This could be due to many reasons of assumptions we made during this project and were lacking in original paper. However, the objects of both the BSDS300 and BSDS500 set are clearly distinguishable in our own implementation, therefore we consider this as a successful result.

Then we move on to the metrics. A summary of the metrics that were evaluated is given in the table below, where the metrics scores of the original W-Net implementation are given by the number in brackets. Here, since original paper considers two scales, Optimal Data Scale and Optimal Image Scale, we compare our results to the average of the two.

*Table 2: Metrics results of our own implementation compared to the paper.*

Discussion

We noticed that the performance of the reconstructed W-Net is slightly worse compared to the performance of the original implementation. The magnitude of both losses is higher, in addition the performance metrics scores are lower. This can be caused by the following points.

First, it should be noticed that the initial magnitude of both loss functions is higher, i. e. around 23,000 instead of 17,500 for the reconstruction loss, and around 60 instead of 19 for the soft N-cut loss. The paper neither mention the initialization of the weight nor the optimizer it used due to which we had to make multiple guesses. Not only this the paper misses out on a key information which is the number classes “K”. This could drastically change the N-cut loss which is used to update encoder parameters and therefore the reconstruction loss is also affected.

In addition, the order in which the pointwise and the depth-wise convolution was performed in each module was not explicitly mentioned in the paper. Neither it was mentioned where they changed the number of channels, either in the pointwise or in the depth-wise convolution. We found that it is more common to do a pointwise convolution before a depth-wise convolution and we changed the number of channels in the pointwise convolution. However, it is uncertain whether this is exactly like the paper has done as well, therefore that might have affected our results.

Lastly, we have not implemented a learning schedule where the learning rate is divided by 10 every 1000 iterations. We have done this manually, however the learning rate could have been made even smaller at the end. Then loss keeps oscillating around a certain point. These oscillations can either be due to the stochastic gradient descent with a small batch size, or because of the high learning rate that keeps overshooting the optimum value. To prevent this, the batch size should be increased, and the learning rate should be decreased.

Section V: Conclusion

Our aim was to reconstruct the W-Net model that is used for fully unsupervised image segmentation [1]. We have reconstructed the W-Net model by combining and adjusting three existent Github repositories [2][3][4], where we used [2] as a base, [3] to increase the computational efficiency of training by changing the soft N-cut loss, and [4] to evaluate the metrics for the results.

Visually, the image segmentations created by our implementation looked good, both on the BSDS300 and BSDS500 test set. The original input image could be clearly recognized from the reconstruction image as well. The losses were higher than the original results of the paper and the performance metrics scores were not exact but close. The main reason for this is probably due to the difference in the parameters chosen because of their unavailability from the paper.

Here, we would like to analyse the paper under the context of “Individual Reproducibility” as provided by Edward Raff [5]. We would like to comment on Readability parameter for the paper, though the motivation and context of the paper is clear, but the paper in itself is not self-contained and has multiple references to some very key concepts which makes it quite hard to follow sometimes because continuity is broken. Next, we would like to highlight that the paper misses out on some important hyperparameters like number of classes “K”, kind of optimizer, initialization, etc. However, the paper provided an unsupervised model for image segmentation which has proven to be successful and has myriad applications.

Bibliography

[1] Xia, X., & Kulis, B. (2017). W-Net: A Deep Model for Fully Unsupervised Image Segmentation. ArXiv, abs/1711.08506.

[2] gr-b. (2020). W-Net: A Deep Model for Fully Unsupervised Image Segmentation — Implementation in Pytorch. fkodom/wnet-unsupervised-image-segmentation (github.com).

[3] Frank Odom. (2019). W-Net Unsupervised Image Segmentation. fkodom/wnet-unsupervised-image-segmentation (github.com)

[4] Kuang Haofei. (2020). BSD500-Segmentation-Evaluator. KuangHaofei/BSD500-Segmentation-Evaluator: This repo is used to evaluate the results of segmentation with BSD500 dataset, including Probabilistic Rand Index (PRI), Variation of Information (VI) and Segmentation Covering (SC). (github.com)

[5] Raff, E. 2019. A Step Toward Quantifying Independently Reproducible Machine Learning Research. In NeurIPS. URL http://arxiv.org/abs/1909.06674