Learning Deconvolution Network for Semantic Segmentation

17 November 2018

Authors: Hyeonwoo Noh, Seunghoon Hong, Bohyung Han. Department of Computer Science and Engineering, POSTECH, Korea

Abstract: We propose a novel semantic segmentation algorithm by learning a deconvolution network. We learn the network on top of the convolutional layers adopted from VGG 16-layer net. The deconvolution network is composed of deconvolution and unpooling layers, which identify pixel-wise class labels and predict segmentation masks. We apply the trained network to each proposal in an input image, and construct the final semantic segmentation map by combining the results from all proposals in a simple manner. The proposed algorithm mitigates the limitations of the exist methods based on fully convolutional networks by integrating deep deconvolution network and proposed-wise prediction; our segmentation method typically identifies detailed structures and handles objects in multiple scales naturally. Our network demonstrates outstanding performance in PASCAL VOC 2012 dataset, and we achieve the best accuracy(72.5%) among the methods trained with no external data through ensemble with the fully convolutional network.

摘要: 我们提出了一种通过学习去卷积网络的新的语义分割算法. 我们基于16-层VGG网络中的卷积层来学习这个网络. 去卷积网络由反卷积, unpooling层组成, 它识别像素类别并预测分割掩膜. 我们将训练好的网络应用到输入图像的每一个proposal, 然后以简单的方式合并所有proposals的结果,来构建最终的语义分割图. 该算法通过整合深度去卷积网络和逐proposal预测,缓解了现有基于全卷积网络分割方法的权限性. 我们的算法通常识别详细的结构,并且自然地在多个尺度上处理对象. 我们的网络在PASCAL VOC 2012数据集中显示出优异的性能, 并且通过与全卷积网络集成,在无外部数据时,获得最好的精度(72.5%).

We employ a completely different strategy to perform semantic segmentation based on CNN. Our main contributions are summarized below: + We learn a multi-layer deconvolution network, which is composed of deconvolution, unpooling, rectified linear unit(ReLU) layers. Learning deconvolutiom network for semantic segmentation is meaningful but no one has attempted to do it yet to our knowledge. + The trained network is applied to individual object proposals to obtain instance-wise segmentations, which are combined for the final semantic segmentation; It is free from scale issues found in FCN-based methods and identifies finer details of an object. + We achieve outstanding performance using deconvolution network trained only on PASCAL VOC 2012 dataset, and obtain the best accuracy through the ensemble with FCN by exploiting the heterogeneous and complementary characteristic of our algorithm with respect to FCN-based methods.

基于CNN,我们采用了一种完全不同的策略来做语义分割. 我们的主要贡献如下:

  • 我们学习了一个多层的去卷积网络,它由反卷积, unpooling和Relu单元组成. 学习去卷积网络用来语义分割是有意义的,但据我们所知,没有人试图这样做.
  • 将训练好的网络应用到单个对象proposals来获得实例级的分割,这些分割被组合用于最终的语义分割; 它没有基于FCN方法中发现的规模问题,并且可以识别对象的更精细细节.
  • 我使用仅在PASCAL VOC 2012数据集上训练的去卷积网络实现了出色的性能,并且通过与FCN方法集成获得最佳准确度,这是通过基于FCN的方法,利用我们算法的异构和互补特性实现的.


Figure 2 illustrates the detailed configuration of the entire deep network. Our trained network is composed of two parts - convolution and deconvolution networks. The convolution network corresponds to feature extractor that transforms the input image to multidimensional feature representation, whereas the deconvolution network is a shape generator that produces object segmentation from the feature extacted from the convolution network. The final output of the network is a probability map in the same size to input image, indicating probability of each pixel that belongs to one of the predefined classes.

We employ VGG 16-layer net for convolutional part with its last classification layer removed. Our convolution network has 13 convolutional layers altogether, rectification and pooling operations, and 2 fully connected layers are augmented at the end to impose class-specific projection. Our deconvolution network is a mirrored version of the convolution network, and has multiple series unpooling, deconvolution, and rectification layers. Contrary to convolution network that reduces the size of activations through feed-forwarding, deconvolution network enlarges the activations through the combination of unpooling and deconvolution operations.


We employ unpooling layers in deconvolution network, which perform the reverse operation of pooling and reconstruct the original size of activations as illustrated in Figure 3. It records the locations of maximum activations selected during pooling operation in swith variables, which are employed to place each activation back to its original pooled location. This unpooling strategy is particular useful to reconstruct the structure of input objects.


The output of an unpooling layer is an enlarged, yet sparse activation map. The deconvolution layers densify the sparse activations obtained by unpooling through convolution-like operations with multiple learned filters. However, contrary to convolutional layers, which connect multiple input activations within a filter window to a single activation, deconvolutional layers associated a single input activation with multiple outputs, as illustated in Figure 3. The output of the deconvolutional layer is an enlarged and dense activation map. We crop the boundary of the enlarged to keep the size of the output map identical to the one from the unpooling layer.

The learned filters in deconvolutional layers correspond to bases to reconstruct shape of an input object. Therefore, similar to convolution network, a hierarchical structure of deconvolutional layers are used to capture different level of shape details. The filters in lower layers tend to capture overall shape of an object while the class-specific fine details are encoded in the filters in high layers. In this way, the network directly takes class-specific shape information into account for semantic segmentation, which is often ignored in other approaches based only on convolutional layers.

去卷积层中的学习过的滤波器对应于重建输入对象形状的基础. 因此,类似于卷积网络,去卷积层的分层结构用于捕获不同级别的形状细节. 较低层中滤波器倾向于捕获对象的整体形状,而特定于类的细节在高层滤波器中进行编码. 通过这种方式,网络直接将特定于类的形状信息考虑在内进行语义分割,这在仅基于卷积层的其他方法中经常被忽略.

Analysis of Deconvolution Network

Figure 4 visualizes the outputs from the network layer by layer, which is helpful to understand internal operations of deconvolution network.

We can observe that coarse-to-fine object structures are reconstructed through the propagation overall coarse configuration of an object(e.g. location, shape, and region), while more complex patterns are discovered in higher layers. Note that unpooling and deconvolution play different roles for the construction of the segmentation masks. Unpooling captures example-specific structures by tracing the original locations with strong activations back to image space. As a result, it effectively reconstructs the detailed structure of an object in finer resolutions. On the other hand, learned filters in deconvolution layers tend to capture class-specific shapes. Through deconvolutions, the activations closely related to the target classes are amplified while noisy activations from other regions are suppressed effectively. By the combination of unpooling and deconvolution, our network generates accurate segmentation maps.

我们可以观察到,一个对象的整体粗配置(例如,位置, 形状和区域)通过传播,构建了一个由粗到细的对象结构,而在更高的层中发现更复杂的图案. 请注意,unpooling和去卷积在构建分割掩膜时扮演了不同的角色. Unpooling通过将具有最强激活的原始位置追溯到图像空间来捕获example-specific的结构. 结果, 它以更精细的分辨率有效的构建了对象的详细结构. 另一方面, 反卷积层中学习过得滤波器倾向于捕获class-specific的形状. 通过反卷积,与目标类密切相关的激活被放大,而来自其他区域的噪声激活被有效抑制. 通过unpooling和去卷积的组合, 我们的网络生成准确的分割图.

System Overview

Given our network, semantic segmentation on a whole image is obtained by applying the network to each candidate proposals extracted from the iamge and aggregating outputs of all proposals to the original image space.

Instance-wise segmentation has a few advantages over image-level prediction. It handles objects in various scales and identifies fine details of objects while the approaches with fixed-size respective fileds have troubles with these issues. Also, it alleviates training complexity by reducing search space for prediction and reduces memory requirement for training.

Ensemble with FCN

We develop a simple method to combine the outputs of both algorithms. Given two sets of class conditional probability maps of an input image computed independently by the proposed method and FCN, we compute the mean of both output maps and apply the CRF to obtain the final semantic segmentation.

Table 2 summarizes the detailed configuration of the proposed network presented in Figure 2.

一句话总结: unpooling, 反卷积, crop