Authors: Guosheng Lin, Anton Milan, Chunhua Shen, Ian Reid. Nanyang Technological University, University of Adelaide, Australian Centre for Robotic Vision
Abstract: Recently, very deep convolutional neural networks(CNNs) have shown outstanding performance in object recognition and have also been the first choice for dense classification problems such as semantic segmentation. However, repeated subsampling operations like pooling or convolution striding in deep CNNs lead to a significant decrease in the initial image resolution. Here, we present RefineNet, a generic multi-path refinement network that explicitly exploits all the information available along the down-sampling process to enable high-resolution prediction using long-range residual connections. In this way, the deeper layers that capture high-level semantic features can be directly refined using fine-grained features from earlier convolutions. The individual components of RefineNet employ residual connections following the identity mapping mindset, which allows for effective end-to-end training. Further, we introduce chained residual pooling, which captures rich background context in an efficient efficient manner. We carry out comprehensive experiments and set new state-of-the-art results on seven public datasets. In particular, we achieve an intersection-over-union score of 83.4 on the challenging PASCAL VOC 2012 dataset, which is the best reported result to date.
摘要: 今年来，DCNNs在目标识别中表现出了优异的性能，并且已经成为dense分类问题(如语义分割)的首选.然而，在DCNNs中重复的下采样，如池化或convolution striding，导致初始图像分辨率显著降低. 这里，我们提出了RefineNet，一个通用的多路径精化网络，它显式地通过long-range residual连接来利用下采样过程中可用的所有信息，实现高分辨率预测. 通过这种方式，捕获高级高层语义特征的更深层可用使用来自前面卷积层的细粒度特征直接细化. RefineNet的各个组件按照identity mapping思想使用residual连接，这允许进行有效的端到端训练. 此外，我们引入链式residual pooling,以高效的方式捕获丰富的背景上下文. 我们进行了全面的实验，我们在七个公共数据集上获得了最好的结果. 尤其是，我们在具有挑战性的PASCAL VOC2012数据集上获得了83.4IoU值，这是迄今为止报告的最佳结果.Multiple stages of spatial pooling and convolution strides reduce the final image prediction typically by a factor of 32 in each dimension, thereby losing much of the finer image structure.
多个空间pooling和strides卷积通常将最终图像预测在每个维度上减小32倍，从而失去了许多更精细的图像结构.We argue that features from all levels are helpful for semantic segmentation. High-level semantic features help the category recognition of image regions, while low-level visual features help to generate sharp, detailed boudnaries for high-resolution prediction. How to effectively exploit middle layer features remains an open question and deserves more attentions. To this end, we approse a novel network architecture which effectively exploits multi-level features for generating high-resolution predictions. Our main contributions are as follows:
我们认为，各个层次的特征都有助于语义分割. 高层的语义特征有助于图像区域的类别识别，而低级视觉特征有助于用于高分辨率预测的清晰、详细的边界. 如何有效的开发(exploit)中层特征仍然是一个值得关注的问题. 为此，我们提出了一个新的网络架构,利用多层特征来产生多分辨率的预测. 我们的主要贡献如下:1. We propose a multi-path refinement network(RefineNet) which exploits features at multiple levels of abstraction for high-resolution semantic segmentation. RefineNet refines low-resolution(coarse) semantic segmentation with fine-grained low-level features in a recursive manner to generate high-resolution semantic feature maps. Our model is flexible in that it can be cascaded and modified easily. 2. Our cascaded RefineNets can be effectively trained end-to-end, which is crucial for best prediction performance. More specifically, all components in RefineNet employ residual connections with identity mappings, such that gradients can be directly propagated through short-range and long-range residual connections allowing for both effective and efficient end-to-end training. 3. We propose a new network component we call chained residual pooling which is able to capture background context from a large image region. It does so by efficiently pooling feature with multiple window sizes and fusing them together with residual connections and learnable weights. 4. The proposed RefineNet achieves new state-of-the-art performance on 7 public datasets, including PASCAL VOC 2012, PASCAL-Context, NYUDv2, SUN-RGBD, Cityscapes, ADE20K, and the object parsing Person-Parts dataset. In particular, we achieve an IoU score of 83.4 on the PASCAL VOC 2012 dataset, outperforming the currently best approach Deeplab by a large margin.
我们提出了一个多路径精化网络(RefineNet)，它利用多层抽象的特征进行高分辨的语义分割. RefineNet以递归方式利用细粒度低级特征对低分辨率(粗)语义分割进行细化,生成高分辨率语义feature maps. 我们的模型是灵活的，因为它可以级联化,修改容易.
我们级联的RefineNets可以有效的进行端到端的训练, 这对最佳的预测性能是重要的. 更具体的说, RefineNets的所有组件使用带有identity mappings的residual连接, 使得梯度可以直接通过short-range和long-range的residual连接传播, 从而允许有效和高效的端到端训练.
我们提出了一种新的网络组件, 我们称之为chained residual pooling, 它能够从大的图像区域获取背景上下文. 它通过有效的汇集具有多个窗口大小的pooling特征, 并且将他们与residual连接和可学习的权重融合在一起, 来实现这一点.
提出的RefineNet在7个公共数据集上获得了最新的性能，包括PASCAL VOC 2012、PASCAL-Context、NYUDv2、SUN-RGBD、Cityscapes、ADE20K和对象解析Person-Parts数据集。特别是，我们在PASCAL VOC 2012数据集上获得了83.4的IoU分数，大大超过了当前最好的方法Deeplab。
尽管存在一些相关的工作，但如何有效的利用中间层特征仍然是一个问题. 我们提出了一个新的网络架构，RefineNet，来解决这个问题.The network architecture of RefineNet is distinct from existing methods. It consists of a number of specially designed components which are able to refine the coarse high-level semantic features by exploiting low-level visual features. In particular, RefineNet employs short-range and long-range residual connection with identity mappings which enable effective end-to-end training of the whole system, and thus help to achieve superior performance.
RefineNet的网络架构与现有的办法不同. 它由许多专门设计的组件组成， 这些组件能够通过利用低级视觉特征来细化粗略的高级的语义特征. 具体而言, RefineNet使用带有identity mapping的short-range和long-range residual连接，这能够对整个系统进行有效的端到端训练，从而帮助实现优越的性能.
Proposed MethodWe propose a new framework that provides multiple paths over which information from different resolutions and via potentially long-range connections is assimilated using a generic building block, the RefineNet. Fig. 2(c) shows one possible arrangement of the building blocks to achieve our goal of high resolution semantic segmentation.
我们提出了一个新框架，它提供了多个路径. 使用通用的构建块(building block), RefineNet, 通过这些路径，来自不同分辨率的信息可以通过潜在的long-range连接被吸收. Fig. 2(c)示出了实现多分辨率语义分割的构建块(buiding block)的一种可能的配置.
Multi-Path RefinementFor our standard multi-patch architecture, we divide the pre-trained ResNet(trained with ImageNet) into 4 blocks according to the resolutions of the feature maps, and employ a 4-cascaded architecture with 4 RefineNet units, each of which directly connects to the output of one ResNet block as well as to the preceding RefineNet in the cascade.
RefineNetThe architecture of one RefineNet block is illustrated in Fig. 3(a). Note, however, that our architecture is generic and each Refine block can be easily modified to accept an arbitrary number of feature maps with arbitrary resolutions and depths.
Fig. 3(a)中示出了一个RefineNet块的架构. 然而，请注意，我们的架构是通用的，每一个Refine块可以容易地修改以接受任意数量的具有任意分辨率和深度的特征图.
Residual convolution unit
The first part of the each RefineNet block consits of an adaptive convolution set that mainly fine-tunes the pretrained ResNet weights for task. To that end, each input is passed sequentially through two residual convolution units(RCU), which is a simplified version of the convolution unit in the origin ResNet, where the batch-normalization layers are removed.
All path inputs are then fused into a high-resolution feature map by the multi-resolution fusion block, depicted in Fig. 3(c). This block first applies convolutions for input adaptation, which generate feature maps of the same feature dimension(the smallest one among the inputs), and then upsamples all(smaller) feature maps to the largest resolution of the inputs. Finally, all features maps are fused by summation. The input adaptation in this block also helps to re-scale the feature values approprimately along different paths, which is important for the subsequent sum-fusion. If there is only input path, the input path will directly go through this block without changes.
Chained residual pooling
The output feature map then goes through the chained residual pooling block, schematically depicted in Fig.3(d). The proposed chained residual pooling aims to capture background context from a large image region. It is able to efficiently pool features with multiple window sizes and fuse them together using learnable weights. In paticular, this component is built as chain of multple pooling blocks, each consisting of one max-pooling layer and one convolution layer. One pooling block takes the output of the previous pooling block as input. Therefor, the current pooling block is able to re-use the result from the previous pooling operation and thus access the features from a large region without using a large pooling window. If not further specified, we use two pooling blocks each with stride 1 in our experiments.
所提出的chained residual pooling旨在从大的图像区域捕获背景上下文. 它能够有效地结合不同窗口大小的特征，并用可学习的权重将它们混合在一起.
The output feature maps of all pooling blocks are fused together with the input feature map through summation of residual connections. Note that our choice to employ residual connections also persists in this building block, which once again facilitates gradient propagation during training.
The final step of each RefineNet block is another residual convolution unit(RCU).
Identity Mappings in RefineNetWe have both short-range and long-range residual connections in RefineNet. Short-range residual connections refer to local shortcut connections in one RCU or the residual pooling component, while long-range residual connections refer to the connections between RefineNet modules and the ResNet blocks.
在RefineNet中, 我们既有short-range的又有long-range的residual连接. short-range residual连接指的是在一个RCU或者residual pooling组件中的局部shortcut连接. 而long-range residual连接指的是RefineNet模块和ResNet块之间的连接.
一句话总结: 复杂的RefineNet block, multi-path information, short-range and long-range连接