Authors: Vijay Badrinarayanan, Alex Kendall, Roberto Cipolla, Senior Member, IEEE
Abstract: We present a novel and practical deep fully convolutional neural network architecture for semantic pixel-wise segmentation termed SegNet. This core trainable segmentation engine consists of an encoder network, a corresponding decoder network followed by a pixel-wise classification layer. The architecture of the encoder network is topologically identical to the 13 convolutional layers in the VGG16 network. The role of the decoder network is to map the low resolution encoder feature maps to full input resolution feature maps for pixel-wise classification. The novelty of SegNet lies is in the manner in which the decoder upsamples its lower resolution input feature maps. Specifically, the decoder uses pooling indices computed in the max-pooling step of the corresponding encoder to perform non-linear upsampling. This eliminates the need for learning to upsample. The upsampled maps are sparse and are then convolved with trainable filters to produce dense feature maps. We compare out proposed architecture with the widely adopted FCN, and also with the well known DeepLab-LargeFOV, DeconvNet architectures. This comparison reveals the memory versus accuracy trade-off involved in achieving good segmentation performance.
SegNet was primarily motivated by scene understanding applications. Hence, it is designed to be efficient both in terms of memory and computational time during inference. It is also significantly smaller in the number of trainable parameters than other competing architectures and can be trained end-to-end using stochastic gradient descent. We also performed a controlled benchmark of SegNet and other architectures on both road scenes and SUN RGB-D indoor color scene segmentation tasks. These quantitative assessments show that SegNet provides good performance with competitive inference time and most efficient inference memory-wise as compared to other architectures. We also provide a Caffe implementation of SegNet and a web demo at http://mi.eng.cam.ac.uk/projects/segnet/.
本文提出了一种新颖、实用的深层全卷积网络架构用于语义像素级分割, 成为SegNet. 这个核心可训练分割引擎包括一个编码网络， 和一个相应的解码器网络，以及一个逐像素分类层. 编码网络的架构与VGG16网络中的13个卷积层拓扑相同. 解码器网络的作用是将低分辨率编码器fearure maps映射到用于逐像素分类的全输入分辨率的feature maps. SegNet的新颖之处在于解码器上采样低分辨率feature maps输入的方式. 具体地， 解码器使用在对应编码器的最大池化计算的池化索引来执行非线性上采样. 这消除了学习上采样的需要. 上采样的maps是洗漱的，然后和可训练的滤波器卷积来产生dense feature maps. 我们将提出的架构与广泛使用的FCN，以及著名的DeepLab-LargeFOV, DeconvNet架构进行比较. 这种比较揭示了在实现良好的分割性能时所涉及的内存与精度的权衡.
SegNet最初的目的是场景理解应用. 因此，它被设计为在inference时内存和计算时间都是高效的. 它还可以在可训练参数的数量上显著小于其他竞争机构, 并且可以使用随机梯度下降来端到端的进行训练. 我们还对道路场景和SUN RGB-D室内彩色场景分割任务执行了一个可控的基准测试. 这些定量的评估表明，与其他架构相比，SegNet在具有竞争性推理时间和最高效的推理内存方面提展现了良好的性能. 我们还提供了一个SegNet的Caffe的实现，和一个Web演示:http://mi.eng.cam.ac.uk/projects/segnet/.Our motivation to design SegNet arises from the need to map low resolution to input resolution for pixel-wise classification. This mapping must produce features which are useful for accurate boundary localization.
我们设计SegNet的动机源于需要将低分率映射到输入分辨率，来做逐像素的分类. 这种映射必须产生对精确定位有用的特征.Reusing max-pooling indices in the decoding process has several practical advantages: (i) it improves boundary delineation, (ii) it reduces the number of parameters enabling end-to-end training, and (iii) this form of upsampling can be into any encoder-decoder architecture with only a little modification.
在解码过程中重用最大池化索引有几个实用的优点: (i) 它改进了边界划定, (ii)它减少了能够端到端训练的参数的数量, (iii)这种形式的上采样只需很少的修改就可以加入到任何的编码-解码架构.The key learning module is an encoder-decoder network. An encoder consists of convolution with a filter bank, element-wise tanh non-linearity, max-pooling and sub-sampling to obtain the feature maps. For each sample, the indices of the max locations computed during pooling are stored and passed to the decoder. The decoder up-samples the feature maps by using the stored pooled indices. It convolves this upsampled map using a trainable decoder filter bank to reconstruct the input image.
关键学习模块是编码-解码网络. 解码器由滤波器组卷积，逐像素tanh非线性，最大池化和下采样组成，以获得feature maps. 对每个样本，在pooling计算得到的最大位置的索引，被存储并传递给解码器. 解码器使用储存的pooled索引来上采样feature maps. 它使用可训练的解码器滤波器组和上采样的map卷积来重建输入图像.
SegNet has an encoder network and a corresponding decoder network, followed by a final pixelwise classification layer. This architecture is illustrated in Fig.2. The encoder network consists of 13 convolutional layers which correspond to the first 13 convolutional layers in the VGG16 network designed for object classification.
Each encoder layer has a corresponding decoder layer and hence the decoder network has 13 layers. The final decoder output is fed to a multi-class classifier to produce class probabilities for each pixel independently.On the left in Fig.3 is the decoding technique used by SegNet, where there is no learning involved in the upsampling step. However, the upsampled maps are convolved with trainable multi-channel decoder filters to densify its sparse inputs.
图3中的左边是由SegNet使用的解码技术, 其中在上采样步骤中没有学习. 然而，上采样后的maps与可训练的多通道的解码滤波器进行卷积，来使稀疏的输入密集.
On the right in Fig.3 is the FCN decodint technique. In a decoder of this network, upsampling is performed by inverse convolution using a fixed or trainable multi-channel upsampling kernel. This manner of upsampling is also termed as deconvolution.
一句话总结: decoder的方式, 数据集, 几种变种.