Authors: Christian Szegedy, Google Inc. Wei Liu, University of North Carolina, Chapel Hill. Yangqing Jia, Google Inc. Pierre Sermanet, Google Inc. Scott Reed, University of Michigan. Dragomir Anguelov, Google Inc. Dumitru Erhan, Google Inc. Vincent Vanhoucke, Google Inc. Andrew Rabinovich, Google Inc.
由于我对本篇论文的描述和设计理念还很懵，所以通过翻译的方式来学习的，主要对文章的第4部分 Architectural Details进行了翻译。此外，还参阅了一篇博客https://blog.csdn.net/qq_38906523/article/details/80061075。
Abstract: We propose a deep convolutional neural network architecture codenamed Inception, which was responsible for setting the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014(ILSVRC14). The main hallmark of this architecture is the improved utilization of the computing resources inside the network. This was achieved by a carefully crafted design that allows for increasing the depth and width of the network while keeping the computational budget constant. To optimize quality, the architectural decisions were based on the Hebbian principle and the intuition of multi-scale processing. One particular incarnation used in our submission for ILSCVRC14 is called GoogLeNet, a 22 layers deep network, the quality of which is assessed in the context of classification and detection.
摘要: 我们提出了一个代号为Inception的神经卷积神经网络，该架构获得了2014年ImageNet大规模视觉识别挑战赛(ILSVRC14)最好成绩。该架构的主要标志是提高了网络内计算资源的利用率。这是通过精心设计的设计实现的，该设计运行增加网络的深度和宽度，同时保持计算预算不变。 为了优化质量，架构决策是基于Hebbian原则和多尺度处理的直觉。在我们提交的ILSVRC14中使用的一个具体体现是GoogLeNet，它是一个22层的深度网络，其质量是在分类和检测的背景下评估的。The main idea of the Inception architecture is based on finding out how an optimal local sparse structure in a convolutional vision network can be approximated and covered by readily available dense components. Note that assuming translation invariance means that our network will be built from convolutional building blocks. All we need is to find the optimal local construction and to repeat it spatially.
Inception架构的主要思想是基于找出如何用容易获得的密集组件来近似和覆盖卷积视觉网络中的最优局部稀疏结构。请注意， 假设平移不变性意味着我们的网络将由卷积构建块构建。我们需要做的只是找到最佳的局部结构并在空间上重复它。We assume that each unit from the earlier layer corresponds to some region of the input images and these units are grouped into filter banks. In the lower layers(the ones close to the input) correlated units would concentrate in local regions. This means, we would end up with a lot of clusters concentrated in a single region and they can be covered by a layer of 1 x 1 convolutions in the next layer. However, one can also expect that there will be a smaller number of more spatially spread out clusters that can be covered by convolutions over larger patches, and there will be a decreasing number of patches over larger and larger regions.
我们假设来自较早层的每个单元对应输入图像的某个区域，并且这些单元被分组到滤波器组中。在底层（靠近输入的层），相关单元将集中在局部区域。这意味着，我们将以集中于单个区域中的大量团簇结束，在下一层中它们可以被1 x 1卷积层覆盖。然而，人们还可以预期，会有更少量的空间扩散的聚类，可以通过通过更大补丁上的卷积来覆盖，并且在越来越大的区域上将有越来越少的补丁。In order to avoid patch-alignment issues, current incarnations of the Inception architecture are restricted to filter sizes 1 x 1, 3 x 3, and 5 x 5, however this decision was based more on convenience rather than necessity. It also means that the suggested architecture is a combination of all those layers with their output filter banks concatenated into a single output vector forming the input of the next stage. Additionally, since pooling operations have been essential for the success in current state of the art convolutional networks, it suggests that adding an alternative parallel pooling path in each such stage should have additional beneficial effect, too.
为了避免补丁对齐问题，Inception架构的当前版本仅限于 1 x 1, 3 x 3和5 x 5的滤波器大小，但是这个决定更多的基于方便而不是必要性。这也意味着建议的架构是所有这些层及其输出滤波器组的组合，这些层级联成单个输出向量，形成下一个阶段的输入。此外，由于pooling操作对于当前现有技术的卷积网络的成功至关重要，因此它建议在每个这样的阶段中添加可选的并行pooling路径应该也具有额外的有效效果。As these "Inception modules" are stacked on top of each other, their output correlation statistics are bound to vary: as features of higher abstraction are captured by higher layers, their spatial concentration is expected to decrease suggesting that the ratio of 3 x 3 and 5 x 5 convolutions should increase as we move to higher layers.
由于这些”Inception modules”相互叠加，他们的输出相关统计量必然会变化：由于较高抽象的特征被较高层捕获，他们的空间集中度预计会降低，这表明 3 x 3和5 x 5的卷积比应该当我们移动到更高的层时会增加。One big problem with the above modules, at least in this naive form, is that even a modest number of 5 x 5 convolutions can be prohibitively expensive on top of a convolutional layer with a large number of filters. This problem becomes even more pronounced once pooling units are added to the mix: their number of output filters equal to the number of filters in the previous stage. The merging of the output of the pooling layer with the outputs of convolutional layers would lead to an inevitable increase in the number of outputs from stage to stage. Even while this architecture might cover the optimal sparse structure, it would do it very inefficiently, leading to a computational blow up within a few stages.
上述模块的一个大问题，至少以这种幼稚的形式，就是即使少量的5 x 5卷积，在具有大量滤波器的卷积层之上也是非常昂贵的。一旦将pooling单元将如到混合之中，这个问题就变得更加明显：它们的输出滤波器的数量等于前一阶段滤波器的数量。pooling层的输出与卷积层的输出合并将导致逐级输出的数量不可避免地增加。即使这种架构可能覆盖了最优的稀疏结构，但它的效率也会非常低，从而导致在几个阶段内出现计算爆炸。This leads to the second idea of the proposed architecture: judiciously applying dimension reductions and projections wherever the computational requirements would increase too much otherwise. This is based on the success of embeddings: even low dimensional embeddings might contain a lot of information about a relatively large patch. However, embeddings represent information in a dense, compressed form and compressed information is harder to model. We would like to keep our representation sparse at most places and compress the signals only whenever they have to be aggregated en masse. That is, 1 x 1 convolutions are used to compute reductions before the expensive 3 x 3 and 5 x 5 convolutions. Besides being used as reductions, they also include the use of rectified linear activation which makes them dual-purpose. The final result is depicted in Figure 2(b).
这引出了所提议架构的第二个想法：明智地应用维度缩减和投影，否则计算需求将增加太多。这是基于嵌入的成功：即使是低维的嵌入也可能包含大量关于相对较大的patch的信息。 然而，嵌入以密集的压缩形式表示信息，并且压缩信息更难建模。我们希望在大多数地方保持我们的表示稀疏，并且只在信号必须被聚集时才对它们进行压缩。也就是说，在昂贵的3 x 3和5 x 5卷积之前，使用1 x 1卷积来计算约简。除了约简之外，它们还包括使用整流线性激活，这使它们具有双重用途。最后的结果Figure 2(b)所示。
In general, an Inception network is network consisting of modules of the above type stacked upon each other, with occasional max-pooling layers with stride 2 to halve the resolution of the grid. For technical reasons(memory efficiency during training), it seemed beneficial to start using Inception modules only at higher layers while keeping the lower layers in traditional convolutional fashion. This is not strictly necessary, simply reflecting some infrastructural inefficiencies in our current implementation.
一般来说，Inception网络是由互相堆叠的上述类型的模块组成的网络，偶尔使用具有步长为2的最大池化来减半网络的分辨率。由于技术原因（训练期间的内存效率），仅在较高层使用Inception模块，同时以传统卷积方式保持较底层似乎是有益的。这不是严格必要的，只是反映了我们当前实施中的一些基础结构的低效性。One of the main beneficial aspects of this architecture is that it allows for increasing the number of units at each stage significantly without an uncontrolled blow-up in computational complexity. The ubiquitous use of dimension reduction allows for shielding before convolving over them with a large patch size. Another practically useful aspect of this design is that it aligns with the intuition that visual information should be processed at various scales and then aggregated so that the next stage can abstract features from different scales simultaneously.
这种架构的主要优点之一是，它允许显著增加每个阶段的单元数量，而不会在计算复杂度方面出现不收控制的爆炸。尺寸缩减的普遍应用允许在使用大的patch对他们卷积之前进行屏蔽。这种设计的另一个实际有用的方面是，它符合这样的直觉，即视觉信息应当以各种比例进行处理，然后进行聚合，以便下一个阶段能够同时从不同比例提取特征。The improved use of computational resources allows for increasing both the width of each stage as well as the number of stages without getting into computational difficulties. Another way to utilize the inception architecture is to create slightly inferior, but computationally cheaper versions of it. We have found that all the included the knobs and levers allow for a controlled balancing of computational resources that can result in networks that are 2 - 3 x faster than similarly performing networks with non-Inception architecture, however this requires careful manual design at this point.
Here, the most successful particular instance(named GoogLeNet) is described in Table 1 for demonstrational purposes.