Sematic Segmentation

[FCN - CVPR2015]

-不含全连接层(fc)的全卷积(fully conv)网络。可适应任意尺寸输入。

-增大数据尺寸的反卷积(deconv)层。能够输出精细的结果。

-结合不同深度层结果的跳级(skip)结构。同时确保鲁棒性和精确性。

[U-Net - MICCAI2015]

[RefineNet - CVPR2017]

-The deconvolution operations are not able to recover the low-level visual features which are lost after the down-sampling operation in the convolution forward stage

-Dilated convolutions introduce a coarse sub-sampling of features, which potentially leads to a loss of important details

-RefineNet provides a generic means to fuse coarse high-level semantic features with finer-grained low-level features to generate high-resolution semantic feature maps.

[PSPNet - CVPR2017]

-Current FCN based model is lack of suitable strategy to utilize global scene category clues.

-Global context information along with sub-region context is helpful in this regard to distinguish among various categories.

[GCN - CVPR2017] Large Kernel Matters——Improve Semantic Segmentation by Global Convolutional Network

[DFN - CVPR2018] Learning a Discriminative Feature Network for Semantic Segmentation

[BiSeNet - ECCV2018] BiSeNet: Bilateral Segmentation Network for Real-time Semantic Segmentation

-Spatial Path (SP) and Context Path (CP). As their names imply, the two components are devised to confront with the loss of spatial information and shrinkage of receptive field respectively.

-SP: three layers, each layer includes a convolution with stride = 2, followed by batch normalization and ReLU.

-CP: utilizes lightweight model and global average pooling to provide large receptive field

-loss function:

[ICNet - ECCV2018] ICNet for Real-Time Semantic Segmentation on High-Resolution Images

[DFANet - CVPR2019] DFANet: Deep Feature Aggregation for Real-Time Semantic Segmentation

[DeepLabv1 - ICLR2015]

-Atrous Convolution

[DeepLabv2]

-Atrous Spatial Pyramid Pooling (ASPP)

[DeepLabv3]

[DeepLabv3+ - ECCV2018]

“Attention” in Segmentation

[NLNet - CVPR2018] Non-local Neural Networks

-Capturing long-range dependencies is of central importance in deep neural networks. Intuitively, a non-local operation computes the response at a position as a weighted sum of the features at all positions in the input feature maps.

-Generic non-local operation:

-HWxHW与HWx512做矩阵乘，前一个可以理解为每一行是一个点的f，然后与512维中每个点相乘，对于每个通道上，用的f值是一样的，可以理解为spatial attention。

[DANet - CVPR2019] Dual Attention Network for Scene Segmentation

-Introduces a self-attention mechanism to capture features dependencies in the spatial and channel dimensions

[CCNet - ICCV2019] CCNet: Criss-Cross Attention for Semantic Segmentation

-The current no-local operation, can be alternatively replaced by two consecutive criss-cross operations, in which each one only has sparse connections (H + W - 1) for each position in the feature maps. By serially stacking two criss-cross attention modules, it can collect contextual information from all pixels. The decomposition greatly reduce the complexity in time and space from O((HxW)x(HxW)) to O((HxW)x(H +W - 1)).

-The details of criss-cross attention module: