梳理 Segmentation(分割) 相关介绍，数据集，算法

Intro

Foreword

相信时至今日，分割作为三大基础任务，是被广为熟知的
值得一提的是，分割任务往往被认为有两类
- 语义分割 semantic segmentation
  - 语义分割只区分类别不区分个体
  - 代表作的有 FCN Deeplab 系列
- 实例分割 instance segmentation
  - 实例分割不仅区分类别，对同类别的不同个体也需要区分
  - 代表作有 Mask-RCNN
（借用 gluoncv的一张图来说明）

Algorithm

Semantic Segmentation

Unet

paper U-Net: Convolutional Networks for Biomedical Image Segmentation
git https://lmb.informatik.uni-freiburg.de/people/ronneber/u-net/
- 当然 git 上非官方的也是一抓一大把
非常经典的encode-decode结构，其特点是在decode时高级特征会不断与低级特征融合，对图像本身的结构和语义保护地很好，有效地使局部特征得到充分表现

Segnet

paper SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation
code http://mi.eng.cam.ac.uk/projects/segnet/
和Unet很相似，文章里提到的不同点也是有点牵强。。
- As compared to SegNet, U-Net [16] (proposed for the medical imaging community) does not reuse pooling indices but instead transfers the entire feature map (at the cost of more memory) to the corresponding decoders and concatenates them to upsampled (via deconvolution) decoder feature maps. There is no conv5 and max-pool 5 block in U-Net as in the VGG net architecture. SegNet, on the other hand, uses all of the pre-trained convolutional layer weights from VGG net as pre-trained weights.

Deeplab v1 & v2

paper DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs
home http://liangchiehchen.com/projects/DeepLab.html
在那个年代多阶段还被普遍认为优于end2end，DCNN extract Feature， upsample， FC CRF得到结果。这是deeplab v1的主体思想
- CRFs have been broadly used in semantic segmentation to combine class scores computed by multi-way classifiers with the lowlevel information captured by the local interactions of pixels and edges [23], [24] or superpixels [25]. Even though works of increased sophistication have been proposed to model the hierarchical dependency [26], [27], [28] and/or highorder dependencies of segments [29], [30], [31], [32], [33], we use the fully connected pairwise CRF proposed by [22] for its efficient computation, and ability to capture fine edge details while also catering for long range dependencies.
本文最大的贡献之一是带动了Astro conv在seg任务上的发展
有ASPP的版本成为deeplab v2也是同样的出色，多rate的astro conv拼凑而成
通过CRF的持续迭代，边缘信息变得充分，分割结果更加完美
FOV相当于是单通道的ASPP，实验证明，large rate的aspp效果有显著提高
说什么也要sota一下
有趣的是，ASPP在 PASCAL-Person-Part 上表现不佳，不如不用
Cityscapes 上ASPP管用

FCN

paper Fully Convolutional Networks for Semantic Segmentation
git https://github.com/shelhamer/fcn.berkeleyvision.org
非常简洁直接，打开seg新世界大门
同时，FCN也开启了多level特征融合的大门
32s 16s 8s 表示 downsample 2**（5 4 3）的结果，肉眼可见的更好了
结构简洁结果不错速度快还要啥自行车

Enet

paper ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation
开始意识到maxpooling在初级阶段对input的伤害，使用一个3x3 stride2缓解信息丢失；bottleneck 的设计估计是参考了 res，略有一些不同，使用了prelu
主体结构，在stage2 stage3 都有递增的dilated conv
- asymmetric
  - Sometimes we replace it with asymmetric convolution i.e. a sequence of 5 × 1 and 1 × 5 convolutions isntead of 5 x 5.
展示了相对segnet的速度优势
cityscapes 略逊 segnet

PSP

paper Pyramid Scene Parsing Network
semantic seg 扛把子

直接上代码，非常容易理解

def _PSP1x1Conv(in_channels, out_channels, norm_layer, norm_kwargs):
    return nn.Sequential(
        nn.Conv2d(in_channels, out_channels, 1, bias=False),
        norm_layer(out_channels, **({} if norm_kwargs is None else norm_kwargs)),
        nn.ReLU(True)

class _PyramidPooling(nn.Module):
    def __init__(self, in_channels, **kwargs):
        super(_PyramidPooling, self).__init__()
        out_channels = int(in_channels / 4)
        self.avgpool1 = nn.AdaptiveAvgPool2d(1)
        self.avgpool2 = nn.AdaptiveAvgPool2d(2)
        self.avgpool3 = nn.AdaptiveAvgPool2d(3)
        self.avgpool4 = nn.AdaptiveAvgPool2d(6)
        self.conv1 = _PSP1x1Conv(in_channels, out_channels, **kwargs)
        self.conv2 = _PSP1x1Conv(in_channels, out_channels, **kwargs)
        self.conv3 = _PSP1x1Conv(in_channels, out_channels, **kwargs)
        self.conv4 = _PSP1x1Conv(in_channels, out_channels, **kwargs)
 
    def forward(self, x):
        size = x.size()[2:]
        feat1 = F.interpolate(self.conv1(self.avgpool1(x)), size, mode='bilinear', align_corners=True)
        feat2 = F.interpolate(self.conv2(self.avgpool2(x)), size, mode='bilinear', align_corners=True)
        feat3 = F.interpolate(self.conv3(self.avgpool3(x)), size, mode='bilinear', align_corners=True)
        feat4 = F.interpolate(self.conv4(self.avgpool4(x)), size, mode='bilinear', align_corners=True)
        return torch.cat([x, feat1, feat2, feat3, feat4], dim=1)

就是avgpooling到不同的尺度获取不同粒度的特征信息
aux loss 是一个由stage4出来的辅助loss，接的haed是FCN的head（没有psp模块）
作者在Ablation Study中对这个aux loss进行了分析，这玩意就是好使啊怎么都加点
展示了PSP的渐进加点方法在ImageNet scene parsing challenge 2016上
PSP效果堪称惊艳
Cityscapes 当然也不会放过

ICNet

paper ICNet for Real-Time Semantic Segmentation on High-Resolution Images
git https://github.com/hszhao/ICNet
虽然PSP很不错，但是不够快，ICNET致力于 acc 和 speed的均衡
通过三个不同大小的input三个不同的子网络的特征进行融合，得到最终结果
CFF的细节
列出了几种 semantic segmentation 的常见的structure。我对他说的 ours(d) 是有疑义的，他其实每一个子网络都有output
CityScapes 结果，yolo策略，快的没我好，好的没我快
在CamVid和CoCoStuff上也是一样的不错

DenseASPP

paper DenseASPP for Semantic Segmentation in Street Scenes
受启发于 densenet，给aspp也dense上

Deeplab v3

paper Rethinking Atrous Convolution for Semantic Image Segmentation
主要就是改进了ASPP
在ASPP上加了一个image pooling，并在block4上使用了rate=2的dilate conv
相较PSP有少量提升，JFT指pretrained on JFT dataset
CityScapes上也是如此

Deeplab v3+

paper Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation
git https://github.com/tensorflow/models/tree/master/research/deeplab
a是deeplab v3的结构，b是常见的encode-decode结构，相结合成c成为Deeplab v3
结构简单清晰
顺路玄学设计一把 Xception
得益于backbone和高低阶特征融合，进步很大

EncNet

paper Context Encoding for Semantic Segmentation
对最后的结果重新分配权重，并使用一个新的se loss用来监督监督在这张图中某一个类存在的概率
相较于早它一个月发布的 deeplab v3+，这个结果实在不够看了

BiSeNet

paper BiSeNet: Bilateral Segmentation Network for Real-time Semantic Segmentation
Megvii出品
这个arm就是去除了缩放channel俩conv的se，ffm是一个经典的attention结构，这里成为是特征融合模块
对比之前的性价比之王 ICNet，同精度下速度提升3倍；大幅提升精度下速度也翻倍
上大模型精度也是顶尖的

DANet

paper Dual Attention Network for Scene Segmentation
git https://github.com/junfu1115/DANet/
空间的attention 和 channel的attention 融合
具体attention的做法，spatial的做法还算正常，channel的做法略有些诡异。。

直接上代码

class _PositionAttentionModule(nn.Module):
    """ Position attention module"""
 
    def __init__(self, in_channels, **kwargs):
        super(_PositionAttentionModule, self).__init__()
        self.conv_b = nn.Conv2d(in_channels, in_channels // 8, 1)
        self.conv_c = nn.Conv2d(in_channels, in_channels // 8, 1)
        self.conv_d = nn.Conv2d(in_channels, in_channels, 1)
        self.alpha = nn.Parameter(torch.zeros(1))
        self.softmax = nn.Softmax(dim=-1)
 
    def forward(self, x):
        batch_size, _, height, width = x.size()
        feat_b = self.conv_b(x).view(batch_size, -1, height * width).permute(0, 2, 1)
        feat_c = self.conv_c(x).view(batch_size, -1, height * width)
        attention_s = self.softmax(torch.bmm(feat_b, feat_c))  # conv transpose .* conv
        feat_d = self.conv_d(x).view(batch_size, -1, height * width)
        feat_e = torch.bmm(feat_d, attention_s.permute(0, 2, 1)).view(batch_size, -1, height, width)  # conv .* conv transpose )reshape
        out = self.alpha * feat_e + x
 
        return out
 
 
class _ChannelAttentionModule(nn.Module):
    """Channel attention module"""
 
    def __init__(self, **kwargs):
        super(_ChannelAttentionModule, self).__init__()
        self.beta = nn.Parameter(torch.zeros(1))
        self.softmax = nn.Softmax(dim=-1)
 
    def forward(self, x):
        batch_size, _, height, width = x.size()
        feat_a = x.view(batch_size, -1, height * width)
        feat_a_transpose = x.view(batch_size, -1, height * width).permute(0, 2, 1)
        attention = torch.bmm(feat_a, feat_a_transpose) # x = x * x transpose
        attention_new = torch.max(attention, dim=-1, keepdim=True)[0].expand_as(attention) - attention # x.max - x
        attention = self.softmax(attention_new)
 
        feat_e = torch.bmm(attention, feat_a).view(batch_size, -1, height, width)
        out = self.beta * feat_e + x
 
        return out

cityscapes略强于deeplab v3的水平
VOC也是PSP差不多水平

CGNet

paper CGNet: A Light-weight Context Guided Network for Semantic Segmentation
git https://github.com/wutianyiRosun/CGNet
展示了文章的主思路。a是FCN，b是FC+context module，例如psp aspp，咱这个每个阶段都给上context Feature
名字花里胡哨取一堆，就是3x3 conv和3x3 dilate=3 conv concat，然后se
对module中的residual的位置也做了区分
对标对象是 ENet ESPNet，同样参数下实现了精度 5% - 10%的进步

OCNet

paper OCNet: Object Context Network for Scene Parsing
git https://github.com/PkuRainBow/OCNet.pytorch
微软出品
展示了几种OC架构，paper中并没有画出OC的基本结构，直接上代码

OCM → BaseOC_Module OCP → BaseOC_Context_Module

class _SelfAttentionBlock(nn.Module):
    '''
    The basic implementation for self-attention block/non-local block
    Input:
        N X C X H X W
    Parameters:
        in_channels       : the dimension of the input feature map
        key_channels      : the dimension after the key/query transform
        value_channels    : the dimension after the value transform
        scale             : choose the scale to downsample the input feature maps (save memory cost)
    Return:
        N X C X H X W
        position-aware context features.(w/o concate or add with the input)
    '''
    def __init__(self, in_channels, key_channels, value_channels, out_channels=None, scale=1):
        super(_SelfAttentionBlock, self).__init__()
        self.scale = scale
        self.in_channels = in_channels
        self.out_channels = out_channels
        self.key_channels = key_channels
        self.value_channels = value_channels
        if out_channels == None:
            self.out_channels = in_channels
        self.pool = nn.MaxPool2d(kernel_size=(scale, scale))
        self.f_key = nn.Sequential(
            nn.Conv2d(in_channels=self.in_channels, out_channels=self.key_channels,
                kernel_size=1, stride=1, padding=0),
            InPlaceABNSync(self.key_channels),
        )
        self.f_query = self.f_key
        self.f_value = nn.Conv2d(in_channels=self.in_channels, out_channels=self.value_channels,
            kernel_size=1, stride=1, padding=0)
        self.W = nn.Conv2d(in_channels=self.value_channels, out_channels=self.out_channels,
            kernel_size=1, stride=1, padding=0)
        nn.init.constant(self.W.weight, 0)
        nn.init.constant(self.W.bias, 0)
 
    def forward(self, x):
        batch_size, h, w = x.size(0), x.size(2), x.size(3)
        if self.scale > 1:
            x = self.pool(x)
 
        value = self.f_value(x).view(batch_size, self.value_channels, -1)
        value = value.permute(0, 2, 1)
        query = self.f_query(x).view(batch_size, self.key_channels, -1)
        query = query.permute(0, 2, 1)
        key = self.f_key(x).view(batch_size, self.key_channels, -1)
 
        sim_map = torch.matmul(query, key)
        sim_map = (self.key_channels**-.5) * sim_map
        sim_map = F.softmax(sim_map, dim=-1)
 
        context = torch.matmul(sim_map, value)
        context = context.permute(0, 2, 1).contiguous()
        context = context.view(batch_size, self.value_channels, *x.size()[2:])
        context = self.W(context)
        if self.scale > 1:
            if torch_ver == '0.4':
                context = F.upsample(input=context, size=(h, w), mode='bilinear', align_corners=True)
            elif torch_ver == '0.3':
                context = F.upsample(input=context, size=(h, w), mode='bilinear')
        return context
 
class SelfAttentionBlock2D(_SelfAttentionBlock):
    def __init__(self, in_channels, key_channels, value_channels, out_channels=None, scale=1):
        super(SelfAttentionBlock2D, self).__init__(in_channels,
                                                    key_channels,
                                                    value_channels,
                                                    out_channels,
                                                    scale)
 
class BaseOC_Module(nn.Module):
    """
    Implementation of the BaseOC module
    Parameters:
        in_features / out_features: the channels of the input / output feature maps.
        dropout: we choose 0.05 as the default value.
        size: you can apply multiple sizes. Here we only use one size.
    Return:
        features fused with Object context information.
    """
    def __init__(self, in_channels, out_channels, key_channels, value_channels, dropout, sizes=([1])):
        super(BaseOC_Module, self).__init__()
        self.stages = []
        self.stages = nn.ModuleList([self._make_stage(in_channels, out_channels, key_channels, value_channels, size) for size in sizes])       
        self.conv_bn_dropout = nn.Sequential(
            nn.Conv2d(2*in_channels, out_channels, kernel_size=1, padding=0),
            InPlaceABNSync(out_channels),
            nn.Dropout2d(dropout)
            )
 
    def _make_stage(self, in_channels, output_channels, key_channels, value_channels, size):
        return SelfAttentionBlock2D(in_channels,
                                    key_channels,
                                    value_channels,
                                    output_channels,
                                    size)
 
    def forward(self, feats):
        priors = [stage(feats) for stage in self.stages]
        context = priors[0]
        for i in range(1, len(priors)):
            context += priors[i]
        output = self.conv_bn_dropout(torch.cat([context, feats], 1))
        return output
 
class BaseOC_Context_Module(nn.Module):
    """
    Output only the context features.
    Parameters:
        in_features / out_features: the channels of the input / output feature maps.
        dropout: specify the dropout ratio
        fusion: We provide two different fusion method, "concat" or "add"
        size: we find that directly learn the attention weights on even 1/8 feature maps is hard.
    Return:
        features after "concat" or "add"
    """
    def __init__(self, in_channels, out_channels, key_channels, value_channels, dropout, sizes=([1])):
        super(BaseOC_Context_Module, self).__init__()
        self.stages = []
        self.stages = nn.ModuleList([self._make_stage(in_channels, out_channels, key_channels, value_channels, size) for size in sizes])
        self.conv_bn_dropout = nn.Sequential(
            nn.Conv2d(in_channels, out_channels, kernel_size=1, padding=0),
            InPlaceABNSync(out_channels),
            )
 
    def _make_stage(self, in_channels, output_channels, key_channels, value_channels, size):
        return SelfAttentionBlock2D(in_channels,
                                    key_channels,
                                    value_channels,
                                    output_channels,
                                    size)
 
    def forward(self, feats):
        priors = [stage(feats) for stage in self.stages]
        context = priors[0]
        for i in range(1, len(priors)):
            context += priors[i]
        output = self.conv_bn_dropout(context)
        return output

可以得出结论，所谓OC module就是常见的spatial attention的改装
可以看出ASPP的结构加上OC还是有不少提升的
尽管改动简单，但是效果拔群
在LIP上也做了实验，效果也很好

DUNet

paper Decoders Matter for Semantic Segmentation: Data-Dependent Decoding Enables Flexible Feature Aggregation
整体架构是个常规操作，看看DUpsample怎么玩
我觉得这个图不是很直白，直接上代码

整体就是 conv c → cfactorfactor，然后reshape reshape

class DUpsampling(nn.Module):
    """DUsampling module"""
 
    def __init__(self, in_channels, out_channels, scale_factor=2, **kwargs):
        super(DUpsampling, self).__init__()
        self.scale_factor = scale_factor
        self.conv_w = nn.Conv2d(in_channels, out_channels * scale_factor * scale_factor, 1, bias=False)
 
    def forward(self, x):
        x = self.conv_w(x)
        n, c, h, w = x.size()
 
        # N, C, H, W --> N, W, H, C
        x = x.permute(0, 3, 2, 1).contiguous()
 
        # N, W, H, C --> N, W, H * scale, C // scale
        x = x.view(n, w, h * self.scale_factor, c // self.scale_factor)
 
        # N, W, H * scale, C // scale --> N, H * scale, W, C // scale
        x = x.permute(0, 2, 1, 3).contiguous()
 
        # N, H * scale, W, C // scale --> N, H * scale, W * scale, C // (scale ** 2)
        x = x.view(n, h * self.scale_factor, w * self.scale_factor, c // (self.scale_factor * self.scale_factor))
 
        # N, H * scale, W * scale, C // (scale ** 2) -- > N, C // (scale ** 2), H * scale, W * scale
        x = x.permute(0, 3, 1, 2)
 
        return x

在voc 上进行对比，dusample相对bilinear upsample确有其优势
这个方法在deeplab v3+上一样有效，提升了0.3%

fastFCN

paper FastFCN: Rethinking Dilated Convolution in the Backbone for Semantic Segmentation
git https://github.come/wuhuikai/FastFCN
标准的结构，主要在JPU
上采样到8X，使用多dilate进行conv，cat conv 得到结果。。。岂不是要在8x上aspp？？？那fast在哪呢
效果略有提升
配合EncNet效果不错

LEDNET

paper LEDNET: A LIGHTWEIGHT ENCODER-DECODER NETWORK FOR REAL-TIME SEMANTIC SEGMENTATION
主要的骚操作在decode部分

Fast-SCNN

paper Fast-SCNN: Fast Semantic Segmentation Network
感觉就是deeplab v3+去掉aspp

HRNet

paper High-Resolution Representations for Labeling Pixels and Regions
git https://github.com/HRNet
HRNet在seg上确实是unstoppable，结构大家也都是很了解了
在LIP上，没有extra监督信息的情况下达到了55.9，相当高

DFANet

paper DFANet: Deep Feature Aggregation for Real-Time Semantic Segmentation
列举了不同的structure
看起来就像是三个重复的网络在concat，和HRNet神似
优势在于网络架整体深度的降低带来的计算的快速，在100fps的场景是最好的选择

OCRNet

paper Object-Contextual Representations for Semantic Segmentation
目前没有源代码，对其中的具体操作还有待商榷
效果令人震惊，就坐等爸爸开源了
It can be seen that our approach (HRNetV2 + OCR) achieves very competitive performance w/o using the video information or depth information. We then combine our OCR with ASPP [4] by replacing the global average pooling with our OCR, which (HRNetV2 + OCR (w/ ASP)) achieves 1st on 1 metric and 2nd on 3 of the 4 metrics with only a single model.
还说到了，如果和ASPP一起使用，将GAP换成OCR即可效果拔群

Instance Segmentation

相信说，instance seg 的deep风潮是从mask-rcnn开始的

Mask-RCNN

paper Mask R-CNN
git
- https://github.com/facebookresearch/Detectron
- https://github.com/facebookresearch/maskrcnn-benchmark
- https://github.com/facebookresearch/Detectron2
- facebook出了3个repo主打都是maskrcnn，真-大哥
主体架构图
将RoiPooling改进为RoiAlign，从单纯的max到bilinear interpolation
展示了面对FRRCNN和FRRCNN w/FPN时略有不同的head设计
相较之前的结果进步巨大
与此同时，就单纯拿检测效果对比，mask-RCNN的效果也明显的好过FRRCNN在使用同一个backbone的情况下

PANet

paper Path Aggregation Network for Instance Segmentation
git https://github.com/ShuLiu1993/PANet
panet 也是重量级的，荣誉很多
- CVPR 2018 Spotlight paper
- 1st place of COCO Instance Segmentation Challenge 2017
- 2nd place of COCO Detection Challenge 2017
- 1st place of 2018 Scene Understanding Challenge for Autonomous Navigation in Unstructured Environments
- 相信只需要有其中一个荣誉就是重量级的了
a阶段是FPN加上一条红线，代表底层特征与高层特征的pw add；b阶段将特征再次组装，绿线作用和a阶段红线相似；c阶段进行roi pooling，将结果fuse得到结果
分拆来看，这是b阶段的细节，基本就是FPN中的upsample改为stride 2 kernel 3x3的conv
Note that N2 is simply P2, without any processing.
值得一提的是，对于多阶段的检测算法而言，各自level的roi pooling是独立进行的，但是这个图上ROI proposal都是对齐的，其实中间还有一步对齐的op，对齐后将对应的特征融合
https://github.com/ShuLiu1993/PANet/blob/master/lib/modeling/collect_and_distribute_fpn_rpn_proposals.py
“””Merge RPN proposals generated at multiple FPN levels and then distribute those proposals to their appropriate FPN levels. An anchor at one FPN level may predict an RoI that will map to another level, hence the need to redistribute the proposals.“”“
源代码中有这样一段专门用来做这个事情，简单说就是将所有的proposal给其他level都复制一份，来达到对齐
在mask分支作者也是使用了 conv 与 fc 结合的策略提高seg的精度
吊打了Mask-RCNN w/FPN，但其实也伴随着肉眼可见的计算量增加
在detection上也不遑多让，也是吊打
同时，作者在 Ablation Studies 中也复现了mask rcnn，并使用训练技巧使其涨点4.4
分享了COCO第一的方法，看得出来，这几个方法都涨点很猛
总结：这是一篇干货满满的文章，值得一看

MS R-CNN

paper Mask Scoring R-CNN
git https://github.com/zjhuang22/maskscoring_rcnn
The RCNN head and Mask head are standard components of Mask R-CNN. 新的mask iou分支用于预测各个类别的iou分值
这样简单的操作就开始快乐涨点了，在几乎所有情况下都有效，还要啥自行车

yolact

paper YOLACT Real-time Instance Segmentation
git https://github.com/dbolya/yolact
based on retinanet. predict出bbox cls mask，经过nms，然后和protonet的结果结合得到instance的mask

haed略有不同，share了tower减少计算和参数，多计算了一份mask分支，mask分支的dim k是由config设置的

if cfg.mask_type == mask_type.direct:
    cfg.mask_dim = cfg.mask_size（16）**2
elif cfg.mask_type == mask_type.lincomb:
    cfg.mask_dim = num_grids + num_features

loss: Since each pixel can be assigned to more than one class, we use sigmoid and c channels instead of softmax and c + 1. This loss is given a weight of 1 and results in a +0.4 mAP boost.
整体上看，精度相同的情况下速度上完全吊打了FCIS，是个realtime 不错的选择

PolarMask

paper PolarMask: Single Shot Instance Segmentation with Polar Representation
git https://github.com/xieenze/PolarMask
展示了笛卡尔系建模和极坐标系建模的detail
文章将FCOS进行了拓展，将bbox视为polar系下的4等分角度多边形，将mask视为polar系下的无线等分的多边形
换个方式算centerness，后面有实验证明这个centerness的优势
既然采用了polar系，iou的计算方式也要有所改变，虽然这个式是积分式，其实现实里是离散化成n等分的
(左 → 右上 → 下)
- rays 代表切割的份数，实验证明切36份就差不多了
- 对比了smooth-l1 和 polar iou loss，不用iou loss真滴不行
- 使用polar centerness比卡迪尔中心更好
- box branch 有没有无所谓
- backbone还是越牛逼越好
- scale 当然也是越大越好
在仅训练12epochs w/o aug的情况下达到了30+的coco mask map

CenterMask

paper CenterMask:Real-Time Anchor-Free Instance Segmentation
amazing！在速度和精度上均超过了mask rcnn，而且在实时模型的pk中也大幅胜过yolact，来看看他是怎么做的
整体架构：使用FCOS作为类似RPN网络用于出bbox和cls，然后对每一个Bbox经过sag mask来抑制pixel层面的noise来完成mask
- 图上没有，但是作者提到了 Adaptive RoI Assignment Function 用于自适应多level，将不同大小的box标准化到一个大小
- sam是一个 pooling + sig + elewise-mul 的 spatial attention guided mask（就是个空间attention）
文中一大亮点是，提出了 VoVNet v2，提高了vovnet的性能，主要的改动在 OSA module 上
Residual connection：上图b
eSE：上图c 主要就是gap然后fc，elewise-mul （ECANet也是如此，叫法不同）
作者以FCOS0-R50为例，展示了将其改造为centermask的过程和时间消耗增加其中 mask scoring就是前面提到亮点ms rcnn的miou loss分支，时间消耗每图多15ms，是可接受范围
展示了VoVNetV2中改进的两点的效果，在时间消耗小幅增加的的前提下，精度得到了很不错的trade off
与现在主流的backbone: resnet resnext hrnet 做了对比，在同等精度下（或相对高一些的精度），VoVNet在CenterMask上都能在GPU上跑得更快
对比现在realtime的instance mask架构，CenterMask在同等速度下精度都能有较大提升
非常值得推荐的方法

dataset

Semantic Segmentation

Pascal VOC 2012

home http://host.robots.ox.ac.uk:8080/pascal/VOC/voc2012/index.html
download http://host.robots.ox.ac.uk:8080/pascal/VOC/voc2012/VOCtrainval_11-May-2012.tar
20 classes. The 2012 dataset contains images from 2008-2011 for which additional segmentations have been prepared. As in previous years the assignment to training/test sets has been maintained. The total number of images with segmentation has been increased from 7,062 to 9,993.
VOCAug
- VOCAug 也是非常常用的数据集，由VOC变种而来
- 11355 train 2857 val

Cityscapes

home https://www.cityscapes-dataset.com/
download home上可以找到，需要注册账号下载
baiduyun https://pan.baidu.com/s/1w3W_dQBUiHcwkLOtbSJ1Tg 1bln
30 classes. We present a new large-scale dataset that contains a diverse set of stereo video sequences recorded in street scenes from 50 different cities, with high quality pixel-level annotations of 5 000 frames in addition to a larger set of 20 000 weakly annotated frames. The dataset is thus an order of magnitude larger than similar previous attempts. Details on annotated classes and examples of our annotations are available at this webpage.

ADE20K

home http://groups.csail.mit.edu/vision/datasets/ADE20K/
download http://groups.csail.mit.edu/vision/datasets/ADE20K/ADE20K_2016_07_26.zip
train 20210 val 2000 test. 类别是开放的，目前至少有250个类

Instance Segmentation

COCO 17

home http://cocodataset.org/#download | http://cocodataset.org/#stuff-2017
download http://cocodataset.org/#download
- 需要下载的文件较多，看到的都下载就行
The task includes 55K COCO images (train 40K, val 5K, test-dev 5K, test-challenge 5K) with annotations for 91 stuff classes and 1 ‘other’ class. The stuff annotations cover 38M superpixels (10B pixels) with 296K stuff regions (5.4 stuff labels per image). Annotations for train and val are now available for download, while test set annotations will remain private.