0%

Segmentation Survey

梳理 Segmentation(分割) 相关介绍,数据集,算法


Intro


Foreword

  • 相信时至今日,分割作为三大基础任务,是被广为熟知的
  • 值得一提的是,分割任务往往被认为有两类
    • 语义分割 semantic segmentation
      • 语义分割只区分类别不区分个体
      • 代表作的有 FCN Deeplab 系列
    • 实例分割 instance segmentation
      • 实例分割不仅区分类别,对同类别的不同个体也需要区分
      • 代表作有 Mask-RCNN
  • (借用 gluoncv的一张图来说明)
  • Intro1.png

Algorithm


Semantic Segmentation

Unet

Segnet

  • paper SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation
  • code http://mi.eng.cam.ac.uk/projects/segnet/
  • segnet1.png
  • 和Unet很相似,文章里提到的不同点也是有点牵强。。
    • As compared to SegNet, U-Net [16] (proposed for the medical imaging community) does not reuse pooling indices but instead transfers the entire feature map (at the cost of more memory) to the corresponding decoders and concatenates them to upsampled (via deconvolution) decoder feature maps. There is no conv5 and max-pool 5 block in U-Net as in the VGG net architecture. SegNet, on the other hand, uses all of the pre-trained convolutional layer weights from VGG net as pre-trained weights.

Deeplab v1 & v2

  • paper DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs
  • home http://liangchiehchen.com/projects/DeepLab.html
  • deeplabv1&2_1.png
  • 在那个年代多阶段还被普遍认为优于end2end,DCNN extract Feature, upsample, FC CRF得到结果。这是deeplab v1的主体思想
    • CRFs have been broadly used in semantic segmentation to combine class scores computed by multi-way classifiers with the lowlevel information captured by the local interactions of pixels and edges [23], [24] or superpixels [25]. Even though works of increased sophistication have been proposed to model the hierarchical dependency [26], [27], [28] and/or highorder dependencies of segments [29], [30], [31], [32], [33], we use the fully connected pairwise CRF proposed by [22] for its efficient computation, and ability to capture fine edge details while also catering for long range dependencies.
  • deeplabv1&2_2.png
  • 本文最大的贡献之一是带动了Astro conv在seg任务上的发展
  • deeplabv1&2_3.png
  • 有ASPP的版本成为deeplab v2也是同样的出色,多rate的astro conv拼凑而成
  • deeplabv1&2_4.png
  • 通过CRF的持续迭代,边缘信息变得充分,分割结果更加完美
  • deeplabv1&2_5.png
  • FOV相当于是单通道的ASPP,实验证明,large rate的aspp效果有显著提高
  • deeplabv1&2_6.png
  • 说什么也要sota一下
  • deeplabv1&2_7.png
  • 有趣的是,ASPP在 PASCAL-Person-Part 上表现不佳,不如不用
  • deeplabv1&2_8.png
  • Cityscapes 上ASPP管用

FCN

Enet

  • paper ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation
  • enet1.png
  • 开始意识到maxpooling在初级阶段对input的伤害,使用一个3x3 stride2缓解信息丢失;bottleneck 的设计估计是参考了 res,略有一些不同,使用了prelu
  • enet2.png
  • 主体结构,在stage2 stage3 都有递增的dilated conv
    • asymmetric
      • Sometimes we replace it with asymmetric convolution i.e. a sequence of 5 × 1 and 1 × 5 convolutions isntead of 5 x 5.
  • enet3.png
  • 展示了相对segnet的速度优势
  • enet4.png
  • cityscapes 略逊 segnet

PSP

  • paper Pyramid Scene Parsing Network
  • semantic seg 扛把子
  • psp1.png
  • 直接上代码,非常容易理解
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    def _PSP1x1Conv(in_channels, out_channels, norm_layer, norm_kwargs):
    return nn.Sequential(
    nn.Conv2d(in_channels, out_channels, 1, bias=False),
    norm_layer(out_channels, **({} if norm_kwargs is None else norm_kwargs)),
    nn.ReLU(True)

    class _PyramidPooling(nn.Module):
    def __init__(self, in_channels, **kwargs):
    super(_PyramidPooling, self).__init__()
    out_channels = int(in_channels / 4)
    self.avgpool1 = nn.AdaptiveAvgPool2d(1)
    self.avgpool2 = nn.AdaptiveAvgPool2d(2)
    self.avgpool3 = nn.AdaptiveAvgPool2d(3)
    self.avgpool4 = nn.AdaptiveAvgPool2d(6)
    self.conv1 = _PSP1x1Conv(in_channels, out_channels, **kwargs)
    self.conv2 = _PSP1x1Conv(in_channels, out_channels, **kwargs)
    self.conv3 = _PSP1x1Conv(in_channels, out_channels, **kwargs)
    self.conv4 = _PSP1x1Conv(in_channels, out_channels, **kwargs)

    def forward(self, x):
    size = x.size()[2:]
    feat1 = F.interpolate(self.conv1(self.avgpool1(x)), size, mode='bilinear', align_corners=True)
    feat2 = F.interpolate(self.conv2(self.avgpool2(x)), size, mode='bilinear', align_corners=True)
    feat3 = F.interpolate(self.conv3(self.avgpool3(x)), size, mode='bilinear', align_corners=True)
    feat4 = F.interpolate(self.conv4(self.avgpool4(x)), size, mode='bilinear', align_corners=True)
    return torch.cat([x, feat1, feat2, feat3, feat4], dim=1)
  • 就是avgpooling到不同的尺度获取不同粒度的特征信息
  • psp2.png
  • aux loss 是一个由stage4出来的辅助loss,接的haed是FCN的head(没有psp模块)
  • psp3.png
  • 作者在Ablation Study中对这个aux loss进行了分析,这玩意就是好使啊怎么都加点
  • psp4.png
  • 展示了PSP的渐进加点方法在ImageNet scene parsing challenge 2016上
  • psp5.png
  • PSP效果堪称惊艳
  • psp6.png
  • Cityscapes 当然也不会放过

ICNet

  • paper ICNet for Real-Time Semantic Segmentation on High-Resolution Images
  • git https://github.com/hszhao/ICNet
  • icnet1.png
  • 虽然PSP很不错,但是不够快,ICNET致力于 acc 和 speed的均衡
  • icnet2.png
  • 通过三个不同大小的input三个不同的子网络的特征进行融合,得到最终结果
  • icnet3.png
  • CFF的细节
  • icnet4.png
  • 列出了几种 semantic segmentation 的常见的structure。我对他说的 ours(d) 是有疑义的,他其实每一个子网络都有output
  • icnet5.png
  • CityScapes 结果,yolo策略,快的没我好,好的没我快
  • icnet6.png
  • 在CamVid和CoCoStuff上也是一样的不错

DenseASPP

Deeplab v3

Deeplab v3+

EncNet

  • paper Context Encoding for Semantic Segmentation
  • encnet1.png
  • 对最后的结果重新分配权重,并使用一个新的se loss用来监督监督在这张图中某一个类存在的概率
  • encnet2.png
  • 相较于早它一个月发布的 deeplab v3+,这个结果实在不够看了

BiSeNet

DANet

  • paper Dual Attention Network for Scene Segmentation
  • git https://github.com/junfu1115/DANet/
  • danet1.png
  • 空间的attention 和 channel的attention 融合
  • danet2.png
  • 具体attention的做法,spatial的做法还算正常,channel的做法略有些诡异。。
  • 直接上代码
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    class _PositionAttentionModule(nn.Module):
    """ Position attention module"""

    def __init__(self, in_channels, **kwargs):
    super(_PositionAttentionModule, self).__init__()
    self.conv_b = nn.Conv2d(in_channels, in_channels // 8, 1)
    self.conv_c = nn.Conv2d(in_channels, in_channels // 8, 1)
    self.conv_d = nn.Conv2d(in_channels, in_channels, 1)
    self.alpha = nn.Parameter(torch.zeros(1))
    self.softmax = nn.Softmax(dim=-1)

    def forward(self, x):
    batch_size, _, height, width = x.size()
    feat_b = self.conv_b(x).view(batch_size, -1, height * width).permute(0, 2, 1)
    feat_c = self.conv_c(x).view(batch_size, -1, height * width)
    attention_s = self.softmax(torch.bmm(feat_b, feat_c)) # conv transpose .* conv
    feat_d = self.conv_d(x).view(batch_size, -1, height * width)
    feat_e = torch.bmm(feat_d, attention_s.permute(0, 2, 1)).view(batch_size, -1, height, width) # conv .* conv transpose )reshape
    out = self.alpha * feat_e + x

    return out


    class _ChannelAttentionModule(nn.Module):
    """Channel attention module"""

    def __init__(self, **kwargs):
    super(_ChannelAttentionModule, self).__init__()
    self.beta = nn.Parameter(torch.zeros(1))
    self.softmax = nn.Softmax(dim=-1)

    def forward(self, x):
    batch_size, _, height, width = x.size()
    feat_a = x.view(batch_size, -1, height * width)
    feat_a_transpose = x.view(batch_size, -1, height * width).permute(0, 2, 1)
    attention = torch.bmm(feat_a, feat_a_transpose) # x = x * x transpose
    attention_new = torch.max(attention, dim=-1, keepdim=True)[0].expand_as(attention) - attention # x.max - x
    attention = self.softmax(attention_new)

    feat_e = torch.bmm(attention, feat_a).view(batch_size, -1, height, width)
    out = self.beta * feat_e + x

    return out
  • danet3.png
  • cityscapes略强于deeplab v3的水平
  • danet4.png
  • VOC也是PSP差不多水平

CGNet

OCNet

  • paper OCNet: Object Context Network for Scene Parsing
  • git https://github.com/PkuRainBow/OCNet.pytorch
  • 微软出品
  • ocnet1.png
  • 展示了几种OC架构,paper中并没有画出OC的基本结构,直接上代码
  • OCM → BaseOC_Module OCP → BaseOC_Context_Module
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    76
    77
    78
    79
    80
    81
    82
    83
    84
    85
    86
    87
    88
    89
    90
    91
    92
    93
    94
    95
    96
    97
    98
    99
    100
    101
    102
    103
    104
    105
    106
    107
    108
    109
    110
    111
    112
    113
    114
    115
    116
    117
    118
    119
    120
    121
    122
    123
    124
    125
    126
    127
    128
    129
    130
    131
    132
    133
    134
    135
    136
    137
    138
    139
    140
    class _SelfAttentionBlock(nn.Module):
    '''
    The basic implementation for self-attention block/non-local block
    Input:
    N X C X H X W
    Parameters:
    in_channels : the dimension of the input feature map
    key_channels : the dimension after the key/query transform
    value_channels : the dimension after the value transform
    scale : choose the scale to downsample the input feature maps (save memory cost)
    Return:
    N X C X H X W
    position-aware context features.(w/o concate or add with the input)
    '''
    def __init__(self, in_channels, key_channels, value_channels, out_channels=None, scale=1):
    super(_SelfAttentionBlock, self).__init__()
    self.scale = scale
    self.in_channels = in_channels
    self.out_channels = out_channels
    self.key_channels = key_channels
    self.value_channels = value_channels
    if out_channels == None:
    self.out_channels = in_channels
    self.pool = nn.MaxPool2d(kernel_size=(scale, scale))
    self.f_key = nn.Sequential(
    nn.Conv2d(in_channels=self.in_channels, out_channels=self.key_channels,
    kernel_size=1, stride=1, padding=0),
    InPlaceABNSync(self.key_channels),
    )
    self.f_query = self.f_key
    self.f_value = nn.Conv2d(in_channels=self.in_channels, out_channels=self.value_channels,
    kernel_size=1, stride=1, padding=0)
    self.W = nn.Conv2d(in_channels=self.value_channels, out_channels=self.out_channels,
    kernel_size=1, stride=1, padding=0)
    nn.init.constant(self.W.weight, 0)
    nn.init.constant(self.W.bias, 0)

    def forward(self, x):
    batch_size, h, w = x.size(0), x.size(2), x.size(3)
    if self.scale > 1:
    x = self.pool(x)

    value = self.f_value(x).view(batch_size, self.value_channels, -1)
    value = value.permute(0, 2, 1)
    query = self.f_query(x).view(batch_size, self.key_channels, -1)
    query = query.permute(0, 2, 1)
    key = self.f_key(x).view(batch_size, self.key_channels, -1)

    sim_map = torch.matmul(query, key)
    sim_map = (self.key_channels**-.5) * sim_map
    sim_map = F.softmax(sim_map, dim=-1)

    context = torch.matmul(sim_map, value)
    context = context.permute(0, 2, 1).contiguous()
    context = context.view(batch_size, self.value_channels, *x.size()[2:])
    context = self.W(context)
    if self.scale > 1:
    if torch_ver == '0.4':
    context = F.upsample(input=context, size=(h, w), mode='bilinear', align_corners=True)
    elif torch_ver == '0.3':
    context = F.upsample(input=context, size=(h, w), mode='bilinear')
    return context

    class SelfAttentionBlock2D(_SelfAttentionBlock):
    def __init__(self, in_channels, key_channels, value_channels, out_channels=None, scale=1):
    super(SelfAttentionBlock2D, self).__init__(in_channels,
    key_channels,
    value_channels,
    out_channels,
    scale)

    class BaseOC_Module(nn.Module):
    """
    Implementation of the BaseOC module
    Parameters:
    in_features / out_features: the channels of the input / output feature maps.
    dropout: we choose 0.05 as the default value.
    size: you can apply multiple sizes. Here we only use one size.
    Return:
    features fused with Object context information.
    """
    def __init__(self, in_channels, out_channels, key_channels, value_channels, dropout, sizes=([1])):
    super(BaseOC_Module, self).__init__()
    self.stages = []
    self.stages = nn.ModuleList([self._make_stage(in_channels, out_channels, key_channels, value_channels, size) for size in sizes])
    self.conv_bn_dropout = nn.Sequential(
    nn.Conv2d(2*in_channels, out_channels, kernel_size=1, padding=0),
    InPlaceABNSync(out_channels),
    nn.Dropout2d(dropout)
    )

    def _make_stage(self, in_channels, output_channels, key_channels, value_channels, size):
    return SelfAttentionBlock2D(in_channels,
    key_channels,
    value_channels,
    output_channels,
    size)

    def forward(self, feats):
    priors = [stage(feats) for stage in self.stages]
    context = priors[0]
    for i in range(1, len(priors)):
    context += priors[i]
    output = self.conv_bn_dropout(torch.cat([context, feats], 1))
    return output

    class BaseOC_Context_Module(nn.Module):
    """
    Output only the context features.
    Parameters:
    in_features / out_features: the channels of the input / output feature maps.
    dropout: specify the dropout ratio
    fusion: We provide two different fusion method, "concat" or "add"
    size: we find that directly learn the attention weights on even 1/8 feature maps is hard.
    Return:
    features after "concat" or "add"
    """
    def __init__(self, in_channels, out_channels, key_channels, value_channels, dropout, sizes=([1])):
    super(BaseOC_Context_Module, self).__init__()
    self.stages = []
    self.stages = nn.ModuleList([self._make_stage(in_channels, out_channels, key_channels, value_channels, size) for size in sizes])
    self.conv_bn_dropout = nn.Sequential(
    nn.Conv2d(in_channels, out_channels, kernel_size=1, padding=0),
    InPlaceABNSync(out_channels),
    )

    def _make_stage(self, in_channels, output_channels, key_channels, value_channels, size):
    return SelfAttentionBlock2D(in_channels,
    key_channels,
    value_channels,
    output_channels,
    size)

    def forward(self, feats):
    priors = [stage(feats) for stage in self.stages]
    context = priors[0]
    for i in range(1, len(priors)):
    context += priors[i]
    output = self.conv_bn_dropout(context)
    return output
  • 可以得出结论,所谓OC module就是常见的spatial attention的改装
  • ocnet2.png
  • 可以看出ASPP的结构加上OC还是有不少提升的
  • ocnet3.png
  • 尽管改动简单,但是效果拔群
  • ocnet4.png
  • 在LIP上也做了实验,效果也很好

DUNet

  • paper Decoders Matter for Semantic Segmentation: Data-Dependent Decoding Enables Flexible Feature Aggregation
  • dunet1.png
  • 整体架构是个常规操作,看看DUpsample怎么玩
  • dunet2.png
  • 我觉得这个图不是很直白,直接上代码
  • 整体就是 conv c → cfactorfactor,然后reshape reshape
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    class DUpsampling(nn.Module):
    """DUsampling module"""

    def __init__(self, in_channels, out_channels, scale_factor=2, **kwargs):
    super(DUpsampling, self).__init__()
    self.scale_factor = scale_factor
    self.conv_w = nn.Conv2d(in_channels, out_channels * scale_factor * scale_factor, 1, bias=False)

    def forward(self, x):
    x = self.conv_w(x)
    n, c, h, w = x.size()

    # N, C, H, W --> N, W, H, C
    x = x.permute(0, 3, 2, 1).contiguous()

    # N, W, H, C --> N, W, H * scale, C // scale
    x = x.view(n, w, h * self.scale_factor, c // self.scale_factor)

    # N, W, H * scale, C // scale --> N, H * scale, W, C // scale
    x = x.permute(0, 2, 1, 3).contiguous()

    # N, H * scale, W, C // scale --> N, H * scale, W * scale, C // (scale ** 2)
    x = x.view(n, h * self.scale_factor, w * self.scale_factor, c // (self.scale_factor * self.scale_factor))

    # N, H * scale, W * scale, C // (scale ** 2) -- > N, C // (scale ** 2), H * scale, W * scale
    x = x.permute(0, 3, 1, 2)

    return x
  • dunet3.png
  • 在voc 上进行对比,dusample相对bilinear upsample确有其优势
  • dunet4.png
  • 这个方法在deeplab v3+上一样有效,提升了0.3%

fastFCN

LEDNET

Fast-SCNN

HRNet

DFANet

OCRNet

  • paper Object-Contextual Representations for Semantic Segmentation
  • ocrnet1.png
  • 目前没有源代码,对其中的具体操作还有待商榷
  • ocrnet2.png
  • ocrnet3.png
  • 效果令人震惊,就坐等爸爸开源了
  • It can be seen that our approach (HRNetV2 + OCR) achieves very competitive performance w/o using the video information or depth information. We then combine our OCR with ASPP [4] by replacing the global average pooling with our OCR, which (HRNetV2 + OCR (w/ ASP)) achieves 1st on 1 metric and 2nd on 3 of the 4 metrics with only a single model.
  • 还说到了,如果和ASPP一起使用,将GAP换成OCR即可效果拔群

Instance Segmentation

相信说,instance seg 的deep风潮是从mask-rcnn开始的

Mask-RCNN

PANet

  • paper Path Aggregation Network for Instance Segmentation
  • git https://github.com/ShuLiu1993/PANet
  • panet 也是重量级的,荣誉很多
  • panet1.png
  • a阶段是FPN加上一条红线,代表底层特征与高层特征的pw add;b阶段将特征再次组装,绿线作用和a阶段红线相似;c阶段进行roi pooling,将结果fuse得到结果
  • panet2.png
  • 分拆来看,这是b阶段的细节,基本就是FPN中的upsample改为stride 2 kernel 3x3的conv
  • Note that N2 is simply P2, without any processing.
  • panet3.png
  • 值得一提的是,对于多阶段的检测算法而言,各自level的roi pooling是独立进行的,但是这个图上ROI proposal都是对齐的,其实中间还有一步对齐的op,对齐后将对应的特征融合
  • https://github.com/ShuLiu1993/PANet/blob/master/lib/modeling/collect_and_distribute_fpn_rpn_proposals.py
  • “””Merge RPN proposals generated at multiple FPN levels and then distribute those proposals to their appropriate FPN levels. An anchor at one FPN level may predict an RoI that will map to another level, hence the need to redistribute the proposals.“”“
  • 源代码中有这样一段专门用来做这个事情,简单说就是将所有的proposal给其他level都复制一份,来达到对齐
  • panet4.png
  • 在mask分支作者也是使用了 conv 与 fc 结合的策略提高seg的精度
  • panet5.png
  • 吊打了Mask-RCNN w/FPN,但其实也伴随着肉眼可见的计算量增加
  • panet6.png
  • 在detection上也不遑多让,也是吊打
  • panet7.png
  • 同时,作者在 Ablation Studies 中也复现了mask rcnn,并使用训练技巧使其涨点4.4
  • panet8.png
  • 分享了COCO第一的方法,看得出来,这几个方法都涨点很猛
  • 总结:这是一篇干货满满的文章,值得一看

MS R-CNN

yolact

  • paper YOLACT Real-time Instance Segmentation
  • git https://github.com/dbolya/yolact
  • yolact1.png
  • based on retinanet. predict出bbox cls mask,经过nms,然后和protonet的结果结合得到instance的mask
  • yolact2.png
  • haed略有不同,share了tower减少计算和参数,多计算了一份mask分支,mask分支的dim k是由config设置的
    1
    2
    3
    4
    if cfg.mask_type == mask_type.direct:
    cfg.mask_dim = cfg.mask_size(16)**2
    elif cfg.mask_type == mask_type.lincomb:
    cfg.mask_dim = num_grids + num_features
  • loss: Since each pixel can be assigned to more than one class, we use sigmoid and c channels instead of softmax and c + 1. This loss is given a weight of 1 and results in a +0.4 mAP boost.
  • yolact3.png
  • 整体上看,精度相同的情况下速度上完全吊打了FCIS,是个realtime 不错的选择

PolarMask

  • paper PolarMask: Single Shot Instance Segmentation with Polar Representation
  • git https://github.com/xieenze/PolarMask
  • polarmask1.png
  • 展示了笛卡尔系建模和极坐标系建模的detail
  • polarmask2.png
  • 文章将FCOS进行了拓展,将bbox视为polar系下的4等分角度多边形,将mask视为polar系下的无线等分的多边形
  • polarmask3.png
  • 换个方式算centerness,后面有实验证明这个centerness的优势
  • polarmask4.png
  • 既然采用了polar系,iou的计算方式也要有所改变,虽然这个式是积分式,其实现实里是离散化成n等分的
  • polarmask5.png
  • (左 → 右 上 → 下)
    • rays 代表切割的份数,实验证明切36份就差不多了
    • 对比了smooth-l1 和 polar iou loss,不用iou loss真滴不行
    • 使用polar centerness比卡迪尔中心更好
    • box branch 有没有无所谓
    • backbone还是越牛逼越好
    • scale 当然也是越大越好
  • polarmask6.png
  • 在仅训练12epochs w/o aug的情况下达到了30+的coco mask map

CenterMask

  • paper CenterMask:Real-Time Anchor-Free Instance Segmentation
  • centermask1.png
  • amazing!在速度和精度上均超过了mask rcnn,而且在实时模型的pk中也大幅胜过yolact,来看看他是怎么做的
  • centermask2.png
  • 整体架构:使用FCOS作为类似RPN网络用于出bbox和cls,然后对每一个Bbox经过sag mask来抑制pixel层面的noise来完成mask
    • 图上没有,但是作者提到了 Adaptive RoI Assignment Function 用于自适应多level, 将不同大小的box标准化到一个大小
    • sam是一个 pooling + sig + elewise-mul 的 spatial attention guided mask(就是个空间attention)
  • centermask3.png
  • 文中一大亮点是,提出了 VoVNet v2,提高了vovnet的性能,主要的改动在 OSA module 上
  • Residual connection: 上图b
  • eSE: 上图c 主要就是gap然后fc,elewise-mul (ECANet也是如此,叫法不同)
  • centermask4.png
  • 作者以FCOS0-R50为例,展示了将其改造为centermask的过程和时间消耗增加其中 mask scoring就是前面提到亮点ms rcnn的miou loss分支,时间消耗每图多15ms,是可接受范围
  • centermask5.png
  • 展示了VoVNetV2中改进的两点的效果,在时间消耗小幅增加的的前提下,精度得到了很不错的trade off
  • centermask6.png
  • 与现在主流的backbone: resnet resnext hrnet 做了对比,在同等精度下(或相对高一些的精度),VoVNet在CenterMask上都能在GPU上跑得更快
  • centermask7.png
  • 对比现在realtime的instance mask架构,CenterMask在同等速度下精度都能有较大提升
  • 非常值得推荐的方法

dataset


Semantic Segmentation

Pascal VOC 2012

Cityscapes

  • home https://www.cityscapes-dataset.com/
  • download home上可以找到,需要注册账号下载
  • baiduyun https://pan.baidu.com/s/1w3W_dQBUiHcwkLOtbSJ1Tg 1bln
  • 30 classes. We present a new large-scale dataset that contains a diverse set of stereo video sequences recorded in street scenes from 50 different cities, with high quality pixel-level annotations of 5 000 frames in addition to a larger set of 20 000 weakly annotated frames. The dataset is thus an order of magnitude larger than similar previous attempts. Details on annotated classes and examples of our annotations are available at this webpage.

ADE20K

Instance Segmentation

COCO 17

  • home http://cocodataset.org/#download | http://cocodataset.org/#stuff-2017
  • download http://cocodataset.org/#download
    • 需要下载的文件较多,看到的都下载就行
  • The task includes 55K COCO images (train 40K, val 5K, test-dev 5K, test-challenge 5K) with annotations for 91 stuff classes and 1 ‘other’ class. The stuff annotations cover 38M superpixels (10B pixels) with 296K stuff regions (5.4 stuff labels per image). Annotations for train and val are now available for download, while test set annotations will remain private.