Good Good Study


  • Home

  • Tags

  • Archives

linux

Posted on 2019-03-28

常用命令

vncserver -geometry 1920x1080 创建一个1920x1080的vnc
vncserver -kill :1 删除ID为1的vnc
mkdir XXX 建文件夹
mkdir -p AAA/BBB 建BBB,若AAA不存在则先建AAA
cd XX 到某个目录
cd .. 返回上一级目录
cd ~ 到home目录
pip intsall 装某一个python的库
nvidia-smi 查看gpu使用情况
./XX.sh 运行某个.sh文件
rm -rf XXX 删除整个文件夹
rm XXX 删除一个文件
cp -a XX(source) YY(destination) 将整个XX复制到YY
mv XX YY 将XX移动到YY
df -h 查看系统中文件的使用情况
du -sh * 查看当前目录下各个文件及目录占用空间大小
du -sh FILE/ 查看目录文件总大小
tar

安装

zjuvpn安装

https://www.cc98.org/topic/2323871

1.删除原来的xl2tpd包:

1
sudo dpkg --purge xl2tpd

2.下载:
https://pan.baidu.com/s/1eRNQwng#list/path=%2F

3.安装:

1
sudo dpkg -i xl2tpd_1.1.12-zju2_i386.deb

报错:没有iproute 所以要先装一个:

1
sudo apt-get install iproute

4.配置:

1
sudo vpn-connect -c

按照提示操作, 注意用户名 学号后面要加@a

5.连接:

1
sudo vpn-connect

6.断开:

1
sudo vpn-connect -d

High Dynamic Range Paper

Posted on 2019-03-25

Using a series of low dynamic range images at different exposure

Generally, this problem can be broken down into two stages: 1) aligning the input LDR images and 2) merging the aligned images into an HDR image.

This method produces spectacular images for tripod mounted cameras and static scenes, but generates results with ghosting artifacts when the scene is dynamic or the camera is hand-held.


Deep High Dynamic Range Imaging of Dynamic Scenes - SIGGRAPH2017

-the artifacts of the alignment can be significantly reduced during merging

-Preprocessing the Input LDR Images:

If the LDR images are not in the RAW format, we first linearize them using the camera response function (CRF), then apply gamma correction (γ = 2.2). The gamma correction basically maps the images into a domain that is closer to what we perceive with our eyes.

-Alignment:

Produce aligned images by registering the images with low (Z1) and high (Z3) exposures to the reference image Z2 using tranditional method. (optical flow)

-HDR Merge:

1)Model:

2)Loss Function:

Since HDR images are usually displayed after tonemapping, we propose to compute our loss function between the tonemapped estimated and ground truth HDR images. We propose to use μ-law, a commonly-used range compressor in audio processing, which is differentiable.

we train the learning system by minimizing the L2 distance of the tonemapped estimated and
ground truth HDR images defined as:


Deep High Dynamic Range Imaging with Large Foreground Motions - ECCV2018

-CNNs have been demonstrated to have the ability to learn misalignment and hallucinate missing details

-Three advantage: 1) trained end-to-end without optical flow alignment. 2)can hallucinate plausible details that are totally missing or their presence is extremely weak in all LDR inputs. 3) the same framework can be easily extended to more LDR inputs, and possibly with any specified reference image.

-Network Architecture

We separate the first two layers as encoders for each exposure inputs. After extracting the features, the network learns to merge them, mostly in the middle layers, and to decode them into an HDR output, mostly in the last few layers.

-Processing Pipeline and Loss Function

Given a stack of LDR images, if they are not in RAW format, we first linearize the images using the estimated inverse of Camera Response Function (CRF), which is often referred to as radiometric calibration. We then apply gamma correction to produce the input to our system.

We first map LDRs to H = {H1;H2;H3} in the HDR domain, using simple gamma encoding:

where ti is the exposure time of image Ii. We then concatenate I and H channel-wise into a 6-channel input and feed it directly to the network. The LDRs facilitate the detection of misalignments and saturation, while the exposure-adjusted HDRs improve the robustness of the network across LDRs with various exposure levels.

Tonemapping function and loss function are the same as the previous paper.

-Data Preparation

First align the background using simple homography transformation by homography transformation. Without it, we found that our network tends to produce blurry edges where background is largely misaligned.

Crop the images into 256x256 patches with a stride of 64. To keep the training focused on foreground motions, we detect large motion patches by thresholding the structural similarity between different exposure shots, and replicate these patches in the training set.

Using single low dynamic range image

One intrinsic limitation of this approach is the total reliance on one single input LDR image, which often fails in highly contrastive scenes due to large-scale saturation.

HDR image reconstruction from a single exposure using deep CNNs - 1710

-Estimating missing information in bright image parts, such as highlights, lost due to saturation of the camera sensor


ExpandNet: A Deep Convolutional Neural Network for High Dynamic Range Expansion from Low Dynamic Range Content - EUROGRAPHICS 2018

-Designed to avoid upsampling of downsampled features, in an attempt to reduce blocking and/or haloing artefacts that may arise from more straightforward approaches.

-It is argued that upsampling, especially the frequently used deconvolutional layers, cause checkerboard artefacts. Furthermore, upsampling may cause unwanted information bleeding in areas where context is missing, for example large overexposed areas.

-The local branch handling local detail, the dilation branch for medium level detail, and a global branch accounting for higher level image-wide features

-Loss Function

The L1 distance is chosen for this problem since the more frequently used L2 distance was found to cause blurry results for images. An additional cosine similarity term is added to ensure color correctness of the RGB vectors of each pixel.

Cosine similarity measures how close two vectors are by comparing the angle between them, not taking magnitude into account. For the context of this work, it ensures that each pixel points in the same direction of the three dimensional RGB space. It provides improved color stability, especially for low luminance values, which are frequent in HDR images, since slight variations in any of the RGB components of these low values do not contribute much to the L1 loss, but they may however cause noticeable color shifts.

Dataset

proposed by Kalantari(Deep High Dynamic Range Imaging of Dynamic Scenes):

To generate the ground truth HDR image, we capture a static set by asking a subject to stay still and taking three images with different exposures on a tripod.

Next, we capture a dynamic set to use as our input by asking the subject to move and taking three bracketed exposure images either by holding the camera (to simulate camera motion) or on a tripod.

Capture all the images in RAW format with a resolution of 5760 × 3840 and using a Canon EOS-5D Mark III camera. Downsample all the images (including the dynamic set) to the resolution of 1500 × 1000.

Use color channel swapping and geometric transformation (rotating 90 degrees and flipping) with 6 and 8 different combinations, respectively. This process produces a total of 48 different combinations of data augmentation, from which we randomly choose 10 combinations to augment each training scene. Our data augmentation process increases the number of training scenes from 74 to 740.

Finally, since training on full images is slow, we break down the training images into overlapping patches of size 40 × 40 with a stride of 20. This process produces a set of training patches consisting of the aligned patches in the LDR and HDR domains as well as their corresponding ground truth HDR patches. We then select the training patches where more than 50 percent of their reference patch is under/over-exposed, which results in around 1,000,000 selected patches. This selection is performed to put the main focus of the networks on the challenging regions.

Described in DeepHDR

The dataset was split into 74 training examples and 15 testing examples. crop the images into 256x256 patches with a stride of 64, which produces around 19000 patches. We then perform data augmentation (flipping and rotation), further increasing the training data by 8 times.

In fact, a large portion of these patches contain only background regions, and exhibit little foreground motions. To keep the training focused on foreground motions, we detect large motion patches by thresholding the structural similarity
between different exposure shots, and replicate these patches in the training set.

ExpandNet

Results

Sen: Robust Patch-Based HDR Reconstruction of Dynamic Scenes. ACM TOG 31(6), 203:1-203:11 (2012)

Hu: HDR Deghosting: How to deal with Saturation? In: IEEE CVPR (2013)

Kalantari: Deep High Dynamic Range Imaging of Dynamic Scenes. ACM TOG 36(4) (2017)

HDRCNN: HDR image reconstruction from a single exposure using deep cnns. ACM TOG 36(6) (2017)

Ours: Deep High Dynamic Range Imaging with Large Foreground Motions

Running Time

PC with i7-4790K (4.0GHz) and 32GB RAM, 3 LDR images of size 896x1408 as input.

When run with GPU (Titan X Pascal), our Unet and ResNet take 0.225s and 0.239s respectively.

Quantitative Comparison

Pytorch

Posted on 2019-01-23

view

transpose

contiguous

repeat

squeeze

unsqueeze

.detach().numpy()

【leetcode】771:Jewels and Stones

Posted on 2018-12-21

771 宝石与石头

难度:Easy

题目描述

给定字符串J 代表石头中宝石的类型,和字符串 S代表你拥有的石头。 S 中每个字符代表了一种你拥有的石头的类型,你想知道你拥有的石头中有多少是宝石。

J 中的字母不重复,J 和 S中的所有字符都是字母。字母区分大小写,因此”a”和”A”是不同类型的石头。

示例

1
2
输入: J = "aA", S = "aAAbbbb"
输出: 3
1
2
输入: J = "z", S = "ZZ"
输出: 0

注意

  • S 和 J 最多含有50个字母。
  • J 中的字符不重复。

解法一

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
class Solution {
public:
int numJewelsInStones(string J, string S) {
int res = 0;
for (char s : S) {
for (cahr j : J) {
if (s == j) {
++res;
break;
}
}
}
return res;
}
};

解法二

1
2
3
4
5
6
7
8
9
10
11
12
class Solution {
public:
int numJewelsInStones(string J, string S) {
int res = 0;
unordered_set<char> s;
for (char c : J) s.insert(c);
for (char c : S) {
if (s.count(c)) ++res;
}
return res;
}
}

思路

用HashSet来优化时间复杂度,将珠宝字符串J中的所有字符都放入HashSet中,然后遍历石头字符串中的每个字符,到HashSet中查找是否存在,存在的话计数器自增1即可。

Caffe

Posted on 2018-12-20

Image Classification and Filter Visualization

Instant recognition with a pre-trained model and a tour of the net interface for visualizing features and parameters layer-by-layer.

set CPU mode and load net for test

1
2
3
4
5
caffe.set_mode_cpu()

net = caffe.Net(model_def, # defines the structure of the model
model_weights, # contains the trained weights
caffe.TEST) # use test mode (e.g., don't perform dropout)

input preprocessing

1
2
3
4
5
Set up input preprocessing. (We'll use Caffe's caffe.io.Transformer to do this, but this step is independent of other parts of Caffe, so any custom preprocessing code may be used).

Our default CaffeNet is configured to take images in BGR format. Values are expected to start in the range [0, 255] and then have the mean ImageNet pixel value subtracted from them. In addition, the channel dimension is expected as the first (outermost) dimension.

As matplotlib will load images with values in the range [0, 1] in RGB format with the channel as the innermost dimension, we are arranging for the needed transformations here.

1
2
3
4
5
6
7
# create transformer for the input called 'data'
transformer = caffe.io.Transformer({'data': net.blobs['data'].data.shape})

transformer.set_transpose('data', (2,0,1)) # move image channels to outermost dimension
transformer.set_mean('data', mu) # subtract the dataset-mean value in each channel
transformer.set_raw_scale('data', 255) # rescale from [0, 1] to [0, 255]
transformer.set_channel_swap('data', (2,1,0)) # swap channels from RGB to BGR

set the size of the input

1
2
3
4
5
# set the size of the input (we can skip this if we're happy
# with the default; we can also change it later, e.g., for different batch sizes)
net.blobs['data'].reshape(50, # batch size
3, # 3-channel (BGR) images
227, 227) # image size is 227x227

load an image and perform the preprocessing

1
2
image = caffe.io.load_image(caffe_root + 'examples/images/cat.jpg')
transformed_image = transformer.preprocess('data', image)

classify

1
2
3
4
5
6
7
8
9
# copy the image data into the memory allocated for the net
net.blobs['data'].data[...] = transformed_image

### perform classification
output = net.forward()

output_prob = output['prob'][0] # the output probability vector for the first image in the batch

print 'predicted class is:', output_prob.argmax()

top5

1
2
3
4
5
# sort top five predictions from softmax output
top_inds = output_prob.argsort()[::-1][:5] # reverse sort and take five largest items

print 'probabilities and labels:'
zip(output_prob[top_inds], labels[top_inds])

switch to GPU

1
2
3
caffe.set_device(0)  # if we have multiple GPUs, pick the first one
caffe.set_mode_gpu()
net.forward() # run once before timing to set up memory

show the output shape of each layers

1
2
for layer_name, blob in net.blobs.iteritems():
print layer_name + '\t' + str(blob.data.shape)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
data	(50, 3, 227, 227)
conv1 (50, 96, 55, 55)
pool1 (50, 96, 27, 27)
norm1 (50, 96, 27, 27)
conv2 (50, 256, 27, 27)
pool2 (50, 256, 13, 13)
norm2 (50, 256, 13, 13)
conv3 (50, 384, 13, 13)
conv4 (50, 384, 13, 13)
conv5 (50, 256, 13, 13)
pool5 (50, 256, 6, 6)
fc6 (50, 4096)
fc7 (50, 4096)
fc8 (50, 1000)
prob (50, 1000)

show the parameter shape
We need to index the resulting values with either [0] for weights or [1] for biases.
The param shapes typically have the form (output_channels, input_channels, filter_height, filter_width)(for the weights) and the 1-dimensional shape (output_channels,)(for the biases).

1
2
for layer_name, param in net.params.iteritems():
print layer_name + '\t' + str(param[0].data.shape), str(param[1].data.shape)

1
2
3
4
5
6
7
8
conv1	(96, 3, 11, 11) (96,)
conv2 (256, 48, 5, 5) (256,)
conv3 (384, 256, 3, 3) (384,)
conv4 (384, 192, 3, 3) (384,)
conv5 (256, 192, 3, 3) (256,)
fc6 (4096, 9216) (4096,)
fc7 (4096, 4096) (4096,)
fc8 (1000, 4096) (1000,)

Learning LeNet

Define, train, and test the classic LeNet with the Python interface.

create LeNet

We’ll need two external files to help out:

  • the net prototxt, defining the architecture and pointing to the train/test data
  • the solver prototxt, defining the learning parameters
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
from caffe import layers as L, params as P

def lenet(lmdb, batch_size):
# our version of LeNet: a series of linear and simple nonlinear transformations
n = caffe.NetSpec()

n.data, n.label = L.Data(batch_size=batch_size, backend=P.Data.LMDB, source=lmdb,transform_param=dict(scale=1./255), ntop=2)

n.conv1 = L.Convolution(n.data, kernel_size=5, num_output=20, weight_filler=dict(type='xavier'))
n.pool1 = L.Pooling(n.conv1, kernel_size=2, stride=2, pool=P.Pooling.MAX)
n.conv2 = L.Convolution(n.pool1, kernel_size=5, num_output=50, weight_filler=dict(type='xavier'))
n.pool2 = L.Pooling(n.conv2, kernel_size=2, stride=2, pool=P.Pooling.MAX)
n.fc1 = L.InnerProduct(n.pool2, num_output=500, weight_filler=dict(type='xavier'))
n.relu1 = L.ReLU(n.fc1, in_place=True)
n.score = L.InnerProduct(n.relu1, num_output=10, weight_filler=dict(type='xavier'))
n.loss = L.SoftmaxWithLoss(n.score, n.label)

return n.to_proto()

with open('mnist/lenet_auto_train.prototxt', 'w') as f:
f.write(str(lenet('mnist/mnist_train_lmdb', 64)))

with open('mnist/lenet_auto_test.prototxt', 'w') as f:
f.write(str(lenet('mnist/mnist_test_lmdb', 100)))

The net has been written to disk in a more verbose but human-readable serialization format using Google’s protobuf library. You can read, write, and modify this description directly. Let’s take a look at the train net.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
layer {
name: "data"
type: "Data"
top: "data"
top: "label"
transform_param {
scale: 0.00392156862745
}
data_param {
source: "mnist/mnist_train_lmdb"
batch_size: 64
backend: LMDB
}
}
layer {
name: "conv1"
type: "Convolution"
bottom: "data"
top: "conv1"
convolution_param {
num_output: 20
kernel_size: 5
weight_filler {
type: "xavier"
}
}
}
layer {
name: "pool1"
type: "Pooling"
bottom: "conv1"
top: "pool1"
pooling_param {
pool: MAX
kernel_size: 2
stride: 2
}
}
layer {
name: "conv2"
type: "Convolution"
bottom: "pool1"
top: "conv2"
convolution_param {
num_output: 50
kernel_size: 5
weight_filler {
type: "xavier"
}
}
}
layer {
name: "pool2"
type: "Pooling"
bottom: "conv2"
top: "pool2"
pooling_param {
pool: MAX
kernel_size: 2
stride: 2
}
}
layer {
name: "fc1"
type: "InnerProduct"
bottom: "pool2"
top: "fc1"
inner_product_param {
num_output: 500
weight_filler {
type: "xavier"
}
}
}
layer {
name: "relu1"
type: "ReLU"
bottom: "fc1"
top: "fc1"
}
layer {
name: "score"
type: "InnerProduct"
bottom: "fc1"
top: "score"
inner_product_param {
num_output: 10
weight_filler {
type: "xavier"
}
}
}
layer {
name: "loss"
type: "SoftmaxWithLoss"
bottom: "score"
bottom: "label"
top: "loss"
}

Now let’s see the learning parameters, which are also written as a prototxt file (already provided on disk). We’re using SGD with momentum, weight decay, and a specific learning rate schedule

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# The train/test net protocol buffer definition
train_net: "mnist/lenet_auto_train.prototxt"
test_net: "mnist/lenet_auto_test.prototxt"
# test_iter specifies how many forward passes the test should carry out.
# In the case of MNIST, we have test batch size 100 and 100 test iterations,
# covering the full 10,000 testing images.
test_iter: 100
# Carry out testing every 500 training iterations.
test_interval: 500
# The base learning rate, momentum and the weight decay of the network.
base_lr: 0.01
momentum: 0.9
weight_decay: 0.0005
# The learning rate policy
lr_policy: "inv"
gamma: 0.0001
power: 0.75
# Display every 100 iterations
display: 100
# The maximum number of iterations
max_iter: 10000
# snapshot intermediate results
snapshot: 5000
snapshot_prefix: "mnist/lenet"

loading and checking the slover

1
2
3
4
5
6
caffe.set_device(0)
caffe.set_mode_gpu()

### load the solver and create train and test nets
solver = None # ignore this workaround for lmdb data (can't instantiate two solvers on the same data)
solver = caffe.SGDSolver('mnist/lenet_auto_solver.prototxt')

check

1
2
3
4
5
6
# each output is (batch size, feature dim, spatial dim)
[(k, v.data.shape) for k, v in solver.net.blobs.items()]

# just print the weight sizes (we'll omit the biases)
[(k, v[0].data.shape) for k, v in solver.net.params.items()]
Out[9]:

1
2
solver.net.forward()  # train net
solver.test_nets[0].forward() # test net (there can be more than one)

Writing a custom training loop

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
niter = 200
test_interval = 25
# losses will also be stored in the log
train_loss = zeros(niter)
test_acc = zeros(int(np.ceil(niter / test_interval)))
output = zeros((niter, 8, 10))

# the main solver loop
for it in range(niter):
solver.step(1) # SGD by Caffe

# store the train loss
train_loss[it] = solver.net.blobs['loss'].data

# store the output on the first test batch
# (start the forward pass at conv1 to avoid loading new data)
solver.test_nets[0].forward(start='conv1')
output[it] = solver.test_nets[0].blobs['score'].data[:8]

# run a full test every so often
# (Caffe can also do this for us and write to a log, but we show here
# how to do it directly in Python, where more complicated things are easier.)
if it % test_interval == 0:
print 'Iteration', it, 'testing...'
correct = 0
for test_it in range(100):
solver.test_nets[0].forward()
correct += sum(solver.test_nets[0].blobs['score'].data.argmax(1)
== solver.test_nets[0].blobs['label'].data)
test_acc[it // test_interval] = correct / 1e4

define slover

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
from caffe.proto import caffe_pb2
s = caffe_pb2.SolverParameter()

# Set a seed for reproducible experiments:
# this controls for randomization in training.
s.random_seed = 0xCAFFE

# Specify locations of the train and (maybe) test networks.
s.train_net = train_net_path
s.test_net.append(test_net_path)
s.test_interval = 500 # Test after every 500 training iterations.
s.test_iter.append(100) # Test on 100 batches each time we test.

s.max_iter = 10000 # no. of times to update the net (training iterations)

# EDIT HERE to try different solvers
# solver types include "SGD", "Adam", and "Nesterov" among others.
s.type = "SGD"

# Set the initial learning rate for SGD.
s.base_lr = 0.01 # EDIT HERE to try different learning rates
# Set momentum to accelerate learning by
# taking weighted average of current and previous updates.
s.momentum = 0.9
# Set weight decay to regularize and prevent overfitting
s.weight_decay = 5e-4

# Set `lr_policy` to define how the learning rate changes during training.
# This is the same policy as our default LeNet.
s.lr_policy = 'inv'
s.gamma = 0.0001
s.power = 0.75
# EDIT HERE to try the fixed rate (and compare with adaptive solvers)
# `fixed` is the simplest policy that keeps the learning rate constant.
# s.lr_policy = 'fixed'

# Display the current training loss and accuracy every 1000 iterations.
s.display = 1000

# Snapshots are files used to store networks we've trained.
# We'll snapshot every 5K iterations -- twice during training.
s.snapshot = 5000
s.snapshot_prefix = 'mnist/custom_net'

# Train on the GPU
s.solver_mode = caffe_pb2.SolverParameter.GPU

# Write the solver to a temporary file and return its filename.
with open(solver_config_path, 'w') as f:
f.write(str(s))


Fine-tuning for Style Recognition

Fine-tune the ImageNet-trained CaffeNet on new data.


Object Detection Paper Summarization

Posted on 2018-12-14

Basic Detection Framework

Two stage:

[R-CNN]
[SPP-Net]
[Fast R-CNN]
[Faster R-CNN]

One stage:

[YOLO]
[SSD]
[YOLOv2]
[YOLOv3]

Improvement for Two Stage Framework

To alleviate the problems arising from scale variation and small object instances

Construct feature pyramid

[FPN - CVPR2017] Feature Pyramid Networks for Object Detection
[paper] https://arxiv.org/abs/1612.03144
[summarization]
-Difference with previous segmentation methods which use top-down and skip connections architecture: predictions are independently made on each level.

-The topdown pathway hallucinates higher resolution features by upsampling spatially coarser, but semantically stronger, feature maps from higher pyramid levels.
-Each lateral connection merges feature maps of the same spatial size from the bottom-up pathway and the top-down pathway.
-The bottom-up feature map is of lower-level semantics, but its activations are more accurately localized as it was subsampled fewer times.

[TDM - CVPR2017] Beyond Skip Connections: Top-Down Modulation for Object Detection

Multi-scale training approach (image pyramid)

[SNIP - CVPR2018] An Analysis of Scale Invariance in Object Detection – SNIP
[paper] https://arxiv.org/abs/1711.08189
[summarization]
-Scale Normalization for Image Pyramids

-Use Image Pyramids, for each image(with multi-scale), training is only performed on objects that fall in the desired scale range and the remainder are simply ignored during back-propagation

-Deformable R-FCN + Soft-NMS ResNet-101/DPN-98 -> 48.3mAP

[SNIPER - NIPS2018] SNIPER: Efficient Multi-Scale Training
[paper] https://arxiv.org/abs/1805.09300
[official code] https://github.com/mahyarnajibi/SNIPER
[summarization]
-Chip Generation: For each scale, KxK pixels chips are placed at equal intervals of d pixels

-Positive Chip Selection: A ground-truth box is said to be covered if it is completely enclosed inside a chip. Ground-truth instances which have a partial overlap (IoU > 0) with a chip are cropped. All the cropped ground-truth boxes (valid or invalid) are retained in the chip

-Negative Chip Selection:First train RPN for a couple of epochs. Then, for each scale i, we greedily select all the chips which cover at least M proposals.

To improve localization accuracy

Improve bounding box refinement method

Previous method: Using iterative bounding box regression to refine a bounding box.
This idea ignores two problems:
(1) a regressor trained at low IoU threshhold (such as 0.5, used to defind postives/negetives) is suboptimal for proposals of higher IoUs.
(2) the distribution of bounding boxes changes significantly after each iteration.
Usually, there is no benefit beyond applying the same regressoin function twice.

[Cascade R-CNN]

[IoU-Net]

Predict localization confidence

-Two drawbacks without localization confidence:
(1) In nms, the classification scores are typically used as the metric for ranking the proposals. But the localization accuracy is not well correlated with the classification confidence.
(2) The absence of localization confidence makes the widely adopted bounding box regression less interpretable. Bounding box regression may degenerate the localization of input bounding boxes if applied for multiple times

[IoU-Net ECCV2018] Acquisition of Localization Confidence for Accurate Object Detection
[paper] https://arxiv.org/abs/1807.11590
[official code] https://github.com/vacancy/PreciseRoIPooling (PreciseRoIPooling)
[summarization]
-Introduce IoU-Net, which predicts the IoU between detected bounding boxes and their corresponding ground-truth boxes, making the networks aware of the localization criterion.

-Generate bounding boxes and labels for training the IoU-Net by augmenting the ground-truth, instead of taking proposals from RPNs.

-IoU-guided NMS: Replace classification confidence with the predicted IoU as the ranking keyword in NMS. When a box i eliminates box j, update the classification confidence si of box i by si = max(si; sj )

-New bounding box refinement method: Optimization-based bounding box refinement (on par with traditional regression-based methods.)

-Precise RoI Pooling: It avoids any quantization of coordinates and has a continuous gradient on bounding box coordinates.

To remove duplicated bounding boxes

Widely-adopted approach: NMS

Modification for NMS

To eliminate high-scored false positives

Learning high quality object detectors

-IoU threshhold(u): is set to determine positives/negetives

When u is high, the positives contain less background, but it is difficult to assemble enough positive training examples. When u is low, a richer and more diversified positive training set is available, but the trained detector has little incentive to reject close false positives.In general, it is very difficult to ask a single classifier to perform uniformly well over all IoU levels.

低IoU threshold对于低IoU的样本有更好的改善,但是对于高IoU的样本就不如高threshold的有用。原因在于不同threshold下样本的分布会不一致,也就导致同一个threshold很难对所有样本都有效。

[Cascade RCNN - CVPR2018] Cascade R-CNN: Delving into High Quality Object Detection
[paper] https://arxiv.org/abs/1712.00726
[official code] https://github.com/zhaoweicai/cascade-rcnn
[summarization]

-At each stage t, the R-CNN includes a classifier ht and a regressor ft optimized for IoU threshold ut, where ut > ut−1. This is guaranteed by minimizing the loss


-Cascaded regression is a resampling procedure that changes the distribution of hypotheses to be processed by the different stages. By adjusting bounding boxes, each stage aims to find a good set of close false positives for training the next stage

-A bounding box regressor trained for a certain u tends to produce bounding boxes of higher IoU. Hence, starting from a set of examples (xi, bi), cascade regression successively resamples an example distribution (x′i, b′i) of higher IoU.

Improve the classification power

(1)Shared feature representation for both classification and localization may not be optimal

(2)joint optimization also leads to possible sub-optimal to balance the goals of multiple tasks and could not directly utilize the full potential on individual tasks;

(3)large receptive fields could lead to inferior classification capacity by introducing redundant context information for small objects.

[DCR - ECCV2018] Revisiting RCNN: On Awakening the Classification Power of Faster RCNN
[paper] https://arxiv.org/abs/1803.06799
[official code] https://github.com/bowenc0221/Decoupled-Classification-Refinement
[summarization]
-Propose Decoupled Classification Refinement to eliminate high-scored false positives and improve the region proposal classification results.

-It takes input from a base classiffier, e.g. the Faster RCNN, and refine the classification results using a RCNN-styled network.

-Adaptive Receptive Field

[DCR V2] Decoupled Classification Refinement: Hard False Positive Suppression for Object Detection

Anchor

Anchors are regression references and classification candidates to predict proposals (for two-stage detectors) or final bounding boxes (for single-stage detectors). Modern object detection pipelines usually begin with a large set of densely distributed anchors.

Two general rules for a reasonable anchor design: (1)Alignment: anchor centers need to be well aligned with feature map pixels. (2)Consistency: the receptive field and semantic scope are consistent in different regions of a feature map, so the scale and shape of anchors across different locations should be consistent.

The uniform anchoring scheme can lead to two difficulties: (1) A neat set of anchors of fixed aspect ratios has to be predefined for different problems. A wrong design may hamper the speed and accuracy of the detector. (2) To maintain a sufficiently high recall for proposals, a large number of anchors are needed, while most of them correspond to false candidates that are irrelevant to the object of interests.

Modify anchor generation scheme

[Guided Anchoring - CVPR2019] Region Proposal by Guided Anchoring
[paper] https://arxiv.org/abs/1901.03278
[official code] https://github.com/open-mmlab/mmdetection
[summarization]

(1)Anchor Location Prediction:
-a 1x1 convolution and an element-wise sigmoid function.
-yields a probability map that indicates the possible locations of the objects.
-selecting those locations whose corresponding probability values are above a predefined threshold.
-use masked convolution when inference
-define center/ignore/outside region, use focal loss when training

(2)Anchor Shape Prediction:
-predict the best shape (w; h) for each location.
-a 1x1 convolutional layer that yields a two-channel map that contains the values of dw and dh, an element-wise transform layer that implements Eq.(2).

-when training, sample some common values of w and h, calculate the IoU of these sampled anchors with gt, use the maximum. use bounded iou loss.

(3)Anchor Guided Feature Adaptation
-Ideally, the feature for a large anchor should encode the content over a large region, while those for small anchors should have smaller scopes accordingly.
-first predict an offset field from the output of anchor shape prediction branch, and then apply 3x3 deformable convolution to the original feature map with the offsets.

(4)The Use of High quality Proposals
-set a higher positive/negative threshold and use fewer samples when training detectors with GA-RPN compared to RPN.

The way to extract fixed-length feature vector

RoI Pooling [Fast R-CNN]
RoI Align [Mask R-CNN]
Precise RoI Pooling [IoU-Net]

To accommodate geometric variations

[DCN - ICCV2017] Deformable Convolutional Networks
[summarization]
-CNNs are inherently limited to model large, unknown transformations. There lacks internal
mechanisms to handle the geometric transformations.

-Deformable convolution

The offsets are obtained by applying a convolutional layer over the same input feature map. The output offset fields have the same spatial resolution with the input feature map. The channel dimension 2N corresponds to N 2D offsets.

-Deformable RoI Pooling

[DCNv2 - 1811] Deformable ConvNets v2: More Deformable, Better Results
[summarization]
-Stacking More Deformable Conv Layers

-Modulated Deformable Modules
Not only adjust offsets in perceiving input features, but also modulate the input feature amplitudes from different spatial locations / bins.

Improvement for One Stage Framework

Foreground-background class imbalance problem

Class imbalance is addressed in R-CNN-like detectors by a two-stage cascade and sampling heuristics. The proposal stage (e.g., Selective Search, EdgeBoxes, DeepMask, RPN) rapidly narrows down the number of candidate object locations to a small number (e.g., 1-2k), filtering out most background samples. In the second classification stage, sampling heuristics, such as a fixed foreground-to-background ratio (1:3), or online hard example mining (OHEM), are performed to maintain a manageable balance between foreground and background.

In contrast, a one-stage detector must process a much larger set of candidate object locations regularly sampled across an image. In practice this often amounts to enumerating ~100k locations that densely cover spatial positions, scales, and aspect ratios. While similar sampling heuristics may also be applied, they are inefficient as the training procedure is still dominated by easily classified background examples.

New classification loss function

[Focal Loss - ICCV2017] Focal Loss for Dense Object Detection
[paper]
[summarization]
-Identify class imbalance during training as the main obstacle impeding one-stage detector from achieving state-of-the-art accuracy.

-Focal Loss
Dynamically scaled cross entropy loss: down-weight the contribution of easy examples during training and rapidly focus the model on hard examples.

Consider the cross entropy (CE) loss for binary classification:

Rewrite it by pt:

Define the focal loss as

-RetinaNet

To alleviate the problems arising from scale variation and small object instances

Construct feature pyramid

[PFPNet - ECCV2018] Parallel Feature Pyramid Network for Object Detection
[paper]
[summarization]
-Employ the SPP module to generate pyramid-shaped feature maps via widening the network width instead of increasing its depth.

-SSD / RefineDet VGG16

[M2Det - AAAI2019] M2Det: A Single-Shot Object Detector based on Multi-Level Feature Pyramid Network
[paper] https://arxiv.org/abs/1811.04533
[official code] https://github.com/qijiezhao/M2Det
[summarization]
-For previous feature pyramid method, each feature map (used for detecting objects in a specific range of size) in the pyramid mainly or only consists of single-level features will result in suboptimal detection performance.

-Multi-Level Feature Pyramid Network (MLFPN).

-Based on SSD


Anchor(default box)

In short, anchor method suggests dividing the box space (including position, size, class, etc.) into discrete bins (not necessarily disjoint) and generating each object box via the anchor function defined in the corresponding bin.

Currently most of the detectors model anchors via enumeration, i.e. predefining a number of anchor boxes with all kinds of positions, sizes and class labels, which leads to the following issues.First, anchor boxes need careful design (chosen by handcraft or statistical methods like clustering). Second, predefined anchor functions may cause too many parameters.

[MetaAnchor - NIPS2018] MetaAnchor: Learning to Detect Objects with Customized Anchor
[paper] https://arxiv.org/abs/1807.00980
[summarization]

Anchor free method

Other improvement

Dataset

Framework

[mmdetction] https://github.com/open-mmlab/mmdetection

[OneStageDet] https://github.com/TencentYoutuResearch/ObjectDetection-OneStageDet

[Detectron] https://github.com/facebookresearch/Detectron

Experience in Network Structure

Posted on 2018-12-14

Improvement for Feature Expression

inverted residual with linear bottleneck

ReLU6 、 ReLU、PReLU

Acceleration and Compression

Depthwise Separable Convolution

computational cost reduce from

to

Group Convolution

from 35x35x320 to 17x17x640:

  1. conv+pooling or pooling + conv
  2. conv with stride 2

  3. remain to study which is faster between 2 and 3.

Inspired Architecture

Insight

Manifold of interest should lie in a low-dimensional subspace of the higher-dimensional activation space (non-linearity destroys information in low-dimensional space.)

Something about Python

Posted on 2018-11-16

[TOC]


lstrip()和rstrip()

1
2
str.lstrip([chars]) # 删除字符串左边的空格或指定字符
str.rstrip([chars]) # 删除字符串末尾的空格或指定字符

字典中找最值

1
2
3
dogdistance = {'dog-dog': 33, 'dog-cat': 36, 'dog-car': 41, 'dog-bird': 42}
min(dogdistance, key=dogdistance.get) # 返回最小值的键值:’dog-dog‘
max(dogdistance, key=dogdistance.get) # 返回最大值的键值:’dog-bird‘

列表中找最值

1
2
3
c = [-10,-5,0,5,3,10,15,-20,25]
print c.index(min(c)) # 返回最小值的索引
print c.index(max(c)) # 返回最大值的索引

字典按照value排序

返回list:

1
sorted(d.items(),key = lambda x:x[1],reverse = True)

1
2
import operator
sorted(d.items(),key = operator.itemgetter(1))

返回只有键的tuple

1
sorted(d,key=d.__getitem__)


numpy array 找最值

1
2
3
4
5
6
7
8
9
10
11
12
a = np.arange(9).reshape((3,3))
a
array([[0, 1, 2],
[9, 4, 5],
[6, 7, 8]])

print(np.max(a)) #全局最大
8
print(np.max(a,axis=0)) #每列最大
[6 7 8]
print(np.max(a,axis=1)) #每行最大
[2 5 8]
1
2
3
4
print(np.where(a==np.max(a)))
(array([2], dtype=int64), array([2], dtype=int64))
print(np.where(a==np.max(a,axis=0)))
(array([2, 2, 2], dtype=int64), array([0, 1, 2], dtype=int64))

List做除法

1
list(map(lambda x: x//4, list_a))

google_deeplab代码分析

Posted on 2018-11-16

vis.py :

1
2
predictions = model.predict_labels(
)

->model.py

def predict_labels()

1
2
outputs_to_scales_to_logits = multi_scale_logits(
)

def multi_scale_logits()

1
outputs_to_logits = _get_logits()

def _get_logits()

1
features, end_point = extract_features()

def extract_features()

->core/feature_extractor.py

def extract_features()

model_variant [‘resnet’, ‘xception’, ‘mobilenet’, ‘nas’]

1
2
arg_scope = arg_scopes_map[model_variant]()
features, end_point = get_network(model_variant,..., arg_scope)

model_variant=’xception-XX’

-> arg_scopes_map[model_variant]=xception_arg_scope

->core/xception.py

def xception_arg_scope() 不知道干啥的

def get_network()

1
func = network_map[network_name]

->core/xception.py

def xception_71()

yolov3-pytorch代码分析

Posted on 2018-11-16

GitHub:https://github.com/ayooshkathuria/pytorch-yolo-v3

Tutorial: Implement YOLO v3 from scratch


1. Creating the layers of the network architecture

Configuration File

The configuration file here describes the layout of the network, block by block. You can also see the full architecture Darknet-53 in the folloing diagram.

There are 5 types of layers that are used in YOLO:

Convolutional

Shortcut(Residual): 参数from:-3表示shortcut的output等于倒数第三层的output与前一层output相加

Upsample: YOLO3采用了类似FPN的结构,在三个尺度上进行预测,故要做两次unsample

Route:
When layers attribute has only one value, it outputs the feature maps of the layer indexed by the value. In our example, it is -4, so the layer will output feature map from the 4th layer backwards from the Route layer.

When layers has two values, it returns the concatenated feature maps of the layers indexed by it’s values. In our example it is -1, 61, and the layer will output feature maps from the previous layer (-1) and the 61st layer, concatenated along the depth dimension.

1
2
3
4
5
[route]
layers = -4

[route]
layers = -1, 61

YOLO:
The anchors describes 9 anchors, but only the anchors which are indexed by attributes of the mask tag are used.

1
2
3
4
5
6
7
8
9
[yolo]
mask = 0,1,2
anchors = 10,13, 16,30, 33,23, 30,61, 62,45, 59,119, 116,90, 156,198, 373,326
classes=80
num=9
jitter=.3
ignore_thresh = .5
truth_thresh = 1
random=1

Parsing the configuration file

1
2
3
4
5
6
7
8
def parse_cfg(cfgfile):
"""
Takes a configuration file

Returns a list of blocks. Each blocks describes a block in the neural
network to be built. Block is represented as a dictionary in the list

"""

Creating the building blocks

1
2
3
4
5
def create_modules(blocks):
net_info = blocks[0] #Captures the information about the input and pre-processing
module_list = nn.ModuleList()
prev_filters = 3
output_filters = []

Our function will return a nn.ModuleList .

prev_filters is used to keep track of number of filters in the layer on which the convolutional layer is being applied.

output_filters is used to store the number of output filters of each block, for route layer need.

1
2
for x in blocks:
module = nn.Sequential()

nn.Sequential class is used to sequentially execute a number of nn.Module objects. We use it’s add_module function to string together all layers.

for conv layer and unsample layer:(PyTorch has provided pre-built layers)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
#Add the convolutional layer
conv = nn.Conv2d(prev_filters, filters, kernel_size, stride, pad, bias = bias)
module.add_module("conv_{0}".format(index), conv)

#Add the Batch Norm Layer
bn = nn.BatchNorm2d(filters)
module.add_module("batch_norm_{0}".format(index), bn)

#Check the activation.
#It is either Linear or a Leaky ReLU for YOLO
activn = nn.LeakyReLU(0.1, inplace = True)
module.add_module("leaky_{0}".format(index), activn)

#If it's an upsampling layer
#We use Bilinear2dUpsampling
upsample = nn.Upsample(scale_factor = 2, mode = "bilinear")
module.add_module("upsample_{}".format(index), upsample)

for Route Layer / Shortcut Layers:

1
2
3
4
5
shortcut = EmptyLayer()
module.add_module("shortcut_{}".format(index), shortcut)

route = EmptyLayer()
module.add_module("route_{0}".format(index), route)

The empty layer is defined as:

1
2
3
class EmptyLayer(nn.Module):
def __init__(self):
super(EmptyLayer, self).__init__()

We use empty layer and perform the concatenation directly in theforward function of the nn.Module object representing darknet.

for yolo layers:

1
2
detection = DetectionLayer(anchors)
module.add_module("Detection_{}".format(index), detection)

We define a new layerDetectionLayer that holds the anchors used to detect bounding boxes.

The detection layer is defined as: (the forward function will be proposed later)

1
2
3
4
class DetectionLayer(nn.Module):
def __init__(self, anchors):
super(DetectionLayer, self).__init__()
self.anchors = anchors

At the end of the loop for each block, we do some bookkeeping.

1
2
3
4
module_list.append(module)
prev_filters = filters
output_filters.append(filters)
index += 1

At the end of the function create_modules, we return a tuple containing the net_info, and module_list.

2. Implementing the the forward pass of the network

Defining the Network

1
2
3
4
5
class Darknet(nn.Module):
def __init__(self, cfgfile):
super(Darknet, self).__init__()
self.blocks = parse_cfg(cfgfile)
self.net_info, self.module_list = create_modules(self.blocks)

Implementing the forward pass of the network

forward serves two purposes. First, to calculate the output, and second, to transform the output detection feature maps in a way that it can be processed easier

1
2
3
4
5
6
7
def forward(self, x, CUDA):
modules = self.blocks[1:] # the first element is net layer which is no use
outputs = {} # We cache the outputs for the route layer

write = 0 #This is explained a bit later
for i, module in enumerate(modules):
module_type = (module["type"])

Since route and shortcut layers need output maps from previous layers, we cache the output feature maps of every layer in a dict outputs. The keys are the the indices of the layers, and the values are the feature maps.

Convolutional and Upsample Layers

1
2
if module_type == "convolutional" or module_type == "upsample":
x = self.module_list[i](x)

Route Layer / Shortcut Layer

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
elif module_type == "route":
layers = module["layers"]
layers = [int(a) for a in layers]

if (layers[0]) > 0:
layers[0] = layers[0] - i

if len(layers) == 1:
x = outputs[i + (layers[0])]

else:
if (layers[1]) > 0:
layers[1] = layers[1] - i

map1 = outputs[i + layers[0]]
map2 = outputs[i + layers[1]]

x = torch.cat((map1, map2), 1) # concatenate the feature maps along the depth

elif module_type == "shortcut":
from_ = int(module["from"])
x = outputs[i-1] + outputs[i+from_]

In PyTorch, input and output of a convolutional layer has the format B x C x H x W. The depth corresponding the the channel dimension.

YOLO (Detection Layer)

The output of YOLO is a convolutional feature map that contains the bounding box attributes along the depth of the feature map.

There are two problems. First, the attributes bounding boxes predicted by a cell are stacked one by one along each other, this form is very inconvenient for output processing. Second, it would be nice to have to do these operations on a single tensor, rather than three separate tensors.

To remedy these problems, we will introduce the function predict_transform first.

Transforming the output

1
def predict_transform(prediction, inp_dim, anchors, num_classes, CUDA = True):

This function takes an detection feature map and turns it into a 2-D tensor, where each row of the tensor corresponds to attributes of a bounding box, in the following order .

The code to do above transformation is:

1
2
3
4
5
6
7
8
9
batch_size = prediction.size(0)
stride = inp_dim // prediction.size(2)
grid_size = inp_dim // stride
bbox_attrs = 5 + num_classes
num_anchors = len(anchors)

prediction = prediction.view(batch_size, bbox_attrs*num_anchors, grid_size*grid_size)
prediction = prediction.transpose(1,2).contiguous()
prediction = prediction.view(batch_size, grid_size*grid_size*num_anchors, bbox_attrs)

The dimensions of the anchors are in accordance to the height and width attributes of the net block. These attributes describe the dimensions of the input image, which is larger (by a factor of stride) than the detection map. Therefore, we must divide the anchors by the stride of the detection feature map.

1
anchors = [(a[0]/stride, a[1]/stride) for a in anchors]

Now, we need to transform our output according to the equations

First, sigmoid the x,y coordinates and the objectness score.

1
2
3
4
#Sigmoid the  centre_X, centre_Y. and object confidencce
prediction[:,:,0] = torch.sigmoid(prediction[:,:,0])
prediction[:,:,1] = torch.sigmoid(prediction[:,:,1])
prediction[:,:,4] = torch.sigmoid(prediction[:,:,4])

Add the grid offsets to the center cordinates prediction.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
#Add the center offsets
grid = np.arange(grid_size)
a,b = np.meshgrid(grid, grid)

x_offset = torch.FloatTensor(a).view(-1,1)
y_offset = torch.FloatTensor(b).view(-1,1)

if CUDA:
x_offset = x_offset.cuda()
y_offset = y_offset.cuda()

x_y_offset = torch.cat((x_offset, y_offset), 1).repeat(1,num_anchors).view(-1,2).unsqueeze(0)

prediction[:,:,:2] += x_y_offset

Apply the anchors to the dimensions of the bounding box.

1
2
3
4
5
6
7
8
#log space transform height and the width
anchors = torch.FloatTensor(anchors)

if CUDA:
anchors = anchors.cuda()

anchors = anchors.repeat(grid_size*grid_size, 1).unsqueeze(0)
prediction[:,:,2:4] = torch.exp(prediction[:,:,2:4])*anchors

Apply sigmoid activation to the the class scores.

1
prediction[:,:,5: 5 + num_classes] = torch.sigmoid((prediction[:,:, 5 : 5 + num_classes]))

The last thing we want to do here, is to resize the detections map to the size of the input image. The bounding box attributes here are sized according to the feature map (say, 13 x 13). If the input image was 416 x 416, we multiply the attributes by 32, or the stride variable.

1
2
3
prediction[:,:,:4] *= stride

return prediction

Detection Layer Revisited

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
elif module_type == 'yolo':        
anchors = self.module_list[i][0].anchors
#Get the input dimensions
inp_dim = int (self.net_info["height"])

#Get the number of classes
num_classes = int (module["classes"])

#Transform
x = x.data
x = predict_transform(x, inp_dim, anchors, num_classes, CUDA)
if not write: #if no collector has been intialised.
detections = x
write = 1

else:
detections = torch.cat((detections, x), 1)

outputs[i] = x

rturn detections

3. Confidence Thresholding and Non-maximum Suppression

To be precise, our output is a tensor of shape B x 10647 x 85. B is the number of images in a batch, 10647 is the number of bounding boxes predicted per image, and 85 is the number of bounding box attributes.

However, we must subject our output to objectness score thresholding and Non-maximal suppression, to obtain what we will call in the rest of this post as thetrue detections. To do that, we will create a function called write_results.

1
def write_results(prediction, confidence, num_classes, nms_conf = 0.4):

Object Confidence Thresholding

1
2
conf_mask = (prediction[:,:,4] > confidence).float().unsqueeze(2)
prediction = prediction*conf_mask

Performing Non-maximum Suppression

Transform the (center x, center y, height, width) attributes of our boxes, to (top-left corner x, top-left corner y, right-bottom corner x, right-bottom corner y).

1
2
3
4
5
6
box_corner = prediction.new(prediction.shape)
box_corner[:,:,0] = (prediction[:,:,0] - prediction[:,:,2]/2)
box_corner[:,:,1] = (prediction[:,:,1] - prediction[:,:,3]/2)
box_corner[:,:,2] = (prediction[:,:,0] + prediction[:,:,2]/2)
box_corner[:,:,3] = (prediction[:,:,1] + prediction[:,:,3]/2)
prediction[:,:,:4] = box_corner[:,:,:4]

Confidence thresholding and NMS has to be done for one image at once. This means, we must loop over the first dimension of prediction.

1
2
3
4
5
6
7
8
batch_size = prediction.size(0)

write = False

for ind in range(batch_size):
image_pred = prediction[ind] #image Tensor
#confidence threshholding
#NMS

Notice each bounding box row has 85 attributes, out of which 80 are the class scores. At this point, we’re only concerned with the class score having the maximum value. So, we remove the 80 class scores from each row, and instead add the index of the class having the maximum values, as well the class score of that class.

1
2
3
4
5
max_conf, max_conf_score = torch.max(image_pred[:,5:5+ num_classes], 1)
max_conf = max_conf.float().unsqueeze(1)
max_conf_score = max_conf_score.float().unsqueeze(1)
seq = (image_pred[:,:5], max_conf, max_conf_score)
image_pred = torch.cat(seq, 1)

Get rid of the bounding box rows having a object confidence less than the threshold.

1
2
3
4
5
6
7
8
9
10
11
non_zero_ind =  (torch.nonzero(image_pred[:,4]))
try:
image_pred_ = image_pred[non_zero_ind.squeeze(),:].view(-1,7)
except:
continue

#For PyTorch 0.4 compatibility
#Since the above code with not raise exception for no detection
#as scalars are supported in PyTorch 0.4
if image_pred_.shape[0] == 0:
continue

The try-except block is there to handle situations where we get no detections. In that case, we use continue to skip the rest of the loop body for this image.

Now, let’s get the classes detected in a an image.

1
2
#Get the various classes detected in the image
img_classes = unique(image_pred_[:,-1]) # -1 index holds the class index

Since there can be multiple true detections of the same class, we use a function called unique to get classes present in any given image.

1
2
3
4
5
6
7
8
def unique(tensor):
tensor_np = tensor.cpu().numpy()
unique_np = np.unique(tensor_np)
unique_tensor = torch.from_numpy(unique_np)

tensor_res = tensor.new(unique_tensor.shape)
tensor_res.copy_(unique_tensor)
return tensor_res

Then, we perform NMS classwise.

1
2
for cls in img_classes:
#perform NMS

Once we are inside the loop, the first thing we do is extract the detections of a particular class (denoted by variable cls).

1
2
3
4
5
6
7
8
9
#get the detections with one particular class
cls_mask = image_pred_*(image_pred_[:,-1] == cls).float().unsqueeze(1)
class_mask_ind = torch.nonzero(cls_mask[:,-2]).squeeze()
image_pred_class = image_pred_[class_mask_ind].view(-1,7)

#sort the detections such that the entry with the maximum objectness confidence is at the top
conf_sort_index = torch.sort(image_pred_class[:,4], descending = True )[1]
image_pred_class = image_pred_class[conf_sort_index]
idx = image_pred_class.size(0) #Number of detections

Now, we perform NMS.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
for i in range(idx):
#Get the IOUs of all boxes that come after the one we are looking at
#in the loop
try:
ious = bbox_iou(image_pred_class[i].unsqueeze(0), image_pred_class[i+1:])
except ValueError:
break

except IndexError:
break

#Zero out all the detections that have IoU > treshhold
iou_mask = (ious < nms_conf).float().unsqueeze(1)
image_pred_class[i+1:] *= iou_mask

#Remove the non-zero entries
non_zero_ind = torch.nonzero(image_pred_class[:,4]).squeeze()
image_pred_class = image_pred_class[non_zero_ind].view(-1,7)

Here, we use a function bbox_iou. The first input is the bounding box row that is indexed by the the variable i in the loop.
Second input is a tensor of multiple rows of bounding boxes. The output of the function bbox_iou is a tensor containing IoUs of the bounding box represented by the first input with each of the bounding boxes present in the second input.
If we have two bounding boxes of the same class having an IoU larger than a threshold, then the one with lower class confidence is eliminated.

Also notice, we have put the line of code to compute the ious in a try-catch block. This is because the loop is designed to run idx iterations. However, as we proceed with the loop, a number of bounding boxes may be removed from image_pred_class. This means, we cannot have idx iterations in most instances. Hence, we might try to index a value that is out of bounds (IndexError), or the slice image_pred_class[i+1:] may return an empty tensor, assigning which triggers a ValueError. At that point, we can ascertain that NMS can remove no further bounding boxes, and we break out of the loop.

Calculating the IoU

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
def bbox_iou(box1, box2):
"""
Returns the IoU of two bounding boxes
"""
#Get the coordinates of bounding boxes
b1_x1, b1_y1, b1_x2, b1_y2 = box1[:,0], box1[:,1], box1[:,2], box1[:,3]
b2_x1, b2_y1, b2_x2, b2_y2 = box2[:,0], box2[:,1], box2[:,2], box2[:,3]

#get the corrdinates of the intersection rectangle
inter_rect_x1 = torch.max(b1_x1, b2_x1)
inter_rect_y1 = torch.max(b1_y1, b2_y1)
inter_rect_x2 = torch.min(b1_x2, b2_x2)
inter_rect_y2 = torch.min(b1_y2, b2_y2)

#Intersection area
inter_area = torch.clamp(inter_rect_x2 - inter_rect_x1 + 1, min=0) * torch.clamp(inter_rect_y2 - inter_rect_y1 + 1, min=0)

#Union Area
b1_area = (b1_x2 - b1_x1 + 1)*(b1_y2 - b1_y1 + 1)
b2_area = (b2_x2 - b2_x1 + 1)*(b2_y2 - b2_y1 + 1)

iou = inter_area / (b1_area + b2_area - inter_area)

return iou

Writing the predictions

The function write_results outputs a tensor of shape D x 8. Here D is the true detections in all of images, each represented by a row. Each detections has 8 attributes, namely, index of the image in the batch to which the detection belongs to, 4 corner coordinates, objectness score, the score of class with maximum confidence, and the index of that class.

1
2
3
4
5
6
7
8
9
10
batch_ind = image_pred_class.new(image_pred_class.size(0), 1).fill_(ind)      
#Repeat the batch_id for as many detections of the class cls in the image
seq = batch_ind, image_pred_class

if not write:
output = torch.cat(seq,1)
write = True
else:
out = torch.cat(seq,1)
output = torch.cat((output,out))

4. Designing the input and the output pipelines

Loading the Network

Load the class file.

1
2
num_classes = 80    #For COCO
classes = load_classes("data/coco.names")

1
2
3
4
def load_classes(namesfile):
fp = open(namesfile, "r")
names = fp.read().split("\n")[:-1]
return names

Initialize the network and load weights.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
#Set up the neural network
print("Loading network.....")
model = Darknet(args.cfgfile)
model.load_weights(args.weightsfile)
print("Network successfully loaded")

model.net_info["height"] = args.reso
inp_dim = int(model.net_info["height"])
assert inp_dim % 32 == 0
assert inp_dim > 32

#If there's a GPU availible, put the model on GPU
if CUDA:
model.cuda()

#Set the model in evaluation mode
model.eval()

Read the Input images

1
2
3
4
5
6
7
8
9
10
read_dir = time.time()
#Detection phase
try:
imlist = [osp.join(osp.realpath('.'), images, img) for img in os.listdir(images)]
except NotADirectoryError:
imlist = []
imlist.append(osp.join(osp.realpath('.'), images))
except FileNotFoundError:
print ("No file or directory with the name {}".format(images))
exit()

use OpenCV to load the images.

1
2
load_batch = time.time()
loaded_ims = [cv2.imread(x) for x in imlist]

Write the function letterbox_image to resizes our image, keeping the aspect ratio consistent, and padding the left out areas with the color (128,128,128).

1
2
3
4
5
6
7
8
9
10
11
12
13
def letterbox_image(img, inp_dim):
'''resize image with unchanged aspect ratio using padding'''
img_w, img_h = img.shape[1], img.shape[0]
w, h = inp_dim
new_w = int(img_w * min(w/img_w, h/img_h))
new_h = int(img_h * min(w/img_w, h/img_h))
resized_image = cv2.resize(img, (new_w,new_h), interpolation = cv2.INTER_CUBIC)

canvas = np.full((inp_dim[1], inp_dim[0], 3), 128)

canvas[(h-new_h)//2:(h-new_h)//2 + new_h,(w-new_w)//2:(w-new_w)//2 + new_w, :] = resized_image

return canvas

Write the function prep_image to takes a OpenCV images and converts it to PyTorch’s input format.

1
2
3
4
5
6
7
8
9
10
11
def prep_image(img, inp_dim):
"""
Prepare image for inputting to the neural network.

Returns a Variable
"""

img = cv2.resize(img, (inp_dim, inp_dim))
img = img[:,:,::-1].transpose((2,0,1)).copy()
img = torch.from_numpy(img).float().div(255.0).unsqueeze(0)
return img

123

Pan Sicheng

29 posts
8 tags
© 2019 Pan Sicheng
Powered by Hexo
|
Theme — NexT.Pisces v5.1.4