Sparse Neural Networks pt.1 kwinner – The new line of defense against noise? with implementation in tiny-dnn!

Numenta has just released a new paper strengthening the noise resistant for neural networks. Purposing that a sparse network is more robust against noise than current one. Well, Let’s try it out and implement a portion of the paper our self – in tiny-dnn!

Implementing the layer

The idea is simple. Given a input, select the top k values to keep and discard the rest. Basically global max pooling but the end result contains k values instead of reducing to 1 value. Or if you are familiar with HTM: global inhibition for neural networks.

To have it running in tiny-dnn, let’s make our selves a new layer kwinner_layer. Give it appropriate constructors and overload some virtual methods. I won’t go into details about this. The boilerplate for making a new layer is pretty well documented here (although out dated) and staring at a simple layer of tiny-dnn is enough to just copy the structure and make it work.

class kwinner_layer : public tiny_dnn::layer
{
public:
	kwinner_layer() : layer({vector_type::data}, {vector_type::data}) {}
	kwinner_layer(std::vector<size_t&gt; input_shape, float density)
		: layer({vector_type::data}, {vector_type::data})
		, num_on_cells_(density*std::accumulate(input_shape.begin(), input_shape.end(), 1, std::multiplies<size_t&gt;()))
		, input_shape_(input_shape) {
	}

	std::string layer_type() const override {
		return "kwinner";
	}
	std::vector<shape3d&gt; in_shape() const override {

		// return input shapes
		// order of shapes must be equal to argument of layer constructor
		return { shape3d(io_shape())};
	}

	std::vector<shape3d&gt; out_shape() const override {
		return { shape3d(io_shape())};
	}

	shape3d io_shape() const {
		auto s = input_shape_;
		for(size_t i=0; i<3;i++)
			s.push_back(1);
		return shape3d(s[0], s[1], s[3]);
	}
};

And now the meat of the layer, forward and backward propagation. Since tiny-dnn doesn’t come with autograd. We will have to write both pass by hand. Fortunately it is very easy in this case.

template <typename T, typename Compare&gt;
inline std::vector<std::size_t&gt; sort_permutation(
    const T&amp; vec,
    Compare compare)
{
	std::vector<std::size_t&gt; p(vec.size());
	std::iota(p.begin(), p.end(), 0);
	std::sort(p.begin(), p.end(),
		[&amp;](std::size_t i, std::size_t j){ return compare(vec[i], vec[j]); });
	return p;
}

void forward_propagation(const std::vector<tensor_t*&gt;&amp; in_data,
			std::vector<tensor_t*&gt;&amp; out_data) override {
	//The input and output tensors
	const tensor_t &amp;in = *in_data[0];
	tensor_t &amp;out = *out_data[0];
	//How many data are there in the current batch
	const size_t sample_count = in.size();
	//Make our internal storage larger if we don't have enough
	if (indices_.size() < sample_count)
		indices_.resize(sample_count, std::vector<size_t&gt;(num_on_cells_));
	
	//tiny-dnn's automatically parallelized for loop
	for_i(sample_count, [&amp;](size_t sample) {
		//Get the current data in batch
		const vec_t &amp;in_vec = in[sample];
		vec_t &amp;out_vec = out[sample];

		//Sort the input tensor and return the indices
		auto p = sort_permutation(in_vec, [](auto a, auto b){return a<b;});

		//Only allow the top k values to survive and drop the rest
		for(size_t i=0;i<out_vec.size();i++)
			out_vec[i] = 0;
		for(size_t i=0;i<num_on_cells_;i++) {
			size_t idx = p[i];
			out_vec[idx] = in_vec[idx];
		}

		//Store the indices of the top k value for backprop
		std::copy(p.begin(), p.begin()+num_on_cells_, indices_[sample].begin());
	});
}

The forward_propagation method accepts two parameters. in_data contains all input tensors and out_data contains all the tensors going out of the layer. Since kwinner only contains 1 input and 1 output. We use in_data[0] and out_data[0] exclusively. We resize out internal buffer used for backpropergation. Then the for_i is an automatically parallelized for loop in tiny-dnn so we don’t have to parallelize our self. Within the for loop, we find the top-k values, discard all the others and then copy the indices of values which we didn’t discard to out internal buffer.

And in the backpropergation phase we do the opposite. We copy all gradients if their indices are in our buffer. Otherwise discard them

void back_propagation(const std::vector<tensor_t *&gt; &amp;in_data,
		const std::vector<tensor_t *&gt; &amp;out_data,
		std::vector<tensor_t *&gt; &amp;out_grad,
		std::vector<tensor_t *&gt; &amp;in_grad) override {
	tensor_t &amp;prev_delta       = *in_grad[0];
	const tensor_t &amp;curr_delta = *out_grad[0];

	CNN_UNREFERENCED_PARAMETER(in_data);
	CNN_UNREFERENCED_PARAMETER(out_data);

	for_i(prev_delta.size(), [&amp;](size_t sample) {
		auto&amp; s = prev_delta[sample];
		size_t sz = s.size();
		for (size_t i = 0; i < sz; i++)
			s[i] = 0;
		//accumulate gradient using the indices
		for(size_t i=0;i<num_on_cells_;i++) {
			size_t idx = indices_[sample][i];
			s[i] += curr_delta[sample][idx];
		}
	});
}

That’s it. Now we can use it as a ordinary layer in tiny-dnn. (Besides serializing/desalinizing it as I didn’t implement interfaces for that and I need to modify tiny-dnn’s core to add a new layer)

Experiments and testing

No papers are complete without experiments and tests. With we have it implemented, let’s test how good kwinner really is! For this post, I’ll compare how well LeNet-5 with kwinner performs compared to a network with Dropout, Batch Normalization and a raw network without regularization.

To add noise to the image, a simple function is used

auto noisy_images = test_image;
for(auto&amp; image : noisy_images) {
	for(auto&amp; d : image) {
		if(dist(rng) < noise_factor)
			d = std::max(std::min(noise_dist(rng), 1.f), 0.f);
	}
}

Here are the network definitions.

using activation = tiny_dnn::activation::tanh;

//Raw network
nn << conv(32, 32, 5, 1, 6)
	<< max_pool(28, 28, 6, 2
	<< activation()
	<< conv(14, 14, 5, 6, 16)
	<< max_pool(10, 10, 16, 2)
	<< activation()
	<< conv(5, 5, 5, 16, 120)
	<< activation()
	<< fc(120, 10)
	<< softmax();

//Dropout
nn << conv(32, 32, 5, 1, 6)
	<< max_pool(28, 28, 6, 2)
	<< activation()
	<< dropout_layer(6*14*14, 0.3)        //Dropout
	<< conv(14, 14, 5, 6, 16)
	<< max_pool(10, 10, 16, 2)
	<< activation()
	<< dropout_layer(5*5*16, 0.3)         //Dropout
	<< conv(5, 5, 5, 16, 120)
	<< activation()
	<< dropout_layer(120, 0.3)            //Dropout
	<< fc(120, 10)
	<< softmax();

//Batch norm
nn << conv(32, 32, 5, 1, 6)
	<< max_pool(28, 28, 6, 2)
	<< activation()
	<< batch_normalization_layer(14*14, 6) //Batch norm
	<< conv(14, 14, 5, 6, 16)
	<< max_pool(10, 10, 16, 2)
	<< activation()
	<< batch_normalization_layer(5*5, 16)  //Batch norm
	<< conv(5, 5, 5, 16, 120)
	<< activation()
	<< batch_normalization_layer(120, 1)   //Batch norm
	<< fc(120, 10)
	<< softmax();

//kwinner
nn << conv(32, 32, 5, 1, 6)
	<< max_pool(28, 28, 6, 2)
	<< activation()
	<< conv(14, 14, 5, 6, 16)
	<< max_pool(10, 10, 16, 2)
	<< activation()
	<< conv(5, 5, 5, 16, 120)
	<< activation()
	<< kwinner_layer({120}, 0.4) //Kwinner
	<< fc(120, 10)
	<< softmax();

In all the tests, the networks are trained for 10 epochs with the Adam optimizer and the multi class cross-entropy loss function.

Here are the results. WOW. k-winner is amazing. k-winner almost dominated all other regulation methods. And is able to sustain a 50% accuracy when 70% of the image is corrupted (set to a random value). It is truly amazing.

Noise/Accuracy(%)RawDropoutBatch Normk-winner
098.1795.7393.0895.81
0.0597.7295.2692.1595.09
0.196.7494.1591.8194.08
0.293.4892.0688.5692.07
0.387.3186.4880.6388.53
0.477.1978.5565.1383.2
0.563.3567.8447.0576.17
0.645.5655.6128.3665.7
0.729.9341.7815.0051.79
0.818.4928.2110.6734.82

The Bold numbers are the highest values with regularization. And Italic and Underlined numbers are when the raw networks out-performs regularized version.

Conclusion

I think k-winner is truly amazing. Beating the common Dropout and Batch Norm method and providing the network better noise resistant without training on noisy images.

If you have read the paper at this point (which I don’t think so). I didn’t implement the kwinner algorithm exactly like the paper described. In the paper the gradient are the sign of the sum of the not-discarded gradients. Instead I didn’t calculate the sign, This (seems) boosted the performance greatly (my method 96% vs the parper’s method 77% ). Or the sign calculation may be beneficial to the other parts of the paper.

Also… I think I’ll switch to PyTorch the next time. Although I love tiny-dnn. It is really slow for a DNN library in 2019 (8s per epoch on R7 1800, really?) and the development has basically stopped. 😦

The source code is available on GitHub for everyone interested.

Advertisements

One thought on “Sparse Neural Networks pt.1 kwinner – The new line of defense against noise? with implementation in tiny-dnn!

Add yours

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Powered by WordPress.com.

Up ↑

%d bloggers like this: