It has been while since I touched upon the topic of Sparse Neural Networks. Let’s finally get to the second part. Please make sure that you have read my last post about KWinner and sparse NNs. Otherwise you’ll be confused.

## Boosting

Boosting is the method which HTM uses to avoid common features in the input signal over-representing in the Spatial Pooling process. The method is then adapted into sparse networks for regularization and suppress over-activating neurons.

Wait.. What do you mean by over-activating neurons? Shouldn’t the more information the better? Yet, but… Let me explain. Assuming you have a vector of size 100 representing the activity of whatever is the input of a kwinner layer and the kwinner layer is set to have 4% density. You can infer that after kwinner the resulting vector should have 4 elements with some value and 96 zeros… That’s how kwinner forces the network to only look at important features. **But** what if one (or more) elements are stucked with a very high value? Oops, you just lost one dimension of freedom and your network’s performance suffer. What boosting does is it encourages cells less active being active while discourage cells very active being less active over time.

Boosting works by storing the average active frequency of a cell (the inverse of how log in average a cells becomes active), multiply a function of the active frequency with the input value. It is like an activation layer, but instead of being a function of x. It is a function of x and x’s past history.

Assuming we have the cells’ value `x`

, the average active frequency `a`

, the expected activation frequency `t`

and the boost factor `b`

(how aggressively the boosting should be). We can calculate the new value after boosting `x'`

as `x' = x*exp((t-a)*b)`

. Assuming having t=0.15, b = 2.5 thus having the function `f(x) = exp((0.15-x)*2.5)`

. We can plot it as:

If you squint close enough you find that…

when the average active cycles == target frequency, you get a factor of 1. When less than the target frequency, the factor is larger than 1, thereby increasing the possibility of the cell being selected. And when the average active frequency is larger then the target frequency, the cell is discouraged.

## Implementation

Now we can modify our previous kwinner code so it also supports boosting.

Here is the new forward_propergation function. It is almost the same as our last implementation. The only difference is that we calculate the boost factors and apply them before calculating kwinner.

void forward_propagation(const std::vector<tensor_t*>& in_data, std::vector<tensor_t*>& out_data) override { const tensor_t &in = *in_data[0]; tensor_t &out = *out_data[0]; const size_t sample_count = in.size(); if (indices_.size() < sample_count) indices_.resize(sample_count, std::vector<size_t>(num_on_cells_)); vec_t boost_factors = vec_t(in[0].size(), 1); //Only boost when traning if(boost_factor_ != 0 && phase_ == net_phase::train) { for(size_t i=0;i<boost_factors.size();i++) { float target_density = (float)num_on_cells_/in[0].size(); float average_frequency = count_active_[i]/num_forwards_; //calculate the boost factor boost_factors[i] = exp((target_density-average_frequency)*boost_factor_); } } for_i(sample_count, [&](size_t sample) { vec_t in_vec = in[sample]; vec_t &out_vec = out[sample]; for(size_t i=0;i<in_vec.size() && phase_ == net_phase::train;i++) in_vec[i] *= boost_factors[i]; auto p = sort_permutation(in_vec, [](auto a, auto b){return a<b;}); for(size_t i=0;i<out_vec.size();i++) out_vec[i] = 0; for(size_t i=0;i<num_on_cells_;i++) { size_t idx = p[i]; out_vec[idx] = in_vec[idx]; //Increment the activation counter if(phase_ == net_phase::train) count_active_[idx]++; } std::copy(p.begin(), p.begin()+num_on_cells_, indices_[sample].begin()); }); num_forwards_ += in_data.size(); }

And the other parts are exactly the same! No need to change (I don’t know why they didn’t also change the back prop function in the paper. But they didn’t.)

## Results

This time we’ll be using a new method to apply noise. last time we randomly set pixels to a random value. This time we’ll be adding Gaussian noise to the entire image.

`new_image = clamp(image + factor*gaussian_noise, 0, 1);`

noise/accuracy(%) | Raw | Dropout | Batch Norm | KWinner | KWinner+boosting |
---|---|---|---|---|---|

0.0 | 98.91 | 98.11 | 97.79 | 98.47 | 98.22 |

0.05 | 94.58 | 76.99 | 62.23 | 71.35 | 95.68 |

0.1 | 94.33 | 76.97 | 61.77 | 70.89 | 95.55 |

0.2 | 93.4 | 76.39 | 60.6 | 68.28 | 95.35 |

0.3 | 91.29 | 75.69 | 58.73 | 66.03 | 94.81 |

0.4 | 88.58 | 74.25 | 56.49 | 62.31 | 94.15 |

0.5 | 85.57 | 71.75 | 53.05 | 57.43 | 92.66 |

0.6 | 82.28 | 68.25 | 48.23 | 51.25 | 88.96 |

0.7 | 77.52 | 63.22 | 42.65 | 43.39 | 82.41 |

0.8 | 71.72 | 53.76 | 32.94 | 35.75 | 74.18 |

The networks are trained for 30 epochs.

This time simply using KWinner is not enough to protect against noise, it even is being beaten by Dropout. But KWinner + Boosting outperforms everything! I’m truly amazed.

Source code available at: https://github.com/marty1885/tiny-kwinner

## Leave a Reply