Layer-wise decorrelation in deep-layered artificial neuronal networks

The most commonly used deep networks are purely feed-forward nets. The input is passed to layers 1, 2, 3, then at some point to the final layer (which can be 10, 100 or even 1000 layers away from the input).  Each of the layers contains neurons that are activated differently by different inputs. Whereas activation patterns in earlier layers might reflect the similarity of the inputs, activation patterns in later layers mirror the similarity of the outputs. For example, a picture of an orange and a picture of a yellowish desert are similar in the input space, but very different with respect to the output of the network. But I want to know what happens in-between. How does the transition look like? And how can this transition be quantified?

To answer this question, I’ve performed a very simple analysis by comparing the activation patterns of each layer for a large set of different inputs. To compare the activations, I simply used the correlation coefficient between activations for each pair of inputs.

As network, I used a deep network (GoogleNet) which is pretrained on the imagenet dataset to distinguish ca. 1000 different output categories. Here are the network’s top five outputs to some example input images from the imagenet dataset:


Each of these images produces a 1008-element activation vector (for 1008 different possible output categories). In total, I compared each pair of output vectors produced by a pair of input images for 500 images, resulting in a 500×500 correlation matrix. (The computational most costly part is not to run the network in order to get the activations, but to compute the correlations.)

Then, I performed the same analysis for the activations not of the output layer, but of any intermediate layer and also of the input layer. The correlation matrix of input images simply computes the pixelwise similarity between each pair of input images. Here are some of those layers, clustered in input space:


The structure obtained by clustering the inputs’ correlation matrix is clearly not maintained in higher layers. Most likely, the input layer similarity reflects features like overall color of the images, and it makes sense that this information is not so crucial for later classification.

When clustering the same correlation matrices with respect to the outputs’ correlation matrix, the inputs seem to be more or less uncorrelated, but there is some structure also in the intermediate layers:


Now, to quantify how different images evoke different or similar activation patterns in different layers, I simply computed the correlation between each pair of the above correlation matrices. In the resulting matrix, entries are high if the two corresponding layers exhibit a similar correlational structure. With correlational structure, I mean that e.g. input A and B evoke highly correlated activations, whereas input C and D evoke anti-correlated activations, etc.

Here is the result for the main layers of the network (the layers are described in more detail in the GoogleNet paper). This is the most interesting part of this blog post:


Let’s have a close look. First, layers that are closer in the hierarchy are more similar and therefore show higher values in the matrix. Second, the similarity structure of the input space gets lost very quickly. However, decorrelation in the later convolutional layers (from mixed4b to mixed5a) slows down somehow, and is stronger again in the transition between mixed5a and mixed5b.

Most prominent, however, is the sharp decrease of correlation when going from the second-to-last layer (avgpool0) to the output layer. To my naive mind, this sharp decrease means that the weights connecting the last two layers contribute a disproportionately large effect to the final classification. One possible explanation for this effect is the strong reduction of the number of “neurons” from the second-to-last to the output layer. Another explanation is that learning in the earlier layers is simply not very efficient, i.e., not very helpful for the final task of classification.

In both cases, my intuition tells me that it would be much better to decorrelate the representations smoothly and with constant increments over layers, intead of having a sharp decorrelation step in the final layer; and maybe an analysis like the one above would be helpful for designing networks that compress available information in an optimized way.

A sidenote: The layers above are only the main layers of the GoogleNet. However, as described in the original paper in Fig. 2b, many of those layers consist of a couple of inception modules, basically 1×1, 3×3, 5×5 and MaxPool units that are concatenated lateron. Here’s the same analysis as above, but only for input, output and the two inception modules mixed4c and mixed4d (black borders).


Interestingly, the 5×5 convolutional units (dashed grey borders) are standing out as being not very similar to adjacent 1×1, 3×3 or MaxPool units, and also less similar to the final classification (output) than the smaller units of the same inception module. From this, it seems likely to me that the 5×5 units of the inception modules are of little importance, and it could be worth trying to set up the network as it is, but without the 5×5 convolutional units in the inception modules. However, it does not follow strictly from my analysis that the 5×5 units are useless – maybe they carry some information for the output that no other inception subunit can carry, and although this is not enough to make their representation similar to the ouput, it could still be some useful additional  piece of information.

Another sidenote: While writing this blog post, I realized how difficult it is to write of “correlation matrices of correlation matrices”. Sometimes I replaced “correlations” with “similarities” to make it more intuitive, but still I have the feeling that a hierarchy of several correlations is somehow difficult to grasp with natural thoughts. (For mathematics, that’s easy.)

Yet another sidenote: It would be interesting to see how the decorrelation matrix shown above changes over learning. There is some work that uses decorrelation similar to what I’ve used here as a part of the loss function to prevent overfitting (Cogswell et al., 2016). But their analysis mostly focuses on the classification accuracies, instead of what would be more interesting, the activation patterns and the resulting decorrelation in the network. What a pity!

Along similar lines, I think it might be a mistake to check only the change of the loss function during learning and the test set performance afterwards, while disregarding some obvious statistics: correlations between activations in response to different inputs across layers (that’s what I did here), but also sparseness of activation patters or distributions of weights. And how all of this changes over learning. For example, the original GoogleNet paper (link) motivates its choice of network by aiming at a sparsely activated network, but it only reports the final performance, and not the sparseness of activation patterns.

So far, I have not seen a lot of effort going into this direction (maybe because I do not know the best keywords to search for?). A recent paper (that I only encountered after doing the analysis shown above) seems to partially fill this gap, by Raghu et al., NIPS (2017). It is mainly designed to perform a more challenging task, comparing the similarity of activation patterns across different networks. Basically, it reduces the neuronal activation patterns to lower-dimensional ones and compares them afterwards in this reduced subspace. But it can also be used to compare the similarity of activation patterns for one single layer,  across time during learning and/or across input space.

I think that this sort of analysis, if applied routinely by neuronal network designers, it will contribute a lot to make neuronal networks more transparent, and I hope to see much more research going into this direction!

Posted in Data analysis, machine learning, Neuronal activity | Tagged , , , , , | Leave a comment

Understanding style transfer

‘Style transfer’ is a method based on deep networks which extracts the style of a painting or picture in order to transfer it to a second picture. For example, the style of a butterfly image (left) is transferred to the picture of a forest (middle; pictures by myself, style transfer with


Early on I was intrigued by these results: How is it possible to clearly separate ‘style’ and ‘content’ and mix them together as if they were independent channels? The seminal paper by Gatys et al., 2015 (link) referred to a mathematically defined optimization loss which was, however, not really self-explanatory. In this blog post, I will try to convey the intuitive step-by-step understanding that I was missing in the paper myself.

The resulting image (right) must satisfy two constraints: Content (forest) and style (butterfly). The ‘content’ is well represented by the location-specific activations of the different layers of the deep network. For the ‘style’, Gatys et al. suggest to calculate the joint activation patterns, i.e., correlations between activation patterns of different feature maps. These correlation matrices are mathematically speaking the Gram matrices of the network layers. This means that the Gram matrices of the butterfly image (left) and of the resulting style-transferred image (right) should be optimized to be as similar as possible. But what does this Gram matrix actually mean? Why does this method work?

A slightly better understanding comes with an earlier paper of Gatys et al. on texture synthesis (link). From there it becomes clear that the Gram matrix does not appear from nowhere but is inspired by comparably old-fashioned texture synthesis papers, especially by Portilla and Simoncelli, 2000 (link). This paper deals with the statistical properties that define what humans perceive consistently as the same texture, and the key word here is ‘joint statistics’. More precisely, they argue that it is not sufficient to look at the distributions of features (like edginess or other simple filters), but at the joint occurrence of features. This could be high spatial frequencies (feature 1) co-occurring with horizontal edges (feature 2). Or small spirals (feature 1) co-occurring with a blue color (feature 2). Co-occurences can be intuitively quantified by using the spatial correlation between each pair of feature maps correlations, since correlations are simply a measure of similarity between two (or more) things. As an important side-effect of the inner product associated with the correlation, the Gram matrix is invariant to the positions of features, which makes sense in the context of textures.

On a sidenote, Portilla and Simoncelli are not the first to have had a close look at joint statistics of textures. This is going back at least to Béla Julesz (1962), who conjectured that two images with the same second order statistics (= joint statistics) have textures that are indistinguishable for humans. (Later, he disproved his own conjecture based on counterexamples, but the idea of using joint statistics for texture synthesis remained useful.)

In old-school texture synthesis, features were handcrafted and carefully selected. When working with deep networks, features are much more numerous: they are simply the activation patterns of the layers. Each layer of a deep network consists not only of a single representation or feature map, but many of them (up to 100s or 1000s). Some are locally activated by edges, others by colors or parallel lines, etc. For the visualizations shown below, I’ve set up a Jupyter notebook to make it as transparent as possible. All of it is based on a GoogLeNet, pre-trained on the Imagenet dataset. Here are the feature maps (= activation patterns) of four input pictures (four columns). Green indicates high activation, blue low activation.


For the textures of cracked concrete, the feature maps 4 and 5 (second and third rows) are very similar to each other (correlated) and 15 is highly dissimilar (anti-correlated). Feature map 15 seems to have learned to detect large, bright and smooth surfaces like clouds. Therefore, the Gram matrix entry for the feature pair [4,5] will be consistently high for input images of cracked concrete, but low for cloud images. These are only few examples, but I think it makes pretty clear why correlations of feature maps are a better indicator of a texture than the simple mean activation of single feature maps.

To complement this analysis, I generated an input on the right-most column that optimizes the activation of the respective layers (see below for an explanation how I did this). Whereas feature maps 4 and 5 show edges and high-frequency structures, feature map 15 seems to prefer smooth textures.

Next, let’s have a look at how a full-blown Gram matrix looks like! But which layer would choose for this analysis?  I’m using a variant of the Inception network/GoogLeNet here, which seems to be a little bit less well-suited for style transfer than the VGG network typically used for style transfer. To find out a layer that is indicative of style, I applied 20 images of cloud textures and 20 images of cracked concrete. Then I measured both the confusion matrix of the Gram matrices for each layer, allowing to find the layer that optimally distinguishes these two textures (it is layer ‘mixed4c_3x3_pre_relu/conv’, more details are in the Jupyter notebook). As inputs, I have used greyscale images to prevent the color channels from dominating similarity measurements.


For the 16 inputs above, here come the 16 corresponding 256×256 Gram matrices of the chosen layer, arising from 256 feature maps. To clarify the presentation, I have rearranged the features in the matrices to highlight the clustering. The x- and y-axes of each matrix can be interpreted as the features of this layer, and the clustering highlights some of the feature similarities.


From that, it is quite clear that all cloud pictures display similar Gram matrices. The lower two rows with pictures of cracked concrete exhibit a more or less common pattern as well, which in turn is distinct from the cloudy Gram matrices.

As is clearly visible, the feature space is rather large. Therefore, since the contribution of single features is small, it does not make sense to look e.g. at a single feature pair that is highly correlated for clouds and anti-correlated for cracked concrete. Instead, let’s reduce the complexity and have a look at the clusters shown above.

To understand what those clusters of feature maps are encoding, I used the deep dream technique, based on a Jupyter notebook by Alexander Mordvintsev. Basically, it uses gradient descent on the input activations of the network to compute an image that evokes high activity in the respective feature maps. This yields the following deep dreams, starting from a random noise input. The feature maps, of which the activation has been optimized, correspond to the clusters 4, 5 and 7 shown above in the Gram matrices (yellow highlights).


Cluster 4 clearly prefers smooth and cloudy inputs, whereas cluster 5 likes smaller tiles, separated by dark edges. However, it is difficult to say what the network makes out of those features. First, they will interact with other feature maps. Second, the Gram matrix analysis does not tell whether the feature map clusters are active at all for a given input, or at which locations. Second, as mentioned before, textures are not determined by patterns of feature activations, but by correlations of feature activations.

So let’s go one step further and modify the deep dream algorithm in order to maximize the correlational structure within a cluster of features in a layer, instead of the simple activations of the features. Here comes the result for cluster 4, with the deep dream maximizing either the activity in this cluster of features (left) or maximizing the correlational structure across features within in this cluster (right).


The result is, maybe surprisingly, little informative. It shows that the texture of clouds is not located in a single cluster of the Gram matrix (which is optimized for in the right-hand image), but distributed across the full spectrum of features, and probably also across several layers.

Together, the analysis so far has shown how Gram matrices look like, how they cluster and to how these clusters can be interpreted. However, the complexity and the distributed nature of computations in the network make it very difficult to intuitively understand what is going on and to predict what would happen to specific layers or feature maps or Gram matrices when exposed to a given input picture.

To sum it up, correlated features (= Gram matrices) can be used to compare the textures of two images and can be employed by a loss function to measure texture similarity. This works both for texture synthesis and style transfer. As a byproduct, the correlation matrix of feature maps, the Gram matrix, can be used to understand how the feature space is divided up by a bunch of clusters of similarly tuned channels. If you want to play around with this, my Jupyter notebook on Github could be a good starting point.

An interesting aspect is the fact that joint statistics – a somewhat arbitrary and empirical measurement – are sufficient to generate textures that seem natural to humans. Would it not be a good idea for the human brain, when it comes to texture instead of object recognition, to read out correlated activity of ‘feature neurons’ of the same receptive field and them simply average over all receptive fields? The target neurons that read out co-active feature neurons would thus see some the Gram matrix of the feature activations. There is already work in experimental and theoretical neuroscience that goes somewhat into this direction (Okazawa et al., 2014, link, short summary here).

For further reading, I can recommend Li et al., 2017 (link), who reframe the Gram matrix method by describing it as a Maximum Mean Discrepancy (MMD) minimization with a specific kernel. In addition, they show that other kernels are also useful to measure distances between feature distributions, thereby generalizing the style transfer method. (On the other hand, this paper did not really improve my intuitive understanding of style transfer.)
For an overview of implementations of the style transfer method, there is a nice and recent review on style transfer by Ying et al., 2017 (link). It is not really well-written, but very informative and concise.

Posted in Data analysis, machine learning | Tagged , , , | Leave a comment

Can two-photon scanning be too fast?

The following back-of-the-envelope calculations do not lead to any useful result, but you might be interesting in reading through them if you want to get a better understanding of what happens during two-photon excitation microscopy.

The basic idea of two-photon microscopy is to direct so many photons onto a single confined location in the sample that two photons interact with a fluorophore roughly at the same time, leading to fluorescence. The confinement in time seems to be given by the duration of the laser pulse (ca. 50-500 fs). The confinement in space is in the best case given by the resolution limit (let’s say ca. 0.3 μm in xy and 1 μm in z).

However, since the laser beam is moving around, I wondered whether this may influence the excitation efficiency (spoiler: not really). I thought that his would be the case if the scanning speed in the sample is so high that the fs-pulse is stretched out so much that it spreads over a distance that is greater than the lateral beam size (0.3 μm FWHM).

For normal 8 kHz resonant scanning, the maximum speed (at the center of the FOV) times the temporal pulse width is, assuming a large FOV (1 mm) and a laser pulse that is strongly dispersed through optics and tissue (FWHM = 500 fs):

Δx1 = vmax × Δt = 1 mm × π × 8 kHz × 500 fs = 0.01 nm

This is clearly below the critical limits. Is there anything faster? AOD scanning can run at 100 kHz (reference), although it can not really scan a 1 mm FOV.  TAG lenses are used as scanning devices for two-photon point scanning (reference) and for two-photon light sheet microscopes (reference). They run at up to 1000 kHz sinusoidal. This is performed in the low-resolution direction (z) and usually covers only few hundred microns, but even if it were to cover 1 mm, the spatial spread of the laser pulse would be

Δx1 = 1 mm × π × 1000 kHz × 500 fs = 1 nm

This is already in the range of the size of a typical genetically expressed fluorophor (ca. 2 nm or a bit more for GFP), but clearly less than the resolution limit.

However, even if the infrared pulse was smeared over a couple of micrometers, excitation efficiency would still not be decreased in reality. Why is this so? It can be explained by the requirement that the two photons arriving at the fluorophor have to be absorbed almost ‘simultaneously’. I was unable to find a lot of data on ‘how simultaneous’ this must be, but this interaction window in time seems to be something like Δt < 1 fs (reference). What does this mean? It reduces the true Δx to a fraction of the above results:

Δx2 = 1 mm × π × 1000 kHz × 1 fs = 0.003 nm

Therefore, smearing the physical laser pulses (Δx1) does not really matter. What matters, is the smearing of the temporal interaction window Δt over a spatial distance larger than the resolution limit (Δx2). This, however, would require a line scanning frequency in the GHz range – which will never, ever happen. The scan rate must always be significantly higher than the repetition rate of pulsed excitation. The repetition rate, however,  is limited to <500 MHz due to fluorescence lifetimes of >1-3 ns. Case closed.

Posted in Imaging, Microscopy | Tagged , , , , | 4 Comments

The basis of feature spaces in deep networks

In a new article on Distill, Olah et al. write up a very readable and useful summary of methods to look into the black box of deep networks by feature visualization. I had already spent some time with this topic before (link), but this review pointed me to a couple of interesting aspects that I had not noticed before. In the following, I will write about one aspect of the article in more depth: whether a deepnetwork encodes features rather on a neuronal basis, or rather on a distributed, network basis.

‘Feature visualizations’ as discussed here means to optimize the input pattern (the image that is fed into the network) such that it maximizes the activity of a selected neuron somewhere in the network. The article discusses strategies to prevent this maximization process from generating non-naturalistic images (“regularization” techniques). On a sidenote, however, they also asks what happens when one optimizes the input image not for a single neuron’s activity, but for the joint activity of two or more neurons.


Joint optimization of the activity of two neurons. From Colah et al., Distill (2017) / CC BY 4.0.

Supported by some examples, and pointing at some other examples collected before by Szegedi et al., they write:

Individual neurons are the basis directions of activation space, and it is not clear that these should be any more special than any other direction.

It is a somehow natural thought that individual neurons are the basis of coding/activation space, and that any linear combination could be used for coding equally well as any single neuron-based representation/activation. In linear algebra, it is obvious that any rotation of the basis that spans the coding space does not change anything about the processes and transformations that are taking place in this space.

However, this picture breaks down when switching from linear algebra to non-linear transformations, and deep networks are by construction highly non-linear. My intuition would be that the non-linear transformation of inputs (especially by rectifying units) sparsens activity patterns with increasing depth, thereby localizing the activations to fewer and fewer neurons, without any sparseness constraint during weight learning. This does not necessarily mean that the preferred input images of random directions in activation space would be meaningless; but it would predict that the activation patterns of to-be-classified inputs are not pointing into random directions of activation space, but have an activation direction that prefers the ‘physical’, neuronal basis.

I think that this can be tested more or less directly by analyzing the distributions of activation patterns across layers. If activation patterns were distributed, i.e., pointing into random directions, the distribution would be rather flat across the activation units of each layer. If, on the other hand, activation directions were aligned with the neuronal basis, the distribution would be rather skewed and sparse.

Probably this needs more thorough testing than I’m able to do by myself, but for starters I used the Inception network, trained on the ImageNet dataset, and I used this Python script on the Tensorflow Github page as a starting point. To test the network activation, I automatically downloaded the first ~200 image hits on Google for 100×100 JPGs of “animal picture”, fed it into the network and observed the activation pattern statistics across layers. I uploaded a Jupyter Notebook with all the code and some sample pictures on Github.

The result is that activation patterns are sparse and tend to become sparser with increasing depth of the layers. The distribution is dominated by a lot of zero activations, indicating a net input less or equal to zero. I have excluded the zeros from the histograms and instead given the percentage of non-zero activations as text in the respective histogram. The y-axis of each histogram is in logscale.


It is also interesting that the sparseness decreases with depth, but reaches a bottleneck at a certain level (here from ‘mixed_7’ until ‘mixed_9’ – the mixed layers are inception modules) and becomes less sparse afterwards when approaching the (small) output layer.

A simple analysis (correlation between activation patterns stemming from different input images) shows that de-correlation (red), that is, a decrease of correlation between activations by different input images, is accompanied by sparsening of the activation levels (blue):


It is a bit strange that the network layers 2, 4 and 6 generate sparser activations patterns than the respective previous layers (1, 3 and 5), accompanied by less decorrelated activity. It would be interesting to analyze the correlational structure in more depth. For example, I’d be curious to understand activation patterns of input patterns that lead to the same categorization in the output layer, and to see from which layer on they start to exhibit correlated activations.

Of course there is a great body of literature in neuroscience, especially theoretical neuroscience, that discusses local, sparse or distributed codes and the advantages and disadvantages that come with it. For example, according to theoretical work by Kanerva, sparseness of memory systems helps to prevent different memories from interfering too much with each other, although it is unclear until now whether something similar is implemented in biological systems (you would find many experimental papers with evidence in favor and against it, often for the same brain area). If you would like to read more about sparse and dense codes, Scholarpedia is a good starting point.

Posted in machine learning, Network analysis, Neuronal activity | Tagged , , , , | 2 Comments

All-optical entirely passive laser scanning with MHz rates

Is it possible to let a laser beam scan over an angle without moving any mechanical parts to deflect the beam? It is. One strategy is to use a very short-pulsed laser beam: A short pulse width means a finite spectral width of the laser (->Heisenberg). A dispersive element like a grating can then be used to automatically diffract the beam into smaller beamlets which in turn can somehow be used to scan or de-scan an object. This technique is called dispersive fourier transformation, although there seem to be different names for only slighly different methods. (I have no experience in this field and am not aware of the current state of the art, but I found this short introductory review useful as a primer.)

Recently, I stumbled over an article that describes a similar scanning technique, but without dispersing the beam spectrally: Multi-MHz laser-scanning single-cell fluorescence microscopy by spatiotemporally encoded virtual source array. First I didn’t believe this could be possible, but apparently it is. In simple words, the authors of the study have designed a device that uses a single laser pulse as an input and outputs several laser pulses, separated in time and with different propagation directions – which is scanning.

Wu et al. from the University of Hong Kong describe their technique in more detail in an earlier paper in Light Science & Applications, and in even more detail in its supplementary information, which I found especially interesting. First, it looked like a Fabri-Pérot interferometer to me, but it is actually completely different and is not even based on wave optics.

The idea is to shoot an optically converging pulsed beam (e.g. coming from an ultra-fast Ti:Sa laser) into an area that is bounded by two mirrors that are almost parallel, but slightly misaligned by an angle α<1°. The authors call these two misaligned mirrors a ‘FACED device’. Due to the misalignment, the beam will be reflected multiple times, but come back once it hits the surface orthogonally (see e.g. the black light path below). Therefore, the continuous spectrum of incidence angles will be automatically translated into a discrete set of mini-pulses coming out of this device, because either a part of the beam gets reflected 14 times, or 15 times – obviously, there is no such thing as 14.5 reflections, at least in ray optics. This difference of 1 in number of reflections makes the 15-reflection beam spend more time in the device, Δt ≈ 2S/c, with S being the separation of the two mirrors, and c the speed of light.

It took me some time to understand how this works and how these pulselets coming out of the FACED device look like, but I have to admit that I find it really cool. The schematic drawings in the supplementary information, especially figures S1 and S5, are very helpful for understanding what is going on.

ScanSchemeSchematic drawing (adapted) from Wu et al., LS&A (2016) / CC BY 4.0.

As the authors note (without showing any experiments), this approach could be used for multi-photon imaging as well. It is probably true that there are some hidden difficulties and finite size-effects that make an implementation of this scanning technique challenging in practice, but let’s imagine for one minute how this could look like.

Ideally, we want laser pulses that are spaced with a temporal distance of the flourescence lifetime (ca. 3 ns) in order to prevent temporal crosstalk during detection. This would require the two FACED mirrors to be spaced by S = 50 cm, according to the formula mentioned above. Next, we want to resolve, say, 250 points along this fast-scanning axis, which means that the FACED device would need to split the original pulse into 250 delayed pulselets. The input pulsed beam therefore would need to have a pulse repetition rate of ca. 1.3 MHz (which is then also the line scanning frequency), and each of those pulses would need enough power for the whole line scan.

How long would the FACED mirrors need to be? This is difficult to answer, since the answer depends on the divergence angle of the input pulsed beam that hits the FACED device, but I would guess that it needs to be a couple of meters long, given the spacing of the mirrors (50 cm) and the high number of pulselets that are desired (250). (In a more modest scenario, one could envision to split up one pulse of 80 Mhz in only 4 pulselets, thereby achieving multiplexing of additional regular scanning similar to approaches described before.)

However, I would also ask myself whether the created beamlets are not too much dispersed in time, thereby precluding the two-photon effect. And I also wonder how all this behaves like when transitioning from geometric rays to wave optics. Complex things might happen in this regime. – Certainly a lot of work is required to transition this from an optical table to a biologist’s microscope, but I hope that somebody accepts this challenge and maybe, maybe replaces the kHz scanners of typical multi-photon microscopes by a device that achieves MHz scanning in a couple of years.

Posted in Calcium Imaging, Imaging, Microscopy | Tagged , , | Leave a comment

The most interesting machine learning AMAs on Reddit

It is very clear that Reddit is part of the rather wild zone of the internet. But especially for practical questions, Reddit can be very useful, and even more so for anything connected to the internet or computer technology, like machine learning.

In the machine learning subreddit, there is a series of very nice AMAs (Ask Me Anything) with several of the most prominent machine learning experts (with a bias for deep learning). To me, as somebody who is not working directly in the field, but nevertheless curious about what is going on, it is interesting to read those experts talking about machine learning in a less formal environment, sometimes also ranting about misconceptions or wrong directions of research attention.

Here are my top picks, starting with the ones I found most interesting to read:

  • Yann LeCun, director of Facebook AI research, is not a fan of ‘cute math’.
  • Jürgen Schmidhuber, AI researcher in Munich and Lugano, finds it obvious that ‘art and science and music are driven by the same basic principle’ (which is ‘compression’).
  • Michael Jordan, machine learning researcher at Berkeley, takes an opportunity ‘to exhibit [his] personal incoherence’ and describes his interest in Natural Language Processing (NLP).
  • Geoffrey Hinton, machine learning researcher at Google and Toronto, thinks that the ‘pooling operation used in convolutional neural networks is a big mistake’.
  • Yoshua Bengio, researcher at Montreal, suggests that the ‘subject most relevant to machine learning’ is ‘understanding how learning proceeds in brains’.

And if you want more of that, you can go on with Andrew Ng and Adam Coates from Baidu AI, or Nando de Freitas, a scientist at Deepmind and Oxford. Or just discover the machine learning subreddit yourself.


P.S. If you think that there might be similarly interesting AMAs with top neuroscientists: No, there aren’t.

Posted in Data analysis, machine learning | Tagged , , , | Leave a comment

How deconvolution of calcium data degrades with noise

How does the noisiness of the recorded calcium data affect the performance of spiking-inferring deconvolution algorithms? I cannot offer a rigorous treatment of this question, but some intuitive examples. The short answer: If a calcium transient is not visible at all in the calcium data, the deconvolution will miss the transient as well. It seems that if the signal-to-noise drops below 0.5-0.7, the deconvolution quickly degrades.

To make this a bit more quantitative, I used an algorithm based on convolutional networks (developed by Stephan Gerhard and myself; you can find it on Github, and it’s described here) and a small part of the Allen Brain Observatory dataset.

I assumed that the standard deviation of the raw calcium traces measures ‘Signal’ (a reasonable approximation), and I took the standard deviation of the Gaussian noise that I added on top as ‘Noise’. Then I deconvolved both noisified and unchanged calcium traces and computed the correlation of the spiking traces of calcium+noise vs. calcium alone. If the correlation (y-axis) is high, the performance of the algorithm is not much affected by the noise. The curve is dropping steeply at a SNR of 0.5-0.7.


To get some intuition, let’s give some examples, left the calcium trace plus Gaussian noise, right the deconvolved spiking probabilities (numbers to the left indicate SNR and correlation to ground truth, respectively):


The next example was perturbed with the same absolute amount of noise, but due to the larger signal, the spike inference remained largely unaffected for all but the highest noise levels.


The obvious thing to note is the following: When transients are no longer visible in the calcium trace, they disappear in the deconvolved traces as well. I’d also like to note that both calcium timeseries from the examples above are from the same mouse, the same recording, and even the same plane, but the SNR of the recordings is a lot different. Therefore, lumping together neurons of the same recording, but of different recording quality combines different levels of detected detail. An alternative way would be to set a SNR threshold for the neurons to be included – depending on the precision required from the respective analysis.

Posted in Calcium Imaging, Data analysis, electrophysiology, Imaging, machine learning, Neuronal activity | Tagged , , , , | Leave a comment