Search This Blog

Thursday 31 January 2019

Adventures in "wow that was unusual"

Astronomy, sales prediction, subscription signups… what do they have in common?

There are all sorts of situations where we want to get a simple summary of interesting events in data. It's really useful to say "we had three bursts of sales activity, one just before Mothers' Day and the other two around the time we started advertising". If you only have a handful of products, you can just eyeball this but the challenge happens when you have a lot of products to keep track of.
Likewise, when did we get a surprising number of sign-ups? These are often the most informative signs of a company hitting a real need.
This post is just a really simple way of doing this that is surprisingly effective. The notebook I’ve taken this from is 
Anyway, I hit this problem doing some astronomy lately. I was looking at a signal from a radio telescope which had been observing potential planetary nebulae in the Lesser Magellanic cloud. Here a "peak" in a signal corresponds to some element floating around in shrapnel of the shell of the planetary nebula after it has exploded. For example there should be a spike around 5900 which corresponds to the radio wavelength of hydrogen. Doppler gets in the way though -- the LMC is running away from us and so hydrogen is nearly but not quite 5900. I wanted to get a technique that could find these automatically and calculate their flux.
I want to get from the left picture (what came out of the WiFES system) to the right hand one, and get a list of those peaks so that I can analyse them.

So fire up your Jupyter notebooks, and here we go. Starting off like any other data science project...
That last cell won't frighten many data scientists, but the next one is a bit astronomy-specific. FITS is a common astronomical format.

You can find out about the file with some fits functions...
There are other functions for extracting the metadata (e.g. fits.getheader(image_file)) so that you know what star you are observing.

But ultimately we just want the data. It comes out as a numpy array.
It is one data sample for each frequency that the telescope observed at.

Life would be better if it came out as a pandas Series, but it doesn't so we need to construct our own index. In the FITS file, CRVAL1 is the lowest frequency and CDELT1 is the smallest frequency difference the telescope can measure.
I could do this more efficiently with numpy.arange rather than a loop, but we only have a few thousand elements to create in this index, so performance doesn't matter much.
Which means we can now create a pandas Series, and have a look at the data.

Whenever I see a data set, I like to run a fitter over it. This is a brute-force trial of a very large number of distributions to see what it matches most closely.
If you're trying this at home, don't worry if these next two cells don't work for you. We don't need them. Also fitter-1.1.10 seems to have a compilation bug, so I had to pip install fitter==1.0.9 .

Summary: this data set looks like no earthly distribution. Which is appropriate really for a dataset that started on its journey 160,000 years ago. 😀
Just as a reminder, it's definitely not a gaussian, so it's not true to say "3 standard deviations away from the mean gives you a 1% probability".
It's entirely co-incidental that the dotted line looks like it belongs there as a divider:

There's actually a nasty little mistake there, that could come back to bite us later.
The mean and standard deviation was calculated using the background noise and the spikes of the whole dataset. That's not good, because it makes the mean and standard deviation much higher than it "should" be if the peaks were excluded.
So let's take a tentative shot at wiping out things that are likely to be spikes and then recalculating the mean, standard deviation and distribution.
It's hard to say that the line around 5890 isn't some kind of peak that's substantially different to the data around it!
Maybe we should do it again, and keep on going until fitter says that we have something gaussian (or some other random distribution).
But let's just stop here since we've converged pretty quickly already. The remainder aren't going to add much.
Let's go back to the data and drop anything that's below this new threshold.

They look like very fine lines, but let's zoom in from 6200 - 6400, and it's not quite that simple.

So we need to associate a "peak number" with any continuous sequence of non-zero measurements.
Here's how I did it (there are plenty of other approaches).
These are all the initial conditions. When we start at the lowest frequency, we're not in a peak, we haven't seen a peak yet (so we have no "most recently seen frequency"), and there hasn't been a strongest measurement yet. So the peaks dictionary is empty.