Thursday, 3 July 2014

Killer Graph - Mapamation


Today's post is a special post to celebrate World UFO day, which was on 2 July, and US Independence Day on 4 July - a day that Will Smith has ensured will forever be affiliated with aliens and spaceships.

Rather than tell a story, I'm just going to let the picture do the talking.

Picture - Reported UFO Sightings*
By location and shape
                             Sources: Infochimps; Google; Wikipedia; TomBilston
                             * Some reports excluded due to lack of information

Each figure represents the shape of UFO that was reported

*  - Light               ▲ - Triangle             ☒ - Cone                ✡ - Fireball      
▆ - Rectagle          ⊕ - Sphere               O - Disc                 ◆ - Diamond        
^ - Chevron           △ - Delta                 ● - Oval                = - Cylinder
e - Egg                   ? - Unknown or uncategorised

Each letter marker represents a select defence base

A - Edwards           B - Elgin               C - Fairchild             D - Nellis
E - Pope                 F - Lewis

Colours are random

I'd also like to encourage anybody to leave a comment if they are especially interested in me honing in on a specific region or would like me to add a marker for for a certain point of interest

Next post, let's move away from UFOs. And back to some machine learning. 


Saturday, 14 June 2014

I'll be back in two minutes - what did I miss?

So there's a UFO right in front of you: what do you do? Should you get the camera, phone a friend, or seek shelter in the bunker you bought from Danoz Direct one late night?

Well you don't have much time make the decision. Half of reported UFO sightings lasted for two minutes or less (Figure 1). And 90 per cent lasted for six minutes or less. 

Figure 1

This would barely give you enough time to cook a quick snack. But it would give you plenty of time to take your phone out of your pocket and snap a quick photo. Some people did this, apparently. About 5 per cent of the statements given after reporting a UFO sighting mentioned the word camera. None of the reports mentioned guns, bunkers, or two minute noodles.

The length of sightings has become slightly shorter over the past two decades (Figure 2). I don't think this means too much. But it does give me an opportunity to put a moving gif on the blog. I guess if you squinted you could say that this gives some evidence that the speed of UFOs has fractionally increased in recent years, which would be consistent with a rise the speed of  military aircraft over the period. But this could be bending the data.

Figure 2

Next post, the UFO research will reach a pinnacle with the release of a killer graph. Perhaps it will be so big that I will coincide it's release with Independence Day. A day that will forever be remembered for alien invasions.

Technical stuff

Despite the post's apparent simplicity, getting the unstructured data into a workable format was a pain. In the end, I parsed it using the tm library. But in hindsight it would have been easier and more efficient just to use R's base string functions. I made the graph of the dancing density function using the animation library, which was actually comparatively easy.


Friday, 6 June 2014

Summer has came

It's getting hot over there, so hot, I want to see some UFOs, I am getting to hot, I want to see some UFOs   
Not quite Nelly

Summer is here (well in the US in Australia, it's winter). And along with the rise in the temperature, comes a rise in reported UFO sightings (Figure 1). Just like our 'day of the week' analysis, this finding is statistically significant at any reasonable level.


A keen observer would note that Figure 1 is missing some key information. Namely, that the 'summer effect' could be due to one particularly active summer, rather than a consistent reoccurring pattern each year. That is why we need to look at the time series of the number of reported UFO sightings (Figure 2). Glossing over this series shows clear evidence of seasonality. Decomposing it, reaffirms this finding (see technical stuff). 

The time series also tells us the the number of reports has been trending up since the early 90s. And the growth in the upward trend has been faster than growth in US population (Figure 3).


Tying all the results together reaffirms earlier findings (apologies for some repetition). People are more likely to stay out late and drink more in the summer. While the upward trend likely relates to an increase in the ease of reporting sightings, a fall in the stigma related to reporting sightings, or biases in how the data are collected. There are also some large movements that, provided the demand for UFO information remains high, I will look at later.

Next post, “Ma... Ma… there's big ole shiny fing in the sky. Will there be 'nough time to get the camera and shotgun? Or would it be best if I just got the shotgun?” We look at how long UFO sightings tend to last for.

Technical Stuff

Ideally, I would seasonally adjust the data using x13 from the US Census. This is basically an ARIMA specially designed for seasonal adjustment. But I can't get this working with Ubuntu this didn't help. So for the time being to look at the seasonal components you'll just need to make do with a decomposition (Figure 4). Simply put, this process splits the data into three components, a trend, a seasonal, and an irregular.


Saturday, 31 May 2014

"You write like a man!"


Women and men are different. To be clear, one is not better than the other. Different men and woman are.
- Future chiasic proverb

I want to look at just one difference between men and women. The difference in writing styles. And most importantly can a computer algorithm tell if a book was written by man or a woman? Or, more interestingly, can it tell if a book was written by a woman but published under a male pseudonym.

I know academics have tackled related questions before (for example Argamon, et al. (2003)). But I'm interested in establishing the proof for myself, learning a few things, and explaining it from an approachable perspective.

These posts will not be written for computer scientists. They will be written for someone who is interested in text analysis and machine learning (I'll explain in a minute). But are either at the very beginning of their journey in understanding these concepts, or only interested in passing. 
Prepare for buzz words: machine learning; natural language processing; big data

Machine learning is a broad family of algorithms used to make predictions by examining patterns, correlations or natural cleavages. From spam filters on email accounts, the Roomba that vacuums at least one person's house, to the auto-correct on your phone that causes you to send inappropriate texts to your mum, there are literally thousands of applications for machine learning.

Many of these algorithms have been around for decades. But as the power of computers has risen, the performance and accessibility of these algorithms has  increased. These factors have stretched the study of machine learning out of the purely computer science realm, and into practically every doctrine.

This success has spurred resentment. So much so that machine learning techniques are often referred to in a pejorative sense, as data mining. Some of the resentment is warranted. The patterns found by machine learning algorithms to make predictions may be 'spurious', or correlated purely by coincidence just like how the number of films starring Nicolas Cage is correlated with the number of drownings in swimming pools (tylervigen). This can lead researchers to to make catastrophic errors; though to be fair, these errors can also occur in other approaches.

Some specifics

I've put together a data set of 100 books from Project Gutenberg – 50 for each men and women. A random sample of 70 of these books will be used to train different algorithms. The other 30 books will be used to test how much the computer has 'learned'. Then, hopefully, we can branch out and see if we can correctly identify books written by a woman under male pseudonyms and so on.

Show me the graphs!

As is this blog's design, I'm going to dribble out one chunk of analysis at a time. To whet your appetite, below are two networks: one for the male corpus a corpus is a bunch of unstructured data, such as, in our case, a collection of books and one for the female corpus (Figures 1 and 2).

 Figure 1 - Interconnectedness of Words in Books Written by Men
 50 classics
* Links indicate that both words were in at least 70 per cent of books; nodes exist if the word is used at least 4000 times.
Sources: Project Gutenburg; TomBilston

 Figure 2 - Interconnectedness of words in books written by women
 50 classics
 
* Links indicate that both words were in at least 70 per cent of books; nodes exist if words used at least 4000 times.
Sources: Project Gutenburg; TomBilston
Next post, Summer is coming...and I promise I won't use as much jargon next post. Probably.

Technical stuff

I'm going to contain this project to R - again. The tm library will be used extensively. The networks were made using RGraphviz. I'll try to point out some good resources as I go along. Some notable ones so far are:

Conway D. and J.M. White (2012), 'Machine Learning for Hackers', published by O'Reilly Media.

D'Auria T. (2012), 'How to Build a Document Classifier in Under 25 Minutes using R', Boston Decision

Wu et al. (2007),  'Top 10 algorithms in data mining', Springer-Verla, London


A gender studies academic has also kindly put together a list of gender-based corpus analysis - Link.





Friday, 23 May 2014

How long should you wait before reporting a UFO sighting?

"It's a bird. It's a plane. No actually, I don't know what that is...but despite the high likelihood of being ostracised, I still want to tell people"

So, how long should you wait before reporting a UFO sighting? On the one hand, reporting the sighting immediately exposes you to criticism. After all, you'd need some pretty amazing evidence to be taken seriously. But on the other hand, anybody could say they saw a UFO 20 years ago after a few drinks one Saturday night on a holiday in North Dakota. It also may not be immediately obvious which agency is the best to report the sighting to. Come to think about it, I don't know the answer to that.

This decision process does actually appear to be at play in the data (Figure 1). In statisticians speak, the distribution of the time it takes to report a sighting is, somewhat, bimodal. About half of UFO sightings are reported in one day. Some even note that the UFO was still around as they made the report. But some wait for years, with about one in eight taking longer than than 10 years to report the sighting.

Figure 1 - Delay Before Reporting UFO Sightings
Since 1990 in the United States
                                                                  Sources: Inforchimps; TomBilston

Interestingly, the time taken before reporting sightings has shrunk in recent years, especially since 1995. This suggests that either: improvements in technology have made it easier to report UFO sightings quickly; the stigma associated with reporting a UFO sighting has fallen; or there are biases in how the data were collected.

Next post, Summer is coming (in the Northern Hemisphere). Should I be worried about UFOs? But first, to avoid being typecast as a conspiracy theorist blog, I'll post something completely different. It's a data science blog.

Technical stuff

Once again all of the calculations were in R. But nothing here was overly complicated or used non-standard libraries. The radial chart above was made in ggplot2, which I'm quickly learning is the gold standard in R's graph libraries. But because you are looking here you may actually want to see the density plot of the length time it takes to report sightings (Figure 2).

Figure 2 - Delay Before Reporting UFO Sightings
Log Density Function






Saturday, 17 May 2014

Saturday night is an alright night for UFO sightings

Saturday night is an alright night for fighting UFO sightings.

Well at least that is what the data says (Figure 1). Importantly, this finding is also statistically significant at any reasonable level. (While not essential, details on the statistical test are in the technical stuff section).

Figure 1 - Day of the Week Heat Map
Number of UFO sightings since 1990 in the US
                           
                                Sources: Infochimps; TomBilston

This Saturday effect, or more broadly weekend effect, is not overly surprising. It's probably unlikely that more alien space craft come to our planet on Saturdays. To me, it would be a huge coincidence if an alien society used our same seven day a week system. But people are more likely to stay up late on Saturdays, giving them more time to look into the sky. Many people would also drink and use more recreational drugs.

This effect may not be completely consistent with our previous finding of a relationship between UFO sightings and the number of air force personnel in particular states. I would expect air force activity to be relatively low on Saturdays, as more personnel probably have the night off. However, this result doesn't rule out either finding, instead it suggests that UFO sightings are probably related to multiple factors. Just like practically all economic variables.

Next post, what is an appropriate length of time to wait before reporting a UFO sighting?

Technical Stuff
 
To work out which statistical test to use, you could look at your old text books. But that would be a waste of time. Instead, just google it and you'll probably be in luck. I did that and it turns out that 'day of the week' analysis is used (or at least misused) in police analysis.

As described in this blog, the χ² (or chi-squared) goodness-of-fit test is an appropriate test for 'day of the week' analysis. In this test, our null hypothesis is that UFO sightings are equally likely any day of the week. If we can reject this, we have sufficient evidence to suggest that UFO sightings are indeed more likely to happen on certain days - in our case Saturdays.

Using this method, our test statistic is 270.3 and our χ² critical value (99 per cent confidence level and six degrees of freedom) is 16.8. So we can easily reject the null hypothesis.

I mentioned above that these sorts of tests can be misused. It's easy to think of how this could happen in crime fighting terms. Let's say the police do some analysis and find that their is a 'day of the week effect', with more crime happening on a Fridays. In response, more police are put on the beat on Fridays. Then, not unsurprisingly, even more crimes are seen and reported on Fridays, simply because there are more police around to observe the crime or for people to report the crime to. So according to the numbers it looks like more crimes happen on Friday. As a result, even more police are put on the beat on Fridays and these police find evidence of even more crimes, artificially amplifying the 'day of the week effect'.

Even so, while it is important to think through these 'endogeneity' problems, I really can't see a good reason why this might occur with UFO sightings. That said, if anybody sees a UFO watcher's society that only meets on Saturdays, please let me know?




Tuesday, 13 May 2014

US UFO Dataset - Heatmaps

It's heat map time! Or to be more specific choropleth map time!

Below are three maps and a scatter plot I put together over the past few days. The first map, figure 1, shows the absolute (or unadjusted) number of UFO sightings by state. Looking at the data this way is problematic as you would expect more UFO sightings to be reported in larger states, simply because of the numbers. The second map, figure 2, eliminates this problem by scaling UFO sightings by population. The third map and the scatter plot, figures 3 and 4, shows the relationship between UFO sightings and active duty air force personal risding in each state. You'll find out why we look at this shortly.

And for the key results...
  • 15 per cent of all reported UFO sightings since 1990 have been in California. Perhaps Californian's creativity, as the centre movie production, 'inspires' them to see more UFOs? Or perhaps it has something to do with their acceptance of medical marijuana?
Figure 1 - UFO Sightings Since 1990*
                                                                   * Earlier observations removed; 46 347 observations                                                                  
                                                                   Sources: Infochimps; TomBilston
  • North Dakotan's reported seeing more UFOs than those in any other state. Indeed, in an average year, 13 out of every 100 000 North Dakotan's reported seeing a UFO.
  • Thankfully, in Virginia, my next home, fewer than 1 person in 1 million people reported seeing a UFO in each year.
  • In general, people in the west or north west reported seeing the most UFOs.
Figure 2 - UFO Sightings Since 1990*
Share of average population
                                                                   * Earlier observations removed; 46 347 observations                                                                  
                                                                   Sources: Infochimps; TomBilston
  • Looking at UFO sightings as a share of active air force personnel seems to show that North Dakotans aren't that strange after all. An unusually large share of active air force personal reside in North Dakota. This might suggest that people are seeing military aircraft and reporting them as UFOs. Or it might be that UFOs are more likely to hang around air force bases. Or air force personnel are more likely to report a UFO sighting. We really can't rule any of these out.   
  • The high number of UFO sightings per capita in the north west also seems to be related to this air force personnel effect.
  • On the other hand, it is hard to find a reason why Vermonters report seeing the most UFOs as a share of active air force personnel. Ben and Jerry's maybe? 
Figure 3 - UFO Sightings Since 1990*
Share of active air force personal in 2013
                                                             * Earlier observations removed; 46 347 observations                                                                  
                                                             Source: Defense Manpower Data Center; Infochimps; TomBilston

      Figure 4 - UFO Sightings vs Active Air force Personnel
       Share of population
It would be good to look at a few more maps and correlations before finishing off this project. Maybe comparing UFO sightings to recreational drug consumption or to education levels. If these data show up I'll have a look at them in another post.

Next post, the dance floor is heating up. Let's find out which night is ladies UFO sighting night?

Technical stuff

To put the maps together, I was tempted to use Stata or Mathematica. I've already made quite a few maps using these programs. As it turns out, however, R is pretty good for map making. I used the choloropethR library from GitHub, but  the ggplot2 and maps libraries look just as good.

* An earlier version of this post incorrectly stated the number of  Vermonters and Virginians that report seeing UFOs each year. Apologies for this error.


Saturday, 10 May 2014

US UFO Dataset - WordCloud

Welcome to my first ever blog post.

I recently found a data set of US UFO sightings. Over the next few weeks, I will explore this dataset and post my findings on this blog. I know what you are thinking: 'umm, why UFOs?'. That's a fair question. I don't have a big interest in UFO stories.

It's just a great data set (grab it yourself from infochimps)! With over 60 000 observations, it covers when and where the UFO was sighted, the duration of the sighting, and a brief statement describing the event.

I'm not going to try and authenticate any of the sightings I'll leave that to the 'experts'. But I am going to try and answer a few questions. Where are UFOs most frequently sighted? Are UFOs seen more in summer or winter? Are they seen more frequently after an alien movie is released? Along with a few more ideas I'll hold onto for now. (If you have any ideas post them in the comments, and we can see if it's feasible).

I'll answer these questions over the coming weeks, and then put all of the key findings into an easy to read graphic (not very economist of me!). I'll then start on another project, which I already have a few ideas for.

Now for the analysis. For my first post, I give you a word cloud made from the most frequently used words in the UFO sighting reports (Figure 1). Word clouds are great. They look fantastic this one is even in the shape of a UFO! This will form the centrepiece of my easy to read graphic. And they are informative bigger words mean more mentions.

Figure 1
Word Cloud

Source: Infochimps; Tagul; TomBilston

The most surprising finding from the analysis so far is how rare completly 'out there' statements are. There are no mentions of 'abductions', 'probes', 'aliens', or 'little green men'. The most common words are 'object', 'light(s)', 'sky' and 'saw'. All  words that could to be used to descibe the sighting of a standard or experimental aircraft, rather than an alien aircraft. But it's still far too early to jump any conclusions.

There are, however, a few interesting words that come up fairly frequently. The words 'large', 'big' and 'huge' are said in around a third of statements, about double the frequency of 'small'. In describing shape , 'triangle' is said about twice as much as 'circle', and 'saucer' barely garners a mention. 'White' is the most commonly mentioned colour, followed by 'red', 'green' and 'black'. 'hovering' is mentioned in about 10 per cent of reports.

Next post, I will put up some heat maps of the showing the where the most UFOs are seen.

Technical stuff

I made the wordcloud with R's tm library and Tagul. Using the R's tm library, I cleaned the text by removing punctuation, numbers, very small words, and certain stopwords. Tagul is used to make the actual word cloud. R has a wordcloud function, but it can't be used to make complex shapes like a UFO. After I have a look at some of the other data, I'll probably come back to the text analysis.