I recently found a data set of US UFO sightings. Over the next few weeks, I will explore this dataset and post my findings on this blog. I know what you are thinking: 'umm, why UFOs?'. That's a fair question. I don't have a big interest in UFO stories.
It's just a great data set (grab it yourself from infochimps)! With over 60 000 observations, it covers when and where the UFO was sighted, the duration of the sighting, and a brief statement describing the event.
I'm not going to try and authenticate any of the sightings — I'll leave that to the 'experts'. But I am going to try and answer a few questions. Where are UFOs most frequently sighted? Are UFOs seen more in summer or winter? Are they seen more frequently after an alien movie is released? Along with a few more ideas I'll hold onto for now. (If you have any ideas post them in the comments, and we can see if it's feasible).
I'll answer these questions over the coming weeks, and then put all of the key findings into an easy to read graphic (not very economist of me!). I'll then start on another project, which I already have a few ideas for.
Now for the analysis. For my first post, I give you a word cloud made from the most frequently used words in the UFO sighting reports (Figure 1). Word clouds are great. They look fantastic — this one is even in the shape of a UFO! This will form the centrepiece of my easy to read graphic. And they are informative — bigger words mean more mentions.
Figure 1
Word Cloud
The most surprising finding from the analysis so far is how rare completly 'out there' statements are. There are no mentions of 'abductions', 'probes', 'aliens', or 'little green men'. The most common words are 'object', 'light(s)', 'sky' and 'saw'. All words that could to be used to descibe the sighting of a standard or experimental aircraft, rather than an alien aircraft. But it's still far too early to jump any conclusions.
There are, however, a few interesting words that come up fairly frequently. The words 'large', 'big' and 'huge' are said in around a third of statements, about double the frequency of 'small'. In describing shape , 'triangle' is said about twice as much as 'circle', and 'saucer' barely garners a mention. 'White' is the most commonly mentioned colour, followed by 'red', 'green' and 'black'. 'hovering' is mentioned in about 10 per cent of reports.
Next post, I will put up some heat maps of the showing the where the most UFOs are seen.
Technical stuff
I made the wordcloud with R's tm library and Tagul. Using the R's tm library, I cleaned the text by removing punctuation, numbers, very small words, and certain stopwords. Tagul is used to make the actual word cloud. R has a wordcloud function, but it can't be used to make complex shapes like a UFO. After I have a look at some of the other data, I'll probably come back to the text analysis.

No comments:
Post a Comment