Women and men are different. To be clear, one is not better than the other. Different men and woman are.
- Future chiasic proverb
I want to look at just one difference between men and women. The difference in writing styles. And most importantly can a computer algorithm tell if a book was written by man or a woman? Or, more interestingly, can it tell if a book was written by a woman but published under a male pseudonym.
I know academics have tackled related questions before (for example Argamon, et al. (2003)). But I'm interested in establishing the proof for myself, learning a few things, and explaining it from an approachable perspective.
These posts will not be written for computer scientists. They will be written for someone who is interested in text analysis and machine learning (I'll explain in a minute). But are either at the very beginning of their journey in understanding these concepts, or only interested in passing.

Machine learning is a broad family of algorithms used to make predictions by examining patterns, correlations or natural cleavages. From spam filters on email accounts, the Roomba that vacuums at least one person's house, to the auto-correct on your phone that causes you to send inappropriate texts to your mum, there are literally thousands of applications for machine learning.
Many of these algorithms have been around for decades. But as the power of computers has risen, the performance and accessibility of these algorithms has increased. These factors have stretched the study of machine learning out of the purely computer science realm, and into practically every doctrine.
This success has spurred resentment. So much so that machine learning techniques are often referred to in a pejorative sense, as data mining. Some of the resentment is warranted. The patterns found by machine learning algorithms to make predictions may be 'spurious', or correlated purely by coincidence — just like how the number of films starring Nicolas Cage is correlated with the number of drownings in swimming pools (tylervigen). This can lead researchers to to make catastrophic errors; though to be fair, these errors can also occur in other approaches.
Some specifics
I've put together a
data set of 100 books from Project Gutenberg – 50 for each men and
women. A random sample of 70 of these books will be used to train different algorithms. The other 30 books will be used to test how much the computer has 'learned'. Then, hopefully, we can branch out and see if we can correctly identify books written by a woman under male pseudonyms and so on.
Show me the graphs!
As is this blog's design, I'm going to dribble out one chunk of analysis at a time.
To whet your appetite, below are two networks: one for the male corpus — a corpus is a bunch of unstructured data, such as, in our case, a collection of books — and one for the female corpus (Figures 1 and 2).
Figure 1 - Interconnectedness of Words in Books Written by Men
* Links indicate that both words were in at least 70 per cent of books; nodes exist if the word is used at least 4000 times.
Sources: Project Gutenburg; TomBilston
Figure 2 - Interconnectedness of words in books written by women
* Links indicate that both words were in at least 70 per cent of books; nodes exist if words used at least 4000 times.
Sources: Project Gutenburg; TomBilston
Next post, Summer is
coming...and I promise I won't use as much jargon next post.
Probably.
Technical stuff
I'm going to contain this project to R - again. The tm library will be used extensively. The networks were made using RGraphviz. I'll try to point out some good resources as I go along. Some notable ones so far are:
Conway D. and J.M. White (2012), 'Machine Learning for Hackers', published by O'Reilly Media.
D'Auria T. (2012), 'How to Build a Document Classifier in Under 25 Minutes using R', Boston Decision
Wu et al. (2007), 'Top 10 algorithms in data mining', Springer-Verla, London
A gender studies academic has also kindly put together a list of gender-based corpus analysis - Link.
I'm going to contain this project to R - again. The tm library will be used extensively. The networks were made using RGraphviz. I'll try to point out some good resources as I go along. Some notable ones so far are:
Conway D. and J.M. White (2012), 'Machine Learning for Hackers', published by O'Reilly Media.
D'Auria T. (2012), 'How to Build a Document Classifier in Under 25 Minutes using R', Boston Decision
Wu et al. (2007), 'Top 10 algorithms in data mining', Springer-Verla, London
A gender studies academic has also kindly put together a list of gender-based corpus analysis - Link.


No comments:
Post a Comment