Spark, Project Gutenberg and WNYC¶

The motivation behind this project was two-fold:

1) I wanted to learn how to write MapReduce-like functions for Apache Spark. Spark allows for fast processing of large amounts of data using clusters of computers.

2) I wanted the WNYC team to like me

For this project, I pulled two datasets from Project Gutenberg. After initially being blocked from the Project Gutenberg server by "hacking around" too much, I downloaded 92 of the most popular books from their service. Additionally, I found a ~ 5.5GB file of Project Gutenberg books on Reddit.

First, I imported the necessary libraries:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt; plt.rcdefaults()
from matplotlib.ticker import FuncFormatter
import codecs
from IPython.display import YouTubeVideo
from nltk import word_tokenize,sent_tokenize

I then asked two questions:¶

Question 1: Is there any relationship between book length and average sentence length?¶

For this question, I used the Natural Language Toolkit's library. The Natural Language Toolkit has two methods, 1) word_tokenize and 2) sent_tokenize, that can split a piece of text into words and sentences, respectively.

To find an answer, I mapped every book out into its composite 1) number of words and 2) average sentence length.

with codecs.open('Desktop/manybooks.txt', "r",encoding='utf-8', errors='ignore') as guten_92_file:
    allbooks = guten_92_file.read().strip().replace('\n','').replace('\u201c', '').replace('\r', '')
books = allbooks.split("*** START OF THIS PROJECT")
titles = []
with codecs.open('Desktop/manybooks.txt', "r",encoding='utf-8', errors='ignore') as guten_92_file:
    for line in guten_92_file.readlines():
        if "Title: " in line and "_Title: " not in line:
            titles.append(line.replace('Title: ', '') \
                          .replace('\r', '') \
                          .replace('\n', '') \
                          .encode('ascii', errors='ignore'))
for title in titles[:10]:
    print title

Adventures of Huckleberry Finn, Complete
Anna Karenina
Anne of Green Gables
The Awakening and Selected Short Stories
Autobiography of Benjamin Franklin
Beowulf
The King James Bible
Grimms Fairy Tales
The Brothers Karamazov
The Call of the Wild

I wanted to use the map function from Spark, but I needed to split books into sentences, then count each sentence.

I think there is another way to do this using only Spark commands, where I could use flatmap to split the entire text by book, then flatmap again to split into sentences, assign a meaningful key to each sentence, then reduce by key to get the average sentence length.

def avg_sent_len(book):
    sent_token_list = sent_tokenize(book)
    lengths = 0.0
    for sent_token in sent_token_list:
        lengths += len(sent_token.split(" "))
    return lengths/len(sent_token_list)

ninety_two_books = sc.parallelize(books[1:-1])
book_sentences = ninety_two_books.map(lambda book: (len(word_tokenize(book)), avg_sent_len(book))) 
length_and_avg_sent = book_sentences.collect()

book_sentences.take(5)

[(123881, 26.613255459624174),
 (396475, 17.865350973102995),
 (110789, 20.494095318431043),
 (69962, 22.02629682997118),
 (87297, 36.344146685472495)]

Now that we have the length (in number of words) and the average sentence length, we can plot the results:

length_in_words = [avg_sent_tup[0] for avg_sent_tup in length_and_avg_sent]
avg_sent_len = [avg_sent_tup[1] for avg_sent_tup in length_and_avg_sent]

colors = (0,0,0)
plt.figure(figsize=[8,8])
plt.scatter(length_in_words, avg_sent_len, s=np.pi*5, c="b", alpha=0.6)
plt.title('Book Length vs Average Sentence Length')
plt.xlabel('Length (in number of words)')
plt.ylabel('Average Sentence Length')
plt.show()

We find a very small correlation between the two - c'est la vie

avg_sent_len_df = pd.DataFrame()
avg_sent_len_df['length_in_words'] = length_in_words
avg_sent_len_df['avg_sent_len'] = avg_sent_len
avg_sent_len_df.corr()

Question 2: Do the books from PG follow Zipf's Law?¶

Zipf's Law describes the frequency with which words appear in a corpus of language. As Wikipedia puts it: "the frequency of any word is inversely proportional to its rank in the frequency table. Thus the most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, etc."

I tested out whether my two datasets follow this pattern by 1) Gathering the word counts using Spark and 2) Determining if the word frequencies followed the Zipf equation.

For a more in-depth explanation of Zipf's Law:

YouTubeVideo('fCn8zs912OE', width=853, height=480, start=2)

ninety_two_books = sc.textFile('Desktop/allbooks.txt')

words_and_counts = ninety_two_books.map( lambda x: x.replace('?',' ') \
            .replace(',',' ') \
            .replace('.',' ') \
            .replace('-',' ') \
            .lower()) \
        .flatMap(lambda x: x.split()) \
        .map(lambda x: (x, 1)) \
        .reduceByKey(lambda x,y:x+y) \
        .map(lambda x:(x[1],x[0])) \
        .sortByKey(False)

total_count = ninety_two_books.flatMap(lambda line: line.split(" ")) \
             .map(lambda word: 1) \
             .reduce(lambda a, b: a + b)

top_fifteen = words_and_counts.take(15) 
floats_and_words = []
for count_and_word in top_fifteen:
    floats_and_words.append((count_and_word[0]/float(total_count), \
                                   count_and_word[1].encode('ascii', errors="ignore")))

words_rank = [word[1].encode('ascii', errors='ignore') for word in floats_and_words]
freq = [word[0] for word in floats_and_words]

top_92_ind = np.arange(15)
top_92_width = 0.6       

figure, axis = plt.subplots()
top_92_bar = axis.bar(ind, freq, top_92_width, color='blue', alpha=0.7)

axis.set_ylabel('Percentage Frequency')
axis.set_xlabel('Word Ranking')
axis.set_title('Word Frequency from 92 PG Books')
axis.set_xticks(top_92_ind + .1 / 2)
axis.set_xticklabels(words_rank)
axis.set_yticklabels(['{:2.2f}%'.format(x*100) for x in freq])

axis.yaxis.set_major_formatter(FuncFormatter(lambda y, _: '{:.0%}'.format(y))) 

plt.show()

Analysis with Megabook¶

I also ran this analysis on my "mega book" - the 5.5 GB textfile of roughly 13,000 Project Gutenberg books. Following are the results:

mega_book = sc.textFile('Desktop/megabook.txt')

mega_book_counts = mega_book.flatMap(lambda line: line.split(" ")) \
             .map(lambda word: 1) \
             .reduce(lambda a, b: a + b)

mega_book_word_counts = mega_book.map( lambda x: x.replace('?',' ') \
            .replace(',',' ') \
            .replace('.',' ') \
            .replace('-',' ') \
            .lower()) \
        .flatMap(lambda x: x.split()) \
        .map(lambda x: (x, 1)) \
        .reduceByKey(lambda x,y:x+y) \
        .map(lambda x:(x[1],x[0])) \
        .sortByKey(False)

top_fifteen_mega_book = mega_book_word_counts.take(15)

floats_and_words_mega_book = []
for word_tuple in top_fifteen_mega_book:
    floats_and_words_mega_book.append((word_tuple[0]/float(mega_book_counts), \
                                   word_tuple[1].encode('ascii', errors="ignore")))

mega_book_word_rank = [word[1].encode('ascii', errors='ignore') for word in floats_and_words_mega_book] 
mega_book_freq = [word[0] for word in floats_and_words_mega_book] 

mega_book_ind = np.arange(15)
mega_book_width = 0.6  

figure, axis = plt.subplots()
mega_book_bar = axis.bar(ind, mega_book_freq, mega_book_width, color='blue', alpha=0.7)

axis.set_ylabel('Percentage Frequency')
axis.set_xlabel('Word Ranking')
axis.set_title('Word Frequency from Mega Book')
axis.set_xticks(mega_book_ind + .1 / 2)
axis.set_xticklabels(mega_book_word_rank)
axis.set_yticklabels(['{:2.2f}%'.format(x*100) for x in mega_book_freq])

axis.yaxis.set_major_formatter(FuncFormatter(lambda y, _: '{:.0%}'.format(y))) 

plt.show()

Zipf Expected vs. Obsevations¶

According to Wikipedia, given the Zipf distribution, one can calculate how frequently a word appears given its rank.

The equation is:

\begin{equation*} f(k; s, N) = (1/k^s) / \sum_{n=1}^N (1/n^s) \end{equation*}

Where:

N is the number of elements (in this case, the number of words in the English language)
k is the rank, and
s is the value of the exponent characterizing the distribution

With this equation, we can see how closely our findings match up with the expected values from Zipf's law:

zipf_expected = []
for rank in range(1,16):
    expected_frequency_dividend = 1/float(rank)
    expected_frequency_divisor = 0
    for n in xrange(1,171500):  # There ~171,500 words in the English language, thus N is 171,500
        expected_frequency_divisor += 1/float(n)
    zipf_expected.append(expected_frequency_dividend/expected_frequency_divisor)

ind = np.arange(15)  
width = 0.2

fig, ax = plt.subplots()
expected = ax.bar(ind, zipf_expected, width, color='gray')

top_92 = ax.bar(ind + width, freq, width, color='blue')
mega_book = ax.bar(ind + width + width, freq_mega_book, width, color='red')

ax.set_ylabel('Percentage Frequency')
ax.set_xlabel('Word Ranking')
ax.set_title('Word Frequency, Expected vs. Top 92 vs MegaBook')
ax.set_xticks(ind + width + width / 2)
ax.set_xticklabels(('1st', '2nd', '3rd', '4th', '5th', '6th', '7th', 'etc'))

ax.legend((expected[0], top_92[0], mega_book[0]), ('Expected', 'Top 92', 'MegaBook'))


plt.show()

	length_in_words	avg_sent_len
length_in_words	1.000000	0.200033
avg_sent_len	0.200033	1.000000