text mining - HackerNews Buzz-headlines

This project is an exploration of headlines frequency, it’s popularity.

The dataset is a collection of headlines from HackerNews portal gathered for period 2006-2015.

Note: to see final result scroll to the end

Let’s dive into exploration of the so-called metaparameters, and find out:

How popularity correlates with headline length?
What about popularity in respect quarter periods?
How does overall popularity changes over time on the website?
What are the buzzphrases appeared in the headlines and how do they change over time?

Intuitively we can tell “Why these particular articles were popular” - that most probably depends on the content, relevance, the author. Yet, lets look for some unobvious correlations and then visualize most popular headlines!

Dataset

The dataset was compiled by Arnaud Drizard using the Hacker News API, and can be found here. The file contains 1553934 entries, is 171M big (uncompressed) and uses the following column titles: id, created_at, created_at_i, author, points, url_hostname, num_comments, title

For the sake of this mission only 4 columns are chosen and renamed appropriately:

submission_time – when the story was submitted.
upvotes – number of upvotes the submission got.
url – the base domain of the submission.
headline – the headline of the submission. Users can edit this, and it doesn’t have to match the headline of the original article.

Read-in, Preprocess

Read in the data and preprocess for text mining

import pandas as pd

df = pd.read_csv('stories.csv')
df.columns=['id', 'submission_time', 'posix_time', 'author', 'upvotes', 'url', 'comments_num', 'headline']
df = df[['submission_time', 'upvotes', 'url', 'headline']]

df.head(3)

Ok, we can see what is the data about.

Here is info on the dataframe.

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1553933 entries, 0 to 1553932
Data columns (total 4 columns):
submission_time    1553933 non-null object
upvotes            1553933 non-null int64
url                1459198 non-null object
headline           1550599 non-null object
dtypes: int64(1), object(3)
memory usage: 47.4+ MB

It is rather large, 1.5M entries, which makes it interesting to explore.

Now, comb it with dropna(), so that we have a nice dataset. We can allow such luxury, since dataset is rich and filling nulls would not benefit much in this case.

df.dropna(inplace=True)
len(df)

Somewhat hundred thousands records have been removed, which is ok.

Core preprocessing

For further processing nltk library would be essential, it is especially appreciated for larger datasets analysis.

Steps would be:

lowercasing
concatenate all words from the headlines column
tokenize
stripping of punctuation symbols
get rid of ‘stopwords’
lemmatize

import nltk

def preprocess_headline(headline):
    headline = headline.lower()
    
    # tokenize (smart split)
    tokens = nltk.word_tokenize(headline)
    
    # Stripping of punctuation symbols:
    words_tokenized_nopunct = [w for w in tokens if w.isalpha()]
    
    # Clean of stopwords. Note transformation to set, rather than using a list. Supposedly gives performance boost.
    stopwords = set(nltk.corpus.stopwords.words('english'))
    words_except_stop = [w for w in words_tokenized_nopunct if w not in stopwords]
    
    # Lemmatize (smart-stemming of the words)
    WNlemma = nltk.WordNetLemmatizer()
    words_lemmatized = [WNlemma.lemmatize(t) for t in words_except_stop]
    
    # Gather back into a single phrase-string
    preprocessed_headline = ' '.join(words_lemmatized)
    #preprocessed_headline = ' '.join(words_tokenized_nopunct)
    #print(preprocessed_headline)
    
    return preprocessed_headline

At this point, creating a copy of the original dataset would be a good decision. So that we continue transofmation with the copy!

dataset_df = df.copy()

And now applying preprocessing of the headlines.

dataset_df['processed_headline'] = dataset_df['headline'].apply(preprocess_headline)

The processing might take a few minutes to transform 1.5M records. By the end we get nice and processed headlines a new column.

dataset_df.head()

Great, so far we have phrases consisting of normalized words: all of them lowercased, having same form and with no punctuation. That way we can work with the column further and count similar phrases!

Great, now we can get so called noun phrases from the headlines. Noun phrases are better much better for our exploration thatn just words, because for example ‘steve jobs’ is a noun phrase occur in headline, whereas if we aimed for just words, we would get only ‘steve’ or only ‘jobs’ to count.

So, let’s derive noun phrases and put them all in one huge list!

from textblob import TextBlob
noun_phrase_list = [ list(TextBlob(processed_headline).noun_phrases) for processed_headline in dataset_df['processed_headline'] ]

noun_phrase_list[:3]

[['business advice'], ['note superfish'], []]

We now have list of lists, let’s flatten it.

noun_phrase_flat = [ item for sublist in noun_phrase_list for item in sublist ]

noun_phrase_flat[:3]

['business advice',
 'note superfish',
 'php uk conference diversity scholarship programme']

Looks great!

A powerful technique is to put the list into collection! A collection is basically a dictionary with a bunch of methods. For example it allows easily acess such method as Counter!

from collections import Counter
counter_collection = Counter(noun_phrase_flat)
counter_collection.most_common(20)

[('show hn', 8360),
 ('open source', 2021),
 ('social medium', 1691),
 ('social network', 1354),
 ('steve job', 1296),
 ('big data', 1204),
 ('silicon valley', 951),
 ('small business', 686),
 ('new york', 658),
 ('combinator bookmarklet', 656),
 ('mobile apps', 533),
 ('google glass', 533),
 ('mobile phone', 416),
 ('google chrome', 408),
 ('mobile app', 400),
 ('hacker news', 391),
 ('new way', 388),
 ('bill gate', 366),
 ('app store', 363),
 ('search engine', 343)]

Cool! However, we can see that topmost phrase is ‘show hn’ which stands for ‘show hacker news’. Thinking of this phrase, it most probably was some kind of a button (header) for showing hacker news, we don’t really need it. Cleaning it out.

phrases_top_20 = counter_collection.most_common(21)[1:]

Get distribution

Let’s visualize distribution of top 20 frequent phrases.

import matplotlib
import matplotlib.pyplot as plt

rslt_dist_df = pd.DataFrame(phrases_top_20, columns=('Phrase','freq')).set_index('Phrase')
matplotlib.style.use('ggplot')
bars = rslt_dist_df.plot.bar(rot=0, figsize=(16,10), width=0.8)
plt.title('Top Frequent Headline buzzPhrases for 10 years in HackerNews')
plt.xticks(rotation=40)
plt.show();

png

Great! We have a clear picture of buzzphrases in HackerNews headlines over 10 years!

Most frequent domains

Visualize most frequent domains. Some addresses are presented as subdomain.domain.com, which would be converted to domain.com

import re

# Define convert-function subdomain.domain -> domain. We can apply it onto series values.
def unify_domains(value):
        url=str(value)
        subdom_dom_match = re.search('.*\.(\w+\.(?:\w{2,3}||(?!co.uk))$)', url)
        is_gb = re.search('.*\.co\.uk', url)
        if subdom_dom_match and not is_gb:
            url = subdom_dom_match.group(1)
        return url

domain_series = dataset_df['url'].apply(unify_domains)

Get top frequent domains

top_domains = domain_series.value_counts()[:10]
print(top_domains)

blogspot.com       36807
github.com         30312
techcrunch.com     26609
nytimes.com        24125
youtube.com        22029
google.com         16303
wordpress.com      15531
medium.com         12991
arstechnica.com    12336
wired.com          11070
Name: url, dtype: int64

Frequency by hour of the day

Which hours during a day are most prolific to get published.

import dateutil

# Getting submission distribution per hour
def parse_hours(value):
    datetime_val = dateutil.parser.parse(str(value))
    #hour = value.hour
    hour = datetime_val.hour
    return hour

hour_series = dataset_df['submission_time'].apply(parse_hours)

Display publish distribution over hours

hour_dist = hour_series.value_counts()
print(hour_dist[:5])

  95123
  94090
  92098
  89685
  85719
Name: submission_time, dtype: int64

dist_sorted = hour_dist.sort_index()
matplotlib.style.use('ggplot')
bars = dist_sorted.plot.bar(rot=0, figsize=(16,10), width=0.8)
plt.title('Publish frequency during day (UTC time) for 2006-2015')
plt.show();

png

Fair enough, most frequent publishing happened during evenings.

Popularity to Headline length

Is there such correlation?

For that we would nee additional column to present lenght of the headline.

dataset_df['headline_length'] = dataset_df['headline'].apply(len)
dataset_df.head()

Checking Pearsons coefficient

dataset_df.corr()

Coeff less than 0.25 is insignificant, and suggests there is no correlation.

Visualizing a scatterplot, to see the shape and possible clustering.

dataset_df.plot.scatter('headline_length','upvotes',figsize=(16,10))
plt.show();

png

It is very evident that articles with headlines of 80 characters and more are unpopular (except a few exceptions).

Actually the border is way to salient - a sign of some underlying reason.

Popularity over time

Visualize how overall popularity of articles change over time

import datetime

dataset_df['submission_time'] = pd.to_datetime(dataset_df['submission_time'])

popularity_per_quaters = dataset_df.resample('QS', on='submission_time').sum()['upvotes']

popularity_per_quaters.plot(figsize=(16,10))
plt.title('Articles popularity over time')
plt.show();

png

More and more articles got upvoted over time, a good representation of HackerNews growing popularity.

Buzzphrases

That’d be interesting to see Buzzphrases trends over time. Which phrases appear most frequent in headlines over time? Get an insight of what topics were most popular back then and until recently in Hacker community.

First contemplate on the design on how to present it - an exciting step, with so many variations: let imagination flow. The pictures popup and acquire shapes as you think of the graph goals “What would it tell to a reader?”. A good thing is to sketch it with a pencil on a paper.

Details:

I got the design in my mind and a sketch on a paper. The idea is to draw top barplots for each period (quarters or halves-year) through the whole span (2006-2015). On top of it Buzzphrases trends are presented with lines. If you imagine - it is not an easy task to make it visually appealing: too many lines get interwined. Thus an animation would be applied. Initially patches and lines would be gray, not distinguished much. The animation would light-up group of patches and corresponding lines upon mouse hover! Sounds cool. That should work great.

From technical perspective, following steps are to be implemented:

add dataframe column with corresponding time period
add dataframe column with month period (that’s because lines would be drawn per month to make them smooth)
get top Buzzphrases and aggregated frequency for each period. Agg frequency would be presented per month.
get top Buzzphrases frequencies per months
Plot bars - a bar for a period, a mean of top Buzzphrases.
Plot lines - for each top Buzzphrases presented.
Apply animation to highlight group of patches for a certain period with related buzzphrase-trend-lines
add annotations to display actual phrases
add annotations for interesting points
Focus on colors-palette and make it visually appealing:
- initially bars and lines would be greay and thin
- upon hover (or click?) certain period would be highlighted with the buzzphrase lines thickened and highlighted across all periods!

Each step has it’s technical caveats. We have a design and a plan, and can start making it real.

Add periods

Create indecies: quarter_periods, month_periods. For further sampling per quarter/month.

# Making a copy for manipulations
df_buzz = dataset_df.copy()

# resetting an index, to save orig_index numbers, just in case it might be useful
df_buzz.reset_index(inplace=True)
df_buzz.rename(columns={'index':'orig_index'}, inplace=True)

# Create quarter periods.
# To achieve it: create a copy of time column, then set it as index,
# and then convert this index to quarters via to_period attribute.
df_buzz['submission_time_ind'] = df_buzz['submission_time']
df_buzz.set_index('submission_time_ind', inplace=True)
df_buzz = df_buzz.to_period('Q', copy=True)
# then we reset index, and rename the column to what it presents: quarter_periods
df_buzz.reset_index(inplace=True)
df_buzz.rename(columns={'submission_time_ind':'quarter_periods'}, inplace=True)

# Create month periods.
# To achieve it: create a copy of time column, then set it as index,
# and then convert this index to month via to_period attribute.
df_buzz['submission_time_ind'] = df_buzz['submission_time']
df_buzz.set_index('submission_time_ind', inplace=True)
df_buzz = df_buzz.to_period('M', copy=True)
# then we reset index, and rename the column to what it presents: month
df_buzz.reset_index(inplace=True)
df_buzz.rename(columns={'submission_time_ind':'month_periods'}, inplace=True)

# Setting a multiindex of quarters and months
df_buzz.set_index(['quarter_periods', 'month_periods'], inplace=True)
df_buzz.head()

Looks good. Creating a multiindex involved a number of manipulations, and takes some computational power for 1.5M dataset.

get period lists

The next step is to create datasets of top Buzzphrases per quarter and month periods, suitable for plotting.

# here a copy of dataset is created to perform manipulations
df_buzz_reind = df_buzz.copy()
#df_buzz_reind.reset_index(inplace=True)
df_buzz_reind.sort_values('submission_time', inplace=True)

df_buzz_reind.head()

Pick period to explore between 2010 and 2015 years

df_buzz_reind = df_buzz_reind[(df_buzz_reind['submission_time']>'2010-01-01') & (df_buzz_reind['submission_time']<'2015-01-01')]

Getting nice lists of quarter periods and months periods

Quarter periods

First getting quarter periods lists.

quarters = list(set(df_buzz_reind.index.get_level_values('quarter_periods')))
quarters_series = pd.Series(quarters)
quarters_series_sorted = quarters_series.sort_values()

Get quarter period list and quarter period names list - all sorted in ascending order.

quarter_periods_list = list(quarters_series_sorted)
quarter_names_list = quarters_series_sorted.astype(str).tolist()

quarter_periods_list[:5]

[Period('2010Q1', 'Q-DEC'),
 Period('2010Q2', 'Q-DEC'),
 Period('2010Q3', 'Q-DEC'),
 Period('2010Q4', 'Q-DEC'),
 Period('2011Q1', 'Q-DEC')]

quarter_names_list[:5]

['2010Q1',
 '2010Q2',
 '2010Q3',
 '2010Q4',
 '2011Q1',
 '2011Q2',
 '2011Q3',
 '2011Q4',
 '2012Q1',
 '2012Q2',
 '2012Q3',
 '2012Q4',
 '2013Q1',
 '2013Q2',
 '2013Q3',
 '2013Q4',
 '2014Q1',
 '2014Q2',
 '2014Q3',
 '2014Q4']

quarters_list=list(zip(quarter_periods_list,quarter_names_list))
quarters_list[:5]

[(Period('2010Q1', 'Q-DEC'), '2010Q1'),
 (Period('2010Q2', 'Q-DEC'), '2010Q2'),
 (Period('2010Q3', 'Q-DEC'), '2010Q3'),
 (Period('2010Q4', 'Q-DEC'), '2010Q4'),
 (Period('2011Q1', 'Q-DEC'), '2011Q1')]

Good, we now have our lists!

quarter_periods_list
quarter_names_list

And versions zipped to list of tuples (period, period_name): quarters_list

get most frequent buzzphrases for each period

We are building a dictionary of { period_name : top 3 frequent phrases with frequency }. It involves sampling the dataframe for each period and extract words with their frequencies.

Erlier in this project we’ve already extracted top most buzzphrases across all 10 year span! So let’s gather those steps in a single nice function, that would apply for each period and build the dict.

Define a function that exctracts and returns top frequent phrases.

def get_top_freq_phrases(processed_headlines_series):
    # Getting noun phrase list
    noun_phrase_list = [ list(TextBlob(processed_headline).noun_phrases) for processed_headline in processed_headlines_series ]
    # Flattening the list
    noun_phrase_flat = [ item for sublist in noun_phrase_list for item in sublist ]
    # Converting a list to a collection
    counter_collection = Counter(noun_phrase_flat)
    # Remove 'show hn' phrase as unnecessary for the stats
    del counter_collection['show hn']
    # Finally obtain tuples of top 3 common phrases
    collection_for_period = counter_collection
    top_phrases = counter_collection.most_common(5)
    # print(top_phrases)
    
    return top_phrases, collection_for_period # return top frequent phrases and a collection

We can now go through different periods in dataframe and build a dictionary of top frequent phrases per period.

quarters_top_phrases_dict = {}
for period, period_name in quarters_list:
    df_period = df_buzz_reind.loc[period]
    print(period_name)
    top_phrases, collection_for_period = get_top_freq_phrases(df_period['processed_headline'])
    #print(top_phrases)
    #print(type(collection_for_period))
    
    quarters_top_phrases_dict[period_name] = [top_phrases, collection_for_period]

Perfect, a dictionary with top phrases and frequencies for each quarter is gathered.

Now we would get a one nice list of all top phrases encountered. As that would allow to build line graphs for each.

top_phrases_uniq = list(set([phrase_freq[0] for sublist in quarters_top_phrases_dict.values() for phrase_freq in sublist[0]]))
top_phrases_uniq

['steve job',
 'world news',
 'triunfo del amor capitulo',
 'google glass',
 'angry bird',
 'combinator bookmarklet',
 'open source',
 'ipad mini',
 'google instant',
 'google buzz',
 'mobile app',
 'new ipad',
 'flappy bird',
 'social network',
 'artificial intelligence',
 'social medium',
 'google nexus',
 'website need',
 'real estate',
 'big data',
 'silicon valley',
 'window phone',
 'aaron swartz',
 'net neutrality',
 'reina del sur capitulo',
 'elon musk']

Perfect!

Now, let’s obtain list frequencies for each phrase! Each list would be used as series data to plot a line.

phrase_series_dict = {}
for phrase in top_phrases_uniq:
    phrase_series_dict[phrase]=[ quarters_top_phrases_dict[period_name][1][phrase] if phrase in quarters_top_phrases_dict[period_name][1] else 'null' for period_name in quarter_names_list]

Now, for the convinience, let’s print our value series for each phrase!

for phrase in top_phrases_uniq:
    print(phrase, phrase_series_dict[phrase])

steve job [18, 78, 32, 39, 70, 37, 93, 396, 56, 49, 39, 49, 32, 29, 39, 30, 38, 20, 8, 17]
world news ['null', 'null', 'null', 'null', 1, 1, 5, 81, 47, 7, 'null', 'null', 'null', 1, 'null', 'null', 1, 'null', 'null', 'null']
triunfo del amor capitulo ['null', 'null', 'null', 1, 9, 73, 'null', 'null', 'null', 1, 'null', 'null', 'null', 'null', 'null', 'null', 'null', 'null', 'null', 'null']
google glass ['null', 'null', 'null', 'null', 'null', 'null', 'null', 1, 5, 17, 14, 7, 95, 125, 57, 55, 43, 64, 17, 16]
angry bird ['null', 1, 4, 14, 45, 25, 15, 18, 20, 6, 11, 1, 3, 5, 2, 4, 5, 3, 'null', 2]
combinator bookmarklet [4, 5, 2, 9, 23, 40, 61, 37, 33, 42, 72, 66, 86, 62, 65, 26, 4, 5, 3, 1]
open source [68, 51, 58, 64, 55, 58, 61, 71, 94, 89, 109, 81, 90, 103, 94, 112, 106, 104, 82, 110]
ipad mini ['null', 'null', 'null', 'null', 'null', 'null', 'null', 2, 4, 4, 13, 53, 7, 4, 1, 2, 'null', 'null', 'null', 3]
google instant ['null', 'null', 32, 13, 4, 'null', 'null', 1, 2, 1, 'null', 'null', 'null', 'null', 'null', 'null', 'null', 'null', 'null', 'null']
google buzz [51, 2, 'null', 'null', 'null', 'null', 'null', 3, 'null', 'null', 'null', 'null', 'null', 1, 1, 'null', 'null', 'null', 'null', 'null']
mobile app [3, 3, 4, 8, 8, 6, 15, 12, 14, 22, 23, 26, 35, 26, 24, 36, 33, 12, 39, 24]
new ipad ['null', 1, 1, 'null', 5, 2, 1, 2, 115, 20, 3, 2, 1, 1, 1, 1, 'null', 'null', 1, 'null']
flappy bird ['null', 'null', 'null', 'null', 'null', 'null', 'null', 'null', 'null', 'null', 'null', 'null', 'null', 'null', 'null', 'null', 42, 11, 5, 2]
social network [35, 39, 49, 77, 58, 54, 76, 58, 63, 79, 68, 45, 63, 38, 53, 42, 41, 30, 31, 38]
artificial intelligence [5, 8, 4, 10, 10, 6, 7, 8, 7, 11, 11, 17, 13, 8, 9, 22, 16, 14, 23, 47]
social medium [45, 39, 50, 56, 66, 87, 129, 97, 129, 115, 72, 90, 97, 101, 84, 59, 39, 41, 60, 40]
google nexus [32, 5, 'null', 8, 'null', 5, 3, 2, 'null', 1, 7, 13, 8, 2, 1, 6, 2, 'null', 1, 1]
website need ['null', 28, 1, 'null', 2, 'null', 1, 2, 2, 1, 'null', 2, 'null', 1, 2, 'null', 'null', 'null', 'null', 'null']
real estate [10, 8, 12, 6, 7, 12, 38, 65, 17, 15, 8, 7, 3, 1, 7, 5, 4, 7, 8, 6]
big data [5, 2, 9, 12, 19, 17, 26, 30, 66, 73, 70, 89, 116, 124, 115, 85, 77, 64, 69, 64]
silicon valley [10, 17, 11, 27, 34, 40, 41, 41, 42, 59, 45, 35, 42, 53, 45, 53, 77, 48, 59, 31]
window phone [6, 1, 18, 35, 30, 14, 16, 22, 25, 36, 15, 18, 20, 10, 11, 9, 14, 7, 4, 3]
aaron swartz ['null', 'null', 'null', 1, 'null', 1, 3, 1, 1, 1, 2, 'null', 122, 7, 7, 1, 8, 2, 2, 'null']
net neutrality [6, 13, 14, 12, 4, 7, 1, 4, 2, 5, 2, 'null', 1, 1, 7, 5, 33, 49, 21, 32]
reina del sur capitulo ['null', 'null', 'null', 'null', 3, 66, 'null', 'null', 'null', 'null', 'null', 'null', 'null', 'null', 'null', 'null', 'null', 'null', 'null', 'null']
elon musk ['null', 3, 8, 'null', 2, 2, 2, 6, 2, 6, 10, 16, 24, 22, 36, 22, 8, 32, 16, 42]

Great! We now have that shows top phrases and their frequencies per quarter periods!

Visualize the graph

Since we gathered the freq data, time to plot the graph.

That is quite a challenge: how to represent those serial data in a nice way. Keeping in mind following aims:

Display topmost headlines
Display tendency for each headline throughout 5 years?

Well, how about that would be nice curved lines, each highlighted when hovered over. Adding some icons and annotations would be great as well.

After spending hours of research and trial, I found great solution: Highcharts.com!

Highcharts are absolutely great, smooth and stylish, with so many capabilities! I found it so much more better than plotly or bokeh.

The only thing is that it is javascript-based. I’ve copy-pasted the data-series by hand and then crafted the chart in jsfiddle.net. You can check working code here.

Steve Jobs’ heritage is by far most popular. Aaron Swartz is a very interesting figure I’ve mined from this project and got acquinted with. A very sad story - a genius young man commited suicide on 2013q1 under heavy burden that happened to him. Google Glass and Big Data had a popularity peak somewhere in the middle of 2013, but then declined. Remember Those Flappy birds? That was fun. Also we can see the popularity of Elon Musk and Artificial intelligence coincidentally and unstoppably rising…

Ok, that was interesting to peek into those patterns. And moreover to explore instruments for such endeavors.

All the best :)