Using Reddit's API for Predicting Comments
My model for predicting whether a Reddit post will elicit an above-average number of comments, based on data scraped from thousands of Reddit posts.
Executive Summary:
Reddit is an online content platform which recently surpassed facebook as the third most trafficed site in the US. Because Reddit has the structure of a community-driven platform, and its users are more interested in user-generated content than in paid advertizements, companies have strong incentives to create content that appears user-generated.
The site consists of posts on which users can vote or comment. Each post gets a score based on the number of “upvotes” and “downvotes” it receives, and each post pertains to a particular “subreddit”, a page organized around a particular topic. Posts from a variety of subreddits are aggregated to the front page (, and can be sorted by ‘new’, ‘hot’, ‘trending’, and ‘top’.
This presents an important question for advertizers and their opponents: can reddit be gamed? In general, what influences the popularity of a post?
In this analysis I attempt to predict whether a reddit post gets an above- or below-average amount of interaction (as measured by number of comments), using natural language processing and classification models.
I find that the most important predictors of a post’s success are its score and the length of time since it was posted. The contributions of individual words added very little explanatory power compared to a model which predicted comment levels based on a post’s subreddit, length of time online, and score. This strongly suggests that it is difficult to game reddit based solely on picking good titles. Most of a post’s success appears to depend on its quality as judged by reddit users (and bots).
In this project I attempt to figure out what information can help predict the number of comments a reddit post receives. The two major steps are:
- Scraping data from reddit into a usable format.
- Building a model using that data to predict a post’s number of comments, and interpreting that model.
My problem statement is: What characteristics of a post on Reddit contribute most to the number of comments?
The source of data is the ‘hot’ tab of reddit’s homepage ( I’ll acquire 5 pieces of information about each thread:
- The title of the thread
- The subreddit that the thread corresponds to
- The length of time it has been up on Reddit
- The post’s score (a function of upvotes and downvotes)
- The number of comments on the thread
Then, I build a classification model that uses Natural Language Processing and predicts whether or not a post will have more or fewer than the median number of comments for all the posts I scraped.
Scraping Post Info from
import requests
import json
import time
import pandas as pd
# the URL from which I'm going to scrape posts
URL = ""
# In order to scrape reddit I need to give a custom user agent (kind of like a user name), otherwise
# Python will use the default user name, and reddit will block it since there are so many people
# using the same user name, so here I define my new reddit username.
headers = {'User-agent': 'Ben_Ironside'}
# make a request object to get and store the data from the above URL
res = requests.get(URL, headers = headers)
# check the status of the connection with the URL. Since I'm using a custom
# user agent, it should be fine (status = 200)
# download the webpage's data in JSON format using the res (requests) object
json_data = res.json()
# check how the data is organized in the JSON
dict_keys(['kind', 'data'])
# check out what's in each of those keys
{'after': 't3_8o8nsm',
'before': None,
'children': [{'data': {'approved_at_utc': None,
'approved_by': None,
'archived': False,
'author': 'maxwellhill',
'author_flair_css_class': None,
'author_flair_template_id': None,
'author_flair_text': None,
'banned_at_utc': None,
'banned_by': None,
'can_gild': False,
'can_mod_post': False,
'clicked': False,
'contest_mode': False,
'created': 1528066997.0,
'created_utc': 1528038197.0,
'distinguished': None,
'domain': '',
'downs': 0,
'edited': False,
'gilded': 0,
'hidden': False,
'hide_score': False,
'id': '8o917w',
'is_crosspostable': False,
'is_reddit_media_domain': False,
'is_self': False,
'is_video': False,
'likes': None,
'link_flair_css_class': None,
'link_flair_text': None,
'locked': False,
'media': None,
'media_embed': {},
'media_only': False,
'mod_note': None,
'mod_reason_by': None,
'mod_reason_title': None,
'mod_reports': [],
'name': 't3_8o917w',
'no_follow': False,
'num_comments': 3225,
'num_crossposts': 0,
'num_reports': None,
'over_18': False,
'parent_whitelist_status': 'all_ads',
'permalink': '/r/worldnews/comments/8o917w/trudeau_its_insulting_that_the_us_considers/',
'pinned': False,
'post_categories': None,
'post_hint': 'link',
'preview': {'enabled': False,
'images': [{'id': 'UPq31VlrvZ6XMJCJ-nAqotpxibElKMS51EDx0KFl9EQ',
'resolutions': [{'height': 60,
'url': '',
'width': 108},
{'height': 121,
'url': '',
'width': 216},
{'height': 179,
'url': '',
'width': 320},
{'height': 359,
'url': '',
'width': 640},
{'height': 539,
'url': '',
'width': 960}],
'source': {'height': 551,
'url': '',
'width': 980},
'variants': {}}]},
'pwls': 6,
'quarantine': False,
'removal_reason': None,
'report_reasons': None,
'saved': False,
'score': 26783,
'secure_media': None,
'secure_media_embed': {},
'selftext': '',
'selftext_html': None,
'send_replies': False,
'spoiler': False,
'stickied': False,
'subreddit': 'worldnews',
'subreddit_id': 't5_2qh13',
'subreddit_name_prefixed': 'r/worldnews',
'subreddit_subscribers': 18800039,
'subreddit_type': 'public',
'suggested_sort': None,
'thumbnail': 'default',
'thumbnail_height': 78,
'thumbnail_width': 140,
'title': "Trudeau: It's 'insulting' that the US considers Canada a national security threat",
'ups': 26783,
'url': '',
'user_reports': [],
'view_count': None,
'visited': False,
'whitelist_status': 'all_ads',
'wls': 6},
'dist': 25,
modhash: ''}
It looks like all the data I want is in the ‘data’ item.
For each post, I want to get its name, time posted, score (~ number of upvotes), the subreddit it came from, and how many comments it got.
As can be seen above, the keys for each of these those categories are:
- ‘num_comments’
- ‘score’
- ‘title’
- ‘subreddit’
- ‘created’ (This actually represents when it was posted, and I’m interested in the time elapsed since it was created, so I’ll later subtract the time it was created from the current time to get this info.)
Getting the data I want for an individual post:
# import the datetime library to deal with the time posted
import datetime
# instantiate dictionary of the data about one post
post_scrape = {}
# get one post to demo on
demo_post = json_data['data']['children'][0]
# get the timestamp from a 10-digit format representing the number of seconds since 1970, subtract the current time, and make the results into minutes
post_scrape['mins_since_post'] = round((datetime.datetime.fromtimestamp(demo_post['data']['created'])
# get the rest of the data
post_scrape['num_comments'] = demo_post['data']['num_comments']
post_scrape['score'] = demo_post['data']['score']
post_scrape['title'] = demo_post['data']['title']
post_scrape['subreddit'] = demo_post['data']['subreddit']
# output the dictionary to see if it looks right
{'num_comments': 2,
'score': 84,
'subreddit': 'cactus',
'timestamp': 55,
'title': 'Found these cute little guys in Jeonju, South Korea'}
Turning the above loop into a function so I can call it for each post once I scrape them:
def scrape_post(post):
post_scrape = {}
post_scrape['mins_since_post'] = round((datetime.datetime.fromtimestamp(post['data']['created'])
post_scrape['num_comments'] = post['data']['num_comments']
post_scrape['score'] = post['data']['score']
post_scrape['title'] = post['data']['title']
post_scrape['subreddit'] = post['data']['subreddit']
return post_scrape
Scraping data about a bunch of posts to give me data to base my analysis on:
## This code snippet scrapes all 25 posts that display on one reddit page, and uses
## the ID of the last post (contained in the after attribute of any post of that page)
## to make sure the next request goes to posts after that post, then scrapes them.
## It adds all the scraped posts to the list [posts].
## Credit to Riley Dallas
posts = []
after = None
for i in range(100):
if after == None:
params = {}
params = {'after': after}
URL = ""
res = requests.get(URL, params=params, headers = headers)
json_data = res.json()
after = json_data['data']['after']
# wait a second between requests to avoid putting excessive load on the servers
# Checking how many posts were scraped
# 2500, as expected since I ran my loop 100 times
Putting info about all the posts into a big list of dictionaries
# instantiate list
# posts_infodicts_list = []
# call function on each post, add results to [posts_infodicts_list]
for post in posts:
# checking if the list looks right
# posts_infodicts_list[0:5]
Saving my dataframe of information as a CSV
Saving my data to the disk as a Comma-Separated-Values file so that it won’t be lost if this notebook crashes
# first I make the list of dictionaries into a DataFrame for easy export
df = pd.DataFrame(posts_infodicts_list)
# export to csv in the local directory
Feature Engineering and Data Prep
Using natural language processing to turn the titles into word vectors, which count the occurence of difference words:
# make a corpus of words which includes all words in any of the post's titles
# this will be used to teach the word vectorizer which words to count
corpus = list(df['title']+df['subreddit'])
# import and instantiate CountVectorizer, a word vectorizer which counts the occurence of each word in the corpus
from sklearn.feature_extraction.text import CountVectorizer
cvec = CountVectorizer()
# use the vectorizer to transform the corpus
cvec_corpus = cvec.fit_transform(corpus)
# make a dataframe of the vector counts, after turning them from a sparse matrix to a dense matrix, and label the variables using cvec's get_feature_names() method
vector_counts = pd.DataFrame(cvec_corpus.todense(),columns=cvec.get_feature_names())
# do the same as in the above cell, but with just the subreddits
# this will help me make a simpler model below as a proof of concept
subcorpus = list(df['subreddit'])
sub_cvec = CountVectorizer()
sub_cvec_corpus = sub_cvec.fit_transform(subcorpus)
sub_counts = pd.DataFrame(sub_cvec_corpus.todense(),columns=sub_cvec.get_feature_names())
Making one dataframe out of my scraped data (minus ‘title’), and my vectorized word counts:
# dataframe of data without title column or subreddit column since both have been vectorized
df_notitle = df.drop(columns=['title','subreddit'])
# add the dataframes together (concatonate them)
posts_df = pd.concat([df_notitle,vector_counts], axis = 1)
How many words are in the corpus anyway?
# apparently there were 6563 unique words in the titles of my posts
(5000, 10790)
I want to predict whether the number of comments was low or high. Before I can do that, I need to get the median number of comments and make a variable for whether each post is above or below that.
import numpy as np
# learning the median number of comments
# add new column to dataframe
posts_df['above_median'] = posts_df['num_comments'] >= 20
# drop number of comments from the dataframe since using it to predict would be cheating
posts_df = posts_df.drop(columns=['num_comments'])
Predicting ‘above_median’ Number of Posts With a Random Forest and a Regression Model:
Splitting my data into a testing set and a training set so that I can avoid overfitting my model:
# separating the target (y) from the predictors (x). I'm going to make
# 2 separate x dataframes, one with all the predictors, and one with just subreddit
# so that I can build a simple proof-of-concept model below.
X_subr = sub_counts
X = posts_df.drop(columns=['above_median'])
y = posts_df['above_median']
from sklearn.model_selection import train_test_split
# use train_test_split to make a training a testing set for both Xs
X_train, X_test, y_train, y_test = train_test_split(X,y)
X_train_subr, X_test_subr, y_train_subr, y_test_subr = train_test_split(X_subr,y)
Creating a Random Forest model to predict High/Low number of comments. Starting with a proof of concept which uses only the subreddit as a feature:
# import packages from scikitlearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
# instantiate random forest model
forest = RandomForestClassifier()
# fit model on training data and score it on testing data, y_train_subr)
forest.score(X_test_subr, y_test_subr)
# I ran this model on ten different train_test_splits to check for overfitting and all scores were >0.6 and <0.67
Making a similar model but using words in the title, minutes since post, and score as predictors as well as subreddit, to see whether considering the words in the title makes the model better:
forest = RandomForestClassifier()
# fit model on training data and score it on testing data, y_train)
forest.score(X_test, y_test)
# I ran this model on ten different train_test_splits to check for overfitting and all scores were >0.7 and <0.77
As you can see, the model improved very little when I including the vectorized titles as well as the subreddits. This means that the influence of the words on a post’s number of comments appears to be small if it exists.
Which features were most important to the model?
# make a dataframe which contains the variable names and their relative importances in the model
df_importances = pd.DataFrame({'variable': X_train.columns,
'relative_importance': forest.feature_importances_})
# reorder the dataframe based on absolute value - credit for this implementation goes to posted EdChum on this stackoverflow thread:
df_importances.reindex(df_importances['relative_importance'].abs().sort_values(inplace=False, ascending=False).index).head(50)
# below is a table of relative feature importances for the full model.
# the relative importance is a fraction of 1, where 1 represents the model's total explanatory power
# Score and mins_since_post are clearly the most important variables.
relative_importance | variable | |
1 | 0.192166 | score |
0 | 0.086844 | mins_since_post |
9682 | 0.004962 | to |
9506 | 0.004301 | the |
9578 | 0.003135 | this |
4840 | 0.002912 | in |
5048 | 0.002803 | it |
6717 | 0.002771 | of |
686 | 0.002650 | and |
10541 | 0.002598 | with |
10707 | 0.002433 | you |
5023 | 0.002285 | is |
6410 | 0.001926 | my |
3789 | 0.001905 | for |
827 | 0.001815 | are |
8583 | 0.001758 | should |
5218 | 0.001718 | just |
5605 | 0.001584 | like |
1135 | 0.001536 | be |
2665 | 0.001528 | de |
4305 | 0.001517 | gun |
916 | 0.001445 | at |
6779 | 0.001440 | on |
3896 | 0.001415 | from |
4430 | 0.001395 | have |
3605 | 0.001377 | femcelsbraincels |
9567 | 0.001372 | think |
5981 | 0.001369 | me |
7463 | 0.001365 | pride |
4050 | 0.001358 | get |
2931 | 0.001335 | do |
10056 | 0.001278 | up |
9616 | 0.001269 | through |
4591 | 0.001268 | hmmmhmmm |
10446 | 0.001259 | when |
4412 | 0.001257 | has |
9936 | 0.001239 | two |
733 | 0.001230 | anon |
484 | 0.001223 | actual |
10495 | 0.001202 | will |
9497 | 0.001158 | that |
6715 | 0.001150 | odroid |
5205 | 0.001113 | jump |
3933 | 0.001109 | furry |
4154 | 0.001106 | golden |
9077 | 0.001104 | still |
6699 | 0.001090 | oc |
1589 | 0.001087 | bruce |
1664 | 0.001086 | bustedfashionreps |
1028 | 0.001085 | back |
Repeat the model-building process with a non-tree-based method.
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
# Standard scaler helps prepare data for a logistic regression model
# by putting all numerical predictors on a similar scale
ss = StandardScaler()
# instantiate model
logreg = LogisticRegression(), y_train)
grid_params = {
'penalty': ['l1','l2'],
'C': [.8,.9,1]
grid = GridSearchCV(logreg, grid_params),y_train)
{'C': 0.9, 'penalty': 'l1'}