Predictive Models for Online New Popularity Data Set

The purpose of this repository is to create predictive models and automating R Markdown reports. Analysis are completed on the Online News Popularity Data Set from UCI. Additional information about this data can be accessed here.

The data contains the following variables:

url: URL of the article (non-predictive)
timedelta: Days between the article publication and the dataset acquisition (non-predictive)
n_tokens_title: Number of words in the title
n_tokens_content: Number of words in the content
n_unique_tokens: Rate of unique words in the content
n_non_stop_words: Rate of non-stop words in the content
n_non_stop_unique_tokens: Rate of unique non-stop words in the content
num_hrefs: Number of links
num_self_hrefs: Number of links to other articles published by Mashable
num_imgs: Number of images
num_videos: Number of videos
average_token_length: Average length of the words in the content
num_keywords: Number of keywords in the metadata
data_channel_is_lifestyle: Is data channel ‘Lifestyle’?
data_channel_is_entertainment: Is data channel ‘Entertainment’?
data_channel_is_bus: Is data channel ‘Business’?
data_channel_is_socmed: Is data channel ‘Social Media’?
data_channel_is_tech: Is data channel ‘Tech’?
data_channel_is_world: Is data channel ‘World’?
kw_min_min: Worst keyword (min. shares)
kw_max_min: Worst keyword (max. shares)
kw_avg_min: Worst keyword (avg. shares)
kw_min_max: Best keyword (min. shares)
kw_max_max: Best keyword (max. shares)
kw_avg_max: Best keyword (avg. shares)
kw_min_avg: Avg. keyword (min. shares)
kw_max_avg: Avg. keyword (max. shares)
kw_avg_avg: Avg. keyword (avg. shares)
self_reference_min_shares: Min. shares of referenced articles in Mashable
self_reference_max_shares: Max. shares of referenced articles in Mashable
self_reference_avg_sharess: Avg. shares of referenced articles in Mashable
weekday_is_monday: Was the article published on a Monday?
weekday_is_tuesday: Was the article published on a Tuesday?
weekday_is_wednesday: Was the article published on a Wednesday?
weekday_is_thursday: Was the article published on a Thursday?
weekday_is_friday: Was the article published on a Friday?
weekday_is_saturday: Was the article published on a Saturday?
weekday_is_sunday: Was the article published on a Sunday?
is_weekend: Was the article published on the weekend?
LDA_00: Closeness to LDA topic 0
LDA_01: Closeness to LDA topic 1
LDA_02: Closeness to LDA topic 2
LDA_03: Closeness to LDA topic 3
LDA_04: Closeness to LDA topic 4
global_subjectivity: Text subjectivity
global_sentiment_polarity: Text sentiment polarity
global_rate_positive_words: Rate of positive words in the content
global_rate_negative_words: Rate of negative words in the content
rate_positive_words: Rate of positive words among non-neutral tokens
rate_negative_words: Rate of negative words among non-neutral tokens
avg_positive_polarity: Avg. polarity of positive words
min_positive_polarity: Min. polarity of positive words
max_positive_polarity: Max. polarity of positive words
avg_negative_polarity: Avg. polarity of negative words
min_negative_polarity: Min. polarity of negative words
max_negative_polarity: Max. polarity of negative words
title_subjectivity: Title subjectivity
title_sentiment_polarity: Title polarity
abs_title_subjectivity: Absolute subjectivity level
abs_title_sentiment_polarity: Absolute polarity level
shares: Number of shares (target)

In this project, subsets by data_channel_is_* were produced for automating R Markdown reports. Predictive models used include linear regression models, random forest model, and boosted tree. These models were constructed on training data set and than tested on testing data set. The best model was selected based on lowest RMSE.

List of packages used:

caret To run the Regression and ensemble methods with Train/Split and cross validation.
dplyr A part of the tidyverse used for manipulating data.
GGally To create ggcorr() and ggpairs() correlation plots .
glmnet To access best subset selection.
ggplot2 A part of the tidyverse used for creating graphics.
gridextra To plot with multiple grid objects.
gt To test a low-dimensional null hypothesis against high-dimensional alternative models.
knitr To get nice table printing formats, mainly for the contingency tables.
leaps To identify different best models of different sizes.
markdown To render several output formats.
MASS To access forward and backward selection algorithms
randomforest To access random forest algorithms
tidyr A part of the tidyverse used for data cleaning

Links to the view results of:

The analysis for Lifestyle articles is available here.
The analysis for Entertainment articles is available here.
The analysis for Business articles is available here.
The analysis for Social media articles is available here.
The analysis for Tech articles is available here.
The analysis for World articles is available here.

Code used to create the analysis.

selectID <- unique(newData$channel)  

output_file <- paste0(selectID, "Analysis.md")  

params = lapply(selectID, FUN = function(x){list(channel = x)})

reports <- tibble(output_file, params)

library(rmarkdown)

apply(reports, MARGIN = 1,
      FUN = function(x){
        render(input = "./Project_3.Rmd",
               output_format = "github_document", 
               output_file = x[[1]], 
               params = x[[2]])
      })