Predictive Models for Online New Popularity Data Set
The purpose of this repository is to create predictive models and automating R Markdown reports. Analysis are completed on the Online News Popularity Data Set from UCI. Additional information about this data can be accessed here.
The data contains the following variables:
- url: URL of the article (non-predictive)
- timedelta: Days between the article publication and the dataset acquisition (non-predictive)
- n_tokens_title: Number of words in the title
- n_tokens_content: Number of words in the content
- n_unique_tokens: Rate of unique words in the content
- n_non_stop_words: Rate of non-stop words in the content
- n_non_stop_unique_tokens: Rate of unique non-stop words in the content
- num_hrefs: Number of links
- num_self_hrefs: Number of links to other articles published by Mashable
- num_imgs: Number of images
- num_videos: Number of videos
- average_token_length: Average length of the words in the content
- num_keywords: Number of keywords in the metadata
- data_channel_is_lifestyle: Is data channel ‘Lifestyle’?
- data_channel_is_entertainment: Is data channel ‘Entertainment’?
- data_channel_is_bus: Is data channel ‘Business’?
- data_channel_is_socmed: Is data channel ‘Social Media’?
- data_channel_is_tech: Is data channel ‘Tech’?
- data_channel_is_world: Is data channel ‘World’?
- kw_min_min: Worst keyword (min. shares)
- kw_max_min: Worst keyword (max. shares)
- kw_avg_min: Worst keyword (avg. shares)
- kw_min_max: Best keyword (min. shares)
- kw_max_max: Best keyword (max. shares)
- kw_avg_max: Best keyword (avg. shares)
- kw_min_avg: Avg. keyword (min. shares)
- kw_max_avg: Avg. keyword (max. shares)
- kw_avg_avg: Avg. keyword (avg. shares)
- self_reference_min_shares: Min. shares of referenced articles in Mashable
- self_reference_max_shares: Max. shares of referenced articles in Mashable
- self_reference_avg_sharess: Avg. shares of referenced articles in Mashable
- weekday_is_monday: Was the article published on a Monday?
- weekday_is_tuesday: Was the article published on a Tuesday?
- weekday_is_wednesday: Was the article published on a Wednesday?
- weekday_is_thursday: Was the article published on a Thursday?
- weekday_is_friday: Was the article published on a Friday?
- weekday_is_saturday: Was the article published on a Saturday?
- weekday_is_sunday: Was the article published on a Sunday?
- is_weekend: Was the article published on the weekend?
- LDA_00: Closeness to LDA topic 0
- LDA_01: Closeness to LDA topic 1
- LDA_02: Closeness to LDA topic 2
- LDA_03: Closeness to LDA topic 3
- LDA_04: Closeness to LDA topic 4
- global_subjectivity: Text subjectivity
- global_sentiment_polarity: Text sentiment polarity
- global_rate_positive_words: Rate of positive words in the content
- global_rate_negative_words: Rate of negative words in the content
- rate_positive_words: Rate of positive words among non-neutral tokens
- rate_negative_words: Rate of negative words among non-neutral tokens
- avg_positive_polarity: Avg. polarity of positive words
- min_positive_polarity: Min. polarity of positive words
- max_positive_polarity: Max. polarity of positive words
- avg_negative_polarity: Avg. polarity of negative words
- min_negative_polarity: Min. polarity of negative words
- max_negative_polarity: Max. polarity of negative words
- title_subjectivity: Title subjectivity
- title_sentiment_polarity: Title polarity
- abs_title_subjectivity: Absolute subjectivity level
- abs_title_sentiment_polarity: Absolute polarity level
- shares: Number of shares (target)
In this project, subsets by data_channel_is_* were produced for automating R Markdown reports. Predictive models used include linear regression models, random forest model, and boosted tree. These models were constructed on training data set and than tested on testing data set. The best model was selected based on lowest RMSE.
List of packages used:
caretTo run the Regression and ensemble methods with Train/Split and cross validation.dplyrA part of thetidyverseused for manipulating data.GGallyTo create ggcorr() and ggpairs() correlation plots .glmnetTo access best subset selection.ggplot2A part of thetidyverseused for creating graphics.gridextraTo plot with multiple grid objects.gtTo test a low-dimensional null hypothesis against high-dimensional alternative models.knitrTo get nice table printing formats, mainly for the contingency tables.leapsTo identify different best models of different sizes.markdownTo render several output formats.MASSTo access forward and backward selection algorithmsrandomforestTo access random forest algorithmstidyrA part of thetidyverseused for data cleaning
Links to the view results of:
The analysis for Lifestyle articles is available here.
The analysis for Entertainment articles is available here.
The analysis for Business articles is available here.
The analysis for Social media articles is available here.
The analysis for Tech articles is available here.
The analysis for World articles is available here.
Code used to create the analysis.
selectID <- unique(newData$channel)
output_file <- paste0(selectID, "Analysis.md")
params = lapply(selectID, FUN = function(x){list(channel = x)})
reports <- tibble(output_file, params)
library(rmarkdown)
apply(reports, MARGIN = 1,
FUN = function(x){
render(input = "./Project_3.Rmd",
output_format = "github_document",
output_file = x[[1]],
params = x[[2]])
})