Project 3: Predictive Model for data_channel_is_world Analysis

Smitali Paknaik and Paula Bailey 2022-11-16

Introduction

Social and Online media is now perfused into our society and by getting to know how people think and what they like, we get valuable information on how new/information spreads through out a small society or globally and how it affects the members of a community, their thinking and their actions. This information is very useful for product based companies trying to sell products, advertising companies, writers, influencers etc. Not only that, we also know that it also has lot of impact at political and economic changes occurring within a state or country. What is advertised well can make good best sales, going by that theory companies have now started encashing machine learning methods to know about consumer responses from historical data and use it for expanding markets.

The main goal of this project is to analyze a similar OnlineNewsPopularity data using Exploratory Data Analysis (EDA) and apply regression and ensemble methods to predict the response variable ‘~share’ using a set of predictor variables optimized through variable selection methods. That is, we will try to find out what parameters impact number of online shares of information and whether they can help us predict the number of shares from these parameters.

This OnlineNewsPopularity data is divided into some genre or channel and each individual channel has its characteristics that need to be looked at separately. And some generalization of analysis is necessary as it helps speed up the process and automate the entire analysis at a mouse click.

Our first step is to shape the data to get the various data_channel_groups together in such a format that each channel can be looked into separately . This is needed in order to get proper filtering of the required channels in the automation script. The EDA is performed on each channel using graphs, tables and heatmaps.

The next step includes creating train and test data for train fit and test prediction. Prior to model fitting, variable selection/pre-processing of data (if needed) is done. Two linear regression models are the first step of performing model fit and predictions. The data may be easily explained by a set of linear equations and may not require complex algorithms that can consume time, cost and may require computation speed. Hence, regression models often are a great start to run predictions.

The other method evaluated is two types of ensemble methods- that are Random Forest and Boosted Tree method. Both are types of ensemble methods that use tree based model to evaluate future values. These methods do not have any specified methodology in model fit. They work on the best statistic obtained while constructing the model fit. Theoretically ensemble methods are said to have a good prediction, and using this data it will be evaluated on what categories how these models perform.

At the end for each channel analysis, all 4 models are compared and best model that suits this category of channel is noted.

About Data and Data Preparation

There are 48 parameters in the dataset. The channel parameter is actually a categorical variable which will be used for filtering data in this project and form our analysis into different set of channel files. The categories in this are : ‘lifestyle’,‘entertainment’,‘bus’,‘socmed’,‘tech’ and ‘world’. Additional information about this data can be accessed here.

The csv file is imported read using read_csv(). As we have automated the analysis, we need each channel to be fed into the R code we have and get the analysis and predictions as a different file. The channel categories are in column (or wide) format so they have been converted into long format using pivot_longer. The required category is then easily filtered by passing the channel variables into the filter code.

Irrelevant columns are removed like URL and timedelta as they do not have any significance in our predictions models.

The Exploratory analysis and Modelling was carried on the remaining set of select variables.

Package Imports

  1. caret To run the Regression and ensemble methods with Train/Split and cross validation.
  2. dplyr A part of the tidyverse used for manipulating data.
  3. GGally To create ggcorr and ggpairs correlation plots .
  4. glmnet To access best subset selection.
  5. ggplot2 A part of the tidyverse used for creating graphics.
  6. gridextra To plot with multiple grid objects.
  7. gt To test a low-dimensional nullhypothesis against high-dimensional alternative models.
  8. knitr To get nice table printing formats, mainly for the contingency tables.
  9. leaps To identify different best models of different sizes.
  10. markdown To render several output formats.
  11. MASS To access forward and backward selection algorithms
  12. randomforest To access random forest algorithms
  13. tidyr A part of the tidyverse used for data cleaning

Load data and check for NAs

data <- read.csv("OnlineNewsPopularity.csv") %>% 
                              rename(
                                    Monday      = `weekday_is_monday`,
                                    Tuesday     = `weekday_is_tuesday`,
                                    Wednesday   = `weekday_is_wednesday`,
                                    Thursday    = `weekday_is_thursday`,
                                    Friday      = `weekday_is_friday`,
                                    Saturday    = `weekday_is_saturday`,
                                    Sunday      = `weekday_is_sunday`) %>% 
                                    dplyr::select(-url, -timedelta)

#check for missing values
anyNA(data)  
## [1] FALSE

The UCI site mentioned that url and timedelta are non-predictive variables, so we will remove them from our data set during the importing process. Afterwards, we checked to validate the data set contained no missing values. anyNA(data) returned FALSE, so the file has no missing data.

For the automation process, it will be easier if all the channels (ie data_channel_is_*) are in one column. We used pivot_longer() to pivot columns: data_channel_is_lifestyle, data_channel_is_entertainment, data_channel_is_bus, data_channel_is_socmed, data_channel_is_tech, and data_channel_is_world from wide to long format.

dataPivot <- data %>% pivot_longer(cols = c("data_channel_is_lifestyle", "data_channel_is_entertainment", "data_channel_is_bus","data_channel_is_socmed","data_channel_is_tech", "data_channel_is_world"),names_to = "channel",values_to = "Temp") 


newData <- dataPivot %>% filter(Temp != 0) %>% dplyr::select(-Temp)

Now, the individual channel columns are combined into one column named channel. The variable Temp represents if an article exists for that particular channel. For the final data set, we will remove any values with 0. We performed the same pivot_longer() process on the days of the week.

dataPivot <- newData %>% pivot_longer(c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"), names_to = "weekday",values_to = "Temp1") 

newData <- dataPivot %>% filter(Temp1 != 0) %>% dplyr::select(-Temp1)

Create Training and Testing Sets

When we run Render_Code.R, this code chunk filters data for each channel type to complete the analysis.

selectChannel <- newData %>% filter(channel == params$channel)

To make the data reproducible, we set the seed to 21 and created the training and testing set with a 70% split.

set.seed(21)
trainIndex <- createDataPartition(selectChannel$shares, p = 0.7, list = FALSE)
selectTrain <- selectChannel[trainIndex, ]
selectTest <- selectChannel[-trainIndex, ]

Exploratory Data Analysis

The data has been analysed in this sections. This includes basic statistics, summary tables, contingency tables and useful plots. We have tried to capture as much information as we can in this.

Training Data Summary

str(selectTrain)
## tibble [5,900 x 48] (S3: tbl_df/tbl/data.frame)
##  $ n_tokens_title              : num [1:5900] 10 9 12 11 11 11 12 10 10 11 ...
##  $ n_tokens_content            : num [1:5900] 231 1248 682 799 317 ...
##  $ n_unique_tokens             : num [1:5900] 0.636 0.49 0.46 0.504 0.611 ...
##  $ n_non_stop_words            : num [1:5900] 1 1 1 1 1 ...
##  $ n_non_stop_unique_tokens    : num [1:5900] 0.797 0.732 0.635 0.738 0.729 ...
##  $ num_hrefs                   : num [1:5900] 4 11 10 8 7 8 5 8 12 10 ...
##  $ num_self_hrefs              : num [1:5900] 1 0 0 6 6 0 2 0 9 0 ...
##  $ num_imgs                    : num [1:5900] 1 1 1 1 1 1 0 1 1 1 ...
##  $ num_videos                  : num [1:5900] 1 0 0 0 0 0 0 0 0 0 ...
##  $ average_token_length        : num [1:5900] 5.09 4.62 4.62 4.7 5.24 ...
##  $ num_keywords                : num [1:5900] 5 8 6 6 5 7 6 6 8 7 ...
##  $ kw_min_min                  : num [1:5900] 0 0 0 0 0 0 217 217 217 217 ...
##  $ kw_max_min                  : num [1:5900] 0 0 0 0 0 0 504 504 819 1200 ...
##  $ kw_avg_min                  : num [1:5900] 0 0 0 0 0 ...
##  $ kw_min_max                  : num [1:5900] 0 0 0 0 0 0 0 0 0 0 ...
##  $ kw_max_max                  : num [1:5900] 0 0 0 0 0 0 17100 17100 17100 17100 ...
##  $ kw_avg_max                  : num [1:5900] 0 0 0 0 0 ...
##  $ kw_min_avg                  : num [1:5900] 0 0 0 0 0 0 0 0 0 0 ...
##  $ kw_max_avg                  : num [1:5900] 0 0 0 0 0 ...
##  $ kw_avg_avg                  : num [1:5900] 0 0 0 0 0 ...
##  $ self_reference_min_shares   : num [1:5900] 0 0 0 101 638 0 3100 0 704 0 ...
##  $ self_reference_max_shares   : num [1:5900] 0 0 0 2600 3300 0 3100 0 3700 0 ...
##  $ self_reference_avg_sharess  : num [1:5900] 0 0 0 972 1560 ...
##  $ is_weekend                  : num [1:5900] 0 0 0 0 0 0 0 0 0 0 ...
##  $ LDA_00                      : num [1:5900] 0.04 0.025 0.0333 0.0333 0.04 ...
##  $ LDA_01                      : num [1:5900] 0.04 0.2873 0.0333 0.0335 0.04 ...
##  $ LDA_02                      : num [1:5900] 0.84 0.401 0.867 0.866 0.84 ...
##  $ LDA_03                      : num [1:5900] 0.04 0.2619 0.0333 0.0333 0.04 ...
##  $ LDA_04                      : num [1:5900] 0.04 0.025 0.0333 0.0334 0.04 ...
##  $ global_subjectivity         : num [1:5900] 0.314 0.482 0.473 0.42 0.375 ...
##  $ global_sentiment_polarity   : num [1:5900] 0.0519 0.1024 0.0622 0.1229 0.0913 ...
##  $ global_rate_positive_words  : num [1:5900] 0.039 0.0385 0.0499 0.0375 0.0379 ...
##  $ global_rate_negative_words  : num [1:5900] 0.0303 0.0208 0.0396 0.02 0.0189 ...
##  $ rate_positive_words         : num [1:5900] 0.562 0.649 0.557 0.652 0.667 ...
##  $ rate_negative_words         : num [1:5900] 0.438 0.351 0.443 0.348 0.333 ...
##  $ avg_positive_polarity       : num [1:5900] 0.298 0.404 0.343 0.395 0.316 ...
##  $ min_positive_polarity       : num [1:5900] 0.1 0.1 0.05 0.1 0.0333 ...
##  $ max_positive_polarity       : num [1:5900] 0.5 1 0.6 0.7 0.8 1 1 0.5 0.7 0.7 ...
##  $ avg_negative_polarity       : num [1:5900] -0.238 -0.415 -0.22 -0.233 -0.189 ...
##  $ min_negative_polarity       : num [1:5900] -0.5 -1 -0.6 -0.6 -0.3 ...
##  $ max_negative_polarity       : num [1:5900] -0.1 -0.1 -0.05 -0.05 -0.125 -0.1 -0.1 -0.05 -0.05 -0.05 ...
##  $ title_subjectivity          : num [1:5900] 0 0 0.75 0 0.483 ...
##  $ title_sentiment_polarity    : num [1:5900] 0 0 -0.25 0 0.417 ...
##  $ abs_title_subjectivity      : num [1:5900] 0.5 0.5 0.25 0.5 0.0167 ...
##  $ abs_title_sentiment_polarity: num [1:5900] 0 0 0.25 0 0.417 ...
##  $ shares                      : int [1:5900] 710 2200 1600 504 1800 1200 755 468 1500 10400 ...
##  $ channel                     : chr [1:5900] "data_channel_is_world" "data_channel_is_world" "data_channel_is_world" "data_channel_is_world" ...
##  $ weekday                     : chr [1:5900] "Monday" "Monday" "Monday" "Monday" ...

str() allows us to view the structure of the data. We see each variable has an appropriate data type.

selectTrain %>% dplyr::select(shares, starts_with("rate")) %>% summary()
##      shares       rate_positive_words rate_negative_words
##  Min.   :    35   Min.   :0.0000      Min.   :0.0000     
##  1st Qu.:   827   1st Qu.:0.5349      1st Qu.:0.2500     
##  Median :  1100   Median :0.6429      Median :0.3478     
##  Mean   :  2279   Mean   :0.6218      Mean   :0.3467     
##  3rd Qu.:  1900   3rd Qu.:0.7381      3rd Qu.:0.4483     
##  Max.   :141400   Max.   :1.0000      Max.   :1.0000

This summary() provides us with information about the distribution (shape) of shares (response variable), rate_positive_words, and rate_negative_word.

If mean is greater than median, then Right-skewed with outliers towards the right tail.  
If mean is less than median, then Left-skewed with outliers towards the left tail.  
If mean equals to median, then Normal distribtuion.  

Training Data Visualizations

Note: Due to the extreme minimize and maximize values in our response variable, shares, we transformed the data by applying log(). It allows us to address the differences in magnitude throughout the share variable. The results visually are much better.

g <- ggplot(selectTrain, aes(x=rate_positive_words, y = log(shares)))+geom_point() 
g + geom_jitter(aes(color = as.factor(weekday))) + 
        labs(x = "positive words", y = "shares",
        title = "Rate of Positive Words") +
  scale_fill_discrete(breaks=c("Sunday","Monday","Tuesday","Wednesday","Thursday","Friday","Saturday"))+
      scale_colour_discrete("") 

We can inspect the trend of shares as a function of the positive word rate. If the points show an upward trend, then articles with more positive words tend to be shared more often. If we see a negative trend then articles with more positive words tend to be shared less often.

g <- ggplot(selectTrain, aes(x = shares))
g + geom_histogram(bins = sqrt(nrow(selectTrain)))  + 
          labs(title = "Histogram of Shares")
To view the true shape of the response variable, shares, we did not apply log() to transform the data.

We can inspect the shape of the response variable, shares. If the share of the histogram is symmetrical or bell-shaped, then shares has a normal distribution with the shares evenly spread throughout. The mean is equal to the median. If the shape is left skewed or right leaning, then the tail which contains a most of the outliers will be extended to the left. The mean is less than the median. If the shape is right skewed or left leaning, then the tail which contains a most of the outliers will be extended to the right. The mean is greater than the median.

g <- ggplot(selectTrain) 

 g + aes(x = weekday, y = log(shares)) +
  geom_boxplot(varwidth = TRUE) + 
  geom_jitter(alpha = 0.25, width = 0.2,aes(color = as.factor(weekday)))  + 
     labs(title = "Box Plot of Shares Per Day") +
   scale_colour_discrete("") +
   scale_x_discrete(limits = c("Sunday","Monday","Tuesday","Wednesday","Thursday","Friday","Saturday")) +
   theme(legend.position = "none")
We can inspect the distribution of shares as a function of the day of the week. If the points are within the body of the box, then articles shared are within the 1st and 3rd quartile. If we see black points outside of the whiskers of the boxplot, it represents outliers in the data. These outliers contribute to the shape of the distribute.

Contingency Tables

kable_if(table(selectTrain$channel, selectTrain$weekday))
##                        
##                         Friday Monday Saturday Sunday Thursday Tuesday Wednesday
##   data_channel_is_world    917    946      363    389     1098    1084      1103

Articles per Channel vs Weekday - We can inspect the distribution of articles shared as a function of the day of the week. We can easily see which days articles are more likely shared.

kable_if(table(selectTrain$weekday, selectTrain$num_keywords))
##            
##               3   4   5   6   7   8   9  10
##   Friday      9  49  94 165 167 162 137 134
##   Monday      9  52 128 172 173 137 106 169
##   Saturday   10  26  46  64  58  50  34  75
##   Sunday      1  24  46  72  71  66  36  73
##   Thursday   15  69 132 198 203 140 141 200
##   Tuesday    13  60 115 203 198 154 139 202
##   Wednesday  17  64 122 186 205 176 139 194

Day of the Week vs Number of Keywords - We can inspect the distribution of number of keywords as a function of the day of the week. Across the top are the unique number of keywords in the channel. Articles shared are divided by number of keywords and the day the article is shared. It’s highly likely the more keywords, the more likely an article will be shared with others.

Correlations

We want to remove any predictor variables that are highly correlated to the response variable which can cause multicollinearity. If variables have this characteristic it can lead to skewed or misleading results. We created groupings of predictor variables to look at the relationship with our response variable, share and to each other.

ggcorr(selectTrain%>% dplyr::select(n_tokens_title, n_tokens_content, n_unique_tokens, n_non_stop_words, n_non_stop_unique_tokens,  shares),label_round = 2, label = "FALSE", label_size=1)
The correlation chart above does not show any strong relationships between shares and the variables displayed. We can see many other relationships which is indicated by the darker orange and blue colors. For instance, n_token_title and n_non_stop_unique_tokens has a negative relationship.
ggcorr(selectTrain%>% dplyr::select( average_token_length, num_keywords,num_hrefs, num_self_hrefs, num_imgs, num_videos, shares),label_round = 2, label_size=3, label = "FALSE")
The correlation chart above does not show any strong relationships between shares and the variables displayed. We can see many other relationships which is indicated by the darker orange and blue colors.
ggcorr(selectTrain%>% dplyr::select(kw_min_min, kw_max_min, kw_avg_min, kw_min_max, kw_max_max,  kw_avg_max, kw_min_avg, kw_max_avg, kw_avg_avg, shares),label_round = 2, label = "FALSE", label_size=3)
The correlation chart above does not show any strong relationships between shares and the variables displayed. We can see many other relationships which is indicated by the darker orange and blue colors. We would expect to use this level of correlation for this grouping of variables.
ggcorr(selectTrain%>% dplyr::select(self_reference_min_shares, self_reference_max_shares, self_reference_avg_sharess, global_subjectivity, global_sentiment_polarity, global_rate_positive_words, global_rate_negative_words, shares),label_round = 2, label = "FALSE", label_size=3)
The correlation chart above does not show any strong relationships between shares and the variables displayed. We can see many other relationships which is indicated by the darker orange and blue colors.
ggcorr(selectTrain%>% dplyr::select(LDA_00, LDA_01, LDA_02, LDA_03, LDA_04,  shares),label_round = 2, label = "FALSE", label_size=3)
The correlation chart above does not show any strong relationships between shares and the variables displayed. We can see many other variables above have a mild relationship to one another.
ggcorr(selectTrain%>% dplyr::select(rate_positive_words, rate_negative_words, avg_positive_polarity, min_positive_polarity, max_positive_polarity,  avg_negative_polarity, min_negative_polarity, max_negative_polarity, shares),label_round = 2, label = "FALSE", label_size=3)
The correlation chart above does not show any strong relationships between shares and the variables displayed. We can see many other relationships which is indicated by the darker orange and blue colors.
ggcorr(selectTrain%>% dplyr::select( title_subjectivity, title_sentiment_polarity, abs_title_subjectivity, abs_title_sentiment_polarity, shares),label_round = 2, label = "FALSE")
The correlation chart above does not show any strong relationships between shares and the variables displayed. We can see many other relationships which is indicated by the darker orange and blue colors.

Looking at the overall results from the ggcorr(), we do not see any highly correlated relationships(positive or negative) with our response variable, share. If there was a relationship, it would be orange for highly positive correlated or blue for highly negative correlated. We do notice that many of the variables are highly correlated with one another.

For the custom limitations of correlation maps above, the decision was made not to include the actual correlation value. Even with the attempt to shorten the label for the variable, the maps were cluttered and busy.

More EDA.

The above observations have been simplified further by categorizing the response variable shares into ‘Poor’and ’Good’. This gives information, how data is split between the quantiles of the training data. For params$channel , the shares below Q2 are grouped Poor and above Q2 as Good The summary statistics can be seen for each category easily and deciphered at more easily here.

q<-quantile(selectTrain$shares,0.5)
cat_share<-selectTrain%>% 
  mutate(Rating=ifelse(shares<q,"Poor","Good"))

head(cat_share,5)
## # A tibble: 5 x 49
##   n_tokens_ti~1 n_tok~2 n_uni~3 n_non~4 n_non~5 num_h~6 num_s~7 num_i~8 num_v~9 avera~* num_k~* kw_mi~* kw_ma~* kw_av~*
##           <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1            10     231   0.636    1.00   0.797       4       1       1       1    5.09       5       0       0       0
## 2             9    1248   0.490    1.00   0.732      11       0       1       0    4.62       8       0       0       0
## 3            12     682   0.460    1.00   0.635      10       0       1       0    4.62       6       0       0       0
## 4            11     799   0.504    1.00   0.738       8       6       1       0    4.70       6       0       0       0
## 5            11     317   0.611    1.00   0.729       7       6       1       0    5.24       5       0       0       0
## # ... with 35 more variables: kw_min_max <dbl>, kw_max_max <dbl>, kw_avg_max <dbl>, kw_min_avg <dbl>,
## #   kw_max_avg <dbl>, kw_avg_avg <dbl>, self_reference_min_shares <dbl>, self_reference_max_shares <dbl>,
## #   self_reference_avg_sharess <dbl>, is_weekend <dbl>, LDA_00 <dbl>, LDA_01 <dbl>, LDA_02 <dbl>, LDA_03 <dbl>,
## #   LDA_04 <dbl>, global_subjectivity <dbl>, global_sentiment_polarity <dbl>, global_rate_positive_words <dbl>,
## #   global_rate_negative_words <dbl>, rate_positive_words <dbl>, rate_negative_words <dbl>,
## #   avg_positive_polarity <dbl>, min_positive_polarity <dbl>, max_positive_polarity <dbl>,
## #   avg_negative_polarity <dbl>, min_negative_polarity <dbl>, max_negative_polarity <dbl>, ...

Summary Statistics.

The most basic statistic is assessed here, which is mean across the 2 categories. Not all variables can be informative but the effect of each parameter on the Rating can be seen here. This depicts how each parameter on average has contributed to the shares for each category.

library(gt)

cat_share$Rating=as.factor(cat_share$Rating)

means1<-aggregate(cat_share,by=list(cat_share$Rating),mean)
means2<-cat_share %>% group_by(Rating,is_weekend) %>% summarise_all('mean')

means1
##   Group.1 n_tokens_title n_tokens_content n_unique_tokens n_non_stop_words n_non_stop_unique_tokens num_hrefs
## 1    Good       10.60061         601.7921       0.5033418        0.9581184                0.6549206 10.714416
## 2    Poor       10.55969         585.6426       0.5202327        0.9815739                0.6794629  9.601152
##   num_self_hrefs num_imgs num_videos average_token_length num_keywords kw_min_min kw_max_min kw_avg_min kw_min_max
## 1       2.425797 3.293171  0.6276176             4.615951     7.337785   17.78483  1137.0091   289.3925   6947.295
## 2       2.445681 2.212284  0.4376200             4.752973     7.239155   13.41881   886.8936   239.1268   6660.886
##   kw_max_max kw_avg_max kw_min_avg kw_max_avg kw_avg_avg self_reference_min_shares self_reference_max_shares
## 1   781155.3   226037.0   850.5225   4961.250   2608.637                  3534.260                  7439.772
## 2   791297.1   229782.7   792.9431   4411.802   2406.143                  1863.459                  4552.470
##   self_reference_avg_sharess is_weekend     LDA_00     LDA_01    LDA_02     LDA_03    LDA_04 global_subjectivity
## 1                   5009.340 0.17541730 0.07220912 0.05477716 0.6496203 0.07730162 0.1460918           0.4018799
## 2                   2974.334 0.06679463 0.05832096 0.05190964 0.7000415 0.06248429 0.1272436           0.4042657
##   global_sentiment_polarity global_rate_positive_words global_rate_negative_words rate_positive_words
## 1                0.08002302                 0.03179794                 0.01680024           0.6232110
## 2                0.07172992                 0.03077277                 0.01743468           0.6200187
##   rate_negative_words avg_positive_polarity min_positive_polarity max_positive_polarity avg_negative_polarity
## 1           0.3349074             0.3233019            0.08602588             0.7001245            -0.2487695
## 2           0.3615552             0.3248542            0.09010685             0.6967664            -0.2554882
##   min_negative_polarity max_negative_polarity title_subjectivity title_sentiment_polarity abs_title_subjectivity
## 1            -0.5546767           -0.09297000          0.2541826               0.03439130              0.3608695
## 2            -0.5684885           -0.09395725          0.2265738               0.02142073              0.3649850
##   abs_title_sentiment_polarity    shares channel weekday Rating
## 1                    0.1323086 3476.7527      NA      NA     NA
## 2                    0.1167734  763.3063      NA      NA     NA
means2
## # A tibble: 4 x 49
## # Groups:   Rating [2]
##   Rating is_weekend n_tokens_~1 n_tok~2 n_uni~3 n_non~4 n_non~5 num_h~6 num_s~7 num_i~8 num_v~9 avera~* num_k~* kw_mi~*
##   <fct>       <dbl>       <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1 Good            0        10.5    598.   0.505   0.960   0.658   10.6     2.42    3.31   0.627    4.62    7.33  18.9  
## 2 Good            1        10.9    621.   0.496   0.948   0.642   11.4     2.47    3.20   0.630    4.60    7.38  12.3  
## 3 Poor            0        10.5    585.   0.521   0.982   0.680    9.61    2.45    2.22   0.444    4.75    7.27  14.3  
## 4 Poor            1        10.8    589.   0.507   0.971   0.665    9.47    2.35    2.10   0.345    4.73    6.80   0.741
## # ... with 35 more variables: kw_max_min <dbl>, kw_avg_min <dbl>, kw_min_max <dbl>, kw_max_max <dbl>,
## #   kw_avg_max <dbl>, kw_min_avg <dbl>, kw_max_avg <dbl>, kw_avg_avg <dbl>, self_reference_min_shares <dbl>,
## #   self_reference_max_shares <dbl>, self_reference_avg_sharess <dbl>, LDA_00 <dbl>, LDA_01 <dbl>, LDA_02 <dbl>,
## #   LDA_03 <dbl>, LDA_04 <dbl>, global_subjectivity <dbl>, global_sentiment_polarity <dbl>,
## #   global_rate_positive_words <dbl>, global_rate_negative_words <dbl>, rate_positive_words <dbl>,
## #   rate_negative_words <dbl>, avg_positive_polarity <dbl>, min_positive_polarity <dbl>, max_positive_polarity <dbl>,
## #   avg_negative_polarity <dbl>, min_negative_polarity <dbl>, max_negative_polarity <dbl>, ...

And scatter plots are generated to get an idea on the direction of the predictor variables and to see what is the direction and extent of linearity or non-linearity of the predictors.

q<-quantile(cat_share$shares,c(0.25,0.75))

name<-names(cat_share)
predictors1<-(name)[30:35]
predictors2<-(name) [36:41]
predictors3<-(name) [42:46]
response <- "Rating"

par(mfrow = c(3, 1))

cat_share %>% 
  select(c(response, predictors1)) %>% 
    select_if(~ !any(is.na(.))) %>%   
      ggpairs(aes(colour = Rating,alpha=0.5),
              upper = list(continuous = wrap("cor", title="")))
cat_share %>% 
  select(c(response, predictors2)) %>% 
    select_if(~ !any(is.na(.))) %>%   
      ggpairs(aes(colour = Rating,alpha=0.5),
              upper = list(continuous = wrap("cor", title="")))
cat_share %>% 
  select(c(response, predictors3)) %>% 
    select_if(~ !any(is.na(.))) %>%   
      ggpairs(aes(colour = Rating,alpha=0.5),
              upper = list(continuous = wrap("cor", title="")))

The next aspect is to check the shape of some predictor variables w.r.t response variable. These mostly are parameters like number of keywords, videos, pictures etc. This specifies if the channel shares are skewed by some these factors and at what extent.For example,shares can be maximum, if the channel has more kewywords and hence, we see a right skewed bar plot.

bar_dat<-cat_share %>% select(c('n_tokens_title','n_tokens_content','shares',
                                'num_self_hrefs','num_imgs','num_videos','num_keywords',
                                'Rating'))

ggplot(bar_dat,aes())+
geom_density(aes(x = shares,fill=Rating,alpha=0.5))
g1<-ggplot(bar_dat,aes())+
  geom_bar(aes(x = n_tokens_title, y = shares, fill = Rating),stat = 'identity', position = "dodge")
g2<-ggplot(bar_dat,aes())+
  geom_bar(aes(x = num_self_hrefs, y = shares, fill = Rating),stat = 'identity', position = "dodge")
g3<-ggplot(bar_dat,aes())+
  geom_bar(aes(x = num_imgs, y = shares, fill = Rating),stat = 'identity', position = "dodge")
g4<-ggplot(bar_dat,aes())+
  geom_bar(aes(x = num_videos, y = shares, fill = Rating),stat = 'identity', position = "dodge")
g5<-ggplot(bar_dat,aes())+
  geom_bar(aes(x = num_keywords, y = shares, fill = Rating),stat = 'identity', position = "dodge")
g6<-ggplot(bar_dat,aes())+
  geom_boxplot(aes(y = shares))+coord_flip()
grid.arrange(g1, g2, g3,g4,g5,g6, ncol = 2, nrow = 3)

Exploratory Data Analysis (EDA) with Principal Component Analysis (PCA).

A different aspect of EDA that is tried here. Although, we are uncertain if this will be a good exercise or not. But we do see the data set has too many predictor variables and in any industry cost of computations and labor counts.

In order to find more efficient way to reduce size of data set and understand variance between variables,we are using Principal Component Analysis on training data to see, if the the predictors variables can be used together to determine some relationship with the response variable. Principal Component is a method that has many usages for datasets with too many predictors. It is used in dimensionality reduction, and can also help in understanding the relationship among different variables and their impact on the response variable in terms of variance and also an effective method to remove collinearity from the dataset by removing highly correlated predictors.

Here we are using prcomp on our train data to find our principal components. Some select variables have been discarded as their magnitude was very less and may /may not have impact into the model but to keep the analysis simple , some critical ones have been considered to demonstrate dimensionality reduction.

The PCA plot is plot for variance, another statistic that is critical in understanding relationship between variables. These plots can easily tell how much each variable contributes towards the response variable. PC1 holds the maximum variance , PC2 then the next. We have removed some irrelevant variables and tested the model here..

PC_Train<-selectTrain %>% select(- c('shares','channel','LDA_00','LDA_01','LDA_02','LDA_03','LDA_04','n_tokens_content','n_non_stop_unique_tokens','weekday'))

PC_fit <- prcomp(PC_Train,  scale = TRUE)
summary(PC_fit)
## Importance of components:
##                           PC1     PC2     PC3     PC4     PC5     PC6     PC7     PC8     PC9    PC10    PC11    PC12
## Standard deviation     2.4028 1.88926 1.74632 1.66128 1.57685 1.49293 1.39279 1.26698 1.19016 1.09960 1.02235 1.00240
## Proportion of Variance 0.1519 0.09393 0.08025 0.07263 0.06543 0.05865 0.05105 0.04224 0.03728 0.03182 0.02751 0.02644
## Cumulative Proportion  0.1519 0.24586 0.32612 0.39875 0.46418 0.52283 0.57388 0.61612 0.65340 0.68522 0.71272 0.73917
##                           PC13    PC14    PC15    PC16    PC17   PC18    PC19    PC20   PC21    PC22    PC23    PC24
## Standard deviation     0.99510 0.96850 0.92900 0.89863 0.84830 0.8202 0.77638 0.74447 0.7136 0.70033 0.68939 0.64444
## Proportion of Variance 0.02606 0.02468 0.02271 0.02125 0.01894 0.0177 0.01586 0.01458 0.0134 0.01291 0.01251 0.01093
## Cumulative Proportion  0.76523 0.78991 0.81262 0.83387 0.85281 0.8705 0.88637 0.90096 0.9144 0.92726 0.93977 0.95070
##                           PC25    PC26    PC27    PC28    PC29    PC30    PC31    PC32    PC33    PC34    PC35    PC36
## Standard deviation     0.57476 0.56314 0.50769 0.45597 0.44156 0.35013 0.34820 0.29837 0.27657 0.23624 0.18994 0.17991
## Proportion of Variance 0.00869 0.00835 0.00678 0.00547 0.00513 0.00323 0.00319 0.00234 0.00201 0.00147 0.00095 0.00085
## Cumulative Proportion  0.95939 0.96774 0.97452 0.97999 0.98512 0.98835 0.99154 0.99388 0.99590 0.99737 0.99831 0.99917
##                           PC37      PC38
## Standard deviation     0.17798 5.354e-09
## Proportion of Variance 0.00083 0.000e+00
## Cumulative Proportion  1.00000 1.000e+00
par(mfrow = c(4, 1))

 #screeplot
screeplot(PC_fit, type = "bar")

#Proportion of Variance
plot(PC_fit$sdev^2/sum(PC_fit$sdev^2), xlab = "Principal Component",
ylab = "Proportion of Variance Explained", ylim = c(0, 1), type = 'b')

# Cumulative proportion of variance.
plot(cumsum(PC_fit$sdev^2/sum(PC_fit$sdev^2)), xlab = "Principal Component",
ylab = "Cum. Prop of Variance Explained", ylim = c(0, 1), type = 'b')

Variable Selection by Correlation Results

Before using any reduction algorithm to determine which variables to be in the model, we used the correlation tables above to reduce the predictors. If the predictors are correlated, it’s not necessary to use both predictors.

We removed any variable that contained min and max, if the predictor also contained age:

min_positive_polarity   
max_positive_polarity   
min_negative_polarity   
max_negative_polarity   
self_reference_min_shares   
self_reference_max_shares   
kw_min_avg   
kw_max_avg   
kw_min_max   
kw_max_max   
kw_min_min  
kw_max_min  

We removed all of the “LDA_”. When you view the correlation chart above, they seem to be correlated. In addition, I am not sure what LDA means. When googled, the results referenced modeling and Latent Dirichlet Allocation.

LDA_00: Closeness to LDA topic 0
LDA_01: Closeness to LDA topic 1
LDA_02: Closeness to LDA topic 2
LDA_03: Closeness to LDA topic 3
LDA_04: Closeness to LDA topic 4

We removed the following variables, because the data contains the absolute value of the same information:

title_subjectivity
title_sentiment_polarity

We removed the following variables, because of their strong relationship between global_rate_positive_words and global_rate_negative_words:

global_subjectivity
global_sentiment_polarity

We removed the following variables, because of their strong relationship between n_tokens_content:

n_unique_tokens
n_non_stop_unique_tokens
n_non_stop_words

The final data set will contain the following columns (features):

n_tokens_title
n_tokens_content
num_hrefs
num_self_hrefs
num_imgs
num_videos
average_token_length
num_keywords
kw_avg_min
kw_avg_max
kw_avg_avg
self_reference_avg_sharess
global_rate_positive_words
global_rate_negative_words
rate_positive_words
rate_negative_words
avg_positive_polarity
avg_negative_polarity
abs_title_subjectivity
abs_title_sentiment_polarity

Out of the original 61 variables, we will be looking at using up to 20 variables. There is a relationship between num_hrefs and num_self_hrefs. One of those may be removed later in the analysis. We decided to use best stepwise to further reduce the number of variables in the model.

We will also run a separate linear regression model using the variables selected from visually correlation results (above).

Linear Regression

Linear regression is a method to understand the relationship between a response variables Y and one or more predictor variables x, x1, x2, etc. This method creates a line that best fits the data called “least known regression line”.

For a Simple Linear Regression, we have one response variable and one predictor variables. It uses the formula y = b0 + b1x, where

y:  response variable
b0: intercept (baseline value, if all other values are zero)
b1: regression coefficient
x:  independent variable

For Multiple Linear Regression, we have one response variable and any number of predictor variables. It uses the formula Y = B0 + B1X1 + B2X2 + … + BpXp + E, where

Y: The response variable
Xp: The pth predictor variable
Bp: The average effect on Y of a one unit increase in Xp, holding all other predictors fixed
E: The error term

For the results of a linear regression model to be valid and reliable, the following must be true

1. Linear relationship: There exists a linear relationship between the independent variable, x, and the dependent variable, y. 
2. Independence: The residuals are independent. 
3. Homoscedasticity: The residuals have constant variance at every level of x.
4. Normality: The residuals of the model are normally distributed.

Linear Regression using Best Stepwise

The Best Stepwise method combines the backward and forward step-wise methods to find the best result. We start with no predictor then sequentially add the predictors that contribute the most (Forward). However after adding each new variables, it will remove any variables that no longer improve the fit of the model.

After running the Best Stepwise method, we applied the results to the test data to determine how well the model performed.

set.seed(21)

lmFit<- train(shares ~ ., data = selectTrain%>% dplyr::select(shares, n_tokens_title,
n_tokens_content,
num_hrefs,
num_self_hrefs,
num_imgs,
num_videos,
average_token_length,
num_keywords,
kw_avg_min,
kw_avg_max,
kw_avg_avg,
self_reference_avg_sharess,
global_rate_positive_words,
global_rate_negative_words,
rate_positive_words,
rate_negative_words,
avg_positive_polarity,
avg_negative_polarity,
abs_title_subjectivity,
abs_title_sentiment_polarity),
method = 'leapSeq',
preProcess = c("center", "scale"),
tuneGrid  = data.frame(nvmax = 1:20),
trControl = trainControl(method = "cv", number = 5)
)

lmFit$results
##    nvmax     RMSE    Rsquared      MAE   RMSESD  RsquaredSD    MAESD
## 1      1 5446.985 0.005638233 1921.305 749.2998 0.004614737 71.18216
## 2      2 5428.205 0.011884558 1895.926 752.4582 0.006390358 78.97026
## 3      3 5428.654 0.011853457 1897.958 746.0035 0.004575389 79.92383
## 4      4 5437.105 0.009431813 1906.083 745.5327 0.006238892 86.43348
## 5      5 5423.522 0.014840278 1914.520 746.8799 0.008246499 51.99625
## 6      6 5429.343 0.013128681 1894.194 745.7272 0.005761569 79.27376
## 7      7 5422.040 0.015886191 1913.513 728.3666 0.010246444 47.27169
## 8      8 5419.933 0.017506972 1894.067 735.4395 0.010347322 67.15073
## 9      9 5412.067 0.018800215 1897.209 737.1651 0.008742241 49.14106
## 10    10 5414.541 0.018966194 1893.877 742.1724 0.011065807 66.11113
## 11    11 5408.829 0.020331108 1892.612 735.2902 0.009574828 63.39022
## 12    12 5406.067 0.021390181 1892.164 736.6284 0.009388639 63.77681
## 13    13 5401.307 0.022938695 1892.691 733.6797 0.009154286 64.40450
## 14    14 5400.012 0.023286400 1888.370 733.3397 0.009105428 62.40108
## 15    15 5402.499 0.023086468 1894.876 732.0363 0.008879078 58.63257
## 16    16 5396.489 0.025189144 1899.651 731.5809 0.010178320 50.88701
## 17    17 5397.946 0.024865792 1902.194 734.4734 0.010471674 53.13460
## 18    18 5397.538 0.024941869 1899.098 735.9549 0.010386659 53.62358
## 19    19 5397.966 0.024766441 1902.636 735.2421 0.009836323 53.67258
## 20    20 5396.047 0.025397054 1899.863 735.3905 0.010098502 54.00733
lmFit$bestTune
##    nvmax
## 20    20
set.seed(21)

btTrain <- selectTrain%>% dplyr::select(num_videos, n_tokens_content,kw_avg_max, kw_avg_avg, rate_positive_words,shares)
                                 
btTest <- selectTest%>% dplyr::select(num_videos, n_tokens_content,kw_avg_max, kw_avg_avg, rate_positive_words,shares)

lmFit2<- train(shares ~ ., data = btTrain,
method = 'lm',
trControl = trainControl(method = "cv", number = 5)
)

lmFit2$results
##   intercept     RMSE    Rsquared      MAE   RMSESD RsquaredSD    MAESD
## 1      TRUE 5435.517 0.008881754 1900.032 740.9561 0.00205911 71.75676
predBest <- predict(lmFit2, newdata = btTest)  %>% as_tibble()
LinearRegression_1 <- postResample(predBest, obs =btTest$shares)

Linear Regression on Original Variable Selection

As a comparison, we also included a model that includes the variables selected from correlation heat maps.

set.seed(21)

stTrain <- selectTrain%>% dplyr::select(shares, n_tokens_title,
n_tokens_content,
num_hrefs,
num_self_hrefs,
num_imgs,
num_videos,
average_token_length,
num_keywords,
kw_avg_min,
kw_avg_max,
kw_avg_avg,
self_reference_avg_sharess,
global_rate_positive_words,
global_rate_negative_words,
rate_positive_words,
rate_negative_words,
avg_positive_polarity,
avg_negative_polarity,
abs_title_subjectivity,
abs_title_sentiment_polarity)

                                 
stTest <- selectTest%>% dplyr::select(shares, n_tokens_title,
n_tokens_content,
num_hrefs,
num_self_hrefs,
num_imgs,
num_videos,
average_token_length,
num_keywords,
kw_avg_min,
kw_avg_max,
kw_avg_avg,
self_reference_avg_sharess,
global_rate_positive_words,
global_rate_negative_words,
rate_positive_words,
rate_negative_words,
avg_positive_polarity,
avg_negative_polarity,
abs_title_subjectivity,
abs_title_sentiment_polarity)

lmFit3<- train(shares ~ ., data = stTrain,
method = 'lm',
trControl = trainControl(method = "cv", number = 5)
)

lmFit3$results
##   intercept     RMSE   Rsquared      MAE   RMSESD RsquaredSD    MAESD
## 1      TRUE 5396.047 0.02539705 1899.863 735.3905  0.0100985 54.00733
OrgBest <- predict(lmFit3, newdata = stTest)  %>% as_tibble()
LinearRegression_3 <- postResample(predBest, obs = stTest$shares)

Principal Component selections and Linear Regression with PCA.

This chunk of code below selects the principal components from the above PCA code for EDA based on the desired cumulative variance threshold. As there are too many predictor variables, we are going ahead with a threshold of 0.80 i.e. we are taking all predictors that contribute to total 80% of the data. The number of principal components selected can be seen below after the code is implemented.

These principal components selected will be used below in the linear regression model after feeding our PCA fit into the training and test dataset.

var_count<-cumsum(PC_fit$sdev^2/sum(PC_fit$sdev^2))

count=0

for (i in (1:length(var_count)))
{
  if (var_count[i]<0.80)
  { per=var_count[i]
    count=count+1
    }
  else if (var_count[i]>0.80)
  {
    break}
}

print(paste("Number of Principal Components optimized are",count ,"for cumuative variance of 0.80"))
## [1] "Number of Principal Components optimized are 14 for cumuative variance of 0.80"

The aim here is to test if PCA helps us reduce the dimensions of our data and utilize this on our regression model. The PCA is very efficient method of feature extraction and can help get maximum information at minimum cost. We will check out a multinomial logistic regression with PCA here. The principal components obtained in the above section are used here on train and test data to get PCA fits and then apply linear regression fit on train data and get predictions for test data. It is not expected this may give a very efficient result however, it does help get idea about the predictor variables whether they can be fit enough for a linear model or not.

new_PC <- predict(PC_fit, newdata = PC_Train) %>% as_tibble() 
res<-selectTrain$shares
PC_Vars<-data.frame(new_PC[1:count],res)
head(PC_Vars,5)
##          PC1         PC2       PC3      PC4        PC5         PC6        PC7       PC8        PC9       PC10
## 1 -0.4165374 -1.40587661 -4.235722 2.210712 -1.3520815  0.79536852 -1.7644580 0.5547879 -0.5164037  0.3675021
## 2 -2.1413361 -0.64533250 -4.238154 2.239541 -0.9913070  0.01827614  0.1380015 1.7072615  0.4165305  1.1462965
## 3 -0.8483088 -1.10569669 -3.808325 2.232589 -0.7490266 -2.52925422 -1.6064907 0.8691515 -1.0359514  1.7444834
## 4 -1.0645748 -0.03155368 -4.264733 1.741831 -1.3610985  0.51192777 -0.3175866 1.2436611 -0.9842578 -0.2689401
## 5 -1.1629424  0.90793784 -3.948155 1.161591 -1.0113170 -2.32964551 -2.3125634 0.9145607 -0.4541395 -0.9311847
##         PC11       PC12       PC13       PC14  res
## 1  0.3271023  0.4190601  0.4051998 -0.1358739  710
## 2 -1.2307146  0.7480563 -0.8632653 -0.3815678 2200
## 3  0.5866187  0.3629116  0.2565374 -0.5928845 1600
## 4  0.3932674  0.1846807  0.1459019 -0.9783476  504
## 5  1.5576140 -0.8788444  0.8432663 -0.5294179 1800
fit <- train(res~ .,method='lm', data=PC_Vars,metric='RMSE')

#apply PCA fit to test data
PC_Test<-selectTest %>% select(- c('shares','channel','LDA_00','LDA_01',
                                   'LDA_02','LDA_03','LDA_04','n_tokens_content',
                                   'n_non_stop_unique_tokens','weekday'))

test_PC<-predict(PC_fit,newdata=PC_Test) %>% as_tibble()
test_res<-selectTest$shares
PC_Vars_Test<-data.frame(test_PC[1:count],test_res)
head(PC_Vars_Test,5)
##          PC1        PC2       PC3       PC4       PC5         PC6        PC7        PC8        PC9        PC10
## 1  1.5420406  0.7675858 -4.721761 1.7250503 -1.707024  1.27117371 -1.0566051  0.6312967 -0.1537278 -0.75455252
## 2 -0.6848554  4.4721316 -4.282235 0.1999377 -2.530258 -1.34440406 -3.0178101 -0.6379680 -1.1225180  0.06445948
## 3 -0.6476563 -0.5460883 -4.423745 2.0383499 -1.344147  0.92818221 -1.5070961  0.7657248  0.7509924  0.30405293
## 4 -1.4693907  1.8093767 -2.830206 4.7589276 -1.346789  1.20426323 -2.2230924  2.0394691 -0.8751516  0.99164263
## 5  0.2307231  0.2756968 -3.880048 4.1193413 -1.531708 -0.01303883 -0.5794831  2.6594987 -0.1759123  0.05901753
##         PC11       PC12        PC13       PC14 test_res
## 1  1.1258538 -0.4252144  1.26625582  0.4991164      598
## 2  1.3173614 -0.5817008  0.50268361 -0.4163106     1500
## 3 -0.2959002 -0.1285931 -0.07304338 -0.0601115      495
## 4  0.2103686  0.7430633 -0.21986833 -0.7503355      971
## 5 -0.4333876  0.7935336 -1.02013482 -2.1054736     1400
lm_predict<-predict(fit,newdata=PC_Vars_Test)%>% as_tibble()
LinearRegression_2<-postResample(lm_predict, obs = PC_Vars_Test$test_res)

Ensemble Methods

Ensemble methods is a machine learning models that combines several base models in order to produce one optimal predictive model. A step further to the decision trees, Ensemble methods do not rely on one Decision Tree and hope to right decision at each split, they take a sample of Decision Trees, calculate metrics at each split, and make a final predictor based on the aggregated results of the sampled Decision Trees.

Random Forest

Random forest is a decision tree method. It takes a desired number of bootstrapped samples from the data and builds a tree for each sample. While building the trees, if a split occurs, then only a random sample of trees are averaged. The average of predictions are used to determine the final model.

The random forest model using R is created using rf as the method and mtry for the tuning grid. The tuning grid will provide results using 1 variable up to nine variables in the model.

trainCtrl <- trainControl(method = 'cv', number = 5)
#trainCtrl <- trainControl(method = 'repeatedcv', number = 3, repeats =  1)

rfTrain <- selectTrain %>% dplyr::select(num_videos, num_imgs,num_keywords,
                                   global_subjectivity,global_sentiment_polarity,
                                   global_rate_positive_words,is_weekend,
                                   avg_positive_polarity,title_sentiment_polarity,shares)

rfTest <- selectTest %>% dplyr::select(num_videos, num_imgs,num_keywords,
                                   global_subjectivity,global_sentiment_polarity,
                                   global_rate_positive_words,is_weekend,
                                   avg_positive_polarity,title_sentiment_polarity,shares)

rfFit <- train(shares ~. , data = rfTrain,
method = 'rf',
trControl = trainCtrl,
preProcess = c("center", "scale"),
tuneGrid  = data.frame(mtry = 1:9)    
)


rfPredict<-predict(rfFit,newdata=rfTest) %>% as_tibble()
Random_Forest<-postResample(rfPredict, obs = rfTest$shares)

Boosted Trees

Boosting involves adding ensemble members in sequential manner so that corrections are applied to the predictions made by prior models and outputs a weighted average of these predictions. It is different to random forest in implementation of decision tree split .That is, in this case the decision trees are built in additive manner not in parallel manner as in random forest. It is based on weak learners which is high bias and low variance. Boosting looks mainly into reducing bias (and eventually variance, by aggregating the output metric from many models. We have used some cross validation here to tune the parameters. The tuning parameters have been kept small for the tuneGrid due to limited processing speed. The method caret uses to fit the train data isgbm. The fit is plotted using plot() here to show the results of cross validation which may get complex to interpret, but we are not worrying about it as caret does the work of selecting the best parameters and allocating them to our model fit that can be used directly into predictions.

BT_Train<-selectTrain %>% select(c('num_videos', 'num_imgs','num_keywords',
                                   'global_subjectivity','global_sentiment_polarity',
                                   'global_rate_positive_words','is_weekend',
                                   'avg_positive_polarity','title_sentiment_polarity','shares'))

BT_Test<-selectTest %>% select(c('num_videos', 'num_imgs','num_keywords',
                                   'global_subjectivity','global_sentiment_polarity',
                                   'global_rate_positive_words','is_weekend',
                                   'avg_positive_polarity','title_sentiment_polarity','shares'))

boostfit <- train(shares ~., data = BT_Train, method = "gbm",
trControl = trainControl(method = "repeatedcv", number = 5,
                         repeats = 3,verboseIter = FALSE,allowParallel = TRUE),
preProcess = c("scale"),
tuneGrid = expand.grid(n.trees = c(50,100,200),
           interaction.depth = 1:5,
           shrinkage = c(0.001,0.1),
           n.minobsinnode = c(3,5)),
           verbose = FALSE)

plot(boostfit)
boostfit$bestTune
##    n.trees interaction.depth shrinkage n.minobsinnode
## 35     100                 1       0.1              5
boostpredict<-predict(boostfit,newdata=BT_Test) %>% as_tibble()
Boosted_Trees<-postResample(boostpredict, obs = BT_Test$shares)

Model Comparisons.

Linear Regression and Ensemble method results are now put together here for comparison.

Some metrics come up as we see the postResample results after doing our fit on testdata that takes not of these metrics after comparison with actual test data values.

The RMSE or Root Mean Square Error is the square root of the variance of the residual error. It tells about the absolute fit of the model to the data–how close the true data values are to the model’s predicted values.

R-squared is the proportion of variance in the response variable that can be explained by the predictor variables. In other words, this is a measurement of goodness of fit. A high R-squared means our model has good fit over the training /test data

MAE or Mean Absolute Error gives the absolute distance of the observation to the predictions and averaging it over all the given observations.

We will be comparing our models based on minimum RMSE.

# Some nomenclature things for sanity.

LinearRegression_Stepwise_Sel<-LinearRegression_1  
LinearRegression_Orignal_Var_Sel<-LinearRegression_3   
Linear_Regression_PCA<-LinearRegression_2 

# Getting out best model in terms of RMSE
Metrics_df<-data.frame(rbind(LinearRegression_Stepwise_Sel,
                             LinearRegression_Orignal_Var_Sel,
                             Linear_Regression_PCA, Boosted_Trees, 
                             Random_Forest))
Metrics_df
##                                      RMSE    Rsquared      MAE
## LinearRegression_Stepwise_Sel    7273.530 0.003562967 1948.549
## LinearRegression_Orignal_Var_Sel 7273.530 0.003562967 1948.549
## Linear_Regression_PCA            7261.190 0.006835496 1943.950
## Boosted_Trees                    7261.330 0.007130908 1958.206
## Random_Forest                    7271.394 0.005119243 1947.018
best_model<-Metrics_df[which.min(Metrics_df$RMSE), ]
rmse_min<-best_model$RMSE
r2max<-best_model$Rsquared

The best model for predicting the number of shares for dataset called data_channel_is_world in terms of RMSE is Linear_Regression_PCA with RMSE of 7261.19. This model also showed an R-Squared of about 0.0068355, which shows that the predictors of this model do not have significant impact on the response variable.

The prediction spread of all the models is also depicted through the box plot with comparison to the actual test values. As we the test actual values have a outlier going very high. To see more improvements in the Linear Regression models we can also remove the outliers in the train and test data and repeat our regression steps. Linear regression may show more bias due to outliers in the data and hence, data cleaning is the concept that comes up often while working with complex dataset as part of pre-processing. However, this needs to cautiously used as valuable information may be lost in this exercise which may change the outcomes of our prediction. However, this box plot is simply to summarize the test predict values as its too hard to print so much of data!.

box_df<-data.frame(Actual_Response=selectTest$shares,
                  LinearRegression_Stepwise=predBest$value,
                  LinearRegression_Orignal_Variable=OrgBest$value,
                  LinearRegression_PCA=lm_predict$value,
                  Random_Forest = rfPredict$value,
                  Boosted_Trees=boostpredict$value)

ggplot(stack(box_df), aes(x = ind, y = values)) +
  stat_boxplot(geom = "errorbar") +
  labs(x="Models", y="Response Values") +
  geom_boxplot(fill = "white", colour = "red") +coord_flip()

Conclusion

Different genre of news and information has significant importance in every industry and forecasting user sentiments is one of the critical requirements on which today’s news and information content is shaped. To capture the dynamics of every type of population, understanding of the parameters and their affect of online shares is very important. We hope that this analysis becomes useful to anybody to wants to perform step by step evaluation of such large dataset and extract critical information. The codes meant in this analysis are meant to capture both conceptual aspects of the analysis being done along with results. This set of algorithm can also be applied directly to any data set that has similar attributes and results can be quickly obtained. We hope this analysis is helpful going forward to anybody who wants to build similar model comparisons.

Automation for Building Six Reports

selectID <- unique(newData$channel)  

output_file <- paste0(selectID, "Analysis.md")  

params = lapply(selectID, FUN = function(x){list(channel = x)})

reports <- tibble(output_file, params)

library(rmarkdown)

apply(reports, MARGIN = 1,
      FUN = function(x){
        render(input = "./Project_3.Rmd",
               output_format = "github_document", 
               output_file = x[[1]], 
               params = x[[2]])
      })