data_channel_is_entertainment
AnalysisSmitali Paknaik and Paula Bailey 2022-11-16
Social and Online media is now perfused into our society and by getting to know how people think and what they like, we get valuable information on how new/information spreads through out a small society or globally and how it affects the members of a community, their thinking and their actions. This information is very useful for product based companies trying to sell products, advertising companies, writers, influencers etc. Not only that, we also know that it also has lot of impact at political and economic changes occurring within a state or country. What is advertised well can make good best sales, going by that theory companies have now started encashing machine learning methods to know about consumer responses from historical data and use it for expanding markets.
The main goal of this project is to analyze a similar OnlineNewsPopularity data using Exploratory Data Analysis (EDA) and apply regression and ensemble methods to predict the response variable ‘~share’ using a set of predictor variables optimized through variable selection methods. That is, we will try to find out what parameters impact number of online shares of information and whether they can help us predict the number of shares from these parameters.
This OnlineNewsPopularity data is divided into some genre or channel and each individual channel has its characteristics that need to be looked at separately. And some generalization of analysis is necessary as it helps speed up the process and automate the entire analysis at a mouse click.
Our first step is to shape the data to get the various data_channel_groups together in such a format that each channel can be looked into separately . This is needed in order to get proper filtering of the required channels in the automation script. The EDA is performed on each channel using graphs, tables and heatmaps.
The next step includes creating train and test data for train fit and test prediction. Prior to model fitting, variable selection/pre-processing of data (if needed) is done. Two linear regression models are the first step of performing model fit and predictions. The data may be easily explained by a set of linear equations and may not require complex algorithms that can consume time, cost and may require computation speed. Hence, regression models often are a great start to run predictions.
The other method evaluated is two types of ensemble methods- that are Random Forest and Boosted Tree method. Both are types of ensemble methods that use tree based model to evaluate future values. These methods do not have any specified methodology in model fit. They work on the best statistic obtained while constructing the model fit. Theoretically ensemble methods are said to have a good prediction, and using this data it will be evaluated on what categories how these models perform.
At the end for each channel analysis, all 4 models are compared and best model that suits this category of channel is noted.
There are 48 parameters in the dataset. The channel parameter is actually a categorical variable which will be used for filtering data in this project and form our analysis into different set of channel files. The categories in this are : ‘lifestyle’,‘entertainment’,‘bus’,‘socmed’,‘tech’ and ‘world’. Additional information about this data can be accessed here.
The csv file is imported read using read_csv(). As we
have automated the analysis, we need each channel to be fed into the R
code we have and get the analysis and predictions as a different file.
The channel categories are in column (or wide) format so they have been
converted into long format using pivot_longer. The required
category is then easily filtered by passing the channel variables into
the filter code.
Irrelevant columns are removed like URL and timedelta as they do not have any significance in our predictions models.
The Exploratory analysis and Modelling was carried on the remaining set of select variables.
caret To run the Regression and ensemble methods with
Train/Split and cross validation.dplyr A part of the tidyverse used for
manipulating data.GGally To create ggcorr and ggpairs correlation plots
.glmnet To access best subset selection.ggplot2 A part of the tidyverse used for
creating graphics.gridextra To plot with multiple grid objects.gt To test a low-dimensional nullhypothesis against
high-dimensional alternative models.knitr To get nice table printing formats, mainly for
the contingency tables.leaps To identify different best models of different
sizes.markdown To render several output formats.MASS To access forward and backward selection
algorithmsrandomforest To access random forest algorithmstidyr A part of the tidyverse used for
data cleaningdata <- read.csv("OnlineNewsPopularity.csv") %>%
rename(
Monday = `weekday_is_monday`,
Tuesday = `weekday_is_tuesday`,
Wednesday = `weekday_is_wednesday`,
Thursday = `weekday_is_thursday`,
Friday = `weekday_is_friday`,
Saturday = `weekday_is_saturday`,
Sunday = `weekday_is_sunday`) %>%
dplyr::select(-url, -timedelta)
#check for missing values
anyNA(data) ## [1] FALSE
The UCI site mentioned that url and
timedelta are non-predictive variables, so we will remove
them from our data set during the importing process. Afterwards, we
checked to validate the data set contained no missing values.
anyNA(data) returned FALSE, so the file has no missing data.
For the automation process, it will be easier if all the channels (ie
data_channel_is_*) are in one column. We used
pivot_longer() to pivot columns: data_channel_is_lifestyle,
data_channel_is_entertainment, data_channel_is_bus,
data_channel_is_socmed, data_channel_is_tech, and data_channel_is_world
from wide to long format.
dataPivot <- data %>% pivot_longer(cols = c("data_channel_is_lifestyle", "data_channel_is_entertainment", "data_channel_is_bus","data_channel_is_socmed","data_channel_is_tech", "data_channel_is_world"),names_to = "channel",values_to = "Temp")
newData <- dataPivot %>% filter(Temp != 0) %>% dplyr::select(-Temp)Now, the individual channel columns are combined into one column named channel. The variable Temp represents if an article exists for that particular channel. For the final data set, we will remove any values with 0. We performed the same pivot_longer() process on the days of the week.
dataPivot <- newData %>% pivot_longer(c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"), names_to = "weekday",values_to = "Temp1")
newData <- dataPivot %>% filter(Temp1 != 0) %>% dplyr::select(-Temp1)When we run Render_Code.R, this code chunk filters data
for each channel type to complete the analysis.
selectChannel <- newData %>% filter(channel == params$channel)To make the data reproducible, we set the seed to 21 and created the training and testing set with a 70% split.
set.seed(21)
trainIndex <- createDataPartition(selectChannel$shares, p = 0.7, list = FALSE)
selectTrain <- selectChannel[trainIndex, ]
selectTest <- selectChannel[-trainIndex, ]The data has been analysed in this sections. This includes basic statistics, summary tables, contingency tables and useful plots. We have tried to capture as much information as we can in this.
str(selectTrain)## tibble [4,941 x 48] (S3: tbl_df/tbl/data.frame)
## $ n_tokens_title : num [1:4941] 12 9 12 11 5 11 10 10 6 7 ...
## $ n_tokens_content : num [1:4941] 219 531 161 454 356 281 909 413 241 376 ...
## $ n_unique_tokens : num [1:4941] 0.664 0.504 0.669 0.566 0.618 ...
## $ n_non_stop_words : num [1:4941] 1 1 1 1 1 ...
## $ n_non_stop_unique_tokens : num [1:4941] 0.815 0.666 0.752 0.755 0.766 ...
## $ num_hrefs : num [1:4941] 4 9 5 5 3 5 3 6 5 3 ...
## $ num_self_hrefs : num [1:4941] 2 0 4 3 3 4 2 1 5 2 ...
## $ num_imgs : num [1:4941] 1 1 0 1 12 1 1 13 1 0 ...
## $ num_videos : num [1:4941] 0 0 6 0 1 0 1 0 0 11 ...
## $ average_token_length : num [1:4941] 4.68 4.4 4.45 4.89 4.47 ...
## $ num_keywords : num [1:4941] 5 7 10 6 10 4 5 6 5 9 ...
## $ kw_min_min : num [1:4941] 0 0 0 0 0 217 217 217 217 217 ...
## $ kw_max_min : num [1:4941] 0 0 0 0 0 593 593 598 598 1200 ...
## $ kw_avg_min : num [1:4941] 0 0 0 0 0 ...
## $ kw_min_max : num [1:4941] 0 0 0 0 0 0 0 0 0 0 ...
## $ kw_max_max : num [1:4941] 0 0 0 0 0 17100 17100 17100 17100 17100 ...
## $ kw_avg_max : num [1:4941] 0 0 0 0 0 ...
## $ kw_min_avg : num [1:4941] 0 0 0 0 0 0 0 0 0 0 ...
## $ kw_max_avg : num [1:4941] 0 0 0 0 0 ...
## $ kw_avg_avg : num [1:4941] 0 0 0 0 0 ...
## $ self_reference_min_shares : num [1:4941] 496 0 638 0 1700 951 20900 527 475 1100 ...
## $ self_reference_max_shares : num [1:4941] 496 0 29200 0 2500 951 20900 527 4400 1100 ...
## $ self_reference_avg_sharess : num [1:4941] 496 0 8261 0 2100 ...
## $ is_weekend : num [1:4941] 0 0 0 0 0 0 0 0 0 0 ...
## $ LDA_00 : num [1:4941] 0.5003 0.0286 0.1258 0.2003 0.4565 ...
## $ LDA_01 : num [1:4941] 0.3783 0.4193 0.0203 0.3399 0.4822 ...
## $ LDA_02 : num [1:4941] 0.04 0.4947 0.02 0.0333 0.02 ...
## $ LDA_03 : num [1:4941] 0.0413 0.0289 0.8139 0.3931 0.0213 ...
## $ LDA_04 : num [1:4941] 0.0401 0.0286 0.02 0.0333 0.02 ...
## $ global_subjectivity : num [1:4941] 0.522 0.43 0.572 0.467 0.436 ...
## $ global_sentiment_polarity : num [1:4941] 0.0926 0.1007 0.1662 0.1255 0.1767 ...
## $ global_rate_positive_words : num [1:4941] 0.0457 0.0414 0.0497 0.0441 0.0618 ...
## $ global_rate_negative_words : num [1:4941] 0.0137 0.0207 0.0186 0.0132 0.014 ...
## $ rate_positive_words : num [1:4941] 0.769 0.667 0.727 0.769 0.815 ...
## $ rate_negative_words : num [1:4941] 0.231 0.333 0.273 0.231 0.185 ...
## $ avg_positive_polarity : num [1:4941] 0.379 0.386 0.427 0.363 0.359 ...
## $ min_positive_polarity : num [1:4941] 0.1 0.1364 0.1 0.1 0.0333 ...
## $ max_positive_polarity : num [1:4941] 0.7 0.8 0.85 1 1 0.5 1 0.8 0.5 0.8 ...
## $ avg_negative_polarity : num [1:4941] -0.35 -0.37 -0.364 -0.215 -0.373 ...
## $ min_negative_polarity : num [1:4941] -0.6 -0.6 -0.8 -0.5 -0.7 ...
## $ max_negative_polarity : num [1:4941] -0.2 -0.1667 -0.125 -0.1 -0.0714 ...
## $ title_subjectivity : num [1:4941] 0.5 0 0.583 0.427 0.455 ...
## $ title_sentiment_polarity : num [1:4941] -0.188 0 0.25 0.168 0.136 ...
## $ abs_title_subjectivity : num [1:4941] 0 0.5 0.0833 0.0727 0.0455 ...
## $ abs_title_sentiment_polarity: num [1:4941] 0.188 0 0.25 0.168 0.136 ...
## $ shares : int [1:4941] 593 1200 1200 4600 631 1300 1700 455 6400 1900 ...
## $ channel : chr [1:4941] "data_channel_is_entertainment" "data_channel_is_entertainment" "data_channel_is_entertainment" "data_channel_is_entertainment" ...
## $ weekday : chr [1:4941] "Monday" "Monday" "Monday" "Monday" ...
str() allows us to view the structure of the data. We
see each variable has an appropriate data type.
selectTrain %>% dplyr::select(shares, starts_with("rate")) %>% summary()## shares rate_positive_words rate_negative_words
## Min. : 49 Min. :0.0000 Min. :0.0000
## 1st Qu.: 832 1st Qu.:0.5789 1st Qu.:0.2000
## Median : 1200 Median :0.6875 Median :0.3000
## Mean : 3027 Mean :0.6655 Mean :0.3052
## 3rd Qu.: 2100 3rd Qu.:0.7818 3rd Qu.:0.4037
## Max. :210300 Max. :1.0000 Max. :1.0000
This summary() provides us with information about the
distribution (shape) of shares (response variable), rate_positive_words,
and rate_negative_word.
If mean is greater than median, then Right-skewed with outliers towards the right tail.
If mean is less than median, then Left-skewed with outliers towards the left tail.
If mean equals to median, then Normal distribtuion.
Note: Due to the extreme minimize and maximize values in our response variable, shares, we transformed the data by applying log(). It allows us to address the differences in magnitude throughout the share variable. The results visually are much better.
g <- ggplot(selectTrain, aes(x=rate_positive_words, y = log(shares)))+geom_point()
g + geom_jitter(aes(color = as.factor(weekday))) +
labs(x = "positive words", y = "shares",
title = "Rate of Positive Words") +
scale_fill_discrete(breaks=c("Sunday","Monday","Tuesday","Wednesday","Thursday","Friday","Saturday"))+
scale_colour_discrete("") We can inspect the trend of shares as a function of the positive word rate. If the points show an upward trend, then articles with more positive words tend to be shared more often. If we see a negative trend then articles with more positive words tend to be shared less often.
g <- ggplot(selectTrain, aes(x = shares))
g + geom_histogram(bins = sqrt(nrow(selectTrain))) +
labs(title = "Histogram of Shares")We can inspect the shape of the response variable, shares. If the share of the histogram is symmetrical or bell-shaped, then shares has a normal distribution with the shares evenly spread throughout. The mean is equal to the median. If the shape is left skewed or right leaning, then the tail which contains a most of the outliers will be extended to the left. The mean is less than the median. If the shape is right skewed or left leaning, then the tail which contains a most of the outliers will be extended to the right. The mean is greater than the median.
g <- ggplot(selectTrain)
g + aes(x = weekday, y = log(shares)) +
geom_boxplot(varwidth = TRUE) +
geom_jitter(alpha = 0.25, width = 0.2,aes(color = as.factor(weekday))) +
labs(title = "Box Plot of Shares Per Day") +
scale_colour_discrete("") +
scale_x_discrete(limits = c("Sunday","Monday","Tuesday","Wednesday","Thursday","Friday","Saturday")) +
theme(legend.position = "none")kable_if(table(selectTrain$channel, selectTrain$weekday))##
## Friday Monday Saturday Sunday Thursday Tuesday Wednesday
## data_channel_is_entertainment 679 952 255 382 851 911 911
Articles per Channel vs Weekday - We can inspect the distribution of articles shared as a function of the day of the week. We can easily see which days articles are more likely shared.
kable_if(table(selectTrain$weekday, selectTrain$num_keywords))##
## 2 3 4 5 6 7 8 9 10
## Friday 1 11 62 117 113 135 88 72 80
## Monday 0 10 81 156 190 177 128 92 118
## Saturday 0 3 23 38 41 44 27 37 42
## Sunday 0 3 17 44 63 50 70 46 89
## Thursday 0 6 74 132 171 163 111 74 120
## Tuesday 0 11 91 146 163 160 129 95 116
## Wednesday 0 16 92 147 174 144 143 65 130
Day of the Week vs Number of Keywords - We can inspect the distribution of number of keywords as a function of the day of the week. Across the top are the unique number of keywords in the channel. Articles shared are divided by number of keywords and the day the article is shared. It’s highly likely the more keywords, the more likely an article will be shared with others.
We want to remove any predictor variables that are highly correlated to the response variable which can cause multicollinearity. If variables have this characteristic it can lead to skewed or misleading results. We created groupings of predictor variables to look at the relationship with our response variable, share and to each other.
ggcorr(selectTrain%>% dplyr::select(n_tokens_title, n_tokens_content, n_unique_tokens, n_non_stop_words, n_non_stop_unique_tokens, shares),label_round = 2, label = "FALSE", label_size=1)ggcorr(selectTrain%>% dplyr::select( average_token_length, num_keywords,num_hrefs, num_self_hrefs, num_imgs, num_videos, shares),label_round = 2, label_size=3, label = "FALSE")ggcorr(selectTrain%>% dplyr::select(kw_min_min, kw_max_min, kw_avg_min, kw_min_max, kw_max_max, kw_avg_max, kw_min_avg, kw_max_avg, kw_avg_avg, shares),label_round = 2, label = "FALSE", label_size=3)ggcorr(selectTrain%>% dplyr::select(self_reference_min_shares, self_reference_max_shares, self_reference_avg_sharess, global_subjectivity, global_sentiment_polarity, global_rate_positive_words, global_rate_negative_words, shares),label_round = 2, label = "FALSE", label_size=3)ggcorr(selectTrain%>% dplyr::select(LDA_00, LDA_01, LDA_02, LDA_03, LDA_04, shares),label_round = 2, label = "FALSE", label_size=3)ggcorr(selectTrain%>% dplyr::select(rate_positive_words, rate_negative_words, avg_positive_polarity, min_positive_polarity, max_positive_polarity, avg_negative_polarity, min_negative_polarity, max_negative_polarity, shares),label_round = 2, label = "FALSE", label_size=3)ggcorr(selectTrain%>% dplyr::select( title_subjectivity, title_sentiment_polarity, abs_title_subjectivity, abs_title_sentiment_polarity, shares),label_round = 2, label = "FALSE")Looking at the overall results from the ggcorr(), we do
not see any highly correlated relationships(positive or negative) with
our response variable, share. If there was a relationship, it would be
orange for highly positive correlated or blue for highly negative
correlated. We do notice that many of the variables are highly
correlated with one another.
For the custom limitations of correlation maps above, the decision was made not to include the actual correlation value. Even with the attempt to shorten the label for the variable, the maps were cluttered and busy.
The above observations have been simplified further by categorizing
the response variable shares into ‘Poor’and ’Good’. This gives
information, how data is split between the quantiles of the training
data. For params$channel , the shares below Q2 are grouped
Poor and above Q2 as Good The summary statistics can be seen for each
category easily and deciphered at more easily here.
q<-quantile(selectTrain$shares,0.5)
cat_share<-selectTrain%>%
mutate(Rating=ifelse(shares<q,"Poor","Good"))
head(cat_share,5)## # A tibble: 5 x 49
## n_tokens_ti~1 n_tok~2 n_uni~3 n_non~4 n_non~5 num_h~6 num_s~7 num_i~8 num_v~9 avera~* num_k~* kw_mi~* kw_ma~* kw_av~*
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 12 219 0.664 1.00 0.815 4 2 1 0 4.68 5 0 0 0
## 2 9 531 0.504 1.00 0.666 9 0 1 0 4.40 7 0 0 0
## 3 12 161 0.669 1.00 0.752 5 4 0 6 4.45 10 0 0 0
## 4 11 454 0.566 1.00 0.755 5 3 1 0 4.89 6 0 0 0
## 5 5 356 0.618 1.00 0.766 3 3 12 1 4.47 10 0 0 0
## # ... with 35 more variables: kw_min_max <dbl>, kw_max_max <dbl>, kw_avg_max <dbl>, kw_min_avg <dbl>,
## # kw_max_avg <dbl>, kw_avg_avg <dbl>, self_reference_min_shares <dbl>, self_reference_max_shares <dbl>,
## # self_reference_avg_sharess <dbl>, is_weekend <dbl>, LDA_00 <dbl>, LDA_01 <dbl>, LDA_02 <dbl>, LDA_03 <dbl>,
## # LDA_04 <dbl>, global_subjectivity <dbl>, global_sentiment_polarity <dbl>, global_rate_positive_words <dbl>,
## # global_rate_negative_words <dbl>, rate_positive_words <dbl>, rate_negative_words <dbl>,
## # avg_positive_polarity <dbl>, min_positive_polarity <dbl>, max_positive_polarity <dbl>,
## # avg_negative_polarity <dbl>, min_negative_polarity <dbl>, max_negative_polarity <dbl>, ...
The most basic statistic is assessed here, which is mean across the 2 categories. Not all variables can be informative but the effect of each parameter on the Rating can be seen here. This depicts how each parameter on average has contributed to the shares for each category.
library(gt)
cat_share$Rating=as.factor(cat_share$Rating)
means1<-aggregate(cat_share,by=list(cat_share$Rating),mean)
means2<-cat_share %>% group_by(Rating,is_weekend) %>% summarise_all('mean')
means1## Group.1 n_tokens_title n_tokens_content n_unique_tokens n_non_stop_words n_non_stop_unique_tokens num_hrefs
## 1 Good 10.93121 611.5284 0.8084015 1.3825050 0.9240822 11.545527
## 2 Poor 11.02514 590.5519 0.5358806 0.9732069 0.6762772 9.699505
## num_self_hrefs num_imgs num_videos average_token_length num_keywords kw_min_min kw_max_min kw_avg_min kw_min_max
## 1 3.528032 6.710537 2.429026 4.478706 7.106958 24.76342 1133.718 301.9810 12543.10
## 2 3.458780 6.132317 2.358203 4.476010 6.766694 17.85284 1007.265 276.0858 14072.12
## kw_max_max kw_avg_max kw_min_avg kw_max_avg kw_avg_avg self_reference_min_shares self_reference_max_shares
## 1 756193.1 240961.5 1180.784 5901.366 3271.921 3083.251 9763.969
## 2 774400.4 246533.0 1059.794 5333.398 3020.036 2222.451 7482.466
## self_reference_avg_sharess is_weekend LDA_00 LDA_01 LDA_02 LDA_03 LDA_04 global_subjectivity
## 1 5679.406 0.19443340 0.06735912 0.4221998 0.08687207 0.3581465 0.06502495 0.4540505
## 2 4141.112 0.06100577 0.06428977 0.4256666 0.09460356 0.3530542 0.06238594 0.4489850
## global_sentiment_polarity global_rate_positive_words global_rate_negative_words rate_positive_words
## 1 0.113462 0.03995909 0.01847165 0.6684780
## 2 0.108851 0.04053175 0.01941977 0.6624031
## rate_negative_words avg_positive_polarity min_positive_polarity max_positive_polarity avg_negative_polarity
## 1 0.2997129 0.3664590 0.09354050 0.7980472 -0.2934460
## 2 0.3108039 0.3649256 0.09521202 0.7915279 -0.2944704
## min_negative_polarity max_negative_polarity title_subjectivity title_sentiment_polarity abs_title_subjectivity
## 1 -0.5853050 -0.1127268 0.3220161 0.06646644 0.3213594
## 2 -0.5852528 -0.1122461 0.3140612 0.06192588 0.3247682
## abs_title_sentiment_polarity shares channel weekday Rating
## 1 0.1774378 5168.1109 NA NA NA
## 2 0.1685820 807.9711 NA NA NA
means2## # A tibble: 4 x 49
## # Groups: Rating [2]
## Rating is_weekend n_tokens_~1 n_tok~2 n_uni~3 n_non~4 n_non~5 num_h~6 num_s~7 num_i~8 num_v~9 avera~* num_k~* kw_mi~*
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Good 0 11.0 604. 0.880 1.48 0.992 11.3 3.53 6.49 2.40 4.48 7.03 25.9
## 2 Good 1 10.8 641. 0.513 0.959 0.643 12.6 3.53 7.63 2.55 4.47 7.42 20.1
## 3 Poor 0 11.0 584. 0.538 0.974 0.679 9.71 3.45 5.92 2.38 4.48 6.73 17.8
## 4 Poor 1 11.0 688. 0.503 0.959 0.635 9.51 3.67 9.39 2.09 4.44 7.26 18.1
## # ... with 35 more variables: kw_max_min <dbl>, kw_avg_min <dbl>, kw_min_max <dbl>, kw_max_max <dbl>,
## # kw_avg_max <dbl>, kw_min_avg <dbl>, kw_max_avg <dbl>, kw_avg_avg <dbl>, self_reference_min_shares <dbl>,
## # self_reference_max_shares <dbl>, self_reference_avg_sharess <dbl>, LDA_00 <dbl>, LDA_01 <dbl>, LDA_02 <dbl>,
## # LDA_03 <dbl>, LDA_04 <dbl>, global_subjectivity <dbl>, global_sentiment_polarity <dbl>,
## # global_rate_positive_words <dbl>, global_rate_negative_words <dbl>, rate_positive_words <dbl>,
## # rate_negative_words <dbl>, avg_positive_polarity <dbl>, min_positive_polarity <dbl>, max_positive_polarity <dbl>,
## # avg_negative_polarity <dbl>, min_negative_polarity <dbl>, max_negative_polarity <dbl>, ...
And scatter plots are generated to get an idea on the direction of the predictor variables and to see what is the direction and extent of linearity or non-linearity of the predictors.
q<-quantile(cat_share$shares,c(0.25,0.75))
name<-names(cat_share)
predictors1<-(name)[30:35]
predictors2<-(name) [36:41]
predictors3<-(name) [42:46]
response <- "Rating"
par(mfrow = c(3, 1))
cat_share %>%
select(c(response, predictors1)) %>%
select_if(~ !any(is.na(.))) %>%
ggpairs(aes(colour = Rating,alpha=0.5),
upper = list(continuous = wrap("cor", title="")))cat_share %>%
select(c(response, predictors2)) %>%
select_if(~ !any(is.na(.))) %>%
ggpairs(aes(colour = Rating,alpha=0.5),
upper = list(continuous = wrap("cor", title="")))cat_share %>%
select(c(response, predictors3)) %>%
select_if(~ !any(is.na(.))) %>%
ggpairs(aes(colour = Rating,alpha=0.5),
upper = list(continuous = wrap("cor", title="")))The next aspect is to check the shape of some predictor variables w.r.t response variable. These mostly are parameters like number of keywords, videos, pictures etc. This specifies if the channel shares are skewed by some these factors and at what extent.For example,shares can be maximum, if the channel has more kewywords and hence, we see a right skewed bar plot.
bar_dat<-cat_share %>% select(c('n_tokens_title','n_tokens_content','shares',
'num_self_hrefs','num_imgs','num_videos','num_keywords',
'Rating'))
ggplot(bar_dat,aes())+
geom_density(aes(x = shares,fill=Rating,alpha=0.5))g1<-ggplot(bar_dat,aes())+
geom_bar(aes(x = n_tokens_title, y = shares, fill = Rating),stat = 'identity', position = "dodge")
g2<-ggplot(bar_dat,aes())+
geom_bar(aes(x = num_self_hrefs, y = shares, fill = Rating),stat = 'identity', position = "dodge")
g3<-ggplot(bar_dat,aes())+
geom_bar(aes(x = num_imgs, y = shares, fill = Rating),stat = 'identity', position = "dodge")
g4<-ggplot(bar_dat,aes())+
geom_bar(aes(x = num_videos, y = shares, fill = Rating),stat = 'identity', position = "dodge")
g5<-ggplot(bar_dat,aes())+
geom_bar(aes(x = num_keywords, y = shares, fill = Rating),stat = 'identity', position = "dodge")
g6<-ggplot(bar_dat,aes())+
geom_boxplot(aes(y = shares))+coord_flip()
grid.arrange(g1, g2, g3,g4,g5,g6, ncol = 2, nrow = 3)A different aspect of EDA that is tried here. Although, we are uncertain if this will be a good exercise or not. But we do see the data set has too many predictor variables and in any industry cost of computations and labor counts.
In order to find more efficient way to reduce size of data set and understand variance between variables,we are using Principal Component Analysis on training data to see, if the the predictors variables can be used together to determine some relationship with the response variable. Principal Component is a method that has many usages for datasets with too many predictors. It is used in dimensionality reduction, and can also help in understanding the relationship among different variables and their impact on the response variable in terms of variance and also an effective method to remove collinearity from the dataset by removing highly correlated predictors.
Here we are using prcomp on our train data to find our
principal components. Some select variables have been discarded as their
magnitude was very less and may /may not have impact into the model but
to keep the analysis simple , some critical ones have been considered to
demonstrate dimensionality reduction.
The PCA plot is plot for variance, another statistic that is critical in understanding relationship between variables. These plots can easily tell how much each variable contributes towards the response variable. PC1 holds the maximum variance , PC2 then the next. We have removed some irrelevant variables and tested the model here..
PC_Train<-selectTrain %>% select(- c('shares','channel','LDA_00','LDA_01','LDA_02','LDA_03','LDA_04','n_tokens_content','n_non_stop_unique_tokens','weekday'))
PC_fit <- prcomp(PC_Train, scale = TRUE)
summary(PC_fit)## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 PC11 PC12
## Standard deviation 2.1169 1.94205 1.82600 1.63183 1.48706 1.44416 1.42111 1.3478 1.28250 1.16601 1.15679 1.01544
## Proportion of Variance 0.1179 0.09925 0.08774 0.07008 0.05819 0.05488 0.05315 0.0478 0.04328 0.03578 0.03521 0.02713
## Cumulative Proportion 0.1179 0.21718 0.30493 0.37500 0.43320 0.48808 0.54123 0.5890 0.63231 0.66809 0.70331 0.73044
## PC13 PC14 PC15 PC16 PC17 PC18 PC19 PC20 PC21 PC22 PC23 PC24
## Standard deviation 1.00060 0.97748 0.91059 0.87994 0.85229 0.83414 0.80787 0.77859 0.7474 0.69815 0.67182 0.67051
## Proportion of Variance 0.02635 0.02514 0.02182 0.02038 0.01912 0.01831 0.01717 0.01595 0.0147 0.01283 0.01188 0.01183
## Cumulative Proportion 0.75679 0.78193 0.80375 0.82413 0.84325 0.86156 0.87873 0.89468 0.9094 0.92221 0.93409 0.94592
## PC25 PC26 PC27 PC28 PC29 PC30 PC31 PC32 PC33 PC34 PC35 PC36
## Standard deviation 0.66366 0.57327 0.52853 0.52462 0.44573 0.34479 0.32269 0.27952 0.27646 0.23982 0.21218 0.16442
## Proportion of Variance 0.01159 0.00865 0.00735 0.00724 0.00523 0.00313 0.00274 0.00206 0.00201 0.00151 0.00118 0.00071
## Cumulative Proportion 0.95751 0.96616 0.97351 0.98075 0.98598 0.98911 0.99185 0.99391 0.99592 0.99743 0.99862 0.99933
## PC37 PC38
## Standard deviation 0.15960 0.005237
## Proportion of Variance 0.00067 0.000000
## Cumulative Proportion 1.00000 1.000000
par(mfrow = c(4, 1))
#screeplot
screeplot(PC_fit, type = "bar")
#Proportion of Variance
plot(PC_fit$sdev^2/sum(PC_fit$sdev^2), xlab = "Principal Component",
ylab = "Proportion of Variance Explained", ylim = c(0, 1), type = 'b')
# Cumulative proportion of variance.
plot(cumsum(PC_fit$sdev^2/sum(PC_fit$sdev^2)), xlab = "Principal Component",
ylab = "Cum. Prop of Variance Explained", ylim = c(0, 1), type = 'b')Before using any reduction algorithm to determine which variables to be in the model, we used the correlation tables above to reduce the predictors. If the predictors are correlated, it’s not necessary to use both predictors.
We removed any variable that contained min and max, if the predictor also contained age:
min_positive_polarity
max_positive_polarity
min_negative_polarity
max_negative_polarity
self_reference_min_shares
self_reference_max_shares
kw_min_avg
kw_max_avg
kw_min_max
kw_max_max
kw_min_min
kw_max_min
We removed all of the “LDA_”. When you view the correlation chart above, they seem to be correlated. In addition, I am not sure what LDA means. When googled, the results referenced modeling and Latent Dirichlet Allocation.
LDA_00: Closeness to LDA topic 0
LDA_01: Closeness to LDA topic 1
LDA_02: Closeness to LDA topic 2
LDA_03: Closeness to LDA topic 3
LDA_04: Closeness to LDA topic 4
We removed the following variables, because the data contains the absolute value of the same information:
title_subjectivity
title_sentiment_polarity
We removed the following variables, because of their strong relationship between global_rate_positive_words and global_rate_negative_words:
global_subjectivity
global_sentiment_polarity
We removed the following variables, because of their strong relationship between n_tokens_content:
n_unique_tokens
n_non_stop_unique_tokens
n_non_stop_words
The final data set will contain the following columns (features):
n_tokens_title
n_tokens_content
num_hrefs
num_self_hrefs
num_imgs
num_videos
average_token_length
num_keywords
kw_avg_min
kw_avg_max
kw_avg_avg
self_reference_avg_sharess
global_rate_positive_words
global_rate_negative_words
rate_positive_words
rate_negative_words
avg_positive_polarity
avg_negative_polarity
abs_title_subjectivity
abs_title_sentiment_polarity
Out of the original 61 variables, we will be looking at using up to 20 variables. There is a relationship between num_hrefs and num_self_hrefs. One of those may be removed later in the analysis. We decided to use best stepwise to further reduce the number of variables in the model.
We will also run a separate linear regression model using the variables selected from visually correlation results (above).
Linear regression is a method to understand the relationship between a response variables Y and one or more predictor variables x, x1, x2, etc. This method creates a line that best fits the data called “least known regression line”.
For a Simple Linear Regression, we have one response variable and one predictor variables. It uses the formula y = b0 + b1x, where
y: response variable
b0: intercept (baseline value, if all other values are zero)
b1: regression coefficient
x: independent variable
For Multiple Linear Regression, we have one response variable and any number of predictor variables. It uses the formula Y = B0 + B1X1 + B2X2 + … + BpXp + E, where
Y: The response variable
Xp: The pth predictor variable
Bp: The average effect on Y of a one unit increase in Xp, holding all other predictors fixed
E: The error term
For the results of a linear regression model to be valid and reliable, the following must be true
1. Linear relationship: There exists a linear relationship between the independent variable, x, and the dependent variable, y.
2. Independence: The residuals are independent.
3. Homoscedasticity: The residuals have constant variance at every level of x.
4. Normality: The residuals of the model are normally distributed.
The Best Stepwise method combines the backward and forward step-wise methods to find the best result. We start with no predictor then sequentially add the predictors that contribute the most (Forward). However after adding each new variables, it will remove any variables that no longer improve the fit of the model.
After running the Best Stepwise method, we applied the results to the test data to determine how well the model performed.
set.seed(21)
lmFit<- train(shares ~ ., data = selectTrain%>% dplyr::select(shares, n_tokens_title,
n_tokens_content,
num_hrefs,
num_self_hrefs,
num_imgs,
num_videos,
average_token_length,
num_keywords,
kw_avg_min,
kw_avg_max,
kw_avg_avg,
self_reference_avg_sharess,
global_rate_positive_words,
global_rate_negative_words,
rate_positive_words,
rate_negative_words,
avg_positive_polarity,
avg_negative_polarity,
abs_title_subjectivity,
abs_title_sentiment_polarity),
method = 'leapSeq',
preProcess = c("center", "scale"),
tuneGrid = data.frame(nvmax = 1:20),
trControl = trainControl(method = "cv", number = 5)
)
lmFit$results## nvmax RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 1 1 8691.528 0.01317443 3047.799 1404.920 0.01521344 240.4556
## 2 2 8663.686 0.03104806 3037.205 1408.140 0.04340353 251.1840
## 3 3 8687.443 0.01789690 3019.944 1454.459 0.01510084 257.6219
## 4 4 8695.803 0.01710375 3031.769 1466.932 0.01418705 275.4545
## 5 5 8717.373 0.01474396 3041.916 1476.013 0.01422552 277.9917
## 6 6 8724.324 0.01337191 3042.407 1473.543 0.01357449 272.5588
## 7 7 8724.431 0.01315512 3046.903 1466.807 0.01269102 271.5935
## 8 8 8733.472 0.01336679 3045.953 1472.548 0.01257292 272.7083
## 9 9 8736.249 0.01321570 3046.264 1477.577 0.01269465 269.9529
## 10 10 8741.898 0.01292561 3042.131 1479.939 0.01241773 267.8469
## 11 11 8752.945 0.01055014 3050.838 1456.388 0.01053229 264.3318
## 12 12 8738.087 0.01299714 3046.723 1469.307 0.01222898 260.2246
## 13 13 8739.594 0.01292161 3053.133 1470.649 0.01194432 263.0806
## 14 14 8738.625 0.01280733 3054.817 1469.724 0.01196256 265.3539
## 15 15 8741.327 0.01303178 3048.407 1476.151 0.01227460 270.1190
## 16 16 8737.866 0.01407029 3052.564 1485.954 0.01334514 263.6619
## 17 17 8742.090 0.01307932 3054.131 1475.513 0.01221078 265.7675
## 18 18 8742.494 0.01304095 3053.837 1474.310 0.01217293 267.7958
## 19 19 8741.100 0.01332990 3055.955 1477.701 0.01252078 266.6986
## 20 20 8741.723 0.01322997 3054.771 1476.220 0.01239190 267.1839
lmFit$bestTune## nvmax
## 2 2
set.seed(21)
btTrain <- selectTrain%>% dplyr::select(num_videos, n_tokens_content,kw_avg_max, kw_avg_avg, rate_positive_words,shares)
btTest <- selectTest%>% dplyr::select(num_videos, n_tokens_content,kw_avg_max, kw_avg_avg, rate_positive_words,shares)
lmFit2<- train(shares ~ ., data = btTrain,
method = 'lm',
trControl = trainControl(method = "cv", number = 5)
)
lmFit2$results## intercept RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 1 TRUE 8312.948 0.04162089 3028.061 1171.275 0.06872489 222.674
predBest <- predict(lmFit2, newdata = btTest) %>% as_tibble()
LinearRegression_1 <- postResample(predBest, obs =btTest$shares)As a comparison, we also included a model that includes the variables selected from correlation heat maps.
set.seed(21)
stTrain <- selectTrain%>% dplyr::select(shares, n_tokens_title,
n_tokens_content,
num_hrefs,
num_self_hrefs,
num_imgs,
num_videos,
average_token_length,
num_keywords,
kw_avg_min,
kw_avg_max,
kw_avg_avg,
self_reference_avg_sharess,
global_rate_positive_words,
global_rate_negative_words,
rate_positive_words,
rate_negative_words,
avg_positive_polarity,
avg_negative_polarity,
abs_title_subjectivity,
abs_title_sentiment_polarity)
stTest <- selectTest%>% dplyr::select(shares, n_tokens_title,
n_tokens_content,
num_hrefs,
num_self_hrefs,
num_imgs,
num_videos,
average_token_length,
num_keywords,
kw_avg_min,
kw_avg_max,
kw_avg_avg,
self_reference_avg_sharess,
global_rate_positive_words,
global_rate_negative_words,
rate_positive_words,
rate_negative_words,
avg_positive_polarity,
avg_negative_polarity,
abs_title_subjectivity,
abs_title_sentiment_polarity)
lmFit3<- train(shares ~ ., data = stTrain,
method = 'lm',
trControl = trainControl(method = "cv", number = 5)
)
lmFit3$results## intercept RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 1 TRUE 8741.723 0.01322997 3054.771 1476.22 0.0123919 267.1839
OrgBest <- predict(lmFit3, newdata = stTest) %>% as_tibble()
LinearRegression_3 <- postResample(predBest, obs = stTest$shares)This chunk of code below selects the principal components from the above PCA code for EDA based on the desired cumulative variance threshold. As there are too many predictor variables, we are going ahead with a threshold of 0.80 i.e. we are taking all predictors that contribute to total 80% of the data. The number of principal components selected can be seen below after the code is implemented.
These principal components selected will be used below in the linear regression model after feeding our PCA fit into the training and test dataset.
var_count<-cumsum(PC_fit$sdev^2/sum(PC_fit$sdev^2))
count=0
for (i in (1:length(var_count)))
{
if (var_count[i]<0.80)
{ per=var_count[i]
count=count+1
}
else if (var_count[i]>0.80)
{
break}
}
print(paste("Number of Principal Components optimized are",count ,"for cumuative variance of 0.80"))## [1] "Number of Principal Components optimized are 14 for cumuative variance of 0.80"
The aim here is to test if PCA helps us reduce the dimensions of our data and utilize this on our regression model. The PCA is very efficient method of feature extraction and can help get maximum information at minimum cost. We will check out a multinomial logistic regression with PCA here. The principal components obtained in the above section are used here on train and test data to get PCA fits and then apply linear regression fit on train data and get predictions for test data. It is not expected this may give a very efficient result however, it does help get idea about the predictor variables whether they can be fit enough for a linear model or not.
new_PC <- predict(PC_fit, newdata = PC_Train) %>% as_tibble()
res<-selectTrain$shares
PC_Vars<-data.frame(new_PC[1:count],res)
head(PC_Vars,5)## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10
## 1 0.7701039 -2.718143 2.778664 -1.806332 -0.9930002 0.47090564 -0.1413708 -1.10911273 0.8735269 0.88254413
## 2 0.2991286 -2.651044 3.000036 -2.423090 1.2981953 -0.07814979 -0.2619246 -0.78429597 0.7756287 0.01849173
## 3 1.7363815 -2.654221 2.215671 -1.998729 -1.4380806 1.90602365 -0.1001790 0.15468592 0.2450467 0.47668955
## 4 0.6406039 -3.387213 2.526563 -1.341653 -0.9752363 0.59714632 0.1004311 -0.07014842 0.9035593 -0.28156661
## 5 1.1183910 -3.748250 2.331680 -1.777817 -0.8682650 1.07699934 0.3507416 0.80899540 0.8537205 0.79998534
## PC11 PC12 PC13 PC14 res
## 1 0.51816243 -1.39456950 -0.01850893 1.4027068 593
## 2 0.21341906 -0.07211077 0.11047881 -0.4830124 1200
## 3 -0.09135378 -0.13734842 0.79214988 0.5987597 1200
## 4 -0.13085528 -1.10210871 -0.13708059 0.6503949 4600
## 5 0.02348973 0.49245277 -1.40547751 -1.8334525 631
fit <- train(res~ .,method='lm', data=PC_Vars,metric='RMSE')
#apply PCA fit to test data
PC_Test<-selectTest %>% select(- c('shares','channel','LDA_00','LDA_01',
'LDA_02','LDA_03','LDA_04','n_tokens_content',
'n_non_stop_unique_tokens','weekday'))
test_PC<-predict(PC_fit,newdata=PC_Test) %>% as_tibble()
test_res<-selectTest$shares
PC_Vars_Test<-data.frame(test_PC[1:count],test_res)
head(PC_Vars_Test,5)## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10
## 1 0.6044592 -6.137403 0.8586422 0.71705212 -0.7300889 1.2751267 -0.09585045 0.42619642 0.8516867 -1.30328831
## 2 1.4768978 -4.887558 1.6767083 -0.71490316 -2.6449481 1.2396468 0.16668273 0.11393523 0.4164997 0.07074497
## 3 -0.4795112 -3.535795 0.4113324 -3.63412388 -2.2447241 1.4449558 0.13254606 -0.09783769 1.8038558 -0.43693979
## 4 -0.1784589 -5.206086 1.2331465 -1.63923386 1.2629670 -0.1587659 0.36912268 0.74518931 3.4870436 -0.07526952
## 5 3.3258514 -9.178730 -1.6437238 0.01654149 0.5032508 -0.6356186 0.79032895 1.83257276 3.0685082 -0.69433733
## PC11 PC12 PC13 PC14 test_res
## 1 -1.47886947 -1.7656068 0.1322932 2.0146754 2100
## 2 -0.44928498 -0.7631542 -0.2834098 0.3961791 1200
## 3 1.21990152 0.1265284 1.0388441 -0.2206698 19400
## 4 -0.08927557 -0.6959685 0.4159557 0.3578151 2700
## 5 -2.98667037 0.4848775 -0.9397003 -0.5164115 904
lm_predict<-predict(fit,newdata=PC_Vars_Test)%>% as_tibble()
LinearRegression_2<-postResample(lm_predict, obs = PC_Vars_Test$test_res)Ensemble methods is a machine learning models that combines several base models in order to produce one optimal predictive model. A step further to the decision trees, Ensemble methods do not rely on one Decision Tree and hope to right decision at each split, they take a sample of Decision Trees, calculate metrics at each split, and make a final predictor based on the aggregated results of the sampled Decision Trees.
Random forest is a decision tree method. It takes a desired number of bootstrapped samples from the data and builds a tree for each sample. While building the trees, if a split occurs, then only a random sample of trees are averaged. The average of predictions are used to determine the final model.
The random forest model using R is created using rf as
the method and mtry for the tuning grid. The tuning grid
will provide results using 1 variable up to nine variables in the
model.
trainCtrl <- trainControl(method = 'cv', number = 5)
#trainCtrl <- trainControl(method = 'repeatedcv', number = 3, repeats = 1)
rfTrain <- selectTrain %>% dplyr::select(num_videos, num_imgs,num_keywords,
global_subjectivity,global_sentiment_polarity,
global_rate_positive_words,is_weekend,
avg_positive_polarity,title_sentiment_polarity,shares)
rfTest <- selectTest %>% dplyr::select(num_videos, num_imgs,num_keywords,
global_subjectivity,global_sentiment_polarity,
global_rate_positive_words,is_weekend,
avg_positive_polarity,title_sentiment_polarity,shares)
rfFit <- train(shares ~. , data = rfTrain,
method = 'rf',
trControl = trainCtrl,
preProcess = c("center", "scale"),
tuneGrid = data.frame(mtry = 1:9)
)
rfPredict<-predict(rfFit,newdata=rfTest) %>% as_tibble()
Random_Forest<-postResample(rfPredict, obs = rfTest$shares)Boosting involves adding ensemble members in sequential manner so
that corrections are applied to the predictions made by prior models and
outputs a weighted average of these predictions. It is different to
random forest in implementation of decision tree split .That is, in this
case the decision trees are built in additive manner not in parallel
manner as in random forest. It is based on weak learners which is high
bias and low variance. Boosting looks mainly into reducing bias (and
eventually variance, by aggregating the output metric from many models.
We have used some cross validation here to tune the parameters. The
tuning parameters have been kept small for the tuneGrid due
to limited processing speed. The method caret uses to fit
the train data isgbm. The fit is plotted using
plot() here to show the results of cross validation which
may get complex to interpret, but we are not worrying about it as
caret does the work of selecting the best parameters and
allocating them to our model fit that can be used directly into
predictions.
BT_Train<-selectTrain %>% select(c('num_videos', 'num_imgs','num_keywords',
'global_subjectivity','global_sentiment_polarity',
'global_rate_positive_words','is_weekend',
'avg_positive_polarity','title_sentiment_polarity','shares'))
BT_Test<-selectTest %>% select(c('num_videos', 'num_imgs','num_keywords',
'global_subjectivity','global_sentiment_polarity',
'global_rate_positive_words','is_weekend',
'avg_positive_polarity','title_sentiment_polarity','shares'))
boostfit <- train(shares ~., data = BT_Train, method = "gbm",
trControl = trainControl(method = "repeatedcv", number = 5,
repeats = 3,verboseIter = FALSE,allowParallel = TRUE),
preProcess = c("scale"),
tuneGrid = expand.grid(n.trees = c(50,100,200),
interaction.depth = 1:5,
shrinkage = c(0.001,0.1),
n.minobsinnode = c(3,5)),
verbose = FALSE)
plot(boostfit)boostfit$bestTune## n.trees interaction.depth shrinkage n.minobsinnode
## 33 200 1 0.1 3
boostpredict<-predict(boostfit,newdata=BT_Test) %>% as_tibble()
Boosted_Trees<-postResample(boostpredict, obs = BT_Test$shares)Linear Regression and Ensemble method results are now put together here for comparison.
Some metrics come up as we see the postResample results
after doing our fit on testdata that takes not of these metrics after
comparison with actual test data values.
The RMSE or Root Mean Square Error is the square root of the variance of the residual error. It tells about the absolute fit of the model to the data–how close the true data values are to the model’s predicted values.
R-squared is the proportion of variance in the response variable that can be explained by the predictor variables. In other words, this is a measurement of goodness of fit. A high R-squared means our model has good fit over the training /test data
MAE or Mean Absolute Error gives the absolute distance of the observation to the predictions and averaging it over all the given observations.
We will be comparing our models based on minimum RMSE.
# Some nomenclature things for sanity.
LinearRegression_Stepwise_Sel<-LinearRegression_1
LinearRegression_Orignal_Var_Sel<-LinearRegression_3
Linear_Regression_PCA<-LinearRegression_2
# Getting out best model in terms of RMSE
Metrics_df<-data.frame(rbind(LinearRegression_Stepwise_Sel,
LinearRegression_Orignal_Var_Sel,
Linear_Regression_PCA, Boosted_Trees,
Random_Forest))
Metrics_df## RMSE Rsquared MAE
## LinearRegression_Stepwise_Sel 6494.540 0.023154900 2810.239
## LinearRegression_Orignal_Var_Sel 6494.540 0.023154900 2810.239
## Linear_Regression_PCA 6531.368 0.017059407 2779.967
## Boosted_Trees 6516.248 0.014908315 2807.884
## Random_Forest 6557.816 0.006200649 2818.416
best_model<-Metrics_df[which.min(Metrics_df$RMSE), ]
rmse_min<-best_model$RMSE
r2max<-best_model$RsquaredThe best model for predicting the number of shares for dataset called data_channel_is_entertainment in terms of RMSE is LinearRegression_Stepwise_Sel with RMSE of 6494.54. This model also showed an R-Squared of about 0.0231549, which shows that the predictors of this model do not have significant impact on the response variable.
The prediction spread of all the models is also depicted through the box plot with comparison to the actual test values. As we the test actual values have a outlier going very high. To see more improvements in the Linear Regression models we can also remove the outliers in the train and test data and repeat our regression steps. Linear regression may show more bias due to outliers in the data and hence, data cleaning is the concept that comes up often while working with complex dataset as part of pre-processing. However, this needs to cautiously used as valuable information may be lost in this exercise which may change the outcomes of our prediction. However, this box plot is simply to summarize the test predict values as its too hard to print so much of data!.
box_df<-data.frame(Actual_Response=selectTest$shares,
LinearRegression_Stepwise=predBest$value,
LinearRegression_Orignal_Variable=OrgBest$value,
LinearRegression_PCA=lm_predict$value,
Random_Forest = rfPredict$value,
Boosted_Trees=boostpredict$value)
ggplot(stack(box_df), aes(x = ind, y = values)) +
stat_boxplot(geom = "errorbar") +
labs(x="Models", y="Response Values") +
geom_boxplot(fill = "white", colour = "red") +coord_flip()Different genre of news and information has significant importance in every industry and forecasting user sentiments is one of the critical requirements on which today’s news and information content is shaped. To capture the dynamics of every type of population, understanding of the parameters and their affect of online shares is very important. We hope that this analysis becomes useful to anybody to wants to perform step by step evaluation of such large dataset and extract critical information. The codes meant in this analysis are meant to capture both conceptual aspects of the analysis being done along with results. This set of algorithm can also be applied directly to any data set that has similar attributes and results can be quickly obtained. We hope this analysis is helpful going forward to anybody who wants to build similar model comparisons.
selectID <- unique(newData$channel)
output_file <- paste0(selectID, "Analysis.md")
params = lapply(selectID, FUN = function(x){list(channel = x)})
reports <- tibble(output_file, params)
library(rmarkdown)
apply(reports, MARGIN = 1,
FUN = function(x){
render(input = "./Project_3.Rmd",
output_format = "github_document",
output_file = x[[1]],
params = x[[2]])
})