Project_3
Hello Readers!
I’ve been under the weather for the past week, so I hope everyone is well.
Today we completed a machine learning project to analyze an online new popularity data set from UCI. We created separate analysis based on different channels using linear regression, random forest, and boosted models.
As the first group member, the following were completed:
- Relative path to import the data.
- Subset the data to work on the data channel of interest.
- Automate the generation of the six analysis reports.
- Create summary, correlation tables and plots using the training data.
- Variable reduction using results from correlation.
- Modeling using Linear regression.
- Modeling using Random Forest.
- Created repo, updated README, and uploaded files to GitHub.
Although, I started early on the most difficult part for me - automatic partitioning of six channels, I should have attempted the single report first. Initially, I just dived into the project to create the six partitions and was unsuccessful. After creating the single report, the adaption to the whole project was much easier.
Next time, I would not abandon using parallel computing. The run time for random forest ran long, so I worked on parallel computing. In the end, I did not get the results I expected, so I abandoned the ideal for now. Looking back, I think I just had high expectations. If I’m saving a few seconds or even a minute, it’s still a reduction in time. Overall, the project takes approximately 25 minutes to run analysis on all six channels from the data.
As mentioned above, one of the most difficult parts was the automatic partitioning of the six channels. It comes from me trying to understand how R is going to run a separate render.R file to generate six reports from a different .rmd file. So, I adapted the code from the lecture, and it worked.
Lastly, handling outliers and correlation between variables. This data file contained some extreme outliers. I’m sure those are viral articles, and they contribute to the ability to make an accurate prediction. So, I personally decided not to remove any individual articles from the analysis. Then, there was the correlation. There were a lot of relationships. In my analysis, I did remove more than half of their variables, but I wondered if it would cause a problem with overfitting. The overall RMSE results from all five analysis were similar.
Being able to generate multiple reports at one time is one of the big take-way. Once a base .rmd file is created, a person could save a lot of time by automating the process.
The other takeway is in regards to machine learning. Boosted ended up being the method with the lowest RMSE. I need to practice, practice, practice with making predictions and generating meaning from the results. I guess, I need to find a post-2015 edition of the same data set to see if the models provided similar results.
Well, that’s it for today. If you are interested in see our project, it can be viewed here.
Paula