Exploring data on halloween candy preferences¶

This project will use candy ranking data from the 'fivethirtyeight' package to perform exploratory data analysis as well as linear and logistic regression to find insights in the data.

In [ ]:
# Load all the packages
library(dplyr)
library(tidyr)
library(ggplot2)
library(broom)
library(corrplot)
library(fivethirtyeight)
library(car)

# Load the candy_rankings dataset from the fivethirtyeight package
data(candy_rankings)
In [2]:
# Take a glimpse() at the dataset
glimpse(candy_rankings)
Rows: 85
Columns: 13
$ competitorname   <chr> "100 Grand", "3 Musketeers", "One dime", "One quarter…
$ chocolate        <lgl> TRUE, TRUE, FALSE, FALSE, FALSE, TRUE, TRUE, FALSE, F…
$ fruity           <lgl> FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE…
$ caramel          <lgl> TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE,…
$ peanutyalmondy   <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, …
$ nougat           <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE,…
$ crispedricewafer <lgl> TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
$ hard             <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
$ bar              <lgl> TRUE, TRUE, FALSE, FALSE, FALSE, TRUE, TRUE, FALSE, F…
$ pluribus         <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE…
$ sugarpercent     <dbl> 0.732, 0.604, 0.011, 0.011, 0.906, 0.465, 0.604, 0.31…
$ pricepercent     <dbl> 0.860, 0.511, 0.116, 0.511, 0.511, 0.767, 0.767, 0.51…
$ winpercent       <dbl> 66.97173, 67.60294, 32.26109, 46.11650, 52.34146, 50.…


The competitor name variable holds the name of each type of candy. There are 85 diffent candies represented. Below are the first several listed out.

In [3]:
head(candy_rankings$competitorname, 9)
  1. '100 Grand'
  2. '3 Musketeers'
  3. 'One dime'
  4. 'One quarter'
  5. 'Air Heads'
  6. 'Almond Joy'
  7. 'Baby Ruth'
  8. 'Boston Baked Beans'
  9. 'Candy Corn'


The data set has several categorical variables reprsenting descriptive characteristics of the candy. They are encoded as true/false values. The chart below shows their distribution.

In [4]:
# Gather the categorical variables to make them easier to plot
candy_rankings_long <- candy_rankings %>% gather("feature", "value", chocolate:pluribus)
In [5]:
# Make a bar plot showing the distribution of each variable
ggplot(candy_rankings_long, aes(value)) + geom_bar() + facet_wrap(~feature)
No description has been provided for this image


The pricepercent variable records the percentile rank of the candy's price against all the other candies in the dataset. In the chart below we see the candies ranked from most expensive to least.

In [6]:
# Make a lollipop chart of pricepercent
options(repr.plot.width= 12, repr.plot.height= 15) 
ggplot(candy_rankings, aes(reorder(competitorname, pricepercent), pricepercent)) +
  geom_segment(aes(xend = reorder(competitorname, pricepercent), yend = 0)) +
  geom_point() + coord_flip() + xlab("Candy Name") +ylab("Price Percentile")
No description has been provided for this image


The winpercent variable is the percentage of people who prefer this candy over another randomly chosen candy from the dataset. Below is a histogram of winpercent showing a roughly symmetrical distribution centered at around 45%

In [7]:
# Plot a histogram of winpercent
options(repr.plot.width= 6, repr.plot.height= 6) 
ggplot(candy_rankings, aes(winpercent)) + geom_histogram(bins= 10)
No description has been provided for this image


The chart below shows the candy preferences represented by winpercent ranked from highest to lowest

In [8]:
# Make a lollipop chart of winpercent
options(repr.plot.width= 12, repr.plot.height= 15) 
ggplot(candy_rankings, aes(reorder(competitorname, winpercent), winpercent)) +
  geom_segment(aes(xend = reorder(competitorname, winpercent), yend = 0)) +
  geom_point() + coord_flip() + xlab("Candy Name") +ylab("Win Percent")
No description has been provided for this image


A correlation matrix can provide insight on how the variables relate to one another. If any of the explanatory variables are highly correlated they may introduce issues for linear/logistic regression models due to multicollinearity.

In [9]:
# Plot the correlation matrix
options(repr.plot.width= 8, repr.plot.height= 8) 
candy_rankings %>% select(-competitorname) %>% cor() %>% corrplot()
No description has been provided for this image

Linear Regression¶

In [10]:
# Fit a linear model of winpercent explained by all variables except competitorname
win_mod <- lm(winpercent ~ . -competitorname, data= candy_rankings)


As can be seen from the output below, only 3 coefficients are significant at the 0.05 level or better. These indicate the characteristics that most affect the outcome of winpercent. In other words, the characterstics that drive candy preferences.

In [11]:
summary(win_mod)
Call:
lm(formula = winpercent ~ . - competitorname, data = candy_rankings)

Residuals:
     Min       1Q   Median       3Q      Max 
-20.2244  -6.6247   0.1986   6.8420  23.8680 

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)           34.5340     4.3199   7.994 1.44e-11 ***
chocolateTRUE         19.7481     3.8987   5.065 2.96e-06 ***
fruityTRUE             9.4223     3.7630   2.504  0.01452 *  
caramelTRUE            2.2245     3.6574   0.608  0.54493    
peanutyalmondyTRUE    10.0707     3.6158   2.785  0.00681 ** 
nougatTRUE             0.8043     5.7164   0.141  0.88849    
crispedricewaferTRUE   8.9190     5.2679   1.693  0.09470 .  
hardTRUE              -6.1653     3.4551  -1.784  0.07852 .  
barTRUE                0.4415     5.0611   0.087  0.93072    
pluribusTRUE          -0.8545     3.0401  -0.281  0.77945    
sugarpercent           9.0868     4.6595   1.950  0.05500 .  
pricepercent          -5.9284     5.5132  -1.075  0.28578    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 10.7 on 73 degrees of freedom
Multiple R-squared:  0.5402,	Adjusted R-squared:  0.4709 
F-statistic: 7.797 on 11 and 73 DF,  p-value: 9.504e-09


The vif function from the car package can be used to check for multicollinearity. If all the values are < 5 there is no issue.

In [12]:
round(vif(win_mod), 2)
chocolate
2.77
fruity
2.6
caramel
1.37
peanutyalmondy
1.33
nougat
1.83
crispedricewafer
1.56
hard
1.29
bar
3.54
pluribus
1.71
sugarpercent
1.27
pricepercent
1.82


If a model is a good fit, the residuals should look like "white noise"; they should be balanced evenly around the zero line and show no pattern. Based on the graph below it looks like a good fit.

In [13]:
# Plot the residuals vs the fitted values
ggplot(augment(win_mod), aes(.fitted, .resid)) + geom_point() + geom_hline(yintercept = 0)
No description has been provided for this image

Logistic Regression¶

Below, logistic regression is used to predict whether the candy is chocolate based on the other variables (except competitorname).

A warning is produced because a few of the features (like crispedricewafer) are only ever true when a candy is chocolate. This means that we cannot draw conclusions from the coefficients, but we can still use the model to make predictions.

The model is able to make predictions with an accuracy of 96%.

In [14]:
# Fit a glm() of chocolate
choc_mod <- glm(chocolate ~ . - competitorname, family= "binomial", data= candy_rankings)
Warning message:
"glm.fit: fitted probabilities numerically 0 or 1 occurred"


Only 2 of the coefficients are significant. Note the negative value for fruityTRUE. This is seen in the correlation data as well. Chocolate and fruity are negatively correlated.

In [15]:
summary(choc_mod)
Call:
glm(formula = chocolate ~ . - competitorname, family = "binomial", 
    data = candy_rankings)

Coefficients:
                       Estimate Std. Error z value Pr(>|z|)   
(Intercept)           -10.29370    4.12040  -2.498  0.01248 * 
fruityTRUE             -6.75305    2.20462  -3.063  0.00219 **
caramelTRUE            -1.85093    1.66750  -1.110  0.26700   
peanutyalmondyTRUE     -4.11907    2.98175  -1.381  0.16715   
nougatTRUE            -16.74818 3520.13323  -0.005  0.99620   
crispedricewaferTRUE   14.98331 4725.35051   0.003  0.99747   
hardTRUE                1.83504    1.80742   1.015  0.30997   
barTRUE                19.06799 3520.13379   0.005  0.99568   
pluribusTRUE            0.22804    1.45457   0.157  0.87542   
sugarpercent            0.12168    2.07707   0.059  0.95329   
pricepercent            1.76626    2.24816   0.786  0.43208   
winpercent              0.23019    0.08593   2.679  0.00739 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 116.407  on 84  degrees of freedom
Residual deviance:  25.802  on 73  degrees of freedom
AIC: 49.802

Number of Fisher Scoring iterations: 19
In [16]:
# Make a data frame of predictions
preds <- augment(choc_mod, type.predict= "response") %>% mutate(prediction= .fitted > .5)
In [17]:
# Create the confusion matrix
conf_mat <- preds %>% select(chocolate, prediction) %>% table()
In [18]:
# Calculate the accuracy
accuracy <- sum(diag(conf_mat))/sum(conf_mat)
round(accuracy, 2)
0.96