EDA Final

Joseph Lazarus

1/6/2021

For this case study we assume that our audience is the CEO and CFO of Budweiser (our client). They have hired us to answer 7 questions and beyond those general questions we will speculate / anticipate what may be of interest to them.

We will start by importing the following data for analysis:

Beers.csv: Name: Name of the beer. Beer_ID: Unique identifier of the beer. ABV: Alcohol by volume of the beer. IBU: International Bitterness Units of the beer. Brewery_ID: Brewery id associated with the beer. Style: Style of the beer. Ounces: Ounces of beer.

Breweries.csv: Brew_ID: Unique identifier of the brewery. Name: Name of the brewery. City: City where the brewery is located. State: U.S. State where the brewery is located.

## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.3     v purrr   0.3.4
## v tibble  3.1.0     v stringr 1.4.0
## v tidyr   1.1.3     v forcats 0.5.1
## v readr   1.4.0
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
## 
## Attaching package: 'mice'
## The following object is masked from 'package:stats':
## 
##     filter
## The following objects are masked from 'package:base':
## 
##     cbind, rbind
## Loading required package: colorspace
## Loading required package: grid
## VIM is ready to use.
## Suggestions and bug-reports can be submitted at: https://github.com/statistikat/VIM/issues
## 
## Attaching package: 'VIM'
## The following object is masked from 'package:datasets':
## 
##     sleep
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
## 
## Attaching package: 'maps'
## The following object is masked from 'package:purrr':
## 
##     map
## 
## Attaching package: 'table1'
## The following objects are masked from 'package:base':
## 
##     units, units<-

Load Theme for plots

Load and Merge initial data sets provided

## [1] 2410    7
## [1] 558   4
## [1] 0
## [1] 0
##   Brewery_id     Beer_Name Beer_ID   ABV IBU
## 1          1  Get Together    2692 0.045  50
## 2          1 Maggie's Leap    2691 0.049  26
## 3          1    Wall's End    2690 0.048  19
## 4          1       Pumpion    2689 0.060  38
## 5          1    Stronghold    2688 0.060  25
## 6          1   Parapet ESB    2687 0.056  47
##                                 Style Ounces       Brewery_Name        City
## 1                        American IPA     16 NorthGate Brewing  Minneapolis
## 2                  Milk / Sweet Stout     16 NorthGate Brewing  Minneapolis
## 3                   English Brown Ale     16 NorthGate Brewing  Minneapolis
## 4                         Pumpkin Ale     16 NorthGate Brewing  Minneapolis
## 5                     American Porter     16 NorthGate Brewing  Minneapolis
## 6 Extra Special / Strong Bitter (ESB)     16 NorthGate Brewing  Minneapolis
##   State
## 1    MN
## 2    MN
## 3    MN
## 4    MN
## 5    MN
## 6    MN

Add Features to the individual beer level data frame

Investigate NA values to determine what needs resolution

## 
##  Variables sorted by number of missings: 
##      Variable      Count
##           IBU 0.41701245
##           ABV 0.02572614
##    Brewery_id 0.00000000
##     Beer_Name 0.00000000
##       Beer_ID 0.00000000
##         Style 0.00000000
##        Ounces 0.00000000
##  Brewery_Name 0.00000000
##          City 0.00000000
##         State 0.00000000
##         Class 0.00000000

Method #1 for imputing NA values, with Predictive Mean Mean Matching

## 
##  iter imp variable
##   1   1  ABV  IBU
##   1   2  ABV  IBU
##   1   3  ABV  IBU
##   1   4  ABV  IBU
##   1   5  ABV  IBU
##   2   1  ABV  IBU
##   2   2  ABV  IBU
##   2   3  ABV  IBU
##   2   4  ABV  IBU
##   2   5  ABV  IBU
##   3   1  ABV  IBU
##   3   2  ABV  IBU
##   3   3  ABV  IBU
##   3   4  ABV  IBU
##   3   5  ABV  IBU
##   4   1  ABV  IBU
##   4   2  ABV  IBU
##   4   3  ABV  IBU
##   4   4  ABV  IBU
##   4   5  ABV  IBU
##   5   1  ABV  IBU
##   5   2  ABV  IBU
##   5   3  ABV  IBU
##   5   4  ABV  IBU
##   5   5  ABV  IBU
##   6   1  ABV  IBU
##   6   2  ABV  IBU
##   6   3  ABV  IBU
##   6   4  ABV  IBU
##   6   5  ABV  IBU
##   7   1  ABV  IBU
##   7   2  ABV  IBU
##   7   3  ABV  IBU
##   7   4  ABV  IBU
##   7   5  ABV  IBU
##   8   1  ABV  IBU
##   8   2  ABV  IBU
##   8   3  ABV  IBU
##   8   4  ABV  IBU
##   8   5  ABV  IBU
##   9   1  ABV  IBU
##   9   2  ABV  IBU
##   9   3  ABV  IBU
##   9   4  ABV  IBU
##   9   5  ABV  IBU
##   10   1  ABV  IBU
##   10   2  ABV  IBU
##   10   3  ABV  IBU
##   10   4  ABV  IBU
##   10   5  ABV  IBU
##   11   1  ABV  IBU
##   11   2  ABV  IBU
##   11   3  ABV  IBU
##   11   4  ABV  IBU
##   11   5  ABV  IBU
##   12   1  ABV  IBU
##   12   2  ABV  IBU
##   12   3  ABV  IBU
##   12   4  ABV  IBU
##   12   5  ABV  IBU
##   13   1  ABV  IBU
##   13   2  ABV  IBU
##   13   3  ABV  IBU
##   13   4  ABV  IBU
##   13   5  ABV  IBU
##   14   1  ABV  IBU
##   14   2  ABV  IBU
##   14   3  ABV  IBU
##   14   4  ABV  IBU
##   14   5  ABV  IBU
##   15   1  ABV  IBU
##   15   2  ABV  IBU
##   15   3  ABV  IBU
##   15   4  ABV  IBU
##   15   5  ABV  IBU
##   16   1  ABV  IBU
##   16   2  ABV  IBU
##   16   3  ABV  IBU
##   16   4  ABV  IBU
##   16   5  ABV  IBU
##   17   1  ABV  IBU
##   17   2  ABV  IBU
##   17   3  ABV  IBU
##   17   4  ABV  IBU
##   17   5  ABV  IBU
##   18   1  ABV  IBU
##   18   2  ABV  IBU
##   18   3  ABV  IBU
##   18   4  ABV  IBU
##   18   5  ABV  IBU
##   19   1  ABV  IBU
##   19   2  ABV  IBU
##   19   3  ABV  IBU
##   19   4  ABV  IBU
##   19   5  ABV  IBU
##   20   1  ABV  IBU
##   20   2  ABV  IBU
##   20   3  ABV  IBU
##   20   4  ABV  IBU
##   20   5  ABV  IBU
##   21   1  ABV  IBU
##   21   2  ABV  IBU
##   21   3  ABV  IBU
##   21   4  ABV  IBU
##   21   5  ABV  IBU
##   22   1  ABV  IBU
##   22   2  ABV  IBU
##   22   3  ABV  IBU
##   22   4  ABV  IBU
##   22   5  ABV  IBU
##   23   1  ABV  IBU
##   23   2  ABV  IBU
##   23   3  ABV  IBU
##   23   4  ABV  IBU
##   23   5  ABV  IBU
##   24   1  ABV  IBU
##   24   2  ABV  IBU
##   24   3  ABV  IBU
##   24   4  ABV  IBU
##   24   5  ABV  IBU
##   25   1  ABV  IBU
##   25   2  ABV  IBU
##   25   3  ABV  IBU
##   25   4  ABV  IBU
##   25   5  ABV  IBU
##   26   1  ABV  IBU
##   26   2  ABV  IBU
##   26   3  ABV  IBU
##   26   4  ABV  IBU
##   26   5  ABV  IBU
##   27   1  ABV  IBU
##   27   2  ABV  IBU
##   27   3  ABV  IBU
##   27   4  ABV  IBU
##   27   5  ABV  IBU
##   28   1  ABV  IBU
##   28   2  ABV  IBU
##   28   3  ABV  IBU
##   28   4  ABV  IBU
##   28   5  ABV  IBU
##   29   1  ABV  IBU
##   29   2  ABV  IBU
##   29   3  ABV  IBU
##   29   4  ABV  IBU
##   29   5  ABV  IBU
##   30   1  ABV  IBU
##   30   2  ABV  IBU
##   30   3  ABV  IBU
##   30   4  ABV  IBU
##   30   5  ABV  IBU
##   31   1  ABV  IBU
##   31   2  ABV  IBU
##   31   3  ABV  IBU
##   31   4  ABV  IBU
##   31   5  ABV  IBU
##   32   1  ABV  IBU
##   32   2  ABV  IBU
##   32   3  ABV  IBU
##   32   4  ABV  IBU
##   32   5  ABV  IBU
##   33   1  ABV  IBU
##   33   2  ABV  IBU
##   33   3  ABV  IBU
##   33   4  ABV  IBU
##   33   5  ABV  IBU
##   34   1  ABV  IBU
##   34   2  ABV  IBU
##   34   3  ABV  IBU
##   34   4  ABV  IBU
##   34   5  ABV  IBU
##   35   1  ABV  IBU
##   35   2  ABV  IBU
##   35   3  ABV  IBU
##   35   4  ABV  IBU
##   35   5  ABV  IBU
##   36   1  ABV  IBU
##   36   2  ABV  IBU
##   36   3  ABV  IBU
##   36   4  ABV  IBU
##   36   5  ABV  IBU
##   37   1  ABV  IBU
##   37   2  ABV  IBU
##   37   3  ABV  IBU
##   37   4  ABV  IBU
##   37   5  ABV  IBU
##   38   1  ABV  IBU
##   38   2  ABV  IBU
##   38   3  ABV  IBU
##   38   4  ABV  IBU
##   38   5  ABV  IBU
##   39   1  ABV  IBU
##   39   2  ABV  IBU
##   39   3  ABV  IBU
##   39   4  ABV  IBU
##   39   5  ABV  IBU
##   40   1  ABV  IBU
##   40   2  ABV  IBU
##   40   3  ABV  IBU
##   40   4  ABV  IBU
##   40   5  ABV  IBU
##   41   1  ABV  IBU
##   41   2  ABV  IBU
##   41   3  ABV  IBU
##   41   4  ABV  IBU
##   41   5  ABV  IBU
##   42   1  ABV  IBU
##   42   2  ABV  IBU
##   42   3  ABV  IBU
##   42   4  ABV  IBU
##   42   5  ABV  IBU
##   43   1  ABV  IBU
##   43   2  ABV  IBU
##   43   3  ABV  IBU
##   43   4  ABV  IBU
##   43   5  ABV  IBU
##   44   1  ABV  IBU
##   44   2  ABV  IBU
##   44   3  ABV  IBU
##   44   4  ABV  IBU
##   44   5  ABV  IBU
##   45   1  ABV  IBU
##   45   2  ABV  IBU
##   45   3  ABV  IBU
##   45   4  ABV  IBU
##   45   5  ABV  IBU
##   46   1  ABV  IBU
##   46   2  ABV  IBU
##   46   3  ABV  IBU
##   46   4  ABV  IBU
##   46   5  ABV  IBU
##   47   1  ABV  IBU
##   47   2  ABV  IBU
##   47   3  ABV  IBU
##   47   4  ABV  IBU
##   47   5  ABV  IBU
##   48   1  ABV  IBU
##   48   2  ABV  IBU
##   48   3  ABV  IBU
##   48   4  ABV  IBU
##   48   5  ABV  IBU
##   49   1  ABV  IBU
##   49   2  ABV  IBU
##   49   3  ABV  IBU
##   49   4  ABV  IBU
##   49   5  ABV  IBU
##   50   1  ABV  IBU
##   50   2  ABV  IBU
##   50   3  ABV  IBU
##   50   4  ABV  IBU
##   50   5  ABV  IBU
## Warning: Number of logged events: 6

##  /\     /\
## {  `---'  }
## {  O   O  }
## ==>  V <==  No need for mice. This data set is completely observed.
##  \  \|/  /
##   `-----'

##      Brewery_id Beer_Name Beer_ID ABV IBU Style Ounces Brewery_Name City State
## 2410          1         1       1   1   1     1      1            1    1     1
##               0         0       0   0   0     0      0            0    0     0
##      Class  
## 2410     1 0
##          0 0

## 
##  Variables sorted by number of missings: 
##      Variable Count
##    Brewery_id     0
##     Beer_Name     0
##       Beer_ID     0
##           ABV     0
##           IBU     0
##         Style     0
##        Ounces     0
##  Brewery_Name     0
##          City     0
##         State     0
##         Class     0

Method #2 for imputing values: Study distributions of IBU and ABV by Class of beer rather than using predictive mean for distribution of all beer style combined

## Warning: Removed 1005 rows containing non-finite values (stat_boxplot).

## Warning: Removed 62 rows containing non-finite values (stat_boxplot).

## Warning: Removed 37 rows containing missing values (geom_point).

Create State Level table of summary statistics

Add external data and features to the State level tables including, consumption, population, and consumption per capita. This will be used to explore additional analysis beyond the questions Budweiser asked us.

## Warning: NAs introduced by coercion

Create a State Mapping data frame so that all summary statistics can be geographically plotted

Question 1: How many breweries are present in each state?

Answer 1: Colorado has the most breweries with 47. Other than Colorado, states with most breweries are clustered along US border and coasts. The top 5 brewery count states account for 31% of total craft breweris in the US.

Question 2: Merge beer data with the breweries data. Print the first 6 observations and the last six observations to check the merged file. (RMD only, this does not need to be included in the presentation or the deck.)

## # A tibble: 6 x 15
## # Groups:   Class [5]
##   Brewery_id Beer_Name Beer_ID   ABV   IBU Style Ounces Brewery_Name City  State
##        <int> <chr>       <int> <dbl> <int> <chr>  <dbl> <chr>        <chr> <chr>
## 1          1 Get Toge~    2692 0.045    50 Amer~     16 "NorthGate ~ Minn~ " MN"
## 2          1 Maggie's~    2691 0.049    26 Milk~     16 "NorthGate ~ Minn~ " MN"
## 3          1 Wall's E~    2690 0.048    19 Engl~     16 "NorthGate ~ Minn~ " MN"
## 4          1 Pumpion      2689 0.06     38 Pump~     16 "NorthGate ~ Minn~ " MN"
## 5          1 Strongho~    2688 0.06     25 Amer~     16 "NorthGate ~ Minn~ " MN"
## 6          1 Parapet ~    2687 0.056    47 Extr~     16 "NorthGate ~ Minn~ " MN"
## # ... with 5 more variables: Class <chr>, ABV.pmm.imputed <dbl>,
## #   IBU.pmm.imputed <int>, IBU.class.imputed <dbl>, ABV.class.imputed <dbl>
## # A tibble: 6 x 15
## # Groups:   Class [5]
##   Brewery_id Beer_Name Beer_ID   ABV   IBU Style Ounces Brewery_Name City  State
##        <int> <chr>       <int> <dbl> <int> <chr>  <dbl> <chr>        <chr> <chr>
## 1        556 Pilsner ~      98 0.055    NA Germ~     12 Ukiah Brewi~ Ukiah " CA"
## 2        557 Heinniew~      52 0.049    NA Hefe~     12 Butternuts ~ Garr~ " NY"
## 3        557 Snapperh~      51 0.068    NA Amer~     12 Butternuts ~ Garr~ " NY"
## 4        557 Moo Thun~      50 0.049    NA Milk~     12 Butternuts ~ Garr~ " NY"
## 5        557 Porkslap~      49 0.043    NA Amer~     12 Butternuts ~ Garr~ " NY"
## 6        558 Urban Wi~      30 0.049    NA Engl~     12 Sleeping La~ Anch~ " AK"
## # ... with 5 more variables: Class <chr>, ABV.pmm.imputed <dbl>,
## #   IBU.pmm.imputed <int>, IBU.class.imputed <dbl>, ABV.class.imputed <dbl>

Question 3: Address the missing values in each column.See above for imputing using two methods. Here we will compare methods and decide which is best to use for subsequent analyses

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1042 rows containing non-finite values (stat_bin).

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 62 rows containing non-finite values (stat_bin).

## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 1042 rows containing non-finite values (stat_smooth).
## Warning: Removed 1042 rows containing missing values (geom_point).

Answer 3: The PMM method of imputing mimcs the overall distribution of available data points but does not seem appropriate for each invidual missing point as it replaces with values not accurate for the type of beer that has missing values. The Class-mean method seems more appropriate and will be used for subsequent distribution analysis and summary statitistics. However it does not seema appropriate for relationship analysis, so we will omit NA’s for correlations and models

Question 4: Compute the median alcohol content and international bitterness unit for each state. Plot a bar chart to compare.

Answer 4: For states with sufficiently larger sample sizes median IBU in the mid-30’s. The two outlier states with high median IBU are small samples. They only have 2 beers per state, which happen to be high IBU styles. If they brewed more beers of multiple varieties, they would likely have median IBU more similar to the rest of states.

Almost all states have a median ABV in the 5% to 6% Range. Utah, the one outlier state with low ABV has to do with a state law limiting the max ABV to 4%

Question 5: Which state has the maximum alcoholic (ABV) beer? Which state has the most bitter (IBU) beer?

## # A tibble: 1 x 6
## # Groups:   Class [1]
##   Beer_Name                         ABV Class Style      State Brewery_Name     
##   <chr>                           <dbl> <chr> <chr>      <chr> <chr>            
## 1 Lee Hill Series Vol. 5 - Belgi~ 0.128 Other Quadrupel~ " CO" Upslope Brewing ~
## # A tibble: 1 x 6
## # Groups:   Class [1]
##   Beer_Name             IBU Class Style                 State Brewery_Name      
##   <chr>               <int> <chr> <chr>                 <chr> <chr>             
## 1 Bitter Bitch Imper~   138 IPA   American Double / Im~ " OR" Astoria Brewing C~

Answer 5: Max ABV beer = 12.8% by Upslope Brewing Company, Colorado Lee Hill Series Vol. 5 - Belgian Style Quadrupel Ale Brewed with a select strain of Belgian yeast and traditional Belgian candy syrup, the beer matured over six months to mellow and soften its character.

Max IBU Beer = 138 by Astoria Brewing Company, Oregon Bitter Bitch Imperial IPA A big IPA with a huge bite. Peoples Award Winner 3 Years running at the Spring Beer & Wine Festival, Portland, OR.

Question 6: Comment on the summary statistics and distribution of the ABV variable.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
## 0.00100 0.05000 0.05600 0.05977 0.06700 0.12800      62
## Warning: Removed 62 rows containing non-finite values (stat_boxplot).
## Warning: Removed 62 rows containing missing values (geom_point).

## Adding missing grouping variables: `Class`
## Warning: Removed 62 rows containing non-finite values (stat_bin).

## Adding missing grouping variables: `Class`
## Warning: Removed 62 rows containing non-finite values (stat_bin).

## Warning: Removed 62 rows containing non-finite values (stat_density).

Overall
(N=2410)
ABV
Mean (SD) 0.0598 (0.0135)
Median [Min, Max] 0.0560 [0.00100, 0.128]
Missing 62 (2.6%)

Answer 6: ABV distribution is relatively normal with a right skew IPA’s contribute most to the right tail of high ABV values, explains mean > median The minimum value is a non-alcoholic beer, excluding this one reduces range to 2.7% - 12.8%

Question 7: Is there an apparent relationship between the bitterness of the beer and its alcoholic content? Draw a scatter plot. Make your best judgment of a relationship and EXPLAIN your answer.

## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 1005 rows containing non-finite values (stat_smooth).
## Warning: Removed 1005 rows containing missing values (geom_point).

## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 179 rows containing non-finite values (stat_smooth).
## Warning: Removed 179 rows containing missing values (geom_point).

## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 411 rows containing non-finite values (stat_smooth).
## Warning: Removed 411 rows containing missing values (geom_point).

## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 590 rows containing non-finite values (stat_smooth).
## Warning: Removed 590 rows containing missing values (geom_point).

Answer 7: ABV and IBU are positively correlated with 45% of the variation in ABV explained by IBU The IBU/ABV relationship is different or shifted for different styles of beer For similar alcohol level, IPA is more bitter than Ale

Question 8: Budweiser would also like to investigate the difference with respect to IBU and ABV between IPAs (India Pale Ales) and other types of Ale (any beer with “Ale” in its name other than IPA). You decide to use KNN classification to investigate this relationship. Provide statistical evidence one way or the other.

#KNN - This model omits NA values and loops to find average confusion matrix parameters over iterations and plots for optimal k value. See comments after plots where single confusion matrix created using optimal k

##      classifications
##       Ale IPA
##   Ale 367   7
##   IPA  20 253
## Confusion Matrix and Statistics
## 
##      classifications
##       Ale IPA
##   Ale 367   7
##   IPA  20 253
##                                           
##                Accuracy : 0.9583          
##                  95% CI : (0.9399, 0.9723)
##     No Information Rate : 0.5981          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.9139          
##                                           
##  Mcnemar's Test P-Value : 0.02092         
##                                           
##             Sensitivity : 0.9483          
##             Specificity : 0.9731          
##          Pos Pred Value : 0.9813          
##          Neg Pred Value : 0.9267          
##              Prevalence : 0.5981          
##          Detection Rate : 0.5672          
##    Detection Prevalence : 0.5781          
##       Balanced Accuracy : 0.9607          
##                                           
##        'Positive' Class : Ale             
## 

#KNN - This model omits NA values then scales ABV/IBU parameters to reduce leverage of nearest neighbor distances due to differing scales. Then it loops to find average confusion matrix parameters over iterations and plots for optimal k value. See comments after plots where single confusion matrix created using optimal k

##      classifications
##       Ale IPA
##   Ale 137  23
##   IPA  12 113
## Confusion Matrix and Statistics
## 
##      classifications
##       Ale IPA
##   Ale 137  23
##   IPA  12 113
##                                           
##                Accuracy : 0.8772          
##                  95% CI : (0.8334, 0.9129)
##     No Information Rate : 0.5228          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.753           
##                                           
##  Mcnemar's Test P-Value : 0.09097         
##                                           
##             Sensitivity : 0.9195          
##             Specificity : 0.8309          
##          Pos Pred Value : 0.8562          
##          Neg Pred Value : 0.9040          
##              Prevalence : 0.5228          
##          Detection Rate : 0.4807          
##    Detection Prevalence : 0.5614          
##       Balanced Accuracy : 0.8752          
##                                           
##        'Positive' Class : Ale             
## 

#KNN - Unlike above models where NA values were omitted, this model uses the imputed IBU and ABV values using the Redictive Mean Matching (PMM) discussed in sections above.