The following is from information about the 27 divisions in the 2^{nd} ward of Philadelphia. As we seek to understand the results of the general presidential election of 2016, it helps to look at the data and trends at the local level. Whoever best understands what happens locally can help sway the overall results on the state level.
This data exploration can be enhanced by looking at information from the other wards. This is a map of all the wards in Philadelphia. This is the map of the 2^{nd} ward which is subject to this analysis.
Below is a summary of the following information from the 27 divisions.
The original variables is this dataset are:
 Democrats
 Republicans
 Independents
 Other Party
 Total Population
 White
 Black
 Hispanic
 Other Race
 Male
 Female
 Gender Unreported
Other proportion variables will be created when appropriate.
> summary(second[,2:13])
Dem Rep Ind Other Party Total Pop. White
Min. :293.0 Min. : 41.00 Min. : 3.000 Min. : 42.0 Min. : 379.0 Min. : 54.0
1st Qu.:475.5 1st Qu.: 67.50 1st Qu.: 4.000 1st Qu.: 81.0 1st Qu.: 648.5 1st Qu.:166.0
Median :547.0 Median : 80.00 Median : 9.000 Median : 93.0 Median : 711.0 Median :225.0
Mean :552.3 Mean : 88.96 Mean : 9.444 Mean :102.0 Mean : 752.7 Mean :211.7
3rd Qu.:630.0 3rd Qu.:100.50 3rd Qu.:13.000 3rd Qu.:118.5 3rd Qu.: 846.0 3rd Qu.:246.0
Max. :852.0 Max. :203.00 Max. :20.000 Max. :212.0 Max. :1252.0 Max. :405.0
Black Hispanic Other Race Male Female Gender Unreported
Min. : 7.00 Min. : 6.00 Min. :10.00 Min. :122.0 Min. :148.0 Min. :112.0
1st Qu.: 15.00 1st Qu.:10.00 1st Qu.:20.50 1st Qu.:218.5 1st Qu.:246.5 1st Qu.:163.0
Median : 38.00 Median :13.00 Median :24.00 Median :256.0 Median :284.0 Median :193.0
Mean : 58.11 Mean :13.52 Mean :27.11 Mean :261.9 Mean :288.1 Mean :205.4
3rd Qu.: 66.00 3rd Qu.:16.50 3rd Qu.:30.00 3rd Qu.:311.5 3rd Qu.:317.0 3rd Qu.:241.0
Max. :209.00 Max. :26.00 Max. :56.00 Max. :413.0 Max. :487.0 Max. :365.0
Are there associations between some of these variables?
1.) Does the proportion of female voters in a division help explain the variation in the gross amount of Democratic voters?
A proportion variable is created for the female population in each division.
> second$FemaleProp < second$Female/second$`Total Pop.`
> summary(lm(second$Dem ~ second$FemaleProp))
Call: lm(formula = second$Dem ~ second$FemaleProp)
Residuals: Min 1Q Median 3Q Max 256.340 66.267 1.328 75.866 302.028 Coefficients: Estimate Std. Error t value Pr(>t) (Intercept) 711.4 418.8 1.699 0.102 second$FemaleProp 415.0 1090.1 0.381 0.707 Residual standard error: 136.8 on 25 degrees of freedom Multiple Rsquared: 0.005765, Adjusted Rsquared: 0.034 Fstatistic: 0.1449 on 1 and 25 DF, pvalue: 0.7066
The percentage of females of each division’s total population does a very bad job explaining the variability in the number of democratic voters in each division. The coefficient of determination is almost zero and the pvalue is very large.
It might be more appropriate to look at the number of Democrats in each division as a proportion rather than in total persons.
I make another variable to represent this new proportion.
> second$DemProp < second$Dem/second$`Total Pop.`
> summary(lm(second$DemProp ~ second$FemaleProp))
Call: lm(formula = second$DemProp ~ second$FemaleProp) Residuals: Min 1Q Median 3Q Max 0.075952 0.019556 0.002954 0.029052 0.058959 Coefficients: Estimate Std. Error t value Pr(>t) (Intercept) 0.6205 0.1146 5.415 1.28e05 *** second$FemaleProp 0.3045 0.2983 1.021 0.317  Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.03744 on 25 degrees of freedom Multiple Rsquared: 0.04003, Adjusted Rsquared: 0.001632 Fstatistic: 1.043 on 1 and 25 DF, pvalue: 0.317
The pvalue halves, but the coefficient of determination is still very low. The association is not even close to being statistically significant.
2.) Is the male proportion of the population indicative of the total Republicans in the same division? A new variable is created to represent the male proportion of the population of each division. > second$MaleProp < second$Male/second$`Total Pop.` > summary(lm(second$Rep ~ second$MaleProp)) Call: lm(formula = second$Rep ~ second$MaleProp) Residuals: Min 1Q Median 3Q Max 54.447 19.993 9.245 14.653 108.770 Coefficients: Estimate Std. Error t value Pr(>t) (Intercept) 173.71 89.37 1.944 0.0633 . second$MaleProp 243.12 255.59 0.951 0.3506  Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 36.46 on 25 degrees of freedom Multiple Rsquared: 0.03493, Adjusted Rsquared: 0.003676 Fstatistic: 0.9048 on 1 and 25 DF, pvalue: 0.3506
There is no statistically significant association between these the proportion of males and the amount of Republicans in a division for the same reasons stated in the first example. Another new variable is created to represent the Republican proportion of the population of each division. > second$RepProp < second$Rep/second$`Total Pop.` > summary(lm(second$RepProp ~ second$MaleProp)) Call: lm(formula = second$RepProp ~ second$MaleProp) Residuals: Min 1Q Median 3Q Max 0.049654 0.013996 0.004151 0.016322 0.057392 Coefficients: Estimate Std. Error t value Pr(>t) (Intercept) 0.17521 0.06203 2.825 0.00916 ** second$MaleProp 0.16729 0.17740 0.943 0.35472  Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.02531 on 25 degrees of freedom Multiple Rsquared: 0.03435, Adjusted Rsquared: 0.00428 Fstatistic: 0.8892 on 1 and 25 DF, pvalue: 0.3547
This time the creation of a proportion versus total does almost nothing to change the coefficient of determination and pvalue of the regressions. There is no statistically significant association between the two variables at any reasonable level of significance. However, we must also consider the quality of the data we have received.
Gender Unreported Min. :112.0 1st Qu.:163.0 Median :193.0 Mean :205.4 3rd Qu.:241.0 Max. :365.0
To put this in percentage terms of the total population in each division: We do not know the gender of a sizeable percentage of voters in each division! 22% to 32% of each division does not have an identified gender. We might have found associations between the variables in the first two examples if we had more complete data.
3.) Do populations with a higher proportion of white voters help explain the variation in the amount of Independent party registrants? A proportion is created for the white voters over the total population in the division. > second$WhiteProp < second$White/second$`Total Pop.` > summary(lm(second$Ind ~ second$WhiteProp)) Call: lm(formula = second$Ind ~ second$WhiteProp) Residuals: Min 1Q Median 3Q Max 7.465 3.085 0.759 2.726 9.918 Coefficients: Estimate Std. Error t value Pr(>t) (Intercept) 2.385 3.917 0.609 0.5480 second$WhiteProp 25.531 13.746 1.857 0.0751 .  Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 4.92 on 25 degrees of freedom Multiple Rsquared: 0.1213, Adjusted Rsquared: 0.08612 Fstatistic: 3.45 on 1 and 25 DF, pvalue: 0.07507 We are much closer to finding an association between the proportion of white voters and their tendency to register as Independent versus our attempts to find associations between political parties and genders. However, the white proportion is still not a statistically significant indicator of independent party registrants if we define alpha at 0.05. The pvalue is found as 0.07507 and the coefficient of determination is very low at 12.13%. As we did before, we can now create a new object for the percentage of Independent party registrants from the total number of registrants per division. > second$IndProp < second$Ind/second$`Total Pop.` > summary(lm(second$IndProp ~ second$WhiteProp)) Call: lm(formula = second$IndProp ~ second$WhiteProp) Residuals: Min 1Q Median 3Q Max 0.0081644 0.0033878 0.0001117 0.0032048 0.0082518 Coefficients: Estimate Std. Error t value Pr(>t) (Intercept) 0.004396 0.003759 1.170 0.2532 second$WhiteProp 0.027473 0.013191 2.083 0.0477 *  Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.004721 on 25 degrees of freedom Multiple Rsquared: 0.1479, Adjusted Rsquared: 0.1138 Fstatistic: 4.338 on 1 and 25 DF, pvalue: 0.04766
As a proportion, we find the association to be significant, even though the Rsquared value is only 14.79%. We should now check the residual plots and consider the possibility of adding other variables to the model to improve the coefficient of determination.
The residuals versus fits plot and the normal probability plot look good. The errors are distributed normally with an approximate mean of zero and constant variance.
4.) Does the total population of a division help explain the variation in the proportion of residents that register as another party other than Republican, Democrat, or “Independent”? > summary(lm(second$`Other Party`~second$`Total Pop.`)) Call: lm(formula = second$`Other Party` ~ second$`Total Pop.`) Residuals: Min 1Q Median 3Q Max 30.163 10.212 0.958 5.314 39.011 Coefficients: Estimate Std. Error t value Pr(>t) (Intercept) 26.74068 12.18433 2.195 0.0377 * second$`Total Pop.` 0.17109 0.01567 10.918 5.29e11 ***  Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 15.87 on 25 degrees of freedom Multiple Rsquared: 0.8266, Adjusted Rsquared: 0.8197 Fstatistic: 119.2 on 1 and 25 DF, pvalue: 5.295e11 The total population does a good job at explaining the variability in the number of individuals that register as “other party”. The coefficient of determination is larger at 80.97% and the predictor variable is significant at any level of significance. This is our best result! > op = par(mfrow = c(2,2))> plot(lm(second$`Other Party`~second$`Total Pop.`)) The residuals seem to bounce randomly about the residual = 0 line, but there are three outliers flagged by R. These are the divisions 15, 20, and 25. On our normal probability plot we also see that the errors are normally distributed for middle values, but not for lower and higher values. We may need to transform the variable and/or consider another regression type besides linear. 5.) Is there an association between those who register as another party and the amount individuals in a population that identify as white?
Call:
lm(formula = second$`Other Party` ~ second$White)
Residuals:
Min 1Q Median 3Q Max
45.492 13.703 3.486 14.109 60.357
Coefficients:
Estimate Std. Error t value Pr(>t)
(Intercept) 28.81628 13.79156 2.089 0.047 *
second$White 0.34592 0.06099 5.672 6.63e06 ***

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 25.21 on 25 degrees of freedom
Multiple Rsquared: 0.5627, Adjusted Rsquared: 0.5452
Fstatistic: 32.17 on 1 and 25 DF, pvalue: 6.631e06
The number of those who identify as white does a good job at explaining the variability in the number of individuals that register as “other party”. The coefficient of determination is 56.27% and we reject the null hypothesis that there is no association at any level of significance. This is good news. Below are the results of the residual v. fits and normal probability plot. Once again there are three outliers. Division 25 appears again as an outlier, but now we should further examine the data for divisions 6 and 26. R depicts a pattern in the residuals, or that they do no bounce randomly around the residual = 0 line. However, we should also consider the possibility that more data would eliminate this slight pattern or appearance of “nonrandomness”. The normal probability plot is once again good for middle values, but loses its utility at lower and higher values for the divisions 6, 25, and 26. We could try a transformation of this variable, possibly a squared version of the White variable. Squared and cubed versions of the WhiteProp, White, and TotalPop do not enhance the models once we look at the residuals versus fits and normal probability plots. We can attempt to regression two predictor variables on the response variable OtherParty. Since the total population and white variables were both found to be significant individually, we can see if they together can help to explain the variability in the number of other party registrants. > summary(lm(second$`Other Party`~second$`Total Pop.` + second$White)) Call:
lm(formula = second$`Other Party` ~ second$`Total Pop.` + second$White)
Residuals:
Min 1Q Median 3Q Max
30.108 10.240 0.993 5.313 39.035
Coefficients:
Estimate Std. Error t value Pr(>t)
(Intercept) 2.671e+01 1.276e+01 2.092 0.0472 *
second$`Total Pop.` 1.708e01 2.826e02 6.045 3.05e06 ***
second$White 8.462e04 6.925e02 0.012 0.9904

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 16.2 on 24 degrees of freedom
Multiple Rsquared: 0.8266, Adjusted Rsquared: 0.8122
Fstatistic: 57.22 on 2 and 24 DF, pvalue: 7.374e10
When both predictor variables are included in the same model, only the total population is both to be statistically significant.
