Hello World,
TLDR; Hypothesis testing for statistical significance on a dataset with binary outcome
This assignment was given by Dr Ali Shekhi (one of the better professors on Statistics I have come across in my life) at the University of Limerick for a Statistics for Data Analytics module (MA6101)
“PimaIndiansDiabetes” (Pima Indians Diabetes Database) is a data set with 768 observations and 9 variables. Below is a description of this data set: A population of women who were at least 21 years old, of Pima Indian heritage and living near Phoenix, Arizona, was tested for diabetes according to World Health Organization criteria. The data were collected by the US National Institute of Diabetes and Digestive and Kidney Diseases. We used the 532 complete records after dropping the (mainly missing) data on serum insulin.
Task
You will first need to install the package called “mlbench” in order to access this data set. The package should be installed, readinto R, and then the data set should be stored by the name “mydata”.
1.Write the code that is required to install and read the package and also to store the data set by the name “mydata”.
2.Present the variables descriptively, i.e., present mean (sd) or number (%) for each variable appropriately.
3.Compare the shape of the distribution of “glucose” in those with and without diabetes.
4.Test whether there is a statistically significant difference in the mean glucose of those with and without diabetes.
5.We want to fit a logistic regression model to the data set, with “diabetes” as the binary dependent variable and “age”, “mass”, “glucose” and “pregnant” as predictors. Write a code to fit this model and clearly explain the output.
Solutions
1. Write the code that is required to install and read the package and also to store the data set by the name “mydata”.
> install.packages("mlbench")
Installing package into ‘C:/Users/ASUS/Documents/R/win–library/3.6’
(as ‘lib’ is unspecified)
trying URL ‘https://cran.rstudio.com/bin/windows/contrib/3.6/mlbench_2.11. zip’
Content type ‘application/zip’ length 1061244 bytes (1.0 MB)
downloaded 1.0 MB
package ‘mlbench’ successfully unpacked and MD5 sums checked
The downloaded binary packages are in C:\Users\ASUS\AppData\Local\Temp\RtmpeOcvQZ\downloaded_packages
> library("mlbench", lib.loc="~/R/winlibrary/3.6")
> data(PimaIndiansDiabetes)
> mydata < (PimaIndiansDiabetes)
2. Present the variables descriptively, i.e., present mean (sd) or number (%) for each variable appropriately.
# calculate standard deviation, Mean for all attributes
> sapply(PimaIndiansDiabetes[,1:8], sd)
pregnant glucose pressure triceps insulin mass p edigree age
3.3695781 31.9726182 19.3558072 15.9522176 115.2440024 7.8841603 0 .3313286 11.7602315
> sapply(PimaIndiansDiabetes[,1:9], mean)
pregnant glucose pressure triceps insulin mass pedigree age
3.8450521 120.8945312 69.1054688 20.5364583 79.7994792 31.9925781 0.47
18763 33.2408854
diabetes
NA
> table(mydata$diabetes)/length(mydata$diabetes)
neg pos
0.6510417 0.3489583
3. Compare the shape of the distribution of “glucose” in those with and without diabetes.
>ggplot(mydata,aes(x=glucose))+geom_density(aes(fill=diabetes),alpha=0.3)+
scale_color_manual(values = c(“#800080″,”#00ff00”)) + scale_fill_manual(va
lues = c(“purple”, “green”))+xlim(30,300) + theme(legend.position = “bott
om”) + labs(x = “Glucose”, y = “Density”, title = “Density Plot of Diabetes With and Without Glucose”)
4. Test whether there is a statistically significant difference in the mean glucose of those with and without diabetes.
> X < subset(mydata$glucose, mydata$diabetes=='neg') > Y < subset(mydata$glucose, mydata$diabetes == 'pos')

> t.test(X,Y)
Welch Two Sample ttest
data: X and Y 35.74707 26.80786
sample estimates:
mean of x mean of y

109.9800 141.2575
5. We want to fit a logistic regression model to the data set, with “diabetes” as the binary dependent variable and “age”, “mass”, “glucose” and “pregnant” as predictors. Write a code to fit this model and clearly explain the output.
> Logistic_reg < glm(diabetes~age+mass+glucose+pregnant, family = “binomial”, data=mydata) 
> summary(Logistic_reg)
Call:

Deviance Residuals: 2.1985 0.7315 0.4298 0.7692 2.8186
Coefficients:
Estimate Std. Error z value Pr(>z) (Intercept) 8.397404 0.672564 12.486 < 2e16 *** age 0.012892 0.009026 1.428 0.153179 mass 0.082934 0.013770 6.023 1.72e09 *** glucose 0.033200 0.003365 9.865 < 2e16 *** pregnant 0.113520 0.031269 3.630 0.000283 ***  Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1)
Null deviance: 993.48 on 767 degrees of freedom
Residual deviance: 742.10 on 763 degrees of freedom
AIC: 752.1
Number of Fisher Scoring iterations: 5
Explanation: By looking at the output we can see that variables are significa nt by comparing the pvalues. The variable with *** are significant and play a role in whether a subject has diabetes or not. This shows that the variable Ag e is not statistically significant. Which means, the pvalues of Age is greater than 0.01. This needs to be removed as shown below. 
> Logistic_reg2 < update(Logistic_reg, ~. age) > summary(Logistic_reg2)
Call: data = mydata)

Deviance Residuals: 2.1795 0.7348 0.4304 0.7670 2.8765
Coefficients:
Estimate Std. Error z value Pr(>z)
(Intercept) 8.124024 0.638486 12.724 < 2e16 ***
mass 0.081551 0.013736 5.937 2.90e09 ***
glucose 0.034162 0.003312 10.316 < 2e16 ***
pregnant 0.137094 0.026768 5.121 3.03e07 ***

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) 
Null deviance: 993.48 on 767 degrees of freedom
Residual deviance: 744.12 on 764 degrees of freedom
AIC: 752.12
Number of Fisher Scoring iterations: 5
This is now statistically significant.
