Pima Indians Diabetes | Statistics for Data Analytics

Hello World,

TLDR; Hypothesis testing for statistical significance on a dataset with binary outcome

This assignment was given by Dr Ali Shekhi (one of the better professors on Statistics I have come across in my life) at the University of Limerick for a Statistics for Data Analytics module (MA6101)

“PimaIndiansDiabetes” (Pima Indians Diabetes Database) is a data set with 768 observations and 9 variables. Below is a description of this data set: A population of women who were at least 21 years old, of Pima Indian heritage and living near Phoenix, Arizona, was tested for diabetes according to World Health Organization criteria. The data were collected by the US National Institute of Diabetes and Digestive and Kidney Diseases. We used the 532 complete records after dropping the (mainly missing) data on serum insulin.

Task

You will first need to install the package called “mlbench” in order to access this data set. The package should be installed, read-into R, and then the data set should be stored by the name “mydata”.

1.Write the code that is required to install and read the package and also to store the data set by the name “mydata”.

2.Present the variables descriptively, i.e., present mean (sd) or number (%) for each variable appropriately.

3.Compare the shape of the distribution of “glucose” in those with and without diabetes.

4.Test whether there is a statistically significant difference in the mean glucose of those with and without diabetes.

5.We want to fit a logistic regression model to the data set, with “diabetes” as the binary dependent variable and “age”, “mass”, “glucose” and “pregnant” as predictors. Write a code to fit this model and clearly explain the output.

Solutions

1. Write the code that is required to install and read the package and also to store the data set by the name “mydata”.

> install.packages("mlbench")

Installing package into ‘C:/Users/ASUS/Documents/R/winlibrary/3.6’
(as ‘lib’ is unspecified)
trying URL ‘https://cran.rstudio.com/bin/windows/contrib/3.6/mlbench_2.1-1. zip’
Content type ‘application/zip’ length 1061244 bytes (1.0 MB)
downloaded 1.0 MB

package ‘mlbench’ successfully unpacked and MD5 sums checked

The downloaded binary packages are in C:\Users\ASUS\AppData\Local\Temp\RtmpeOcvQZ\downloaded_packages

> library("mlbench", lib.loc="~/R/win-library/3.6")
> data(PimaIndiansDiabetes)
> mydata <- (PimaIndiansDiabetes)

2. Present the variables descriptively, i.e., present mean (sd) or number (%) for each variable appropriately.

# calculate standard deviation, Mean for all attributes

> sapply(PimaIndiansDiabetes[,1:8], sd)

pregnant glucose pressure triceps insulin mass p edigree age

3.3695781 31.9726182 19.3558072 15.9522176 115.2440024 7.8841603 0 .3313286 11.7602315

> sapply(PimaIndiansDiabetes[,1:9], mean)

pregnant glucose pressure triceps insulin mass pedigree age
3.8450521 120.8945312 69.1054688 20.5364583 79.7994792 31.9925781 0.47

18763  33.2408854
   diabetes

NA

> table(mydata$diabetes)/length(mydata$diabetes)
      neg       pos
0.6510417 0.3489583

3. Compare the shape of the distribution of “glucose” in those with and without diabetes.

>ggplot(mydata,aes(x=glucose))+geom_density(aes(fill=diabetes),alpha=0.3)+

scale_color_manual(values = c(“#800080″,”#00ff00”)) + scale_fill_manual(va

lues = c(“purple”, “green”))+xlim(-30,300) + theme(legend.position = “bott

om”) + labs(x = “Glucose”, y = “Density”, title = “Density Plot of Diabetes With and Without Glucose”)

4. Test whether there is a statistically significant difference in the mean glucose of those with and without diabetes.

> X <- subset(mydata$glucose, mydata$diabetes=='neg') > Y <- subset(mydata$glucose, mydata$diabetes == 'pos')
> t.test(X,Y)
        Welch Two Sample t-test

data: X and Y
t = -13.752, df = 461.33, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval:

 -35.74707 -26.80786
sample estimates:
mean of x mean of y
109.9800  141.2575

5. We want to fit a logistic regression model to the data set, with “diabetes” as the binary dependent variable and “age”, “mass”, “glucose” and “pregnant” as predictors. Write a code to fit this model and clearly explain the output.

> Logistic_reg <- glm(diabetes~age+mass+glucose+pregnant, family = “binomial”, data=mydata)

> summary(Logistic_reg)
Call:
glm(formula = diabetes ~ age + mass + glucose + pregnant, family = "binomial", data = mydata)

Deviance Residuals:
Min 1Q Median 3Q Max

-2.1985  -0.7315  -0.4298   0.7692   2.8186
Coefficients:
             Estimate Std. Error z value Pr(>|z|)
(Intercept) -8.397404   0.672564 -12.486  < 2e-16 ***
age          0.012892   0.009026   1.428 0.153179
mass         0.082934   0.013770   6.023 1.72e-09 ***
glucose      0.033200   0.003365   9.865  < 2e-16 ***
pregnant     0.113520   0.031269   3.630 0.000283 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
    Null deviance: 993.48  on 767  degrees of freedom
Residual deviance: 742.10  on 763  degrees of freedom
AIC: 752.1
Number of Fisher Scoring iterations: 5

Explanation: By looking at the output we can see that variables are significa nt by comparing the p-values. The variable with *** are significant and play a role in whether a subject has diabetes or not. This shows that the variable Ag e is not statistically significant. Which means, the p-values of Age is greater than 0.01. This needs to be removed as shown below.

 

> Logistic_reg2 <- update(Logistic_reg, ~. -age) > summary(Logistic_reg2)

Call:
glm(formula = diabetes ~ mass + glucose + pregnant, family = “binomial”,

    data = mydata)

Deviance Residuals:
Min 1Q Median 3Q Max

-2.1795  -0.7348  -0.4304   0.7670   2.8765
Coefficients:
             Estimate Std. Error z value Pr(>|z|)
(Intercept) -8.124024   0.638486 -12.724  < 2e-16 ***
mass         0.081551   0.013736   5.937 2.90e-09 ***
glucose      0.034162   0.003312  10.316  < 2e-16 ***
pregnant     0.137094   0.026768   5.121 3.03e-07 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 993.48  on 767  degrees of freedom
Residual deviance: 744.12  on 764  degrees of freedom
AIC: 752.12
Number of Fisher Scoring iterations: 5
This is now statistically significant.

Like this article?

Share on facebook
Share on Facebook
Share on twitter
Share on Twitter
Share on linkedin
Share on Linkdin
Share on pinterest
Share on Pinterest

Leave a comment