# Pima Indians Diabetes | Statistics for Data Analytics

Hello World,

TLDR; Hypothesis testing for statistical significance on a dataset with binary outcome

This assignment was given by Dr Ali Shekhi (one of the better professors on Statistics I have come across in my life) at the University of Limerick for a Statistics for Data Analytics module (MA6101)

“PimaIndiansDiabetes” (Pima Indians Diabetes Database) is a data set with 768 observations and 9 variables. Below is a description of this data set: A population of women who were at least 21 years old, of Pima Indian heritage and living near Phoenix, Arizona, was tested for diabetes according to World Health Organization criteria. The data were collected by the US National Institute of Diabetes and Digestive and Kidney Diseases. We used the 532 complete records after dropping the (mainly missing) data on serum insulin.

You will first need to install the package called “mlbench” in order to access this data set. The package should be installed, read-into R, and then the data set should be stored by the name “mydata”.

1.Write the code that is required to install and read the package and also to store the data set by the name “mydata”.

2.Present the variables descriptively, i.e., present mean (sd) or number (%) for each variable appropriately.

3.Compare the shape of the distribution of “glucose” in those with and without diabetes.

4.Test whether there is a statistically significant difference in the mean glucose of those with and without diabetes.

5.We want to fit a logistic regression model to the data set, with “diabetes” as the binary dependent variable and “age”, “mass”, “glucose” and “pregnant” as predictors. Write a code to fit this model and clearly explain the output.

Solutions

1. Write the code that is required to install and read the package and also to store the data set by the name “mydata”.

```> install.packages("mlbench")
```

Installing package into ‘C:/Users/ASUS/Documents/R/winlibrary/3.6’
(as ‘lib’ is unspecified)
trying URL ‘https://cran.rstudio.com/bin/windows/contrib/3.6/mlbench_2.1-1. zip’
Content type ‘application/zip’ length 1061244 bytes (1.0 MB)

package ‘mlbench’ successfully unpacked and MD5 sums checked

```> library("mlbench", lib.loc="~/R/win-library/3.6")
> data(PimaIndiansDiabetes)
> mydata <- (PimaIndiansDiabetes)
``` 2. Present the variables descriptively, i.e., present mean (sd) or number (%) for each variable appropriately.

# calculate standard deviation, Mean for all attributes

> sapply(PimaIndiansDiabetes[,1:8], sd)

pregnant glucose pressure triceps insulin mass p edigree age

3.3695781 31.9726182 19.3558072 15.9522176 115.2440024 7.8841603 0 .3313286 11.7602315

```> sapply(PimaIndiansDiabetes[,1:9], mean)
```

pregnant glucose pressure triceps insulin mass pedigree age
3.8450521 120.8945312 69.1054688 20.5364583 79.7994792 31.9925781 0.47

```18763  33.2408854
diabetes
```

NA

```> table(mydata\$diabetes)/length(mydata\$diabetes)
```
```      neg       pos
0.6510417 0.3489583
```

3. Compare the shape of the distribution of “glucose” in those with and without diabetes.

>ggplot(mydata,aes(x=glucose))+geom_density(aes(fill=diabetes),alpha=0.3)+

scale_color_manual(values = c(“#800080″,”#00ff00”)) + scale_fill_manual(va

lues = c(“purple”, “green”))+xlim(-30,300) + theme(legend.position = “bott

om”) + labs(x = “Glucose”, y = “Density”, title = “Density Plot of Diabetes With and Without Glucose”) 4. Test whether there is a statistically significant difference in the mean glucose of those with and without diabetes.

 ```> X <- subset(mydata\$glucose, mydata\$diabetes=='neg') > Y <- subset(mydata\$glucose, mydata\$diabetes == 'pos') ``` ```> t.test(X,Y) ``` ``` Welch Two Sample t-test ``` data: X and Yt = -13.752, df = 461.33, p-value < 2.2e-16alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: ``` -35.74707 -26.80786 sample estimates: mean of x mean of y ```
```109.9800  141.2575
```

5. We want to fit a logistic regression model to the data set, with “diabetes” as the binary dependent variable and “age”, “mass”, “glucose” and “pregnant” as predictors. Write a code to fit this model and clearly explain the output.

 > Logistic_reg <- glm(diabetes~age+mass+glucose+pregnant, family = “binomial”, data=mydata) ```> summary(Logistic_reg) ``` `Call:glm(formula = diabetes ~ age + mass + glucose + pregnant, family = "binomial", data = mydata)` Deviance Residuals:Min 1Q Median 3Q Max ```-2.1985 -0.7315 -0.4298 0.7692 2.8186 Coefficients: ``` ``` Estimate Std. Error z value Pr(>|z|) (Intercept) -8.397404 0.672564 -12.486 < 2e-16 *** age 0.012892 0.009026 1.428 0.153179 mass 0.082934 0.013770 6.023 1.72e-09 *** glucose 0.033200 0.003365 9.865 < 2e-16 *** pregnant 0.113520 0.031269 3.630 0.000283 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 ``` ```(Dispersion parameter for binomial family taken to be 1) ``` ``` Null deviance: 993.48 on 767 degrees of freedom Residual deviance: 742.10 on 763 degrees of freedom AIC: 752.1 ``` ```Number of Fisher Scoring iterations: 5 ``` Explanation: By looking at the output we can see that variables are significa nt by comparing the p-values. The variable with *** are significant and play a role in whether a subject has diabetes or not. This shows that the variable Ag e is not statistically significant. Which means, the p-values of Age is greater than 0.01. This needs to be removed as shown below. ```> Logistic_reg2 <- update(Logistic_reg, ~. -age) > summary(Logistic_reg2) ``` Call:glm(formula = diabetes ~ mass + glucose + pregnant, family = “binomial”, ``` data = mydata) ```
 Deviance Residuals:Min 1Q Median 3Q Max ```-2.1795 -0.7348 -0.4304 0.7670 2.8765 ``` ```Coefficients: ``` ``` Estimate Std. Error z value Pr(>|z|) ``` ```(Intercept) -8.124024 0.638486 -12.724 < 2e-16 *** ``` ```mass 0.081551 0.013736 5.937 2.90e-09 *** glucose 0.034162 0.003312 10.316 < 2e-16 *** pregnant 0.137094 0.026768 5.121 3.03e-07 *** --- ``` ```Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) ``` ```Null deviance: 993.48 on 767 degrees of freedom Residual deviance: 744.12 on 764 degrees of freedom AIC: 752.12 ``` ```Number of Fisher Scoring iterations: 5 ``` ```This is now statistically significant. ```