Chapter 2 Setting up R

Welcome to R! For this introductory workshop, we will be using R Markdown. This might take a bit of getting used to it, but it is a great way to run R code and make comments so you know what you did!!

First thing that you will need to do is install the packages you need. The data set, as well as many of the commands is in a library called survival (good name, right!). We will also be using a powerful graphing software called ggplot (boy, we could have up to 5 workshops on this and still only scratch the surface of what this can do!) and a few extra libraries to make things look nice.

First, in the console, you will need to install the packages onto your local drive. To do so, type the commands:
install.packages(“survival”)
install.packages(“ggplot2”)
install.packages(“gridExtra”)
install.packages(“survminer”)

You will only need to do this once. Once they have been installed, you now have them on your local drive.

Now let’s open a new R Markdown notebook. Go to File -> New File -> R Markdown… This will open an R Markdown file for you. Go ahead and run some of the “R Chunks” to see what is going on here. After you are comfortable, let’s change the first chunk to do what we need it to do. We will need to library the survival and ggplot2 package, as well as the other packages we will need. See below for code to do this. Every time you open this document, and want to run the codes contained within, you will need to do these commands first.

library(survival)
library(ggplot2)
library(gridExtra)
library(survminer)

Be sure to hit the little green run button in the right hand corner to run this code. Feel free to put comments before or after this chunk to let you know what you just did!!

Now let’s explore the data set we will be using (as any good data scientist, you must KNOW your data first!!). Go ahead and ask for a summary of the data set lung.

summary(lung)
##       inst            time            status           age       
##  Min.   : 1.00   Min.   :   5.0   Min.   :1.000   Min.   :39.00  
##  1st Qu.: 3.00   1st Qu.: 166.8   1st Qu.:1.000   1st Qu.:56.00  
##  Median :11.00   Median : 255.5   Median :2.000   Median :63.00  
##  Mean   :11.09   Mean   : 305.2   Mean   :1.724   Mean   :62.45  
##  3rd Qu.:16.00   3rd Qu.: 396.5   3rd Qu.:2.000   3rd Qu.:69.00  
##  Max.   :33.00   Max.   :1022.0   Max.   :2.000   Max.   :82.00  
##  NA's   :1                                                       
##       sex           ph.ecog          ph.karno        pat.karno     
##  Min.   :1.000   Min.   :0.0000   Min.   : 50.00   Min.   : 30.00  
##  1st Qu.:1.000   1st Qu.:0.0000   1st Qu.: 75.00   1st Qu.: 70.00  
##  Median :1.000   Median :1.0000   Median : 80.00   Median : 80.00  
##  Mean   :1.395   Mean   :0.9515   Mean   : 81.94   Mean   : 79.96  
##  3rd Qu.:2.000   3rd Qu.:1.0000   3rd Qu.: 90.00   3rd Qu.: 90.00  
##  Max.   :2.000   Max.   :3.0000   Max.   :100.00   Max.   :100.00  
##                  NA's   :1        NA's   :1        NA's   :3       
##     meal.cal         wt.loss       
##  Min.   :  96.0   Min.   :-24.000  
##  1st Qu.: 635.0   1st Qu.:  0.000  
##  Median : 975.0   Median :  7.000  
##  Mean   : 928.8   Mean   :  9.832  
##  3rd Qu.:1150.0   3rd Qu.: 15.750  
##  Max.   :2600.0   Max.   : 68.000  
##  NA's   :47       NA's   :14

You can see how many variables you have, the range of each variable and how many NA’s there are (NA’s can be a problem in some analyses). Let’s take a look at a bar plot for status.

ggplot(lung,aes(x=factor(status)))+ geom_bar(stat="count", fill="blue")

Just a little more exploration (if you are doing an analysis, you should explore more than this!).

p1<-ggplot(lung,aes(x=factor(status),y=time)) + geom_boxplot()
p2<-ggplot(lung, aes(x=factor(status),y=age)) + geom_boxplot()
p3<-ggplot(lung, aes(x=factor(status),y=ph.karno)) + geom_boxplot()
p4<-ggplot(lung, aes(x=factor(status),y=pat.karno)) + geom_boxplot()
p5<-ggplot(lung, aes(x=factor(status),y=meal.cal)) + geom_boxplot()
p6<-ggplot(lung, aes(x=factor(status),y=wt.loss)) + geom_boxplot()
grid.arrange(p1, p2, p3, p4, p5, p6, nrow = 3)

ggplot(lung, aes(x=factor(status),fill = factor(ph.ecog))) +
  geom_bar(position="fill")