Package
The packages are additional functionalities that can be installed to Rstudio. Think of the apps on your phone, which can do different things.
There are several ways to install them, but the simplest is through a package called pacman, but first we need to install the package pacman
if (!require("pacman")) install.packages("pacman")
Then, every time we need a new package, we invoke the command by
pacman::p_load(tidyverse,
palmerpenguins)
The packages are installed only once (just like the apps on your phone), but you need to call them every time you want to use them (just like the apps on your phone)
How to code in R using RStudio
Writing code is different from writing a letter or a mail. While a mail is written to another person, the code is written to be read by a machine. And unlike the person, the machine does not interpret the intent of the writer. Thus, if I write with some spelling or grammatical error, another person can understand the idea. However, a machine that finds an error could not go on.
Mistakes in writing code are inevitable and we all make them. The important thing is that you are able to detect where the error is, and that task is simpler when you write well-formatted code. For example look at these two codes:
# Code 1
penguins %>% ggplot(aes(x=bill_length_mm,y=body_mass_g))+geom_point()
# Code 2
penguins %>%
ggplot(aes(x = bill_length_mm,
y = body_mass_g)) +
geom_point()
Both are identical, but the format of the second one allows you to read it better and in case there is any error it will be easier to detect. The code is written for machines, but one of the ideas of tidyverse is that a human can also read and interpret it.
Here are some tips for writing better code:
- use the shortcuts:
- CTRL + ALT + I: insert a chunk of code
- CTRL + ALT + M: insert the pipe operator %>%
- ALT + - : insert the <- assignment operator
- Select the code and press CTRL * SHIFT + A: autoformat the code
- Select the code and press CTRL + ENTER : run the code
Try these shortcuts until you are familiar with them Note, this shortcuts only run inside a code chunk
So, add a code chunk here and try all the rest of the shorcuts
- comment all your code
Do your future self a favor and comment your code.
Everything that is after a # sign will not be executed and considered as a comment, e.g.
x <- c(1:10) # here is my comment about this line
# I can also write here!
- Capitalization matters
Execute this code and found the errors
# mean(X)
# Mean(x)
Packages
The pacman command allows to easy install other packages
If not installed, write
install.packages(“pacman”)
pacman::p_load(tidyverse,
palmerpenguins )
Dataset
Now I will load the penguins dataset to my environment (check the upper right panel, the ENVIROMENT tab)
data(penguins)
EXPLORE THE DATASET
With the View() command you can open the dataset in an external tab, as an spreadsheet, but you will not be able to modify it. And that’s a good thing, remember: don’t touch your data!
# View(penguins)
With the dim() commando, you can see the DIMensions of the dataset, in rows and columns
dim(penguins)
## [1] 344 8
I can explore the first six rows of the dataset, and this one of the first things that I do when I have some data
head(penguins)
## # A tibble: 6 x 8
## species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex
## <fct> <fct> <dbl> <dbl> <int> <int> <fct>
## 1 Adelie Torge… 39.1 18.7 181 3750 male
## 2 Adelie Torge… 39.5 17.4 186 3800 fema…
## 3 Adelie Torge… 40.3 18 195 3250 fema…
## 4 Adelie Torge… NA NA NA NA <NA>
## 5 Adelie Torge… 36.7 19.3 193 3450 fema…
## 6 Adelie Torge… 39.3 20.6 190 3650 male
## # … with 1 more variable: year <int>
and also the last six rows
tail(penguins)
## # A tibble: 6 x 8
## species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex
## <fct> <fct> <dbl> <dbl> <int> <int> <fct>
## 1 Chinst… Dream 45.7 17 195 3650 fema…
## 2 Chinst… Dream 55.8 19.8 207 4000 male
## 3 Chinst… Dream 43.5 18.1 202 3400 fema…
## 4 Chinst… Dream 49.6 18.2 193 3775 male
## 5 Chinst… Dream 50.8 19 210 4100 male
## 6 Chinst… Dream 50.2 18.7 198 3775 fema…
## # … with 1 more variable: year <int>
With the str(), you can see the STRucture of your dataset (or dataframe). Note that indicates the class and the levels of each categorical (or factor) variable. Also, you see the dimensions in the first line.
It tell us that this dataset is a tibble, a especial kind of dataset, that has 344 rows per 8 columns
str(penguins)
## tibble [344 × 8] (S3: tbl_df/tbl/data.frame)
## $ species : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ island : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ bill_length_mm : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
## $ bill_depth_mm : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
## $ flipper_length_mm: int [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
## $ body_mass_g : int [1:344] 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
## $ sex : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
## $ year : int [1:344] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
The str() is super useful and also one of the first things that you have to examine about your data, and consult several times during your data analysis.
The tidyverse version of str() is the glimpse() command
glimpse(penguins)
## Rows: 344
## Columns: 8
## $ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Ade…
## $ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgers…
## $ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1,…
## $ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1,…
## $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 18…
## $ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475,…
## $ sex <fct> male, female, female, NA, female, male, female, mal…
## $ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 200…
The difference between str and glimpse is not much. Glimpse adapts the result to the size of the screen.
Try reducing the size of this window and running first str and then glimpse. Compare the results.
To see a summary of all the variables, we have the following command
It shows us central tendency and dispersion measures for continuous data and a count for categorical data. For both it also shows us the amount of unavailable data or NA
summary(penguins)
## species island bill_length_mm bill_depth_mm
## Adelie :152 Biscoe :168 Min. :32.10 Min. :13.10
## Chinstrap: 68 Dream :124 1st Qu.:39.23 1st Qu.:15.60
## Gentoo :124 Torgersen: 52 Median :44.45 Median :17.30
## Mean :43.92 Mean :17.15
## 3rd Qu.:48.50 3rd Qu.:18.70
## Max. :59.60 Max. :21.50
## NA's :2 NA's :2
## flipper_length_mm body_mass_g sex year
## Min. :172.0 Min. :2700 female:165 Min. :2007
## 1st Qu.:190.0 1st Qu.:3550 male :168 1st Qu.:2007
## Median :197.0 Median :4050 NA's : 11 Median :2008
## Mean :200.9 Mean :4202 Mean :2008
## 3rd Qu.:213.0 3rd Qu.:4750 3rd Qu.:2009
## Max. :231.0 Max. :6300 Max. :2009
## NA's :2 NA's :2
To see the summary of a particular variable, you can select it with $ in the following way:
dataset$column
for example
summary(penguins$species)
## Adelie Chinstrap Gentoo
## 152 68 124
EXERCISES |
try getting the summary of island |
r # write your code here |
and now the summary of body_mass_g |
r # write your code here |
As you saw in the summary of all variables, R automatically detects the type or class of variable.
To check the class of a variable, we use the command:
class(dataset$variable_to_check)
For example:
class(penguins$year)
## [1] "integer"
Sometimes a number can be used to store a categorical variable, such as assigning a number to each sex. In general, I do not recommend using numbers to store nominal variables, but simply to store the variable as it is presented, for example male and female instead of 1 and 2.
For example, if we consult the summary of the year variable, we get the following:
EXERCISES |
Try finding the classes of other variables |
summary(penguins$year)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2007 2007 2008 2008 2009 2009
R recognizes the year as a numerical variable. However, the year is indeed a categorical variable. Nobody says: “I was born in 1997.4” for example,
To change the type of variable we have to reassign the modified variable.
For example, to change the variable year from number to factor, we have to overwrite the variable year
penguins$year <- as.factor(penguins$year)
now the year variable should be correctly formatted
summary(penguins$year)
## 2007 2008 2009
## 110 114 120
Now it’s fixed!
Exploring the dataset with tables
A simple way to explore a dataset is by means of tables. This is done with the command (table(variable1, variable2))
For example, let’s look at the distribution of sex for each species
table(penguins$species,
penguins$sex)
##
## female male
## Adelie 73 73
## Chinstrap 34 34
## Gentoo 58 61
The command tables show the result of (rows, columns)
For example, if we change the order, we will obtain this
table(penguins$sex,
penguins$species)
##
## Adelie Chinstrap Gentoo
## female 73 34 58
## male 73 34 61
Exercise |
+ Try making a chart to see the total number of penguins per island. + Now by island and by year + Try changing the order: how does it look better? + What happens if you make a table of a numerical variable? |
CREATING GRAPHS WITH GGPLOT2
ggplot stands for grammar of graphics.
Just as language has rules with which to communicate any idea, graphics have a grammar. If we understand the components of a graph and how the different parts that make up a graph are related, we can create any type of visualization to communicate a result.
Essentially, to make a graphic we need three things:
- the data,
- the mapping of the variables to visual properties of the graph. The mappings are placed within the aes function (where aes stands for aesthetics), for example, position, size, color, shape, and
- a geometric shape that represents the data in the mapping. Geoms are the geometric objects (points, lines, bars, etc.) that can be placed on a graph. They are added using functions that start with geom_.
So the first thing we need is the data
penguins %>% # the data
ggplot() # means: "hey ggplot, take this data, and wait for instructions"
Secondly, we map the variables of interest
Usually this is done this way:
penguins %>%
ggplot(
mapping = # we can omit this line, I wrote it here just for you to see what the mapping refers to
aes(x = bill_length_mm, # hey ggplot, map these aesthetic variables, in the x axis this variable
y = body_mass_g)) # and in the y axis this other
Since I personally omit the mapping, I write:
penguins %>%
ggplot(aes(x = bill_length_mm,
y = body_mass_g))
and the result is the same.
we can skip other things, but I don’t recommend it. For example, you can write:
penguins %>%
ggplot(aes(bill_length_mm, body_mass_g))
Now we should tell ggplot how to map those x and y variables. In this case, we will choose to map them as points.
penguins %>%
ggplot(aes(x = bill_length_mm,
y = body_mass_g)) +
geom_point() # this is the geometric object that will map the aesthetic
Later, we add more layers as aesthetics elements.
For example, we can add a color to each point to identify the sex. We add this as an additional aesthetic, see code below
penguins %>%
drop_na() %>%
ggplot(aes(x = bill_length_mm,
y = body_mass_g)) +
geom_point() +
aes(color = sex) # this is the additional layer
EXCERCISE
Now it is your turn, we could ask: how is the distribution of the bill_depth_mm vs. the flipper_length_mm?
# write your code here
There is a separate group of points: what could be due to a particular sex or species? Make a chart to find out!
EXERCISES
geom_point
If you have two continuous variables, the geom_point is the preferred option to graph
Try plotting the bill_length_mm in the x axis vs the bill_depth_mm
# penguins %>%
# ggplot(aes(x = _____________,
# y = _____________)) +
# geom_point()
You can add more layers as aesthetics elements. For example, if you want to visualize the previous graph but with species, you can add an additional layer coloring each point with the sex variable
# penguins %>%
# ggplot(aes(x = _____________,
# y = _____________,
# color = _________)) +
# geom_point()
and we can add several additional aesthetics layers, as
shape = for discrete or categorical variables, as sex size = for continuous variables, as body_mass_g
and also we can add more geom layers. For example, we can add a regression line to explore the correlation between the plotted variables.
Try adding a geom_smooth(method = “lm”) after the last geom and check the results
# penguins %>%
# ggplot(aes(x = bill_length_mm,
# y = bill_depth_mm)) +
# geom_point() +
# ________________
What can you conclude about the relationship between these two variables?
Now try disaggregating by species.
Hint: mark the species with colors
We can add more variables depending on your nature.
For example, we can change the size of each point according to some numerical variable, such as the body_mass_g
try completing this code
# penguins %>%
# ggplot(aes(x = bill_length_mm,
# y = bill_depth_mm,
# color = species,
# size = ___________) +
# geom_point()
Tuning your graph
Themes
You can change the theme of the graph, the visual appearence, changing the layer that control this, with the commando theme and choosing your theme of preference
penguins %>%
ggplot(aes(x = bill_length_mm,
y = flipper_length_mm,
color = species)) +
geom_point() +
theme_minimal() # this is one example
penguins %>%
ggplot(aes(x = bill_length_mm,
y = flipper_length_mm,
color = species)) +
geom_point() +
theme_dark() # now a dark theme
Try others!
try theme_linedraw() theme_light()
penguins %>%
ggplot(aes(x = bill_length_mm,
y = flipper_length_mm,
color = species)) + # in this case, here
geom_point()
You can also install more themes or create your own theme
Some packages with additional themes are:
ggpubr (I use mostly the minimal and some of the ggpubr package) and ggthemes
To install use
pacman::p_load(ggthemes)
and then
penguins %>%
ggplot(aes(x = bill_length_mm,
y = flipper_length_mm,
color = species)) + # in this case, here
geom_point() +
ggthemes::theme_economist() # here I say" from the package ggthemes use the theme_economist()
You can try these themes:
- ggthemes::theme_solarized()
- ggthemes::theme_excel()
Headings and labels
You can add and change the labels with labs()
penguins %>%
ggplot(aes(x = bill_length_mm,
y = flipper_length_mm,
color = species)) +
geom_point() +
labs(
title = "My Title",
subtitle = "The subtitle",
x = "the X axis",
y = "the Y axis",
color = "Specie"
)
Try to answer these questions:
What is the relationship between Penguin mass vs. flipper length ?
And between Flipper length vs. bill length?
Note
You will find this kind of notation in books and posts, where the data goes inside the ggplot command. For example:
ggplot(data = penguins) +
(mapping = aes(x = species,
y = body_mass_g,
color = sex)) +
geom_boxplot() +
theme_minimal()
I prefer to leave the data out of the ggplot command, since it’s easier to perform some data transformation and then plot it.