Data Visualization with ggplot
Projects and File Management
Projects are an easy way to keep everything in one place. Creating a project will create a folder for you to keep all of your code files (scripts, .Rmd files, etc.), your data (Excel files, csv files, etc.), plots you want to save and anything else related to your project.
You can create projects by clicking the blue icon at the top right of RStudio, choosing “New Project”, and then “New Directory”. You’ll then have the chance to name your project and choose where on your computer to save it.
Packages
Packages are collections of external software usually designed around a specific theme (visualization, statistics, etc.). Packages can be installed directly from RStudio using the install.packages()
function. We’ll be using a package called tidyverse
which is actually a collection of packages. Installing tidyverse
would look like this:
install.packages("tidyverse")
Importantly, you only need to install a package once (unless you want to update it). But to use a package you have to load it every time you start a new RStudio session. This is done with the library()
function.
library(tidyverse)
We’ll only be using the readr
and ggplot2
packages but it’s easier to install all of tidyverse
.
Data Import
One common way to import data into R is as csv files. ‘csv’ stands for comma separated values and is an efficient way of storing data as text separated by commas. If you’re working with an excel document you can easily save it as a .csv file. If you already have a lot of excel files that you don’t want to convert, there are packages that will read .xls files as is (see https://readxl.tidyverse.org/).
The read_csv()
function from the readr
package (included in tidyverse
) is the easiest way to read in .csv files. (Note: read.csv()
is the Base R equivalent and will work the same, more or less. But if you’re using tidyverse functions I’d recommend using read_csv()
)
The basic format of read_csv
is to pass it either the file name of your data (saved in the same working directory that RStudio is using), or a url to data stored online, using quotes in both cases. And we assign the data to whatever name we want with <-
Here’s how this might look on your computer with a file inside your project folder.
#read csv file and save to name
mydata <- read_csv("my_csv_file.csv")
Remember that read_csv()
comes from the readr
package which is a part of tidyverse
. To use it you will have to load either readr
or tidyverse
using library()
.
Palmer Penguins
We’ll be using the Palmer Penguin data and we can import it directly from a url.
# load data from online
penguins <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-07-28/penguins.csv')
Artwork by @allison_horst
Let’s take a quick look at what the data looks like. We have 8 variables.
head(penguins)
## # A tibble: 6 x 8
## species island bill_length_mm bill_depth_mm flipper_length_~ body_mass_g sex
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 Adelie Torge~ 39.1 18.7 181 3750 male
## 2 Adelie Torge~ 39.5 17.4 186 3800 fema~
## 3 Adelie Torge~ 40.3 18 195 3250 fema~
## 4 Adelie Torge~ NA NA NA NA <NA>
## 5 Adelie Torge~ 36.7 19.3 193 3450 fema~
## 6 Adelie Torge~ 39.3 20.6 190 3650 male
## # ... with 1 more variable: year <dbl>
ggplot
ggplot2
is the most popular package for plotting in R. It’s built by the same people who make tidyverse and RStudio so it plays well with other tidyverse packages. It’s almost infinitely customizeable and there are lots of external add-ons to make interesting and vibrant visualizations.
Here is a template of what the code for a typical ggplot will look like:
#ggplot template
ggplot(data = <DATA>) +
<GEOM_FUNCTION>(mapping = aes(x, y))
Think of ggplots as being constructed in layers. You start with a blank canvas and then add new layers on top of each other. (The “gg” in ggplot stands for “grammar of graphics” which is short for the concept of a “layered grammar of graphics”.)
Every ggplot will start with with the ggplot()
function. Every one. Every time. Let’s see what ggplot()
gives us by itself.
ggplot()
A blank square. This is our blank canvas that I talked about above. We have the space for a plot but haven’t told ggplot what to put on it yet. Let’s take a look at our template again.
#ggplot template
ggplot(data = <DATA>) +
<GEOM_FUNCTION>(mapping = aes(x, y))
After ggplot()
you add layers that are called “geoms”. There’s a geom for just about every type of plot you can think of (e.g. geom_point
, geom_smooth
, geom_boxplot
). Geoms tell ggplot what you want to do with the raw data.
Every new layer of a ggplot is separated by a plus sign +
on the previous line. Be careful not to forget this.
The required arguments for a ggplot are data
and mapping
which tell ggplot which dataframe to use and which variables to use, respectively. When you specify a mapping you’ll also use another function called aes()
. The variables you use will always go inside the aes()
function. I wouldn’t worry about the specifics of aes()
right now. Just remember to always put the variables you’re plotting inside it e.g. aes(x, y)
Scatterplots
Do penguins with long flippers also have long bills? We can use geom_point()
to make a scatterplot. We have two numeric variables that we want to assign to the x- and y-axes and the names of our variables are bill_length_mm
and flipper_length_mm
.
ggplot(data = penguins)+
geom_point(mapping = aes(x = bill_length_mm, y = flipper_length_mm))
Named arguments are optional in R so you don’t have to specify things like data=
and mapping=
(just make sure your arguments are in the right place!).
# this
ggplot(data = penguins)+
geom_point(mapping = aes(x = bill_length_mm, y = flipper_length_mm))
#same as this
ggplot(penguins)+
geom_point(aes(bill_length_mm, flipper_length_mm))
Naming arguments is partially up to personal preference. As you become more familiar with functions you might choose to not include argument names to save typing and make code more concise. If you think it would be more clear for yourself or someone else reading your code, you can include some arguments. In some of the following examples I will include the x
and y
arguments for clarity.
Even better, every geom can take it’s own data and arguments OR you can specify them in your first ggplot()
line. Everything in the ggplot()
function will be inherited by every line below it. (Tip: Don’t forget commas between arguments.)
# this
ggplot(penguins)+
geom_point(aes(x = bill_length_mm, y = flipper_length_mm))
#same as this
ggplot(penguins, aes(x = bill_length_mm, y = flipper_length_mm))+
geom_point()
What this means is that we can easily add extra geoms that use the same data. What if we want to add a line to our scatter plot? geom_smooth()
adds a line of best fit with standard error.
# adding a line
ggplot(penguins, aes(x = bill_length_mm, y = flipper_length_mm))+
geom_point()+
geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
If you want a linear relationship, you can specify method = "lm"
(short for “linear model”) inside geom_smooth()
ggplot(penguins, aes(x = bill_length_mm, y = flipper_length_mm))+
geom_point()+
geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'
Adding extra variables
There are features of a ggplot that can be changed to reflect differences between groups such as colour
and shape
. These are arguments that can be set inside of the aes()
function and provided with a categorical variable to group by.
For example, what if we wanted to colour points by species. We can set the colour
argument to be equal to the “species” variable in our data.
# colour by species
ggplot(penguins, aes(x = bill_length_mm, y = flipper_length_mm, colour = species))+
geom_point()+
geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'
Notice that we also got separate lines for each species. This is because we set colour inside the first ggplot()
line, so the grouping is inherited by all the layers below it. If we wanted to only colour the points, we can move the colour argument to geom_point()
(inside of it’s own aes()
function).
# colour by species
ggplot(penguins, aes(x = bill_length_mm, y = flipper_length_mm))+
geom_point(aes(colour = species))+
geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'
We can do the same thing with the shape
argument.
ggplot(penguins, aes(x = bill_length_mm, y = flipper_length_mm))+
geom_point(aes(colour = species, shape = species))+
geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'
Displaying a third numeric variable
It can be harder to display three numeric variables in one plot, but one option is to vary point size by the third variable using the size
argument.
ggplot(penguins, aes(bill_length_mm, flipper_length_mm))+
geom_point(aes(size = body_mass_g))
This plot is a bit hard to read. It would be helpful if we could see where points are clustering. We can do this by changing how transparent the points are with the alpha
argument. Alpha ranges from 0 to 1 with one being solid.
If you want to change an aspect of all of the points (i.e. not based on another variable), you can set that argument outside of aes()
.
ggplot(penguins, aes(bill_length_mm, flipper_length_mm))+
geom_point(aes(size = body_mass_g), alpha = 0.4)
Themes & Styling
Labels
There are a lot of ways to add labels to your plot but one of the easiest is the labs()
function that is added on a new line of your ggplot code. Let’s use our scatterplot example from earlier.
Inside of labs()
you can specify labels for your x- and y-axes as well as a title and a subtitle if you’d like. To label your legend, you can add an argument with the same name as the aspect you grouped by (e.g. colour, shape, etc.).
Because we used both colour and shape, we’ll have to do some duplication and change the label for both.
ggplot(penguins, aes(x = bill_length_mm, y = flipper_length_mm))+
geom_point(aes(colour = species, shape = species))+
geom_smooth(method = "lm")+
labs(title = "Bill & Flipper Lengths of the Palmer Penguin Species",
x = "Bill Length", y = "Flipper Length", colour = "Species", shape = "Species")
## `geom_smooth()` using formula 'y ~ x'
Themes
All of the aspects of a ggplot are customizeable including, grid lines, backgrounds, spacing, and axis ticks. While these can all be changed individually, ggplot comes with some built-in themes to make your plots look cleaner. ggplot themes act as their own function (i.e. on a new line after +) and have the format theme_*
. It’s typical to add a theme function as the last line of your ggplot code. One popular theme is theme_bw()
ggplot(penguins, aes(x = bill_length_mm, y = flipper_length_mm))+
geom_point(aes(colour = species, shape = species))+
geom_smooth(method = "lm")+
labs(title = "Bill & Flipper Lengths of the Palmer Penguin Species",
x = "Bill Length", y = "Flipper Length", colour = "Species", shape = "Species")+
theme_bw()
## `geom_smooth()` using formula 'y ~ x'
A personal favorite of mine is theme_classic()
ggplot(penguins, aes(x = bill_length_mm, y = flipper_length_mm))+
geom_point(aes(colour = species, shape = species))+
geom_smooth(method = "lm")+
labs(title = "Bill & Flipper Lengths of the Palmer Penguin Species",
x = "Bill Length", y = "Flipper Length", colour = "Species", shape = "Species")+
theme_classic()
## `geom_smooth()` using formula 'y ~ x'
For examples on how to change smaller aspects of a ggplot, see the documentation on the theme()
function https://ggplot2.tidyverse.org/reference/theme.html
Categorical Variables
How many observations are there for each penguin species?
geom_bar
counts the number of observations in your specified variable. Because the counting is done behind the scenes, you only need to provide one variable (on either the x or y axis; depending on the direction you want your bars to go).
If you already have a dataset of the values you want to plot (e.g. counts or means), geom_col()
plots raw values.
ggplot(penguins, aes(species))+
geom_bar()
Adding Variables
Adding additional variables works similar to what we’ve seen before; additional attributes can be specified inside the aes()
function. Importantly, “colour” will only colour the borders of the bars in a bar plot, to colour inside the bars, you’ll want to use “fill” instead.
ggplot(penguins, aes(species, fill = island))+
geom_bar()
By default, groups are stacked in bar plots. If you’d like a separate bar for each group you set the position
argument in geom_bar
to “dodge”.
ggplot(penguins, aes(species, fill = island))+
geom_bar(position = "dodge")
Faceting
We can also create separate plots within plots using ggplot functions like facet_wrap()
and facet_grid()
. Inside of facet_wrap()
you only need to provide the name of the variable you want to facet by after a tilde ~
.
ggplot(penguins, aes(x = bill_length_mm, y = flipper_length_mm))+
geom_point()+
geom_smooth(method = "lm")+
facet_wrap(~species)+
theme_bw()
## `geom_smooth()` using formula 'y ~ x'
Summary Statistics
There are several geoms that will calculate summary statistics for you, like geom_boxplot()
ggplot(penguins, aes(species, body_mass_g))+
geom_boxplot()
There’s also a corresponding geom_violin()
that gives you a better idea of the distribution of your data.
ggplot(penguins, aes(species, body_mass_g))+
geom_violin()
stat_summary()
In all of the geoms we’ve used so far, there are a lot of functions doing the heavy lifting behind the scenes. In most cases, geoms will cover everything you need. But sometimes we might want more control over how our plot is made. This is where the stat_*
functions come in, primarily, stat_summary()
.
Let’s say we wanted a bar plot of the mean body mass for penguins of each species. There’s no geom that calculates a mean and displays bars. geom_bar()
only calculates counts by default. Luckily, stat_summary()
calculates mean and standard error by default.
ggplot(penguins, aes(species, body_mass_g))+
stat_summary()
## No summary function supplied, defaulting to `mean_se()`
Oops, that’s not a bar plot. There’s one more thing we have to change to get our bar plot. stat_summary()
has a geom
argument that by default is set to “pointrange” which is what we’re seeing above. The geom we want is “bar” so we’ll add an argument inside stat_summary()
. We can also add error bars with another stat_summary()
layer.
ggplot(penguins, aes(species, body_mass_g))+
stat_summary(geom = "bar")+
stat_summary(geom = "errorbar")
## No summary function supplied, defaulting to `mean_se()`
## No summary function supplied, defaulting to `mean_se()`
This plot isn’t that informative about our data. Let’s see if we can do a little better. We can add some transparent points to show how our data are distributed. And we’ll add some transparency to our bars so we can see the points, and add a border around them by setting the size
argument. Also, we’ll add some colour using fill
just for aesthetics.
ggplot(penguins, aes(species, body_mass_g))+
stat_summary(fun = "mean",
geom = "bar",
alpha = 0.5,
color = "black",
size = 0.5,
mapping = aes(fill=species))+
geom_point(alpha = 0.3)
This a bit better but still not as clear as it could be. We still can’t see the distribution of the individual data points that well. We can use a useful function called geom_jitter()
to fix this. geom_jitter()
adds a small amount of random noise to points so you can more easily see where points are clustering. We set how much it “jitters” the data using the height
and width
arguments. geom_jitter()
is also useful for crowded scatter plots.
ggplot(penguins, aes(species, body_mass_g))+
stat_summary(fun = "mean",
geom = "bar",
alpha = 0.5,
color = "black",
size = 0.5,
mapping = aes(fill=species))+
geom_jitter(width = 0.1, height = 0.1, alpha = 0.3)
This is better but because our data have relative high means with little variation, we have all this extra space in the bars. One option would be to zoom in on you plot using something like coord_cartesian()
which lets you set the limits of your axes. In this case, we’re changing the ylim
argument.
ggplot(penguins, aes(species, body_mass_g))+
stat_summary(fun = "mean",
geom = "bar",
alpha = 0.5,
color = "black",
size = 0.5,
mapping = aes(fill=species))+
geom_jitter(width = 0.1, height = 0.1, alpha = 0.3)+
coord_cartesian(ylim = c(2500, 6400))+
theme_bw()+
guides(fill = FALSE)+ #get rid of legend
labs(x = NULL, y = "Body Mass (g)")
This definitely looks better but maybe we could do more to represent the distribution in this data. geom_violin()
can help with this. We’ll add in some mean lines with stat_summary()
and geom “crossbar”.
ggplot(penguins, aes(species, body_mass_g))+
geom_violin(alpha = 0.1, mapping = aes(fill = species), show.legend = F)+
geom_jitter(width = 0.05, height = 0.05, alpha = 0.3)+
stat_summary(fun = "mean", geom = "crossbar", width = 0.3)+ # mean lines
theme_bw()+
labs(x = NULL, y = "Body Mass (g)")
Bonus
One of the benefits of ggplot is it’s infinite customizeability. There are a ton of add-ons that let you make more complex and more effective lots. Here’s a small-scale example using a function called geom_flat_violin()
by Ben Marwick. I source the function directly from a link where the code is hosted online but it can also be found in the PupillometryR
package.
source("https://gist.githubusercontent.com/benmarwick/2a1bb0133ff568cbe28d/raw/fb53bd97121f7f9ce947837ef1a4c65a73bffb3f/geom_flat_violin.R")
ggplot(penguins, aes(species, body_mass_g, fill=species))+
geom_boxplot(width = 0.2, outlier.shape = NA, alpha = 0.8)+
geom_flat_violin(position = position_nudge(x = 0.2, y = 0), alpha = 0.8)+
geom_jitter(width = 0.1, alpha = 0.5, size = 1)+
guides(fill = FALSE, color = FALSE)+ # get rid of legend
theme_classic()+
labs(x = NULL, y = "Body Mass (g)\n")
Plotly
Plotly is a software that works across many platforms and languages and is used to make interactive visualizations. In R there’s a package called plotly
. Mouse over the points in the plot below to see how it works!
ex <- ggplot(penguins, aes(x = bill_length_mm, y = flipper_length_mm))+
geom_point(aes(colour = species))+
geom_smooth(method = "lm")+
theme_bw()
plotly::ggplotly(ex)
## `geom_smooth()` using formula 'y ~ x'
There are lots of similar add-ons for ggplot. Try gganimate
for animated plots, or the patchwork
package for combining multiple ggplots into one figure.
Resources
From Data to Viz: https://www.data-to-viz.com/
R for Data Science: https://r4ds.had.co.nz/
ggplot Textbook: https://ggplot2-book.org/
Data Sourced from: https://allisonhorst.github.io/palmerpenguins/