Data Visualization
Data visualization is the second most important skill of a data scientist (after data wrangling). By the end of this session, you will be able to use the package ggplot2
to build different data graphics and to craft effective visualizations, which answer questions like these:
Top N Customers. Which customers have the most purchasing power?
Heatmap of pruchasing habits. Which customers prefer which products?
To get to these advanced plots, you need to learn everything from the ggplot2
package:
- Learn the
anatomy
of a ggplot object - Learn the
geometries
of a ggplot object including- Scatter plots (2D Relationships)
- Line plots (Time series)
- Bar / Column plots (category vs. numeric)
- Histograms, Faceted Histograms, Density Plots (Univariate & within-feature distributions)
- Box plots & Violon plots (Distributions by category)
- Text & Label geometries (Adding textual mappings)
- Formatting a ggplot object
- Colors & Color palettes
- Aesthetic feature mappings (color, fill, size)
- Faceted plots (investigate categories)
- Position adjustments
- Scales (Continuous & discrete features)
- Labels & legends
- themes
Theory Input
Think of ggplots like building layers of a cake. Each layer is added on top. Building a plot is a 3-step process:
- Create a canvas defined by mapping to columns in your data.
- Add 1 or more geometries (geoms)
- Add formatting features (scales, themes, facets etc.)
We can specify the different parts of the plot, and combine them together using the +
operator (Note that the +
operator is similar to the %>%
pipe operator but is not interchangeable!. If you want a similar look, you can also use %+%
).
Step 0: Format data
We are working with the olist data. You can read your rds file or recreate the data again. Before we disucss the anatomy of ggplot, we have to prepare the data appropriately and get it in the right format. The key to a good ggplot is knowing how to format the data for a ggplot. Let’s start by visualizing the revenue over the month for the year 2017. So we only need the price and the date column and group and summarize those accordingly.
library(tidyverse)
library(lubridate)
order_items_tbl <- read_csv(file = "00_data/01_e-commerce/01_raw_data/olist_order_items_dataset.csv")
orders_tbl <- read_csv(file = "00_data/01_e-commerce/01_raw_data/olist_orders_dataset.csv")
# 1.0 Anatomy of a ggplot ----
# 1.1 How ggplot works ----
# Step 1: Format data ----
revenue_by_month_tbl <-
# Join tables
left_join(order_items_tbl, orders_tbl) %>%
# Replace . with _
set_names(names(.) %>%
str_replace_all("\\.", "_")) %>%
# Select year and create month column
select(price, order_purchase_timestamp) %>%
filter(year(order_purchase_timestamp) == 2017) %>%
mutate(month = month(order_purchase_timestamp)) %>%
# Group and summarize price
group_by(month) %>%
summarize(revenue = sum(price)) %>%
ungroup()
revenue_by_month_tbl
## # A tibble: 12 x 2
## month revenue
## <dbl> <dbl>
## 1 1 120313.
## 2 2 247303.
## 3 3 374344.
## 4 4 359927.
## 5 5 506071.
## 6 6 433039.
## 7 7 498031.
## 8 8 573972.
## 9 9 624402.
## 10 10 664219.
## 11 11 1010271.
## 12 12 743914.
Now that we have our data formatted, we can begin our ggplot by piping our data into the ggplot()
function. All ggplot2 plots begin with a call to ggplot()
, supplying default data and aesthethic mappings, specified by aes()
. You then add layers, scales, coords and facets with +
or %+%
.
Step 1: Build Canvas
Aesthetic mappings describe how variables/ columns in the data are mapped to visual properties (aesthetics) of geometries (e.g. scatterplot. See next step). They represent something you can see in the final plot. There are all sorts of different mappings:
- position (i.e., on the x and y axes)
- color (“outside” color)
- fill (“inside” color)
- alpha (opacity)
- shape (of points)
- line type
- size
Aesthetic mappings are set with the aes()
function and take properties of the data and use them to influence visual characteristics. All aesthetics for a plot are specified in the aes()
function call in the beginning (in the next section you will see that each geom layer can have its own aes specification). Our plot requires aes mappings for x and y. Each visual characteristic can encode an aspect of the data and be used to convey information. Thus, we can add a mapping for the revenue to a color characteristic as well:
# Step 2: Plot ----
revenue_by_year_tbl %>%
# Canvas
ggplot(aes(x = year, y = revenue, color = revenue))
If you run this, just the canvas in the viewer pane will be created. The canvas is the 1st layer that is just a blank slate with the axes. But any subsequent geoms that we add will utilize those mappings.
Note that using the aes()
function will cause the visual channel to be based on the data specified in the argument. For example, using aes(color = “blue”) won’t cause the geometry’s color to be “blue”, but will instead cause the visual channel to be mapped from the vector c(“blue”) — as if we only had a single type of engine that happened to be called “blue”. This will become more clear in the next steps.
Step 2: Geometries
The 2nd layer generates a visual depiction of the data using geometry types. Geometries
are the fundamental way to represent data in your plot. They are the actual marks we put on a plot and hence determine the plot type: Histrograms, scatter plots, box plots etc. Building on these basics, ggplot2 can be used to build almost any kind of plot you may want. The most obvious distinction between plots is what geometric objects (geoms) they include. Examples include:
- points (
geom_point
, for scatter plots, dot plots, etc) - lines (
geom_line
, for time series, trend lines, etc) - boxplot (
geom_boxplot
, for, well, boxplots!) - … and many more!
Each of these geometries will leverage the aesthetic mappings supplied although the specific visual properties that the data will map to will vary. For example, you can map data to the shape of a geom_point (e.g., if they should be circles or squares), or you can map data to the linetype of a geom_line (e.g., if it is solid or dotted), but not vice versa. Each type of geom accepts only a subset of all aesthetics. Almost all geoms require an x and y mapping at the bare minimum. Refer to the geom help pages to see what mappings each geom accepts (e.g. ?geom_line
). A plot should have at least one geom, but there is no upper limit. You can add a geom to a plot using the +
operator to create complex graphics showing multiple aspects of your data. To get a list of available geometric objects use the code below (or simply type geom_<tab>
in RStudio to see a list of functions starting with geom_):
help.search("geom_", package = "ggplot2")
Now that we know about geometric objects and aesthetic mapping, we’re ready to make our first ggplot: a line with dots. We’ll use combination of geom_line
and geom_plot
to do this, which requires aes mappings for x and y. The color for revenue is optional. We can set the size / thickness of the points / line with the size argument. Additionally, we can insert a trendline based on the dots using geom_smooth()
. The arguments lm
stands for linear regression. With se = FALSE
we remove the display of the standard errors.
revenue_by_month_tbl %>%
# Canvas
ggplot(aes(x = month, y = revenue, color = revenue)) +
# Geometries
geom_line(size = 1) +
geom_point(size = 5) +
geom_smooth(method = "lm", se = FALSE)
As mentioned earlier, if we specify an aesthetic within ggplot()
it will be passed on to each geom that follows. But each geom layer can have its own aes specification by wrapping the attributes in the geoms into aes()
. This will map these variables to other aesthetics e.g. the revenue to the size of the dots geom_point(aes(size = revenue))
. You will see that the size of the dots varies then based on the amount of revenue and we will get another legend. This allows us to only show certain characteristics for that specific layer. If you wish to apply an aesthetic property to an entire geometry, you can set that property as an argument to the geom method, outside of the aes() call: geom_point(color = "blue")
or geom_point(size = 5)
.
In summary variables are mapped to aesthetics with the aes()
function, while fixed visual cues are set outside the aes() call. This sometimes leads to confusion, as in this example:
base_plot +
# not what you want because 2 is not a variable
geom_point(aes(size = 2),
# this is fine -- turns all points red
color = "red")
Step 3: Formatting
Once we have that, we can get into the formatting:
Range of your plot
To expand the range of a plot you can use expand_limit()
. As arguments set y and/or x either to single values or a vector containing the upper and the lower limit (e.g. expand_limit(y = 0)
). If you want to zoom into a certain area use coord_cartesian(ylim = c(ymin, ymax), xlim = c(xmin, xmax))
and set the values accordingly. If you don’t wrap ylim
and xlim
into coord_cartesian
the values out of range will be dropped.
Scales
Aesthetic mapping (i.e., with aes()) only says that a variable should be mapped to an aesthetic. It doesn’t say how that should happen. For example, when mapping a variable to shape with aes(shape = categories) you don’t say what shapes should be used. Similarly, aes(color = revenue) doesn’t say what colors should be used. Describing what colors/shapes/sizes etc. to use is done by modifying the corresponding scale. In ggplot2, scales include:
- position
- color, fill, and alpha
- size
- shape
- linetype
ggplot automatically adds a particular scale for each mapping/ every aestethic to the plot to determine the range of values that the data should map to. This is the same as above with explicit scales:
revenue_by_month_tbl %>%
# Canvas
ggplot(aes(x = month, y = revenue, color = revenue)) +
# Geometries
geom_line(size = 1) +
geom_point(size = 5) +
geom_smooth(method = "lm", se = FALSE) +
# same as above, with explicit scales
scale_x_continuous() +
scale_y_continuous() +
scale_colour_continuous()
Scales are modified with a series of functions using a scale_<aesthetic>_<type>
naming scheme. Try typing scale_<tab>
to see a list of scale modification functions. A continuous scale will handle things like numeric data (where there is a continuous set of numbers), whereas a discrete scale will handle things like categories. While the default scales will work fine, it is possible to explicitly add different scales to replace the defaults. The following arguments are common to most scales in ggplot2:
- name: the first argument specifies the axis or legend title
- limits: the minimum and maximum of the scale
- breaks: the points along the scale where labels should appear
- labels: the text that appear at each break
Specific scale functions may have additional arguments; for example, the scale_color_continuous()
function has arguments low and high for setting the colors at the low and high end of the scale.
Lets do the following formatting:
- Let the y-axis start at 0.
- change the color for revenue to a red-black-gradient (from the default dark-blue to light-blue gradient)
- update the labels to the dollar format and make it to millions (using
scales::dollar_format()
)
revenue_by_month_tbl %>%
# Canvas
ggplot(aes(x = month, y = revenue, color = revenue)) +
# Geometries
geom_line(size = 1) +
geom_point(size = 5) +
geom_smooth(method = "lm", se = FALSE) +
# Formatting
expand_limits(y = 0) +
scale_color_continuous(low = "red", high = "black",
labels = scales::dollar_format(scale = 1/1e6, suffix = "M")) +
scale_y_continuous(labels = scales::dollar_format(scale = 1/1e6, suffix = "M"))
Labels
The title and axis labels can be changed using the labs()
function with title, x and y arguments. Another option is to use the ggtitle(), xlab() and ylab().
Lets do the following formatting:
- Add labels
labs(
title = "Revenue",
subtitle = "Sales are trending up and to the right!",
x = "",
y = "Sales (Millions)",
color = "Rev ($M)",
caption = "What's happening?\nSales numbers showing year-over-year growth."
)
Themes
ggplot comes with several complete themes which control all non-data display in a predfined way. Just add them as another layer. Examples are theme_bw
, theme_light()
, theme_dark()
, theme_minimal()
. See here for the list of the complete themes of ggplot2. There are also multiple other packages, that contain themes (e.g. ggthemes
). Theme elements like the legend can be adjusted with the theme()
function.
Lets do the following formatting:
- Add a theme
- Change the position and the direction of the legend
library(ggthemes)
theme_economist() +
theme(legend.position = "right", legend.direction = "vertical")
If we combine everything:
library(ggthemes)
g <- revenue_by_month_tbl %>%
# Canvas
ggplot(aes(x = month, y = revenue, color = revenue)) +
# Geometries
geom_line(size = 1) +
geom_point(size = 5) +
geom_smooth(method = "lm", se = FALSE) +
# Formatting
expand_limits(y = 0) +
scale_x_continuous(breaks = revenue_by_month_tbl$month,
labels = month(revenue_by_month_tbl$month, label = T)) +
scale_color_continuous(low = "red", high = "black",
labels = scales::dollar_format(scale = 1/1e6, suffix = "M")) +
scale_y_continuous(labels = scales::dollar_format(scale = 1/1e6, suffix = "M")) +
labs(
title = "Revenue (2017)",
subtitle = "Sales are trending up!",
x = "",
y = "Sales (Millions)",
color = "Rev ($M)",
caption = "What's happening?\nSales numbers showing month-over-month growth."
) +
theme_economist() +
theme(legend.position = "right",
legend.direction = "vertical",
axis.text.x = element_text(angle = 45))
g
By running View(g)
you see, that g is basically just a list containing all the information we just provided.
Factors
In the business case we are working with the data structure factors
. R uses factors to handle categorical variables, variables that have a fixed and known set of possible values. Factors are also helpful for reordering character vectors to improve display. The goal of the forcats
package is to provide a suite of tools that solve common problems with factors, including changing the order of levels or the values. forcats
is part of the core tidyverse, so you can load it with library(tidyverse)
or library(forcats)
.
Take a look at the following example:
library(tidyverse)
starwars %>%
filter(!is.na(species)) %>%
count(species, sort = TRUE)
## # A tibble: 37 x 2
## species n
## <chr> <int>
## 1 Human 35
## 2 Droid 6
## 3 Gungan 3
## 4 Kaminoan 2
## 5 Mirialan 2
## 6 Twi'lek 2
## 7 Wookiee 2
## 8 Zabrak 2
## 9 Aleena 1
## 10 Besalisk 1
## # … with 27 more rows
The function fct_lump()
collapses the least/most frequent values of a factor into “other”. A positive value for the argument n
preserves the most common n values. Negative n
preserves the least common -n values. The argument w
accepts an optional numeric vector giving weights for frequency of each value (not level) in f.
starwars %>%
filter(!is.na(species)) %>%
mutate(species = as_factor(species) %>%
fct_lump(n = 3)) %>%
count(species)
## # A tibble: 4 x 2
## species n
## <fct> <int>
## 1 Human 35
## 2 Droid 6
## 3 Gungan 3
## 4 Other 39
Other useful functions are:
fct_reorder():
Reordering a factor by another variable.
f <- factor(c("a", "b", "c", "d"), levels = c("b", "c", "d", "a"))
f
## a b c d
## Levels: b c d a
fct_reorder(f, c(2,3,1,4))
## a b c d
## Levels: c a b d
fct_relevel()
: allows you to move any number of levels to any location.
fct_relevel(f, "a")
## a b c d
## Levels: a b c d
fct_relevel(f, "b", "a")
## a b c d
## Levels: b a c d
# Move to the third position
fct_relevel(f, "a", after = 2)
## a b c d
## Levels: b c a d
# Relevel to the end
fct_relevel(f, "a", after = Inf)
## a b c d
## Levels: b c d a
fct_relevel(f, "a", after = 3)
## a b c d
## Levels: b c d a
For further information, see chapter Factors from R for Data Science.
Business case
Case 1
- Question: How much purchasing power is in top 5 customer cities?
- Goal: Visualize top N customer cities in terms of revenue, include cumulative percentage.
1. Load libraries and data and join them together
# 1.0 Lollipop Chart: Top N Customers ----
library(tidyverse)
library(lubridate)
order_items_tbl <- read_rds("00_data/01_e-commerce/02_wrangled_data/order_items_tbl.rds")
orders_tbl <- read_rds("00_data/01_e-commerce/02_wrangled_data/orders_tbl.rds")
customers_tbl <- read_rds("00_data/01_e-commerce/02_wrangled_data/customers_tbl.rds")
order_lines_tbl <- order_items_tbl %>%
left_join(orders_tbl) %>%
left_join(customers_tbl)
2. Data manipluation
n <- 10
# Data Manipulation
top_customers_tbl <- order_lines_tbl %>%
# Select relevant columns
select(customer_city, price) %>%
# Collapse the least frequent values into “other”
mutate(customer_city = as_factor(customer_city) %>%
fct_lump(n = n, w = price)) %>%
# Group and summarize
group_by(customer_city) %>%
summarize(revenue = sum(price)) %>%
ungroup() %>%
# Reorder the column customer_city by revenue
mutate(customer_city = customer_city %>% fct_reorder(revenue)) %>%
# Place "Other" at the beginning
mutate(customer_city = customer_city %>% fct_relevel("Other", after = 0)) %>%
# Sort by this column
arrange(desc(customer_city)) %>%
# Add Revenue Text
mutate(revenue_text = scales::dollar(revenue, scale = 1e-6, suffix = "M")) %>%
# Add Cumulative Percent
mutate(cum_pct = cumsum(revenue) / sum(revenue)) %>%
mutate(cum_pct_text = scales::percent(cum_pct)) %>%
# Add Rank
mutate(rank = row_number()) %>%
mutate(rank = case_when(
rank == max(rank) ~ NA_integer_,
TRUE ~ rank
)) %>%
# Add Label text
mutate(label_text = str_glue("Rank: {rank}\nRev: {revenue_text}\nCumPct: {cum_pct_text}"))
3. Data visualization
# Data Visualization
top_customers_tbl %>%
# Canvas
ggplot(aes(x = revenue, y = customer_city)) +
# Geometries
geom_segment(aes(xend = 0, yend = customer_city),
color = palette_light()[1],
size = 1) +
geom_point(aes(size = revenue),
color = palette_light()[1]) +
geom_label(aes(label = label_text),
hjust = "inward",
size = 3,
color = palette_light()[1]) +
# Formatting
scale_x_continuous(labels = scales::dollar_format(scale = 1e-6, suffix = "M")) +
labs(
title = str_glue("Top {n} Customers"),
subtitle = str_glue("Start: {year(min(order_lines_tbl$order_purchase_timestamp))}
End: {year(max(order_lines_tbl$order_purchase_timestamp))}"),
x = "Revenue ($M)",
y = "Customer",
caption = str_glue("Top 3 cities contribute
24% of purchasing power.")
) +
theme_minimal() +
theme(
legend.position = "none",
plot.title = element_text(face = "bold"),
plot.caption = element_text(face = "bold.italic")
)
Case 2
- Question: Do specific customers have a purchasing preference?
- Goal: Visualize heatmap of proportion of sales by sub category for the categories fashion & furniture. Since Heatmaps are great for showing details in 3 dimensions, show the results for each customer state.
1. Load libraries and data and join them together
# 2.0 Heatmaps ----
library(tidyverse)
library(tidyquant) # For colors
order_items_tbl <- read_rds("00_data/01_e-commerce/02_wrangled_data/order_items_tbl.rds")
orders_tbl <- read_rds("00_data/01_e-commerce/02_wrangled_data/orders_tbl.rds")
products_tbl <- read_rds("00_data/01_e-commerce/02_wrangled_data/products_tbl.rds")
customers_tbl <- read_rds("00_data/01_e-commerce/02_wrangled_data/customers_tbl.rds")
# Data Manipulation
# Joing together
order_lines_tbl <- order_items_tbl %>%
left_join(orders_tbl) %>%
left_join(customers_tbl) %>%
left_join(products_tbl)
2. Data manipluation
# Select columns and filter categories
pct_sales_by_state_tbl <- order_lines_tbl %>%
select(customer_state, main_category_name, sub_category_name, price) %>%
filter(main_category_name == "fashion" | main_category_name == "moveis") %>%
# Group by category and summarize
group_by(customer_state, main_category_name, sub_category_name) %>%
summarise(total_revenue = sum(price)) %>%
ungroup() %>%
# Group by state and calculate revenue ratio
group_by(customer_state) %>%
mutate(pct = round((total_revenue / sum(total_revenue)), digits = 2)) %>%
ungroup() %>%
# Reverse order of states
mutate(customer_state = as.factor(customer_state) %>% fct_rev()) %>%
mutate(customer_state_num = as.numeric(customer_state))
3. Data visualization
# Data Visualization
pct_sales_by_state_tbl %>%
ggplot(aes(sub_category_name, customer_state)) +
# Geometries
geom_tile(aes(fill = pct)) +
geom_text(aes(label = scales::percent(pct)),
size = 3) +
facet_wrap(~ main_category_name, scales = "free_x") +
# Formatting
scale_fill_gradient(low = "white", high = palette_light()[1]) +
labs(
title = "Heatmap of Purchasing Habits",
x = "Sub Category",
y = "Customer State",
caption = str_glue(
"Customer states that prefer Fashion products:
AMAZONAS (AM)
Customer states that prefer furniture products:
All other states. Customers from Amapá (AP) so not buy fashion products at all")
) +
theme_tq() +
theme(
axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = "none",
plot.title = element_text(face = "bold"),
plot.caption = element_text(face = "bold.italic")
)
Challenge
Create at least 2 plots.
- For the first one use the olist data and create a
violin plot
that shows the price distribution for whatever categories you choose. - Take the covid data from the last session and map the death / cases over the time. Show the trend for the entire world as well as for Germany and the USA (
line plot
). - Optional: Create a
worldmap
and color the countries according to the fatality (total deaths per capita). If it is easier for you, you can do it also just for the states of the USA or any other state (you need to get a different dataset though).