A very intro introduction to DPLYR, from RLadies MTL
Last month, on a break during my workday, I decided to google some communities where I could practice and learn to code in R. It had been on my mind for a while, as I’ve had many side project ideas and occasionally get stuck from small errors !
~R Ladies~
I came across R Ladies - a community of women (and other folks) in Montreal that meet once a month to learn something new in R! I’ve struggled with learning R as I now recognize that there are some basics I didn’t (and still don’t) fully understand - objects, when to use tibbles, what packages are associated with tibbles only, among, I’m sure, many other things.
My first session was their November 20th session on dplyr, a package I had worked with before in courses and through google/Stack Overflow solutions. Sadly, instead of knowing it as the powerful data manipulator it is, I only knew that I had to always install it.
I didn’t know what commands were associated with it or why they were grouped.
Data manipulation is the answer.
Here I share the tutorial prepared by R Ladies. In this meet-up, we learned:
- Sub-set data using select
- Remove observations using filter
- Group data with group_by
- Perform calculations on groups while maintaining the whole dataset with mutate
- Perform calculations on groups making a new grouped dataset with summarize
All of the materials for this session are open source - you can find them here
I installed the package and loaded in the data. In the end I used my local file path as there seemed to be errors with the data loading in from GitHub - a blog for another time.
Okay - Now onto the exercises! :)
Exercises!
RLadies gave us the following tasks: 1. Bind the two data frames into a single data frame. What are the new dimensions of the data frame? 2. Delete the first column of the data with the “Notes” 3. Remove all observations where Average.LMP.Gestational.Age has missing values 4. Calculate the average birth weight by state using mutate and summarise. How are the results different? 5. What were the states with the 5 highest fertility rates in 2010 (tricky question!)
Exercise #1:
On the first exercise, I already had to ask a question…. what is bind?
Using the data provided by RLadies, it’s easy to understand and I picked it up quickly.
The two datasets we loaded in, births_2003_2006 and births_2007_2018, represent data from the years listed in the names, 2003 to 2006 and 2007 to 2018.
You can imagine a dataset where there are many observations (i.e. births) for those years, as well as other data on things such as State of the birth and birth weight.
Example of the dataset births_2003_2006:
Birth | Year | Birth Weight |
---|---|---|
1 | 2003 | 7 |
2 | 2004 | 6.75 |
3 | 2005 | 7.25 |
The dataset has been broken up into two sets of years, binding puts them back together by adding all of the observations into one dataset. In the below example, we can imagine that births 4 and 5 were originally their own dataset but using bind we added them onto the end of the first example dataset above.
Example of re-joined datasets:
Birth | Year | Birth Weight |
---|---|---|
1 | 2003 | 7 |
2 | 2004 | 6.75 |
3 | 2005 | 7.25 |
4 | 2018 | 7.5 |
5 | 2018 | 7 |
To bind the two data frames, I first looked at their column header names. By knowing similar column header names, I would then be able to start investigating if the same names represent the same data, and therefore can be binded.
names(births_2007_2018)
## [1] "Notes" "State"
## [3] "Year" "Mother.s.Bridged.Race"
## [5] "Births" "Birth.Rate"
## [7] "Fertility.Rate" "Average.Birth.Weight"
## [9] "Average.LMP.Gestational.Age"
names(births_2003_2006)
## [1] "Notes" "State"
## [3] "Year" "Mother.s.Bridged.Race"
## [5] "Births" "Birth.Rate"
## [7] "Fertility.Rate" "Average.Birth.Weight"
## [9] "Average.LMP.Gestational.Age"
They have the same column headers! I should have checked to confirm that the data from each column are in the same format and that they can be binded, but for the purpose of this blog I am just skipping ahead to binding the datasets.
To bind the datasets, you use the command bind_rows followed by the two datasets that you want to bind together. I name this dataset births_2003_2018, as it now represents births from 2003 to 2018.
births_2003_2018 <- bind_rows(births_2003_2006, births_2007_2018)
Exercise #2:
For the next exercise, I removed a column from the dataset. To do this, I use the select command. The select command is used to keep only the data that you want to work with. With select, you name all of the columns you want to work with.
In this blog, we want to work with all of the data except the column “Notes”.
In option one, I could use the select command to name all of the columns except “Notes”. However, I thought another way might be easier…
Therefore, I had a question on the second exercise too…
Instead of naming all of the columns, select also has a trick that puts it into the negative. I used -c to indicate that I wanted to only remove the column “Notes”.
births_2003_2018 <- select(births_2003_2018, -c("Notes"))
Exercise #3:
The next task was to remove observations that have missing data for the column “Average.LMP.Gestational.Age”.
The task here is to remove observations instead of columns. Observations are the rows in the dataset that represent the individual birth cases.
Where select works with columns, filter works with observations.
To remove missing data, I used the operator is.na(column name).
For manipulations that don’t involve missing values, the operators are more common. For example, to keep only observations where a column value is greater than 6, I would have used “>” or “<”.
For missing values, however, you must use is.na(), which keeps only the missing values, or !is.na(), which keeps only non-missing values. The later is what I use below.
If you forget, I’ve been using the DPLYR cheat sheet to remember these little details!
For this exercise, I also renamed the dataset births_no_na. My naming and renaming is mostly arbitrary here; I simply felt that removing observations was worth a dataset renaming!
births_no_na <- filter(births_2003_2018, !is.na(Average.LMP.Gestational.Age))
Exercise #4:
The next exercise involves a few new things. In this example I used piping. Piping, to me, performing many different data manipulations, or commands, together in creating a new dataset.
Below, I group_by and mutate in the same line to create the new dataset births_by_state.
Now what does group_by and mutate mean?
group_by groups the data so that all future operations, or calculations, will be done by the specified groups. Here, I want to know information by state. I want to know the average birth weight or the average fertility rate per state. Another potential group could be “Year”, which makes sense if I want to know information by year.
But we haven’t learned to calculate things yet!
Good news, that is what the other command, mutate, does. Mutate adds new columns, or variables, based on calculations from other columns.
In this case, where we have the data grouped, it will maintain the entire dataset and make a new column with state-level averages.
If I just wanted a table of state averages, I would use summarize instead of mutate.
In the below example, the line of code then calculates the average birth weight by state.
births_by_state <- group_by(births_no_na, State) %>%
mutate(mean_wt = mean(Average.Birth.Weight))
Okay the last stretch!
Exercise #5:
Exercise 5 involves multiple steps and many of the above commands. I broke down the question into the following pieces.
First - I needed to remove all observations except those in 2010. Second - I needed to group the data by state third - I needed to rank the data by fertility rate
To remove observations, I used filter!
fer_rate_2010 <- filter(births_no_na, Year==2010)
rank_fer_rate <- group_by(fer_rate_2010, State) %>%
summarize(mean_wt = mean(Average.Birth.Weight))
#View(rank_fer_rate)
Instead of using a command to rank the data, I instead opened the dataset using View and manually ranked it. Maybe there will be another blog in the future where I learn to do some command ranking! For now, oh well, manual was faster.
And that’s all! The next RLadies session is in January, 2020 and I will prepare another blog to summarize what I learned.
Thank you and happy new year :) ~Meg
Resources:
Here are some of the resources I used throughout the blog to help with small reminders and little errors:
<3 Megan