Friday 16 April 2021

The problem with Purrr (for Biologists)

 

So, it took me forever to complete the DataCamp course on Functional Programming with Purrr.

https://learn.datacamp.com/courses/foundations-of-functional-programming-with-purrr

With a bit of programming background, I’m a fan of functional programming, the aim of which is to avoid copy pasting errors and allows execution of functions on various subsets of data.

I started this course ages ago, and gave up repeatedly, electing to do other courses instead (with easy, standard R code). It should have been an ideal course for me, with famous ornithologist Auriel Fournier and some bird data.

Doing this course made me realise this version of reality:

Venn diagrams of the intersection between programmers and biologists.

In essence, all top blog posts on purrr are praiseworthy (this is a product of R code developer Messiah Hadley Wickham). The package implements ‘better’ versions of the ‘apply’ set of base R functions, which are ‘higher order’ functions, that really are useful (if you can put into the time to master them, since it requires multilevel thinking).

What is ‘better’? Well, I think there may well be two definitions, depending on where you sit on the coder-biologist spectrum. Better for a programmer is a ‘readability’ and ‘succinctness’. Better for biologists is also readability, but with the trust in the final answer a lot more important. I.e. things break down at the ‘succinctness’ level: because in a line of piped code, a biologist is wanting to know what happens at each level. A line of piped biologist code will have been built step by step to ensure that each line is doing what it should do. A programmer will weave these all together to achieve succinctness, while it may well make a lot more sense for a biologist to have ‘expanded’ code.

‘BOO!” say the programmers.

I say: “That is alright”.

The truth is, for ‘normal’ biologists, it will probably take less time to filter your data in an Excel spreadsheet and apply multiple functions across columns to get what you want compared to debugging your first attempts to code a line using apply().

What are my issues with purrr? Well, I hardly ever use lists, and the double bracket notation is just intimidating. A lack of familiarity with this data form doesn’t help, although it is seen everywhere in R output (take those GLMs output for example).  Actually, a simple form of purrr uses the map set of functions (which can return data as logical or vector or dataframe) and can be used on dataframes, pretty much like apply(), except you get ‘standardized’ output, which is apparently why it is cool with those that code on a daily basis rather than trudge around swamps (or deserts) with binoculars (i.e. ornithologists).  

Using purrr will require learning yet another packaging coding style. For instance ~.x is used …. To get stuff? Well, I wish I could give you an honest simple explanation, but I can’t. Here is an example from the course:

map_chr(sw_films, ~.x[["episode_id"]]) # readable? Not a chance.

map_chr(your_data, ~x[["column_name"]]) #sort of this, then return the data as a character vector.

Okay, I’m loosing you, so lets take a hectic example from the course.

This is the problem question :

What is the distribution of heights of characters in each of the Star Wars films?

That is simple right? Just make a histogram on a vector (or column with numeric data from a data frame).

Using the sw_films data from the ‘repurrrsive’ package:

library(repurrrsive); library(tidyverse)

data(sw_films)

This is the ‘clue’ code provided. WTF!? Hectic tidyverse vocabulary required. But okay: we just need to fill in the blanks so how hard can it be?

# Turn data into correct dataframe format

film_by_character <- tibble(filmtitle = map____(___, ___)) %>%

    mutate(filmtitle, characters = map(___, ___)) %>%

    unnest()

 

Damn hard. Line 1: When do you use ~.x; should you use [[]] or just “variable_name”, not to mention I initially thought map_df was the appropriate map function here.

Then line 2 – I just piped some data, so that should be available right? So should that be . or .x? Neither… need the data assignment spelled out.  And I presume I’ll use the ~.x [[]] again… nope, just “variable_name” this time.

So this is the solution (well, code that doesn’t return a fatal error, we’ll ignore the new warning from unnest for now. Screw you, code evolution):

film_by_character <- tibble(filmtitle = map_chr(sw_films, ~.x[["title"]])) %>%

  mutate(filmtitle, characters = map(sw_films, "characters")) %>%

  unnest()

 

Geez. That was step 1. We still need to solve this:

# Pull out elements from sw_people. Create a dataframe with the "height", "mass", "name", and "url" elements from sw_people.

sw_characters <- map____(___, `[`, c(___, ___, ___, ___))

 


Thank heavens the ‘[‘ is in there, because you’d have never solved that by yourself. This the solution.

sw_characters <- map_df(sw_people, `[`, c("height", "mass", "name", "url"))

 

 

Step 3: join the data frames. This should be a breeze! I mean, I use the join functions all the time.

But wait, what the hell is this c(“___”) thing? Surely, we could just do a rename on the fly? This screwed me…. I couldn’t do the join. At this stage, was it the join code that was the problem or the initial tibble creation?

# Join the two new objects

character_data <- inner_join(___, ___, by = c("___" = ___)) %>%

   # Make sure the columns are numbers

    mutate(height = as.numeric(height), mass = as.numeric(mass))

 

# My incorrect solution

character_data <- inner_join(sw_characters, film_by_character, by = c("characters" = url)) %>%

    mutate(height = as.numeric(height), mass = as.numeric(mass))

 

# My cheat code to get what they wanted:
character_data <- inner_join(film_by_character, rename(sw_characters, characters=url)) %>%

  mutate(height = as.numeric(height), mass = as.numeric(mass))

 

Thank G. Last step was make a facet ggplot chart, actually easy.

My horror: that was the foundational course, so I nearly died when on completing it that I was then recommended to do intermediate functional programming with purrr. Oh … my …. G...

I guess I just don't have a purrrsonality that purrrs. Miaow.

 

Related Posts Plugin for WordPress, Blogger...