SABAP1 and 2 are two amazing citizen science projects: millions of bird records compiled by birdwatchers and submitted to a central database managed at the University of Cape Town.
Birdwatchers are requested to submit lists after surveying a set area (pentad) for a minimum of 2 hours, ideally covering all habitats within the 8x8 km pentad. These are known as full protocol cards.
The data can be used for many purposes, but my interest is focused on the information that can be acquired for supporting conservation decision making.
For this we need an index of relative abundance: how common is the bird? May seem like a simple question: but it isn’t. There are many problems that need to be overcome in interpreting the data that is part of the listing process.
In its simplest form, the basic unit of relative abundance is known as the ‘reporting rate’. This is the proportion of a set of lists that a species appears on. For instance, for a pentad, if 10 lists have been submitted, and a species is recorded on 6 of these, then the reporting rate is 6/10 = 0.6 (or 60%).
So our first challenge: what happens if a pentad has been surveyed only once? Then reporting rate can only be 0 (not recorded) or 1 (the species is recorded). These numbers are not very useful, so the more lists for a pentad, the better: the closer we come to a realistic probability that you will record a species should you go to a site. Unfortunately, most of southern Africa is either inaccessible, or has few bird watchers. So much of the country is either not visited, or has been visited only once.
Here is a chart that illustrates this for the first SABAP1 conducted from 1987-1992 and SABAP2 which was initiated in 2007 and is ongoing. Here, the sampling unit is the Quarter Degree Grid Cell (QDGC), the sampling unit of SABAP1: each QDGC contains 9 pentads. The x axis (number of cards) is truncated at 50 in each case: some pentads have many more cards than this.
Both SABAP1 and SABAP2 have a very large number of QDGCs sampled only once. As the sampling effort of SABAP2 is now higher than SABAP1, there are a very large number of cells that have been visited only once (nearly 1500!).
In this paper below I focused on calculating single index parameters of change to compare how species were doing between atlas periods:
Estimating conservation metrics from atlas data: the case of southern African endemic birds
For reporting rate change, I calculate the mean reporting rate across all QDGCs for SABAP2, subtract the mean of reporting rate from SABAP1, then divide that results by the mean of the reporting rate from SABAP1. A positive number indicates an increase in reporting rate (inferring increased abundance), while a negative number infers decreased abundance. For example, if reporting rate for SABAP2 was 100%, but only 50% for SABAP1, then the change is 100% (i.e. 100-50 / 50) i.e. a doubling in probability of reporting. Zero means no change. It might seem complicated, but it is the simplest way to look at change: there are many many more complicated ways. However, reporting rates make sense to most people, since they are proportions and percentages. Other metrics can get rather esoteric (z-scores, standardized population change, occupancy modelling probability output). And rather importantly, there is good agreement between these in terms of what they are telling us.
To create summary reporting rate figures, I did not want to use QDGCs that had been surveyed only once to avoid the bias of cells with reporting rate of 0 or 1 only, so I filtered data to include QDGCs that were filtered only twice or more for both atlas periods.
But what happens if I had not filtered at all, or if I had filtered more?
Here I illustrate what the difference is in overall reporting rate between atlas periods using different filters, from no filter all the way to using only the subset of QDGCs with 25 or more lists. My study subject of choice is the African Black Oystercatcher, South Africa’s Bird of the Year for 2018. This coastal bird has been recorded in 177 QDGCs over the course of the two atlas projects.
We can see in the above chart that without applying a filter that reporting rate change is the lowest value: 11%, while filtering the data so at least 1 card was atlased during both projects gives us the highest value (c18%). After that it is fairly stable: around 15% increase, no matter what the filter.
Similarly, range change shows an odd result for unfiltered data: here range change is negative for unfiltered data (implying a range contraction), but hovering around a 10% increase in range for most other filters. So it looks like filtering is a good idea to get rid of some of that instability introduced with too many 1s and 0s. A minimum filter would be 2 or more lists for both atlas projects.
But what about sampling bias? This is a big problem with atlas data. Most atlasing is conducted around urban areas, and these are likely not representative of the wider landscape. Also, we need to deal with spatial autocorrelation, because sites close to each other are likely not independent. So in order to get a better idea of how much reporting rate has changed between atlas periods, here I appeal to the central limit theorem and sub-sample randomly using 50 QDGCs from the range of the Oyc 1000 times i.e. I calculate a reporting rate change based on 1000 random draws from the Oyc dataset, each time calculating a change in reporting rate value. I get an answer with mean of 16% plus or minus 9%. So a 15% increase seems reasonable using the entire data set wasn’t too bad. To think of it simply, you are 15% more likely to see an Oyc now in its range than you were during SABAP1. It isn’t a huge increase, but it is on the positive side, which is good news for Oycs.
For the above random sampling, 50 seems like a good minimum value: lower than this and standard deviation starts to increase i.e. we start to become less sure about where a reasonable answer lies. This is because large numbers are key to this process, and so inference for species with small ranges is a bad idea when it comes to using atlas data.
Let’s get back to that Law of Large numbers. The law of large numbers is a principle of probability according to which the frequencies of events with the same likelihood of occurrence even out, given enough trials or instances. As the number of samples increases, the actual ratio of outcomes will converge on the theoretical, or expected, ratio of outcomes. I.e. The Law of Large Number states that when sample size tends to get very large, the sample mean equals the population mean.
With the SABAP projects, we have large numbers. Almost certainly, some of the answers we are getting from relative abundance at the pentad level will be wrong, but using them all together we start to satisfy the law of large numbers. Here I illustrate for a species with theoretical relative abundance of 65% how many pentads we would require that have been sampled 4 times (giving us pentads with possibilities of 0, 1, 2, 3, 4) before the mean value starts to become acceptable enough that we can be confident it represents the ‘real’ value. Note that the confidence intervals are pretty narrow by about 30 pentads already.
And the last question I will answer in this post is: how does presence/absence data (binomial distribution) i.e. 0s and 1s eventually create a value that is between 0 and 100? Here, the Central Limit Theory is what is important. The Central limit Theorem states that when sample size tends to infinity, the sample mean will be normally distributed. For our example of pentads with 4 checklists, see how quickly the distribution becomes normal. Here the p value is a significant deviation away from normal. We get lift off (i.e. normal data) once we start hitting 30 pentads. Certainly then, 30 (pentads or QDGCs) is the minimum that would be required to make inference from atlas data for a population.
Don’t forget though: single value conservation metrics are not the whole picture, they are simply something easy to grasp in a complex world. But more about complex modelling another day.