Mapping the spread of a migratory bird using Twitter

I like Waxwings: big, brassy, berry-eating flockers! Living on the east coast of Scotland I should have a better than average chance of seeing them, but seem to have a knack of being at places where they are seen either just before or just after they appear.

Luckily we have @WaxwingsUK (https://twitter.com/WaxwingsUK) on Twitter to report the sightings. Having watched the bumper crop of sightings unfold since last autumn I started to think about whether it would be possible to visualise the spread of records over time, using the info from the timeline of @WaxwingsUK.

The following is my attempt. It took more than a few train journeys to work out the details and makes use of some excellent R scripts, functions and packages that other folks have put together (acknowledged in the code that follows). In order to replicate this, you need a Twitter account and register for API keys as detailed in the links.

Stage 1: Download the content of @WaxwingsUK Twitter timeline from autumn 2016 until present (2017-03-13).

The first part is fairly straightforward, using the TwitteR package. You will need to replace the relevant key strings with the ones that Twitter gives you when you register.

# Fire up the Quattro, sorry, I mean load up the required packages
library(twitteR)
library(lubridate)
library(ggmap)
library(stringi)
library(viridis)
library(animation)

# you also need ImageMagick installed to save animations
# https://www.imagemagick.org/

# do initial setup of Twitter oauth and store in .http-oauth file
# see details at:
# http://bigcomputing.blogspot.co.uk/2016/02/the-twitter-r-package-by-jeff-gentry-is.html

consumer_key <- "your_key"
consumer_secret <- "your_key"
access_token <- "your_key"
access_secret <- "your_key"
setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_secret)

# extract tweet content
waxwing_tweets <- userTimeline("WaxwingsUK", n=3200)

# set maxID to extract different parts
tweet.df <- twListToDF(waxwing_tweets)
all_tweets<-tweet.df
last_num<-as.numeric(tweet.df$id[nrow(tweet.df)])

# iterate through timeline, extracting each set of tweets and updating last_num
# needed due to TWitter API limits, see:
# https://dev.twitter.com/rest/public/timelines
repeat{
  waxwing_tweets <- userTimeline("WaxwingsUK", maxID=as.character(last_num), n=3200)
  tweet.df <- twListToDF(waxwing_tweets)
  all_tweets <-rbind.data.frame(all_tweets, tweet.df)
  n_recs<-nrow(all_tweets)
  last_num<-as.numeric(tweet.df$id[nrow(tweet.df)])-1
  
  # we will stop at 1000 tweets as this is more than enough
  if(n_recs>1000){
    break
  }
}

# now need to extract those in 2016 or 17 based on date field (created) which
# is already in POSIXct format, so should be easy using lubridate
all_tweets$Year<-year(all_tweets$created)
all_tweets$Month<-month(all_tweets$created)

# extract all 2017 tweets
records_2017<-all_tweets[all_tweets$Year==2017, ]

# extract 2016 tweets after April, which is the break between seasons
records_2016<-all_tweets[all_tweets$Year==2016 & all_tweets$Month>4, ]

# combine the records together
records<-rbind.data.frame(records_2017, records_2016)

# remove RTs and replies as the info is not consistently structured
# so can't process easily
rts<-is.na(records$isRetweet)
records<-records[rts!=TRUE, ]
replies<-is.na(records$replyToSN)
records<-records[replies!=FALSE, ]

# just extract location string and date created columns
records<-cbind.data.frame(records$text, records$created)
names(records)<-c("location", "date")

# sort records on date for subsequent plotting
records<- records[order(records$date),] 

Stage 2: Text wrangling to extract the location and date of sightings.

Luckily @WaxwingsUK provides the records of sightings in a fairly consistent format, which makes life easier (nice @WaxwingsUK, keep it up!). R does not have the best string manipulation functionality, but we can do it without too much hair-pulling.

# find rows with illegal characters, by conversion to control seq
illegal_chrs<-grepl("[[:cntrl:]]", stringi::stri_enc_toascii(records$location))

# remove these from the dataframe, but keep all fields
records<-records[illegal_chrs!=TRUE, ]

# convert to character
records$location<-as.character(records$location)

# split the tweet text at the : to remove the leading date info
records_locs<-strsplit(records$location, ":")

# bind the records back to a df
records_locs<-do.call(rbind.data.frame, records_locs)

# just keep the address text, convert to character vector
records_locs<-as.character(records_locs[[2]])

# now split for each location, separated by ,
records_locs<-strsplit(records_locs, ",")

# use length of list to determine number of records in each tweet
date_reps<-as.numeric(lapply(records_locs, function(x) length(x)))

# unlist records to a vector
records_locs<-unlist(records_locs)

# repeat dates appropriate times
dates<-rep(records$date, date_reps)

# some locations have more than one road/location, separated by '/'
# Google Maps API does not like these, so just use last part

# define function to split string based on last occurrence of set string
# and return the right hand portion
splitText<-function(x, string){
  pos<-gregexpr(string, x)
  posmax<-pos[[1]]
  # if not -1, i.e. string is found then
  if (posmax[1]!=-1){
    posmax<-posmax[length(posmax)]
    keep<-substr(x, posmax+1, nchar(x))
    return(keep)
  }
  else{
    # if the string is not found then return it unmodified
    return(x)
    }
}

# convert records_locs to dataframe for apply
records_locs<-as.data.frame(records_locs, stringsAsFactors = FALSE)

# process out all the '/' parts of the locations using splitText function
records_locs<-apply(records_locs, 1, splitText, "/")

# split again based on '+' to remove flock size info
records_locs<-as.data.frame(records_locs, stringsAsFactors = FALSE)

# process out all the '60+ etc' parts of the text
records_locs<-apply(records_locs, 1, splitText, "\\+")

# gsub the merry hell out of the location strings to make them suitable for 
# Google Maps API call

records_locs<-gsub(" -",",", records_locs)
records_locs<-gsub(" \\(", ", ", records_locs)
records_locs<-gsub("\\)", ", UK", records_locs)
records_locs<-gsub("@", "", records_locs)
records_locs<-gsub("_", " ", records_locs)
records_locs<-gsub("/", " ", records_locs)
records_locs<-gsub(" Dr ", " Drive ", records_locs)
records_locs<-gsub(" Rd ", " Road ", records_locs)
records_locs<-gsub(" Ln ", " Lane" , records_locs)
records_locs<-gsub(" Crt ", " Court ", records_locs)
records_locs<-gsub(" St ", " Street ", records_locs)
records_locs<-gsub(" Cl ", " Close ", records_locs)
records_locs<-gsub(" car park", "", records_locs)
records_locs<-gsub(" opp", "", records_locs)
records_locs<-gsub(" private garden", "", records_locs)
records_locs<-gsub(" &amp;", " and", records_locs)

# trim whitespace, now ready for the geocoding stage
records_locs<-trimws(records_locs)

Stage 3: Geocoding the sightings.

This makes use of the geocode function in the excellent ggmap package. Even better Shane Lynn has written a function that allows you to ping the Google Maps API up to the public use limit (2500 per day) and pause until you can do some more. We only have about 1600 locations so can do it in one go.

# Function to geocode addresses via Google Maps API, by Shane Lynn
# http://www.shanelynn.ie/massive-geocoding-with-r-and-google-maps/

# define a function that will process googles server responses for us.
getGeoDetails <- function(address){
  # use the gecode function to query google servers
  geo_reply = geocode(address, output='all', messaging=TRUE, override_limit=TRUE)
  # now extract the bits that we need from the returned list
  answer <- data.frame(lat=NA, long=NA, accuracy=NA, formatted_address=NA, address_type=NA, status=NA)
  answer$status <- geo_reply$status
  
  # if we are over the query limit - want to pause for an hour
  while(geo_reply$status == "OVER_QUERY_LIMIT"){
    print("OVER QUERY LIMIT - Pausing for 1 hour at:") 
    time <- Sys.time()
    print(as.character(time))
    Sys.sleep(60*60)
    geo_reply = geocode(address, output='all', messaging=TRUE, override_limit=TRUE)
    answer$status <- geo_reply$status
  }
  
  # return Na's if we didn't get a match:
  if (geo_reply$status != "OK"){
    return(answer)
  }
  # else, extract what we need from the Google server reply into a dataframe:
  answer$lat <- geo_reply$results[[1]]$geometry$location$lat
  answer$long <- geo_reply$results[[1]]$geometry$location$lng
  if (length(geo_reply$results[[1]]$types) > 0){
    answer$accuracy <- geo_reply$results[[1]]$types[[1]]
  }
  answer$address_type <- paste(geo_reply$results[[1]]$types, collapse=',')
  answer$formatted_address <- geo_reply$results[[1]]$formatted_address
  
  return(answer)
}

# initialise a dataframe to hold the results
geocoded <- data.frame()
# find out where to start in the address list (if the script was interrupted before):
startindex <- 1
# if a temp file exists - load it up and count the rows!
tempfilename <- 'temp_geocoded.rds'
if (file.exists(tempfilename)){
  print("Found temp file - resuming from index:")
  geocoded <- readRDS(tempfilename)
  startindex <- nrow(geocoded)
  print(startindex)
}

# Start the geocoding process - address by address. geocode() function takes care of query speed limit.
for (ii in seq(startindex, length(records_locs))){
  print(paste("Working on index", ii, "of", length(records_locs)))
  # query the google geocoder - this will pause here if we are over the limit.
  result = getGeoDetails(records_locs[ii]) 
  print(result$status)
  result$index <- ii
  # append the answer to the results file.
  geocoded <- rbind(geocoded, result)
  # save temporary results as we are going along
  saveRDS(geocoded, tempfilename)
}

# geocoded now contains the lat and long (plus other things)
# of the records, in date order, so combine with other info
plotdata<-cbind.data.frame(records_locs, dates, geocoded$long, geocoded$lat)
names(plotdata)<-c("location", "date", "long", "lat")

# remove those records that did not get geocoded (NA)
noloc<-is.na(plotdata$long)
plotdata<-plotdata[noloc!=TRUE, ]

# some geocoding results are not in the UK despite best efforts so
# need to exclude these, using UK lat/long limits (only lower lat value
# needed in this case)
plotdata<-plotdata[plotdata$lat>49, ]

# we have lost 272 records on the way through, through no/wrong geocoding
# that is not too bad really, still got ~1400

Stage 4: Map and animate the sightings.

Again based around the functions of ggmap, plus the animation package. You need to install ImageMagick first as well. Plotting the sightings over time and grading the symbol colour by the date of record using the plasma scheme from the viridis package as this shows up nicely against the Google Maps background. The saveVideo function at the end outputs to an mp4 called animation.mp4 in the working directory.

# set up base map for plotting, setting centre long/lat
myLocation <- c(-2, 54.5)

# using Google roadmap as the base
waxMap <- get_map(location=myLocation, zoom=5, source="google", maptype="roadmap", crop=FALSE)

# set up colours for points, based on viridis plasma scheme
plotdata$colours<-plasma(1395, begin=0, end =0.9)

# set up function to plot map in sequence
plotmaps <- function(){
    for (m in seq(1, nrow(plotdata))){
      points<-plotdata[1:m, ]
      p<-ggmap(waxMap)+
        geom_point(aes(x = long, y = lat), data = points, alpha = 1, color=points$colours, size = 2)+
        scale_x_continuous(limits = c(-11, 4), expand = c(0, 0)) +
        scale_y_continuous(limits = c(49, 61), expand = c(0, 0))
    print(p)
  }
}  

# set video options
oopt = ani.options(interval = 0.05, nmax = nrow(plotdata))  

# output video; this will throw a series of warnings over
# changing scales of axes, but it is all good
saveVideo(plotmaps(),interval = 0.05)

We can also do a static version as well as video.

p<-ggmap(waxMap)+
  geom_point(aes(x = long, y = lat), data = plotdata, alpha = 1, color=plotdata$colours, size = 2)+
  scale_x_continuous(limits = c(-11, 4), expand = c(0, 0)) +
  scale_y_continuous(limits = c(49, 61), expand = c(0, 0))
p

Waxwing image CC0 gustavmelin0.

comments powered by Disqus