2021-10-01 Maybe unit tests are a good idea for research projects

## Summary A note (or perhaps a cautionary tale) exploring some less than ideal #rstats debugging involving dreaming of having fully unit tested code that isn't riddled with bugs, and not remembering to always check the raw data very carefully first. Science isn't always about fancy models. ## What was I doing [Working on a project](https://github.com/epiforecasts/evaluate-delta-for-forecasting) to evaluate the use of variant of concern (in this case Delta) sequences to inform short-term forecasts of COVID-19. As I have been feeling very fancy recently I am using a new to me approach of having an [R package for the back-end generalisable code](https://github.com/epiforecasts/forecast.vocs) and a [`{targets}`](https://github.com/epiforecasts/evaluate-delta-for-forecasting/blob/main/_targets.R) workflow for the analysis specific front-end. ## What went wrong After doing model development on a [test data set from the RKI](https://github.com/epiforecasts/evaluate-delta-for-forecasting/blob/main/data-raw/observations/rki/process-obs.md) I recently moved the workflow onto sequence data extracted from [covariants.org](https://covariants.org)merged with [case data from the JHU via the ECDC forecasting hub](https://github.com/epiforecasts/evaluate-delta-for-forecasting/blob/main/data-raw/observations/covariants.org/process-obs.md) at which point problems began with very long running models and some fit warnings (working in stan here so these were issues with hitting maximum tree-depth and divergent transitions). After watching my fancy `{targets}` workflow for a while in awe of how cool it was (see[ `targets::tar_watch()`](https://docs.ropensci.org/targets/reference/tar_watch.html)) I thought perhaps something was not okay. Pulling out a "successful" fit and visualising it (always a good idea) flagged some very funny looking observations early on in the data. What is happening here? ![[2021-10-01-germany-covid19-cases-and-posterior.png]] ## Checking the truth data To check if I was just completely misremembering case notifications in Germany I first pulled down some notification data from the WHO (via `covidregionaldata`). *Note: I only check a single date here. This will come back to haunt me later.* ``` r library(covidregionaldata) library(data.table) germany <- get_national_data("germany") #> Downloading data from https://covid19.who.int/WHO-COVID-19-global-data.csv #> Rows: 150732 Columns: 8 #> ── Column specification ──────────────────────────────────────────────────────── #> Delimiter: "," #> chr (3): Country_code, Country, WHO_region #> dbl (4): New_cases, Cumulative_cases, New_deaths, Cumulative_deaths #> date (1): Date_reported #> #> ℹ Use `spec()` to retrieve the full column specification for this data. #> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message. #> Cleaning data #> Filtering data to: Germany #> Processing data germany <- setDT(germany)[, cases_new := Reduce(`+`, shift(cases_new, 0:6))] germany[date == as.Date("2021-04-03")] #> date un_region who_region country iso_code cases_new cases_total #> 1: 2021-04-03 Europe EURO Germany DE 117965 2873190 #> deaths_new deaths_total recovered_new recovered_total hosp_new hosp_total #> 1: 120 76895 NA NA NA NA #> tested_new tested_total #> 1: NA NA ``` Yes, quite a mismatch here so definitely a problem somewhere.... ## Debugging data processing The slightly tricky thing about the data I am using is that it not only contains the latest notifications and sequence data but also all of the previously released sequence and notification data (as these change as new data is reported). If the code to filter for a given forecasting date is not working properly then potentially this could be causing the issue seen above. Jumping into the `forecast.vocs` package to see if there were obvious bugs I quickly spotted duplicate dates after filtering. This was likely introduced due to a slight difference between my test and actual data. Not having any unit tests in this code this had completely slipped under the radar. After fixing the issue and adding some tests I was left with a [nice PR adding 7% unit testing coverage](https://github.com/epiforecasts/forecast.vocs/pull/57). There is of course an open question as to whether these are good (or good enough) tests as for me this is something of a mystic art. ```r library(data.table) library(testthat) #remotes::install_github("epiforecasts/forecast.vocs") library(forecast.vocs) # construct test data dt <- rbind( update_obs_availability(latest_obs(germany_covid19_delta_obs), seq_lag = 3), update_obs_availability(latest_obs(germany_covid19_delta_obs), seq_lag = 1) ) test_that("Dates are unique", { expect_dates_unique <- function(dt) { expect_equal(nrow(dt[, .(n = .N), by = c("date")][n > 1]), 0) } expect_dates_unique(filter_by_availability(dt)) expect_dates_unique(filter_by_availability(dt, seq_date = "2021-06-12")) expect_dates_unique(filter_by_availability(dt, case_date = "2021-07-01")) }) test_filter_by_availability <- function(dt, message, tar_date = max(dt$date), case_date = tar_date, seq_date = tar_date) { test_that(message, { fdt <- filter_by_availability(dt, date = tar_date, seq_date = seq_date, case_date = case_date) # Dates are correctly ordered to avoid downstream issues expect_true( all(fdt[, ordered := date > shift(date)][!is.na(ordered)]$ordered) ) # No data beyond sequence date is present expect_equal( nrow(fdt[seq_available > seq_date & is.na(seq_available)]), 0 ) # No data beyond case date is present expect_equal( nrow(fdt[cases_available > case_date & is.na(cases_available)]), 0 ) if (case_date > seq_date) { # If cases are available after sequences they are present expect_true(nrow(fdt[cases_available > seq_date]) > 0) } # If cases were available before sequences they are still present if (nrow(dt[date < tar_date & is.na(seq_available)]) > 0) { expect_true(nrow(fdt[date < tar_date & is.na(seq_available)]) > 0) } }) } test_filter_by_availability( dt, message = "Default settings work as expected" ) test_filter_by_availability( dt, message = "Default settings work when setting a forecast date", tar_date = as.Date("2021-07-24") ) test_filter_by_availability( dt, message = "Default settings work when setting a case available date", case_date = as.Date("2021-07-24") ) test_filter_by_availability( dt, message = "Default settings work when setting a sequence available date", seq_date = as.Date("2021-07-24") ) test_filter_by_availability( dt, message = "Default settings work when setting all dates", case_date = as.Date("2021-07-24"), seq_date = as.Date("2021-08-07"), tar_date = as.Date("2021-07-31") ) ``` ## Checking using the original data source Switching back to the test RKI data armed with this new PR everything looked fine both in terms of the observations and the resulting posterior predictions. Agreement between the WHO and RKI data sources also looks good. Seems like [`forecast.vocs`](https://github.com/epiforecasts/forecast.vocshttps://github.com/epiforecasts/forecast.vocs) may not have been the source of the issue after all. ```r #remotes::install_github("epiforecasts/forecast.vocs") library(forecast.vocs) options(mc.cores = 4) obs <- filter_by_availability( germany_covid19_delta_obs, date = as.Date("2021-07-05") ) curr_obs <- latest_obs(germany_covid19_delta_obs) dt <- stan_data(obs, horizon = 4) model <- load_model(strains = 2) inits <- stan_inits(dt, strains = 2) fit <- stan_fit( data = dt, model = model, init = inits, adapt_delta = 0.99, max_treedepth = 15, refresh = 0, show_messages = FALSE ) posterior <- summarise_posterior(fit) posterior <- update_voc_label(posterior, "Delta") plot_cases(posterior, curr_obs, log = TRUE) ``` ![[2021-10-01-rki-germany-covid19.png]] ## Looking at the underlying observations Something I am quite prone to is not carefully checking data. I recommend it but don't do it enough so here we are. Diving back into the [processing document](https://github.com/epiforecasts/evaluate-delta-for-forecasting/blob/main/data-raw/observations/covariants.org/process-obs.md) for the CoVariants/JHU data and [graphing the data](https://github.com/epiforecasts/evaluate-delta-for-forecasting/blob/7d39803881e5b45e757b731e6b20fc90f926a3f4/data-raw/observations/covariants.org/process-obs.Rmd#L263) for cases in Germany against those from the RKI source and WHO sources we see some fairly stark differences that indicate data ingestion is the key problem. ![[2021-10-01-rki-vs-covariants.png]] ## Fixing the bug So now the issue has been localised it turns out to be incredibly trivial and obvious. The key issue was copy and pasting cleaning code from one script to another without stopping to think that the new data set had data from multiple countries in it. Code goes from this, ```r cases[, cases := frollsum(value, n = 7)] ``` To this, ```r setkey(cases, location_name, date) cases[, cases := frollsum(value, n = 7), by = c("location_name")] ``` [Commit](https://github.com/epiforecasts/evaluate-delta-for-forecasting/commit/7f17d884e703e9423226b4800431431bbb44edae) fixing the issue. Finally the first graph I should have made shows us the mismatch has been mostly cleared up. Remaining differences are for another day! ![[2021-10-01-rki-vs-covariants-cases-1.png]] ## Moral of the story? Check the raw data carefully, check simple issues first, and don't get overwhelmed with how fancy you are that you forget to do these things. Now back to running `targets::tar_make()` and watching the glory using `targets::tar_watch()`. ![[2021-10-01-targets.png]] ## Tags [[Blog]] #blog/2021/ #software-engineering #rstats #software-engineering/unit-tests #data-cleaning #stream-of-consciousness