---
title: "Introduction to fluxfinder"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Introduction to fluxfinder}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

The `fluxfinder` package provides functions to parse static-chamber
greenhouse gas measurement files generated by a variety of instruments; 
compute flux rates using multi-observation metadata; and generate diagnostic
metrics and plots. It's designed to be easy to integrate into scientific workflows.

## Load sample data

```{r setup}
library(fluxfinder)

# Data from a LI-7810
f <- system.file("extdata/TG10-01087.data", package = "fluxfinder")
dat <- ffi_read_LI7810(f)

# Note that the fluxfinder read functions print some info after reading
# Set "options(fluxfinder.quiet = TRUE)" to suppress such messages

# Look at a subset of the data; the full data frame has 500+ rows and 25 columns
dat[1:6, 1:9]
```

The data frame returned by `ffi_read_LI7810` is all data from the raw
[LI-7810](https://www.licor.com/env/products/trace-gas/LI-7810) file,
except that `TIMESTAMP`, `TZ` (time zone of the timestamps),
`SN` (serial number), and `MODEL` columns have been added.

The analyzer data is basically a stream of measured 
greenhouse gas concentrations:

```{r overview-plot, fig.width=7}
library(ggplot2)
ggplot(dat, aes(TIMESTAMP, CO2)) + geom_point()
```

## Match with metadata

For these data to be useful, we need to 
associate them with _metadata_ about the measurements: when they were started,
how long they lasted, plot/treatment/collar information, etc. 

```{r metadata}
# Accompanying metadata
md <- system.file("extdata/TG10-01087-metadata.csv", package = "fluxfinder")
metadat <- read.csv(md)

print(metadat)
```

**Important note**: in this sample metadata, our measurement identified is
labeled `Plot`, but this could be named, and refer to, anything: bottle,
sample, collar, etc. It's simply an identifier for _this_ measurement, i.e.
this row.

The `ffi_metadata_match` function matches up the data with metadata,
using the `TIMESTAMP` column that `ffi_read_LI7810` helpfully created
when it read the data file.

```{r matching, fig.width=7}
dat$metadat_row <- ffi_metadata_match(
  data_timestamps = dat$TIMESTAMP,
  start_dates = metadat$Date,
  start_times = metadat$Start_time,
  obs_lengths = metadat$Obs_length + 10) # 10 is expected dead band length

# Note that ffi_metadata_match() warns us that one metadata row didn't match any data

# Based on the row match information, add a "Plot" column to the data
dat$Plot <- metadat$Plot[dat$metadat_row]
metadat$metadat_row <- seq_len(nrow(metadat))

# ...and plot
p <- ggplot(dat, aes(TIMESTAMP, CO2, color = Plot)) + geom_point()
print(p)
```

Some of these are clearly not correct--the measurement time seems to be shorter
then 60 seconds for the C, D, and E plots:

```{r emphasize-problems, fig.width=7, echo=FALSE, message=FALSE}
library(lubridate)
p + annotate("rect", ymin = 455, ymax = 476, 
             xmin = ymd_hms("2022-10-27 10:39:30", tz = "EST"), 
             xmax = ymd_hms("2022-10-27 10:43:30", tz = "EST"), 
             color = "red", fill = NA, linewidth = 1.5)
```

In real life we'd want to correct the faulty metadata at its source.
Here, we'll just change the values programmatically and re-match:

```{r matching2, fig.width=7}
metadat$Obs_length[3:5] <- c(30, 45, 45)
dat$metadat_row <- ffi_metadata_match(
  data_timestamps = dat$TIMESTAMP,
  start_dates = metadat$Date,
  start_times = metadat$Start_time,
  obs_lengths = metadat$Obs_length + 10)
dat$Plot <- metadat$Plot[dat$metadat_row]

p %+% dat
```

That looks better!

## Unit conversion

We'd like our final units to be in µmol/m2/s, and so need to do some unit
conversion. (This can happen either before or after flux computation, below.)
The package provides `ffi_ppm_to_umol()` and `ffi_ppb_to_nmol()` functions
that perform this conversion using the
[Ideal Gas Law](https://en.wikipedia.org/wiki/Ideal_gas_law).

```{r units}
dat$CO2_umol <- ffi_ppm_to_umol(dat$CO2, 
                                volume = 0.1, # m3
                                temp = 24)    # degrees C

# See the message: because we didn't provide the 'atm' parameter, 
# ffi_ppm_to_umol assumed standard pressure.

# Also normalize by ground area (0.16 m2 in this example)
dat$CO2_umol_m2 <- dat$CO2_umol / 0.16
```

Note that in the example above we're using a **constant system volume and
measurement ground area**. If that's not the case, 
there should be a column in the metadata providing the changing values
(e.g. giving volume in m<sup>3</sup>) for each measurement. 
Then after calling `ffi_metadata_match()`, merge the data and metadata and
pass the appropriate column to `ffi_ppm_to_umol()`. Here's an example:

```{r}
# Let's say volume varies by measurement; this can happen if the chamber
# height changes depending on the ground vegetation in each plot
metadat$Volume <- c(0.1, 0.2, 0.1, 0.1, 0.3, 0.1, 0.1)

# Merge the data and metadata
dat_changing_vol <- merge(dat, metadat[c("Plot", "Volume")], by = "Plot", all.x = TRUE)

# Unit conversion as above, but using the changing volume information:
dat_changing_vol$CO2_umol <- ffi_ppm_to_umol(dat_changing_vol$CO2, 
                                             volume = dat_changing_vol$Volume,
                                             temp = 24)
# We still have constant ground area in this example
dat_changing_vol$CO2_umol_m2 <- dat_changing_vol$CO2_umol / 0.16

# Relative to the previous constant-volume example, our area-normalized
# amounts (µmol) have now increased for plots B and E because
# of their larger volumes:
aggregate(CO2_umol_m2 ~ Plot, data = dat, FUN = mean)
aggregate(CO2_umol_m2 ~ Plot, data = dat_changing_vol, FUN = mean)
```

## Compute fluxes

The `ffi_compute_fluxes` function provides a general-purpose tool for
computing fluxes from concentration time series, as well as associated
QA/QC information. It returns statistics for four types of models:
_linear_, _robust linear_, _polynomial_, and _HM81_, an exponential model
derived from diffusion theory, following
[Hutchinson and Mosier (1981)](http://dx.doi.org/10.2136/sssaj1981.03615995004500020017x). 

Model statistics include Akaike information criterion (`AIC`), 
R squared (`r.squared`), standard error of the residuals (`sigma`), 
and model p-value (`p.value`). For the robust linear regression only, a 
logical value `converged` is included; see the documentation for `MASS::rlm()`.

Flux (slope) statistics estimate and std.error;

For the robust linear regression model only, a logical value converged.

```{r compute-fluxes}
fluxes <- ffi_compute_fluxes(dat,
                             group_column = "Plot", 
                             time_column = "TIMESTAMP", 
                             gas_column = "CO2_umol_m2",
                             dead_band = 10) 
#Here we use a constant dead band value, but this can also vary by group; see the documentation


# By default, ffi_compute_fluxes returns a data.frame with one row per
# grouping variable value (i.e., per measurement). The first column is the
# group label; the second is the average value of the `time_column`;
# and the rest of the columns are fit statistics for a linear fit of
# concentration as a function of time, along with information about polynomial
# and robust-linear fits. See ?ffi_compute_fluxes for more details.
names(fluxes)

# For clarity, print out only a subset of the columns 
fluxes[c("Plot", "TIMESTAMP", "lin_r.squared", "lin_flux.estimate", "HM81_flux.estimate")]
```

Note that the `fluxes` extract printed above has one row per `Plot`, the
grouping variable; the mean `TIMESTAMP` of the group; model statistics
such as `lin_r.squared`; and the flux estimate. The final
column, `HM81_flux.estimate` is only numeric (i.e., not `NA`) when the data 
show evidence of a saturating curvature. So in this case we might want to
examine more carefully the data from plots A, C, and F.

Plotting our computed fluxes:

```{r plot-fluxes, fig.width=7}
ggplot(fluxes, aes(Plot, lin_flux.estimate, color = lin_r.squared)) +
  geom_point() +
  geom_linerange(aes(ymin = lin_flux.estimate - lin_flux.std.error,
                     ymax = lin_flux.estimate + lin_flux.std.error)) +
  ylab("CO2 flux (µmol/m2/s)")
```

We might want to check whether the robust-linear slope diverges 
from the linear fit slope, suggesting influential outliers, or whether the
polynomial R<sup>2</sup> is much larger, potentially indicating curvature 
of the observations due to e.g. diffusion limitations.

```{r, figures-side, fig.show="hold", out.width="46%"}
ggplot(fluxes, aes(lin_flux.estimate, rob_flux.estimate, color = Plot)) +
  geom_point() + geom_abline() + theme(legend.position = "none")
ggplot(fluxes, aes(lin_r.squared, poly_r.squared, color = Plot)) +
  geom_point() + geom_abline() + theme(legend.position="none")
```

The plot C (green) data have more scatter, and thus lower R<sup>2</sup> values and
higher uncertainty on the computed flux,
but there's no strong evidence of nonlinearity or outlier problems (although
see note above about the `HM81_estimate` field).

## A clearly nonlinear case

In our experience, static-chamber fluxes are frequently linear
([Forbrich et al. 2010](https://doi.org/10.1016/j.soilbio.2009.12.004))
even though molecular diffusion means that chamber feedbacks will inevitably
lead to curvilinear (saturating) behavior over time
([Pedersen et al. 2010](https://doi.org/10.1111/j.1365-2389.2010.01291.x)).

Still, how does one use `fluxfinder` to diagnose and work with nonlinear data?

Let's use R's built-in `Puromycin` dataset (simply because it exhibits
saturating behavior; it's unrelated to greenhouse gases) as an example:

```{r, fig.width=7}
ggplot(Puromycin, aes(conc, rate)) + geom_point() + geom_smooth(method = "lm")
```

Visually, we can see that a linear model will not be appropriate in this
case. 

```{r}
ffi_compute_fluxes(Puromycin,
                   group_column = NULL,
                   time_column = "conc",
                   gas_column = "rate")
```

```{r include=FALSE}
x <- ffi_compute_fluxes(Puromycin,
                        group_column = NULL,
                        time_column = "conc",
                        gas_column = "rate")
x <- round(x, 3)
```

From the diagnostics returned by `ffi_compute_fluxes`:

* `HM81_flux.estimate` is not `NA`, which only occurs with saturating behavior;
* The `lin_AIC` (`r x$lin_AIC`) and `rob_AIC` (`r x$rob_AIC`) [Akaike information criterion](https://en.wikipedia.org/wiki/Akaike_information_criterion) values are similar, so no indication of influential outliers;
* The `lin_r.squared` (`r x$lin_r.squared`) and `poly_r.squared` (`r x$poly_r.squared`) values are _very_ different, suggesting a failure of the linear model;
* The [root mean square error](https://en.wikipedia.org/wiki/Root_mean_square_deviation) (RMSE) of the linear model is much higher than the other models' values;
* The `HM81_r.squared` (`r x$HM81_r.squared`) and `HM81_AIC` (`r x$HM81_AIC`) are considerably higher and lower, respectively, than the linear model. 

All of these metrics point to a common conclusion: a linear model is _not_
appropriate for these data. If these were real data, we should use the
`HM81_flux.estimate` value as our flux estimate.

## Conclusion

This vignette covered `fluxfinder` basics: loading data and metadata,
matching the two, performing unit conversion, computing fluxes, and some
basic QA/QC. The test data we worked above could be fit well by linear model,
but for many reasons this might not always be true; see the vignette on
[integrating with the gasfluxes package](integrating-gasfluxes.html) for
guidance on using more sophisticated model-fitting routines.