Formatting Rbec output

Author

Shane Hogle

Published

May 24, 2024

Contains both results from hambiNMR experiment and also a test of using “Just A Plate” cleanup method during library preparation.

Setup

Libraries and global variables

library(tidyverse)
library(here)
library(fs)
library(archive)
library(scales)
source(here::here("R", "utils_generic.R"))

Global vars

data_raw <- here::here("_data_raw", "20240513_BTK_illumina_v3")
data <- here::here("data", "20240513_BTK_illumina_v3")
amplicontar <- here::here(data_raw, "rbec_output.tar.gz")

# make processed data directory if it doesn't exist
fs::dir_create(data)

# create temporary location to decompress
tmpdir <- fs::file_temp()

Read data

Untar Rbec output tarball

archive::archive_extract(
  amplicontar,
  dir = tmpdir,
  files = NULL,
  options = character(),
  strip_components = 0L
)

Read and format Rbec tables

# set up directory structure
tabdir <- here::here(tmpdir, "rbec_output")
samppaths <- fs::dir_ls(tabdir)
sampnames <- path_split(samppaths) %>% 
  map_chr(dplyr::last)

Read raw counts tables

straintabs <- paste0(samppaths, "/strain_table.txt") %>% 
  set_names(sampnames) %>% 
  map_df(
  read_tsv,
  skip = 1,
  col_names = c("strainID","count"),
  show_col_types = FALSE, 
  .id = "sample") %>% 
  completecombos(count)

Read normalized counts tables

normstraintabs <- paste0(samppaths, "/strain_table_normalized.txt") %>% 
  set_names(sampnames) %>% 
  map_df(
    read_tsv,
    skip = 1,
    col_names = c("strainID","norm_count"),
    show_col_types = FALSE, 
    .id = "sample") %>% 
  completecombos(norm_count)

Renormalizing to 16S copy number

rf1 <- straintabs %>% 
  group_by(sample) %>% 
  mutate(tot = sum(count)) %>% 
  left_join(normstraintabs) %>% 
  mutate(f = norm_count/sum(norm_count)) %>% 
  mutate(count_norm = round(tot*f)) %>% 
  dplyr::select(-count, -tot, -norm_count, -f) %>% 
  mutate(tot = sum(count_norm),
         f = count_norm/sum(count_norm)) %>%
  ungroup()

Joining with `by = join_by(sample, strainID)`

Read metadata

exp_md <- readr::read_tsv(here::here(data, "20240513_BTK_illumina_v3_metadata.tsv"))

Rows: 456 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (3): sample, replicate, evolution
dbl (3): plate, day, amp_conc

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Final dataframe

 full <- left_join(rf1, exp_md)

Joining with `by = join_by(sample)`

Analysis

Negative controls

First, we’ll check whether the experimental and plate negative controls look good

full %>% 
  filter(str_detect(sample, "^Neg")) %>% 
  summarize(tot = sum(count_norm), .by = c("sample", "plate", "replicate"))

This looks great. In all cases very few reads are coming off the negative controls. This means it is safe to assume there was no contamination during the library preps. More importantly it looks like the main experiment was clean because we see no growth/contamination in the R2A media blanks.

Positive controls

full %>% 
  filter(str_detect(sample, "^Pos")) %>%
  mutate(total_pos_controls = n_distinct(sample)) %>% 
  group_by(sample) %>% 
  mutate(n_sp_detected = n()) %>% 
  distinct(sample, n_sp_detected, total_pos_controls)

This is also good - we detect all 23 species in the positive controls on each plate

Export

Now we can write these files for later use

# remove positive and negative controls

full %>% 
  filter(!str_detect(sample, "^Pos|^Neg")) %>% 
  readr::write_tsv(here::here(data, "species_counts_md.tsv"))

Clean up

# remove decompressed coverage directory from temp location
fs::dir_delete(tmpdir)