Welcome to SCRuB! This is a software package for in silico removal of contamination from microbial datasets using process controls.
Please report any issues you faced while using SCRuB in our Issues Page, or email gia2105@columbia.edu.
SCRuB is currently available through our github
repository, and can be installed using devtools
:
devtools::install_github("Shenhav-and-Korem-labs/SCRuB")
torch::install_torch()
Additionally, we provide SCRuB as a QIIME2 plugin, which can be installed from a QIIME2 environment using pip:
pip install git+https://github.com/Shenhav-and-Korem-labs/q2-SCRuB.git
These tutorials demonstrates the data format needed to set up SCRuB, how to run SCRuB’s core funtions, and how to interpret its results.
Why SCRuB?
Instead of
trying to identify whether a taxa is categorically a contaminant, SCRuB
models the composition of each contamination source (kit contamination,
water contamination, etc.). We assume that taxa present together in a
contamination source will be introduced together to other samples, and
in similar proportions as in the contamination source. Therefore, if a
control sample contains multiple bacteria, and a sample of interest
contains only one of them and at high counts, that one bacteria is
likely not a contaminant. This allows a more accurate and specific
decontamination.
Should I use SCRuB only if I
believe that there is substantial contamination?
Our results
demonstrate that SCRuB will not erroneously remove taxa if provided with
unrelated controls. We therefore support incorporating SCRuB into your
day-to-day analysis pipeline.
What does SCRuB do with the
locations of the samples on the wells?
SCRuB uses these
location to handle the important and common phenomenon of well-to-well
leakage, in which material from biological samples leaks into controls
during experimental procedures. Using the locations of samples during
processing allows us to detect these cases.
What do I do if I have more than
one type of control?
SCRuB supports multiple types of controls,
and performs serial decontamination - each time performing
decontamination using a different type of control. You can specify the
order of decontamination yourself; we recommend to perform
decontamination in the order in which contaminants are introduced.
We sequenced samples from two
different studies on the same batch / plate / sequencing run. Should I
run SCRuB just on the data from my study?
One of the key
advantages of SCRuB is that it uses the shared information across all
samples affected by a certain contamination source (e.g., an extraction
batch). We therefore recommend that you supply SCRuB with all
the relevant samples, including ones that are not related to a
particular study - SCRuB uses the information in those samples to
perform better decontamination.
Is there any benefit in providing
SCRuB with more than one control sample?
Yes! SCRuB uses each
control sample as an independent realization of the contamination source
it represents. More samples allow us to infer the latent composition of
these source more accurately. We recommend at least 2 controls per
source, although more than two controls would often be appropriate.
Does SCRuB work equally well on
relative abundances vs compositional counts?
While SCRuB is
expected to work with relative abundance or subsampled/rarefied counts,
we recommend using raw counts. Using relative abundances gives all
samples the same power to share information; while this assumption can
be reasonable, it does deviate slightly from SCRuB’s validations.
How should I run SCRuB on a large
dataset spanning multiple plates?
SCRuB infers and removes one
contamination source at a time. For every dataset, but particularly
large ones, we recommend that you first consider the experimental design
and the type of process controls collected. Then, run SCRuB separately
for each contamination source. For example, if you have collected and
sequenced empty collection kits, these likely apply to the entire study:
use them to SCRuB the entire dataset, indicating the plate location of
the relevant plate in which the controls were located. If you have
collected and sequenced negative extraction controls, these likely apply
to a specific plate / sequencing batch. Run SCRuB using these controls
to decontaminate only the relevant plate / sequencing batch; to
accomplish this, the SCRuB function should be called separately, once
for each plate / batch.
How should I run SCRuB if my
samples’ well locations changed across processing stages?
If
your plate structure varies across different processing stages
(i.e. extraction vs amplification) and you have controls relevant to
each, we recommend running the SCRuB function in multiple stages to
account for the different spatial structures. SCRuB should be run
separately for each set of relevant controls, and each sample should be
SCRuB-ed once per contamination layer.
How can I resolve issues setting
up SCRuB on my machine?
While we encourage all users to contact
us for assistance via our issues pages or via email, we can offer a few
general suggestions to address some common issues: 1) To accommodate
SCRuB’s dependencies, we require R >= 3.6
. If a user’s
analysis requires an older version of R, we recommend maintaining two
separate R environments on your machine. 2) A common issue with running
the torch package can be resolved by running the R command
torch::install_torch()
. 3) We have implemented unit tests
that successfully deploy SCRuB on five separate machines; upon request,
we will be happy to provide assistance with information specific to your
machine.