Welcome to SCRuB! This is a software package for in silico removal of contamination from microbial datasets using process controls.

Please report any issues you faced while using SCRuB in our Issues Page, or email gia2105@columbia.edu.


SCRuB is currently available through our github repository, and can be installed using devtools:


Additionally, we provide SCRuB as a QIIME2 plugin, which can be installed from a QIIME2 environment using pip:

pip install git+https://github.com/Shenhav-and-Korem-labs/q2-SCRuB.git


Getting started

To start using SCRuB, follow along one of the following links, according to your preferences:
R tutorial
QIIME2 tutorial

These tutorials demonstrates the data format needed to set up SCRuB, how to run SCRuB’s core funtions, and how to interpret its results.



Why SCRuB?
Instead of trying to identify whether a taxa is categorically a contaminant, SCRuB models the composition of each contamination source (kit contamination, water contamination, etc.). We assume that taxa present together in a contamination source will be introduced together to other samples, and in similar proportions as in the contamination source. Therefore, if a control sample contains multiple bacteria, and a sample of interest contains only one of them and at high counts, that one bacteria is likely not a contaminant. This allows a more accurate and specific decontamination.

Should I use SCRuB only if I believe that there is substantial contamination?
Our results demonstrate that SCRuB will not erroneously remove taxa if provided with unrelated controls. We therefore support incorporating SCRuB into your day-to-day analysis pipeline.

What does SCRuB do with the locations of the samples on the wells?
SCRuB uses these location to handle the important and common phenomenon of well-to-well leakage, in which material from biological samples leaks into controls during experimental procedures. Using the locations of samples during processing allows us to detect these cases.

What do I do if I have more than one type of control?
SCRuB supports multiple types of controls, and performs serial decontamination - each time performing decontamination using a different type of control. You can specify the order of decontamination yourself; we recommend to perform decontamination in the order in which contaminants are introduced.

We sequenced samples from two different studies on the same batch / plate / sequencing run. Should I run SCRuB just on the data from my study?
One of the key advantages of SCRuB is that it uses the shared information across all samples affected by a certain contamination source (e.g., an extraction batch). We therefore recommend that you supply SCRuB with all the relevant samples, including ones that are not related to a particular study - SCRuB uses the information in those samples to perform better decontamination.

Is there any benefit in providing SCRuB with more than one control sample?
Yes! SCRuB uses each control sample as an independent realization of the contamination source it represents. More samples allow us to infer the latent composition of these source more accurately. We recommend at least 2 controls per source, although more than two controls would often be appropriate.

Does SCRuB work equally well on relative abundances vs compositional counts?
While SCRuB is expected to work with relative abundance or subsampled/rarefied counts, we recommend using raw counts. Using relative abundances gives all samples the same power to share information; while this assumption can be reasonable, it does deviate slightly from SCRuB’s validations.

How should I run SCRuB on a large dataset spanning multiple plates?
SCRuB infers and removes one contamination source at a time. For every dataset, but particularly large ones, we recommend that you first consider the experimental design and the type of process controls collected. Then, run SCRuB separately for each contamination source. For example, if you have collected and sequenced empty collection kits, these likely apply to the entire study: use them to SCRuB the entire dataset, indicating the plate location of the relevant plate in which the controls were located. If you have collected and sequenced negative extraction controls, these likely apply to a specific plate / sequencing batch. Run SCRuB using these controls to decontaminate only the relevant plate / sequencing batch; to accomplish this, the SCRuB function should be called separately, once for each plate / batch.

How should I run SCRuB if my samples’ well locations changed across processing stages?
If your plate structure varies across different processing stages (i.e. extraction vs amplification) and you have controls relevant to each, we recommend running the SCRuB function in multiple stages to account for the different spatial structures. SCRuB should be run separately for each set of relevant controls, and each sample should be SCRuB-ed once per contamination layer.

How can I resolve issues setting up SCRuB on my machine?
While we encourage all users to contact us for assistance via our issues pages or via email, we can offer a few general suggestions to address some common issues: 1) To accommodate SCRuB’s dependencies, we require R >= 3.6. If a user’s analysis requires an older version of R, we recommend maintaining two separate R environments on your machine. 2) A common issue with running the torch package can be resolved by running the R command torch::install_torch(). 3) We have implemented unit tests that successfully deploy SCRuB on five separate machines; upon request, we will be happy to provide assistance with information specific to your machine.


R-CMD-check Codecov test coverage