BiasAway Documentation

Welcome to BiasAway - an open-source command-line tool and web-server that provide four approaches to generate nucleotide composition-matched DNA sequences.

https://travis-ci.org/asntech/biasaway.svg?branch=master https://img.shields.io/pypi/pyversions/biasaway.svg https://img.shields.io/pypi/v/biasaway.svg https://anaconda.org/bioconda/biasaway/badges/version.svg https://anaconda.org/bioconda/biasaway/badges/downloads.svg https://anaconda.org/bioconda/biasaway/badges/installer/conda.svg https://img.shields.io/github/issues/asntech/biasaway.svg

Introduction

The BiasAway software tool is introduced to generate nucleotide composition-matched DNA sequences. It is available as open source code from bitbucket.

The tool provides users with four approaches to generate synthetic or genomic background sequences matching mono- or k-mer composition of user-provided foreground sequences:

  1. synthetic k-mer shuffled sequences
  2. synthetic k-mer shuffled sequences in a sliding window
  3. genomic mononucleotide distribution matched sequences
  4. genomic mononucleotide distribution within a sliding window matched sequences

The 1st approach shuffles each user-provided sequences independently by preserving the k-mer composition of the input sequences. The 2nd approach applies the same method as the 1st approach but within a sliding window along the user-provided sequences. For the 3rd and 4th approaches, the background sequences are selected from a pool of provided genomic sequences to match the distribution of mononucleotide for each target sequence. The 4th approach considers the mean and standard deviation of %GC computed within the sliding window along the user-provided sequences to match as closely as possible the distribution for each user-provided sequence.

The approaches based on a sliding window were considered because due to evolutionary changes such as insertion of repetitive sequences, local rearrangements, or biochemical missteps, the target sequences may have sub-regions of distinct nucleotide composition.

Installation

BiasAway is available on PyPi, through Bioconda, and the source code is available on bitbucket. BiasAway takes care of the installation of all the required python modules. If you already have a working installation of python, the easiest way to install the required python modules is by installing biasaway using pip.

If you are setting up Python for the first time, we recommend to install it using the Conda or Miniconda Python distribution. This comes with several helpful scientific and data processing libraries available for platforms including Windows, Mac OSX, and Linux.

You can use one of the following ways to install BiasAway.

Quick installation

Prerequisites

BiasAway requires the following Python modules:

Install biopython, numpy, matplotlib, and seaborn

BiasAway uses biopython, numpy, matplotlib, and seaborn you can install them using pip or conda.

Note

If you install using pip or bioconda prerequisites will be installed.

Install BiasAway using conda

BiasAway is available on Bioconda for installation via conda.

conda install -c bioconda biasaway

Install BiasAway using pip

BiasAway is available on PyPi for installation via pip.

pip install biasaway

Install BiasAway from source

You can install the development version by using git from our bitbucket repository at https://bitbucket.org/CBGR/biasaway.

Install development version from Bitbucket

If you have git installed, use this:

git clone https://bitbucket.org/CBGR/biasaway.git
cd biasaway
python setup.py sdist install

How to use BiasAway

Once you have installed BiasAway, you can type:

biasaway --help

It will print the main help, which lists the six subcommands/modules: k, w, g, and c.

usage: biasaway <subcommand> [options]

positional arguments <subcommand>: {k,w,g,c}

        List of subcommands
        k       k-mer shuffling
        w       k-mer shuffling within a sliding window
        g       mononucleotide distribution matched
        c       mononucleotide distribution within a sliding window matched

optional arguments:
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit

To view the help for the individual subcommands, please type:

Note

Please check BiasAway modules to see a detailed summary of available options.

To view k module help, type

biasaway k --help

To view w module help, type

biasaway w --help

To view g module help, type

biasaway g --help

To view c module help, type

biasaway c --help

BiasAway modules

The BiasAway software tool is introduced to generate nucleotide composition-matched DNA sequences. It is available as open source code from bitbucket.

The tool provides users with four approaches to generate synthetic or genomic background sequences matching mono- and k-mer composition of user-provided foreground sequences:

Note

BiasAway can generate distribution plots for QC. Plots provide information about distribution of %GC, dinucleotides, and lengths for the input sequences and generated sequences. Moreover, BiasAway provides the following QC metrics for comparing these distributions whenever possible: mean absolute error and goodness of fit computed as Pearson’s chi-squared statistic, log-likelihood ratio test (G-test), and the Cressie-Read power divergence.

Note

BiasAway also comes with a Web App available at http://biasaway.uio.no.

K-mer shuffling

Each user-provided sequence will be shuffled to keep its k-mer composition. This module can be used for any k, for instance use -k 1 for conserving the mononucleotide composition of the input sequences.

Usage:

biasaway k [options]

Note

Please scroll down to see a detailed summary of available options.

Help:

biasaway k --help

Example:

biasaway k -f path/to/FASTA/file/my_fasta_file.fa

It will output the generated sequences on stdout, keeping the dinucleotide composition of the input sequence by default (k-mer with k=2 is the default). If you wish to save the sequences in a specific file, you can type:

biasaway d -f path/to/FASTA/file/my_fasta_file.fa > path/to/output/FASTA/file/my_fasta_output.fa

Summary of options

Option Description
-h, –help To show the help message and exit
-f, –foreground Foreground file in fasta format.
-k, –kmer K-mer to be used for shuffling (default: 2 for dinucleotide shuffling)
-n, –nfold How many background sequences per each foreground sequence will be generated (default: 1)
-e, –seed Seed number to initialize the random number generator for reproducibility (default: integer from the current time)
-p, –plot-filename Base filename for all the plots and related statistics looking at %GC, dinucleotide, and lengths distributions (``default: not activated so no plot and statistics produced)

K-mer shuffling within a sliding window

For each user-provided sequence, a window will slide along to shuffle the nucleotides within the window, keeping the local k-mer composition. As such, the generated sequences will preserve the local k-mer composition of the input sequences along them.

Usage:

biasaway w [options]

Note

Please scroll down to see a detailed summary of available options.

Help:

biasaway w --help

Example:

biasaway w -f path/to/FASTA/file/my_fasta_file.fa

It will output the generated sequences on stdout, keeping the local dinucleotide composition of the input sequences (k=2 for dinucleotide shuffling is used as default). If you wish to save the sequences in a specific file, you can type:

biasaway w -f path/to/FASTA/file/my_fasta_file.fa > path/to/output/FASTA/file/my_fasta_output.fa

Summary of options

Option Description
-h, –help To show the help message and exit
-f, –foreground Foreground file in fasta format.
-k, –kmer K-mer to be used for shuffling (default: 2 for dinucleotide shuffling)
-n, –nfold How many background sequences per each foreground sequence will be generated (default: 1)
-w, –winlen Window length (default: 100)
-s, –step Sliding step (default: 50)
-e, –seed Seed number to initialize the random number generator for reproducibility (default: integer from the current time)
-p, –plot-filename Base filename for all the plots and related statistics looking at %GC, dinucleotide, and lengths distributions (``default: not activated so no plot and statistics produced)

Genomic mononucleotide distribution matched

Given a set of available background sequences (pre-computed or provided by the user), each user-provided foreground sequence will be matched to a background sequence having the same mononucleotide composition.

The first time you run this module, you need to provide a set of potential background sequences using the –background argument. The –bgdirectory argument is necessary and will contain the decomposition of the background sequences in dedicated files per %GC content.

If you already have such a pre-computed background directory, you can only use the –bgdirectory argument to speed-up the process.

Usage:

biasaway g [options]

Note

Please scroll down to see a detailed summary of available options.

Help:

biasaway g --help

Example:

biasaway g -f path/to/FASTA/file/my_fasta_file.fa -b path/to/background.fa -r path/to/bgdirectory

It will output the generated sequences on stdout. If you wish to save the sequences in a specific file, you can type:

biasaway g -f path/to/FASTA/file/my_fasta_file.fa -b path/to/background.fa -r path/to/bgdirectory > path/to/output/FASTA/file/my_fasta_output.fa

Summary of options

Option Description
-h, –help To show the help message and exit
-f, –foreground Foreground file in fasta format.
-n, –nfold How many background sequences per each foreground sequence will be generated (default: 1)
-r, –bgdirectory Background directory (must be empty if –background is used). See documentation for details.
-b, –background Background file in fasta format. Not necessary if a background directory has already been computed previously.
-l, –length Try to match the length as closely as possible (not set by default)
-e, –seed Seed number to initialize the random number generator for reproducibility (default: integer from the current time)
-p, –plot-filename Base filename for all the plots and related statistics looking at %GC, dinucleotide, and lengths distributions (``default: not activated so no plot and statistics produced)

Genomic mononucleotide distribution within a sliding window matched

Given a set of available background sequences (pre-computed or provided by the user), each user-provided foreground sequence will be matched to a background sequence having a close mononucleotide local composition. Specifically, distribution of %GC composition in a sliding window are computed for foreground and background sequences; a foreground sequence with a mean m_f and standard deviation sdev_f of %GC in the sliding window is matched to a background sequence if its mean %GC m_b is such that: .. math:

m_f - N * sdev_f <= m_b <= m_f + N * sdev_f

with N equals to 2.6 by default.

The first time you run this module, you need to provide a set of potential background sequences using the –background argument. The –bgdirectory argument is necessary and will contain the decomposition of the background sequences in dedicated files per %GC content.

If you already have such a pre-computed background directory, you can only use the –bgdirectory argument to speed-up the process.

Usage:

biasaway c [options]

Note

Please scroll down to see a detailed summary of available options.

Help:

biasaway c --help

Example:

biasaway c -f path/to/FASTA/file/my_fasta_file.fa -b path/to/background.fa -r path/to/bgdirectory

It will output the generated sequences on stdout. If you wish to save the sequences in a specific file, you can type:

biasaway c -f path/to/FASTA/file/my_fasta_file.fa -b path/to/background.fa -r path/to/bgdirectory > path/to/output/FASTA/file/my_fasta_output.fa

Summary of options

Option Description
-h, –help To show the help message and exit
-f, –foreground Foreground file in fasta format.
-n, –nfold How many background sequences per each foreground sequence will be generated (default: 1)
-r, –bgdirectory Background directory (must be empty if –background is used). See documentation for details.
-b, –background Background file in fasta format. Not necessary if a background directory has already been computed previously.
-l, –length Try to match the length as closely as possible (not set by default)
-w, –winlen Window length (default: 100)
-s, –step Sliding step (default: 50)
-d, –deviation Deviation from the mean (default: 2.6 for a threshold of mean + 2.6 * stdev)
-e, –seed Seed number to initialize the random number generator for reproducibility (default: integer from the current time)
-p, –plot-filename Base filename for all the plots and related statistics looking at %GC, dinucleotide, and lengths distributions (``default: not activated so no plot and statistics produced)

BiasAway web-server

Introduction

The BiasAway web-server provides an interactive and easy to use interface for users to upload FASTA files and to generate background sequences. It comes with precomputed genomic partitions of 100, 250, 500, 750, and 1000 bp bins for the genome of nine species (Arabidopsis thaliana; Caenorhabditis elegans; Danio rerio; Drosophila melanogaster; Homo sapiens; Mus musculus; Rattus norvegicus; Saccharomyces cerevisiae; and Schizosaccharomyces pombe). These background sequences are provided through Zenodo at 10.5281/zenodo.3923866. These background sequences were generated using the script at https://bitbucket.org/CBGR/biasaway_background_construction, which can be used by users to generate their own background sequences. The result page provides information about mononucleotide, dinucleotide, and length distributions for the provided and generated sequences for comparison.

BiasAway has four modules:

BiasAway Web App

Note

The BiasAway web-application automatically generate distribution plots for QC. Plots provide information about distribution of %GC, dinucleotides, and lengths for the input sequences and generated sequences. Moreover, BiasAway provides the following QC metrics for comparing these distributions whenever possible: mean absolute error and goodness of fit computed as Pearson’s chi-squared statistic, log-likelihood ratio test (G-test), and the Cressie-Read power divergence.

Below are screenshots for individual modules.

K-mer shuffling

This module should be run when the user aims at preserving the global k-mer nucleotide frequencies of input sequences.

BiasAway - K-mer shuffling generator

K-mer shuffling within a sliding window

This module should be run when the user aims at preserving the local k-mer nucleotide frequencies of input sequences.

BiasAway - K-mer shuffling within a sliding window

Genomic mononucleotide distribution matched

This module should be run when the user aims at selecting genuine genomic background sequences from a pool of provided genomic sequences to match the distribution of mononucleotide for each target sequence.

BiasAway - %GC distribution-based background

Genomic mononucleotide distribution within a sliding window matched

This module should be run when the user aims at selecting genuine genomic background sequences from a pool of provided genomic sequences to match the local distribution of mononucleotide for each target sequence.

BiasAway - %GC distribution and %GC composition within a sliding window

Example result page and QC plots

BiasAway provides quality control (QC) plots and metrics to assess the similarity of the mono- and di-nucleotide, and length distributions for the foreground and background sequences. Specifically, four plots are provided to visualize how similar the foreground and background sequences are when considering (2) their distributions of %GC content using density plots, (2) their dinucleotide contents considering all IUPAC nucleotides using a heatmap, (3) their dinucleotide contents considering adenine, cytosine, guanine, and thymine nucleotides using a heatmap, and (4) their distributions of lengths.

BiasAway - Results page

Generation of background repositories

Modules g and c of BiasAway require the generation of a background repository for the genome of interest. This can be created with the script located at our BitBucket repository.

Our BiasAway Web-Server contains precomputed background repositories for 9 species. The genome fasta files used to create these can be found below:

Please note that some genome fasta files are separated by chromosomes in their original repositories. In that case, please make sure to concatenate all chromosome fasta files in one single genome fasta file.

We also provide a collection of precomputed background repositories for the nine organisms mentioned above using k-mers of size 100, 250, 500, 750 and 1000 base pairs. They can be found as individual compressed files in our Zenodo repository

Availability

The BiasAway web-server is freely available at:

> http://biasaway.uio.no

Support

If you have questions, or found any bug in the program, please write to us at anthony.mathelier[at]ncmm.uio.no and azizk[at]stanford.edu.

You can also report the issues to our bitbucket repo

Citation

If you used BiasAway, please cite:

  • A. Khan, R. Riudavets Puig, P. Boddie, and A. Mathelier. BiasAway: command-line and web server to generate nucleotide composition-matched DNA background sequences, 2020.
  • R. Worsley-Hunt et al. Improving analysis of transcription factor binding sites within ChIP-Seq data based on topological motif enrichment, BMC Genomics 2014; 10.1186/1471-2164-15-472