BiasAway modules

The BiasAway software tool is introduced to generate nucleotide composition-matched DNA sequences. It is available as open source code from bitbucket.

The tool provides users with four approaches to generate synthetic or genomic background sequences matching mono- and k-mer composition of user-provided foreground sequences:

Note

BiasAway can generate distribution plots for QC. Plots provide information about distribution of %GC, dinucleotides, and lengths for the input sequences and generated sequences. Moreover, BiasAway provides the following QC metrics for comparing these distributions whenever possible: mean absolute error and goodness of fit computed as Pearson’s chi-squared statistic, log-likelihood ratio test (G-test), and the Cressie-Read power divergence.

Note

BiasAway also comes with a Web App available at http://biasaway.uio.no.

K-mer shuffling

Each user-provided sequence will be shuffled to keep its k-mer composition. This module can be used for any k, for instance use -k 1 for conserving the mononucleotide composition of the input sequences.

Usage:

biasaway k [options]

Note

Please scroll down to see a detailed summary of available options.

Help:

biasaway k --help

Example:

biasaway k -f path/to/FASTA/file/my_fasta_file.fa

It will output the generated sequences on stdout, keeping the dinucleotide composition of the input sequence by default (k-mer with k=2 is the default). If you wish to save the sequences in a specific file, you can type:

biasaway d -f path/to/FASTA/file/my_fasta_file.fa > path/to/output/FASTA/file/my_fasta_output.fa

Summary of options

Option Description
-h, –help To show the help message and exit
-f, –foreground Foreground file in fasta format.
-k, –kmer K-mer to be used for shuffling (default: 2 for dinucleotide shuffling)
-n, –nfold How many background sequences per each foreground sequence will be generated (default: 1)
-e, –seed Seed number to initialize the random number generator for reproducibility (default: integer from the current time)
-p, –plot-filename Base filename for all the plots and related statistics looking at %GC, dinucleotide, and lengths distributions (``default: not activated so no plot and statistics produced)

K-mer shuffling within a sliding window

For each user-provided sequence, a window will slide along to shuffle the nucleotides within the window, keeping the local k-mer composition. As such, the generated sequences will preserve the local k-mer composition of the input sequences along them.

Usage:

biasaway w [options]

Note

Please scroll down to see a detailed summary of available options.

Help:

biasaway w --help

Example:

biasaway w -f path/to/FASTA/file/my_fasta_file.fa

It will output the generated sequences on stdout, keeping the local dinucleotide composition of the input sequences (k=2 for dinucleotide shuffling is used as default). If you wish to save the sequences in a specific file, you can type:

biasaway w -f path/to/FASTA/file/my_fasta_file.fa > path/to/output/FASTA/file/my_fasta_output.fa

Summary of options

Option Description
-h, –help To show the help message and exit
-f, –foreground Foreground file in fasta format.
-k, –kmer K-mer to be used for shuffling (default: 2 for dinucleotide shuffling)
-n, –nfold How many background sequences per each foreground sequence will be generated (default: 1)
-w, –winlen Window length (default: 100)
-s, –step Sliding step (default: 50)
-e, –seed Seed number to initialize the random number generator for reproducibility (default: integer from the current time)
-p, –plot-filename Base filename for all the plots and related statistics looking at %GC, dinucleotide, and lengths distributions (``default: not activated so no plot and statistics produced)

Genomic mononucleotide distribution matched

Given a set of available background sequences (pre-computed or provided by the user), each user-provided foreground sequence will be matched to a background sequence having the same mononucleotide composition.

The first time you run this module, you need to provide a set of potential background sequences using the –background argument. The –bgdirectory argument is necessary and will contain the decomposition of the background sequences in dedicated files per %GC content.

If you already have such a pre-computed background directory, you can only use the –bgdirectory argument to speed-up the process.

Usage:

biasaway g [options]

Note

Please scroll down to see a detailed summary of available options.

Help:

biasaway g --help

Example:

biasaway g -f path/to/FASTA/file/my_fasta_file.fa -b path/to/background.fa -r path/to/bgdirectory

It will output the generated sequences on stdout. If you wish to save the sequences in a specific file, you can type:

biasaway g -f path/to/FASTA/file/my_fasta_file.fa -b path/to/background.fa -r path/to/bgdirectory > path/to/output/FASTA/file/my_fasta_output.fa

Summary of options

Option Description
-h, –help To show the help message and exit
-f, –foreground Foreground file in fasta format.
-n, –nfold How many background sequences per each foreground sequence will be generated (default: 1)
-r, –bgdirectory Background directory (must be empty if –background is used). See documentation for details.
-b, –background Background file in fasta format. Not necessary if a background directory has already been computed previously.
-l, –length Try to match the length as closely as possible (not set by default)
-e, –seed Seed number to initialize the random number generator for reproducibility (default: integer from the current time)
-p, –plot-filename Base filename for all the plots and related statistics looking at %GC, dinucleotide, and lengths distributions (``default: not activated so no plot and statistics produced)

Genomic mononucleotide distribution within a sliding window matched

Given a set of available background sequences (pre-computed or provided by the user), each user-provided foreground sequence will be matched to a background sequence having a close mononucleotide local composition. Specifically, distribution of %GC composition in a sliding window are computed for foreground and background sequences; a foreground sequence with a mean m_f and standard deviation sdev_f of %GC in the sliding window is matched to a background sequence if its mean %GC m_b is such that: .. math:

m_f - N * sdev_f <= m_b <= m_f + N * sdev_f

with N equals to 2.6 by default.

The first time you run this module, you need to provide a set of potential background sequences using the –background argument. The –bgdirectory argument is necessary and will contain the decomposition of the background sequences in dedicated files per %GC content.

If you already have such a pre-computed background directory, you can only use the –bgdirectory argument to speed-up the process.

Usage:

biasaway c [options]

Note

Please scroll down to see a detailed summary of available options.

Help:

biasaway c --help

Example:

biasaway c -f path/to/FASTA/file/my_fasta_file.fa -b path/to/background.fa -r path/to/bgdirectory

It will output the generated sequences on stdout. If you wish to save the sequences in a specific file, you can type:

biasaway c -f path/to/FASTA/file/my_fasta_file.fa -b path/to/background.fa -r path/to/bgdirectory > path/to/output/FASTA/file/my_fasta_output.fa

Summary of options

Option Description
-h, –help To show the help message and exit
-f, –foreground Foreground file in fasta format.
-n, –nfold How many background sequences per each foreground sequence will be generated (default: 1)
-r, –bgdirectory Background directory (must be empty if –background is used). See documentation for details.
-b, –background Background file in fasta format. Not necessary if a background directory has already been computed previously.
-l, –length Try to match the length as closely as possible (not set by default)
-w, –winlen Window length (default: 100)
-s, –step Sliding step (default: 50)
-d, –deviation Deviation from the mean (default: 2.6 for a threshold of mean + 2.6 * stdev)
-e, –seed Seed number to initialize the random number generator for reproducibility (default: integer from the current time)
-p, –plot-filename Base filename for all the plots and related statistics looking at %GC, dinucleotide, and lengths distributions (``default: not activated so no plot and statistics produced)