BiasAway modules¶
The BiasAway software tool is introduced to generate nucleotide composition-matched DNA sequences. It is available as open source code from bitbucket.
The tool provides users with four approaches to generate synthetic or genomic background sequences matching mono- and k-mer composition of user-provided foreground sequences:
BiasAway can generate distribution plots for QC. Plots provide information about distribution of %GC, dinucleotides, and lengths for the input sequences and generated sequences. Moreover, BiasAway provides the following QC metrics for comparing these distributions whenever possible: mean absolute error and goodness of fit computed as Pearson’s chi-squared statistic, log-likelihood ratio test (G-test), and the Cressie-Read power divergence.
BiasAway also comes with a Web App available at
K-mer shuffling¶
Each user-provided sequence will be shuffled to keep its k-mer composition. This module can be used for any k, for instance use -k 1 for conserving the mononucleotide composition of the input sequences.
biasaway k [options]
Please scroll down to see a detailed summary of available options.
biasaway k --help
biasaway k -f path/to/FASTA/file/my_fasta_file.fa
It will output the generated sequences on stdout, keeping the dinucleotide composition of the input sequence by default (k-mer with k=2 is the default). If you wish to save the sequences in a specific file, you can type:
biasaway d -f path/to/FASTA/file/my_fasta_file.fa > path/to/output/FASTA/file/my_fasta_output.fa
Summary of options
Option | Description |
-h, –help | To show the help message and exit |
-f, –foreground | Foreground file in fasta format. |
-k, –kmer | K-mer to be used for shuffling (default: 2 for dinucleotide shuffling) |
-n, –nfold | How many background sequences per each foreground sequence will be generated (default: 1 ) |
-e, –seed | Seed number to initialize the random number generator for reproducibility (default: integer from the current time ) |
-p, –plot-filename | Base filename for all the plots and related statistics looking at %GC, dinucleotide, and lengths distributions (``default: not activated so no plot and statistics produced) |
K-mer shuffling within a sliding window¶
For each user-provided sequence, a window will slide along to shuffle the nucleotides within the window, keeping the local k-mer composition. As such, the generated sequences will preserve the local k-mer composition of the input sequences along them.
biasaway w [options]
Please scroll down to see a detailed summary of available options.
biasaway w --help
biasaway w -f path/to/FASTA/file/my_fasta_file.fa
It will output the generated sequences on stdout, keeping the local dinucleotide composition of the input sequences (k=2 for dinucleotide shuffling is used as default). If you wish to save the sequences in a specific file, you can type:
biasaway w -f path/to/FASTA/file/my_fasta_file.fa > path/to/output/FASTA/file/my_fasta_output.fa
Summary of options
Option | Description |
-h, –help | To show the help message and exit |
-f, –foreground | Foreground file in fasta format. |
-k, –kmer | K-mer to be used for shuffling (default: 2 for dinucleotide shuffling) |
-n, –nfold | How many background sequences per each foreground sequence will be generated (default: 1 ) |
-w, –winlen | Window length (default: 100 ) |
-s, –step | Sliding step (default: 50 ) |
-e, –seed | Seed number to initialize the random number generator for reproducibility (default: integer from the current time ) |
-p, –plot-filename | Base filename for all the plots and related statistics looking at %GC, dinucleotide, and lengths distributions (``default: not activated so no plot and statistics produced) |
Genomic mononucleotide distribution matched¶
Given a set of available background sequences (pre-computed or provided by the user), each user-provided foreground sequence will be matched to a background sequence having the same mononucleotide composition.
The first time you run this module, you need to provide a set of potential background sequences using the –background argument. The –bgdirectory argument is necessary and will contain the decomposition of the background sequences in dedicated files per %GC content.
If you already have such a pre-computed background directory, you can only use the –bgdirectory argument to speed-up the process.
biasaway g [options]
Please scroll down to see a detailed summary of available options.
biasaway g --help
biasaway g -f path/to/FASTA/file/my_fasta_file.fa -b path/to/background.fa -r path/to/bgdirectory
It will output the generated sequences on stdout. If you wish to save the sequences in a specific file, you can type:
biasaway g -f path/to/FASTA/file/my_fasta_file.fa -b path/to/background.fa -r path/to/bgdirectory > path/to/output/FASTA/file/my_fasta_output.fa
Summary of options
Option | Description |
-h, –help | To show the help message and exit |
-f, –foreground | Foreground file in fasta format. |
-n, –nfold | How many background sequences per each foreground sequence will be generated (default: 1 ) |
-r, –bgdirectory | Background directory (must be empty if –background is used). See documentation for details. |
-b, –background | Background file in fasta format. Not necessary if a background directory has already been computed previously. |
-l, –length | Try to match the length as closely as possible (not set by default ) |
-e, –seed | Seed number to initialize the random number generator for reproducibility (default: integer from the current time ) |
-p, –plot-filename | Base filename for all the plots and related statistics looking at %GC, dinucleotide, and lengths distributions (``default: not activated so no plot and statistics produced) |
Genomic mononucleotide distribution within a sliding window matched¶
Given a set of available background sequences (pre-computed or provided by the user), each user-provided foreground sequence will be matched to a background sequence having a close mononucleotide local composition. Specifically, distribution of %GC composition in a sliding window are computed for foreground and background sequences; a foreground sequence with a mean m_f and standard deviation sdev_f of %GC in the sliding window is matched to a background sequence if its mean %GC m_b is such that: .. math:
m_f - N * sdev_f <= m_b <= m_f + N * sdev_f
with N equals to 2.6 by default.
The first time you run this module, you need to provide a set of potential background sequences using the –background argument. The –bgdirectory argument is necessary and will contain the decomposition of the background sequences in dedicated files per %GC content.
If you already have such a pre-computed background directory, you can only use the –bgdirectory argument to speed-up the process.
biasaway c [options]
Please scroll down to see a detailed summary of available options.
biasaway c --help
biasaway c -f path/to/FASTA/file/my_fasta_file.fa -b path/to/background.fa -r path/to/bgdirectory
It will output the generated sequences on stdout. If you wish to save the sequences in a specific file, you can type:
biasaway c -f path/to/FASTA/file/my_fasta_file.fa -b path/to/background.fa -r path/to/bgdirectory > path/to/output/FASTA/file/my_fasta_output.fa
Summary of options
Option | Description |
-h, –help | To show the help message and exit |
-f, –foreground | Foreground file in fasta format. |
-n, –nfold | How many background sequences per each foreground sequence will be generated (default: 1 ) |
-r, –bgdirectory | Background directory (must be empty if –background is used). See documentation for details. |
-b, –background | Background file in fasta format. Not necessary if a background directory has already been computed previously. |
-l, –length | Try to match the length as closely as possible (not set by default ) |
-w, –winlen | Window length (default: 100 ) |
-s, –step | Sliding step (default: 50 ) |
-d, –deviation | Deviation from the mean (default: 2.6 for a threshold of mean + 2.6 * stdev ) |
-e, –seed | Seed number to initialize the random number generator for reproducibility (default: integer from the current time ) |
-p, –plot-filename | Base filename for all the plots and related statistics looking at %GC, dinucleotide, and lengths distributions (``default: not activated so no plot and statistics produced) |