KOBAS on the Command Line

Container Technologies

KOBAS is provided as a Docker container.

A container is a standard unit of software that packages up code and all its dependencies so the application runs quickly and reliably from one computing environment to another.

There are two major containerization technologies: Docker and Apptainer (Singularity).

Docker containers can be run with either technology.

Running KOBAS using Docker

About Docker

  • Docker must be installed on the computer you wish to use for your analysis.
  • To run Docker you must have ‘root’ (admin) permissions (or use sudo).
  • Docker will run all containers as ‘root’. This makes Docker incompatible with HPC systems (see Singularity below).
  • Docker can be run on your local computer, a server, a cloud virtual machine etc.
  • For more information on installing Docker on other systems: Installing Docker.

Getting the KOBAS container

The KOBAS tool is available as a Docker container on Docker Hub: KOBAS container

The container can be pulled with this command:

docker pull agbase/kobas:3.0.3_3

Remember

You must have root permissions or use sudo, like so:

sudo docker pull agbase/kobas:3.0.3_3

Getting the Help and Usage Statement

sudo docker run --rm agbase/kobas:3.0.3_3 -h

Tip

The /work-dir directory is built into this container and should be used to mount your data.

KOBAS can perform two tasks - annotate (-a) - identify (enrichment) (-g)

KOBAS can also run both tasks with a single command (-j).

Annotate Example Command

sudo docker run \
--rm \
-v $(pwd):/work-dir \
agbase/kobas:3.0.3_3 \
-a \
-i GCF_001298625.1_SEUB3.0_protein.faa \
-s sce \
-t fasta:pro \
-o GCF_001298625.1
Command Explained

sudo docker run: tells docker to run

–rm: removes the container when the analysis has finished. The image will remain for future use.

-v $(pwd):/work-dir: mounts my current working directory on the host machine to ‘/work-dir’ inside the container

agbase/kobas:3.0.3_3: the name of the Docker image to use

Tip

All the options supplied after the image name are KOBAS options

-a: Tells KOBAS to run the ‘annotate’ process.

-i GCF_001298625.1_SEUB3.0_protein.faa: input file (protein FASTA).

-s sce: Enter the species code for the species of the sequences in your input file.

Note

If you don’t know the code for your species it can be found here: https://www.kegg.jp/kegg/catalog/org_list.html

If your species of interest is not available then you should choose the code for the closest-related species available

-t: input file type; in this case, protein FASTA.

-o GCF_001298625.1: prefix for the output file names

Reference Understanding results.

Identify Example Command

sudo docker run \
--rm \
-v $(pwd):/work-dir \
agbase/kobas:3.0.3_3 \
-g \
-f GCF_001298625.1_SEUB3.0_protein.faa \
-b sce \
-o ident_out
Command Explained

sudo docker run: tells docker to run

–rm: removes the container when the analysis has finished. The image will remain for future use.

-v $(pwd):/work-dir: mounts my current working directory on the host machine to ‘/work-dir’ in the container

agbase/kobas:3.0.3_3: the name of the Docker image to use

Tip

All the options supplied after the image name are KOBAS options

-g: Tells KOBAS to runt he ‘identify’ process.

-f GCF_001298625.1_SEUB3.0_protein.faa: output file from KOBAS annotate

-b sce: background; enter the species code for the species of the sequences in your input file.

Note

If you don’t know the code for your species it can be found here: https://www.kegg.jp/kegg/catalog/org_list.html

If your species of interest is not available then you should choose the code for the closest-related species available

-o ident_out: basename of output file

Reference Understanding results.

Annotate and Identify Pipeline Example Command

sudo docker run \
--rm \
-v $(pwd):/work-dir \
agbase/kobas:3.0.3_3 \
-j \
-i GCF_001298625.1_SEUB3.0_protein.faa \
-s sce \
-t fasta:pro
-o GCF_001298625.1
Command Explained

sudo docker run: tells docker to run

–rm: removes the container when the analysis has finished. The image will remain for future use.

-v $(pwd):/work-dir: mounts my current working directory on the host machine to ‘/work-dir’ in the container

agbase/kobas:3.0.3_3: the name of the Docker image to use

Tip

All the options supplied after the image name are KOBAS options

-j: Tells KOBAS to run both the ‘annotate’ and ‘identify’ processes.

-i GCF_001298625.1_SEUB3.0_protein.faa: input file (protein FASTA)

-s sce: Enter the species code for the species of the sequences in your input file.

Note

If you don’t know the code for your species it can be found here: https://www.kegg.jp/kegg/catalog/org_list.html

If your species of interest is not available then you should choose the code for the closest-related species available

-t: input file type; in this case, protein FASTA.

-o GCF_001298625.1: basename of output files

Note

This pipeline will automatically use the output of ‘annotate’ as the -f foreground input for ‘identify’. This will also use your species option as the -b background input for ‘identify’.

Reference Understanding results.

Running KOBAS using Singularity

About Singularity (now Apptainer)

  • does not require ‘root’ permissions
  • runs all containers as the user that is logged into the host machine
  • HPC systems are likely to have Singularity installed and are unlikely to object if asked to install it (no guarantees).
  • can be run on any machine where it is installed
  • more information about installing Singularity
  • This tool was tested using Singularity 3.10.2.

HPC Job Schedulers

Although Singularity can be installed on any computer this documentation assumes it will be run on an HPC system. The tool was tested on a Slurm system and the job submission scripts below reflect that. Submission scripts will need to be modified for use with other job scheduler systems.

Getting the KOBAS container

The KOBAS tool is available as a Docker container on Docker Hub: KOBAS container

Example Slurm script:

#!/bin/bash
#SBATCH --job-name=kobas
#SBATCH --ntasks=8
#SBATCH --time=2:00:00
#SBATCH --partition=short
#SBATCH --account=nal_genomics

module load singularity

cd /location/where/your/want/to/save/file

singularity pull docker://agbase/kobas:3.0.3_3

Running KOBAS with Data

Tip

There /work-dir directory is built into this container and should be used to mount data.

Example Slurm Script for Annotate Process

#!/bin/bash
#SBATCH --job-name=kobas
#SBATCH --ntasks=8
#SBATCH --time=2:00:00
#SBATCH --partition=short
#SBATCH --account=nal_genomics

module load singularity

cd /directory/you/want/to/work/in

singularity run \
-B /directory/you/want/to/work/in:/work-dir \
/path/to/your/copy/kobas_3.0.3_3.sif \
-a \
-i GCF_001298625.1_SEUB3.0_protein.faa \
-s sce \
-t fasta:pro \
-o GCF_001298625.1
Command Explained

singularity run: tells Singularity to run

-B /project/nal_genomics/amanda.cooksey/protein_sets/saceub/KOBAS:/work-dir: mounts my current working directory on the host machine to ‘/work-dir’ in the container

/path/to/your/copy/kobas_3.0.3_3.sif: the name of the Singularity image to use

Tip

All the options supplied after the image name are KOBAS options

-a: Tells KOBAS to run the ‘annotate’ process.

-i GCF_001298625.1_SEUB3.0_protein.faa: input file (protein FASTA)

-s sce: Enter the species for the species of the sequences in your input file.

Note

If you don’t know the code for your species it can be found here: https://www.kegg.jp/kegg/catalog/org_list.html

If your species of interest is not available then you should choose the code for the closest-related species available

-t: input file type; in this case, protein FASTA.

-o GCF_001298625.1: name of output file

Reference Understanding results.

Example Slurm Script for Identify Process

#!/bin/bash
#SBATCH --job-name=kobas
#SBATCH --ntasks=8
#SBATCH --time=2:00:00
#SBATCH --partition=short
#SBATCH --account=nal_genomics

module load singularity

cd /location/where/your/want/to/save/file

singularity pull docker://agbase/kobas:3.0.3_3

singularity run \
-B /directory/you/want/to/work/in:/work-dir \
kobas_3.0.3_3.sif \
-g \
-f GCF_001298625.1_SEUB3.0_protein.faa \
-b sce \
-o ident_out
Command Explained

singularity run: tells Singularity to run

-B /project/nal_genomics/amanda.cooksey/protein_sets/saceub/KOBAS:/work-dir: mounts my current working directory on the host machine to ‘/work-dir’ in the container

kobas_3.0.3_3.sif: the name of the Singularity image to use

Tip

All the options supplied after the image name are KOBAS options

-g: Tells KOBAS to run the ‘identify’ process.

-f GCF_001298625.1_SEUB3.0_protein.faa: output file from ‘annotate’

-b sce: background; enter the species for the species of the sequences in your input file.

Note

If you don’t know the code for your species it can be found here: https://www.kegg.jp/kegg/catalog/org_list.html

If your species of interest is not available then you should choose the code for the closest-related species available

-o ident_out: name of output file

Reference Understanding results.

Example Slurm Script for Annotate and Identify Pipeline

#!/bin/bash
#SBATCH --job-name=kobas
#SBATCH --ntasks=8
#SBATCH --time=2:00:00
#SBATCH --partition=short
#SBATCH --account=nal_genomics

module load singularity

cd /location/where/your/want/to/save/file

singularity pull docker://agbase/kobas:3.0.3_3

singularity run \
-B /directory/you/want/to/work/in:/work-dir \
kobas_3.0.3_3.sif \
-j \
-i GCF_001298625.1_SEUB3.0_protein.faa \
-s sce \
-t fasta:pro \
-o GCF_001298625.1
Command Explained

singularity run: tells Singularity to run

-B /rsgrps/shaneburgess/amanda/i5k/kobas:/work-dir: mounts my current working directory on the host machine to ‘/work-dir’ in the container

kobas_3.0.3_3.sif: the name of the Singularity image to use

Tip

All the options supplied after the image name are KOBAS options

-j: Tells KOBAS to runt he ‘annotate’ process.

-i GCF_001298625.1_SEUB3.0_protein.faa: input file (protein FASTA)

-s sce: Enter the species for the species of the sequences in your input file.

Note

If you don’t know the code for your species it can be found here: https://www.kegg.jp/kegg/catalog/org_list.html

If your species of interest is not available then you should choose the code for the closest-related species available

-t: input file type; in this case, protein FASTA.

-o GCF_001298625.1: name of output file

Note

This pipeline will automatically use the output of ‘annotate’ as the -f foreground input for ‘identify’. This will also use your species option as the -b background input for ‘identify’.

Understanding Your Results

Annotate

If all goes well, you should get the following:

  • <species>.tsv: This is the tab-separated output from the BLAST search. It is unlikely that you will need to look at this file.
  • <basename>: KOBAS-annotate generates a text file with the name you provide. It has two sections (detailed below).
  • <basename>_KOBAS_acc_pathways.tsv: Our post-processing script creates this tab-separated file. It lists each accession from your data and all of the pathways to which they were annotated.
  • <basename>_KOBAS_pathways_acc.tsv: Our post-processing script creates this tab-separated file. It lists each pathway annotated to your data with all of the accessions annotated to that pathway.

The <basename> file has two sections. The first section looks like this:

#Query      Gene ID|Gene name|Hyperlink
XP_018220118.1      sce:YMR059W|SEN15|http://www.genome.jp/dbget-bin/www_bget?sce:YMR059W
XP_018221352.1      sce:YJR050W|ISY1, NTC30, UTR3|http://www.genome.jp/dbget-bin/www_bget?sce:YJR050W
XP_018224031.1      sce:YDR513W|GRX2, TTR1|http://www.genome.jp/dbget-bin/www_bget?sce:YDR513W
XP_018222559.1      sce:YFR024C-A|LSB3, YFR024C|http://www.genome.jp/dbget-bin/www_bget?sce:YFR024C-A
XP_018221254.1      sce:YJL070C||http://www.genome.jp/dbget-bin/www_bget?sce:YJL070C

The second section follows a dashed line and looks like this:

////
Query:                  XP_018222878.1
Gene:                   sce:YDL220C     CDC13, EST4
Entrez Gene ID:         851306
////
Query:                  XP_018219412.1
Gene:                   sce:YOR204W     DED1, SPP81
Entrez Gene ID:         854379
Pathway:                Innate Immune System    Reactome        R-SCE-168249
                        Immune System   Reactome        R-SCE-168256
                        Neutrophil degranulation        Reactome        R-SCE-6798695

<basename>_KOBAS_acc_pathways.tsv looks like this:

XP_018220118.1      BioCyc:PWY-6689
XP_018221352.1      Reactome:R-SCE-6782135,KEGG:sce03040,Reactome:R-SCE-73894,Reactome:R-SCE-5696398,Reactome:R-SCE-6782210,Reactome:R-SCE-6781827
XP_018224031.1      BioCyc:GLUT-REDOX-PWY,BioCyc:PWY3O-592

<basename>_KOBAS_pathways_acc.tsv looks like this:

BioCyc:PWY3O-0  XP_018222002.1,XP_018222589.1
KEGG:sce00440   XP_018222406.1,XP_018219751.1,XP_018222229.1
Reactome:R-SCE-416476   XP_018223583.1,XP_018221814.1,XP_018222685.1,XP_018220832.1,XP_018219073.1,XP_018218776.1,XP_018223466.1,XP_018223545.1,XP_018222256.1
Reactome:R-SCE-418346   XP_018220070.1,XP_018221774.1,XP_018221826.1,XP_018220071.1,XP_018222218.1,XP_018220541.1,XP_018219550.1

Identify

If all goes well, you should get the following:

  • <output_file_name_you_provided>: KOBAS identify generates a text file with the name you provide.
##Databases: PANTHER, KEGG PATHWAY, Reactome, BioCyc
##Statistical test method: hypergeometric test / Fisher's exact test
##FDR correction method: Benjamini and Hochberg

#Term   Database        ID      Input number    Background number       P-Value Corrected P-Value       Input   Hyperlink
Metabolic pathways      KEGG PATHWAY    sce01100        714     754     0.00303590229485        0.575578081959  XP_018221856.1|XP_018220917.1|XP_018222719.1|...link
Metabolism      Reactome        R-SCE-1430728   419     438     0.0147488189928 0.575578081959  XP_018221856.1|XP_018221742.1|XP_018219354.1|XP_018221740.1|...link
Immune System   Reactome        R-SCE-168256    304     315     0.0267150787723 0.575578081959  XP_018223955.1|XP_018222962.1|XP_018223268.1|XP_018222956.1|...link

Contact us.