KOBAS on the ARS Ceres HPC

About Ceres/Scinet

Getting the databases

To run the tool you need some public data in your working directory. The files can be downloaded directly from the KOBAS homepage. These directories are also available on Ceres/Scinet under /reference/data/iplant/2019-09-16/kobas. This directory will need to be copied to your working directory, since KOBAS requires write-access to the subdirectories within this folder.

  1. seq_pep/: species-specific BLAST databases used by KOBAS
cp -r /reference/data/iplant/2019-09-16/kobas/seq_pep/ .
chmod -R 777 seq_pep
  1. sqlite3: species-specific annotation databases used by KOBAS
cp -r /reference/data/iplant/2019-09-16/kobas/sqlite3/ .
chmod -R 777 sqlite3

Note

The above commands should result in two directories (seq_pep and sqlite3) each containing many files. There is no need to unzip the .gz files.

Running KOBAS on Ceres

Running programs on Ceres/Scinet

  • You’ll need to run KOBAS either in interactive mode or batch mode.
  • For interactive mode, use the salloc command.
  • For batch mode, you’ll need to write a batch job submission bash script.

Running KOBAS in interactive mode

Loading the module

The Scinet VRSC has installed the KOBAS program. To load the module in interactive mode, run the command

module load agbase

Getting the Help and Usage Statement

kobas -h

See Help and Usage Statement

KOBAS can perform two tasks: - annotate (-a) - identify (enrichment) (-g) KOBAS can also run both task with a single command (-j).

Annotate Example Command - interactive mode

kobas -a -i AROS_10.faa -s dme -t fasta:pro -o kobas_output -k /project/nal_genomics/monica.poelchau

Annotate Example Command - batch mode

Running programs on Ceres/Scinet in batch mode

Example batch job submission bash script (e.g. kobas-job.sh):

#! /bin/bash
module load agbase
kobas -a -i AROS_10.faa -s dme -t fasta:pro -o kobas_output -k /project/nal_genomics/monica.poelchau

Submitting the batch job:

sbatch kobas-job.sh
Command Explained

-a: Tells KOBAS to run the ‘annotate’ process.

-i AROS_10.faa: input file (peptide FASTA)

-s dme: Enter the species for the species of the sequences in your input file.

Note

If you don’t know the code for your species it can be found here: https://www.kegg.jp/kegg/catalog/org_list.html

If your species of interest is not available then you should choose the code for the closest-related species available

-t: input file type; in this case, protein FASTA.

-o kobas_output: name of output file

-k /project/nal_genomics/monica.poelchau: KOBAS HOME. The path to kobas_home, which is the parent directory of sqlite3/ and seq_pep/.

For information on output files see Understanding Your Results: Annotate

Identify Example Command - interactive mode

kobas -g -f kobas_output -b dme -k /project/nal_genomics/monica.poelchau -o ident_out

Annotate Example Command - batch mode

Running programs on Ceres/Scinet in batch mode

Example batch job submission bash script (e.g. kobas-job.sh):

#! /bin/bash
module load agbase
kobas -g -f kobas_output -b dme -k /project/nal_genomics/monica.poelchau -o ident_out

Submitting the batch job:

sbatch kobas-job.sh
Command Explained

-g: Tells KOBAS to runt he ‘identify’ process.

-f kobas_output: output file from KOBAS annotate

-b dme: background; enter the species code for the species of the sequences in your input file.

Note

If you don’t know the code for your species it can be found here: https://www.kegg.jp/kegg/catalog/org_list.html

If your species of interest is not available then you should choose the code for the closest-related species available

-o ident_out: name of output file

For information on outputs see Understanding Your Results: Identify

Annotate and Identify Pipeline Example Command - interactive mode

kobas -j -i AROS_10.faa -s dme -t fasta:pro -k /project/nal_genomics/monica.poelchau -o kobas_output

Annotate Example Command - batch mode

Running programs on Ceres/Scinet in batch mode

Example batch job submission bash script (e.g. kobas-job.sh):

#! /bin/bash
module load agbase
kobas -j -i AROS_10.faa -s dme -t fasta:pro -k /project/nal_genomics/monica.poelchau -o kobas_output

Submitting the batch job:

sbatch kobas-job.sh
Command Explained

-j: Tells KOBAS to runt he ‘annotate’ process.

-i AROS_10.faa: input file (peptide FASTA)

-s dme: Enter the species for the species of the sequences in your input file.

Note

If you don’t know the code for your species it can be found here: https://www.kegg.jp/kegg/catalog/org_list.html

If your species of interest is not available then you should choose the code for the closest-related species available

-t: input file type; in this case, protein FASTA.

-o kobas_output: basename of output files

Note

This pipeline will automatically use the output of ‘annotate’ as the -f foreground input for ‘identify. This will also use your species option as the -b background input for ‘identify’.

For more information on outputs see Understanding Your Results: Annotate and Identify

Understanding Your Results

Annotate

If all goes well, you should get the following:

  • seq_pep folder: This folder contains the BLAST database files used in your analysis.
  • sqlite3 folder: This folder contains the annotation database files used in your analysis
  • <species>.tsv: This is the tab-delimited output from the BLAST search. It is unlikely that you will need to look at this file.
  • <output_file_name_you_provided>: KOBAS-annotate generates a text file with the name you provide. It has two sections.

The first sections looks like this:

##dme       Drosophila melanogaster (fruit fly)
##Method: BLAST     Options: evalue <= 1e-05
##Summary:  87 succeed, 0 fail

#Query      Gene ID|Gene name|Hyperlink
lcl|NW_020311285.1_prot_XP_012256083.1_15   dme:Dmel_CG34349|Unc-13-4B|http://www.genome.jp/dbget-bin/www_bget?dme:Dmel_CG34349
lcl|NW_020311286.1_prot_XP_020708336.1_46   dme:Dmel_CG6963|gish|http://www.genome.jp/dbget-bin/www_bget?dme:Dmel_CG6963
lcl|NW_020311285.1_prot_XP_020707987.1_39   dme:Dmel_CG30403||http://www.genome.jp/dbget-bin/www_bget?dme:Dmel_CG30403

The second section follows a dashed line and looks like this:

--------------------

////
Query:                      lcl|NW_020311285.1_prot_XP_012256083.1_15
Gene:                       dme:Dmel_CG34349        Unc-13-4B
Entrez Gene ID:             43002
////
Query:                      lcl|NW_020311286.1_prot_XP_020708336.1_46
Gene:                       dme:Dmel_CG6963 gish
Entrez Gene ID:             49701
Pathway:                    Hedgehog signaling pathway - fly        KEGG PATHWAY    dme04341
////
Query:                      lcl|NW_020311285.1_prot_XP_020707987.1_39
Gene:                       dme:Dmel_CG30403
Entrez Gene ID:             246595
////
Query:                      lcl|NW_020311285.1_prot_XP_020707989.1_40
Gene:                       dme:Dmel_CG6148 Past1
Entrez Gene ID:             41569
Pathway:                    Endocytosis     KEGG PATHWAY    dme04144
                            Hemostasis      Reactome        R-DME-109582
                            Factors involved in megakaryocyte development and platelet production   Reactome        R-DME-98323

Identify

If all goes well, you should get the following:

  • sqlite3 folder: This folder contains the annotation database files used in your analysis
  • <output_file_name_you_provided>: KOBAS identify generates a text file with the name you provide.
##Databases: PANTHER, KEGG PATHWAY, Reactome, BioCyc
##Statistical test method: hypergeometric test / Fisher's exact test
##FDR correction method: Benjamini and Hochberg

#Term       Database        ID      Input number    Background number       P-Value Corrected P-Value       Input   Hyperlink
Hedgehog signaling pathway - fly    KEGG PATHWAY    dme04341        12      33      3.20002656734e-18       1.76001461204e-16       lcl|NW_020311286.1_prot_XP_012256678.1_51|lcl|NW_020311286.1_prot_XP_025602973.1_48|lcl|NW_020311286.1_prot_XP_012256683.1_52|lcl|NW_020311286.1_prot_XP_012256679.1_55|lcl|NW_020311286.1_prot_XP_012256674.1_54|lcl|NW_020311286.1_prot_XP_020708336.1_46|lcl|NW_020311285.1_prot_XP_012256108.1_32|lcl|NW_020311286.1_prot_XP_012256682.1_53|lcl|NW_020311286.1_prot_XP_025603025.1_47|lcl|NW_020311286.1_prot_XP_020708334.1_49|lcl|NW_020311285.1_prot_XP_012256109.1_33|lcl|NW_020311286.1_prot_XP_020708333.1_50 http://www.genome.jp/kegg-bin/show_pathway?dme04341/dme:Dmel_CG6963%09red/dme:Dmel_CG6054%09red
Hedgehog signaling pathway  PANTHER P00025  6       13      3.6166668094e-10        9.94583372585e-09       lcl|NW_020311286.1_prot_XP_025602279.1_78|lcl|NW_020311286.1_prot_XP_025602289.1_76|lcl|NW_020311286.1_prot_XP_025602264.1_79|lcl|NW_020311285.1_prot_XP_012256108.1_32|lcl|NW_020311285.1_prot_XP_012256109.1_33|lcl|NW_020311286.1_prot_XP_012256943.1_77     http://www.pantherdb.org/pathway/pathwayDiagram.jsp?catAccession=P00025
Signaling by NOTCH2 Reactome        R-DME-1980145   3       8       2.00259649553e-05       0.000275357018136       lcl|NW_020311285.1_prot_XP_012256118.1_28|lcl|NW_020311285.1_prot_XP_012256117.1_27|lcl|NW_020311285.1_prot_XP_012256119.1_26   http://www.reactome.org/cgi-bin/eventbrowser_st_id?ST_ID=R-DME-1980145

Annotate and Identify Pipeline

If all goes well, you should get the following:

  • logs folder: This folder contains the ‘conder_stderr’ and ‘condor_stdout’ files. The files record feedback, progress and, importantly, any errors the app encountered during the analysis. You won’t normally need to look at these but they are very helpful in figuring out what may have happened if your output doesn’t look like you expected.
  • sqlite3 folder: This folder contains the annotation database files used in your analysis
  • seq_pep folder: This folder contains the BLAST database files used in your analysis.
  • <species>.tsv: This is the tab-delimited output from the BLAST search. It is unlikely that you will need to look at this file.
  • <basename>_annotate_out.txt: KOBAS annotate generates a text file with the name you provide. It has two sections.

The first sections looks like this:

##dme       Drosophila melanogaster (fruit fly)
##Method: BLAST     Options: evalue <= 1e-05
##Summary:  87 succeed, 0 fail

#Query      Gene ID|Gene name|Hyperlink
lcl|NW_020311285.1_prot_XP_012256083.1_15   dme:Dmel_CG34349|Unc-13-4B|http://www.genome.jp/dbget-bin/www_bget?dme:Dmel_CG34349
lcl|NW_020311286.1_prot_XP_020708336.1_46   dme:Dmel_CG6963|gish|http://www.genome.jp/dbget-bin/www_bget?dme:Dmel_CG6963
lcl|NW_020311285.1_prot_XP_020707987.1_39   dme:Dmel_CG30403||http://www.genome.jp/dbget-bin/www_bget?dme:Dmel_CG30403

The second section follows a dashed line and looks like this:

--------------------

////
Query:                      lcl|NW_020311285.1_prot_XP_012256083.1_15
Gene:                       dme:Dmel_CG34349        Unc-13-4B
Entrez Gene ID:             43002
////
Query:                      lcl|NW_020311286.1_prot_XP_020708336.1_46
Gene:                       dme:Dmel_CG6963 gish
Entrez Gene ID:             49701
Pathway:                    Hedgehog signaling pathway - fly        KEGG PATHWAY    dme04341
////
Query:                      lcl|NW_020311285.1_prot_XP_020707987.1_39
Gene:                       dme:Dmel_CG30403
Entrez Gene ID:             246595
////
Query:                      lcl|NW_020311285.1_prot_XP_020707989.1_40
Gene:                       dme:Dmel_CG6148 Past1
Entrez Gene ID:             41569
Pathway:                    Endocytosis     KEGG PATHWAY    dme04144
                            Hemostasis      Reactome        R-DME-109582
                            Factors involved in megakaryocyte development and platelet production   Reactome        R-DME-98323mel_CG6963
lcl|NW_020311285.1_prot_XP_020707987.1_39   dme:Dmel_CG30403||http://www.genome.jp/dbget-bin/www_bget?dme:Dmel_CG30403
  • <basename>_identify_out.txt: KOBAS identify generates a text file with the name you provide.
##Databases: PANTHER, KEGG PATHWAY, Reactome, BioCyc
##Statistical test method: hypergeometric test / Fisher's exact test
##FDR correction method: Benjamini and Hochberg

#Term       Database        ID      Input number    Background number       P-Value Corrected P-Value       Input   Hyperlink
Hedgehog signaling pathway - fly    KEGG PATHWAY    dme04341        12      33      3.20002656734e-18       1.76001461204e-16       lcl|NW_020311286.1_prot_XP_012256678.1_51|lcl|NW_020311286.1_prot_XP_025602973.1_48|lcl|NW_020311286.1_prot_XP_012256683.1_52|lcl|NW_020311286.1_prot_XP_012256679.1_55|lcl|NW_020311286.1_prot_XP_012256674.1_54|lcl|NW_020311286.1_prot_XP_020708336.1_46|lcl|NW_020311285.1_prot_XP_012256108.1_32|lcl|NW_020311286.1_prot_XP_012256682.1_53|lcl|NW_020311286.1_prot_XP_025603025.1_47|lcl|NW_020311286.1_prot_XP_020708334.1_49|lcl|NW_020311285.1_prot_XP_012256109.1_33|lcl|NW_020311286.1_prot_XP_020708333.1_50 http://www.genome.jp/kegg-bin/show_pathway?dme04341/dme:Dmel_CG6963%09red/dme:Dmel_CG6054%09red
Hedgehog signaling pathway  PANTHER P00025  6       13      3.6166668094e-10        9.94583372585e-09       lcl|NW_020311286.1_prot_XP_025602279.1_78|lcl|NW_020311286.1_prot_XP_025602289.1_76|lcl|NW_020311286.1_prot_XP_025602264.1_79|lcl|NW_020311285.1_prot_XP_012256108.1_32|lcl|NW_020311285.1_prot_XP_012256109.1_33|lcl|NW_020311286.1_prot_XP_012256943.1_77     http://www.pantherdb.org/pathway/pathwayDiagram.jsp?catAccession=P00025
Signaling by NOTCH2 Reactome        R-DME-1980145   3       8       2.00259649553e-05       0.000275357018136       lcl|NW_020311285.1_prot_XP_012256118.1_28|lcl|NW_020311285.1_prot_XP_012256117.1_27|lcl|NW_020311285.1_prot_XP_012256119.1_26   http://www.reactome.org/cgi-bin/eventbrowser_st_id?ST_ID=R-DME-1980145

Contact us.