SIMAP - Similarity Matrix of Proteins Flatfiles dumps of protein domains from InterPro databases and SIMAP2GO function predictions This folder contains tab delimited flatfiles of the proteins in the sequence databases: PDB, GenBank, RefSeq and Uniprot, their GO terms as predicted by BLAST2GO (based on GO lite version 2012-02-11) and their protein domains calculated from the recent InterPro databases (Interpro release 36.0), SignalP 3.0, TargetP 1.1 and TMHMM 2.0. The folder environmental contains similar information ORFs from all environmental sequences in the Whole Genome Shotgun (wgs) section of GenBank, from the JCVI GOS project and IMG/M. GO annotations are also available at http://bioinfo.cipf.es/b2gfar/ for many proteomes or for all further datasets upon request (mail to thomas.rattei@univie.ac.at). Description of file formats: sequences.gz (Non redundant amino acid sequences) Columns: - md5 (128bit MD5 hash of the sequence in upper case letters) - sequence (amino acid sequence; lower case characters indicate low complexity regions) proteins.gz (Proteins) Columns: - md5 (128bit MD5 hash of the sequence in upper case letters) - name (Protein name as in the originating database) - taxonomyid - database name blast2go.gz (GO annotations) Columns: - md5 (128bit MD5 hash of the sequence in upper case letters) - GO term ID - BLAST2GO score features_xxx.gz (Domains from Interpro database xxx according to the tab delimited interpro raw file format) Columns: - md5 (128bit MD5 hash of the sequence in upper case letters) - CRC64 (CRC64 hash of the sequence in upper case letters) - length of the protein sequence - name of the database - name of the feature - description of the feature - begin (begin of the hit on the protein sequence) - end (end of the hit on the protein sequence) - e-Value (e-Value, if available) - true positive flag - date - name of the InterPro assignment (if assigned) - description of the InterPro assignment (if assigned)