EPA is a Web-based tool for Expressed Sequence Tags (EST) collections access, assembly and annotation with a strong effort on communication between user and system. EPA graphically integrates the output of different sequence analysis algorithms ranging from base-calling to functional annotation giving an information-rich and interactive environment for high quality sequence annotation.
EPA is composed of two modules that are successively launched : the clustering module (based on stackPACK(1)) and the annotation module.
In EPA, data access, processing and annotation are structured on a project basis. A project is a collection of ESTs that can be progressively extended by successive addition of new sequences.
Submission and access to the data are protected by a login/password system in order to ensure confidentiality and integrity of the data. Login and password can be obtained at URL: http://cbi.labri.fr/outils/EPA/access.php
Some information about the project such as a brief description, format of the sequences files, etc. are necessary to obtain an account.
Once the annotation of the sequences is completed, the data can be made accessible to the scientific community (if the project's owner wish) for consultation at the web server. Several completed projects are already viewable at : URL : http://cbi.labri.fr/outils/SAM2/COMPLETE/index.php
This document describes the two modules (the clustering module and the annotation module) of EPA pipeline and the interface to public data sets.
CLUSTERING MODULE
Figure 1 : Steps of EPA pipeline from sequence submitting to annotation. First of all, a preprocessing step is ran. Then, all the steps of stackPACK(1) software are launched : clustering, creation of contigs, alignments and consensus sequences creation. To finish, the annotation module is used to annotate all the sequences of the project.
The user is invited to create a project and submit ESTs to the EPA pipeline either as trace files (ABI or MEGABACE format) or raw sequences (in FASTA format) with the corresponding PHRED format quality files.
In the former case : an initial processing of the sequences (base-calling and trimming using PHRED(2), vector masking using cross_match and vector removing using our own perl script) is launched automatically.
The results, sequences under the FASTA format and base quality files, are sent to the user via email and also used automatically as inputs for the assembly process.
In the latter case : this process directly starts after that data integrity has been checked.
Sequences and associated base quality files are first imported into the project database.
Masking of repeated sequences, clustering and assembly are then chained automatically to generate consensus sequences using stackPACK.
Briefly, repeated sequences (e.g. low complexity portions of sequences as polypurine, AT-rich regions, simple tandem repeats), are masked using RepeatMasker(3) to avoid problems in the clustering and assembly steps.
Groups or clusters of sequences with strong overlaps (similarity cutoff of 0,96 on a window of 100 bases) are then generated using the d2-cluster(4) clustering algorithm .
Within each cluster, contigs are established using the assembly software Phrap(5).
To account for sequence variation (alternative splicing, gene family members, allelic polymorphisms) CRAW(6) is then used to produce primary consensus and sub-consensus.
A gene index comprising the primary consensus and singletons (ESTs which could not be joined with other sequences during the clustering process) is then transferred to the annotation module.
User gets back an email summarizing the main results of the assembly procedure and is invited to start the annotation procedure.
StackPACK annotation module (SAM) has been developped with the aim to keep stackPACK's core database infrastructure and its assembly results and to supplement it with a stackPACK complementary infrastructure dedicated to store and manage several types of functionnal annotations.
The interface of SAM integrates, at the same time, interactive tools to visualize stackPACK's assembly, a workflow manager allowing to launch several analysis tools (see Analysis part of the menu) and an annotation interface mining all the data resulting from these analyses (see Annotation part of the menu).
This interface allows to precisely annotate singletons and primary consensus sequences in a semi-automatic way using sequence homologies tools (like BLAST(7)) and domains / repetitions detection tools (see Analysis part of the menu).
The access to one project is restricted by a login/password management system based on privileges levels. Four levels of privileges exist:
consultation
consultation-annotation
consultation-annotation-delete
admin
The software administrator allocates the level of privilege of each user when he creates it by using the users administration interface.
The menu of SAM (Figure 2) enumerates the options available for the user according to his level of privileges.
Each important option of this menu is described in this document (with the privileges level required)...
Home Page / Menu
Here is the page reached by the user after entering the login/password allocated for him once the clustering step is completed.
Figure 2 : Home page of StackPACK Annotation Module. Available informations about the project : the total number of sequences, the number of clusters, contigs, the number of singletons ... Graphical bars show the repartition of annotated singletons and annotated consensus sequences. A menu allows the user to access to all the components that his privileges level permits.
Figure 3 : Analysis launcher interface. A table displays all the analyses yet configurated. Each track represents a tools that can be launched (if it is not yet terminated or if an update must be processed).
This part allows the user to manage the analyses that might be launched in order to provide functional annotations for the sequences of his project.
A set of analyses tools are yet integrated in the software and the implementation of the pipeline allows an easy integration of additionnal tools. Here is a list of the available tools that can be launched with consensus and singletons sequences:
BLAST of the sequences, of the consensus sequences and of the sequences with several databanks ordered in a multi-banks strategy (see BLAST multi-banks principle below).
BLAST of the sequence with one bank chosen in the banks list.
Seek of protein domains with Prosite(8).
Seek of transmembrane domain with TMHMM(9).
Ontology tools like Gene Ontology(10) or Funcat(11).
Seek of repeated elements with SPUTNIK(12).
Link toward stackPACK alignments images.
Annotation history of sequences.
The first step in order to launch an analysis, is to create it (by selecting the dedicated tool in the Add a new Analysis menu) and, then, enter the parameters of the analysis in the configuration form (see Figure 4).
Figure 4 : Multi-banks analysis configuration page. For each bank of the strategy, the e-value threshold and the Blast type are chosen.
Blast multi-banks principle :
- A hierarchical list of databanks and their dedicated e-value thresholds are entered in the configuration form.
- A BLAST with the first databank is done and if no results satisfy the first treshold condition, the second bank is used.
- As long as there are no results, the tool switches to the following bank until the last. However, if a result is found, the tool stops and keep this result.
When the analysis is configured and the parameters saved, two new lines appears in the analysis launcher interface: the first one concerns the launching of the analysis for the contigs (using the consensus sequences) and the second one concerns the launching of the analysis for the singletons.
The second step simply consists in launching it using the launcher's interface.
The analysis launcher allows the user to stop an analysis and relaunch it whenever he wants. So he can give the priority to the most urgent analysis. The progression of one analysis is represented by graphical bars indicating the percentage of one analysis that is achieved.
Once completed, the analysis can not be relaunched except when a databank update or an assembly update has been done (see Update section).
Figure 5 : Unannotated consensus listing page. Each button gives access to the annotation interface of the unannotated consensus indicated
This part gathers all the sequences (consensus or singleton) that are not yet annotated.
Each sequence has got a link toward the annotation interface by clicking the button :
(click on element to open the dedicated pop up)
Several tools and a graphical map (Figure 6) are available for the user in order to choose the best annotation.
The graphical map displays an overview of the alignment of the consensus with its ESTs and the BLAST RESULTS.
Links giving access to the results of all the analyses are provided in the analysis field and on every panel representing an analysis hit on the graphical map.
Figure 6 : Graphical part of annotation interface. A map displays the consensus aligned with its ESTs and the BLAST results. Links giving access to html output of BLAST analyses and to stackPACK informations are available.
The correspondence between BLAST results of a consensus sequence and those of its ESTs is checked.
The three first hits of each EST must be the same as the three first hits of the consensus sequence.
A graphical marker near the EST panel shows this correspondence (see the graphical map in Figure 6):
Green if the results are the same
Orange if they are not the same
The BLAST results of the consensus sequence are listed in a drop-down menu (see Propositions field in Figure 7). One of these results can be transferred in the annotation field by selecting it and modified. Ontology searches can, then, be done using the terms of the annotation field.
Figure 7 : Annotation of consensus interface. The BLAST results of the consensus listed in the drop-down menu can be transfered and completed in the annotation field.
Once the annotation and the ontology terms are chosen with the help of the tools results, a submitting icon allows the user to validate them selecting a high or a low level of quality according to the confidence he has about these data. In the two cases, once validated, the annotation is stored and can be retrieved later but if the user is not totally confident about its validity a low quality must be selected so he can relaunch the analysis when the databank is updated to complete the annotation.A high quality is selected only when the annotation is definitive.
The same annotation interface (Figure 5, Figure 6, Figure 7) exists to annotate the singletons.
Each validated annotation is then stored in the database and can be retrieved and download (see Tools section of the menu).
Consultation
The informations of consensus sequences and singletons yet annotated are viewable though a consultation interface allowing several seeking criteria. Several actions, depending on the privileges level, can be performed on the annotations: from viewing them to delete them.
Figure 8.a : Consultation page of annotated consensus (with consultation level). All the annotated sequences are listed with their annotations. In this case (consultation level), a pop-up displaying a summary concerning the annotation of each sequence is available by clicking on the link over its accession. A pop-up with more details and a graphical map of the alignment is opened when clicking on the Details button. A high quality annotation is represented by a green indicator while a low quality annotation is represented by a red indicator. Sequences can be selected according to their accession or their annotation quality and keywords matching with their annotation.
Figure 8.b : Consultation page of annotated consensus (with consultation-annotation-delete level). All the annotated sequences are listed with their annotations. A pop-up displaying a summary concerning the annotation of each sequence is available by clicking on the link over its accession. In this case (consultation-annotation-delete level), the annotation interface of one sequence is opened when clicking on the Modify button and the annotation can be erased when pushing the Delete button. A high quality annotation is represented by a green indicator while a low quality annotation is represented by a red indicator. Sequences can be selected according to their accession or their annotation quality and keywords matching with their annotation.
Figure 8.c : Consultation page of annotated consensus (with consultation-annotation-delete level, a quality chosen and an annotation keyword entered). All the annotated sequences are listed with their annotations. A pop-up displaying a summary concerning the annotation of each sequence is available by clicking on the link over its accession. In this case (consultation-annotation-delete level), the annotation interface of one sequence is opened when clicking on the Modify button and the annotation can be erased when pushing the Delete button. A high quality annotation is represented by a green indicator while a low quality annotation is represented by a red indicator. Sequences can be selected according to their accession or their annotation quality (high in this example) and keywords matching with their annotation (ribosomal in this example).
Each button gives access to a different action over one annotated consensus sequence. According to the privileges level of the user, the button appears or does not appear. ( : clickable element opening the dedicated pop up)
Minimum privileges level required : consultation
According to his privileges level, two windows can be opened by clicking this button :
Each modification or suppression involve the creation of an entry in the annotation history of the consensus.
The same consultation interface (Figure 8.a, Figure 8.b, Figure 8.c) exists to browse the singletons annotations.
Update
Two types of update cases exist and require some analyses to be relaunched : the BLAST databank update that only concerns the BLAST analyses whose databank release have been updated and the assembly update that concerns all the analyses when a new batch of EST is added to a project.
The system detects these events and adapts the update strategy to the case :
- In a BLAST databank update case, the user is allowed to launch the BLAST analyses (single bank or multi-bank) with the new release detected.
- In an assembly update case, all the sequences that have been affected by the new assembly clustering (for example, contigs that integrates new ESTs have their primary consensus sequence that may change) have their results that are erased and the launching action (see Analysis section) is made active.
When the new release of a databank is installed, the BLAST analyses concerning this bank must be relaunched. The user is informed of this situation as soon as he is logged to the home page of the project (see Figure 9).
Figure 9 : Home page when a BLAST databank update is detected. A warning message (Updated BLAST databank available) inform the user of the event.
The analysis interface allows the user to relaunch them. Only unannotated sequences and low quality annotated sequences are resubmitted.
Figure 10 : Analysis launcher in databank update case. In this example, the SPROT databank is updated. The BLAST analysis on consensus sequences have already been performed with the previous release but the update mode switches the Action to Launch (in normal mode, when an analysis is performed, the Action column is empty).
When a BLAST in Blast databank update mode is completed, the user can visualize the list of the sequences whose hits are modified after the bank update (Figure 11).
Figure 11 : Updated sequences listing page Each Summary button gives access to the Blast results of one of the sequences whose hits are changed after a Blast Databank update. Delete buttons allows to delete an entry (when reannotated for example) from the update list so the user can see which sequences remain to be reannotated.
Selecting a sequence (by clicking the button) gives access to a page (Figure 12) showing the two firsts BLAST hits before and after databank update.
He can then reannotate them by using the annotation interface.
Figure 12 : Blast databank update : comparison of Blast hits before update versus Blast hits after update. Informations like update date, bank releases before/after update and the two first hits (with their expect values) before/after update are shown to provide all the data necessary to decide whether a reannotation is necessary.
Assembly update
A new batch of ESTs can be incorporated to an existing project, launching a new processing run from base-calling to annotation. Already assembled sequences and the new ESTs are clustered together and re-assembled. The consensus whose EST composition is not or slightly affected by the new assembly automatically inherit the existing annotation, while new or updated consensus are identified by the system and flagged in order to undergo a new round of analysis and annotation. Along this annotation process, tracks of all successive functional assignments are kept
Tools
Minimum privileges level required: consultation
Figure 13 : Additionnal tools page Several tools for sequence retrieving and statistical data are available in this page
STATISTICS
Figure 14 : Annotations statistics menu. Statistics concerning several criteria of the annotations are available (statistical distribution of high quality annotated sequences / low quality annotated sequences / unannotated sequences, distribution of annotation keywords and of ontology terms).
Figure 15 : Annotation quality chart. A graphical representation of high quality annotated consensus sequences / low quality annotated consensus sequences / unannotated consensus sequences. The same representation exists for the singletons.
Figure 16 : Keywords distribution in annotations. A graphical representation of sequences containing a given keyword in their annotation is shown.
SEARCH
This part provides a tool allowing to seek for sequences combining several annotation criteria or ontologies. Regular expressions are used for the retrieval concerning annotation and comments fields.
Figure 17 : Sequence retrieving tools. Retrieving criteria : annotations and ontology terms, stackPACK accessions, EMBL accessions (if the sequences have been submitted to EMBL database), annotator name, annotation date, annotation quality, metabolic pathways involved.
DATA
Data like annotation list, unigene and large clusters annotation are available for download (in different formats like fasta, excel or html) or visualization.
HERITAGE Minimum privileges level required: consultation-annotation-delete
A project can inherit the entire set of annotations of one or several selected project(s).
Sequences accessions and clustering groups might be the same in all the projects concerned for the annotation to be conserved.
This option is very useful, for example, when a project whose sequences are yet annotated is composed of a subset of ESTs of another bigger project. If the owner of the two projects wants to conserve the smallest project's annotations in the biggest project, he just has to select the smallest project in the heritage interface of the biggest project. All the sequences for which the clustering step gives the same results (sequences with singletons status in both projects and contigs with the same ESTs composition in both projects) conserve the same annotation than the one in the model project.
Figure 18 : Home page of the public version of EPA's annotation module : SAM COMPLETE. No identification is needed, the user just have to choose between the projects available which one he wants to consult.
The public version of the stackPACK annotation module, S.A.M COMPLETE is a consultation version of SAM that allows to screen the annotations of the projects that are currently completed.
All the consultation options are available in this public version including consultation of the annotations of consensus sequences and singletons, all the option of Tools section (see Tools section) except New EST and Heritage.
In addition, the user can blast his own sequences with the databanks of the project he chose (Figure 19). The databank of one project contain the sequences of all its ESTs and consensus sequences.
Figure 19 : SAM COMPLETE's BLAST interface. One or several sequence(s) can be blasted against the databank composed of all the sequences of the project selected. The sequence(s) (raw sequence or in FASTA format) might be pasted in the text area or an upload file can be joined. Then the BLAST type and the expect value threshold might be selected and the analysis can be launched.
If any sequence of the project matches with the user's one, it is possible to see all the wishable informations about one hit (Figure 20).
Figure 20 : SAM COMPLETE's BLAST result page. The consensus sequences, ESTs and singletons that matches with the query are listed ordered by e-value and all the informations (annotations, sequence, clustering data ...) about this sequence are available by accessing to the Details page of the sequence.
Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs", Nucleic Acids Res. 25:3389-3402.
Prosite(8)
Falquet, L., Pagni, M., Bucher, P., Hulo, N., Sigrist, C.J., Hofmann, K. and Bairoch, A. (2002) Nucleic Acids Res, 30, 235-238.
TMHMM(9)
2.0c
Krogh A, Larsson B, von Heijne G, Sonnhammer EL. Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. J Mol Biol. 2001 ;305:567580. doi: 10.1006/jmbi.2000.4315.
Gene Ontology(10)
http://www.geneontology.org/
Funcat (MIPS)(11)
Schoof, H., Zaccaria, P., Gundlach, H., Lemcke, K., Rudd, S., Kolesov, G., Arnold, R., Mewes, H.W. and Mayer, K.F. (2002) Nucleic Acids Res, 30, 91-93.
SPUTNIK
http://abajian.net/sputnik/
Softwares needed for stackPACK installation :
Software
Version
Location
d2_cluster(4) and CRAW(6)
latest
Academic: Biotique Systems Academic: University of Houston Commercial: Electric Genetics