Input: RefSeq records from NCBI/ user's own genome sequences (GenBank format)
Clustering: DIAMOND + MCL + phylogeny-based post-processing
Fig.1 An overview of pan-genome analysis and visualization pipeline.
panX analysis pipeline is based on DIAMOND, MCL and phylogeny-aware post-processing to determine clusters of orthologous genes from a collection of annotated genomes. panX generates a strain/species tree based on core genome SNPs and a gene tree for each gene cluster.
panX interactive visualization:
(1) The dynamic pan-genome statistical charts allow rapid filtering and selection of gene subsets in cluster table;
clicking a gene cluster in cluster table loads (2) related alignment, (3) individual gene tree and (4) gene presence/absence and gain/loss pattern on strain/species tree;
(5) Selecting sequences in alignment highlights associated strains on strain/species tree;
(6) (7) Strain/species tree interacts with gene tree in various ways;
(8) Zooming into a clade on strain/species tree screens strains in metadata table;
(9) Searching in metadata table display strains pertinent to specific meta-information.
Interactive charts: dc.js (interactive charting JS library based on crossfilter and d3) Gordon Woodhull's support and advice highly appreciated!
Data tables: DataTables plug-in for the enhancement of data accessibility
Alignment: MSA (multiple sequence alignment JS library)
Phylogenetic tree: PhyloTree (highly flexible tree visualization JS library) crafted by Richard Neher
The all-against-all comparison of protein sequences from all strains is performed by the fast and sensitive protein alignment tool DIAMOND (Buchfink et al. 2015 Nature Methods), which applies a double indexing method to compute the list of seed and location information in both queries and references.
The sequence similarly matrix using bit-score from the DIAMOND output serves as input for the Markov Clustering Algorithm (MCL) to create the clusters of homologous genes.
Afterwards, the clusters are post-processed by splitting clusters that involve distantly related sequences and contain a large number of paralogs.