Input: RefSeq records from NCBI/ user's own genome sequences (GenBank format)

Clustering: DIAMOND + MCL + phylogeny-based post-processing


Fig.1 An overview of pan-genome analysis and visualization pipeline.

panX analysis pipeline is based on DIAMOND, MCL and phylogeny-aware post-processing to determine clusters of orthologous genes from a collection of annotated genomes. panX generates a strain/species tree based on core genome SNPs and a gene tree for each gene cluster.

panX interactive visualization:

(1) The dynamic pan-genome statistical charts allow rapid filtering and selection of gene subsets in cluster table;

clicking a gene cluster in cluster table loads (2) related alignment, (3) individual gene tree and (4) gene presence/absence and gain/loss pattern on strain/species tree;

(5) Selecting sequences in alignment highlights associated strains on strain/species tree;

(6) (7) Strain/species tree interacts with gene tree in various ways;

(8) Zooming into a clade on strain/species tree screens strains in metadata table;

(9) Searching in metadata table display strains pertinent to specific meta-information.

Interactive charts: dc.js (interactive charting JS library based on crossfilter and d3) Gordon Woodhull's support and advice highly appreciated!

Data tables: DataTables plug-in for the enhancement of data accessibility

Alignment: MSA (multiple sequence alignment JS library)

Phylogenetic tree: PhyloTree (highly flexible tree visualization JS library) crafted by Richard Neher

The all-against-all comparison of protein sequences from all strains is performed by the fast and sensitive protein alignment tool DIAMOND (Buchfink et al. 2015 Nature Methods), which applies a double indexing method to compute the list of seed and location information in both queries and references.

Fig.2 DIAMOND, a double indexing approach using the list of seed and their location in both queries and references.
Image credit: Lecture notes by Daniel Huson, Tübingen University

The sequence similarly matrix using bit-score from the DIAMOND output serves as input for the Markov Clustering Algorithm (MCL) to create the clusters of homologous genes.

Afterwards, the clusters are post-processed by splitting clusters that involve distantly related sequences and contain a large number of paralogs.