Summary: The Global Proteome Machine and Database (GPMDB) representational state transfer (REST) service was designed to provide simplified access to the proteomics information in GPMDB using a stable set of methods and parameters. Version 1 of this interface gives access to 25 methods for retrieving experimental information about protein post-translational modifications, amino acid variants, alternate splicing variants and protein cleavage patterns.
Availability and implementation: GPMDB data and database tables are freely available for commercial and non-commercial use. All software is also freely available, under the Artistic License. http://rest.thegpm.org/1 (GPMDB REST Service), http://wiki.thegpm.org/wiki/GPMDB_REST (Service description and help), and http://www.thegpm.org (GPM main project description and documentation). The code for the interface and an example REST client is available at ftp://ftp.thegpm.org/repos/gpmdb_rest
Contact: rbeavis@thegpm.org or david@fenyolab.org
Supplementary information: Supplementary data are available at Bioinformatics online.
Summary: The eukaryotic transcriptome is composed of thousands of coding and long non-coding RNAs (lncRNAs). However, we lack a software platform to identify both RNA classes in a given transcriptome. Here we introduce Annocript, a pipeline that combines the annotation of protein coding transcripts with the prediction of putative lncRNAs in whole transcriptomes. It downloads and indexes the needed databases, runs the analysis and produces human readable and standard outputs together with summary statistics of the whole analysis.
Availability and implementation: Annocript is distributed under the GNU General Public License (version 3 or later) and is freely available at https://github.com/frankMusacchia/Annocript.
Motivation: Discovering the relevant therapeutic targets for drug-like molecules, or their unintended ‘off-targets’ that predict adverse drug reactions, is a daunting task by experimental approaches alone. There is thus a high demand to develop computational methods capable of detecting these potential interacting targets efficiently.
Results: As biologically annotated chemical data are becoming increasingly available, it becomes feasible to explore such existing knowledge to identify potential ligand–target interactions. Here, we introduce an online implementation of a recently published computational model for target prediction, TarPred, based on a reference library containing 533 individual targets with 179 807 active ligands. TarPred accepts interactive graphical input or input in the chemical file format of SMILES. Given a query compound structure, it provides the top ranked 30 interacting targets. For each of them, TarPred not only shows the structures of three most similar ligands that are known to interact with the target but also highlights the disease indications associated with the target. This information is useful for understanding the mechanisms of action and toxicities of active compounds and can provide drug repositioning opportunities.
Availability and implementation: TarPred is available at: http://www.dddc.ac.cn/tarpred.
Motivation: ChIP-seq is a powerful technology to measure the protein binding or histone modification strength in the whole genome scale. Although there are a number of methods available for single ChIP-seq data analysis (e.g. ‘peak detection’), rigorous statistical method for quantitative comparison of multiple ChIP-seq datasets with the considerations of data from control experiment, signal to noise ratios, biological variations and multiple-factor experimental designs is under-developed.
Results: In this work, we develop a statistical method to perform quantitative comparison of multiple ChIP-seq datasets and detect genomic regions showing differential protein binding or histone modification. We first detect peaks from all datasets and then union them to form a single set of candidate regions. The read counts from IP experiment at the candidate regions are assumed to follow Poisson distribution. The underlying Poisson rates are modeled as an experiment-specific function of artifacts and biological signals. We then obtain the estimated biological signals and compare them through the hypothesis testing procedure in a linear model framework. Simulations and real data analyses demonstrate that the proposed method provides more accurate and robust results compared with existing ones.
Availability and implementation: An R software package ChIPComp is freely available at http://web1.sph.emory.edu/users/hwu30/software/ChIPComp.html.
Contact: hao.wu@emory.edu
Supplementary information: Supplementary data are available at Bioinformatics online.
Oasis: online analysis of small RNA deep sequencing data
Summary: Oasis is a web application that allows for the fast and flexible online analysis of small-RNA-seq (sRNA-seq) data. It was designed for the end user in the lab, providing an easy-to-use web frontend including video tutorials, demo data and best practice step-by-step guidelines on how to analyze sRNA-seq data. Oasis’ exclusive selling points are a differential expression module that allows for the multivariate analysis of samples, a classification module for robust biomarker detection and an advanced programming interface that supports the batch submission of jobs. Both modules include the analysis of novel miRNAs, miRNA targets and functional analyses including GO and pathway enrichment. Oasis generates downloadable interactive web reports for easy visualization, exploration and analysis of data on a local system. Finally, Oasis’ modular workflow enables for the rapid (re-) analysis of data.
Availability and implementation: Oasis is implemented in Python, R, Java, PHP, C++ and JavaScript. It is freely available at http://oasis.dzne.de.
Contact: stefan.bonn@dzne.de
Supplementary information: Supplementary data are available at Bioinformatics online.
Motivation: Most proteins consist of multiple domains, independent structural and evolutionary units that are often reshuffled in genomic rearrangements to form new protein architectures. Template-based modeling methods can often detect homologous templates for individual domains, but templates that could be used to model the entire query protein are often not available.
Results: We have developed a fast docking algorithm ab initio domain assembly (AIDA) for assembling multi-domain protein structures, guided by the ab initio folding potential. This approach can be extended to discontinuous domains (i.e. domains with ‘inserted’ domains). When tested on experimentally solved structures of multi-domain proteins, the relative domain positions were accurately found among top 5000 models in 86% of cases. AIDA server can use domain assignments provided by the user or predict them from the provided sequence. The latter approach is particularly useful for automated protein structure prediction servers. The blind test consisting of 95 CASP10 targets shows that domain boundaries could be successfully determined for 97% of targets.
Availability and implementation: The AIDA package as well as the benchmark sets used here are available for download at http://ffas.burnham.org/AIDA/.
Contact: adam@sanfordburnham.org
Supplementary information: Supplementary data are available at Bioinformatics online.
Motivation: Macromolecular structures and interactions are intrinsically heterogeneous, temporally adopting a range of configurations that can confound the analysis of data from bulk experiments. To obtain quantitative insights into heterogeneous systems, an ensemble-based approach can be employed, in which predicted data computed from a collection of models is compared to the observed experimental results. By simultaneously fitting orthogonal structural data (e.g. small-angle X-ray scattering, nuclear magnetic resonance residual dipolar couplings, dipolar electron-electron resonance spectra), the range and population of accessible macromolecule structures can be probed.
Results: We have developed MESMER, software that enables the user to identify ensembles that can recapitulate experimental data by refining thousands of component collections selected from an input pool of potential structures. The MESMER suite includes a powerful graphical user interface (GUI) to streamline usage of the command-line tools, calculate data from structure libraries and perform analyses of conformational and structural heterogeneity. To allow for incorporation of other data types, modular Python plugins enable users to compute and fit data from nearly any type of quantitative experimental data.
Results: Conformational heterogeneity in three macromolecular systems was analyzed with MESMER, demonstrating the utility of the streamlined, user-friendly software.
Availability and implementation: https://code.google.com/p/mesmer/
Contact: foster.281@osu.edu or ihms.2@osu.edu
Supplementary information: Supplementary data are available at Bioinformatics online.
Summary: A genetic variant can be represented in the Variant Call Format (VCF) in multiple different ways. Inconsistent representation of variants between variant callers and analyses will magnify discrepancies between them and complicate variant filtering and duplicate removal. We present a software tool vt normalize that normalizes representation of genetic variants in the VCF. We formally define variant normalization as the consistent representation of genetic variants in an unambiguous and concise way and derive a simple general algorithm to enforce it. We demonstrate the inconsistent representation of variants across existing sequence analysis tools and show that our tool facilitates integration of diverse variant types and call sets.
Availability and implementation: The source code is available for download at http://github.com/atks/vt. More detailed documentation is available at http://genome.sph.umich.edu/wiki/Variant_Normalization.
Contact: hmkang@umich.edu
Supplementary information: Supplementary data are available at Bioinformatics online.
Motivation: The recent advance of single-cell technologies has brought new insights into complex biological phenomena. In particular, genome-wide single-cell measurements such as transcriptome sequencing enable the characterization of cellular composition as well as functional variation in homogenic cell populations. An important step in the single-cell transcriptome analysis is to group cells that belong to the same cell types based on gene expression patterns. The corresponding computational problem is to cluster a noisy high dimensional dataset with substantially fewer objects (cells) than the number of variables (genes).
Results: In this article, we describe a novel algorithm named shared nearest neighbor (SNN)-Cliq that clusters single-cell transcriptomes. SNN-Cliq utilizes the concept of shared nearest neighbor that shows advantages in handling high-dimensional data. When evaluated on a variety of synthetic and real experimental datasets, SNN-Cliq outperformed the state-of-the-art methods tested. More importantly, the clustering results of SNN-Cliq reflect the cell types or origins with high accuracy.
Availability and implementation: The algorithm is implemented in MATLAB and Python. The source code can be downloaded at http://bioinfo.uncc.edu/SNNCliq.
Contact: zcsu@uncc.edu
Supplementary information: Supplementary data are available at Bioinformatics online.
A major roadblock towards accurate interpretation of single cell RNA-seq data is large technical noise resulted from small amount of input materials. The existing methods mainly aim to find differentially expressed genes rather than directly de-noise the single cell data. We present here a powerful but simple method to remove technical noise and explicitly compute the true gene expression levels based on spike-in ERCC molecules.
Availability and implementation: The software is implemented by R and the download version is available at http://wanglab.ucsd.edu/star/GRM.
Contact: wei-wang@ucsd.edu
Supplementary information: Supplementary data are available at Bioinformatics online.
Motivation: Hydration largely determines solubility, aggregation of proteins and influences interactions between proteins and drug molecules. Despite the importance of hydration, structural determination of hydration structure of protein surfaces is still challenging from both experimental and theoretical viewpoints. The precision of experimental measurements is often affected by fluctuations and mobility of water molecules resulting in uncertain assignment of water positions.
Results: Our method can utilize mobility as an information source for the prediction of hydration structure. The necessary information can be produced by molecular dynamics simulations accounting for all atomic interactions including water–water contacts. The predictions were validated and tested by comparison to more than 1500 crystallographic water positions in 20 hydrated protein molecules including enzymes of biomedical importance such as cyclin-dependent kinase 2. The agreement with experimental water positions was larger than 80% on average. The predictions can be particularly useful in situations where no or limited experimental knowledge is available on hydration structures of molecular surfaces.
Availability and implementation: The method is implemented in a standalone C program MobyWat released under the GNU General Public License, freely accessible with full documentation at http://www.mobywat.com.
Contact: csabahete@yahoo.com
Supplementary information: Supplementary data are available at Bioinformatics online.
Creation of defined genetic mutations is a powerful method for dissecting mechanisms of bacterial disease; however, many genetic tools are only developed for laboratory strains. We have designed a modular and general negative selection strategy based on inducible toxins that provides high selection stringency in clinical Escherichia coli and Salmonella isolates. No strain- or species-specific optimization is needed, yet this system achieves better selection stringency than all previously reported negative selection systems usable in unmodified E. coli strains. The high stringency enables use of negative instead of positive selection in phage-mediated generalized transduction and also allows transfer of alleles between arbitrary strains of E. coli without requiring phage. The modular design should also allow further extension to other bacteria. This negative selection system thus overcomes disadvantages of existing systems, enabling definitive genetic experiments in both lab and clinical isolates of E. coli and other Enterobacteriaceae.
Motivated by the high cost of human curation of biological databases, there is an increasing interest in using computational approaches to assist human curators and accelerate the manual curation process. Towards the goal of cataloging drug indications from FDA drug labels, we recently developed LabeledIn, a human-curated drug indication resource for 250 clinical drugs. Its development required over 40 h of human effort across 20 weeks, despite using well-defined annotation guidelines. In this study, we aim to investigate the feasibility of scaling drug indication annotation through a crowdsourcing technique where an unknown network of workers can be recruited through the technical environment of Amazon Mechanical Turk (MTurk). To translate the expert-curation task of cataloging indications into human intelligence tasks (HITs) suitable for the average workers on MTurk, we first simplify the complex task such that each HIT only involves a worker making a binary judgment of whether a highlighted disease, in context of a given drug label, is an indication. In addition, this study is novel in the crowdsourcing interface design where the annotation guidelines are encoded into user options. For evaluation, we assess the ability of our proposed method to achieve high-quality annotations in a time-efficient and cost-effective manner. We posted over 3000 HITs drawn from 706 drug labels on MTurk. Within 8 h of posting, we collected 18 775 judgments from 74 workers, and achieved an aggregated accuracy of 96% on 450 control HITs (where gold-standard answers are known), at a cost of $1.75 per drug label. On the basis of these results, we conclude that our crowdsourcing approach not only results in significant cost and time saving, but also leads to accuracy comparable to that of domain experts.
Tandem duplication is a wide-spread phenomenon in plant genomes and plays significant roles in evolution and adaptation to changing environments. Tandem duplicated genes related to certain functions will lead to the expansion of gene families and bring increase of gene dosage in the form of gene cluster arrays. Many tandem duplication events have been studied in plant genomes; yet, there is a surprising shortage of efforts to systematically present the integration of large amounts of information about publicly deposited tandem duplicated gene data across the plant kingdom. To address this shortcoming, we developed the first plant tandem duplicated genes database, PTGBase. It delivers the most comprehensive resource available to date, spanning 39 plant genomes, including model species and newly sequenced species alike. Across these genomes, 54 130 tandem duplicated gene clusters (129 652 genes) are presented in the database. Each tandem array, as well as its member genes, is characterized in complete detail. Tandem duplicated genes in PTGBase can be explored through browsing or searching by identifiers or keywords of functional annotation and sequence similarity. Users can download tandem duplicated gene arrays easily to any scale, up to the complete annotation data set for an entire plant genome. PTGBase will be updated regularly with newly sequenced plant species as they become available.
Background: Quantitative analysis of simple molecular networks is an important step forward understanding fundamental intracellular processes. As network motifs occurring recurrently in complex biological networks, gene auto-regulatory circuits have been extensively studied but gene expression dynamics remain to be fully understood, e.g., how promoter leakage affects expression noise is unclear. Results: In this work, we analyze a gene model with auto regulation, where the promoter is assumed to have one active state with highly efficient transcription and one inactive state with very lowly efficient transcription (termed as promoter leakage). We first derive the analytical distribution of gene product, and then analyze effects of promoter leakage on expression dynamics including bursting kinetics. Interestingly, we find that promoter leakage always reduces expression noise and that increasing the leakage rate tends to simplify phenotypes. In addition, higher leakage results in fewer bursts. Conclusions: Our results reveal the essential role of promoter leakage in controlling expression dynamics and further phenotype. Specifically, promoter leakage is a universal mechanism of reducing expression noise, controlling phenotypes in different environments and making the gene produce generate fewer bursts.
Background: The exchange of metabolites and the reprogramming of metabolism in response to shifting microenvironmental conditions can drive subpopulations of cells within colonies toward divergent behaviors. Understanding the interactions of these subpopulations—their potential for competition as well as cooperation—requires both a metabolic model capable of accounting for a wide range of environmental conditions, and a detailed dynamic description of the cells’ shared extracellular space. Results: Here we show that a cell’s position within an in silico Escherichia coli colony grown on glucose minimal agar can drastically affect its metabolism: “pioneer” cells at the outer edge engage in rapid growth that expands the colony, while dormant cells in the interior separate two spatially distinct subpopulations linked by a cooperative form of acetate crossfeeding that has so far gone unnoticed. Our hybrid simulation technique integrates 3D reaction-diffusion modeling with genome-scale flux balance analysis (FBA) to describe the position-dependent metabolism and growth of cells within a colony. Our results are supported by imaging experiments involving strains of fluorescently-labeled E. coli. The spatial patterns of fluorescence within these experimental colonies identify cells with upregulated genes associated with acetate crossfeeding and are in excellent agreement with the predictions. Furthermore, the height-to-width ratios of both the experimental and simulated colonies are in good agreement over a growth period of 48 hours. Conclusions: Our modeling paradigm can accurately reproduce a number of known features of E. coli colony growth, as well as predict a novel one that had until now gone unrecognized. The acetate crossfeeding we see has a direct analogue in a form of lactate crossfeeding observed in certain forms of cancer, and we anticipate future application of our methodology to models of tissues and tumors.
Protein crosslinking has been used for decades to derive structural information about proteins and protein complexes. Only recently, however, it became possible to map the amino acids involved in the crosslinks with the advent of high resolution mass spectrometry (MS). Here, we present Crossfinder, which automates the search for crosslinks formed by site-specifically incorporated crosslinking amino acids in LC-MS-MS data.
Availability and Implementation: An executable version of Crossfinder for Windows machines (64-bit) is freely available to non-commercial users. It is bundled with a manual and example data.
Motivation: Most biological processes remain only partially characterized with many components still to be identified. Given that a whole genome can usually not be tested in a functional assay, identifying the genes most likely to be of interest is of critical importance to avoid wasting resources.
Results: Given a set of known functionally related genes and using a state-of-the-art approach to data integration and mining, our Functional Lists (FUN-L) method provides a ranked list of candidate genes for testing. Validation of predictions from FUN-L with independent RNAi screens confirms that FUN-L-produced lists are enriched in genes with the expected phenotypes. In this article, we describe a website front end to FUN-L.
Availability and implementation: The website is freely available to use at http://funl.org
Motivation: The combination of liquid chromatography and mass spectrometry (LC/MS) has been widely used for large-scale comparative studies in systems biology, including proteomics, glycomics and metabolomics. In almost all experimental design, it is necessary to compare chromatograms across biological or technical replicates and across sample groups. Central to this is the peak alignment step, which is one of the most important but challenging preprocessing steps. Existing alignment tools do not take into account the structural dependencies between related peaks that coelute and are derived from the same metabolite or peptide. We propose a direct matching peak alignment method for LC/MS data that incorporates related peaks information (within each LC/MS run) and investigate its effect on alignment performance (across runs). The groupings of related peaks necessary for our method can be obtained from any peak clustering method and are built into a pair-wise peak similarity score function. The similarity score matrix produced is used by an approximation algorithm for the weighted matching problem to produce the actual alignment result.
Results: We demonstrate that related peak information can improve alignment performance. The performance is evaluated on a set of benchmark datasets, where our method performs competitively compared to other popular alignment tools.
Availability: The proposed alignment method has been implemented as a stand-alone application in Python, available for download at http://github.com/joewandy/peak-grouping-alignment.
Contact: Simon.Rogers@glasgow.ac.uk
Supplementary information: Supplementary data are available at Bioinformatics online.
Summary:Whiteboard is a class library implemented in C++ that enables visualization to be tightly coupled with computation when analyzing large and complex datasets.
Availability and implementation: the C++ source code, coding samples and documentation are freely available under the Lesser General Public License from http://whiteboard-class.sourceforge.net/.
No large group of recently extinct placental mammals remains as evolutionarily cryptic as the approximately 280 genera grouped as ‘South American native ungulates’. To Charles Darwin, who first collected their remains, they included perhaps the ‘strangest animal[s] ever discovered’. Today, much like 180 years ago, it is no clearer whether they had one origin or several, arose before or after the Cretaceous/Palaeogene transition 66.2 million years ago, or are more likely to belong with the elephants and sirenians of superorder Afrotheria than with the euungulates (cattle, horses, and allies) of superorder Laurasiatheria. Morphology-based analyses have proved unconvincing because convergences are pervasive among unrelated ungulate-like placentals. Approaches using ancient DNA have also been unsuccessful, probably because of rapid DNA degradation in semitropical and temperate deposits. Here we apply proteomic analysis to screen bone samples of the Late Quaternary South American native ungulate taxa Toxodon (Notoungulata) and Macrauchenia (Litopterna) for phylogenetically informative protein sequences. For each ungulate, we obtain approximately 90% direct sequence coverage of type I collagen α1- and α2-chains, representing approximately 900 of 1,140 amino-acid residues for each subunit. A phylogeny is estimated from an alignment of these fossil sequences with collagen (I) gene transcripts from available mammalian genomes or mass spectrometrically derived sequence data obtained for this study. The resulting consensus tree agrees well with recent higher-level mammalian phylogenies. Toxodon and Macrauchenia form a monophyletic group whose sister taxon is not Afrotheria or any of its constituent clades as recently claimed, but instead crown Perissodactyla (horses, tapirs, and rhinoceroses). These results are consistent with the origin of at least some South American native ungulates from ‘condylarths’, a paraphyletic assembly of archaic placentals. With ongoing improvements in instrumentation and analytical procedures, proteomics may produce a revolution in systematics such as that achieved by genomics, but with the possibility of reaching much further back in time.