Top

EULER-AIR: checking base-assignment errors in repeat regions of fragment assembly

Read assignment in repeat regions are often ambiguous, causing errors in base-assignment in the final assembly. EULER_AIR improves the read-assignment using an EM-procedure. Taking an assembled genomic sequence and the associated trace data, EULER_AIR can (i) discover and correct base-assignment errors; (ii) provide accurate read assignments; (iii) utilize finishing reads for accurate base assignment; and (iv) provide guidance for designing finishing experiments.

Alu Repeat Subfamily Classification

Alu repeats are the most abundant family of repeats in the human genome, with over 1 million copies comprising 10% of the genome. They have been implicated in human genetic disease and in the enrichment of gene-rich segmental duplications in the human genome, and form a rich fossil record of primate and human history. Alu repeat elements are believed to have arisen from the replication of a small number of source elements, whose evolution over time gives rise to the 31 Alu subfamilies currently reported in Repbase Update. We apply a novel method to identify and statistically validate 213 Alu subfamilies. We build an evolutionary tree of these subfamilies, and conclude that the history of Alu evolution is more complex than previous studies had indicated.

Download source

Comparative genomics motif finding

The recent discovery of the first small modulatory RNA (smRNA) presents the challenge of finding other molecules of similar length and conservation level. Unlike short interfering RNA (siRNA) and micro-RNA (miRNA), effective computational and experimental screening methods are not currently known for this species of RNA molecule, and the discovery of the one known example was partly fortuitous because it happened to be complementary to a well-studied DNA binding motif (the Neuron Restrictive Silencer Element). Existing comparative genomics approaches (e.g., phylogenetic footprinting) rely on alignments of orthologous regions across multiple genomes. This approach, while extremely valuable, is not suitable for finding motifs with highly diverged “non-alignable” flanking regions. In this website, we demonstrate that several unusually long and well conserved motifs can be discovered de novo through a comparative genomics approach that does not require an alignment of orthologous upstream regions

http://bix-nrse.ucsd.edu/

Uniform Projection Motif Finder

Buhler and Tompa (2002) introduced the random projection algorithm for the motif discovery problem and demonstrated that this algorithm performs well on both simulated and biological samples. We describe a modification of the random projection algorithm, called the uniform projection algorithm, which utilizes a different choice of projections. We replace the random selection of projections by a greedy heuristic that approximately equalizes the coverage of the projections. We show that this change in selection of projections leads to improved performance on motif discovery problems. Furthermore, the uniform projection algorithm is directly applicable to other problems where the random projection algorithm has been used, including comparison of protein sequence databases.

Download source

Spectral Networks for Identification of Proteins and Post-Translational Modifications

Spectral Networks for Identification of Proteins and Post-Translational Modifications

Spectral networks are based on a new idea that allows one to perform MS/MS database search . . . without ever comparing a spectrum against a database. Spectral networks capitalize on spectral pairs – pairs of spectra obtained from overlapping (often non-tryptic) peptides or from unmodified and modified versions of the same peptide. While seemingly redundant, spectral pairs open up computational avenues that were never explored before. Having a spectrum of a modified peptide paired with a spectrum of an unmodified peptide, allows one to separate the prefix and suffix ladders, to greatly reduce the number of noise peaks, and to generate a small number of peptide reconstructions that are likely to contain the correct one. The MS/MS database search is thus reduced to extremely fast pattern matching (rather than time-consuming matching of spectra against databases). In addition to speed, this approach provides a new paradigm for identifying post-translational modifications and higly modified peptides.

Bottom