编辑: 星野哀 | 2019-07-16 |
1 arXiv:1703.10966v1 [q-bio.QM]
31 Mar
2017 and offer physical insights to mutagenesis. Empirical models are another class of methods that utilize empir- ical functions and potential terms to describe mutation impacts. Model parameters are ?t with a given set of experimental data and the resulting model is used to predict new mutation induced folding free energy changes. The last class of approaches is knowledge based methods that invoke modern machine learning techniques to uncover hidden relationships between protein stability and protein structure as well as sequence. A major ad- vantage of knowledge based mutation predictors is their ability to handle increasingly large and diverse mutation data sets. However, the performance of these approaches highly depends on the training sets and their results usually can not be easily interpreted in physical terms. A common challenge for all existing mutation impact prediction models is in achieving accurate and reliable predictions of membrane protein stability changes upon mutation. As recently noted by Kroncke et al, currently there is no reliable method for the prediction of membrane protein mutation impacts.19 The membrane protein mutation data set studied by these authors has fewer than
250 data entries, which is too few for most knowl- edge based methods, and involves
7 membrane protein families, which is too diverse for typical physics based methods. Figure 1: An illustration of topological invariants (Top row), basic simplexes (Middle row) and simplicial complex construction in a given radius of ?ltration (Bottom row). Top row: a point, a circle, an empty sphere and a torus are displayed from left to right. Betti-0, Betti-1 and Betti-2 numbers for point are, respectively, 1,0 and 0, for the circle 0,1 and 0, for the empty sphere 0,0 and 1, and for the torus 1,2 and 1. Two auxiliary rings are added to the torus explain Betti-1= 2. Middle row: Four typical simplexes are illustrated. Bottom row: Illustration of a set of ten points (left chart) at a given ?ltration radius (middle chart) and the corresponding simplicial complexes (right chart), where there are one 0-simplex, three 1-simplexes, one 2-simplex and one 3-simplex. A key feature of all existing structure based mutation impact predictors is that they either fully or partially rely on direct geometric descriptions which rest in excessively high dimensional spaces resulting in large number of degrees of freedom. In practice, the geometry can easily be over simpli?ed. Mathematically, topology, in contrast to geometry, concerns the connectivity of different components in a space,20 and offers the ultimate level of abstraction of data. However, conventional topology incurs too much reduction of geometric information to be practically useful in biomolecular analysis. Persistent homology, a new branch of algebraic topology, re- tains partial geometric information in topological description, and thus bridges the gap between geometry and topology.21,22 It has been applied to biomolecular characterization, identi?cation and analysis.23C27 However, conventional persistent homology makes no distinction of different atoms in a biomolecule, which results in a heavy loss of biological information and limits its performance in protein classi?cation.28 In the present work, we introduce element speci?c persistent homology (ESPH), interactive persistent homology and binned barcode representation to retain essential biological information in the topological simpli?cation of biological complexity. We further integrate ESPH and machine learning to analyze and predict protein muta- tion impacts. The essential idea of our topological mutation predictor (T-MP) is to use ESPH to transform the biomolecular data in the high-dimensional space with full biological complexity to a space of fewer dimensions and simpli?ed biological complexity, and to use machine learning to deal with massive and diverse data sets. A