A Topological Data Analysis of the Protein Structure

. Persistent homology is a tool from a set of methods called Topological data analysis, showing until nowadays a lot of success when it comes to application in biology since this latest uses metrics only for measuring similarities, Embedding the geometric details and focusing on the global shape is the key point making the success of persistent homology, this will be investigated in the paper since enormous work already done in the ﬁeld and results seems to be endless, as an eﬃcient topological data analysis tool. In this work we will be conﬁrming the latest assumption (topology embeds geometry) by displaying the structure of COILED SERINE which is a protein estimated to constitute 3-5 percent of the encoded residues in most genomes, and giving a substitute of the optimal characteristic distance that can be used in the ﬂexibility-rigidity index, a classic method used to simulate molecule movements and ﬂexible behavior, when it comes to atomic rigidity functions. We will also analyze interesting patterns in the binding site of the beta sheet generated from the pdb ﬁle 2JOX. We will be detecting and giving a simple description of diﬀerent patterns generated by using javaplex generating barcodes and linear statistical results as a summary statistics.


Introduction
Giving interpretation of statistical summaries is being the main building block of any conclusion of an applied mathematical result, the paradigm of a well defined figure reflecting the starting hypothesis is the key point to any valuable scientific work, for that reason we will be investigating recent works on topological data analysis since this latest is until nowadays being considered as the perfect tool on the two levels, algorithmic and axiomatic to be approaching a convincing answer.We will be dealing with the following points to survey our hypothesis: • categories : parametric vs non parametric in TDA.
• parametric models: 1st methodology in TDA: Modeling and replicating statistical topology.
• Space of persistence diagrams, first definition, wasserstein distance measure.
• Norm of a persistent diagram.
Different exploitations of TDA in molecular biology can be resumed in the following points: • protein structure prediction.
• protein structure analysis.
• protein-ligand binding : integration of "element specific persistent homology" also called "multicomponent persistent homology" for protein-ligand binding affinity prediction.
• protein binding site analysis.
And different statistical methods can be summarized in : • linear statistical approach using persistent homology.
• topological descriptors for an unsupervised learning approach.
A linear statistical approach for persistent homology is known to be the best way for none mathematical theoreticians to be deriving interesting results in the field of protein structure prediction and analysis, the first time TDA was used for such purpose is via persistent homology [1], with a comparison between topological and geometrical methods through a construction of different physical methods Xia, K and collaborators gives interesting results giving itself a confirmation to the well defined paradigm of topology-function rather than the geometry-function one, a detailed description of the method using persistent landscapes as the statistical summary was giving in [2] for the analysis of a "protein binding", the introduction of that tool comes from the restricted theoritical frame of barcodes to be defining a statistical observation since the statistical study involves calculating frechet means and a bijection between a frechet means and the set of barcodes seems to clearly be difficult to realize, barcodes are known to be the popularized statistical summary of persistent homology, using accumulated bar length is another interesting way where persistent homology can be exploited as shown in [3] to figure out exponential kernel of molecular dynamic simulations, highlighting topological signature of an atom in a macro molecule or " weighted persistent homology" [4] (ASPH) can even be more precise on the macromolecular level.
1.1.Persistent homology, clustering an atoms point cloud.Until nowadays a protein is defined to be as the main building component of all cellular tissues in all living organisms, this definition holds thanks to Anfinsen's dogma [5] but facing a real challenge regarding the complexity of the folding path of a protein, Analysis of protein structure and development of summary statistics to find an accurate structure-function relationship have made an evolutionary steps during last decades thanks to the enormous available data generated from Xray crystallography, the availability of data gives birth to a new paradigm wich is "the complexity of data" and computational topology seems to perfectly answer a lot of questions [6], We can be sure from the XYZ distribution since all the configurations follow physical laws, but we need a better way to link between atoms in a macroscopic level in order to catch up the other aspects of a protein -involving persistent homology in detection and analysis of protein folding path was investigated using topological feature vector [7] The choice of persistent homology comes from its capacity of neglecting metric details and capturing void, cavities and holes at different scales by using a filtration parameter [8] [9] which is the truly demanded function from the mathematical tools used in the analysis of protein structure and binding sites.The majority of the mathematical models used to study protein characteristics such as flexibility, folding and structure are geometrically based ones which level up the complexity of the algorithms, we mention several methods to compute those network metrics such as VisANT [10], CentiScaPe [11], CentiLib [12] and Visone [2], but all these models and tools can't catch up the dynamical nature of the protein which is done perfectly when using the filtration parameter [13].
As already mentioned we will analyze the topology structure of COILED SERINE, and giving a clear response of how a substitute of the optimal characteristic distance that can be used in the flexibilityrigidity index (a classic method used to simulate molecule movements and flexible behavior, when it comes to atomic rigidity functions) can be replaced by a simple topological descriptors.We will also analyze interesting patterns in the binding site of the beta sheet generated from the pdb file 2JOX and will be detecting and giving a simple description of different patterns generated by using javaplex generating barcodes and linear statistical diagrams as a summary statistics.We will witness through the results, the dynamical nature of this parameter, the protocol starts with a point cloud, topology gives us the ability to hide the algebraic invariant which comes out with a final shape, the elements we will be filtering are called homology groups, two shapes or in a better axiomatic way a main level of the previously defined (protein) called secondary structure is investigated, in a first sight "the beta sheet" and the "alpha helix" will be reconstructed, the observations we will be using statistics on to visualize a previously theoretically justified parameter (FRI) are the (x, y ) couples indicating the life time of each homology group, we will reduce dimensions until getting our XY graph.This paper is organized in four sections: firstly an introduction see section 1.Secondly, in section 2 we summarize the mathematical material required, especially the persistent homology tools.In section 3, we present all details of our topological approach to analyze the COILED SERINE protein structure.
Finally, in section 4 we make some conclusions and discuss some further possible research issues.

Mathematical Background
As mentioned here above, in this section we will summarize the tools that will be used in our topological view point to approach the structure of the previously defined molecule.We will give the keynotes of the notion of simplicial homology, and give more details about persistent homology.For more details about simplicial homology we refer the reader to [14].The reference [15] and [16] are considered, by almost all topological data analysts, elementary and unavoidable to learn more about persistent homology.
2.1.Simplicial homology.Homology is the branch of algebraic topology making the computing part of it a true realization, the main application is dimentionality reduction via interesting tools such as persistent homology.Definition 2.3.A simplicial complex K is a finite set of simplices satisfying the following conditions: (1) For all simplices A ∈ K with α a face of A, we have α ∈ K.
well defined as level of generator as follows: For any p-simplex, σ = [e 0 , e 1 , • • • , e p ], we associate the where êi is omitted.
Thus, we obtain this chain complex where → denotes the inclusion map.Elements of Z p (K) = ker ∂ p are called the p-cycles, while those of B p (K) = Im∂ p+1 are called boundaries.The following fundamental result states the any boundary is a cycle.Indeed: Theorem 2.1.The boundary of a boundary vanishes, that is, The p-th simplicial homology group is defined to be the quotient group It measures the obstruction for a cycle to be a boundary.The p − th Betti number is its rank: For any topological space X, one way to define its homology is the following: Firstly one have to call a p-simplex of X, any continuous map Then denote K p (X) the Z-module spanned by this p-simplicies.By this approach, one may associate to any topological space X, a simpicial complex K(X), unique up to homoemorphism.Secondly, one have to define the faces where êi is omitted.And finally one have to define the boundaries on ∂ p K p (X) → K p−1 (X), as follows: Hence the simplicial homology of X, none other than that of K p (X). Mathematically speaking The simplicial homology of topological space is known to be a homotopical invariant, In other word two homotopic topological spaces, have the same homology.The inverse is known to be in general false, however it can be used to prove that two topological space are not homotopic, whenever the have not the same homology.The key contribution of the simplicial homology is to compute the number of holes of a given dimension for a topological spaces.Connected components is the case of dimension 0. For example • for a point : H 0 (pt) = Z, while H p (pt) = 0 for p > 0; • for a sphere : • for a torus : 2.2.Persistent homology.Theoretically, the term persistence is for the first time introduced in [11].It was describing an abstract definition as a natural extension of homology on filtered simplicial complexes.For applied purposes persistent homology is working as a statistical tool destined to rebuild the manifold supporting the point cloud already mentioned in the introduction, the manifold is the hidden space from which data has been extracted.the result making computing part a true realization is that persistent homology of filtered complex is nothing but the regular homology of a graded module over a polynomial ring [17].Another interesting and explicit description of persistent homology via visualization of barcodes can be found in [18].We suggest here a concise precise definition via classification theorem : Remark 2.1 (Persistence modules).We apply the "homology functor" to the filtered chain complexes [2], so we get our "homology groups" category, which can be viewed as : where → denotes the inclusion map.
Theorem 2.2.For a finite persistence module C with filed F coefficients that are the quantification of the filtration parameter over a field.A clear description can be found in [13].
Definition 2.6.The p-persistence k-th homology group To visualize efficiently the method one need to use metrics, for that aim let's define a metric on our topological space : Definition 2.7.The open vietoris-rips complex is the simplicial complex with vertices the points of X and p − si mpli ci es the subsets of X with diameter less than r .To be able to reattach intervals so the continuous property of the filtration can be filled, one needs to use a distance on the set of persistence diagrams, one way to do it is by using Wasserstein distance.
where M is a bijection defined on the points of the diagonal.
The data often comes with noise since we sample from an unknown space (a probability distribution), for that reason an interesting proposition to survey and correct final results when comes the computing part is the stability theorem Theorem 2.3.Let f , g : K −→ R be monotone functions.Then for a homology dimension k we have: One reason the previous theorem is called stability theorem is the contractibility of the Wasserstein distance, this guarantee theoretically the mapping between data and associated persistent diagrams is a well defined homeomorphism.

Topological data analysis of the protein
The most popular way Topological data analysis (TDA in short) is exploited is for clustering purposes through persistent homology since this was the immediate extension of applied statistics in TDA.This comes from the intrinsic property of a point cloud, even said the axiomatic presentation seems to hide greater strategies [19].The field of application making until nowadays a great success is molecular biology since this latest doesn't fit into geometric representations when comes serious investigations or interesting behaviors such as flexibility and folding of proteins, plus the extremely expensive and complicated computation power needed, an interesting application is the protein binding analysis [20].
Before we present parameters used to generate a suitable filtration one needs to comprehend in depth the notion of a protein, what is making it such an interesting concept and how modern models has been shaped through accumulation of interesting results and surveys.We have to mention that with the evolution in mathematical tools and computation power a lot of theoretical hypothesis made it to a well defined quantified results.The first step to protein structure definition and analysis start with a Nobel prize in 1972 for his work on the connection between the amino acid sequence and the biologically active conformation.Christian ANFINSEN gives to this conformation the first and last definition of a protein as a concept as well as a hypothesis to be investigated, which means all the researches made in proteins analysis are about questioning between the amino acid sequence and the active conformation.We must wait until 1994 when critical assessment of structure prediction becomes a true valued enterprise, the challenge starts when the relation structure prediction takes place, the only way to do the calculations was through quantum mechanics which is not a sustainable way.For that reason after gathering an interesting amount of data the only way to complete databases was through dealing with the structure prediction question, this demanded a comprehension of the folding path, then naturally rises the works and results on protein flexibility and rigidity using mathematical statistical methods rather than experimental geometric ones.For more enlightenment Let's consider the geometric representation shown in the figure below, This is the best of what how the visualization of a protein can be made, clearly a lot of complicated patterns even with the heavy costly computing demanded power, which also mean no real geometric representation for at least a well folded protein, we conclude since a protein shape dictates its function, one can tell from the figure why the paradigm of geometry.function is not a practical way to be adopted for a learning process purposes, being said rises the question of how the transition to the theoretically confirmed topology.functionparadigm can take place so naturally we can be sure from the learning process.Topological descriptors of an already defined beta sheet.This section is devoted to an application part : the protocol is statistical inference for observations that are barcodes.We will be using existing data from the freely protein data bank existing on the net, then we will smoothly be reading results as any statistical study, rearranging data will take place when barcodes seems noisy and difficult to compute.We will consider the Gaussian noise to set up our point cloud data set contained in R 3.700 Let's illustrate the visualization part by a simple example, our point cloud is the set of atoms lying in a 3 dimensional space downloaded from a pdb file with 1COS ID, each atom is associated with the same radius in the distance based filtration.The stream will be constructed for the point cloud data which is the xy z coordinates of the all atom representation.The size is not too large to choose a landmark selector, so we will simply build a Vietoris-Rips stream.We can choose a better filtration but for the limited computation power we stick with the value of 8.In this case a Vietoris-Rips simpicial complex is largely sufficient to decipher the topological descriptors (a small data set) so their is no need to use a landmark selector.
We obtain the topological representation of our data in the form of a barcode, which can be called a topological descriptors.For the alpha carbon model we consider only the alpha carbon atoms, to catch up the structure of the backbone we will be constructing a Vietoris-Rips stream.
As it is naturally described, the filtration begins which makes around each C α atom a topological 3 dimentional sphere, spheres start to overlap which form one dimensional loop, this is exactly our one dimensional simplex, then the first betti number is calculated and depicted in the barcode as a four little bars from 0.2 to 1.5 Angstrom.
As mentioned in the literature.We assume that the four levels are well defining the full structure of the main building component (protein).
3.0.2.Parameters used to generate a suitable filtration.different probabilistic methods and tools are used to simulate molecular behaviour emerging from atomistic level, we mention that no theoritical frame is giving, which also means only with a learning process we can find and interpret results, a clear description can be found in [21].Many powerful models have been proposed such as the molecular non-linear dynamic (MND) and flexibility-rigidity index (FRI) to analyze protein main propreties such as folding and flexibility, the fundamental assumption of these methods is only through Physical laws and immediately following the mathematical description which is a truly demanding and complicated way one can achieve a model to be computed, being said, we have through the accumulated results tremendous data to be simplified into creative statistical tools, we will be using the main models MND and FRI to confirm the latest view point, which immediately means providing correlation matrix based filtration for the persistent homology analysis of proteins, An easy example defining the distance matrices for persistent homology uses can be found in [7].One of the techniques that are utilized in the flexibility analysis is Molecular non-linear dynamics : we denote the coordinates of atoms in the molecule studied as r 1 , r 2 , . . ., r i , . . ., r N , where r i ∈ R 3 is the position vector of the j th atom.The Euclidean distance between i th and j th atom r ij can be calculated.We can easily construct our topological connectivity matrix serving as the input point cloud for our "barcode statistical inference" with monotonically where ω ij is associated with atomic types, η ij is the atomic-type related characteristic distance and Φ(r ij , η ij ) is a radial basis correlation kernel.
A generalized exponential kernel has the form and the Lorentz type of kernels is: The parameters k, ν, and η are adjustable.We usually search over a certain reasonable range of parameters to find the best fitting result by comparing with experimental B-factors [6].It is assumed that each particle in a protein can be viewed as a non-linear oscillator and its dynamics can be represented by a non-linear equation.The interactions between particles are represented by the correlation matrix (c ij ).Therefore, for the whole protein of N particles, we set a non-linear dynamical system as: Where u = (u 1 , u 2 , ..., u i , ..., u N ) T is an array of state functions for N non-linear oscillators (T denotes the transpose), u j = (u j1 , u j2 , ..., u ji , ..., u jN ) is an n-dimensional non-linear function for the j th oscillator, F (u) = (F (u 1 ), F (u 2 ), ..., F (u N ) T is an array of non-linear functions of N oscillators, and Here, ε is the overall coupling strength, C = C ij i,j=1,2,...,N is an NN correlation matrix, and Γ is an n × n linking matrix.
Obviously the transverse stability of the MND system gradually increases during the protein folding from disorder conformations to their well-defined natural structure.
Figure 16.Behaviour of the folding process through filtration 3.1.Persistent homology analysis of the characteristic distance.We consider a folding protein that constitutes N particles and has the spatio-temporal complexity of R 3N * R + .We Assume that our system can be described as a set of N nonlinear oscillators of dimension R nN * R + , where n is the dimensionality of a single nonlinear oscillator.As shown in equations.Φ(r, η) = e −(r /η) k and Φ(r, η) = 1 1+(r /η) ν .Persistent homology can provide a quantitative prediction of optimal characteristic distances in MND and FRI.The optimal characteristic distance varies from protein to protein.An adequate filtration process is the essence of persistent homology analysis, for that a filtration matrix based on a modification of the correlation matrix of MND is proposed: Where 0 Φ(r ij , η ij) 1 is defined previously .withusing the exponential kernel with parameter K = 2.We slightly vary the filtration parameter of the AC point cloud for the Alpha helix from the 1COS identity, the formation of simplicial complex or topological connectivity changes too.

Conclusion and discussion
This work is showing an easy application of persistent homology, the main focus of presenting a road map to get familiarized with the axiomatic idea, yet with a spectacular result, it was out of the scope of this proposition to theoretically justify the use of statistical tests on the set of barcodes, but the application shows clearly that the method can surpass a simple statistical approach, and instead of conducting a molecular dynamic simulation it is easier to use existing information from models to construct a quantified sequence of barcodes then to look for its convergence limit, we can find interesting productions in the literature but none exploited fully persistent homology far from being a statistical tool, an interesting attempt by using dynamical distances was made by Peter Bubenik and collaborators, but couldnt theoretically justify barcodes as a statistical observation, instead it gives birth to a new functional tool wich is persistent landscapes.

Conflicts of Interest:
The authors declare that there are no conflicts of interest regarding the publication of this paper.

Figure 4 .
Figure 4.A vietoris Rips illustrationThe lifetime of each homology group, which means the algebraic length of intervals (l, p) together with the values of k can be summarized and visualized using barcodes, since R is the perfect set to be describing an interval for analytical purposes, one needs to define homology on vector spaces to be able to use a field F , this may gives a clear definition ready to be exploited for applied purposes.

Figure 5 .Definition 2 . 9 .
Figure 5. illustration of the birth and the death of a data through barcodes visualization If our topological space X is a totaly bounded metric, one can write the barcode as : bar c V R k (X, F ) to separate interleaving components one also needs to calculate distance between barcodes, this gives the following definition:

Figure 6 .
Figure 6.A persistent diagram as shared by software As we can remark from the figure 5 each barcode can be represented by a persistent diagramme.

Figure 11 .
Figure 11.Representation of a folded protein with 1IJ3 ID

Figure 13 .
Figure 13.descriptors of alpha carbon distribution of an alpha helix generated from a protein of pdb ID 1COS

Figure 14 .
Figure 14.beta sheet all atom point cloud related topological descriptors

Figure 15 .
Figure 15.beta sheet alpha carbon point cloud related topological descriptors

Figure 21 .
Figure 21.Connectivity patterns of the alpha carbon point cloud