Protein Docking From Scratch


During the last semester I did a seminar on artificial intelligence but ended up choosing protein docking as my topic. From a mathematical point of view it could get interesting, however, just implementing the potential and its gradient which, mathematically speaking, is pretty trivial was much more work than I expected it to be and in my report for the seminar I elaborated on those obstacles ranging from really trivial, but time-devouring, ones like parsing PDB files to “harder” ones like determining parameters for the potential, which is again an optimization problem. In this post I will give a brief overview of the topics dealt with in the report.

Sections that might be interesting for people who want to get into protein docking

Terminology

The first section introduces basic terminology so that one should be able to get the gist of a text that deals with protein docking. It starts from defining polypeptide chains and all the structures (primary, secondary, tertiary and quaternary) and goes on to introduce the different coordinate systems that are frequently being used to conclude with a definition of what protein docking is about.

Lennard-Jones potential

In the second section the Lennard-Jones potential is introduced, a physical interpretation and its feasibility is given and its parameter estimation is discussed. I think the latter is particularly interesting because even though I knew I was only dealing with models of reality and not “reality itself” I thought that, for example, setting the parameter accounting for the Pauli-repulsion to the value the physicists “derived” from experiments (or calculations) would be fine. However, one should keep in mind this is only some model which works best if fitted to the data, no matter if this fitting leads to denying physicists’ intuition (e.g. two carbon atoms in two different positions experiencing two different Pauli-repulsion forces). The underlying physics serves for a good initial guess for the model, but does not have to be the final one!

Sections that might be interesting for mathematicians

Coordinate conversion

People who are into numerical mathematics are very likely to tell you otherwise, but in my opinion there is no really interesting math behind it.

Computation of the gradient

Again, from a mathematical point of view the definition of a gradient is simple, however, applied mathematicians dislike difference quotients not only because of their inherent susceptibility to cancellation, but also because of the time they take in the multivariate case for functions that take a lot of time or computational resources to evaluate.

As a result automatic differentiation was introduced. The forward as well as the backward mode are, based on Neumaier’s book on numerical analysis, analysed in this section. The backwards mode is particularly interesting as it is the way to go for potentials depending on many inputs. Naturally it has different names in different areas; the machine learning community for example loves to post like three new “graphical tutorials” on backpropagation which is nothing but a backwards automatic differentiation, which in turn is basically a clever use of the chain rule.

At this point I would like to thank Dougal Maclaurin and David Duvenaud not only for writing the AutoGrad library (a Python implementation for backwards automatic differentiation), but also for their quick support and helping me out when I encountered difficulties using it.

Sections that might be interesting for people who like to implement stuff

Implementation

This was definitely the part that took the most time and was the least interesting (again, from a mathematical point of view). However, it may be interesting as a starting point for people who want to play around with PDB files because I found the documentation to be pretty horrible, so I tried to summarize the parts that I needed here.

Results

I should really work on the results, but limited time and computational resources were (and still are) a real showstopper. Maybe I will update it some time, but honestly speaking (most of the) other tasks have a higher priority.