Mathematical investigations

The past few weeks I have been reading various papers in rigidity theory and writing a few proofs.  Although Laman’s Theorem (from 1970) gives necessary and sufficient conditions for 2D bar-joint framework rigidity, no one has yet discovered necessary and sufficient conditions for bar-joint framework rigidity in 3D or higher.  However, molecules can also be modeled with body-bar-hinge frameworks, consisting of rigid “bodies” held together by bars and hinges, which can flex.  Tay’s Theorem from 1984 characterizes 3D rigidity in body-bar-hinge frameworks, and this theorem is the basis for the combinatorial pebble algorithms that KINARI runs on.  My ultimate goal is to understand how these algorithms work and prove their accuracy, but I’m starting out by learning about 2D rigidity.  Right now I’m trying to prove a bijection between Henneberg constructions and Laman (also known as (2,3)-sparse) graphs.

Initial Entry

(Sorry, reposting this because I accidentally left it as just a comment the first time)

Initial Entry

1. What motivated you to participate in the 4CBC fellowship program?
I have been working with Professor Ileana Streinu for the past two years as a STRIDE scholar, and I participated in the 4CBC fellowship program last year. I find the research project very interesting, so I will continue to work for Professor Streinu in the coming academic year (my Junior year) and hopefully do an honors thesis with her in my senior year.

2. What skills or new insights do you hope to acquire during this fellowship?
Up until now, my work professor Streinu has been much more focused on data parsing and data analysis; it was much more focused on programming than on math. Now I’m going to focus on the more theoretical aspect of the project, the mathematics of rigidity theory.

3. Which of your two research disciplines approaches the project from a perspective that is the most challenging to you?
I don’t know biology (I’ve only taken BIO 150).

4. What course work was important to your decision to undertake this research fellowship?
If I had to name a course, it would be Graph Theory (MTH 255).

5. What courses do you plan to take to deepen your understanding of the research project?
Algorithms (CSC 252), Advanced Linear Algebra (MTH 333), Combinatorics (MTH 254), Topics in Computational Biology: Bio-Geometry of Proteins (CSC 334)

6. What are your professional goals after you complete the fellowship?
My goal is to go to graduate school in math — probably pure math. After that, I hope to become a professor and do research on both pure and applied topics in Graph Theory.

Ambiguous positions

The program to color atoms inside the unit cell differently from those outside the unit cell in JMol was completed yesterday morning.

Professor Streinu and I had another discussion yesterday.  I might have mentioned this before, but if in case I haven’t, there are many Cif files that have ambiguous atom positions.  There are two main ways that ambiguous atom positions can occur.

The first way is that several experiments were performed to gather the data for a given crystal, and the different experiments each showed an atom (a certain Oxygen, for instance) in a slightly different position — with coordinates whose difference is epsilon from the other positions.  The scientist included all the different positions in the same file, but it’s really all the same atom, rather than different atoms.  We need a script so that the computer will recognize these positions as all being for that same oxygen atom (based on the closeness of the positions, the fact that the atoms in these positions all have the same bonds, etc.) and won’t get confused into thinking they are different atoms.

I haven’t written that script yet, because it will be a bit more complicated.  However, I have written a script to solve the other main issue regarding ambiguous atoms.  In this other issue, there are many positions which are actually identical — perhaps with only a tiny epsilon difference — but the elements occupying these identical positions are different.  To my understanding, is supposed to represent the variability of the crystal in nature; a silicon might be found in that position in 70% of the cases, but an aluminum might be found in 30% of the cases.  Professor Streinu needs these files to be split into multiple files, each with a different configuration for the ambiguous atoms.  Thus, if a file has 3 ambiguous positions, and there are 2 elements listed for each position, my script needs to generate 6 different files, each with a different combination of atoms occupying the ambiguous positions.

First, I made a script to catch all the Cif files that have ambiguous positions.  I ran it over a couple hundred files, but then I gave up, because there were some terrible formatting issues in several cif files (random newline characters scattered through the file as if a monkey attacked the keyboard) which were becoming extremely tedious and time-consuming to correct manually.  I’m thinking I might write another script to catch files with these weird newline characters and set them aside to be dealt with later.

I’ve also made a script to process the ambiguous Cifs and create the various configuration files for them.  I first tested it on a small number of files (as usual) to see that it performs well.  However, when I proceeded to test it on the entire COD’s ambiguous files (or rather, as many as I have collected thus far), my terminal eventually crashed.  From looking at the output I received, it appears that most ambiguous Cif files (which usually have only a small number of atoms) have less than 10 configurations and are easy to process; however, there are a couple of them that have, say, 100 atoms total and 25 ambiguous positions, with 3 possibilities for each position.  This means there must be 3 to the 25th power different configurations!  In fact, I could see there were some Cifs that got over 10,000 different configuration files in my output folder.  Thus, I think the problem is that my computer’s memory is being overwhelmed by having to invent all those configurations;  my program is designed to keep adding as many nested for loops as necessary, but I think I need to put some sort of ceiling on that to avoid unmanageable situations like this.  Huge files will just have to be processed by some other means.  I still managed to process over 100 files though, so at least we have something.

Pipeline all set. Next goal: JMol visualization

As of Wednesday, the pipeline has been all set up and integrated into KINARI. In other words, you can now go on the KINARI website, click a few buttons, and the website will use my code to produce a Q-Graph or a VQ-Graph, as well as several intermediate files and an error report file showing the traceback if something went wrong. Of course, there will be many edits and changes as time goes on.

Professor Streinu also had a long discussion with me on Wednesday afternoon about the next task I have to accomplish and also about the long-term goals for this project.  The next task I have to accomplish (which I’ve been working on since Thursday) is to make code that will create a new cif file, containing the atom and bond locations predicted by Systre, which JMol can visualize.

This is a complicated task for a number of reasons.  First of all, atom and bond locations alone are not enough information for JMol to visualize the crystal.  Each atom location must have an “atom site label” — the element name and serial number for that atom.  (For example, C11 would be the eleventh carbon atom.)  However, Systre only provides atom locations; it does’t provide elements.  I told professor Streinu about this issue, and she found a way to get the information.  If we load the original Cif file into JMol, there is a command we can use to get JMol to print a list of atom coordinates and corresponding elements (for the atoms currently being visualized) into the JMol console.  It’s simple enough to make my python code write a short JMol script (JMol has its own coding language) which will execute the command, transfer the printed output from the JMol console to the standard output (a.k.a the terminal you’re currently using) and write the printed output to a text file, whose extension we chose to be “.atom”.

Next, the .atom file and the quotient graph file are parsed.  One interesting complication was that the atom coordinates were calculated by different methods in the two different files, so they each had their own different set of round-off errors, making it difficult for the computer to recognize when the coordinates in one file matched the coordinates in the other.  I created a percent-tolerance function to take care of this issue.

The code now works, and Professor Streinu is integrating it into KINARI.  My next task is to create a python script which will read the custom-made Cif file, determine which atoms are inside the unit cell and which ones are outside, and color the atoms and bonds within the unit cell a different color.

Another task I should start working on is putting together datasets to be uploaded into KINARI.  The datasets should have a theme, for example “all Cifs whose graphs are disconnected,” or “all Cifs which contain less than 5 atoms.”  Professor Streinu plans that during the first few weeks of fall I will be doing a lot of testing on as much of the KINARI pipeline as we will have completed by then, which will hopefully include not just my pipeline and JMol visualization scripts, but also some code that Aymeric wrote, using a linear algebra module for C++, that provides the degrees of freedom.  I will write code to write error analysis reports for each survey run, which will include information such as which files succeeded and which ones failed, the stage in the KINARI pipeline at which they failed, and the error message explaining why.  (Similar to the surveys I’ve run in the past, but this time on a much more complex pipeline.)  Professor Streinu also said that the next step, after some rigorous testing has been done, will be to write a short paper about our progress.

More Summaries

I further modified Tori’s code yesterday. It can now process not only PDB files, but also cif files, Q-Graphs, VQ-Graphs, ErCif Files, Systre-Input files, and Systre-Output files.  ErCif files are a file produced after the Q-Graph pipeline is executed; they contain the Traceback error message that was produced if the cif file failed, or simply the word “SUCCESS” if the file succeeded.  Systre-Input files contain some data from a cif file and some outside information (valence), but reformatted in the way that Systre requires.  The Systre Output files are what Olaf’s script produces.

I also picked out a sample data set of 20 COD files and 20 Zeolite files, sorted into directories labeled small, medium, and large (based on how many atoms the files contain). Professor Streinu wants to upload these files to KINARI for testing.

VQ and Q Graph Creator programs, and more KINARI summaries

I created two new programs yesterday and added two new functions to Tori’s KINARI-summary-creating program.  The first program I made creates a Q-Graph (in a normal text-file with keywords, not with the curly brackets and lists like my mathematicaPrep.py program) and the other program does the same thing (with no curly brackets and lists) for a VQ-Graph.  The two new functions I added to Tori’s program will create summaries for a Q-Graph and VQ-Graph file, respectively.

VQ-Graphs, new code from Olaf, KINARI summaries, and more

Several things happened during the last few days.

As I described previously, my mathematicaPrep.py program produces the Cif file’s quotient graph (among other things) in a Mathematica-ready format.  On Friday, however, Professor Streinu told me that it would also be helpful to her to have a program that produces a vertex-quotient graph in addition to a program that produces a quotient graph.  I already explained (in my post from May 24) how a quotient graph lists edges; a vertex-quotient graph, on the other hand, assigns ALL atoms an index number (even those that are outside the unit cell), and it simply lists edges as the pair of index numbers for the atoms participating in a given bond, with no translation provided.  For example, 3 5 would mean that atom #3 and atom #5 are bonded.  So, on Friday I wrote a program to convert Q-Graphs into VQ-Graphs.

Also, Professor Streinu told me to email Olaf a sample of the Cif files that got Systre-related errors and ask him if he knows how to fix these errors.  He emailed me back on Saturday, and sent me a revised version of his unique_nets.py code.  He says this new version of the code should fix the error about array indices, but unfortunately it cannot do much to accommodate disconnectedness.  However, he also mentioned that the Systre GUI can handle a moderate amount of disconnectedness, so we may just have to use the GUI for disconnected graphs.  I was busy writing other code yesterday, but hopefully today I will get a chance to run my scanCifErrors.py program — with Olaf’s revised code — over the Zeolite database again, and over some COD sample sets.  I’m keeping my fingers crossed for more successes!

Professor Streinu talked with me yesterday.  I think our discussion helped me a lot to get a better understanding of the bigger picture for the project, and how all the smaller pieces I’m working on fit together.  We want the website she’s developing, KINARI, to be able to process crystals as well as proteins, even though the ultimate goal is to research protein folding.  The reason for this is that the rigidity algorithms Professor Streinu is developing need to be tested on real data, but PDB files are way too large to be tested on, so she wants to do her initial testing on Cif files, which are much smaller.

The user-interface for KINARI displays a JMol image of the molecule uploaded by the user.  Previous students have created JMol scripts to make JMol produce this image, but for PDB files.  However, JMol makes its own predictions about where bonds ought to be, and it performs minimizations that we don’t want.  Professor Streinu wants me to find a way to make JMol display only the atom coordinates and bond locations calculated by Systre, without anything extra.  Our current plan is to write a program which creates a new Cif file using the Systre calculations, and to turn off certain JMol parameters.  This means I’ll have to convert the Cartesian coordinates produced by Systre into fractional coordinates before they go into the Cif file.

Lastly, KINARI is designed so that whenever a user uploads a file to be processed, KINARI immediately produces a summary for the user, describing certain quick facts about the file and stating the estimated amount of time it will take for the file to be processed.  Tori wrote a long script that produces this kind of summary for PDB files as a part of her senior thesis.  Professor Streinu gave me this code yesterday, and I modified it (and added some new fields to it) so that it can produce a similar summary, but with other kinds of information, for a Cif file.

Zeolite Outcomes

I finally finished running my program on the Zeolite database. There are 231 files in the database, and 208 of them were successful (meaning that they produced a output file with the necessary input data for Mathematica).  That gives us a 90% success rate.

Once again, the few files that failed all had that same java error about array indices. However, at least for the couple that I checked, they can still be visualized in JMol; that’s worrisome.  Furthermore, I don’t think the problem causing the error for those files is incorrect valence, because the elements in the files that failed (at least the couple that I checked) all seemed to contain only oxygen and silicon.  I’ll have to examine them more thoroughly though.

One file in the database, SFV.cif, is EXTREMELY large, and the first time I ran my script I stopped execution at that point because it was taking so long (several minutes) to just get through that one file that I thought there was a while loop which whose exit conditions where never being satisfied or something.  It’s one of those tricky things… it takes so long to run the script over the whole database that I don’t like quitting it too freely, but on the other hand I don’t want to be caught in a never-ending while loop of doom and have my computer crash again.

A few programming pointers

(This is just a few more things from my discussion with Professor Streinu, but I thought my last post was getting too long so I’m putting it separately.)  Professor Streinu ran my programs on her computer, and noticed all my debug output that prints to the screen (I forgot to comment it out).  She told me I should not get in the habit of using debug output that prints to the screen and then commenting it out each time I think I’m done debugging, because it’s tedious to do that.  Instead, she told me I should make the program produce a “tmp” directory for all temporary, junk files (such as the debug output contained in a log file, and the other intermediary files produced by my pipeline) so that it’s all in one place and can be easily deleted by the user, but is still available to us for analysis and debugging.

I also told Professor Streinu about how even though the temporary while-loop in my program prevented it from doing too many files at once, it still started heating up and making “huffing and puffing” noises (from the fan) when it did 1,000 files or more.  I really don’t want it to crash from too much data processing, like it did last semester.  Professor Streinu said I should look into ways to execute the loop more slowly.  I took her advice, and I found out a way to make the computer “take breaks” at certain intervals during a long loop.  I designed it so that my computer will pause execution after every 100 files and sleep for 5-seconds.  This seems to be helping a little with the overheating/noises problem.

From COD to Zeolites

Professor Streinu and I met to discuss the outcomes of the tests I ran on sample sets of the COD. I described all the errors I had encountered, and how common they each were.  Professor Streinu confirmed that files which contain question marks in the place of key data points and cannot be visualized by JMol are indeed useless. She also confirmed that files with improper formatting were useless, because it is a waste of time to manually fix them.

As for the errors produced by Systre, it was more complicated. Systre’s inability to analyze disconnected graphs annoyed her, because there are many crystals that have a disconnected structure (for instance, graphine comes in separate sheets).  Perhaps there is some way to force Systre to analyze each connected component of a disconnected graph individually?  This issue may be something I will eventually ask Olaf about.

The other Systre error — which, as I said in my last post, was BY FAR the most common, affecting 96% of the files tested — was the error regarding array indices being out of bounds.  Initially, I had no idea what was causing this error, even after looking at the piece of Systre’s source code that it was referencing.  However, Olaf did make a note in the source code about his array representing a system of equations, so I was curious.  I took one cif file that was successful — the carbon diamond — and ran it again, but this time I changed the valence dictionary within my program so that it said carbon had a valence of 3 instead of its correct valence, 4.  The result was not success, but rather this same error message about array indices.  So, I now suspect that at least some of the cif files that received this error message were receiving it due to incorrect valences.  Perhaps the valences of each atom act as scalar multipliers in Olaf’s system of equations?

In any case, I have little faith in the valences I used in my program’s valence dictionary, because they only list one valence for each element (but in reality, many elements have several possible valences), and I collected those valences from random online sources.  So, over the weekend I did a little research to try to find a more thorough and trustworthy table of valences.  I also was searching for some sort of an algorithm for predicting valence  in those cases where it is ambiguous.  I came across an algorithm called the “Bond Valence Model,” positing an inverse power relationship between bond valence and empirically determined bond length, discovered in 2002.  I also found a table of valences (written in easy-to-parse, COD format!) for all the element-to-element bond pairs in the COD files, published in 2013 by the same author that discovered the algorithm.  I showed this table, as well as a review article about the algorithm and its uses published in 2009, to Professor Streinu.  She was excited to see the article and table and told me to send them to her immediately.

Nevertheless, for files containing many atoms of ambiguous valence, it would be inefficient to have the program check over every possible combination of valences.  So, Professor Streinu decided that we should set aside the COD for now and instead test my program on a smaller, cleaner database she discovered — the Database of Zeolite Structures.  This database’s files contain a few elements (O, Si, and Al, which all have pretty consistent valences) and it has been more thoroughly checked over and processed by people, so it ought to produce a lot of successful output files she can use in her mathematica program.  So that’s my next task.