Calculating the Assembly Index of a molecule is an NP-Hard problem. We've developed a hierarchy of algorthmic approaches to this calculation. Use the descriptions below to understand the outputs from Molecular-Assembly.com.
Molecular Assembly indices are a fundamental quantity in Assembly Theory because they measure
the amount of information required to make a molecule. The Molecular Assembly Index (MA) of a
molecule is defined as the fewest number of steps required to make the molecular graph
by recursively using previously made structures. Let's unpack that statement a bit.
First let's get a feel for this process. Instead of thinking
about molecules, let's consider the Assembly Index of a word, like "ABRACADABRA". This example
is
worked out in the image on this page. The word contains eleven letters in total, but is built of
four unique letters ("A", "B", "C", "R"). We'll refer to the unique letters in the word as the
"basic building blocks." To start the assembly process we can take a basic building block, in
this case let's start with "A" which we can join with another
building block, "B", to make "AB", that's one step. We can repeat this process a couple times
to make "ABRA", and then a few steps later we will have made "ABRACAD". So far the number of
steps we have taken is six, which is the length of the word we've made so far. To finish the word
we need to add four more letters, but we've previously made the remaining piece we need ("ABRA"),
so we can add all of those in one step, resulting in an Assembly Index of 7. That last step,
where we reused "ABRA", is what we mean when we say "recursively using previously made
structures." We could have taken a longer path to make the final word, for example by reusing
"AB" but adding "R" and "A" at the end in seperate steps. That would have taken a total of 9
steps. When we say the Assembly index is the "fewest number of steps" what we mean is that
there is no way to make the final target using fewer steps - although there might be pathways
that require the same number of steps (for example you can build ABRACADABRA from right to left
instead of left to right in the same number of steps).
We when talk about the Molecular Assembly Index (MA) of a molecule we are talking about the
steps required to make its graph structure, not the molecule itself. That means that we are not
concerned with whether or not the operations we perform in this process could be physically
done in a lab or a natural setting. Therefore some of the joining operations could seem
counterintuitive to chemists or folks who make real molecules for a living. The number of steps
required to make the graph will never be less than the number of individual mechanistic steps
required to make the actual molecule. This is why we emphasize that we are calculating the fewest
number of steps to make the molecular graph, not the molecule. When we compute the MA of a
molecule we ignore the Hydrogen atoms. This is done primarily for computational simplicity
because it significantly reduces the number of atoms in a molecule and has an insignificant
effect on the structure of the compounds. Finally when we consider the basic building blocks of
a molecule, we use the unique bonds in the molecule, not the atoms, as the components. That
means that joining operations are defined by super imposing two (or more) atoms in different
bonds.
Computing the MA of a compound is an NP-Hard problem,
meaning that the computational time required to find the answer increases significantly with the size of
the molecule. This is one reason we have decided to have Molecular-Assembly.com look-up precomputed MA values for
molecules rather than compute them de-novo. In order to understand the data that this website
provides it is important to understand the different algorithms we have developed to calculate
MA. We detail the those algorithms below.
Although we aim to find the shortest pathway required to construct a graph from two-bond fragments, the
graph assembly algorithm does not directly calculate the assembly index by searching through the space of
possible pathways. Rather, it attempts to find duplicatable subgraphs in the molecule-graph and iteratively
removes subgraphs.
At each step of the process, we find a duplicatable subgraph within the molecule graph. We then remove this
duplicatable subgraph from the structure by deleting all edges (but not nodes) from the original graph and
disconnect the replicated structure associated with the duplicatable subgraph from the remnant structure.
We continue this process until no possible duplicatable subgraphs remain.
The maximum possible assembly index for a graph is N–1, where N is the number of edges in the graph. Such
a graph has no duplicatable subgraphs with at least two bonds. In order to calculate the assembly index using
this duplicate-finding method, for each duplicatable subgraph deleted, we subtract from the N–1 upper bound
the value k–1 where k is the number of edges in the subgraph. This is because each duplicate of size k represents
a saving of k–1 bonds over a one-bond fragment during the forward assembly process. The sum of all k–1 values for
all possible duplicates S can be subtracted from N–1 to obtain the assembly index for a particular molecule.
The graph assembly algorithm seeks to find the shortest assembly pathway and does so by finding the pathway with
the largest value of S. Broadly, it does so by first finding all possible duplicatable subgraphs in the initial
graph. All the duplicate graphs are grown and classified as unique via the VF2 graph isomorphism. Those duplicated
subgraphs are sorted and fragmented according to duplicates in reverse order of size using a modified disjoint-set
data structure. Finally, the fragments are recursively hashed into a Hash assembly state and pruned with a tight
branch and bound strategy. This process is done until convergence, and at each step, the best construction pathway
is stored for retrieval. Note that if the process is stopped early, then the assembly index is labeled as approximate,
even if it is the actual value.
For more details on the algorithm and the pathway generation see the paper:
Rapid Exploration of the Assembly Chemical Space of Molecular Graphs
and the associated supplement.