The recent issue of Journal of Organic Chemistry, (JOC, 2008, 73(12)) has a few articles that are particularly interesting.

The article by Lipkus, et al., entitled Structural Diversity of Organic Chemistry. A Scaffold Analysis of the CAS Registry, JOC, 2008,73, 4443-4451, is a particularly ambitious bit of work that only CAS could do. This article describes a scaffold survey of more than 24 million organic compounds in the CAS Registry.

The data set was limited to carbon-based structures containing the heteroatoms H, B, Si, N, P, As, O, S, Se, Te, and the halogens.  Moreover, the work was further limited to framework structures containing rings or linked rings. Acyclic compounds were not included owing to the inapplicability of the framework definition in the search algorithm. Multicomponent substances and polymers are ignored as well.

Lipkus and coworkers found that half of the graph frameworks analyzed are described by only 143 framework shapes.  The remaining half are described by 836,565 graphs.

One of the key conclusions is quoted here-

“It is not surprising that some frameworks occur much more frequently than others. However, the extreme unevenness in the way frameworks are distributed among organic compounds is somewhat surprising. This is particularly true at the graph level, where it is found that only 143 framework shapes can describe half of the compounds. The fact that both graph and hetero frameworks have very topheavy distributions tells us that the exploration of organic chemistry space has tended to concentrate on relatively small numbers of structural motifs.”

Lipkus concludes that cost minimization is one of the drivers of this “… shaping the known universe of organic chemistry.” He comes to this conclusion due to the presence of a power law which describes this distribution. The power law he refers to is a linear log-log relationship that is indicative of what they refer to as the “rich-get-richer process”.

If I understand this correctly, a relatively small number of easily made or commercially available early precursors are comprised of ring graphs that, by virtue of modification, propagate into more complex analogs that retain the original graph. This has the effect of multiplying the frequency of a given graph.

The cost minimization aspect comes from the benefits of familiar chemistry and the commercial availability of a fairly limited set of ring graphs. Adding more rings will usually mean adding more molecular weight and adding problematic synthesis and separation issues.

The authors conclude that the lopsided distribution of organic compounds toward only 143 graphs comprises a bottleneck in drug discovery. They further suggest that more exploration in other areas of chemistry space may be worthwhile.