An algorithm to identify functional groups in organic molecules – Journal of Cheminformatics
Identification and extraction of functional groups
The majority of FGs contain heteroatoms. Therefore our approach is based on processing heteroatoms and their environment with the addition of some other functionalities, like multiple carbon–carbon bonds.
The algorithm is outlined below:
-
1.
mark all heteroatoms in a molecule, including halogens
-
2.
mark also the following carbon atoms:
-
atoms connected by non-aromatic double or triple bond to any heteroatom
-
atoms in nonaromatic carbon–carbon double or triple bonds
-
acetal carbons, i.e. sp3 carbons connected to two or more oxygens, nitrogens or sulfurs; these O, N or S atoms must have only single bonds
-
all atoms in oxirane, aziridine and thiirane rings (such rings are traditionally considered to be functional groups due to their high reactivity).
-
-
3.
merge all connected marked atoms to a single FG
-
4.
extract FGs also with connected unmarked carbon atoms, these carbon atoms are not part of the FG itself, but form its environment.
The algorithm described above iterates only through non-aromatic atoms. Aromatic heteroatoms are collected as single atoms, not as part of a larger system. They are extended to a larger FG only when there is an aliphatic functionality connected (for example an acyl group connected to a pyrrole nitrogen). Heteroatoms in heterocycles are traditionally not considered to be “classical” FGs by themselves but simply to be part of the whole heterocyclic ring. The rationale for such treatment is enormous diversity of heterocyclic systems. For example in our previous study [12] nearly 600,000 different heterocycles consisting of 1–3 fused 5- and 6- membered rings were enumerated.
After marking all atoms that are part of FGs as described above, the identified FGs are extracted together also with their environment—i.e. connected carbon atoms, when the type of carbon (aliphatic or aromatic) is also preserved.
We do not claim that this algorithm provides an ultimate definition of FGs. Every medicinal chemist has probably a slightly different understanding about what a FG is. In particular the definition of activated sp3 carbons may create some discussion. In the present algorithm we restricted our definition only to classical acetal, thioacetal or aminal centers (i.e. sp3 carbons having at least 2 oxygens, sulfurs or nitrogens as neighbors) and did not consider other similar systems, i.e. alpha-substituted carbonyls or carbons connected to S=O or similar bonds. During the program development phase various such options have been tested, and this “strict” definition provided the most satisfactory results. Extension of FGs also to alpha-substituted carbonyls (i.e. heteroatom or halogen in alpha position to carbonyl) and similar systems more than triple the number of FGs identified, generating many large and rare FGs. Since our major interest was in comparing various molecular datasets and not in reactivity estimation we implemented this strict definition of acetal carbons. To assess the possible reactivity of molecules, various substructures filters are available, as for example already mentioned PAINS [9] or Eli Lilly rules [10].
To illustrate better the algorithm some examples of FGs identified for few simple molecules are shown in Fig. 1.
Fig. 1
Example of functional groups identified. Groups are color coded according to their type
Full size image
Generalization of functional groups
FGs, particularly those with several connection points, may be present in numerous forms differing by variation in their environment. The attachment points may be unsubstituted (i.e. the valences are filled by hydrogens) or connected to aliphatic or aromatic carbons with large number of possible combinations. A simple amide group with 3 connection points may form 18 such variations (two connections on nitrogen are considered to be symmetrical here). As another example list of 20 ureas with different environments extracted from the ChEMBL database (vide infra) is shown on Fig. 2. For the more complex groups the number of possible variations is even considerably larger.
Fig. 2
Various forms of the urea functionality differing in the environment patterns. The numbers in the corner indicate the number of molecules in ChEMBL in which this particular group is present and the percentage
Full size image
In most cases, however, it is not necessary to go into such level of detail. When studying frequency statistics of FGs in chemical databases one is usually interested in percentage of molecules with, say, urea or sulfonamide functionalities and not in the environment details. It would be therefore desirable to merge FGs based on the important “central” moiety. One needs to be careful here, however. In some special cases, particularly for smaller FGs the differences in the environment are very important, for example to distinguish between alcohols and phenols or amines and anilines. To consider these different scenarios the generalization scheme described below was developed:
-
1.
environments on carbon atoms are deleted, the only exception are substituents on carbonyl that are retained (to distinguish between aldehydes and ketones)
-
2.
all free valences on heteroatoms are filled by the “R atoms” (this atom may represent hydrogen or carbon) with exception of:
-
hydrogens on the –OH groups
-
hydrogens on the simple amines and thiols (i.e. FGs with just single central N or S atom) are not replaced, this allows to distinguish secondary and tertiary amines, and thiols and sulfides.
-
-
3.
all remaining environment carbons (on heteroatoms and carbonyls) are replaced by the “R atoms”; exceptions are environments on single atomic N or O FGs with one carbon connected, where this carbon is retained also with its type (aliphatic or aromatic), this allows to distinguish between amines and anilines, and alcohols and phenols.
This scheme provides a good balance between preserving sufficient, chemically meaningful details on one side and generalization on the other side. Examples of generalized FGs created by this procedure are shown in the following section.