OpenSMILES Specification

<< Previous: Nonstandard SMILES Up: Table of Contents Next: References >>

6. Proposed Extensions

6.1 External R-Groups

Daylight proposed, and OpenEye actually implemented, an extension that specifies bonds to external R-groups. An external R-group is specified using ampersand '&' followed by a ring-closure specification (either a digit, or % and two digits). However, unlike ring-closures, the bond is to an external, unspecified R-group. Example: "n1c(&1)c(&2)cccc1" - 2,3-substituted pyridine.

6.2 Polymers and Crystals

Daylight (Weininger) proposed, but never implemented, an extension for crystals and polymers. Daylight also used the ampersand '&' character, (which may conflict with the R-group proposal, above), but with the added rule that if a number appears more than once, it creates a repeating unit.

c1ccccc1C&1&1 polystyrene
C&1&1&1&1 diamond
c&1&1&1 graphite

6.3 '@' for Cis/Trans around Double Bonds

The '/' and '\' marks for cis/trans bonds seem simple on the surface but are problematic for complex systems. For example, in a long series of conjugated double bonds, changing the configuration of one bond can require rewriting dozens of bond symbols.

More importantly, there is a theoretical flaw with the use of '/' and '\'. In a cyclo-ene (name??) ring with an even number of double bonds, it is not possible to write a valid SMILES. (Recall that '/' and '\' reverse sense if moved from the left to the right of the atom, thus "C/1=C/CCCCCCC1" represents a cis configuration even though '/' appears twice.)

C/1=C/C=C\C=C/C=C\1 Illegal SMILES

The SMILES above is illegal because the first and second "C1" have opposite bond symbols '/' versus '\', so the single bond that was "broken" to create the ring has to be both "up" and "down". This logical flaw follows from using "up" and "down" in a linear notation, when in fact the atoms form a circle.

The proposed syntax for cis/trans configurations uses the '@' symbol on the allenal atoms. For example:

F[C@@H]=[C@H]F
F[C@H]=[C@@H]F
trans-difluoroethene
F[C@H]=[C@H]F
F[C@@H]=[C@@H]F
cis-difluoroethene

Interpretation of '@' and '@@' follows the tetrahedral convention: The atoms, as encountered in the SMILES string, are either in anticlockwise '@' or clockwise '@@' order as viewed on the page. Since cis/trans configurations are planar, they can also be "viewed from underneath the page", which results in the two valid SMILES shown for each compound, above.

Note that in all cases, cis and trans are easy for the chemist to distinguish visually: A trans form always has opposite "clock-ness" (@,@@ or @@,@), and the cis form always has the same "clock-ness" (@,@ or @@,@@) for the allenal atoms.

This proposed form of cis/trans specification using '@' and '@@' does not suffer from the theoretical flaw illustrated above:

[C@H]1=[C@@H][C@@H]=[C@@H][C@@H]=[C@@H][C@@H]=[C@@H]1 cyclooctatetraene

Note that the first allenal carbon must be represented as '@' since the '1' follows the H, whereas the rest of the allenal carbons use '@@' to characterize the cis configuration of each bond. Since this is a specification on the atom, rather than the single bond, no conflict arises at the ring-closure bond.

6.5 Radicals

This section needs considerable work. The following text is courtesy Chris Morley, who commented: "I guess the last paragraph doesn't look too good in a formal specification. There are two reasons for the frailty: lack of proof that the radical and aromatic uses can always be unambigous (I doubt anybody has tried); and a known deficiency in the parser." However, it is a good starting point...

A single lowercase symbol is interpreted as a radical center. CCc is an alternative to CC[CH2] and is the 1-propyl radical; CcC or C[CH]C is the 2-propyl radical, Co is the methoxy radical. An odd number of adjacent lowercase symbols is a delocalised conjugated radical. So Cccccc is CC=CC=C[CH2] or CC=C[CH]C=C or C[CH]C=CC=C Lowercase "c" or "n" can be used in a ring: C1cCCCC1 is the cyclohexyl radical.

The use of the non-aromatic lowercase symbol is a shorted form with improved intelligibilty that allows the use of implicit hydrogen in radicals. However it is intended only for simple unambiguous molecules and is not reliable when combined with aromatic atoms.

6.6 Twisted SMILES

An interesting extension that specifies conformational information via bond dihedral angles and bond lengths was proposed by McLeod and Peters:

http://www.daylight.com/meetings/mug03/McLeod/MUG03McLeodPeters.pdf