OpenSMILES Specification

<< Previous: SMILES Input Up: Table of Contents Next: Nonstandard SMILES >>

4. Writing SMILES: Normalizations

4.1 What is Normalization?

A wide variety of SMILES strings are acceptable as input. For example, all of the following represent ethanol:

CCO ethanol
OCC ethanol
C(O)C ethanol
[CH3][CH2][OH] ethanol
[H][C]([H])([H])C([H])([H])[O][H] ethanol

However, it is desirable to write SMILES in more standard forms; the first two forms above are preferred by most chemists, and require fewer bytes to store on a computer. Several levels of normalization of SMILES are recommended for systems that generate SMILES strings. Although these are not mandatory in any sense, they should be considered guidelines for software engineers creating SMILES systems.

4.2 No Normalization

The simplest "normalization" is no normalization. SMILES can be written in any form whatsoever, as long as they meet the rules for SMILES. Some examples of systems that might produce un-normalized SMILES are:

4.3 Standard Form

The "standard form" of a SMILES is designed to produce a compact SMILES, and one that is human readable (for smaller molecules).

In addition, a normalized SMILES has the important property that it matches itself as a SMARTS string. This is a very important feature of normalized SMILES in cheminformatics systems.

Note: In the example below, the "Wrong" SMILES examples are all valid SMILES, but are "wrong" in the sense that they are not the preferred form for standard normalization.

4.3.1 Atoms

Correct Wrong Normalization Rule
CC [CH3][CH3] Write atoms in the "organic subset" as bare atomic symbols whenever possible.
[CH3-] [CH3-1] If the charge is +1 or -1, leave off the digit.
C[13CH](C)C C[13CH1](C)C If the hydrogen count is 1, leave off the digit.
[CH3-] [C-H3] Always write the atom properties in the order: Chirality, hydrogen-count, charge.
C[C@H](Br)Cl C[CH@](Br)Cl
[CH3-] [H][C-]([H])[H] Represent hydrogens as a property of the heavy atom rather than as explicit atoms, unless other rules (e.g. [2H]) require that the hydrogen be explicit.

4.3.2 Bonds

Correct Wrong Normalization Rule
CC C-C Only write '-' (single bond) when it is between two aromatic atoms. Never write the ':' (aromatic bond) symbol. Bonds are single or aromatic by default (as appropriate).
c1ccccc1 c:1:c:c:c:c:c:1
c1ccccc1-c2ccccc2 c1ccccc1c2ccccc2

4.3.3 Cycles

Correct Wrong Normalization Rule
c1ccccc1C2CCCC2 c1ccccc1C1CCCC1 Don't reuse ring-closure digits.
c1ccccc1C2CCCC2 c0ccccc0C1CCCC1 Begin ring numbering with 1, not zero (or any other number)
CC1=CCCCC1 CC=1CCCCC=1 Avoid making a ring-closure on a double or triple bond. For the ring-closure digits, choose a single bond whenever possible.
C1CC2CCCCC2CC1 C12(CCCCC1)CCCCC2 Avoid starting a ring system on an atom that is in two or more rings, such that two ring-closure bonds will be on the same atom.
C1CCCCC1 C%01CCCCC%01 Use the simpler single-digit form for rnums less than 10.

4.3.4 Starting Atom and Branches

Correct Wrong Normalization Rule
OCc1ccccc1 c1cc(CO)ccc1 Start on a terminal atom if possible.
CC(C)CCCCCC CC(CCCCCC)C Try to make "side chains" short; pick the longest chains as the "main branch" of the SMILES.
OCCC CCCO Start on a heteroatom if possible.
CC C1.C1 Only use dots for disconnected components.

4.3.5 Aromaticity

Correct Wrong Normalization Rule
c1ccccc1 C1=CC=CC=C1 Write the aromatic form in preference to the Kekulé form.

4.3.6 Chirality

Correct Wrong Normalization Rule
BrC(Br)C Br[C@H](Br)C Remove chiral markings for atoms that are not chiral.
FC(F)=CF F/C(/F)=C/F Remove cis/trans markings for double bonds that are not cis or trans.

4.4 Canonical SMILES

A Canonical SMILES is one that follows the Standard Form above, and additionally, always writes the atoms and bonds of any particular molecule in the exact same order, regardless of the source of the molecule or its history in the computer. Here are a few examples of Canonical versus non-Canonical SMILES:

Canonical SMILES Non-canonical Name
Oc1ccccc1 c1ccccc1O

The primary use of Canonical SMILES is in cheminformatics systems. A molecule's structure, when expressed as a canonical SMILES, will always yield the same SMILES string, which allows a chemical database system to:

Canonical SMILES should not be considered a universal, global identifier (such as a permanent name that spans the WWW). Two systems that produces a canonical SMILES may use different rules in their code, or the same system may be improved or have bugs fixed as time passes, thus changing the SMILES it produces. A Canonical SMILES is primarily useful in a single database, or a system of related databases or information, in which all molecules were created using a single canonicalizer.

The rules (algorithms) by which the canonical ordering of the atoms in a SMILES are generated are quite complex, and beyond the scope of this document. There are many chemistry and mathematical graph-theory papers describing the canonical labeling of a graph, and writing a canonical SMILES string. See the Appendix for further information.

Those considering Canonical SMILES for a database system should also investigate InChI, a canonical naming system for chemicals that is an approved IUPAC naming convention.

4.5 SMILES Files

A SMILES file consists of zero or more SMILES strings, one per line, optionally followed by at least one whitespace character (space or tab), and other data. There can be no leading whitespace before the SMILES string on a line. The optional whitespace character and data that follows it are not part of the SMILES specification, and interpretation of this data is up to applications that use the SMILES file. Each line of the file is terminated by either a singe LF character, or by a CR/LF pair of characters (commonly called the "Unix" and "Windows" line terminators, respectively). A SMILES parser must accept either line terminator. A blank line in the SMILES file, or a line that begins with a whitespace character, should be completely ignored by a SMILES parser.