<< Previous: SMILES Input | Up: Table of Contents | Next: Nonstandard SMILES >> |
A wide variety of SMILES strings are acceptable as input. For example, all of the following represent ethanol:
CCO | ethanol |
OCC | ethanol |
C(O)C | ethanol |
[CH3][CH2][OH] | ethanol |
[H][C]([H])([H])C([H])([H])[O][H] | ethanol |
However, it is desirable to write SMILES in more standard forms; the first two forms above are preferred by most chemists, and require fewer bytes to store on a computer. Several levels of normalization of SMILES are recommended for systems that generate SMILES strings. Although these are not mandatory in any sense, they should be considered guidelines for software engineers creating SMILES systems.
The simplest "normalization" is no normalization. SMILES can be written in any form whatsoever, as long as they meet the rules for SMILES. Some examples of systems that might produce un-normalized SMILES are:
The "standard form" of a SMILES is designed to produce a compact SMILES, and one that is human readable (for smaller molecules).
In addition, a normalized SMILES has the important property that it matches itself as a SMARTS string. This is a very important feature of normalized SMILES in cheminformatics systems.
Note: In the example below, the "Wrong" SMILES examples are all valid SMILES, but are "wrong" in the sense that they are not the preferred form for standard normalization.
Correct | Wrong | Normalization Rule |
CC | [CH3][CH3] | Write atoms in the "organic subset" as bare atomic symbols whenever possible. |
[CH3-] | [CH3-1] | If the charge is +1 or -1, leave off the digit. |
C[13CH](C)C | C[13CH1](C)C | If the hydrogen count is 1, leave off the digit. |
[CH3-] | [C-H3] | Always write the atom properties in the order: Chirality, hydrogen-count, charge. |
C[C@H](Br)Cl | C[CH@](Br)Cl | |
[CH3-] | [H][C-]([H])[H] | Represent hydrogens as a property of the heavy atom rather than as explicit atoms, unless other rules (e.g. [2H]) require that the hydrogen be explicit. |
Correct | Wrong | Normalization Rule |
CC | C-C | Only write '-' (single bond) when it is between two aromatic atoms. Never write the ':' (aromatic bond) symbol. Bonds are single or aromatic by default (as appropriate). |
c1ccccc1 | c:1:c:c:c:c:c:1 | |
c1ccccc1-c2ccccc2 | c1ccccc1c2ccccc2 |
Correct | Wrong | Normalization Rule |
c1ccccc1C2CCCC2 | c1ccccc1C1CCCC1 | Don't reuse ring-closure digits. |
c1ccccc1C2CCCC2 | c0ccccc0C1CCCC1 | Begin ring numbering with 1, not zero (or any other number) |
CC1=CCCCC1 | CC=1CCCCC=1 | Avoid making a ring-closure on a double or triple bond. For the ring-closure digits, choose a single bond whenever possible. |
C1CC2CCCCC2CC1 | C12(CCCCC1)CCCCC2 | Avoid starting a ring system on an atom that is in two or more rings, such that two ring-closure bonds will be on the same atom. |
C1CCCCC1 | C%01CCCCC%01 | Use the simpler single-digit form for rnums less than 10. |
Correct | Wrong | Normalization Rule |
OCc1ccccc1 | c1cc(CO)ccc1 | Start on a terminal atom if possible. |
CC(C)CCCCCC | CC(CCCCCC)C | Try to make "side chains" short; pick the longest chains as the "main branch" of the SMILES. |
OCCC | CCCO | Start on a heteroatom if possible. |
CC | C1.C1 | Only use dots for disconnected components. |
Correct | Wrong | Normalization Rule |
c1ccccc1 | C1=CC=CC=C1 | Write the aromatic form in preference to the Kekulé form. |
Correct | Wrong | Normalization Rule |
BrC(Br)C | Br[C@H](Br)C | Remove chiral markings for atoms that are not chiral. |
FC(F)=CF | F/C(/F)=C/F | Remove cis/trans markings for double bonds that are not cis or trans. |
A Canonical SMILES is one that follows the Standard Form above, and additionally, always writes the atoms and bonds of any particular molecule in the exact same order, regardless of the source of the molecule or its history in the computer. Here are a few examples of Canonical versus non-Canonical SMILES:
Canonical SMILES | Non-canonical | Name |
OCC |
CCO C(C)O |
ethanol |
Oc1ccccc1 |
c1ccccc1O c1(O)ccccc1 c1(ccccc1)O |
phenol |
The primary use of Canonical SMILES is in cheminformatics systems. A molecule's structure, when expressed as a canonical SMILES, will always yield the same SMILES string, which allows a chemical database system to:
Canonical SMILES should not be considered a universal, global identifier (such as a permanent name that spans the WWW). Two systems that produces a canonical SMILES may use different rules in their code, or the same system may be improved or have bugs fixed as time passes, thus changing the SMILES it produces. A Canonical SMILES is primarily useful in a single database, or a system of related databases or information, in which all molecules were created using a single canonicalizer.
The rules (algorithms) by which the canonical ordering of the atoms in a SMILES are generated are quite complex, and beyond the scope of this document. There are many chemistry and mathematical graph-theory papers describing the canonical labeling of a graph, and writing a canonical SMILES string. See the Appendix for further information.
Those considering Canonical SMILES for a database system should also investigate InChI, a canonical naming system for chemicals that is an approved IUPAC naming convention.
A SMILES file consists of zero or more SMILES strings, one per line, optionally followed by at least one whitespace character (space or tab), and other data. There can be no leading whitespace before the SMILES string on a line. The optional whitespace character and data that follows it are not part of the SMILES specification, and interpretation of this data is up to applications that use the SMILES file. Each line of the file is terminated by either a singe LF character, or by a CR/LF pair of characters (commonly called the "Unix" and "Windows" line terminators, respectively). A SMILES parser must accept either line terminator. A blank line in the SMILES file, or a line that begins with a whitespace character, should be completely ignored by a SMILES parser.
<< Previous: SMILES Input | Up: Table of Contents | Next: Nonstandard SMILES >> |
Copyright © 2007, Craig A. James
Content is available under GNU Free Documentation License 1.2