OpenSMILES Specification

<< Previous: n/a Up: Table of Contents Next: Formal Grammar >>

Richard Apodaca, Noel O'Boyle, Andrew Dalke, John van Drie, Peter Ertl,
Geoff Hutchison, Craig A. James, Greg Landrum, Chris Morley, Egon Willighagen, Hans De Winter

1. Introduction

"... we cannot improve the language of any science, without, at the same time improving the science itself; neither can we, on the other hand, improve a science, without improving the language or nomenclature which belongs to it ..."
Antoine Lavoisier, 1787

1.1 Purpose

This document formally defines an open specification version of the SMILES language, a typographical line notation for specifying chemical structure. It is hosted under the banner of the Blue Obelisk project, with the intent to solicit contributions and comments from the entire computational chemistry community.

1.2 Motivation

SMILES was originally developed as a proprietary specification by Daylight Chemical Information Systems. Since the introduction of SMILES in the late 1980's, it has become widely accepted as a defacto standard for exchange of molecular structures. Many independent SMILES software packages have been written in C, C++, Java, Python, LISP, and probably even FORTRAN.

At this point in the history of SMILES, is appropriate for the chemistry community to develop a new, non-proprietary specification for the SMILES language. Daylight's SMILES Theory Manual has long been the "gold standard" for the SMILES language, but as a proprietary specification, it limits the universal adoption of SMILES, and has no mechanism for contributions from the chemistry community. We salute Daylight for their past contributions, and the excellent SMILES documentation they provided free of charge for the past two decades.

1.3 Audience

This document is intended for developers designing or improving a SMILES parser or writer. Readers are expected to be acquainted with SMILES. Due to the formality of this document, it is not a good tutorial for those trying to learn SMILES. This document is written with precision as the primary goal; readability is secondary.

1.4 What is a Molecule? The Valence Model of Chemistry

Before defining the SMILES language, it is important to state the physical model on which it is based: the valence model of chemistry, which uses a mathematician's graph to represent a molecule. In a chemical graph, the nodes are atoms, and the edges are semi-rigid bonds that can be single, double, or triple according to the rules of valence bond theory.

This simple mental model has little resemblance to the underlying quantum-mechanical reality of electrons, protons and neutrons, yet it has proved to be a remarkably useful approximation of how atoms behave in close proximity to one another. However, the valence model is an imperfect representation of molecular structure, and the SMILES language inherits these imperfections. Chemical bonds are often tautomeric, aromatic or otherwise fractional rather than neat integer multiples. Delocalized bonds, bond-centered bonds, hydrogen bonds and various other inter-atom forces that are well characterized by a quantum-mechanics description simply don't fit into the valence model.

"If you can build a molecule from a modeling kit, you can name it."
-- McLeod and Peters

McLeod and Peter's quip captures the deficiencies of SMILES well: if you can't build a molecule from a modeling kit, the deficiencies of SMILES and other connection-table formats become apparent.