Robert M. Hanson
Department of Chemistry
St. Olaf College
5/19/2010
This document describes a specification for an extension of SMILES and SMARTS for use in 3D molecular atom search and selection as well as biomolecular sequence and cross-link searching. This specification is implemented in Jmol 12.0. It is really a set of specifications:
Besides a presentation of general considerations, a detailed specification for syntax, and the term "aromatic" is defined.
format
$ load 1crn.pdb; print {*}.find("SMILES",true)
//* Jmol bioSMILES 12.0.RC19_dev  2010-06-06 14:24 1 *//
~p~TTC:1C:2PSIVARSNFNVC:3RLPGTPEAIC:3ATYTGC:2IIIPGATC:1PGDYAN
$ load 1blu.pdb; print {*}.find("SMILES",true)
//* Jmol bioSMILES 12.0.RC19_dev  2010-06-06 14:24 1 *//
~p~ALMITDECINCDVCEPECPNGAISQGDETYVIEPSLCTECVGHYETSQCVEVCPVDCIIKDPSHEETEDELRAK
  YERITG.
//* FS4 *//[Fe@@]123[S]4[Fe@@]56[S]7[Fe@]84[S]3[Fe@@]97[S]26.
  [CYS.SG#16//* 8 *//]8.[CYS.SG#16//* 53 *//]1.[CYS.SG#16//* 14 *//]9.[CYS.SG#16//* 11 *//]5.
//* FS4 *//[Fe@@]%10%11%12[S]%13[Fe@@]%14%15[S]%16[Fe@]%17%13[S]%12[Fe@@]%18%16[S]%11%15.
  [CYS.SG#16//* 37 *//]%17.[CYS.SG#16//* 18 *//]%10.[CYS.SG#16//* 49 *//]%18.[CYS.SG#16//* 40 *//]%14.
//* HOH *//[O]
$ load 1d66.pdb;print {*}.find("SMILES", true)
//* Jmol bioSMILES 12.0.RC19_dev  2010-06-06 14:24 1 *//
//* chain D dna *// ~d~CCGGAGGACAGTCCTCCGG.
//* chain E dna *// ~d~CCGGAGGACTGTCCTCCGG.
//* chain A protein *// ~p~EQACDICRLKKLKCSKEKPKCAKCLKNNWECRYSPKTKRSPLTRAHLTEV
  ESRLERL.
//* chain B protein *// ~p~EQACDICRLKKLKCSKEKPKCAKCLKNNWECRYSPKTKRSPLTRAHLTEV
  ESRLERL.
//* CD *//[Cd]1234[Cd]567[CYS.SG#16//* 28:B *//]4.[CYS.SG#16//* 11:B *//]15.
  [CYS.SG#16//* 31:B *//]2.[CYS.SG#16//* 21:B *//]7.[CYS.SG#16//* 14:B *//]6.[CYS.SG#16//* 38:B *//]3.
//* CD *//[Cd]89%10%11[Cd]%12%13%14[CYS.SG#16//* 28:A *//]%10.[CYS.SG#16//* 14:A *//]%13.
  [CYS.SG#16//* 31:A *//]9.[CYS.SG#16//* 21:A *//]%14.[CYS.SG#16//* 11:A *//]8%12.[CYS.SG#16//* 38:A *//]%11.
//* HOH *//[O]
$ load 1d66.pdb;calculate hbonds;print {*}.find("SMILES", true)
108 hydrogen bonds
//* Jmol bioSMILES 12.0.RC19_dev  2010-06-06 14:24 1 *//
//* chain D dna *// ~d~C:1C:2G:3G:4A:5G:6G:7A:8C:9A:%10G:%11T:%12C:%13C:%14T:%15
  C:%16C:%17G:%18G:%19.
//* chain E dna *// ~d~C:%19C:%18G:%17G:%16A:%15G:%14G:%13A:%12C:%11T:%10G:9T:8
  C:7C:6T:5C:4C:3G:2G:1.
//* chain A protein *// ~p~E:%20QA:%20C:%21D:%22:%23I:%24C:%25:%26R:%21:%27L:%23
  K:%24K:%27:%25L:%26KCSK:%28EK:%28PKC:%29A:%30K:%31C:%32:%33L:%29:%34K:%30
  N:%31N:%34:%32W:%33E:%35CR:%35:%22YSPKTKRSP:%36LT:%36:%37R:%38A:%39H:%40L:%37:%41
  T:%38:%42E:%39:%43:%44V:%40:%45E:%41:%46S:%43:%42R:%44L:%45:%47:%48E:%46R:%47
  L:%48.
//* chain B protein *// ~p~EQAC:%49D:%50:%51I:%52C:%53:%54R:%49:%55L:%51K:%52
  K:%55:%53L:%54KCSK:%56EK:%56PKC:%57:%58A:%59K:%60C:%57:%61:%62L:%58:%63K:%59
  N:%60N:%61:%63W:%62ECR:%50YSPKTKRSP:%64LT:%64:%65R:%66A:%67H:%68L:%65:%69
  T:%66:%70E:%67:%71V:%68:%72E:%69:%73S:%70:%74R:%71L:%72:%75E:%73R:%74L:%75.
//* CD *//[Cd]%76%77%78%79[Cd]%80%81%82[CYS.SG#16//* 28:B *//]%79.[CYS.SG#16//* 11:B *//]%76%80.
  [CYS.SG#16//* 31:B *//]%77.[CYS.SG#16//* 21:B *//]%82.[CYS.SG#16//* 14:B *//]%81.[CYS.SG#16//* 38:B *//]%78.
//* CD *//[Cd]%83%84%85%86[Cd]%87%88%89[CYS.SG#16//* 28:A *//]%85.[CYS.SG#16//* 14:A *//]%88.
  [CYS.SG#16//* 31:A *//]%84.[CYS.SG#16//* 21:A *//]%89.[CYS.SG#16//* 11:A *//]%83%87.[CYS.SG#16//* 38:A *//]%86.
//* HOH *//[O]
//* chain 9 rna *// ~r~UUAG:%(792)G:%(793)C:%(794)G:%(795)G:%(796)C:%(797)CAC AG:%(798)C:%(799)G:%(800)G:%(801)U:%(802)G:%(803)G:%(804)G:%(805)GUUGCCUC:%(806) C:%(807)C:%(808)G:%(809)U:%(810)ACCC:%(811)AUCCCG:%(811)AACA:%(810)C:%(809) G:%(808)G:%(807)AAG:%(806)AU:%(812)AA:%(812)GC:%(805)C:%(804)C:%(803)A:%(802) C:%(801)C:%(800)AG:%(799)C:%(798)GUUC:%(813)C:%(814)G:%(815)G:%(816)G:%(817) GAGUAC:%(818)U:%(819)G:%(820)G:%(821)A:%(822)G:%(823)UG:%(824)C:%(825)GCG AG:%(825)C:%(824)C:%(823)U:%(822)C:%(821)U:%(820)G:%(819)G:%(818)GAAAC:%(817) C:%(816)C:%(815)G:%(814)G:%(813)UUCG:%(797)C:%(796)C:%(795)G:%(794)C:%(793) C:%(792)ACC.
All single-component aspects of Daylight SMILES are implemented, including aromaticity and atom- and bond-based stereochemistry ("chirality").
      Var x = '$R1="[CH3,NH2]";$R2="[OH]"; {a}[$R1]' // select aromatic atoms attached to CH3 or NH2  
      select within(SMARTS,@x)
 
Note that these variables are any string whatsoever, not just atom sets. The syntax is simply:
      Var x = '$R1="[CH3,NH2]";$R2="[$($R1),OH]"; {a}[$R1]' // select aromatic atoms attached to CH3, NH2, or OH  
      select within(SMARTS,@x)
 
 
      Var x = '$R1="[CH3,NH2]";$R2="[OH]";  {a}[$([$R1]),$([$R2])]' // aromatic attached to CH3, NH2, or OH
      select within(SMARTS,@x)
    Note that $(...) need not be within [...], and 
    wherever it is, it always means "just the first atom".
| [Element] | capitalized - standard notation Na, Si, etc. -- specific non-aromatic atom | 
| [element] | uncapitalized - specific aromatic atom (as for standard notation, no limitations) | 
| * | any atom | 
| A | any non-aromatic atom | 
| a | any aromatic atom | 
| # | atomic number | 
| (integer) | mass number -- Note, however, that [H1] is [*H1], "any atom with one attached hydrogen", not unlabeled hydrogen, [1H]. | 
| D | degree - total number of connections | 
| H | exact hydrogen count | 
| h | "implicit" hydrogen count (atoms are not in structure) | 
| R | in the specified number of rings | 
| r | in ring of a given size | 
| v | valence (total bond order) | 
| X | calculated connectivity, including implicit hydrogens | 
| x | number of ring bonds | 
| @ | stereochemistry | 
| d | non-hydrogen degree -- number of non-hydrogen connections | 
| = | Jmol atom index, for example: [=23] | 
| [number]? | mass number or undefined (so, for example, [C12?] means any carbon that isn't explicitly C13 or C14 | 
| [$n(pattern)] | A specific number of occurances of pattern. For example, C[$3(C=C)]C is synonymous with CC=CC=CC=CC. | 
| [$min-max(pattern)] | A variable number of occurances of pattern. For example: A[$0-2(C:G)]A is synonymous with AA or AC(:G)A or AC(:G)C(:G)A. | 
| pattern1 || pattern2 | "||" indicates "or" and allows searching for multiple patterns, which may overlap. For example: select search("c{O} || c{C}"). Note that the "||" syntax is an alternative to using "[,]", in this case being equivalent to (and slightly slower than) select search("c{[O,C]}"). | 
| (.measure) | The extension capitalizes on the fact that in a standard SMARTS string, period "." cannot
ever appear immediately following an open parenthesis "(". Using this fact, the format involves the following: 
  "(." [single character type - "d" (distance), "a" (angle), or "t" (torsion)] [optional numeric identifier] 
           ":" [optional "!" (not)] [minimum value] { "," | "-" } [maximum value] ")"
This extension must appear immediately following an element symbol or a bracketed atom expression.  
The separators "," or "-" between minimum and maximum values are equivalent.
For example, the following will find all aliphatic carbon-carbon bonds that are between 1.5 and 1.6 angstroms long.
select search("C(.d:1.5-1.6)C")
The following will select for all trans-diaxial methyl groups on a cyclohexane ring, finding all torsions that are outside
the range -160 to 160 degrees:
select search("[CH3](.t:!-160,160)CC[CH3]")
The default in terms of specifying which atoms are involved is simply "the next N-1 atoms," where N is 2, 3, or 4. 
For more complicated patterns, one can designate the specific atoms in the measurement using a numeric 
identifier after the measurement type. The following will
target the bond angle across the carbonyl group in the backbone of a peptide:
select search("[*.CA](.a1:105-110)C(.a1)(=O)N(.a1)")
Designations can overlap; one simply adds whatever (.xn) designation is wanted after the desired atoms:
select search("C(.a1:105,108)C(.a1)(.a2:110,130)C(.a1)(.a2)C(.a2)")
In Jmol, this capability is extended to the measure command for easy access to SMARTS-based measurements:
select *
measure search("C(.a1:110,130)C(.a1)(=O)C(.a1)")
Note that the atoms in no way have to be connected. The only restriction is that the three markers for an angle
or the four markers for a torsion will be identified in order from left to right within the SMARTS string. The following,
for example, will find all carbonyl oxygen atoms that are within 5 angstroms of each other:
select search("{O}(.d1:0,5)=C.{O}(.d1)=C")
The "." here indicating "not bonded." {O} specifies that although we want to find the entire set, we only
want to select the oxygen atoms. The close of the selection brace may appear before or after the (.x) designation. | 
| [residue.atomName#atomicNumber] | residue and atom name, with optional atomic number, for example [CYS.SG#16] or [ALA.CA]. "0" for atomName indicates the "lead" atom -- for nucleic acids the phosphorus atom (or in some cases a terminal oxygen or hydrogen atom), and for proteins the alpha carbon. and . | 
| ~...~... | bioSEQUENCE using single-letter or [RES] codes. | 
| %(n) | ring branching where n may be larger than 99. | 
| [*.ATOMNAME], [RESIDUE.*], [*.*] | Wildcards for residues and atom names | 
| [RES.ATOMNAME]+[RES.ATOMNAME] | atoms in adjacent residues, for example [ALA.CA]+[GLY.N] | 
| [RES.ATOMNAME]:[RES.ATOMNAME] | atoms in cross-linked residues, for example [CYS.CA]:[CYS.CA] | 
| ~...~... | bioSEQUENCE notation using single-letter or [RES] codes, including logic: select search("~A:[C,T]") | 
 
      # note: prior to parsing, all white space is removed
       
   [smilesDef] == [preface] [smiles]
   [preface] == { [flagDefs] | NULL } 
   [flagDefs] == { [flagDef] || [flagDef] [flagDefs] }
   [flagDef] == "/" [processingFlags] "/"
   [processingFlags] == { [processingFlag] | [processingFlag] [processingFlags] }
   [processingFlag] == { "noAromatic" | "noStereo" } (case-insensitive)
      # note: the noAromatic flag indicates to not distinguish between
      #       aromatic/aliphatic searches -- "C" and "c"
      # note: the noStereo flag turns off all stereochemical testing
      # note: thus, both "/noAromatic//noStereo/" and "/noAromatic noStereo/" are valid 
   [smiles] == { [entity] | [entity] "." [entity] }
   [entity] == { [bioSequence] | [molecularSequence] }
   [molecularSequence] = [node][connections] 
   [node] == { [atomExpression] | [connectionPointer] }
   [atomExpression] = { [unbracketedAtomType] 
                             | "[" [bracketedExpression] "]" }
   
   [unbracketedAtomType] == [atomType] 
                                 & ! { "Ac" | "Ba" | "Ca" | "Na" | "Pa" | "Sc"
                                     | "ac" | "ba" | "ca" | "na" | "pa" | "sc" }
      # note: Brackets are required for these elements: [Na], [Ca], etc.
      #       These elements Xy are instead interpreted as "X" "y", a single-letter
      #       element followed by an aromatic atom. 
      
   [atomType] == { [validElementSymbol] | [aromaticType] }
   [validElementSymbol] == (see Elements.java; 
                            including Xx and only through element 109)
   [aromaticType] == { [validElementSymbol].toLowerCase() }
       
   [bracketedExpression] == { "[" [atomPrimitives] "]" } 
   
   [atomPrimitives] == { [atom] | [atom] [atomModifiers] }
   [atom] == { [isotope] [atomType] | [atomType] } 
   [isotope] == [digits]
       # note -- isotope mass must come before the element symbol. 
   [digits] == { [digit] | [digit] [digits] }
   [digit] == { "0" | "1" | "2" | "3" | "4" | "5" | "6" | 7" | "8" | "9" }
   [atomModifiers] == { [atomModifier] | [atomModifier] [atomModifiers] }
   [atomModifier] == { [charge] | [stereochemistry] | [H_Prop] }
   [charge] == { "-" [digits] | "+" [digits] | [plusSet] | [minusSet] }
   [plusSet] == { "+" | "+" [plusSet] }
   [minusSet] == { "-" | "-" [minusSet] }
   [stereochemistry] == { "@"           # anticlockwise
                              | "@@"    # clockwise
                              | "@" [stereochemistryDescriptor] 
                              | "@@" [stereochemistryDescriptor] }
   [stereochemistryDescriptor] == [stereoClass] [stereoOrder]
   [stereoClass] == { "AL" | "TH" | "SP" | "TP" | "OH" }
   [stereoOrder] == [digits]
   
   [connectionPointer] == { "%" [digit][digit] | [digit] | "%(" [digits] ")"}
      # note: all connectionPointers must have a second matching connectionPointer 
      #       and must be preceded by an atomExpression for the 
      #       first occurance and either an atomExpression or a bond
      #       for the second occurance
      # note: Jmol bioSMARTS extends the possible number of rings to > 100 by 
      #       allowing %(n)
   [connections] == [connection] | NULL }
   [connection] == { [branch] | [bond] [node] } [connections]
   [branch] == { "(" { [smiles] | [bond] [smiles] } ")" | "()" }
      # note: empty parentheses "()" are ignored in SMILES and bioSMILES
   [bond] == { "-" | "=" | "#" | "." | "/" | "\\" | ":" | NULL
      # note: Jmol will not match two totally independent molecular pieces. For example,
      #       Jmol will not math [Na+].[Cl-]. However, "." can be used to clarify a
      #       structure that has "ring" bond notation:
      #       CC1CCC.C1CC   is a valid structure.
      # note: bioSEQUENCE uses ":" to indicate "cross-linked", which is the default for branches
   [bioSequence] == [bioCode] [bioNode] [connections]
   [bioCode] == { "~" | "~" [bioType] "~" }
      # note: The "~" must be the first character in a component and must be repeated 
      #       for each component (separated by ".")
   [bioType] == { "p" | "n" | "r" | "d" }
      # note: protein, nucleic, RNA, DNA
   [bioNode] == { "[" [bioResidueName] "." [bioAtomName] "]" 
                 | "[" [bioResidueName] "." [bioAtomName] "#" [atomicNumber] "]" 
                 | [bioResidueCode] } 
   [atomicNumber] == [digits]
   [bioResidueName] == { "ARG" | "GLY" ... } (case-insensitive) 
   [bioAtomName] == {"C" | "CA" | "N" ... } (case-insensitive)
   [bioResidueCode] == { "A" | "R" | "G" ... } (case-sensitive)
      # note: In a BioSEQUENCE, residues are designated using standard 1-letter-code group names
      #       or bracketed residues [xxx] with optional atoms specified: [ARG], [CYS.SG]. 
 
 ######## GENERAL ########
      # note: prior to parsing, all white space is removed
   [smartDef] == [preface] [smartsSet]
   [preface] == { [flagDefs] [variableDefs] | [variableDefs] | NULL } 
   [flagDefs] == { [flagDef] || [flagDef] [flagDefs] }
   [flagDef] == "/" [processingFlags] "/"
   [processingFlags] == { [processingFlag] | [processingFlag] [processingFlags] }
   [processingFlag] == { "noAromatic" | "noStereo" } (case-insensitive)
      # note: the noAromatic flag indicates to not distinguish between
      #       aromatic/aliphatic searches -- "C" and "c"
      # note: the noStereo flag turns off all stereochemical testing
      # note: thus, both "/noAromatic//noStereo/" and "/noAromatic noStereo/" are valid 
   [variableDefs] == [variableDef] | [variableDef] [variableDefs]
   [variableDef] ==  "$" [label] "=" "\"" [smarts] "\"" [comments] ";"
   [label] == 'A-Z' [any characters other than "=", "(", or "$"]
   [comments] == [any characters other than ";"]
      # note: Variable definitions must be parsed first. 
      #       After that, all variable references [$XXXX] are replaced
      
   [smartsSet] == { [smarts] | [smarts] "||" [smartsSet] }
      # note: Jmol adds the "or" operation "||", for example: "C=O || C=N"
      #       which, in this case, could also be written as "C=[O,N]"
      #       Jmol preprocesses these sets, evaluates them independently, and then
      #       combines them.
      
   [smarts] == { [node3D] [connections] | [bioSequence] } 
   [connections] == [connection] | NULL }
   [connection] == { [branch] | [bondExpression] [node3D] } [connections]
   [branch] == { "(" { [smarts] | [bondExpression] [smarts] } ")" | "()" }
      # note: Default bonding for a branch is single for SMARTS or cross-linked (:) for bioSEQUENCE
      # note: "()" is ignored in SMARTS and indicates "not cross-linked" in bioSEQUENCE
   
 ######## ATOMS ########
    
   [node3D] == { [atomExpression] | [atomExpression] "(." [measure] ")" | [connectionPointer] }
   [atomExpression] = { [unbracketedAtomType]
                             | [bracketedExpression] 
                             | [multipleExpression]
                             | [nestedExpression] }
   
   
   [unbracketedAtomType] == [atomType] 
                                 & ! { "Ac" | "Ba" | "Ca" | "Na" | "Pa" | "Sc"
                                     | "ac" | "ba" | "ca" | "na" | "pa" | "sc" }
      # note: Brackets are required for these elements: [Na], [Ca], etc.
      #       These elements Xy are instead interpreted as "X" "y", a single-letter
      #       element followed by an aromatic atom. 
      # note: in a bioSEQUENCE, all atom types are 1-letter code group names
      
   [atomType] == { [validElementSymbol] | "A" | [aromaticType] | "*" }
   [validElementSymbol] == (see Elements.java; 
                            including Xx and only through element 109)
   [aromaticType] == { "a" | [validElementSymbol].toLowerCase() }
       
   [bracketedExpression] == "[" { [atomOrSet] | [atomOrSet] ";" [atomAndSet] } "]" 
   
   [atomOrSet] == { [atomAndSet] | [atomAndSet] "," [atomAndSet] }
   [atomAndSet] == { [atomPrimitives] | [atomPrimitives] "&" [atomAndSet]
                              | "!" [atomPrimitive] 
                              | "!" [atomPrimitive] "&" [atomAndSet] }
                              
 ######## ATOM PRIMITIVES ########
   [atomPrimitives] == { [atomPrimitive] | [atomPrimitive] [atomPrimitives] }
       # note -- if & is not used, certain combinations of primitiveDescritors
       #         are not allowed. Specifically, combinations that together
       #         form the symbol for an element will be read as the element (Ar, Rh, etc.)
       #         when NOT followed by a digit and no element has already been defined 
       #         So, for example, [Ar] is argon, [Ar3] is [A&r3], [ORh] is [O&R&h],  
       #         but [Ard2] is [Ar&d2] -- "argon with two non-hydrogen connections"
       #         Also, "!" may not be use with implied "&". 
       #         Thus, [!a], [!a&!h2], and [h2&!a] are all valid, but [!ah2] is invalid.             
   [atomPrimitive] == { [isotope] | [atomType] | [charge] | [stereochemistry]
                              | [a_Prop] | [A_Prop] | [D_Prop] | [H_Prop] | [h_Prop] 
                              | [R_Prop] | [r_Prop] | [v_Prop] | [X_Prop]
                              | [x_Prop] | [nestedExpression] }
   [isotope] == [digits] | [digits] "?"
       # note -- isotope mass may come before or after element symbol, 
       #         EXCEPT "H1" which must be parsed as "an atom with a single H" 
   [digits] == { [digit] | [digit] [digits] }
   [digit] == { "0" | "1" | "2" | "3" | "4" | "5" | "6" | 7" | "8" | "9" }
   [charge] == { "-" [digits] | "+" [digits] | [plusSet] | [minusSet] }
   [plusSet] == { "+" | "+" [plusSet] }
   [minusSet] == { "-" | "-" [minusSet] }
   [stereochemistry] == { "@"           # anticlockwise
                              | "@@"    # clockwise
                              | "@" [stereochemistryDescriptor] 
                              | "@@" [stereochemistryDescriptor] }
   [stereochemistryDescriptor] == [stereoClass] [stereoOrder]
   [stereoClass] == { "AL" | "TH" | "SP" | "TP" | "OH" }
   [stereoOrder] == [digits]
       # note -- "?" here (unspecified) is not relevant in 3D-SEARCH 
   
   [A_Prop] == "#" [digits]           # elemental atomic number
   [a_Prop] == "=" [digits]           # atom index (starts with 0)
   [D_Prop] == { "D" [digits] | "D" } # degree -- total number of connections 
                                      #   excludes implicit H atoms; default 1
   [d_Prop] == { "d" [digits] | "d" } # degree -- non-hydrogen connections
                                      #   default 1 
   [H_Prop] == { "H" [digits] | "H" } # exact hydrogen count 
                                      #   excludes implicit H atoms
   [h_Prop] == { "h" [digits] | "h" } # implicit hydrogens -- "h" indicates "at least one"
                                      #   (see note below)
   [R_Prop] == { "R" [digits] | "R" } # ring membership; e.g. "R2" indicates "in two rings"
                                      #   "R" indicates "in a ring" 
                                      #   !R" or "R0" indicates "not in any ring"
   [r_Prop] == { "r" [digits] | "r" } # in ring of size [digits]; "r" indicates "in a ring"
   [v_Prop] == { "v" [digits] | "v" } # valence -- total bond order (counting double as 2, e.g.)
   [X_Prop] == { "X" [digits] | "X" } # connectivity -- total number of connections
                                      #   includes implicit H atoms
   [x_Prop] == { "x" [digits] | "x" } # ring connectivity -- total ring connections
   
 ######## Nested and Multiple Expressions ########
 
   [nestedExpression] == "$(" [atomExpression] ")"
      # note: nestedExpressions return only the first atom as a match, 
      #       not all atoms in the expression.
   [multipleExpression] == { "[$" [nTimes] "(" [orExpression] ")]" 
                             | "[$[nMinimum] "-" [nMaximum](" [orExpression] ")]" }  
   [orExpression] = { [atomExpression] 
                       | [atomExpression "|" [orExpression] 
                       | [atomExpression "||" [orExpression] }
      # note: "|" and "||" are synonymous in this inner context; "|" is preferred simply
      #       for readability (whereas "||" is required for the [smartsSet] context). 
      # note: This syntax is carefully written to exclude $(xxx) by itself, which
      #       is a nestedExpression, not a multipleExpression. The difference is that
      #       the nestedExpression only returns the first atom, while the multipleExpression
      #       returns all atoms. To return only the first atom within this context 
      #       it is necessary to use a nested expression within the multiple expression.
      #       For example: "CC[$2( $(C=O) | $(C=N) )]"
      #       is the same as "CC$(C=[O,N])$(C=[O,N])", although Jmol preprocesses it as
      #          "CC$(C=O)$(C=O)||CC$(C=O)$(C=N)||CC$(C=N)$(C=O)||CC$(C=N)$(C=N)"
      
   [nTimes] == [digits]
   [nMinimum] == [digits]
   [nMaximum] == [digits]
      # note: multipleExpressions allow for searching a given number of expressions or 
      #       a variable number of expressions (including 0, perhaps)
      #       Jmol pre-processes these expressions and turns them into a set:
      #       pattern1 || pattern2 || pattern3....
 ######## BioSEQUENCE ########
   [bioSequence] == [bioCode] [bioNode] [connections]
   [bioCode] == { "~" | "~" [bioType] "~" }
      # note: The "~" must be the first character in a component and must be repeated 
      #       for each component (separated by ".")
   [bioType] == { "p" | "n" | "r" | "d" }
      # note: protein, nucleic, RNA, DNA
   [bioNode] == { "[" [bioResidueName] "]" 
                 | "[" [bioResidueName] "." [bioAtomName] "]" 
                 | "[" [bioResidueName] "." [bioAtomName] [A_Prop] "]" 
                 | [bioResidueCode] } 
   [bioResidueName] == { "*" | "ARG" | "GLY" ... } (case-insensitive) 
   [bioAtomName] == { "*" | "0" | "C" | "CA" | "N" ... } (case-insensitive)
      # note: "0" indicates the "lead atom":
      #   nucleic: P if present, or H5T if present, or O5'/O5*
      #   protein: CA
      #   carbohydrate: the first atom of the group listed in the model file
   [bioResidueCode] == { "*" | "A" | "R" | "G" ... } (case-sensitive)
      # note: wildcard or standard group 1-letter-code
      #       or, in the case of RNA or DNA:
      #         "N" (any residue; same as "*"), 
      #         "R" (any purine -- A or G)
      #         "Y" (any pyrimidine -- C or T or U)
 ######## CONNECTIONS (aka "rings") ########
   [connectionPointer] == { [digit] | "%" [digit][digit] | "%(" [digits] ")" }
      # note: All connectionPointers must have a second matching connectionPointer 
      #       and must be preceded by an atomExpression for the 
      #       first occurance and either an atomExpression or a bondExpression
      #       for the second occurance. The matching connectionPointers may be
      #       in different "components" (separated by "."), in which case they
      #       represent general connections and not necessarily rings.
 ######## BONDS ########
   [bondExpression] == { [bondOrSet] | [bondOrSet] ";" [bondAndSet] } 
   
   [bondOrSet] == { [bondAndSet] | [bondAndSet] "," [bondAndSet] }
   [bondAndSet] == { [bondPrimitives] | [bondPrimitives] "&" [bondAndSet]
                              | "!" [bondPrimitive] 
                              | "!" [bondPrimitive] "&" [bondAndSet] }
                                              
 ######## BOND PRIMITIVES ########
                              
   [bondPrimitives] == { [bondPrimitive] | [bondPrimitive] [bondPrimitives] }       
   [bond] == { "-" | "=" | "#" | "." | "/" | "\\" | ":" | "~" | "@" | "+" | "^" | NULL
      # note: All bondExpressions are not valid. Stereochemistry should not 
      #       be mixed with the others, as it represents a single bond always.
      #       In addition, "." ("no bond") cannot be mixed with any bond type.
      #       Nothing would be retrieved by "-&=", as a bond cannot be both single
      #       and double. However, "-@" is potentially very useful -- "ring single-bonds"
      #       or "=&!@" -- "doubly-bonded atoms where the double bond is not in a ring"
      # note: Jmol will not match two totally independent molecular pieces. For example,
      #       Jmol will not math [Na+].[Cl-]
      # note: "+" indicates "adjacent biomolecular groups in a chain"
      # note: a bioSEQUENCE ends with "." or the end of the string. A new bioSEQUENCE
      #       can continue with "~" immediately following this "." 
      # note: For a SMARTS search, "." indicates the start of a new subset, not necessarily a
      #       new component.
      # note: "^" indicates atropisomer bond with positive dihedral angle
      
 ######## MEASURES ########
   
   [measure] == { [measureId] | [measureId] ":" [range] | [measureId] ":!" [range] }
   [measureId] == { [measureCode] | [measureCode] [digits] }
   [measureCode == { "d" | "a" | "t" }
   [range] == [minimumValue] { "," | "-" } [maximumValue]
   [minimumValue] == [decimalNumber]
   [maximumValue] == [decimalNumber]
We define "aromatic" here strictly in terms of geometry - a flat ring with trigonal planar geometry for all atoms in the ring. No consideration of bond order is used, because for the sorts of models that can be loaded into Jmol, many do not assume a bonding scheme (PDB, GAUSSIAN, etc.).
Given a ring of N atoms...
                  1
                /   \
               2     6 -- 6a
               |     |
         5a -- 5     4
                \   /
                  3  
    
    with arbitrary order and up to N substituents...
-- Bob Hanson last updated 4/10/2012 : fix for [$(...)n] and [$(...)min-max]