ICM Manual v.3.9
by Ruben Abagyan,Eugene Raush and Max Totrov
Copyright © 2020, Molsoft LLC
Jun 5 2024

PrevICM Language Reference
Regular expressions (regexp)

[ Regexp syntax ]

Functions supporting regular expressions:

See regexp syntax .

ICM regular expression syntax

[ Simple expressions | Shortcuts | Regexp back references | Greedy matching ]

Simple expressions

  • . any character except new line ( to match anything, say (.|\n) or use (?n) in the beginning of the expression )
  • ^ the beginning of the line
  • $ the end of the line
  • [abc] any character from the list
  • [^abc] any character NOT in the list
  • [a-z] a range, e.g. [0-9] or [0-9A-Z]
  • \c backslash suppresses special meaning of a character
  • \\ backslash itself
  • (string) enclose a simple expression in parentheses to write repetitions, back-references, or field=number expressions in the Split, Match and Replace functions.

Inline modifiers of regular expressions:

  • (?i) ignore case until the end of the same enclosing group, e. g. 'aBc' ~ '(?i)abc', 'a((?i)bc)d' matches 'aBCd','abcd','aBcd', but not 'Abcd' or 'abcD'
  • (?-i) match case-sensitive until the end of the same enclosing group, e. g. 'a(?i)bc(?-i)d' matches 'aBCd', but not 'Abcd' or 'abcD',
  • (?n) begin matching newline character with dot '.': "1bc\nd2" ~ '(?n)1.*2'


  • \d matches a digit ( '[0-9]' ). '\d+' matches one or more digits.
  • \D matches a NON-digit. '\D+' matches space between numbers
  • \w matches a character in a word ( [a-zA-Z_] ). '\w+' matches a word
  • \W matches a NON-word character. '\W+' matches the interword space
  • \s matches a whitespace character, or a separator ( [ \r\t\n\f] )
  • \S matches a non-separator symbol
  • \b matches a word boundary, i. e. a boundary between \w and \W symbols, for example, '\bedgeh\b' matches inside 'the edge' and does not match inside 'the hedge'

Repetitions and back-references

( a and b are simple regular expressions, e.g. a DNA base [ACTG], or ([hp]anky.*) ):

  • a? - nothing or a single occurrence of a
  • a* - nothing or any number of repetitions of a
  • a+ - matches a at least once or more
  • a{n,m} - matches a from n to m times
  • a|b - matches a or b
  • ab - matches a and b
  • (a)\1 - \1 is a back-reference: matches a, then matches exactly the same string. Back-references can go from \1 to \9.

A problem with the posix repetitions

Imagine that you want to match text between two tags, e.g. <i>one</i> in a text which has two items of the same kind ( <i>one</i> and <i>two</i> ). Unfortunately, we can not just use <i>.*</i> to match <i>one</i> since the POSIX standard tries to match the MAXIMAL LENGTH expression between the italic tags (shown in bold are the flanking expressions: <i>one</i> and <i>two</i>).
A straight-forward solution of this problem is to make a more complex definition of the word between the tags, by saying that the 'italized' word should not contain the '<' symbol.

ICM followed Perl in using the question mark (?) after the repetition symbol to enforce the minimal match. The minimal match expressions will look like this (a is a simple regular expression, like a character or a string in parentheses ):

  • a?? - nothing or a single occurrence of minimal occurrence of a
  • a*? - nothing or any number of repetitions of minimal occurrence of a (e.g. Match(s,'tag(.*?)endtag':n))
  • a+? - matches a at least once or more
  • '<i>.*</i>' - matches the entire 'one</i> and <i>two'
  • '<i>[^<]*</i>' - explicitly prohibits the tag inside. matches only the first word
  • '<i>.*?</i>' - the '*?' expression enforces the smallest match

