[NAME]
ALL.dao.type.string.pattern

[TITLE]
String Pattern Matching

[DESCRIPTION]


 0.1  Introduction 

Dao has built-in support for regular expression based string pattern matching. A regular 
expression (regex) is a string representing a pattern (rules), from which a set of string
s can be constructed. The pattern represented by a regular expression is used to search i
n a string for sub-strings that can be constructed from that pattern, namely sub-strings 
that match that pattern. A number of operations can be performed on the resulting sub-str
ings, including extraction, replacing and splitting etc.

 0.2  Character Class 

A character class is used to identify a set of characters. 
  *  x : ordinary characters represent themselves, excluding magic characters ^$|()%.[]*+
     -?{}<>;
  *  . : a dot represents any characters;
  *  %a : all alphabetic characters;
  *  %s : all white space characters;
  *  %k : all control characters;
  *  %p : all punctuation characters;
  *  %d : all digits;
  *  %x : all hexadecimal digits;
  *  %c : all lower case characters;
  *  %e : all CJK (Chinese, Japanese, Korean) characters;
  *  %w : all alphabetic characters, digists and character _;
  *  %A : non alphabetic characters, complement of %a;
  *  %S : non white space characters, complement of %s;
  *  %K : non control characters, complement of %k;
  *  %P : non punctuation characters, complement of %p;
  *  %D : non digits, complement of %d;
  *  %X : non hexadecimal digits, complement of %x;
  *  %C : upper case characters;
  *  %E : non CJK (Chinese, Japanese, Korean) characters;
  *  %W : complement of %w;
  *  %x : represents character x, where x is any non-alphanumeric character; x may also b
     e an alphabetic  character if it is not one of the character class symbols or b or B
     .
  *  [set] : represents the union of all characters in set; a range of characters startin
     g from a character x  up to another character y can be included in set  as x-y; the 
     above character classes can also be included in set;
  *  [^set] : complement of [set]; 


 0.3  Pattern Item 

A pattern item can be 
  *  a single character class;
  *  ^ : match at the begin of a string;
  *  $ : match at the end of a string;
  *  %n : match n-th captured sub string; n can be one or more digits;
  *  {{verbatim}} : match verbatim text without escapes;
  *  %bxy : match a balanced pair of characters x and y; here balance means, starting fro
     m the same matched position,  the mached sub string should contain  the same number 
     and minimum number of x and y; the same as that in Lua;
  *  %B{pattern1}{pattern2} : match a balanced pair of patterns pattern1 and pattern2; he
     re balance has the same meaning as in %bxy; 

A pattern item e can be optional skiped or matched repeatedly as indicated by: 
  *  e? : match zero time or once;
  *  e* : match zero time or any number of times;
  *  e+ : match once or more;
  *  e{n} : match exactly n times;
  *  e{n,} : match at least n times;
  *  e{,n} : match at most n times;
  *  e{n,m} : match at least n times and at most m times; 


 0.4  Grouping and Captures 

In a pattern, one or more pattern items can be grouped together  by parenthesis to form s
ub patterns (group).  Alternative patterns in a group can be separated by |, and the grou
p could be optionally skipped if an empty alternative pattern is specified as (|pattern) 
or (pattern|). When a string is matched to a pattern, the sub strings that match the grou
ps of sub patterns can be captured for other use. Captures are numbered according to thei
r left parenthesis. For example, in pattern (%a+)%s*(%d+(%a+)), the first (%a+) will have
group number 1, and (%d+(%a+)) will have group number 2, and the second (%a+) will have g
roup number 3. For convenience, the whole pattern has group number 0.

In case there are multiple possible ways of matching a substring starting from the same p
osition, the matching length is calculated as the sum of the lengths of the sub-matches o
f all groups (including number 0 group)  in the pattern, and the matching giving maximum 
matching length is  returned as the result. In this way, one can put a deeper nesting of 
parenthesis around a group, if one want that group has high priority to be matched. For e
xample, when 1a2 is matched to patterh (%d%w*)(%w*%d), there are two possible ways of mac
thing, namely, 1a matching to (%d%w*) and 2 matching to (%w*%d), or 1 matching to (%d%w*)
and a2 matching to (%w*%d), but if an extra parenthesis is added to one of the group, for
example, as (%d%w*)((%w*%d)), then the matching becomes unique, which is the second way o
f matching where letter a is matched in the last group.

 0.5   String Matching Methods  

Like in Lua, the regular expression matching functionalities are accessed through various
string methods. The regular expression patterns are stored  in strings, and passed to the
se string methods.  Each pattern string corresponds to an internal representation of a re
gular expression, which are compiled from the pattern string at the first time it is used
. Though the strings that represent the same pattern can be passed multiple times  to the
se methods, they are compiled only once in one process (virtual machine process). So the 
overhead of compiling a regular expression can be normally ignored.

The following methods are provided: 
     
   1  fetch( invar self: string, pattern: string, group = 0, start = 0, end = 0 )
   2      => string
   3  match( invar self: string, pattern: string, group = 0, start = 0, end = 0 )
   4      => tuple<start:int,end:int>|none
   5  change( invar self: string, pattern: string, target: string, index = 0, 
   6      start = 0, end = -1 ) => string
   7  capture( invar self: string, pattern: string, start = 0, end = 0 ) => list<string>
   8  extract( invar self: string, pattern: string, 
   9      mtype: enum<both,matched,unmatched> = $matched ) => list<string>
  10  scan( invar self: string, pattern: string, start = 0, end = 0 )
  11      [start: int, end: int, state: enum<unmatched,matched> => none|@V]
  12      => list<@V>
     


 0.5.1   fetch(invar self:string,pattern:string,group=0,start=0,end=0)=>string  
     
   1  fetch( invar self: string, pattern: string, group = 0, start = 0, end = 0 )
   2      => string
     
Fetch the substring that matches the "group"-th group of pattern "pattern".
Only the region between "start" (inclusive) and "end" (exclusive) is searched.
When the "end" parameter is not used explicitly, the region will range from "start" to th
e end of the string.

Examples, 
     
   1  var S1 = "ABC123DEF456GHI"
   2  var S2 = S1.fetch( "%d+" )          # S2 = "123"
   3  var S3 = S1.fetch( "%d+(%a+)", 1 )  # S3 = "DEF"
   4  var S4 = S1.fetch( "%d+", 0, 6 )    # S4 = "456"
     


 0.5.2   match(invar self:string,pattern:string,group=0,start=0,end=0)=>...  
     
   1  match( invar self: string, pattern: string, group = 0, start = 0, end = 0 )
   2      => tuple<start:int,end:int>|none
     
Match part of this string to pattern "pattern".
If matched, the indexes of the first and the last byte of the matched substring will be r
eturned as a tuple. If not matched, "none" is returned.
Parameter "start" and "end" have the same meaning as in string::fetch().

Examples, 
     
   1  var S1 = "ABC123DEF(456)GHI"
   2  var M2 = S1.match( "%d+" )          # M2 = (start=3,end=5); substring: "123"
   3  var M3 = S1.match( "%b()" )         # M3 = (start=9,end=13); substring: "(456)"
   4  var M4 = S1.match( "%b{}" )         # M4 = none;
   5  var M5 = S1.match( "%d+(%a+)", 1 )  # M5 = (start=6,end=8); substring: "DEF"
     


 0.5.3   change(invar self:string,pat:string,tar:string,index=0,start=0,end=0)=>string  
     
   1  change( invar self: string, pattern: string, target: string, index = 0, 
   2      start = 0, end = 0 ) => string
     
Change the part(s) of the string that match pattern "pattern" to "target". And return a n
ew string.
The target string "target" can contain back references from pattern "pattern".
If "index" is zero, all matched parts are changed; otherwise, only the "index"-th match i
s changed.
Parameter "start" and "end" have the same meaning as in string::fetch().

Examples, 
     
   1  var S1 = "ABC123DEF456GHI"
   2  var S2 = S.change( "%d+", ";" )          # S2 = "ABC;DEF;GHI"
   3  var S3 = S.change( "(%d+)", "<%1>", 1 )  # S3 = "ABC<123>DEF456GHI"
     


 0.5.4   capture(invar self:string,pattern:string,start=0,end=0)=>list<string>  
     
   1  capture( invar self: string, pattern: string, start = 0, end = 0 ) => list<string>
     
Match pattern "pattern" to the string, and capture all the substrings that match to each 
of the groups of "pattern". Return these substrings as a list, and in this list, the i-th
string corresponds to the i-th pattern group.
Note that the pattern groups are indexed starting from one, and zero index is reserved fo
r the whole pattern.
Parameter "start" and "end" have the same meaning as in string::fetch().

Examples, 
     
   1  var S1 = "ABC123DEF456GHI"
   2  var L1 = S1.capture( "%d+" )        # L1 = { "123" }
   3  var L2 = S1.capture( "%d+ (%a+)" )  # L2 = { "123DEF", "DEF" }
     


 0.5.5   extract(invar self:string,pattern:string,mtype:enum<...>=$matched)=>...  
     
   1  extract( invar self: string, pattern: string, 
   2      mtype: enum<both,matched,unmatched> = $matched ) => list<string>
     
Extract the substrings that match to, or are between the matched ones, or both, and retur
n them as a list.

Examples, 
     
   1  var S1 = "ABC123DEF456GHI"
   2  var L1 = S1.extract( "%d+" )   # L1 = { "123", "456" }
   3  var L2 = S2.extract( "%d+", $unmatched )  # L2 = { "ABC", "DEF", "GHI" }
     


 0.5.6   scan(invar self:string,pattern:string,start=0,end=0)[...]=>list<@V>  
     
   1  scan( invar self: string, pattern: string, start = 0, end = 0 )
   2      [start: int, end: int, state: enum<unmatched,matched> => none|@V]
   3      => list<@V>
     
Scan the string with pattern "pattern", and invoke the attached code section for each mat
ched substring and substrings between matches.
The start and end index as well as the state of matching or not matching can be passed to
the code section.
Parameter "start" and "end" have the same meaning as in string::fetch().

Examples, 
     
   1  var S1 = "ABC123DEF"
   2  S1.scan( "%d+" ) { [start, end, state]
   3      io.writeln( start, end, S1[start:end], state )
   4  }
   5  # Output:
   6  # 0 2 ABC $unmatched(0)
   7  # 3 5 123 $matched(1)
   8  # 6 8 DEF $unmatched(0)