Prev - DUP - Next - TOC

Regular expressions


I'll try to make a more interesting pazzle. This time, we test whether or not a string matches another string, say pattern.

In order to be useful, we import some characters with special meaning into patterns. The following are special characters.

    [ ]     range specification. (e.g., [a-z] means a letter
            in range of from a to z) 
    \w      letter or digit. same as [0-9A-Za-z_]
    \W      neither letter nor digit
    \s      blank character. same as [ \t\n\r\f]
    \S      non-space character. 
    \d      digit character. same as [0-9]. 
    \D      non digit character. 
    \b      word boundary (outside of range specification). 
    \B      non word boundary. 
    \b      back spage (0x08) (inside of range specification)
    *       zero or more times repetition of followed expression
    +       zero or one times repetition of followed expression
    {m,n}   at least n times, but not more than m timesrepetition 
            of followed expression
    ?       at least 0 times, but not more than 1 timesrepetition 
            of followed expression
    |       eather followed or leaded expression
    ( )     grouping

For example, `^f[a-z]+' means "repetition of letters in range from `a' to `z' which is leaded by `f'" Special matching characters like these are called `reguler expression'. Regular expressions are useful for string finding, so it is used very often in UNIX environment. A typical example is `grep'.

To understand regular expressions, let's make a little program. Store the following program into a file named `regx.rb' and then execute it.
Note: This program works only on UNIX because this uses reverse video escape sequences.

 st = "\033[7m"
 en = "\033[m"

 while TRUE
   print "str> "
   STDOUT.flush
   str = gets
   break if not str
   str.chop!
   print "pat> "
   STDOUT.flush
   re = gets
   break if not re
   re.chop!
   str.gsub! re, "#{st}\\&#{en}"
   print str, "\n"
 end
 print "\n"

This program requires input twice and reports matching in first input string to second input regular expression by reverse video displaying. Don't mind details now, they will be explained.

 str> foobar
 pat> ^fo+
 foobar
 ~~~

# foo is reversed and ``~~~'' is just for text-base brousers.

Let's try several inputs.

 str> abc012dbcd555
 pat> \d
 abc012dbcd555
    ~~~    ~~~

This program detect multiple muchings.

str> foozboozer
pat> f.*z
foozboozer
~~~~~~~~

`fooz' isn't matched but foozbooz is, since a regular expression maches the longest substring.

This is too diffucult of a pattern to recognize at a glance.

 str> Wed Feb  7 08:58:04 JST 1996
 pat> [0-9]+:[0-9]+(:[0-9]+)?
 Wed Feb  7 08:58:04 JST 1996
            ~~~~~~~~

In ruby, a regular expression is quoted by `/'. Also, some methods convert a string into a regular expression automatically.

 ruby> "abcdef" =~ /d/
 3
 ruby> "abcdef" =~ "d"
 3
 ruby> "aaaaaa" =~ /d/
 FALSE
 ruby> "aaaaaa" =~ "d"
 FALSE

`=~' is a matching operator with respected to regular expression; it returns the position when matched.


Prev - DUP - Next - TOC