Musings of an OS Plumber: KSH93 Extended Patterns

According to David Korn, a shell is primarily a string processing language. Pattern matching is an important compoment of any such language and indeed Korn Shell 93 (ksh93) has excellent support for extended patterns as well as regular expressions. Extended patterns can be thought of as class or type of extended regular expressions. Both the bash and zsh shells have something similar but not as comprehensive. However, as usual, extended patterns are documented quite tersely in the ksh93 man page.

The purpose of this post is to explain, with some examples, how to use the power of extended patterns in your ksh93 scripts. It is assumed that you are reasonably familiar with basic regular expressions (BREs) as implemented in sed, grep or awk. If you need a introductory tutorial on regular expressions, here is one at IBM developerWorks.

The following table shows the the basic pattern matching operators in ksh93.

?(pattern)	Match if found 0 or 1 times
*(pattern)	Match if found 0 or more times
+(pattern)	Match if found 1 or more times
@(pattern1\|...)	Match if any of the patterns found
!(pattern)	Match if no pattern found

Note that an operator must preceed a pattern in ksh93 whereas in egrep, sed and awk, the operator is placed after the pattern.

Here is an example of how to use the above operators to modify the contents of the string str

str="Joe Mike and Dave are all good friends"

print ${str//a?(re)/_}
# output: Joe Mike _nd D_ve _ _ll good friends

print ${str//g*(o)/_}
# output: Joe Mike and Dave are all _d friends

print ${str//+(o)/_}
# output: J_ Mike and Dave are all g_d friends

print ${str//@(Joe|Mike|Dave)/_}
# output: _ _ and _ are all good friends

print ${str//@(Joe|Mike|g*(o))/_}
# output: _ _ and Dave are all _d friends

print ${str//!(Joe)/_}
# output: _

print ${str//!(Joe|Mike|Dave)/_}
# output: _

In the above example, I have shown the expected output as a comment below each print statement. Here is another example which should futher clarify your understanding of these pattern operators. Note particularly the output of of the last print statement.

str='An extended pattern expression'

print ${str//e/#}
print ${str//[^e]/#}
print ${str//+(e)/#}
print ${str//-(e)/#}
print ${str//?(e)/#}
print ${str//*(e)/#}
print ${str//!(e)/#}

which produces the following output.

An #xt#nd#d patt#rn #xpr#ssion
###e##e##e######e###e###e######
An #xt#nd#d patt#rn #xpr#ssion
An extended pattern expression
#A#n# #x#t#n#d#d# #p#a#t#t#r#n# #x#p#r#s#s#i#o#n#
#A#n# #x#t#n#d#d# #p#a#t#t#r#n# #x#p#r#s#s#i#o#n#
#

The following table show a number of pattern matching interval quantifiers.

{n}(pattern)	Match if found exactly n times where n is a non-negative number
{n,m}(pattern)	Match if found between n and m times where n and m are non-negative integers and n <= m

Here is an example of how to use the above interval quantifiers to match various strings.

print $(
   [[ aaaa == {4}(a) ]] || print $?
   [[ aaaa == {,4}(a) ]] || print $?
   [[ aaaa == {3,}(a) ]] || print $?
   [[ aaaa == {2,4}(a) ]] || print $?
   [[ abc == {1,4}(ab)c ]] || print $?
   [[ abcabc == {,2}(abc) ]] || print $?
   [[ abababcc == {1,4}(ab){1,2}(c) ]] || print $?
   [[ abc == {1,4}(ab){1,2}(c) ]] || print $?
   [[ abcdcdabcd == {3,6}(ab|cd) ]] || print $?
   [[ abcdcdabcde == {5}(ab|cd)e ]] || print $?
)

By default an extended pattern attempts to match the longest possible string consistent with generating the longest overall match. This is known as a greedy or maximal match. A non-greedy (or minimal) match is one that matches the shortest possible string. Perl was the first scripting langauge to popularize non-geedy matching. By the way, ksh93 and zsh are the only shells that support non-greedy matching.

You can use the '-' qualifier to indicate to the shell that you want to use non-geedy matching as shown in the table below.

?-(pattern)	Shortest match if found 0 or 1 times
*-(pattern)	Shortest match if found 0 or more times
+-(pattern)	Shortest match if found 1 or more times
@-(pattern1\|…)	Shortest match if any of the patterns found
{n,m}-(pattern)	Shortest match if found between n and m times

Alternatively, you can use the ~(-g) subpattern to indicate to ksh93 that you want to use non-geedy matching. The following examples show both methods.

str="bcdabdcbabcd"

print "    Greedy: ${str/+(*ab)/_}"
print "Non-greedy: ${str/+-(*ab)/_}"

str="heleelloo hello"

print "    Greedy: ${str//he*l/_}"
print "Non-greedy: ${str//~(-g)he*l/_}"
print "    Greedy: ${str//?(he*ll)/_}"
print "Non-greedy: ${str//~(-g)?(he*ll)/_}"
print "Non-greedy: ${str//?-(he*ll)/_}"
print "    Greedy: ${str//+(he*ll)/_}"
print "Non-greedy: ${str//+-(he*ll)/_}"
print "    Greedy: ${str//*(he*ll)/_}"
print "Non-greedy: ${str//*-(he*ll)/_}"
print "    Greedy: ${str//{1,2}(he*ll)/_}"
print "Non-greedy: ${str//~(-g){1,2}(he*l)/_}"

A pattern-list is a list of one or more patterns separated from each other by either a & or a |. A & (denoting logical AND) means that all patterns must be matched whereas | (denoting logical OR) means that only one pattern need be matched. Composite patterns can also be created as shown below.

?(pattern-list)	Optionally matches any one of the patterns
*(pattern-list)	Matches zero or more occurrences of the patterns.
+(pattern-list)	Matches one or more occurrences of the patterns.
{n}(pattern-list)	Matches exactly n occurrences of the patterns.
{m,n}(pattern-list)	Matches m to n occurrences of the patterns. If m is omitted, 0 is used. If n is omitted at least m occurrences are matched.
@(pattern-list)	Matches exactly one of the patterns.
!(pattern-list)	Matches anything except one of the patterns.

Again, by default, matching is greedy. Each pattern in the pattern-list attempts to match the longest string possible consistent with generating the longest overall match. If more than one match is possible, the match starting closest to the beginning of the string will be chosen. However, for each of the above compound patterns a − can be inserted in front of the ( to specify that the shortest match to the specified pattern-list should be used.

Finer grained control of extended pattern matching is possible using sub-patterns of the form ~(options:pattern-list) where :pattern-list is optional and options consists of one or more of the following option flags:

+	Enable following options (default)
-	Disable following options
E	Remainder of the pattern uses ERE pattern syntax
F	Remainder of the pattern uses fgrep-like pattern syntax.
G	Remainder of pattern uses BRE pattern syntax
K	Remainder of pattern uses ksh93 pattern syntax (default)
i	Case insensitive match
g	Greedy match (default)
l	Left anchor pattern.
r	Right anchor pattern.

If both options and :pattern-list are specified, then the specified options apply only to :pattern-list. Otherwise, the specified options remain in effect until disabled by a subsequent ~(...) sub-pattern or at the end of the sub-pattern containing ~(...).

ksh93 provides a way to translate extended patterns into regular expressions and vice-versa by means of two printf options.

$ printf "%R\n" "*[!0-9]*'
[^0-9]
$ printf "%P\n" "([0-9]+\.){3}"
*{3}(+([0-9])\.)*
$

I hope that I have inspired you to go away and experiment on your own with some of the more advanced features of extended patterns in ksh93. Once you master the syntax, you can significantly reduce the need for your scripts to invoke external utilities such as sed or awk simply to parse text strings.

Enjoy!

P.S. All the examples included in this post were tested on ksh93t+ 12/10/2008.

Musings of an OS Plumber

KSH93 Extended Patterns

0 comments:

Post a Comment

Labels

Blog Archive