[svlug] extended grep reg exp

Mark S Bilk mark at cosmicpenguin.com
Tue Aug 12 13:43:10 PDT 2003


In-Reply-To: <IKEJKFOFPOCBCEMIGDBIMEGHFEAA.rhxk at earthlink.net>; from rhxk at earthlink.net on Tue, Aug 12, 2003 at 01:21:03PM -0700
Organization: http://www.cosmicpenguin.com/911

On Tue, Aug 12, 2003 at 01:21:03PM -0700, Robert Khachikyan wrote:
>I've read the doc for grep extensively and google searched
>it and still couldn't find what i was looking for...on top
>of that, i left my reg exp book @ home....so here it is.
>
>I have a big file that has
>
>3918400 bla bla bla
>3918401 bla bla bla
>3918402 bla bla bla
>3918403 bla bla bla
>3918404 bla bla bla
>...
>3945785 bla bla bla
>3945786 bla bla bla
>3945787 bla bla bla
>...
>
>you get the idea. I want to grep a portion of it out.
>Let's say from 3918403 -> 3928404 (10001 lines).

If the lines are truly numbered like that, sequentially and
with no gaps, you could use head and tail to pick out a desired
span.  You can even calculate the beginning and ending line
numbers (positions of lines in the file, not the numbers in 
the text) in a script.  $((  )) evaluates an arithmetic expression
in bash.  

>To my knowledge, grep's regular expression works with
>searching for the last character of the string(*[0-9]).
>This would return only 10 lines...what if I want to
>do a crazy grep like this?
>
>i thought 'egrep -E 39[18403-28404] file' would do, but
>it comes back with no match...

I don't recall any regexp for a numerical span.  The square brackets
and dash denote a span of single ASCII characters -- [b-e] matches 
b, c, d, or e.  I've included below a good help file for sed; it's worth 
keeping.  I think grep (maybe not egrep) uses the same regexp language.

*********************************************************************

Regular expressions
  -------------------

To know how to use sed, people should understand regular expressions 
(RE for short).

This is a brief résumé of regular expressions used in SED.

c       a single char, if not special, is matched against text.

*       matches a sequence of zero or more repetitions of previous char,
        grouped RE, or class.

\+      as *, but matches one or more.

\?      as *, but only matches zero or one.

\{i\}   as *, but matches exactly <i> sequences (a number, between
        0 and some limit -- in Henry Spencer's regexp(3) library,
        this limit is 255)

\{i,j\} matches between <i> and <j>, inclusive, sequences.

\{i,\}  matches more thanor equal to <i> sequences.

\{,j\}  matches at most (or equal) <j> sequences.

\(RE\)  groups RE as a whole, this is used to: 

                - apply postfix operators, like `\(abcd\)*'
                  this will search for zero or more whole
                  sequences of "abcd", if `abcd*', it would
                  search for "abc" followed by zero or more "d"s

                - use back references (see below)

..       match any character 

^       match the null string at beginning of line, i.e. what
        what appears after ^ must appear at the 
        beginning of line

        e.g. `^#include' will match only lines where "#include" is
        the first thing on line, but if there are one or two spaces
        before, the match fail

$       the same as ^, but refers to end of line

\c      matches character `c' -- used to match special chars, 
        referred above (and some more below)

[list]  matches any single char in list. e.g. `[aeiou]' matches 
        all vowels

[^list] matches any single char NOT in list

        a list may be composed by <char1>-<char2>, and means
        all chars between (inclusive) <char1> and <char2>

        to include `]' in the list, make it the first char
        to include `-' in the list, make it the first or last

RE1\|RE2
        matches RE1 or RE2

\1 \2 \3 \4 \5 \6 \7 \8 \9, => \i
        matches the <i>th \(\) reference on RE, this is called
        back reference, and usually it is (very) slow

Notes:
  ------
        - some implementations of sed, may not have all REs mentioned,
          notably `\+', `\?' and `\|'

        - the RE is greedy, i.e. if two or more matches are detected, it
          selects the longest, if there are two or more selected with
          the same size, it selects the first in text

Examples:
  ---------

        `abcdef'        matches "abcdef"
        `a*b'           matches zero or more "a"s followed by a single "b"
, 
                          like "b" or "aaaaaab"
        `a\?b'          matches "b" or "ab"
        `a\+b\+'        matches one or more "a"s followed by one or more
                          "b"s, the minimum match will be "ab", but
                          "aaaab" or "abbbbb" or "aaaaaabbbbbbb" also
                          match
        `.*'            all chars on line, of all lines (including empty
                          ones)
        `.\+'           all chars on line, but only on lines containing
                          at least one char, i.e. empty lines will not
                          be matched)

        `^main.*(.*)'   search for a line containing "main" as the first
                          thing on the line, that line must also
                          contain an opening and closing parenthesis
                          being the open paren preceded and followed
                          by any number of chars (including none)

        `^#'            all lines beginning with "#" (shell and 
                        make comments)

        `\\$'           all lines ending with a single `\' (there are
                          two for escaping `\') -- line continuation
                          in C and make, and shell, etc...

        `[a-zA-Z_]'     any letters or digits

        `[^     ]\+'    (a tab and a space) -- one or more sequences
                          of any char that isn't a space or tab.
                          Usually this means a word

        `^.*A.*$'       match an "A" that is right in the center of the
                          line

        `A.\{9\}$'      match an "A" that is exactly the last tenth
                          character on line

        `^.\{,15\}A'    match the last "A" on the first 16 chars of the
                          line


  ========================================================================

Substitution
  ------------

This command is so often used that it deserves a whole section!

(2)s/RE/<replacement>/[flags] -- (s)ubstitute, substitute

        - on specified lines, text matched by RE, if any, is replaced
          by <replacement>

        - if replacement is done, the flag that permits the `test' 
          command to be performed is set (more about this on
          `t' command)

        - the `/' separator, in fact could be ANY character. Usually
          it is `/' due to the fact that almost every program with
          regular expressions can use it. Exceptions are
          grep and lex, that don't use any char as a delimiter.

        - <replacement> is raw text. The only exceptions are: 

                &       it is replaced by all text matched by RE
                        Being so, then
                                s/RE/&/ 
                        is a null op, whatever the RE, except for
                        setting the test flag

                \d      where `d' is a digit (see below for more),
                        is replaced by the d-th grouped \(\) sub-RE

                        some implementations of sed (more precisely, 
                        some implementations of regex(3) library, that
                        some implementations of sed use), limit `d'
                        to be a single digit (1-9). Others, such as GNU
                        sed (2.05 at least) accept a valid number.

                        GNU sed also accepts and understands `\0'
                        as a `&'. i.e. the whole matched RE.
                        I don't know if this behavior is standard.

                        If there isn't a d-th grouped \(\), then
                        \d is replaced by the null string.

                \c      where `c' is any char except digits, quote `c'

                Note that besides the above, _all_ other text is raw,
                so `\n' or `\t' doesn't work as one might expect. To 
                insert a newline for instance, one must do

                        s/foo/bar-on-this-line\
                        foo-on-next/

        - <flags> are optional, and can be combined

                g       replace all occurrences of RE by <replacement>
                        (the default is to replace only the first)

                p       write the pattern space only if the substitution w
as
                        successful

                w <file>
                        work as `p' flag, but the pattern space is written
                        to <file>

                d       where `d' is a digit, replace the d-th occurrence,
                        if any, of RE by <replacement>


*********************************************************************




More information about the svlug mailing list