Binary Regular Expression Matching (bregexp)

cthulhu@diku.dk (Stefan Krabbe)
Mon, 20 Mar 1995 05:58:32 GMT

          From comp.compilers

Related articles
Binary Regular Expression Matching (bregexp) cthulhu@diku.dk (1995-03-20)
Re: Binary Regular Expression Matching (bregexp) henry@zoo.toronto.edu (1995-03-22)
| List of all articles for this month |

Newsgroups: comp.compilers
From: cthulhu@diku.dk (Stefan Krabbe)
Summary: Where can I find a binary pattern matching program
Keywords: DFA, question, lex
Organization: Department of Computer Science, U of Copenhagen
Date: Mon, 20 Mar 1995 05:58:32 GMT

Question: Do you know if there exists a regexp-function that works
                    on binary files, preferably written in C, that I can
                    grab? I'm looking for something almost like the library
                    function regexp(3) from 4.3 BSD, or the regcomp(3C)
                    from HPUX.


I'm looking for these features in the regexp() function:


=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
1-It must be able to do a binary search! A binary regexp-string could be


              "\377\373." (note: I speak C here. \377 is octal and equal to 255)


    meaning: match byte == 255, followed by byte == 251, followed by any byte.
    It would be nice if it could match the '\0' byte too, like:


              "asd\000asd".


2-I'd like to be able to specify my own whitespace.
    That way I can make it a normal text-regexp if I set whitespace to newline.
    It should also be possible to set whitespace to NOWHITESPACE, ie
    the regexp-function would have to look through an entire binary
    file, if that was what I wanted.
    Usually the strings/files that must be searched for a match,
    will be about 3-300 bytes long, but it would be nice if I could specify
    a maximum match-length, especially when whitespace (record sepparators) can
    be turned off.


3-I'd like it to be possible to include subexpressions in a regexp.
    Let's say that a subexpression must be enclosed between the character
    pairs \( and \), like in sed.
    Example:


              regexp =
              "[mM]y name is \([A-Za-z]*\) and I need a bregexp."


4-If I get a match, I'd like to know the offsets to the start of the
    match and the end of the match. I'd also like offsets to
    subexpression-matches (see bellow).


    Example:
    If the text (in this case it's not binary) we try to match is:


              "Hello there, my name is Stefan and I need a bregexp."


    and we use the above regexp, then I'd like the first offsets to be


              "Hello there, my name is Stefan and I need a bregexp."
                                          ^ ^
                                          start end


    the second offsets to be


              "Hello there, my name is Stefan and I am 26 years old."
                                                                ^ ^
                                                                start end


=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Well, that's it. I hope you can tell me where it is. Someone
must have made it allready...no?


Best Regards
    Stefan - cthulhu@diku.dk
--


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.