Re: Precedence Rules for '$' and '^'

Chris F Clark <cfc@shell01.TheWorld.com>
Mon, 17 Sep 2007 00:22:39 -0400

          From comp.compilers

Related articles
Precedence Rules for '$' and '^' jamin.hanson@googlemail.com (2007-09-12)
Re: Precedence Rules for '$' and '^' jo@durchholz.org (Joachim Durchholz) (2007-09-13)
Re: Precedence Rules for '$' and '^' jo@durchholz.org (Joachim Durchholz) (2007-09-13)
Re: Precedence Rules for '$' and '^' jamin.hanson@googlemail.com (2007-09-14)
Re: Precedence Rules for '$' and '^' rsc@swtch.com (Russ Cox) (2007-09-14)
Re: Precedence Rules for '$' and '^' jo@durchholz.org (Joachim Durchholz) (2007-09-15)
Re: Precedence Rules for '$' and '^' cfc@shell01.TheWorld.com (Chris F Clark) (2007-09-17)
Re: Precedence Rules for '$' and '^' jamin.hanson@googlemail.com (2007-09-17)
| List of all articles for this month |

From: Chris F Clark <cfc@shell01.TheWorld.com>
Newsgroups: comp.compilers
Date: Mon, 17 Sep 2007 00:22:39 -0400
Organization: The World Public Access UNIX, Brookline, MA
References: 07-09-035 07-09-056
Keywords: lex
Posted-Date: 18 Sep 2007 08:09:42 EDT

I doubt that there is a completely general consensus on what the ^ and
$ meta-characters do. Moreover, if one uses regular expressions for
searching (or on matching data with an internal record structure),
these other aspects come more into play.


However, I think that understanding what Perl does is a good starting
point. And, one important factor to consider is that Perl mentions
them applying at "record" boundaries, where the normal record is
synonymous with a "line" meaning the newline character(s) (which is
it's own bag of worms) is the boundary marker. However, one can use
Perl with other definitions of records. I'm not sure how far Perl
takes the concept of records, but I do know that it is more than just
lines in a file. I don't recall off-hand whether there are ways to
compare a variable holding a bag-of-records to a pattern. However,
given the nature of Perl I'd be more surprised if you couldn't do it,
than if you could.


So, what "should" ^ mean, it should match the start of a "record".
For a text file composed of lines, it means the start of the file, and
also after each occurrence of a newline. In a CSV (comma separated
value) file, it should be the start of the file, and before each
comma. In a file of fixed length records, it should be the start of
each file and at each modulo record length boundary (note, there will
be no characgter in the stream in that case). Similarly, if one has a
file of variable length records, where the record has a record that
contains a specific length field, you figure out where your record
boundaries are and make the ^ point to the start of each record--and
there may be no "character" explicitly representing the end of one
record and start of the next. For example in network protocols, a
packet boundary (which has no character representing it at all) is
generally considered a record boundary, if that is your problem
domain, you probably want your meanings of ^ and $ to include packet
boundaries.


Similarly, the $ represents the end of a "record". In the text file
case, this is the character before the newline. In the CSV case,
the character before the comma. And, so forth and so on. The end of
file is also usually considered the end of a record.


Now, as Russ Cox mentioned, the pattern ^$ matches a record that has
no characters in between the start and end, thus is must be an empty
record.


However, I would interpret the pattern $^ as matching the end of one
record and beginning of the next, thus, it matches the boundaries
between records, and does not match the beginning of the first record
(because there is no previous record that the $ is the end of) nor the
end of the last record (for the converse reason). I would not argue
with those that want to see the end-of-record character(s), e.g. "\n"
or "," in between the $ and ^ for the above case, as in $\n^ for
matching all record separating newlines that are not at the
end-of-file.


I would also allow some of Russ's more complex cases, such as ^$$,
which would mean start-of-record, end-of-record, end-of-record
(i.e. two empty records and the same as ^$^$, i.e. omitting a ^ after
a $ is unimportant, and also the same as ^^$ with a similar
rationale). I'm assuming that one doesn't have a hierarchical record
system where one has "nested" records which whould make the concept of
two consecutive starts might be sensible in an entirely different way
(start of a containing record and start of a nested record).


If you follow that rationale, you will see that the pattern 'abc^def'
is actually well-defined, and means 'abc' start-of-record 'def'.
Again, assuming flat record structures, that's the same as 'abc$^def'
or 'abc$def'. I believe I have used the last one in emacs and gotten
the expected results, where it dealt properly with DOS \r\n terminated
lines mixed with Unix \n terminated ones.


Hope this helps, (or atleast makes sense)
-Chris


*****************************************************************************
Chris Clark Internet : compres@world.std.com
Compiler Resources, Inc. Web Site : http://world.std.com/~compres
23 Bailey Rd voice : (508) 435-5016
Berlin, MA 01503 USA fax : (978) 838-0263 (24 hours)


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.