Related articles |
---|
Re: What stage should entities be resolved? christopher.f.clark@compiler-resources.com (Christopher F Clark) (2022-03-12) |
Re: What stage should entities be resolved? DrDiettrich1@netscape.net (Hans-Peter Diettrich) (2022-03-14) |
Re: What stage should entities be resolved? costello@mitre.org (Roger L Costello) (2022-03-15) |
Re: What stage should entities be resolved? DrDiettrich1@netscape.net (Hans-Peter Diettrich) (2022-03-18) |
Re: What stage should entities be resolved? gah4@u.washington.edu (gah4) (2022-03-17) |
Re: What stage should entities be resolved? 480-992-1380@kylheku.com (Kaz Kylheku) (2022-03-18) |
Re: What stage should entities be resolved? gah4@u.washington.edu (gah4) (2022-03-18) |
[2 later articles] |
From: | Christopher F Clark <christopher.f.clark@compiler-resources.com> |
Newsgroups: | comp.compilers |
Date: | Sat, 12 Mar 2022 14:11:21 +0200 |
Organization: | Compilers Central |
References: | 22-03-019 22-03-025 |
Injection-Info: | gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="8890"; mail-complaints-to="abuse@iecc.com" |
Keywords: | parse, design |
Posted-Date: | 14 Mar 2022 11:36:04 EDT |
Contrary to what might assume from my previous posting on this topic.
I agree with Dodi.
Sometimes, the right answer is another phase. To keep your lexer
simple, it can be useful to have a separate phase that deals with
"character" issues, whether that is transforming UTF-8 extensions into
unique code points (or actual characters representing glyphs possibly
accented, i.e. resolving the combining code points into canonical
versions) or taking sequences like & or \n or whatever into single
tokens (or characters). That *can* make the whole process simpler and
faster.
For example, years ago when working on a C compiler for Honeywell when
the first ANSI standard was still new, the standard had 8 stages (if I
recall correctly) that described the lexing process. We decided that
the best way to assure faithfulness to the standard was to implement
the 8 stages exactly as specified, at least in the first version.
That way we had a reliable model of the desired behavior that we could
track back to the standard. Moreover, by having them as separate
pieces of code, it was easy to turn them off (e.g. trigraphs in C were
an ANSI invention and some C programs used ??? not as a trigraph but
as a way of emphasis). Similarly, some pre-ANSI C dialects supported
nested comments and you might want to change that phase.
While you do want each phase to generally build larger and larger
structures. I.e. you don't want your parser very often dealing with
strings as individual characters. The exact number of phases or
content of each phase can vary slightly. One size rarely fits all.
--
******************************************************************************
Chris Clark email: christopher.f.clark@compiler-resources.com
Compiler Resources, Inc. Web Site: http://world.std.com/~compres
23 Bailey Rd voice: (508) 435-5016
Berlin, MA 01503 USA twitter: @intel_chris
------------------------------------------------------------------------------
Return to the
comp.compilers page.
Search the
comp.compilers archives again.