Re: String tokenizer place in Chomsky hierarchy?

Mitch <>
Fri, 11 Apr 2008 07:17:34 -0700 (PDT)

          From comp.compilers

Related articles
String tokenizer place in Chomsky hierarchy? (Tegiri Nenashi) (2008-04-08)
Re: String tokenizer place in Chomsky hierarchy? (Mitch) (2008-04-11)
Re: String tokenizer place in Chomsky hierarchy? (Hans-Peter Diettrich) (2008-04-11)
Re: String tokenizer place in Chomsky hierarchy? (2008-04-15)
Re: String tokenizer place in Chomsky hierarchy? (Rock Brentwood) (2008-04-25)
| List of all articles for this month |

From: Mitch <>
Newsgroups: comp.compilers,comp.theory
Date: Fri, 11 Apr 2008 07:17:34 -0700 (PDT)
Organization: Compilers Central
References: 08-04-030
Keywords: lex
Posted-Date: 11 Apr 2008 16:44:14 EDT

On Apr 8, 1:44 pm, Tegiri Nenashi <> wrote:
> String tokenizer is the simplest parser of all. In java, arguably, it
> is much more frequently used than regular expressions. Yet I fail to
> see any parsing theory book ever mentioning it.

'String tokenizer', used by software engineering people, is a synonym
for 'lexical analyzer', which is the term tending to be used by the
parsing/theory crowd. Just a difference in language use.

> In my practical experience it is easer write a scanner on string
> tokenizer foundation rather than to use off the shellf reg exp engine.

A string tokenizer (or lexical analyzer) may use regular expressions
to define the tokens, the off-the-shelf tool you want to use is
something like lex, flex, javacc.

> I'm interested however what is the language theory perspective onto
> this phenomenon.
> Formally, a set of terminals is partitioned into a separator or a set
> of separators, and the rest of terminals. Then, string tokenizer
> translates a given word into a set (or list) of words.

If all you're doing is managing delimiters (as opposed to keywords and
floating point), then that's modeled by a pretty simple regular
expression and so I can see that it'd be much easier to write one
yourself than use an off-the-shelf package. (but doesn't even C's
'strtok' take care of that?

> Here we have the first technical difficulty, what exactly this
> translation is? Is it a mapping from Monoid to Idempotent Semiring?

The map is from monoid to monoid; the implementation of the map uses a
semiring (the set described using a regexp).

> does this 8(!) rule grammar present any new insight to
> what string tokenizor is?

I don't think so.

> Besides, assuming this grammar produces an unambigious parse tree,
> there is still an extra step of extracting the set of words from the
> tree.

A regular language (described by a reg exp) always has an unambiguous
parse tree, and that tree is really an unnested list. As to extra
step, that is just a matter of how you implement your parsing. You
could extract each token as you go along, rather than producing a tree
then scanning it a second time to produce the separate tokens.

I think the idea of tokenizing via regular expressions is used because
it is so easy to do that part of the parsing on-line, as each
character comes in (no backtracking, you don't have to complete the
token parsing before you know you have the correct token. (of course
this doesn't apply to extended regular expressions with variables).


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.