Related articles |
---|
DFAC, DFA lexer generator for C/C++ paul@paulbmann.com (Paul Mann) (2010-05-13) |
Re: DFAC, DFA lexer generator for C/C++ paul@paulbmann.com (Paul Mann) (2010-07-01) |
Re: DFAC, DFA lexer generator for C/C++ paul@paulbmann.com (Paul Mann) (2010-07-07) |
From: | Paul Mann <paul@paulbmann.com> |
Newsgroups: | comp.compilers |
Date: | Wed, 7 Jul 2010 13:20:35 -0700 (PDT) |
Organization: | Compilers Central |
References: | 10-05-074 10-07-002 |
Keywords: | lex, tools |
Posted-Date: | 08 Jul 2010 10:27:05 EDT |
In case anybody is confused, let me say:
DFAC is a complete package, which includes the 'dfac.exe' program
and C/C++ source code for a 'main.cpp', 'lexer.cpp' and 'parser.cpp'.
Also included is a C Lexical Grammar which is a good example of
the lexical-grammar notation used by DFAC. When compiled, the
program is a good speed tester for lexical analyzers.
The lexical grammar notation is different than regular-expressions
because it was borrowed from the type of BNF used to create parsers.
Other compiler compiler systems (ANTLR, SableCC) use this kind of
notation for defining lexers. Here is a small example:
<identifier> => IDENTIFIER // A defined constant (returns a number
to the parser).
<identifier> -> letter (letter|digit)*
letter -> 'a'..'z' | 'A'..'Z' | '_'
digit -> '0'..'9'
Here is the C Lexical Grammar included in the product.
(You might want to view this in a fixed font, such as Courier).
/* C Lexical Grammar by CompilerWare, July 2010. */
// Tokens:
<eof> => T_EOF
<identifier> => T_IDENTIFIER
<number> => T_NUMBER
<literal> => T_LITERAL
<string> => T_STRING
`auto` => T_AUTO
`break` => T_BREAK
`case` => T_CASE
`cdecl` => T_CDECL
`char` => T_CHAR
`const` => T_CONST
`continue` => T_CONTINUE
`default` => T_DEFAULT
`do` => T_DO
`double` => T_DOUBLE
`else` => T_ELSE
`enum` => T_ENUM
`extern` => T_EXTERN
`far` => T_FAR
`float` => T_FLOAT
`for` => T_FOR
`goto` => T_GOTO
`huge` => T_HUGE
`if` => T_IF
`int` => T_INT
`interrupt` => T_INTERUPT
`long` => T_LONG
`near` => T_NEAR
`pascal` => T_PASCAL
`register` => T_REGISTER
`return` => T_RETURN
`short` => T_SHORT
`signed` => T_SIGNED
`sizeof` => T_SIZEOF
`static` => T_STATIC
`struct` => T_STRUCT
`switch` => T_SWITCH
`typedef` => T_TYPEDEF
`union` => T_UNION
`unsigned` => T_UNSIGNED
`void` => T_VOID
`volatile` => T_VOLATILE
`while` => T_WHILE
'+' => T_PLUS
'-' => T_MINUS
'*' => T_ASTERISK
'/' => T_SLASH
'%' => T_PERCENT
',' => T_COMMA
';' => T_SEMICOLON
'=' => T_EQUALS
'{' => T_LEFTBRACE
'}' => T_RIGHTBRACE
':' => T_COLON
'(' => T_LPAREN
')' => T_RPAREN
'[' => T_LBRACKET
']' => T_RBRACKET
'...' => T_ELIPSIS
'!' => T_EXCLAMATION
'^' => T_BITEXOR
'|' => T_BITOR
'&' => T_BITAND
'*=' => T_MULEQ
'/=' => T_DIVEQ
'%=' => T_MODEQ
'+=' => T_ADDEQ
'-=' => T_SUBEQ
'<<=' => T_SHLEQ
'>>=' => T_SHREQ
'&=' => T_ANDEQ
'^=' => T_EXOREQ
'|=' => T_OREQ
'++' => T_PLUSPLUS
'--' => T_MINUSMINUS
'~' => T_TILDE
'.' => T_DOT
'->' => T_ARROW
'#' => T_HASHMARK
'\' => T_BACKSLASH
'?' => T_QUESTION
'||' => T_OR
'&&' => T_AND
'==' => T_EQ
'!=' => T_NOTEQ
'<' => T_LT
'>' => T_GT
'<=' => T_LTEQ
'>=' => T_GTEQ
'<<' => T_SHL
'>>' => T_SHR
<whitespace> [] // ignore this
<comment1> []
<comment2> []
// Lexical rules:
<eof> -> \z
<identifier> -> letter (letter|digit)*
<number> -> digits
-> float
<literal> -> ''' lchar+ '''
lchar -> '\' '\'
-> '\' 't'
-> '\' 'n'
-> '\' '''
-> '\' '0'
-> lany
<string> -> '"' '"'
-> '"' schar+ '"'
schar -> '\' '\'
-> '\' 't'
-> '\' 'n'
-> '\' '"'
-> '\' '0'
-> sany
<whitespace> -> space+
float -> rational
-> digits exp
-> rational exp
rational -> digits '.'
-> '.' digits
-> digits '.' digits
exp -> 'e' digits
-> 'E' digits
-> 'e' '-' digits
-> 'E' '-' digits
-> 'e' '+' digits
-> 'E' '+' digits
<comment1> -> '/' '*' EndInAst '/'
EndInAst -> '*'+
-> NA+ '*'+
-> EndInAst NANS '*'+
-> EndInAst NANS NA+ '*'+
NA -> 0..127 - \z - '*'
NANS -> 0..127 - \z - '*' - '/'
<comment2> -> '/' '/'
-> '/' '/' NEOL+
NEOL -> 32..127 | \t
digits -> digit+
letter -> 'a'..'z' | 'A'..'Z' | '_'
digit -> '0'..'9'
lany -> any - ''' - '\' - \n
sany -> any - '"' - '\' - \n
space -> \t | \f | \n | ' '
any -> 0..127 - \z
\t -> 9
\n -> 10
\v -> 11
\f -> 12
\r -> 13
\z -> 26 // End-of-file character
/* End of C Lexical Grammar. */
The main reason that I use the angled brackets for <identifier>
is because of consistency. The parser grammar uses this notation
and the LALR parser generator synchronizes very well with the
DFAC lexer generator.
For more information and to download the product, see:
http://compilerware.com
Paul B Mann
Return to the
comp.compilers page.
Search the
comp.compilers archives again.