Related articles |
---|
Parsing a simple BASIC language paul.dunn4@ntlworld.com (paul.dunn4) (2001-04-04) |
Re: Parsing a simple BASIC language barry_j_kelly@hotmail.com (Barry Kelly) (2001-04-10) |
Re: Parsing a simple BASIC language stephan@pcrm.win.tue.nl (2001-04-10) |
Re: Parsing a simple BASIC language christl@fmi.uni-passau.de (2001-04-12) |
Re: Parsing a simple BASIC language paul.dunn4@ntlworld.com (Dunny) (2001-04-12) |
Re: Parsing a simple BASIC language barry_j_kelly@hotmail.com (Barry Kelly) (2001-04-14) |
Re: Parsing a simple BASIC language marcov@toad.stack.nl (2001-04-18) |
Re: Parsing a simple BASIC language michael@moria.de (2001-04-18) |
[2 later articles] |
From: | "Barry Kelly" <barry_j_kelly@hotmail.com> |
Newsgroups: | comp.compilers |
Date: | 10 Apr 2001 01:15:43 -0400 |
Organization: | Ireland On-Line Customer |
References: | 01-04-014 |
Keywords: | parse, Basic |
Posted-Date: | 10 Apr 2001 01:15:43 EDT |
"paul.dunn4" <paul.dunn4@ntlworld.com> wrote in message
> I have recently begun a new project, [...] I am writing a Sinclair
> Spectrum BASIC interpreter - [...] what has got me stumped is parsing
> the text entered by the user for errors.
> The problem that I have is one of speed.
> I would like to create a simple syntax highlighter which displays a
> syntax template (such as FOR <variable> = <Numexpr> TO <numexpr> ) and
> highlights errors in red as you type. Unfortunately, it slows dow to
> the point of unusability if you type over about 150 chars in the input
> string.
> 1) How to parse a simple BASIC language (like Sinclair Basic) quickly,
> 2) Any better method than the one I'm using here. I used BNF for
everything,
> but that was even slower than the method I am using.
I'm not familiar with Sinclair Basic, but I have written many parsers
in Object Pascal (Delphi). Might I suggest that you use a simple
recursive descent parser (and change your grammar to LL(1), with
context if necessary), and begin highlighting in red at the beginning
of the first invalid token found by the parser?
For speed, the method I have found to work efficiently with Object
Pascal, for lexical analysis, looks like this; It uses a little hack,
similar to Classes.pas's TParser, which defines TToken as a character;
this makes it easy to use for ':', ';' etc:
---8<---
// class outline edited for clarity
type
TToken = #0..#255;
// use a hash lookup at the end of NextToken to check if
// a tokIdentifier is actually a tokKeyword
TKeyword = (keyNone, keyBlah, keyFor, keyEtc);
TLexer = class
private
FCurrentPosition: Integer;
public
constructor Create(ASource: string);
procedure NextToken;
property Token: TToken;
property TokenStart: Integer;
property TokenLength: Integer;
property TokenAsString: string;
property TokenAsInteger: Integer;
property TokenAsFloat: TFloat;
property TokenAsWhatever: TWhatever;
property TokenAsKeyword: TKeyword;
end;
const
tokEOF = TToken(0);
tokInteger = TToken(1);
tokFloat = TToken(2);
tokWhatever = TToken(3);
tokKeyword = TToken(4);
tokIdentifier = TToken(5);
--->8---
The job of the NextToken procedure is:
beginning at the CurrentPosition:
skip whitespace
determine the token type from its first character
read in that token type fully
leave the CurrentPosition 1 after the end of the token just read
set TokenAsString, TokenAsInteger, Token, etc as appropriate.
I'm sure you're familiar with writing native code regexp matchers, but I
have seen many people using Object Pascal do it inefficiently, so I'm going
to be fairly verbose, edited a little for clarity:
---8<---
procedure TLexer.NextToken;
const
WhiteSpace = [#1..#32];
Alpha = ['a'..'z', 'A'..'Z', '_'];
Numeric = ['0'..'9'];
AlphaNumeric = Alpha + Numeric;
var
currPos: Integer; // allows register variable
// another optimization if Delphi doesn't do automatic
// loop variable induction is to use a PChar for currPos
// instead; it'll mean slightly more pointer arithmetic
// to figure out TokenStart, and you should use the
// SetString function instead of the Copy function.
begin
currPos := FCurrentPosition;
// skip whitespace
while source[currPos] in WhiteSpace do
Inc(currPos);
// determine token type
case source[currPos] of
#0:
begin
// a #0 (null character) means end of source
Token := tokEOF;
// we don't touch the current position,
// so repeated NextToken keeps setting
// the token to tokEOF
TokenStart := currPos;
TokenLength := 1;
end;
Alpha:
begin
// an identifier
FTokenStart := currPos;
while source[currPos] in AlphaNumeric do
Inc(currPos);
// the current position is now at the first
// character that ISN'T part of the identifier
// extract token (from start, length currpos - start)
StringToken := Copy(source, TokenStart,
currPos - TokenStart);
// set token type
Token := tokIdentifier;
TokenLength := currPos - TokenStart;
end;
Numeric:
begin
// a number
TokenStart := currPos;
while source[currPos] in Numeric do
Inc(currPos);
// extract token (from start, length currpos - start)
StringToken := Copy(source, TokenStart,
currPos - TokenStart);
IntegerToken := StrToInt(StringToken);
FloatToken := IntegerToken;
// set token type
Token := tokInteger;
TokenLength := currPos - TokenStart;
end;
else
// some operator like + or whatever
// an easy cheat is to store the token as
// a character type and use characters below
// 32 (whitespace usually; platform) to
// represent token types like integer,
// float, identifier, etc.
Token := source[currPos];
TokenStart := currPos;
TokenLength := 1;
// VERY IMPORTANT to leave current position
// AT ONE AFTER current token
Inc(currPos);
end;
FCurrentPosition := currPos;
end;
--->8---
Make sure you initialize FCurrentPosition to 1 (not the default 0) in the
constructor. Keep a copy of ASource in something like FBuffer if you use
PChars instead of indices, otherwise the refcount of the string might go to
zero when the constructor returns.
Recursive descent parsers are trivial to write; essentially, you start off
writing a class that takes a TLexer parameter in the constructor; the
constructor should probably call NextToken on the lexer before any start
token is read. The class should have a public method that corresponds to a
starting grammar rule, like ReadStatement, or whatever. Each grammar rule
should eat anything it expects in the token stream, collect information
about identifiers, etc, and at the end do a case statement which decides
what rule to continue. For rules that look like subtraction or division
(need to be bracketed from the left), use a while or repeat loop instead of
recursing into the same function.
An example (I'm just checking for consistency):
---8<---
procedure TParser.ReadForLoop;
begin
EatKeyword(keyFor); // should be a helper method
EatToken(tokIdentifier); // just checking for consistency
// easy enough to add more support if necessary
EatToken('='); // this is possible with the 'hack'
ReadExpression; // change to return type of
// expression if you like
EatKeyword(keyTo);
ReadExpression;
if Lexer.TokenAsKeyword = keyStep then
begin
Lexer.NextToken;
ReadExpression;
end;
end;
--->8---
Of course, the Eat* methods should raise exceptions if the token in the
lexer doesn't correspond to the expected token.
If you add methods to the lexer and parser to reset, they should easily be
fast enough to handle thousands of characters without any noticable delay;
you should catch any exceptions thrown by the Eat* (or Expected* - similar
to Eat* but they don't call Lexer.NextToken) methods and begin your
highlighting with the information given in the lexer state. You shouldn't
need to parse the input anymore until the user presses return or else
modifies somewhere before the error position.
This recursive descent parser technique was inspired by Jack Crenshaw.
One final note for speed; make sure you're using a hash table for lookup.
I've got a fairly fast hashtable on
http://codecentral.borland.com/codecentral/ccweb.exe/listing?id=15171.
Much of this goes without saying, but I want to make sure.
-- Barry
barry_j_kelly@hotmail.com
Return to the
comp.compilers page.
Search the
comp.compilers archives again.