Related articles |
---|
Languages with optional spaces maury.markowitz@gmail.com (Maury Markowitz) (2020-02-19) |
Re: Languages with optional spaces awanderin@gmail.com (Jerry) (2020-02-20) |
Re: Languages with optional spaces drikosev@gmail.com (Ev. Drikos) (2020-02-23) |
Re: Languages with optional spaces maury.markowitz@gmail.com (Maury Markowitz) (2020-02-25) |
Re: Languages with optional spaces maury.markowitz@gmail.com (Maury Markowitz) (2020-02-25) |
Re: Languages with optional spaces martin@gkc.org.uk (Martin Ward) (2020-02-25) |
Re: Languages with optional spaces 493-878-3164@kylheku.com (Kaz Kylheku) (2020-02-26) |
Re: Languages with optional spaces awanderin@gmail.com (awanderin) (2020-02-26) |
[13 later articles] |
From: | Jerry <awanderin@gmail.com> |
Newsgroups: | comp.compilers |
Date: | Thu, 20 Feb 2020 23:38:51 -0700 |
Organization: | A noiseless patient Spider |
References: | 20-02-015 |
Injection-Info: | gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="41015"; mail-complaints-to="abuse@iecc.com" |
Keywords: | lex |
Posted-Date: | 21 Feb 2020 11:33:07 EST |
Maury Markowitz <maury.markowitz@gmail.com> writes:
> I'm trying to write a lex/yacc (flex/bison) interpreter for classic BASICs
> like the original DEC/MS, HP/DG etc. I have it mostly working for a good chunk
> of 101 BASIC Games (DEF FN is the last feature to add).
>
> Then I got to Super Star Trek. To save memory, SST removes most spaces, so
> lines look like this:
>
> 100FORI=1TO10
>
> Here's my current patterns that match bits of this line:
>
> FOR { return FOR; }
>
> [:,;()\^=+\-*/\<\>] { return yytext[0]; }
>
> [0-9]*[0-9.][0-9]*([Ee][-+]?[0-9]+)? {
> yylval.d = atof(yytext);
> return NUMBER;
> }
>
> "FN"?[A-Za-z@][A-Za-z0-9_]*[\$%\!#]? {
> yylval.s = g_string_new(yytext);
> return IDENTIFIER;
> }
>
> These correctly pick out some parts, numbers and = for instance, so it sees:
>
> 100 FORI = 1 TO 10
>
> The problem is that FORI part. Some BASICs allow variable names with more than
> two characters, so in theory, FORI could be a variable. These BASICs outlaw
> that in their parsers; any string that starts with a keyword exits then, so
> this would always parse as FOR. In lex, FORI is longer than FOR, so it returns
> a variable token called FORI.
>
> Is there a way to represent this in lex? Over on Stack Overflow the only
> suggestion seemed to be to use trailing syntax on the keywords, but that
> appears to require modifying every one of simple patterns for keywords with
> some extra (and ugly) syntax. Likewise, one might modify the variable name
> pattern, but I'm not sure how one says "everything that doesn't start with one
> of these other 110 patterns".
>
> Is there a canonical cure for this sort of problem that isn't worse than the
> disease?
> [Having written Fortran parsers, not that I've ever found. I did a prepass
> over each statement to figure out whether it was an assignment or something
> else, then the lexing was straightforward if not pretty. -John]
I have an unfinished compiler for Applesoft BASIC (the Apple II BASIC
written by Microsoft) that I wrote in Python using PLY (a Yacc-like
tool). Since, as you discovered, the BASICs of the 1970s and 1980s
often didn't use spaces to separate tokens, I tossed the PLY lexer
module out the window and wrote the lexer all myself to interface to the
parser generator. I don't have the code up on any website so I'll paste
it here and if it provides you with any insight, then great.
Essentially, what it does is at every point in the tokenizer, it tries
to find a token. This means removing spaces and searching the token
table. For certain tokens ("ATN", "AT", "TO"), spaces _are_
significant, and the original parser (in 6502 code) had special case
code to handle them. My parser mimics what the original did as well.
What my lexer does that the original did not do is convert numbers and
strings into lexemes. The original interpreter rescanned these at
run-time everytime it saw them. I wanted to hand the parser entire
numbers and strings.
Here is the code:
-----
#!/usr/bin/env python3
"""Applesoft compiler lexical analyzer."""
import re
import sys
def escape_chars(text):
"""Escape regular-expression special characters."""
for char in '$^+(*=':
text = text.replace(char, '\\' + char)
return text
regular_keywords = [
'END', 'FOR', 'NEXT', 'DATA', 'INPUT', 'DEL', 'DIM', 'READ', 'GR',
'TEXT', 'PR#', 'IN#', 'CALL', 'PLOT', 'HLIN', 'VLIN', 'HGR2',
'HGR', 'HCOLOR=', 'HPLOT', 'DRAW', 'XDRAW', 'HTAB', 'HOME', 'ROT=',
'SCALE=', 'SHLOAD', 'TRACE', 'NOTRACE', 'NORMAL', 'INVERSE',
'FLASH', 'COLOR=', 'POP', 'VTAB', 'HIMEM:', 'LOMEM:', 'ONERR',
'RESUME', 'RECALL', 'STORE', 'SPEED=', 'LET', 'GOTO', 'RUN', 'IF',
'RESTORE', '&', 'GOSUB', 'RETURN', 'REM', 'STOP', 'ON', 'WAIT',
'LOAD', 'SAVE', 'DEF', 'POKE', 'PRINT', 'CONT', 'LIST', 'CLEAR',
'GET', 'NEW', 'TAB(', 'TO', 'FN', 'SPC(', 'THEN', 'NOT', 'STEP',
'+', '-', '*', '/', '^', 'AND', 'OR', '>', '=', '<', 'SGN', 'INT',
'ABS', 'USR', 'FRE', 'SCRN(', 'PDL', 'POS', 'SQR', 'RND', 'LOG',
'EXP', 'COS', 'SIN', 'TAN', 'PEEK', 'LEN', 'STR$', 'VAL',
'ASC', 'CHR$', 'LEFT$', 'RIGHT$', 'MID$']
# AT keyword is handled specially
# "ATN" ==> <ATN> but "AT N" ==> <AT> "N"
# "ATO" ==> "A" <TO> but "AT O" ==> <AT> "O"
special_keywords_re = re.compile(
r'(A *T(?![NO]))|(A *TN)|(A(?= *T *O))', re.IGNORECASE)
keywords_re = re.compile('({})'.format(
'|'.join([' *'.join(escape_chars(char) for char in kw)
for kw in regular_keywords])), re.IGNORECASE)
integer_re = re.compile(r'[0-9][0-9 ]*')
float_re = re.compile(r'[\d ]*\.[\d ]*(E *[-+]?[\d ]*)?', re.IGNORECASE)
data_re = re.compile(r'[^":\n]*("[^\n]*")?[^":\n]*(?=[:\n])')
remark_re = re.compile(r'[^\n]*')
variable_re = re.compile(r'([A-Za-z] *[A-Za-z0-9 ]*[\$%]?)')
string_re = re.compile(r'"([^"]*)" *')
space_re = re.compile(r' +')
keyword_map = {
'&': 'AMPER',
'*': 'Times',
'+': 'Plus',
'-': 'Minus',
'/': 'Divide',
'<': 'Less',
'=': 'Equal',
'>': 'Greater',
'CHR$': 'CHR',
'COLOR=': 'COLOR',
'HCOLOR=': 'HCOLOR',
'HIMEM:': 'HIMEM',
'IN#': 'IN',
'LEFT$': 'LEFT',
'LOMEM:': 'LOMEM',
'MID$': 'MID',
'PR#': 'PR',
'RIGHT$': 'RIGHT',
'ROT=': 'ROT',
'SCALE=': 'SCALE',
'SCRN(': 'SCRN',
'SPC(': 'SPC',
'SPEED=': 'SPEED',
'STR$': 'STR',
'TAB(': 'TAB',
'^': 'Power'}
tokens = ('END', 'FOR', 'NEXT', 'DATA', 'INPUT', 'DEL', 'DIM', 'READ', 'GR',
'TEXT', 'PR', 'IN', 'CALL', 'PLOT', 'HLIN', 'VLIN', 'HGR2', 'HGR',
'HCOLOR', 'HPLOT', 'DRAW', 'XDRAW', 'HTAB', 'HOME', 'ROT', 'SCALE',
'SHLOAD', 'TRACE', 'NOTRACE', 'NORMAL', 'INVERSE', 'FLASH', 'COLOR',
'POP', 'VTAB', 'HIMEM', 'LOMEM', 'ONERR', 'RESUME', 'RECALL',
'STORE', 'SPEED', 'LET', 'GOTO', 'RUN', 'IF', 'RESTORE', 'AMPER',
'GOSUB', 'RETURN', 'REM', 'STOP', 'ON', 'WAIT', 'LOAD', 'SAVE',
'DEF', 'POKE', 'PRINT', 'CONT', 'LIST', 'CLEAR', 'GET', 'NEW',
'TAB', 'TO', 'FN', 'SPC', 'THEN', 'AT', 'NOT', 'STEP', 'AND', 'OR',
'SGN', 'INT', 'ABS', 'USR', 'FRE', 'SCRN', 'PDL', 'POS', 'SQR',
'RND', 'LOG', 'EXP', 'COS', 'SIN', 'TAN', 'ATN', 'PEEK', 'LEN',
'STR', 'VAL', 'ASC', 'CHR', 'LEFT', 'RIGHT', 'MID',
'Greater', 'Equal', 'Less', 'Plus', 'Minus', 'Times', 'Divide',
'Power', 'LParen', 'RParen', 'String', 'Newline', 'Variable',
'Colon', 'Comma', 'Semi', 'Float', 'Integer',)
literals_re = re.compile(r'([:;,\(\)\n])')
literals_map = {
':': 'Colon',
';': 'Semi',
',': 'Comma',
'(': 'LParen',
')': 'RParen',
'\n': 'Newline'}
class Token:
"""PLY Token: expected lexer return value for parser."""
def __init__(self, token_type=None, value=None):
self.type = token_type
self.value = value
self.lineno = None
self.lexpos = None
def __repr__(self):
if self.value is not None:
return '%s[%s]' % (self.type, repr(self.value))
else:
return self.type
def __eq__(self, other):
return self.type == other.type and self.value == other.value
def filter_text(text):
"""Upper-case and filter out spaces from text."""
return text.replace(' ', '').upper()
class Lexer:
"""
Lexer emulates quirky Applesoft tokenizer as closely as possible.
Thus, it diverges from what a normal lexer would do. In
particular, spaces are completely insignificant, except within
strings, except if it finds "AT" followed by a space and then an
"N" appears. That is parsed as "AT N", not "ATN". Fun? ;)
"""
def __init__(self, debug=False):
self.s = '' # input source string
self.ix = 0
self.DEBUG = debug
def input(self, input_str):
"""Feed the lexical analyzer."""
self.s = input_str
self.ix = 0
def debug(self, token):
"""Generate debug output."""
if self.DEBUG:
print(token)
def find_keyword(self, offset=0):
"""
Find a keyword.
Handle the AT keyword in the same way as Applesoft:
- "ATN" ==> ATN
- "AT N" ==> AT N
- "ATO" ==> A TO This special case returns Variable:A.
- "AT O" ==> AT O
Return Token and input-size consumed on success.
Return None, 0 if no token found.
"""
t = Token()
m_keyword = keywords_re.match(self.s[self.ix + offset:])
if m_keyword:
t.type = filter_text(m_keyword.group(0))
length = m_keyword.end()
if t.type in keyword_map:
t.value = t.type = keyword_map[t.type]
elif t.type == 'DATA':
m_data = data_re.match(self.s[self.ix + offset + length:])
assert m_data, 'DATA statement is munged'
t.value = m_data.group(0)
length += m_data.end()
elif t.type == 'REM':
m_remark = remark_re.match(self.s[self.ix + offset + length:])
assert m_remark, 'REM statement is munged'
t.value = m_remark.group(0)
length += m_remark.end()
return t, length
m_special_keyword = special_keywords_re.match(
self.s[self.ix + offset:])
if m_special_keyword:
if m_special_keyword.group(1): # AT
t.type = 'AT'
elif m_special_keyword.group(2): # ATN
t.type = 'ATN'
elif m_special_keyword.group(3): # ATO ==> A (TO is parsed later)
t.type = 'Variable'
t.value = 'A'
else:
assert False, 'broken special_keyword: {}'.format(
m_special_keyword.groups())
length = m_special_keyword.end()
return t, length
return None, 0
def handle_literal(self, match):
"""Handle literal token."""
t = Token(literals_map[match.group(1)])
self.ix += match.end()
self.debug(t)
return t
def handle_string(self, match):
"""Handle string token."""
t = Token('String', match.group(1))
self.ix += match.end()
self.debug(t)
return t
def handle_float(self, match_float):
"""
Handle floating-point token.
Applesoft allows odd floats that must be converted:
. ==> 0.0
.E ==> 0.0
123E ==> 123.0
"""
f = filter_text(match_float.group(0))
if f.endswith('E'):
f = f[:-1]
if f == '.':
f = '0.0'
t = Token('Float', float(f))
self.ix += match_float.end()
self.debug(t)
return t
def handle_integer(self, match_int):
"""Handle integer token."""
t = Token('Integer', int(filter_text(match_int.group(0))))
self.ix += match_int.end()
self.debug(t)
return t
def token(self):
"""
Return a token to the parser.
The Applesoft parser tokenizes keywords, identifies string
literals, and leaves everything else as plain text (numbers,
variables, and punctuation). This parser treats the keywords
the same way, but also creates special tokens for strings,
integers, and floating-point numbers.
Look for things in this order:
- keywords
- special keywords (ATN AT TO)
- punctuation (literals)
- strings
- floats
- integers
- variables
"""
m = space_re.match(self.s[self.ix:])
if m:
self.ix += m.end()
if self.ix >= len(self.s):
return None
t, length = self.find_keyword()
if t:
self.ix += length
self.debug(t)
return t
m = literals_re.match(self.s[self.ix:])
if m:
return self.handle_literal(m)
m = string_re.match(self.s[self.ix:])
if m:
return self.handle_string(m)
m_float = float_re.match(self.s[self.ix:])
if m_float:
return self.handle_float(m_float)
m_int = integer_re.match(self.s[self.ix:])
if m_int:
return self.handle_integer(m_int)
t = Token()
m = variable_re.match(self.s[self.ix:])
if m:
t.type = 'Variable'
# If a keyword is found in any suffix of the variable
# name, exclude that suffix.
for i in range(1, m.end()):
token_ahead, length = self.find_keyword(i)
if token_ahead:
t.value = filter_text(self.s[self.ix: self.ix + i])
assert t.value, 'Bad variable parse: {}'.format(
self.s[self.ix: self.ix + m.end()])
self.ix += i
self.debug(t)
return t
t.value = filter_text(m.group(0))
self.ix += m.end()
self.debug(t)
return t
t.type = 'Char'
t.value = self.s[self.ix]
self.ix += 1
self.debug(t)
return t
--
Jerry awanderin at gmail dot com
Return to the
comp.compilers page.
Search the
comp.compilers archives again.