Re: Universal Character Names

"Quinn Tyler Jackson" <qjackson@wave.home.com>
13 Oct 1998 02:16:01 -0400

From comp.compilers

Related articles
Universal Character Names eric.b.lemings@lmco.com (Eric Lemings) (1998-10-10)
*Re: Universal Character Names qjackson@wave.home.com (Quinn Tyler Jackson)* (1998-10-13)**
Re: Universal Character Names Brian.Inglis@cadvision.com (1998-10-13)
Re: Universal Character Names eric.b.lemings@lmco.com (Eric Lemings) (1998-10-17)
Re: Universal Character Names ok@atlas.otago.ac.nz (Dr Richard A. O'Keefe) (1998-10-17)
Re: Universal Character Names fjh@cs.mu.OZ.AU (1998-10-22)
Re: Universal Character Names eggert@twinsun.com (1998-10-30)

| List of all articles for this month |

From:	"Quinn Tyler Jackson" <qjackson@wave.home.com>
Newsgroups:	comp.compilers
Date:	13 Oct 1998 02:16:01 -0400
Organization:	Compilers Central
References:	98-10-068
Keywords:	i18n, lex
X-MimeOLE:	Produced By Microsoft MimeOLE V4.72.3110.3

>universal-character-name:
> \u hex-quad
> \U hex-quad hex-quad
>
>hex-quad:
> hex-digit hex-digit hex-digit hex-digit
>
>Needless to say this makes the old regexp for identifiers:
>
>[a-zA-Z_]+[a-zA-Z0-9_]*
>
>obsolete. How would you modify it to handle UCN?

>[a-zA-Z_]+[a-zA-Z0-9_]*

Here's what I was able to come up with based on the description you gave:

([a-zA-Z_]|((\\u[a-zA-Z0-9]{4})|(\\U[a-zA-Z0-9]{8})))([a-zA-Z0-9_]|((\\u[a-zA-Z0-9]{4})|(\
\U[a-zA-Z0-9]{8})))*

Since RE's like this are nasty, to say the least, I prefer to write these beasts out as if
I were coding:

        (
                [a-zA-Z_]
                |
                (
                        (
                                \\u[a-zA-Z0-9]{4}
                        )
                        |
                        (
                                \\U[a-zA-Z0-9]{8}
                        )
                )
        )
        (
                [a-zA-Z0-9_]
                |
                (
                        (
                                \\u[a-zA-Z0-9]{4}
                        )
                        |
                        (
                                \\U[a-zA-Z0-9]{8}
                        )
                )
        )*

Your particular re parser may use a bang for a pipe. If you re parser doesn't deal with
the {n} postfix, you'll have to make these replacements:

                                \\u[a-zA-Z0-9]{4}

becomes:

                                    \\u[a-zA-Z0-9][a-zA-Z0-9][a-zA-Z0-9][a-zA-Z0-9]

and

                                \\U[a-zA-Z0-9]{8}

becomes:

\\U[a-zA-Z0-9][a-zA-Z0-9][a-zA-Z0-9][a-zA-Z0-9][a-zA-Z0-9][a-zA-Z0-9][a-zA-Z0-9][a-zA-Z0-9
]

Since my eyes are tired, I leave such replacement within the full re
itself as an exercise for the reader... (Indeed, I may have missed a
"p" or "q" somewhere in that beastie...)

--
Quinn Tyler Jackson

email: qjackson@wave.home.com
url: http://www.qtj.net/~quinn/
ftp: qtj.net

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.

Re: Universal Character Names

"Quinn Tyler Jackson" <qjackson@wave.home.com>13 Oct 1998 02:16:01 -0400

"Quinn Tyler Jackson" <qjackson@wave.home.com>
13 Oct 1998 02:16:01 -0400