RE: Integer sizes and DFAs

Christopher F Clark <christopher.f.clark@compiler-resources.com>
Sun, 27 Mar 2022 00:54:35 +0200

          From comp.compilers

Related articles
RE: Integer sizes and DFAs christopher.f.clark@compiler-resources.com (Christopher F Clark) (2022-03-27)
Re: Integer sizes and DFAs gah4@u.washington.edu (gah4) (2022-03-26)
Re: Integer sizes and DFAs gah4@u.washington.edu (gah4) (2022-03-26)
RE: Integer sizes and DFAs christopher.f.clark@compiler-resources.com (Christopher F Clark) (2022-03-27)
| List of all articles for this month |

From: Christopher F Clark <christopher.f.clark@compiler-resources.com>
Newsgroups: comp.compilers
Date: Sun, 27 Mar 2022 00:54:35 +0200
Organization: Compilers Central
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="67075"; mail-complaints-to="abuse@iecc.com"
Keywords: design, performance
Posted-Date: 26 Mar 2022 19:42:51 EDT

Kaz, I don't know whether you simply didn't read what I wrote or chose
to ignore it.


> In certain common situations we model elementary operations as being
> O(1), with certain justifications. That simplifies the analysis, while
> leaving its results still relevant and true for practical applications
> running on present hardware.


And, my point was 2**32 is large enough to be considered arbitrarily large with
respect to most DFAs. Not quite the human genome, see extended analysis
below. Here was my first analysis.


me > Note how that even works for the DNA DFAs, assuming 32 bit integers
me > and a byte addressed machine. With that model, you can fit 10 million
me > (1E7) states in the 2 billion byte (2GB) address space (each state has
me > 4, 4 byte entries or 16 bytes).


> Each 32 bit state has 4 byte entries only if it doesn't express that
> it branches to other states for certain input symbols.


A state isn't 32 bits, a transition is. A transition (in a DFA) is
just the address of another state. A state in a DFA for DNA has 4 such
transitions (one for A, T, C, and G--the 4 DNA bases). Those are the
only legal symbols in DNA. So a state is 4 * 4 bytes, each transition
is 4 bytes and there are 4 of them per state.


Thus, I explained what the justification is and how that allows more
than 100 million states in the combined DFA. That's without any fancy
tricks.


Now, I just looked up the size of the human genome. it is 3 billion,
so that's a little more than another order or magnitude bigger, so you
definitely need slightly bigger integers (or to do some compression,
although 64 bit integers are overkill, but convenient to work with and
with those you could easily fit the genome in an 128 GB SSD, which
would be addressible by 64 bit integer/pointers). And, for most other
applications even 1 million states are more than sufficient. Thus,
saying a transition can happen in O(1) time is a perfectly reasonable
assumption. You can layout nearly any DFA we expect to see in memory
and directly address it. Thus, the RAM model holds.


By the way with 40 bit integers, you could fit it in 60GB (20 bytes
per state and 3G states), We don't quite have laptops with that much
DRAM memory yet.


--
******************************************************************************
Chris Clark email: christopher.f.clark@compiler-resources.com
Compiler Resources, Inc. Web Site: http://world.std.com/~compres
23 Bailey Rd voice: (508) 435-5016
Berlin, MA 01503 USA twitter: @intel_chris
------------------------------------------------------------------------------


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.