Re: Optimization techniques and undefined behavior

Christian Gollwitzer <auriocus@gmx.de>
Thu, 2 May 2019 22:22:29 +0200

          From comp.compilers

Related articles
[13 earlier articles]
Re: Optimization techniques and undefined behavior martin@gkc.org.uk (Martin Ward) (2019-05-02)
Re: Optimization techniques and undefined behavior david.brown@hesbynett.no (David Brown) (2019-05-02)
Re: Optimization techniques and undefined behavior david.brown@hesbynett.no (David Brown) (2019-05-02)
Re: Optimization techniques and undefined behavior 847-115-0292@kylheku.com (Kaz Kylheku) (2019-05-02)
Re: Optimization techniques and undefined behavior bc@freeuk.com (Bart) (2019-05-02)
Re: Optimization techniques and undefined behavior bc@freeuk.com (Bart) (2019-05-02)
Re: Optimization techniques and undefined behavior auriocus@gmx.de (Christian Gollwitzer) (2019-05-02)
Re: Optimization techniques and undefined behavior bc@freeuk.com (Bart) (2019-05-03)
Re: Optimization techniques and undefined behavior martin@gkc.org.uk (Martin Ward) (2019-05-03)
Re: Optimization techniques and undefined behavior anw@cuboid.co.uk (Andy Walker) (2019-05-03)
Re: Optimization techniques and undefined behavior david.brown@hesbynett.no (David Brown) (2019-05-03)
Re: Optimization techniques and undefined behavior bc@freeuk.com (Bart) (2019-05-03)
Re: Optimization techniques and undefined behavior bc@freeuk.com (Bart) (2019-05-03)
[14 later articles]
| List of all articles for this month |
From: Christian Gollwitzer <auriocus@gmx.de>
Newsgroups: comp.compilers
Date: Thu, 2 May 2019 22:22:29 +0200
Organization: A noiseless patient Spider
References: <72d208c9-169f-155c-5e73-9ca74f78e390@gkc.org.uk> 19-04-021 19-04-023 19-04-037 19-04-039 19-04-042 19-04-044 19-04-047 19-05-004 19-05-008
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="53323"; mail-complaints-to="abuse@iecc.com"
Keywords: errors, arithmetic
Posted-Date: 02 May 2019 18:08:12 EDT

Am 02.05.19 um 16:51 schrieb David Brown:
> On 01/05/2019 14:53, Bart wrote:
>> On 30/04/2019 13:46, David Brown wrote:
>>> It is really incredibly simple (at least in this case).  Do the
>>> multiplications using types that won't overflow.  That might be an
>>> unsigned type if its range is suitable (not because it has defined
>>> overflow behaviour, but use it if it has enough range) or a bigger
>>> signed integer type.
>>
>> That's just kicking the can further down the road.
>>
>
> Yes (especially for the use of a larger signed type).  That's fine.  C
> let's you easily kick the can a /long/ way down the road.  How often do
> you need integers that will overflow a 64-bit type?
>
>> If you have two unknown values A and B, and need to multiply, you won't
>> know if the result will overflow.
>>
> First off, how often do you actually have unknown values to deal with?
> Usually you know something at least.


Well exactly that was my point. In order to parse the file format, you
get unknown values, and similar errors led to buffer overflows in the past:


https://www.kb.cert.org/vuls/id/523889/


That one was really nasty, because libpng is contained in a lot of
security critical software (browsers, web servers handling uploads...).
So, this IS a real problem, maybe just not common in the code that you
write.




> bytes_per_pixel is not going to be more than, say, 16 - that's for
> 32-bit per colour, including alpha channel.  No picture format is going
> to have more.  Width and height will also be limited - the highest
> resolution image sensors are in the range of 120 megapixels.  If you are
> talking about a panorama picture sewn together from 5 billion such
> camera pictures, you might have a point - and that's just using 64-bit
> types.




To get back to reality. I chose the PGM format just as an example,
because it is so simple that we could actually post the source code of
the parser in this message. However, in my previous job, I had to deal
with multidimensional scientific data sets. The uncompressed file
formats typically have a binary (or even ASCII) header and a block of
memory (the so called "raw" format), and there may be up to 5 dimensions.


A format close to PGm is NRRD:
http://teem.sourceforge.net/nrrd/format.html Software that reads/writes
this can be found here: http://fiji.sc/


The total data set size can be as large as your computer's main memory,
we had machines with up to 1 TB RAM and data sets regularly with up to
100 GB in size.


So let's say you have a 4D data set, then you get, as input


x
y
z
w
datatype


In some formats, if it's only 2 dimensions, then the 3rd and 4th are
simply set to 1. Also, there may be oddly shaped "images" with 100,000
pixels in one dimension and only 2 or 3 in another dimension. So the
"real" task is:


* Read in x, y, z, w and the data type identifier. Each of x,y,z,w must
be checked to be within the range of size_t.


* Then compute the number of bytes by multiplying these numbers. If it
is smaller then the number of bytes in main memory, alloc memory and
load the file.


The easiest way to do that is with unlimited integers, like in Python:


memsize = x*y*z*w*sizeof(datatype)


if memsize > mainmemory:
error "Out of memory"


In C++, you need to jump through strange hoops to detect the overflow.
You simply can't limit it such that it will not overflow for 64 bit -
because that would mean you restrict each dimension to 65535, which is
definitely NOT enough. Instead, what would be needed is a function
safe_multiply(x,y) which returns x*y if there is no overflow and throws
an error otherwise, so you could write:


try {
size_t memsize = safe_multiply(x,y,z,w,sizeof(datatype);
} catch (overflow_error & err) {
std::cerr<<"Not enough memory";
}




Even that safe_multiply() is extremely hard to write, unless your
compiler supports 128 bit math as an extension.


Christian


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.