Performance of floating point instructions

Re: Performance of floating point instructions

Eero Tamminen
Karma: 161
2010-03-10 16:47 UTC
Hi,

Hamalainen Kimmo (Nokia-D/Helsinki) wrote:
> On Wed, 2010-03-10 at 12:57 +0100, ext Alberto Mardegan wrote:
>> Kimmo Hämäläinen wrote:
>>> You can also put the CPU to a "fast floats" mode, see hd_fpu_set_mode()
>>> in
>>> http://maemo.gitorious.org/fremantle-hildon-desktop/hildon-desktop/blobs/master/src/main.c

Not the libosso osso_fpu_set_mode() function?


- Eero
  •  Reply

Re: Performance of floating point instructions

Alberto Mardegan
Karma: 410
2010-03-10 17:19 UTC
Eero Tamminen wrote:
> Hamalainen Kimmo (Nokia-D/Helsinki) wrote:
>> On Wed, 2010-03-10 at 12:57 +0100, ext Alberto Mardegan wrote:
>>> Kimmo Hämäläinen wrote:
>>>> You can also put the CPU to a "fast floats" mode, see hd_fpu_set_mode()
>>>> in
>>>> http://maemo.gitorious.org/fremantle-hildon-desktop/hildon-desktop/blobs/master/src/main.c
>>>>
>
> Not the libosso osso_fpu_set_mode() function?

I can't find this in libosso.h. :-(
I'll copy Kimmo's code.


--
http://www.mardy.it <- geek in un lingua international!
  •  Reply

Re: Performance of floating point instructions

Alberto Mardegan
Karma: 410
2010-03-10 17:21 UTC
Eero Tamminen wrote:
>> Is there any performance penalty if this switch is done often?
>
> Why you would switch it off?
>
> Operations on "fast floats" aren't IEEE compatible, but as far as
> I've understood, they should differ only for numbers that are very close
> to zero, close enough that repeating your algorithm few more times would
> produce divide by zero even with IEEE semantics (i.e. if "fast float"
> causes you issues, it's indicating that there's most likely some issue
> in your algorithm).

Ok, I thought the precision loss would be more noticeable, but as we are
talking about latitude and longitude (and anyway the GPS accuracy is not
so great), I guess I don't have any need to turn it off.

Anyway, I'm doing some benchmarks, I'll post the results soon.


--
http://www.mardy.it <- geek in un lingua international!
  •  Reply

Re: Performance of floating point instructions

Alberto Mardegan
Karma: 410
2010-03-10 18:29 UTC
Alberto Mardegan wrote:
> Does one have any figure about how the performance of the FPU is,
> compared to integer operations?

I added some profiling to the code, and I measured the time spent by a
function which is operating on an array of points (whose coordinates are
integers) and trasforming each of them into a geographic coordinates
(latitude and longitude, floating point) and calculating the distance
from the previous point.

http://vcs.maemo.org/git?p=maemo-mapper;a=shortlog;h=refs/heads/gps_control
map_path_calculate_distances() is in path.c,
calculate_distance() is in utils.c,
unit2latlon() is a pointer to unit2latlon_google() in tile_source.c


The output (application compiled with -O0):


double:

map_path_calculate_distances: 110 ms for 8250 points
map_path_calculate_distances: 5 ms for 430 points

map_path_calculate_distances: 109 ms for 8250 points
map_path_calculate_distances: 5 ms for 430 points


float:

map_path_calculate_distances: 60 ms for 8250 points
map_path_calculate_distances: 3 ms for 430 points

map_path_calculate_distances: 60 ms for 8250 points
map_path_calculate_distances: 3 ms for 430 points


float with fast FPU mode:

map_path_calculate_distances: 50 ms for 8250 points
map_path_calculate_distances: 2 ms for 430 points

map_path_calculate_distances: 50 ms for 8250 points
map_path_calculate_distances: 2 ms for 430 points


So, it seems that there's a huge improvements when switching from
doubles to floats; although I wonder if it's because of the FPU or just
because the amount of data passed around is smaller.
On the other hand, the improvements obtained by enabling the fast FPU
mode is rather small -- but that might be due to the fact that the FPU
operations are not a major player in this piece of code.

One curious thing is that while making these changes, I forgot to change
the math functions to there float version, so that instead of using:

float x, y;
x = sinf(y);

I was using:

float x, y;
x = sin(y);

The timings obtained this way are surprisingly (at least to me) bad:

map_path_calculate_distances: 552 ms for 8250 points
map_path_calculate_distances: 92 ms for 430 points

map_path_calculate_distances: 552 ms for 8250 points
map_path_calculate_distances: 91 ms for 430 points

Much worse than the double version. The only reason I can think of, is
the conversion from float to double and vice versa, but is it really
that expensive?

Anyway, I'll stick to using 32bit floats. :-)

--
http://www.mardy.it <- geek in un lingua international!
  •  Reply

Re: Performance of floating point instructions

Bernd Stramm
Karma: 9
2010-03-10 18:52 UTC
On Wed, 2010-03-10 at 20:29 +0200, Alberto Mardegan wrote:
> Alberto Mardegan wrote:
> > Does one have any figure about how the performance of the FPU is,
> > compared to integer operations?
>
> I added some profiling to the code, and I measured the time spent by a
> function which is operating on an array of points (whose coordinates are
> integers) and trasforming each of them into a geographic coordinates
> (latitude and longitude, floating point) and calculating the distance
> from the previous point.
>
> http://vcs.maemo.org/git?p=maemo-mapper;a=shortlog;h=refs/heads/gps_control
> map_path_calculate_distances() is in path.c,
> calculate_distance() is in utils.c,
> unit2latlon() is a pointer to unit2latlon_google() in tile_source.c
>
>
> The output (application compiled with -O0):
>
>
> double:
>
> map_path_calculate_distances: 110 ms for 8250 points
> map_path_calculate_distances: 5 ms for 430 points
>
> map_path_calculate_distances: 109 ms for 8250 points
> map_path_calculate_distances: 5 ms for 430 points
>
>
> float:
>
> map_path_calculate_distances: 60 ms for 8250 points
> map_path_calculate_distances: 3 ms for 430 points
>
> map_path_calculate_distances: 60 ms for 8250 points
> map_path_calculate_distances: 3 ms for 430 points
>
>
> float with fast FPU mode:
>
> map_path_calculate_distances: 50 ms for 8250 points
> map_path_calculate_distances: 2 ms for 430 points
>
> map_path_calculate_distances: 50 ms for 8250 points
> map_path_calculate_distances: 2 ms for 430 points
>
>
> So, it seems that there's a huge improvements when switching from
> doubles to floats; although I wonder if it's because of the FPU or just
> because the amount of data passed around is smaller.

Right, is your experiment actually measuring floating point performance,
or is that swamped out by memory accesses, or some bus transfers or
something like that?

> On the other hand, the improvements obtained by enabling the fast FPU
> mode is rather small -- but that might be due to the fact that the FPU
> operations are not a major player in this piece of code.
>
> One curious thing is that while making these changes, I forgot to change
> the math functions to there float version, so that instead of using:
>
> float x, y;
> x = sinf(y);
>
> I was using:
>
> float x, y;
> x = sin(y);
>
> The timings obtained this way are surprisingly (at least to me) bad:
>
> map_path_calculate_distances: 552 ms for 8250 points
> map_path_calculate_distances: 92 ms for 430 points
>
> map_path_calculate_distances: 552 ms for 8250 points
> map_path_calculate_distances: 91 ms for 430 points
>
> Much worse than the double version. The only reason I can think of, is
> the conversion from float to double and vice versa, but is it really
> that expensive?
>
> Anyway, I'll stick to using 32bit floats. :-)
>

It is often hard to tell how much difference optimizing a particular
operation makes. If the setup is cheaper for the slower operation, do
you gain anything by using faster ops? Hard to measure sometimes.

Like racing, it's not how fast you go, it's when you get there.

Bernd



  •  Reply

Re: Performance of floating point instructions

Laurent Desnogues

2010-03-10 18:54 UTC
On Wed, Mar 10, 2010 at 7:29 PM, Alberto Mardegan
<mardy@users.sourceforge.net> wrote:
> Alberto Mardegan wrote:
>>
>> Does one have any figure about how the performance of the FPU is, compared
>> to integer operations?
>
> I added some profiling to the code, and I measured the time spent by a
> function which is operating on an array of points (whose coordinates are
> integers) and trasforming each of them into a geographic coordinates
> (latitude and longitude, floating point) and calculating the distance from
> the previous point.
>
> http://vcs.maemo.org/git?p=maemo-mapper;a=shortlog;h=refs/heads/gps_control
> map_path_calculate_distances() is in path.c,
> calculate_distance() is in utils.c,
> unit2latlon() is a pointer to unit2latlon_google() in tile_source.c
>
>
> The output (application compiled with -O0):
>
>
> double:
>
> map_path_calculate_distances: 110 ms for 8250 points
> map_path_calculate_distances: 5 ms for 430 points
>
> map_path_calculate_distances: 109 ms for 8250 points
> map_path_calculate_distances: 5 ms for 430 points
>
>
> float:
>
> map_path_calculate_distances: 60 ms for 8250 points
> map_path_calculate_distances: 3 ms for 430 points
>
> map_path_calculate_distances: 60 ms for 8250 points
> map_path_calculate_distances: 3 ms for 430 points
>
>
> float with fast FPU mode:
>
> map_path_calculate_distances: 50 ms for 8250 points
> map_path_calculate_distances: 2 ms for 430 points
>
> map_path_calculate_distances: 50 ms for 8250 points
> map_path_calculate_distances: 2 ms for 430 points
>
>
> So, it seems that there's a huge improvements when switching from doubles to
> floats; although I wonder if it's because of the FPU or just because the
> amount of data passed around is smaller.
> On the other hand, the improvements obtained by enabling the fast FPU mode
> is rather small -- but that might be due to the fact that the FPU operations
> are not a major player in this piece of code.

The "fast" mode only gains 1 or 2 cycles per FP instruction.
The FPU on Cortex-A8 is not pipelined and the fast mode
can't change that :-)

> One curious thing is that while making these changes, I forgot to change the
> math functions to there float version, so that instead of using:
>
> float x, y;
> x = sinf(y);
>
> I was using:
>
> float x, y;
> x = sin(y);
>
> The timings obtained this way are surprisingly (at least to me) bad:
>
> map_path_calculate_distances: 552 ms for 8250 points
> map_path_calculate_distances: 92 ms for 430 points
>
> map_path_calculate_distances: 552 ms for 8250 points
> map_path_calculate_distances: 91 ms for 430 points
>
> Much worse than the double version. The only reason I can think of, is the
> conversion from float to double and vice versa, but is it really that
> expensive?

This looks odd given that the 2 additional instructions
take 5 and 7 cycles.

> Anyway, I'll stick to using 32bit floats. :-)

As long as it fits your needs that seems wise :)


Laurent
  •  Reply

Re: Performance of floating point instructions

Siarhei Siamashka
Karma: 270
2010-03-10 19:34 UTC
On Wednesday 10 March 2010, Alberto Mardegan wrote:
> Alberto Mardegan wrote:
> > Does one have any figure about how the performance of the FPU is,
> > compared to integer operations?
>
> I added some profiling to the code, and I measured the time spent by a
> function which is operating on an array of points (whose coordinates are
> integers) and trasforming each of them into a geographic coordinates
> (latitude and longitude, floating point) and calculating the distance
> from the previous point.
>
> http://vcs.maemo.org/git?p=maemo-mapper;a=shortlog;h=refs/heads/gps_control
> map_path_calculate_distances() is in path.c,
> calculate_distance() is in utils.c,
> unit2latlon() is a pointer to unit2latlon_google() in tile_source.c
>
>
> The output (application compiled with -O0):

Using an optimized build (-O2 or -O3) may sometimes change the overall picture
quite dramatically. It makes almost no sense benchmarking -O0 code, because in
this case all the local variables are kept in memory and are read/written
before/after each operation. It's substantially different from normal code.

--
Best regards,
Siarhei Siamashka
  •  Reply

Re: Performance of floating point instructions

Siarhei Siamashka
Karma: 270
2010-03-10 19:54 UTC
On Wednesday 10 March 2010, Laurent Desnogues wrote:
> On Wed, Mar 10, 2010 at 7:29 PM, Alberto Mardegan
> > So, it seems that there's a huge improvements when switching from doubles
> > to floats; although I wonder if it's because of the FPU or just because
> > the amount of data passed around is smaller.
> > On the other hand, the improvements obtained by enabling the fast FPU
> > mode is rather small -- but that might be due to the fact that the FPU
> > operations are not a major player in this piece of code.
>
> The "fast" mode only gains 1 or 2 cycles per FP instruction.
> The FPU on Cortex-A8 is not pipelined and the fast mode
> can't change that :-)

It's probably
http://infocenter.arm.com/help/topic/com.arm.doc.ddi0344j/ch16s07s01.html
vs.
http://infocenter.arm.com/help/topic/com.arm.doc.ddi0344j/BCGEIHDJ.html

I wonder why the compiler does not use real NEON instructions with -ffast-math
option, it should be quite useful even for scalar code.

something like:

vld1.32 {d0[0]}, [r0]
vadd.f32 d0, d0, d0
vst1.32 {d0[0]}, [r0]

instead of:

flds s0, [r0]
fadds s0, s0, s0
fsts s0, [r0]

for:

*float_ptr = *float_ptr + *float_ptr;

At least NEON is pipelined and should be a lot faster on more complex code
examples where it can actually benefit from pipelining. On x86, SSE2 is used
quite nicely for floating point math.

--
Best regards,
Siarhei Siamashka
  •  Reply

Re: Performance of floating point instructions

Laurent Desnogues

2010-03-10 20:01 UTC
On Wed, Mar 10, 2010 at 8:54 PM, Siarhei Siamashka
<siarhei.siamashka@gmail.com> wrote:
[...]
> I wonder why the compiler does not use real NEON instructions with -ffast-math
> option, it should be quite useful even for scalar code.
>
> something like:
>
> vld1.32  {d0[0]}, [r0]
> vadd.f32 d0, d0, d0
> vst1.32  {d0[0]}, [r0]
>
> instead of:
>
> flds     s0, [r0]
> fadds    s0, s0, s0
> fsts     s0, [r0]
>
> for:
>
> *float_ptr = *float_ptr + *float_ptr;
>
> At least NEON is pipelined and should be a lot faster on more complex code
> examples where it can actually benefit from pipelining. On x86, SSE2 is used
> quite nicely for floating point math.

Even if fast-math is known to break some rules, it only
breaks C rules IIRC. OTOH, NEON FP has no support
for NaN and other nice things from IEEE754.

Anyway you're perhaps looking for -mfpu=neon, no?


Laurent
  •  Reply

Re: Performance of floating point instructions

Laurent GUERBY
Karma: 69
2010-03-10 20:31 UTC
On Wed, 2010-03-10 at 21:54 +0200, Siarhei Siamashka wrote:
> I wonder why the compiler does not use real NEON instructions with -ffast-math
> option, it should be quite useful even for scalar code.
>
> something like:
>
> vld1.32 {d0[0]}, [r0]
> vadd.f32 d0, d0, d0
> vst1.32 {d0[0]}, [r0]
>
> instead of:
>
> flds s0, [r0]
> fadds s0, s0, s0
> fsts s0, [r0]
>
> for:
>
> *float_ptr = *float_ptr + *float_ptr;
>
> At least NEON is pipelined and should be a lot faster on more complex code
> examples where it can actually benefit from pipelining. On x86, SSE2 is used
> quite nicely for floating point math.

Hi,

Please open a report on http://gcc.gnu.org/bugzilla with your test
sources and command line, at least GCC developpers will notice there's
interest :).

GCC comes with some builtins for neon, they're defined in arm_neon.h
see below.

Sincerely,

Laurent


typedef struct float32x2x2_t
{
float32x2_t val[2];
} float32x2x2_t;

...

__extension__ static __inline float32x2_t __attribute__ ((__always_inline__))
vpadd_f32 (float32x2_t __a, float32x2_t __b)
{
return (float32x2_t)__builtin_neon_vpaddv2sf (__a, __b, 3);
}




  •  Reply