Performance of floating point instructions

Re: Performance of floating point instructions

Karma: 270

2010-03-10 20:33 UTC

On Wednesday 10 March 2010, Laurent Desnogues wrote:
> On Wed, Mar 10, 2010 at 8:54 PM, Siarhei Siamashka
> <siarhei.siamashka@gmail.com> wrote:
> [...]
>
> > I wonder why the compiler does not use real NEON instructions with
> > -ffast-math option, it should be quite useful even for scalar code.
> >
> > something like:
> >
> > vld1.32 {d0[0]}, [r0]
> > vadd.f32 d0, d0, d0
> > vst1.32 {d0[0]}, [r0]
> >
> > instead of:
> >
> > flds s0, [r0]
> > fadds s0, s0, s0
> > fsts s0, [r0]
> >
> > for:
> >
> > *float_ptr = *float_ptr + *float_ptr;
> >
> > At least NEON is pipelined and should be a lot faster on more complex
> > code examples where it can actually benefit from pipelining. On x86, SSE2
> > is used quite nicely for floating point math.
>
> Even if fast-math is known to break some rules, it only
> breaks C rules IIRC.

If that's the case, some other option would be handy. Or even a new custom
data type like float_neon (or any other name). Probably it is even possible
with C++ and operators overloading.

> OTOH, NEON FP has no support
> for NaN and other nice things from IEEE754.
>
> Anyway you're perhaps looking for -mfpu=neon, no?

I lost my faith in gcc long ago :) So I'm not really looking for anything.

--
Best regards,
Siarhei Siamashka

Re: Performance of floating point instructions

Siarhei Siamashka

Karma: 270

2010-03-10 20:54 UTC

On Wednesday 10 March 2010, Laurent Desnogues wrote:
> Even if fast-math is known to break some rules, it only
> breaks C rules IIRC. OTOH, NEON FP has no support
> for NaN and other nice things from IEEE754.

And just checked gcc man page to verify this stuff.

-ffast-math
Sets -fno-math-errno, -funsafe-math-optimizations,
-ffinite-math-only, -fno-rounding-math, -fno-signaling-nans
and -fcx-limited-range.

-ffinite-math-only
Allow optimizations for floating-point arithmetic that assume that arguments
and results are not NaNs or +-Infs.

This option is not turned on by any -O option since it can result in
incorrect output for programs which depend on an exact implementation of
IEEE or ISO rules/specifications for math functions. It may, however, yield
faster code for programs that do not require the guarantees of these
specifications.

So looks like -ffast-math already assumes no support for NaNs. Even if
there are other nice IEEE754 things preventing NEON from being used
with -ffast-math, an appropriate new option relaxing this requirement
makes sense to be invented.

--
Best regards,
Siarhei Siamashka

Re: Performance of floating point instructions

Siarhei Siamashka

Karma: 270

2010-03-10 22:32 UTC

On Wednesday 10 March 2010, Laurent GUERBY wrote:
> On Wed, 2010-03-10 at 21:54 +0200, Siarhei Siamashka wrote:
> > I wonder why the compiler does not use real NEON instructions with
> > -ffast-math option, it should be quite useful even for scalar code.
> >
> > something like:
> >
> > vld1.32 {d0[0]}, [r0]
> > vadd.f32 d0, d0, d0
> > vst1.32 {d0[0]}, [r0]
> >
> > instead of:
> >
> > flds s0, [r0]
> > fadds s0, s0, s0
> > fsts s0, [r0]
> >
> > for:
> >
> > *float_ptr = *float_ptr + *float_ptr;
> >
> > At least NEON is pipelined and should be a lot faster on more complex
> > code examples where it can actually benefit from pipelining. On x86, SSE2
> > is used quite nicely for floating point math.
>
> Hi,
>
> Please open a report on http://gcc.gnu.org/bugzilla with your test
> sources and command line, at least GCC developpers will notice there's
> interest :).

This sounds reasonable :)

> GCC comes with some builtins for neon, they're defined in arm_neon.h
> see below.

This does not sound like a good idea. If the code has to be modified and
changed into something nonportable, there are way better options than
intrinsics.

Regarding the use of NEON instructions via C++ operator overloading. A test
program is attached.

# gcc -O3 -mcpu=cortex-a8 -mfpu=neon -mfloat-abi=softfp -ffast-math
-o neon_float neon_float.cpp

=== ieee754 floats ===

real 0m3.396s
user 0m3.391s
sys 0m0.000s

=== runfast floats ===

real 0m2.285s
user 0m2.273s
sys 0m0.008s

=== NEON C++ wrapper ===

real 0m1.312s
user 0m1.313s
sys 0m0.000s

But the quality of generated code is quite bad. That's also something to be
reported to gcc bugzilla :)

--
Best regards,
Siarhei Siamashka

#include <stdio.h>
#include <arm_neon.h>

#if 1
class fast_float
{
float32x2_t data;
public:
fast_float(float x) { data = vset_lane_f32(x, data, 0); }
fast_float(const fast_float &x) { data = x.data; }
fast_float(const float32x2_t &x) { data = x; }
operator float () { return vget_lane_f32(data, 0); }

friend fast_float operator+(const fast_float &a, const fast_float &b);
friend fast_float operator*(const fast_float &a, const fast_float &b);

const fast_float &operator+=(fast_float a)
{
data = vadd_f32(data, a.data);
return *this;
}
};
fast_float operator+(const fast_float &a, const fast_float &b)
{
return vadd_f32(a.data, b.data);
}
fast_float operator*(const fast_float &a, const fast_float &b)
{
return vmul_f32(a.data, b.data);
}
#else
typedef float fast_float;
#endif

float f(float *a, float *b)
{
int i;
fast_float accumulator = 0;
for (i = 0; i < 1024; i += 16)
{
accumulator += (fast_float)a[i + 0] * (fast_float)b[i + 0];
accumulator += (fast_float)a[i + 1] * (fast_float)b[i + 1];
accumulator += (fast_float)a[i + 2] * (fast_float)b[i + 2];
accumulator += (fast_float)a[i + 3] * (fast_float)b[i + 3];
accumulator += (fast_float)a[i + 4] * (fast_float)b[i + 4];
accumulator += (fast_float)a[i + 5] * (fast_float)b[i + 5];
accumulator += (fast_float)a[i + 6] * (fast_float)b[i + 6];
accumulator += (fast_float)a[i + 7] * (fast_float)b[i + 7];
accumulator += (fast_float)a[i + 8] * (fast_float)b[i + 8];
accumulator += (fast_float)a[i + 9] * (fast_float)b[i + 9];
accumulator += (fast_float)a[i + 10] * (fast_float)b[i + 10];
accumulator += (fast_float)a[i + 11] * (fast_float)b[i + 11];
accumulator += (fast_float)a[i + 12] * (fast_float)b[i + 12];
accumulator += (fast_float)a[i + 13] * (fast_float)b[i + 13];
accumulator += (fast_float)a[i + 14] * (fast_float)b[i + 14];
accumulator += (fast_float)a[i + 15] * (fast_float)b[i + 15];
}
return accumulator;
}

volatile float dummy;
float buf1[1024];
float buf2[1024];

int main()
{
int i;
int tmp;
__asm__ volatile(
"fmrx %[tmp], fpscr\n"
"orr %[tmp], %[tmp], #(1 << 24)\n" /* flush-to-zero */
"orr %[tmp], %[tmp], #(1 << 25)\n" /* default NaN */
"bic %[tmp], %[tmp], #((1 << 15) | (1 << 12) | (1 << 11) | (1 << 10) | (1 << 9) | (1 << 8))\n" /* clear exception bits */
"fmxr fpscr, %[tmp]\n"
: [tmp] "=r" (tmp)
);
for (i = 0; i < 1024; i++)
{
buf1[i] = buf2[i] = i % 16;
}
for (i = 0; i < 100000; i++)
{
dummy = f(buf1, buf2);
}
printf("%f\n", (double)dummy);
return 0;
}

Re: Performance of floating point instructions

Laurent GUERBY

Karma: 69

2010-03-10 23:18 UTC

On Thu, 2010-03-11 at 00:32 +0200, Siarhei Siamashka wrote:
> On Wednesday 10 March 2010, Laurent GUERBY wrote:
> > GCC comes with some builtins for neon, they're defined in arm_neon.h
> > see below.
>
> This does not sound like a good idea. If the code has to be modified and
> changed into something nonportable, there are way better options than
> intrinsics.

I've no idea if this comes from a standard but ARM seems to imply
arm_neon.h is supposed to be supported by various toolchains:

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dht0002a/ch01s04s02.html
<<
GCC and RVCT support the same NEON intrinsic syntax, making C or C++
code portable between the toolchains. To add support for NEON
intrinsics, include the header file arm_neon.h. Example 1.3 implements
the same functionality as the assembler examples, using intrinsics in C
code instead of assembler instructions.
>>

(nice test :)

> But the quality of generated code is quite bad. That's also something to be
> reported to gcc bugzilla :)

Seems that in some limited cases GCC is making progress on neon:

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43001

I'm building current SVN g++ for arm to see what it does on your code
(GCC 4.4.1 get it to run in 1.5s on an 800 MHz efika MX box).

Sincerely,

Laurent

Re: Performance of floating point instructions

Tor Arntsen

Karma: 359

2010-03-11 10:27 UTC

[float/double/vfp/neon etc.]

There's some information collected at
http://pandorawiki.org/Floating_Point_Optimization
There are also several long threads over at the Pandora forum about
everything discussed in this thread.

-Tor

Re: Performance of floating point instructions

Alberto Mardegan

Karma: 410

2010-03-11 15:52 UTC

Siarhei Siamashka wrote:
>> The output (application compiled with -O0):
>
> Using an optimized build (-O2 or -O3) may sometimes change the overall picture
> quite dramatically. It makes almost no sense benchmarking -O0 code, because in
> this case all the local variables are kept in memory and are read/written
> before/after each operation. It's substantially different from normal code.

Right. Just to complete the picture, here's the same data with -O2:

float (fast mode enabled):
map_path_calculate_distances: 40 ms for 8250 points
map_path_calculate_distances: 2 ms for 430 points

double (fast mode enabled):
map_path_calculate_distances: 93 ms for 8250 points
map_path_calculate_distances: 4 ms for 430 points

(I'm not posting the same data with fast mode disabled, as it cannot be
worse than the -O0 case, which is anyway not too far from these values)
The relative preformance seems to be about the same. But then of course,
it might not be because of the FPU, but of the data transfers.

Ciao,
Alberto

--
http://www.mardy.it <- geek in un lingua international!

Re: Performance of floating point instructions

Eero Tamminen

Karma: 161

2010-03-12 14:05 UTC

Hi,

ext Alberto Mardegan wrote:
> Right. Just to complete the picture, here's the same data with -O2:
>
> float (fast mode enabled):
> map_path_calculate_distances: 40 ms for 8250 points
> map_path_calculate_distances: 2 ms for 430 points
>
> double (fast mode enabled):

Note that fast mode affects only floats.

> map_path_calculate_distances: 93 ms for 8250 points
> map_path_calculate_distances: 4 ms for 430 points
>
> (I'm not posting the same data with fast mode disabled, as it cannot be
> worse than the -O0 case, which is anyway not too far from these values)
> The relative preformance seems to be about the same. But then of course,
> it might not be because of the FPU, but of the data transfers.

- Eero

Re: Performance of floating point instructions

Eero Tamminen

Karma: 161

2010-10-29 09:14 UTC

Hi,

(I resurrected this old thread because there was on meego-dev
mailing list a comment about possibility for RunFast float mode
being enabled by default on MeeGo...)

ext Alberto Mardegan wrote:
> Eero Tamminen wrote:
>> Hamalainen Kimmo (Nokia-D/Helsinki) wrote:
>>> On Wed, 2010-03-10 at 12:57 +0100, ext Alberto Mardegan wrote:
>> Not the libosso osso_fpu_set_mode() function?
>
> I can't find this in libosso.h. :-(

It's defined in osso-fpu.h (since summer 2009):
http://maemo.gitorious.org/fremantle-hildon-desktop/libosso/blobs/master/src/osso-fpu.h

If somebody's using a lot of floats (RunFast mode affects only floats)
and they're a bottleneck, it would be interesting to know how much this
(setting the RunFast mode at program start) helps.

- Eero

Re: Performance of floating point instructions

Alberto Mardegan

Karma: 410

2010-10-29 11:04 UTC

On 10/29/2010 12:14 PM, Eero Tamminen wrote:
> Hi,
>
> (I resurrected this old thread because there was on meego-dev
> mailing list a comment about possibility for RunFast float mode
> being enabled by default on MeeGo...)
>
> ext Alberto Mardegan wrote:
>> Eero Tamminen wrote:
>>> Hamalainen Kimmo (Nokia-D/Helsinki) wrote:
>>>> On Wed, 2010-03-10 at 12:57 +0100, ext Alberto Mardegan wrote:
>>> Not the libosso osso_fpu_set_mode() function?
>>
>> I can't find this in libosso.h. :-(
>
> It's defined in osso-fpu.h (since summer 2009):
> http://maemo.gitorious.org/fremantle-hildon-desktop/libosso/blobs/master/src/osso-fpu.h
>
>
>
> If somebody's using a lot of floats (RunFast mode affects only floats)
> and they're a bottleneck, it would be interesting to know how much this
> (setting the RunFast mode at program start) helps.

The discussion continued in the same thread:
http://lists.maemo.org/pipermail/maemo-developers/2010-March/025203.html
and
http://lists.maemo.org/pipermail/maemo-developers/2010-March/025218.html

Ciao,
Alberto

--
http://blog.mardy.it <-- geek in un lingua international!

RE: Performance of floating point instructions

<pedro.larroy at nokia.com>

2010-10-29 12:28 UTC

Are we using arm's VFP?

That would speed things up tremendously.

Pedro.

-----Original Message-----
From: maemo-developers-bounces@maemo.org [mailto:maemo-developers-bounces@maemo.org] On Behalf Of ext Eero Tamminen
Sent: Friday, October 29, 2010 11:15
To: Maemo Mailing List
Subject: Re: Performance of floating point instructions

Hi,

(I resurrected this old thread because there was on meego-dev
mailing list a comment about possibility for RunFast float mode
being enabled by default on MeeGo...)

ext Alberto Mardegan wrote:
> Eero Tamminen wrote:
>> Hamalainen Kimmo (Nokia-D/Helsinki) wrote:
>>> On Wed, 2010-03-10 at 12:57 +0100, ext Alberto Mardegan wrote:
>> Not the libosso osso_fpu_set_mode() function?
>
> I can't find this in libosso.h. :-(

It's defined in osso-fpu.h (since summer 2009):
http://maemo.gitorious.org/fremantle-hildon-desktop/libosso/blobs/master/src/osso-fpu.h

If somebody's using a lot of floats (RunFast mode affects only floats)
and they're a bottleneck, it would be interesting to know how much this
(setting the RunFast mode at program start) helps.

- Eero

« previous 1 2 3

Register

Performance of floating point instructions

<pedro.larroy at nokia.com>