Performance of floating point instructions

Re: Performance of floating point instructions

Siarhei Siamashka
Karma: 270
2010-03-10 20:33 UTC
On Wednesday 10 March 2010, Laurent Desnogues wrote:
> On Wed, Mar 10, 2010 at 8:54 PM, Siarhei Siamashka
> <siarhei.siamashka@gmail.com> wrote:
> [...]
>
> > I wonder why the compiler does not use real NEON instructions with
> > -ffast-math option, it should be quite useful even for scalar code.
> >
> > something like:
> >
> > vld1.32  {d0[0]}, [r0]
> > vadd.f32 d0, d0, d0
> > vst1.32  {d0[0]}, [r0]
> >
> > instead of:
> >
> > flds     s0, [r0]
> > fadds    s0, s0, s0
> > fsts     s0, [r0]
> >
> > for:
> >
> > *float_ptr = *float_ptr + *float_ptr;
> >
> > At least NEON is pipelined and should be a lot faster on more complex
> > code examples where it can actually benefit from pipelining. On x86, SSE2
> > is used quite nicely for floating point math.
>
> Even if fast-math is known to break some rules, it only
> breaks C rules IIRC.

If that's the case, some other option would be handy. Or even a new custom
data type like float_neon (or any other name). Probably it is even possible
with C++ and operators overloading.

> OTOH, NEON FP has no support
> for NaN and other nice things from IEEE754.
>
> Anyway you're perhaps looking for -mfpu=neon, no?

I lost my faith in gcc long ago :) So I'm not really looking for anything.

--
Best regards,
Siarhei Siamashka
  •  Reply

Re: Performance of floating point instructions

Siarhei Siamashka
Karma: 270
2010-03-10 20:54 UTC
On Wednesday 10 March 2010, Laurent Desnogues wrote:
> Even if fast-math is known to break some rules, it only
> breaks C rules IIRC. OTOH, NEON FP has no support
> for NaN and other nice things from IEEE754.

And just checked gcc man page to verify this stuff.

-ffast-math
Sets -fno-math-errno, -funsafe-math-optimizations,
-ffinite-math-only, -fno-rounding-math, -fno-signaling-nans
and -fcx-limited-range.

-ffinite-math-only
Allow optimizations for floating-point arithmetic that assume that arguments
and results are not NaNs or +-Infs.

This option is not turned on by any -O option since it can result in
incorrect output for programs which depend on an exact implementation of
IEEE or ISO rules/specifications for math functions. It may, however, yield
faster code for programs that do not require the guarantees of these
specifications.

So looks like -ffast-math already assumes no support for NaNs. Even if
there are other nice IEEE754 things preventing NEON from being used
with -ffast-math, an appropriate new option relaxing this requirement
makes sense to be invented.

--
Best regards,
Siarhei Siamashka
  •  Reply

Re: Performance of floating point instructions

Siarhei Siamashka
Karma: 270
2010-03-10 22:32 UTC
On Wednesday 10 March 2010, Laurent GUERBY wrote:
> On Wed, 2010-03-10 at 21:54 +0200, Siarhei Siamashka wrote:
> > I wonder why the compiler does not use real NEON instructions with
> > -ffast-math option, it should be quite useful even for scalar code.
> >
> > something like:
> >
> > vld1.32 {d0[0]}, [r0]
> > vadd.f32 d0, d0, d0
> > vst1.32 {d0[0]}, [r0]
> >
> > instead of:
> >
> > flds s0, [r0]
> > fadds s0, s0, s0
> > fsts s0, [r0]
> >
> > for:
> >
> > *float_ptr = *float_ptr + *float_ptr;
> >
> > At least NEON is pipelined and should be a lot faster on more complex
> > code examples where it can actually benefit from pipelining. On x86, SSE2
> > is used quite nicely for floating point math.
>
> Hi,
>
> Please open a report on http://gcc.gnu.org/bugzilla with your test
> sources and command line, at least GCC developpers will notice there's
> interest :).

This sounds reasonable :)

> GCC comes with some builtins for neon, they're defined in arm_neon.h
> see below.

This does not sound like a good idea. If the code has to be modified and
changed into something nonportable, there are way better options than
intrinsics.

Regarding the use of NEON instructions via C++ operator overloading. A test
program is attached.

# gcc -O3 -mcpu=cortex-a8 -mfpu=neon -mfloat-abi=softfp -ffast-math
-o neon_float neon_float.cpp

=== ieee754 floats ===

real 0m3.396s
user 0m3.391s
sys 0m0.000s

=== runfast floats ===

real 0m2.285s
user 0m2.273s
sys 0m0.008s

=== NEON C++ wrapper ===

real 0m1.312s
user 0m1.313s
sys 0m0.000s

But the quality of generated code is quite bad. That's also something to be
reported to gcc bugzilla :)

--
Best regards,
Siarhei Siamashka

#include <stdio.h>
#include <arm_neon.h>

#if 1
class fast_float
{
float32x2_t data;
public:
fast_float(float x) { data = vset_lane_f32(x, data, 0); }
fast_float(const fast_float &x) { data = x.data; }
fast_float(const float32x2_t &x) { data = x; }
operator float () { return vget_lane_f32(data, 0); }

friend fast_float operator+(const fast_float &a, const fast_float &b);
friend fast_float operator*(const fast_float &a, const fast_float &b);

const fast_float &operator+=(fast_float a)
{
data = vadd_f32(data, a.data);
return *this;
}
};
fast_float operator+(const fast_float &a, const fast_float &b)
{
return vadd_f32(a.data, b.data);
}
fast_float operator*(const fast_float &a, const fast_float &b)
{
return vmul_f32(a.data, b.data);
}
#else
typedef float fast_float;
#endif

float f(float *a, float *b)
{
int i;
fast_float accumulator = 0;
for (i = 0; i < 1024; i += 16)
{
accumulator += (fast_float)a[i + 0] * (fast_float)b[i + 0];
accumulator += (fast_float)a[i + 1] * (fast_float)b[i + 1];
accumulator += (fast_float)a[i + 2] * (fast_float)b[i + 2];
accumulator += (fast_float)a[i + 3] * (fast_float)b[i + 3];
accumulator += (fast_float)a[i + 4] * (fast_float)b[i + 4];
accumulator += (fast_float)a[i + 5] * (fast_float)b[i + 5];
accumulator += (fast_float)a[i + 6] * (fast_float)b[i + 6];
accumulator += (fast_float)a[i + 7] * (fast_float)b[i + 7];
accumulator += (fast_float)a[i + 8] * (fast_float)b[i + 8];
accumulator += (fast_float)a[i + 9] * (fast_float)b[i + 9];
accumulator += (fast_float)a[i + 10] * (fast_float)b[i + 10];
accumulator += (fast_float)a[i + 11] * (fast_float)b[i + 11];
accumulator += (fast_float)a[i + 12] * (fast_float)b[i + 12];
accumulator += (fast_float)a[i + 13] * (fast_float)b[i + 13];
accumulator += (fast_float)a[i + 14] * (fast_float)b[i + 14];
accumulator += (fast_float)a[i + 15] * (fast_float)b[i + 15];
}
return accumulator;
}

volatile float dummy;
float buf1[1024];
float buf2[1024];

int main()
{
int i;
int tmp;
__asm__ volatile(
"fmrx %[tmp], fpscr\n"
"orr %[tmp], %[tmp], #(1 << 24)\n" /* flush-to-zero */
"orr %[tmp], %[tmp], #(1 << 25)\n" /* default NaN */
"bic %[tmp], %[tmp], #((1 << 15) | (1 << 12) | (1 << 11) | (1 << 10) | (1 << 9) | (1 << 8))\n" /* clear exception bits */
"fmxr fpscr, %[tmp]\n"
: [tmp] "=r" (tmp)
);
for (i = 0; i < 1024; i++)
{
buf1[i] = buf2[i] = i % 16;
}
for (i = 0; i < 100000; i++)
{
dummy = f(buf1, buf2);
}
printf("%f\n", (double)dummy);
return 0;
}

  •  Reply

Re: Performance of floating point instructions

Laurent GUERBY
Karma: 69
2010-03-10 23:18 UTC
On Thu, 2010-03-11 at 00:32 +0200, Siarhei Siamashka wrote:
> On Wednesday 10 March 2010, Laurent GUERBY wrote:
> > GCC comes with some builtins for neon, they're defined in arm_neon.h
> > see below.
>
> This does not sound like a good idea. If the code has to be modified and
> changed into something nonportable, there are way better options than
> intrinsics.

I've no idea if this comes from a standard but ARM seems to imply
arm_neon.h is supposed to be supported by various toolchains:

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dht0002a/ch01s04s02.html
<<
GCC and RVCT support the same NEON intrinsic syntax, making C or C++
code portable between the toolchains. To add support for NEON
intrinsics, include the header file arm_neon.h. Example 1.3 implements
the same functionality as the assembler examples, using intrinsics in C
code instead of assembler instructions.
>>

(nice test :)

> But the quality of generated code is quite bad. That's also something to be
> reported to gcc bugzilla :)

Seems that in some limited cases GCC is making progress on neon:

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43001

I'm building current SVN g++ for arm to see what it does on your code
(GCC 4.4.1 get it to run in 1.5s on an 800 MHz efika MX box).

Sincerely,

Laurent



  •  Reply

Re: Performance of floating point instructions

Tor Arntsen
Karma: 359
2010-03-11 10:27 UTC
[float/double/vfp/neon etc.]

There's some information collected at
http://pandorawiki.org/Floating_Point_Optimization
There are also several long threads over at the Pandora forum about
everything discussed in this thread.

-Tor
  •  Reply

Re: Performance of floating point instructions

Alberto Mardegan
Karma: 410
2010-03-11 15:52 UTC
Siarhei Siamashka wrote:
>> The output (application compiled with -O0):
>
> Using an optimized build (-O2 or -O3) may sometimes change the overall picture
> quite dramatically. It makes almost no sense benchmarking -O0 code, because in
> this case all the local variables are kept in memory and are read/written
> before/after each operation. It's substantially different from normal code.

Right. Just to complete the picture, here's the same data with -O2:

float (fast mode enabled):
map_path_calculate_distances: 40 ms for 8250 points
map_path_calculate_distances: 2 ms for 430 points

double (fast mode enabled):
map_path_calculate_distances: 93 ms for 8250 points
map_path_calculate_distances: 4 ms for 430 points

(I'm not posting the same data with fast mode disabled, as it cannot be
worse than the -O0 case, which is anyway not too far from these values)
The relative preformance seems to be about the same. But then of course,
it might not be because of the FPU, but of the data transfers.

Ciao,
Alberto

--
http://www.mardy.it <- geek in un lingua international!
  •  Reply

Re: Performance of floating point instructions

Eero Tamminen
Karma: 161
2010-03-12 14:05 UTC
Hi,

ext Alberto Mardegan wrote:
> Right. Just to complete the picture, here's the same data with -O2:
>
> float (fast mode enabled):
> map_path_calculate_distances: 40 ms for 8250 points
> map_path_calculate_distances: 2 ms for 430 points
>
> double (fast mode enabled):

Note that fast mode affects only floats.


> map_path_calculate_distances: 93 ms for 8250 points
> map_path_calculate_distances: 4 ms for 430 points
>
> (I'm not posting the same data with fast mode disabled, as it cannot be
> worse than the -O0 case, which is anyway not too far from these values)
> The relative preformance seems to be about the same. But then of course,
> it might not be because of the FPU, but of the data transfers.


- Eero
  •  Reply

Re: Performance of floating point instructions

Eero Tamminen
Karma: 161
2010-10-29 09:14 UTC
Hi,

(I resurrected this old thread because there was on meego-dev
mailing list a comment about possibility for RunFast float mode
being enabled by default on MeeGo...)

ext Alberto Mardegan wrote:
> Eero Tamminen wrote:
>> Hamalainen Kimmo (Nokia-D/Helsinki) wrote:
>>> On Wed, 2010-03-10 at 12:57 +0100, ext Alberto Mardegan wrote:
>> Not the libosso osso_fpu_set_mode() function?
>
> I can't find this in libosso.h. :-(

It's defined in osso-fpu.h (since summer 2009):
http://maemo.gitorious.org/fremantle-hildon-desktop/libosso/blobs/master/src/osso-fpu.h


If somebody's using a lot of floats (RunFast mode affects only floats)
and they're a bottleneck, it would be interesting to know how much this
(setting the RunFast mode at program start) helps.


- Eero
  •  Reply

Re: Performance of floating point instructions

Alberto Mardegan
Karma: 410
2010-10-29 11:04 UTC
On 10/29/2010 12:14 PM, Eero Tamminen wrote:
> Hi,
>
> (I resurrected this old thread because there was on meego-dev
> mailing list a comment about possibility for RunFast float mode
> being enabled by default on MeeGo...)
>
> ext Alberto Mardegan wrote:
>> Eero Tamminen wrote:
>>> Hamalainen Kimmo (Nokia-D/Helsinki) wrote:
>>>> On Wed, 2010-03-10 at 12:57 +0100, ext Alberto Mardegan wrote:
>>> Not the libosso osso_fpu_set_mode() function?
>>
>> I can't find this in libosso.h. :-(
>
> It's defined in osso-fpu.h (since summer 2009):
> http://maemo.gitorious.org/fremantle-hildon-desktop/libosso/blobs/master/src/osso-fpu.h
>
>
>
> If somebody's using a lot of floats (RunFast mode affects only floats)
> and they're a bottleneck, it would be interesting to know how much this
> (setting the RunFast mode at program start) helps.

The discussion continued in the same thread:
http://lists.maemo.org/pipermail/maemo-developers/2010-March/025203.html
and
http://lists.maemo.org/pipermail/maemo-developers/2010-March/025218.html

Ciao,
Alberto

--
http://blog.mardy.it <-- geek in un lingua international!
  •  Reply

RE: Performance of floating point instructions

<pedro.larroy at nokia.com>

2010-10-29 12:28 UTC
Are we using arm's VFP?

That would speed things up tremendously.

Pedro.

-----Original Message-----
From: maemo-developers-bounces@maemo.org [mailto:maemo-developers-bounces@maemo.org] On Behalf Of ext Eero Tamminen
Sent: Friday, October 29, 2010 11:15
To: Maemo Mailing List
Subject: Re: Performance of floating point instructions

Hi,

(I resurrected this old thread because there was on meego-dev
mailing list a comment about possibility for RunFast float mode
being enabled by default on MeeGo...)

ext Alberto Mardegan wrote:
> Eero Tamminen wrote:
>> Hamalainen Kimmo (Nokia-D/Helsinki) wrote:
>>> On Wed, 2010-03-10 at 12:57 +0100, ext Alberto Mardegan wrote:
>> Not the libosso osso_fpu_set_mode() function?
>
> I can't find this in libosso.h. :-(

It's defined in osso-fpu.h (since summer 2009):
http://maemo.gitorious.org/fremantle-hildon-desktop/libosso/blobs/master/src/osso-fpu.h


If somebody's using a lot of floats (RunFast mode affects only floats)
and they're a bottleneck, it would be interesting to know how much this
(setting the RunFast mode at program start) helps.


- Eero
  •  Reply