Performance of floating point instructions
Performance of floating point instructions
Re: Performance of floating point instructions

Sivan Greenberg
Hi Alberto!
On Wed, Mar 10, 2010 at 9:55 AM, Alberto Mardegan <
mardy@users.sourceforge.net> wrote:
> Hi all,
> in maemo-mapper I have a lot of code involved in doing transformations
> from latitude/longitude to Mercator coordinates (used in google maps, for
> example), calculation of distances, etc.
>
> I'm trying to use integer arithmetics as much as possible, but sometimes
> it's a bit impractical, and I wonder if it's really worth the trouble.
>
> Does one have any figure about how the performance of the FPU is, compared
> to integer operations?
>
> A practical question: should I use this way of computing the square root:
>
>
> http://en.wikipedia.org/wiki/Methods_of_computing_square_roots#Binary_numeral_system_.28base_2.29
>
> (but operating on 32 or even 64 bits), or would I be better using sqrtf()
> or sqrt()?
>
>
> Does anyone know any tricks to optimize certain operations on arrays of
> data?
>
Basically, what we did with ThinX OS, is have a full blown soft-float
toolchain which then used the already proven and highly optimized GCC's
stack floating point operations. However , Maemo is not soft float, so I'd
recommend to experiment with rebuilding Mapper using such a soft float
enabled toolchain, statically linked to avoid glitches to system's libc or
have a seperat LD_LIBRARY_PATH to avoid memory hogging, and see where it
gets you.
IMHO this is the best way to do FP optimization. We have experimented with
it alot, including sqrtf and friend to no significant improvement.
Sivan
On Wed, Mar 10, 2010 at 9:55 AM, Alberto Mardegan <
mardy@users.sourceforge.net> wrote:
> Hi all,
> in maemo-mapper I have a lot of code involved in doing transformations
> from latitude/longitude to Mercator coordinates (used in google maps, for
> example), calculation of distances, etc.
>
> I'm trying to use integer arithmetics as much as possible, but sometimes
> it's a bit impractical, and I wonder if it's really worth the trouble.
>
> Does one have any figure about how the performance of the FPU is, compared
> to integer operations?
>
> A practical question: should I use this way of computing the square root:
>
>
> http://en.wikipedia.org/wiki/Methods_of_computing_square_roots#Binary_numeral_system_.28base_2.29
>
> (but operating on 32 or even 64 bits), or would I be better using sqrtf()
> or sqrt()?
>
>
> Does anyone know any tricks to optimize certain operations on arrays of
> data?
>
Basically, what we did with ThinX OS, is have a full blown soft-float
toolchain which then used the already proven and highly optimized GCC's
stack floating point operations. However , Maemo is not soft float, so I'd
recommend to experiment with rebuilding Mapper using such a soft float
enabled toolchain, statically linked to avoid glitches to system's libc or
have a seperat LD_LIBRARY_PATH to avoid memory hogging, and see where it
gets you.
IMHO this is the best way to do FP optimization. We have experimented with
it alot, including sqrtf and friend to no significant improvement.
Sivan
Re: Performance of floating point instructions

Ove Kaaven
Alberto Mardegan skrev:
> Does anyone know any tricks to optimize certain operations on arrays of
> data?
The answer to that is, obviously, to use the Cortex-A-series SIMD
engine, NEON.
Supposedly you may be able to make gcc generate NEON instructions with
-mfpu=neon -ffast-math -ftree-vectorize (and perhaps -mfloat-abi=softfp,
but that's the default in the Fremantle SDK anyway), but it's still not
very good at it, so writing the asm by hand is still better... and I'm
not sure if it can automatically vectorize library calls like sqrt.
> Does anyone know any tricks to optimize certain operations on arrays of
> data?
The answer to that is, obviously, to use the Cortex-A-series SIMD
engine, NEON.
Supposedly you may be able to make gcc generate NEON instructions with
-mfpu=neon -ffast-math -ftree-vectorize (and perhaps -mfloat-abi=softfp,
but that's the default in the Fremantle SDK anyway), but it's still not
very good at it, so writing the asm by hand is still better... and I'm
not sure if it can automatically vectorize library calls like sqrt.
RE: Performance of floating point instructions
2010-03-10 09:50 UTC
> > in maemo-mapper I have a lot of code involved in doing
> > transformations from latitude/longitude to Mercator
> > coordinates (used in google maps, for example), calculation
> > of distances, etc.
> >
> > I'm trying to use integer arithmetics as much as
> > possible, but sometimes it's a bit impractical, and I wonder
> > if it's really worth the trouble.
Is the code slow at the moment and is it specifically the fp stuff that's
slowing it down? If not, I'd say it's probably not worth the effort unless
you're doing this for fun/out of interest.
> > Does one have any figure about how the performance of
> > the FPU is, compared to integer operations?
> >
> > A practical question: should I use this way of
> > computing the square root:
> >
> > http://en.wikipedia.org/wiki/Methods_of_computing_square_roots
> > #Binary_numeral_system_.28base_2.29
> >
> > (but operating on 32 or even 64 bits), or would I be
> > better using sqrtf() or sqrt()?
I'd suggest writing some benchmark code for the functions you wish to
compare.
> > Does anyone know any tricks to optimize certain
> > operations on arrays of data?
There are SIMD extensions
(http://www.arm.com/products/processors/technologies/dsp-simd.php).
> Basically, what we did with ThinX OS, is have a full blown
> soft-float toolchain which then used the already proven and
> highly optimized GCC's stack floating point operations.
> However , Maemo is not soft float, so I'd recommend to
> experiment with rebuilding Mapper using such a soft float
> enabled toolchain, statically linked to avoid glitches to
> system's libc or have a seperat LD_LIBRARY_PATH to avoid
> memory hogging, and see where it gets you.
Soft-float is significantly slower than using the VFP hard-float (using
mfpu, etc., flags on GCC on the N900 and the N8x0 for that matter), there
should be emails containing benchmarks on the list from a long while back
otherwise I can dig them out again. But Alberto's situation is slightly
different as his integer-only code need not deal with arbitrary fp numbers
(as is the case for the soft-float code) as he knows what his inputs' ranges
will be, therefore he should be able to write more efficient and specialised
fixed point integer functions that avoid conversion to and from fp form and
that trim significant figures to the minimum he requires.
Cheers,
Simon
> > transformations from latitude/longitude to Mercator
> > coordinates (used in google maps, for example), calculation
> > of distances, etc.
> >
> > I'm trying to use integer arithmetics as much as
> > possible, but sometimes it's a bit impractical, and I wonder
> > if it's really worth the trouble.
Is the code slow at the moment and is it specifically the fp stuff that's
slowing it down? If not, I'd say it's probably not worth the effort unless
you're doing this for fun/out of interest.
> > Does one have any figure about how the performance of
> > the FPU is, compared to integer operations?
> >
> > A practical question: should I use this way of
> > computing the square root:
> >
> > http://en.wikipedia.org/wiki/Methods_of_computing_square_roots
> > #Binary_numeral_system_.28base_2.29
> >
> > (but operating on 32 or even 64 bits), or would I be
> > better using sqrtf() or sqrt()?
I'd suggest writing some benchmark code for the functions you wish to
compare.
> > Does anyone know any tricks to optimize certain
> > operations on arrays of data?
There are SIMD extensions
(http://www.arm.com/products/processors/technologies/dsp-simd.php).
> Basically, what we did with ThinX OS, is have a full blown
> soft-float toolchain which then used the already proven and
> highly optimized GCC's stack floating point operations.
> However , Maemo is not soft float, so I'd recommend to
> experiment with rebuilding Mapper using such a soft float
> enabled toolchain, statically linked to avoid glitches to
> system's libc or have a seperat LD_LIBRARY_PATH to avoid
> memory hogging, and see where it gets you.
Soft-float is significantly slower than using the VFP hard-float (using
mfpu, etc., flags on GCC on the N900 and the N8x0 for that matter), there
should be emails containing benchmarks on the list from a long while back
otherwise I can dig them out again. But Alberto's situation is slightly
different as his integer-only code need not deal with arbitrary fp numbers
(as is the case for the soft-float code) as he knows what his inputs' ranges
will be, therefore he should be able to write more efficient and specialised
fixed point integer functions that avoid conversion to and from fp form and
that trim significant figures to the minimum he requires.
Cheers,
Simon
Re: Performance of floating point instructions

Laurent Desnogues
On Wed, Mar 10, 2010 at 10:46 AM, Ove Kaaven <ovek@arcticnet.no> wrote:
> Alberto Mardegan skrev:
>> Does anyone know any tricks to optimize certain operations on arrays of
>> data?
>
> The answer to that is, obviously, to use the Cortex-A-series SIMD
> engine, NEON.
>
> Supposedly you may be able to make gcc generate NEON instructions with
> -mfpu=neon -ffast-math -ftree-vectorize (and perhaps -mfloat-abi=softfp,
> but that's the default in the Fremantle SDK anyway), but it's still not
> very good at it, so writing the asm by hand is still better... and I'm
> not sure if it can automatically vectorize library calls like sqrt.
One has to be careful with that approach: Cortex-A9 SoC won't
necessarily come with a NEON SIMD unit, as it's optional. So it'd
be better to also include code that doesn't assume one has a
NEON unit.
Laurent
> Alberto Mardegan skrev:
>> Does anyone know any tricks to optimize certain operations on arrays of
>> data?
>
> The answer to that is, obviously, to use the Cortex-A-series SIMD
> engine, NEON.
>
> Supposedly you may be able to make gcc generate NEON instructions with
> -mfpu=neon -ffast-math -ftree-vectorize (and perhaps -mfloat-abi=softfp,
> but that's the default in the Fremantle SDK anyway), but it's still not
> very good at it, so writing the asm by hand is still better... and I'm
> not sure if it can automatically vectorize library calls like sqrt.
One has to be careful with that approach: Cortex-A9 SoC won't
necessarily come with a NEON SIMD unit, as it's optional. So it'd
be better to also include code that doesn't assume one has a
NEON unit.
Laurent
Re: Performance of floating point instructions
2010-03-10 10:25 UTC
Dnia środa, 10 marca 2010 o 11:14:14 Laurent Desnogues napisał(a):
> One has to be careful with that approach: Cortex-A9 SoC won't
> necessarily come with a NEON SIMD unit, as it's optional. So it'd
> be better to also include code that doesn't assume one has a
> NEON unit.
Or if someone will try to run new ver of maemo-mapper on n8x0 for example.
Regards,
--
JID: hrw@jabber.org
Website: http://marcin.juszkiewicz.com.pl/
LinkedIn: http://www.linkedin.com/in/marcinjuszkiewicz
> One has to be careful with that approach: Cortex-A9 SoC won't
> necessarily come with a NEON SIMD unit, as it's optional. So it'd
> be better to also include code that doesn't assume one has a
> NEON unit.
Or if someone will try to run new ver of maemo-mapper on n8x0 for example.
Regards,
--
JID: hrw@jabber.org
Website: http://marcin.juszkiewicz.com.pl/
LinkedIn: http://www.linkedin.com/in/marcinjuszkiewicz
Re: Performance of floating point instructions
2010-03-10 11:39 UTC
On Wed, 2010-03-10 at 10:46 +0100, ext Ove Kaaven wrote:
> Alberto Mardegan skrev:
> > Does anyone know any tricks to optimize certain operations on arrays of
> > data?
>
> The answer to that is, obviously, to use the Cortex-A-series SIMD
> engine, NEON.
>
> Supposedly you may be able to make gcc generate NEON instructions with
> -mfpu=neon -ffast-math -ftree-vectorize (and perhaps -mfloat-abi=softfp,
> but that's the default in the Fremantle SDK anyway), but it's still not
> very good at it, so writing the asm by hand is still better... and I'm
> not sure if it can automatically vectorize library calls like sqrt.
You can also put the CPU to a "fast floats" mode, see hd_fpu_set_mode()
in
http://maemo.gitorious.org/fremantle-hildon-desktop/hildon-desktop/blobs/master/src/main.c
N900 has support for NEON instructions also.
-Kimmo
> Alberto Mardegan skrev:
> > Does anyone know any tricks to optimize certain operations on arrays of
> > data?
>
> The answer to that is, obviously, to use the Cortex-A-series SIMD
> engine, NEON.
>
> Supposedly you may be able to make gcc generate NEON instructions with
> -mfpu=neon -ffast-math -ftree-vectorize (and perhaps -mfloat-abi=softfp,
> but that's the default in the Fremantle SDK anyway), but it's still not
> very good at it, so writing the asm by hand is still better... and I'm
> not sure if it can automatically vectorize library calls like sqrt.
You can also put the CPU to a "fast floats" mode, see hd_fpu_set_mode()
in
http://maemo.gitorious.org/fremantle-hildon-desktop/hildon-desktop/blobs/master/src/main.c
N900 has support for NEON instructions also.
-Kimmo
Re: Performance of floating point instructions
2010-03-10 11:57 UTC
Kimmo Hämäläinen wrote:
> You can also put the CPU to a "fast floats" mode, see hd_fpu_set_mode()
> in
> http://maemo.gitorious.org/fremantle-hildon-desktop/hildon-desktop/blobs/master/src/main.c
>
> N900 has support for NEON instructions also.
This sounds interesting!
Is there any performance penalty if this switch is done often?
Ciao,
Alberto
--
http://www.mardy.it <-- geek in un lingua international!
> You can also put the CPU to a "fast floats" mode, see hd_fpu_set_mode()
> in
> http://maemo.gitorious.org/fremantle-hildon-desktop/hildon-desktop/blobs/master/src/main.c
>
> N900 has support for NEON instructions also.
This sounds interesting!
Is there any performance penalty if this switch is done often?
Ciao,
Alberto
--
http://www.mardy.it <-- geek in un lingua international!
Re: Performance of floating point instructions
2010-03-10 12:53 UTC
On Wed, 2010-03-10 at 12:57 +0100, ext Alberto Mardegan wrote:
> Kimmo Hämäläinen wrote:
> > You can also put the CPU to a "fast floats" mode, see hd_fpu_set_mode()
> > in
> > http://maemo.gitorious.org/fremantle-hildon-desktop/hildon-desktop/blobs/master/src/main.c
> >
> > N900 has support for NEON instructions also.
>
> This sounds interesting!
>
> Is there any performance penalty if this switch is done often?
IIRC, there was not. Leonid Moiseichuk was testing this about a year
ago, and he noticed almost 50% speed-up for floats. Notice that this
affects only floats, not doubles, and that there is a small accuracy
penalty.
-Kimmo
> Kimmo Hämäläinen wrote:
> > You can also put the CPU to a "fast floats" mode, see hd_fpu_set_mode()
> > in
> > http://maemo.gitorious.org/fremantle-hildon-desktop/hildon-desktop/blobs/master/src/main.c
> >
> > N900 has support for NEON instructions also.
>
> This sounds interesting!
>
> Is there any performance penalty if this switch is done often?
IIRC, there was not. Leonid Moiseichuk was testing this about a year
ago, and he noticed almost 50% speed-up for floats. Notice that this
affects only floats, not doubles, and that there is a small accuracy
penalty.
-Kimmo
Re: Performance of floating point instructions
2010-03-10 16:20 UTC
Hi,
ext Alberto Mardegan wrote:
> Kimmo Hämäläinen wrote:
>> You can also put the CPU to a "fast floats" mode, see hd_fpu_set_mode()
>> in
>> http://maemo.gitorious.org/fremantle-hildon-desktop/hildon-desktop/blobs/master/src/main.c
>>
>> N900 has support for NEON instructions also.
>
> This sounds interesting!
>
> Is there any performance penalty if this switch is done often?
Why you would switch it off?
Operations on "fast floats" aren't IEEE compatible, but as far as
I've understood, they should differ only for numbers that are very close
to zero, close enough that repeating your algorithm few more times would
produce divide by zero even with IEEE semantics (i.e. if "fast float"
causes you issues, it's indicating that there's most likely some issue
in your algorithm).
- Eero
ext Alberto Mardegan wrote:
> Kimmo Hämäläinen wrote:
>> You can also put the CPU to a "fast floats" mode, see hd_fpu_set_mode()
>> in
>> http://maemo.gitorious.org/fremantle-hildon-desktop/hildon-desktop/blobs/master/src/main.c
>>
>> N900 has support for NEON instructions also.
>
> This sounds interesting!
>
> Is there any performance penalty if this switch is done often?
Why you would switch it off?
Operations on "fast floats" aren't IEEE compatible, but as far as
I've understood, they should differ only for numbers that are very close
to zero, close enough that repeating your algorithm few more times would
produce divide by zero even with IEEE semantics (i.e. if "fast float"
causes you issues, it's indicating that there's most likely some issue
in your algorithm).
- Eero
in maemo-mapper I have a lot of code involved in doing transformations from
latitude/longitude to Mercator coordinates (used in google maps, for example),
calculation of distances, etc.
I'm trying to use integer arithmetics as much as possible, but sometimes it's a
bit impractical, and I wonder if it's really worth the trouble.
Does one have any figure about how the performance of the FPU is, compared to
integer operations?
A practical question: should I use this way of computing the square root:
http://en.wikipedia.org/wiki/Methods_of_computing_square_roots#Binary_numeral_system_.28base_2.29
(but operating on 32 or even 64 bits), or would I be better using sqrtf() or sqrt()?
Does anyone know any tricks to optimize certain operations on arrays of data?
Ciao,
Alberto
--
http://www.mardy.it <-- geek in un lingua international!