BeagleBone Black: slow as a dog

All benchmarks are artificial, but this one had me scratching my head. One hears  that the BeagleBone Black is screamingly fast compared to the Raspberry Pi; faster, newer processor, blahdeblah, mcbtyc, etc. I found the opposite is true.

So I buy one at the exceptionally soggy Toronto Mini Maker Faire. Props to the CircuitCo folks, they are easy to set up: just a mini-USB cable provides power and virtual network shell. And BoneScript — an Arduino-like JavaScript library — is very clever indeed. But I need to see if this thing has any grunt, and so I need a benchmark.

After hearing about the business-card raytracer, I thought it would be perfect. I compiled it on both machines with:

g++  -Ofast   card.cpp   -o card

and then ran it with:

time ./card > /dev/null

The results are … surprising:

  • Raspberry Pi: 4′ 15″
  • BeagleBone Black: 12′ 39″ → 3× slower

(In comparison, my i7 quad-core laptop runs it in 8½ seconds.)

I don’t have any explanation why the BBB is so much slower. It’s almost as if the compiler isn’t fully optimizing under Ångström Linux.

Raspberry Pi: system info

$ uname -a
Linux rpi 3.6.11+ #538 PREEMPT Fri Aug 30 20:42:08 BST 2013 armv6l GNU/Linux

$ cat /proc/cpuinfo 
Processor    : ARMv6-compatible processor rev 7 (v6l)
BogoMIPS    : 697.95
Features    : swp half thumb fastmult vfp edsp java tls 
CPU implementer    : 0x41
CPU architecture: 7
CPU variant    : 0x0
CPU part    : 0xb76
CPU revision    : 7

Hardware    : BCM2708
Revision    : 000f

BeagleBone Black: system info

# uname -a
Linux beaglebone 3.8.13 #1 SMP Tue Jun 18 02:11:09 EDT 2013 armv7l GNU/Linux
# cat /proc/cpuinfo 
processor    : 0
model name    : ARMv7 Processor rev 2 (v7l)
BogoMIPS    : 297.40
Features    : swp half thumb fastmult vfp edsp thumbee neon vfpv3 tls 
CPU implementer    : 0x41
CPU architecture: 7
CPU variant    : 0x3
CPU part    : 0xc08
CPU revision    : 2

Hardware    : Generic AM33XX (Flattened Device Tree)
Revision    : 0000

Both boards are running at stock speed.

Update: I’ve tried with an external power supply, and checked that the processor was running at full speed. It made no difference. I suspect that Raspbian enables armhf floating point by default, while Ångström needs to be told to use it.

Send the author to the moon!

15 thoughts on “BeagleBone Black: slow as a dog”

  1. I ran this test on my normal beaglebone, not the black with the clock forced at 720 mhz on Ubuntu 13.10 saucy:

    ubuntu@arm:~/temp$ g++ -Ofast card.cpp -o card
    ubuntu@arm:~/temp$ time ./card > /dev/null

    real 11m7.456s
    user 11m4.244s
    sys 0m0.125s

    uname -a readout :

    3.8.13-bone28 #1 SMP Fri Sep 13 01:11:14 UTC 2013 armv7l armv7l armv7l GNU/Linux

    cat /proc/cpuinfo readout:
    processor : 0
    model name : ARMv7 Processor rev 2 (v7l)
    BogoMIPS : 181.83
    Features : swp half thumb fastmult vfp edsp thumbee neon vfpv3 tls
    CPU implementer : 0x41
    CPU architecture: 7
    CPU variant : 0x3
    CPU part : 0xc08
    CPU revision : 2

    Hardware : Generic AM33XX (Flattened Device Tree)
    Revision : 0000
    Serial : 0000000000000000

    Interesting how a slower cpu with slower ram can run a little faster. Perhaps you are right about the distro?

  2. also, -march=native only reduced execution time by mere seconds and has a negligible impact.

    I forgot my g++ -v:

    Using built-in specs.
    COLLECT_GCC=g++
    COLLECT_LTO_WRAPPER=/usr/lib/gcc/arm-linux-gnueabihf/4.8/lto-wrapper
    Target: arm-linux-gnueabihf
    Configured with: ../src/configure -v –with-pkgversion=’Ubuntu/Linaro 4.8.1-10ubuntu8′ –with-bugurl=file:///usr/share/doc/gcc-4.8/README.Bugs –enable-languages=c,c++,java,go,d,fortran,objc,obj-c++ –prefix=/usr –program-suffix=-4.8 –enable-shared –enable-linker-build-id –libexecdir=/usr/lib –without-included-gettext –enable-threads=posix –with-gxx-include-dir=/usr/include/c++/4.8 –libdir=/usr/lib –enable-nls –with-sysroot=/ –enable-clocale=gnu –enable-libstdcxx-debug –enable-libstdcxx-time=yes –enable-gnu-unique-object –disable-libitm –disable-libquadmath –enable-plugin –with-system-zlib –disable-browser-plugin –enable-java-awt=gtk –enable-gtk-cairo –with-java-home=/usr/lib/jvm/java-1.5.0-gcj-4.8-armhf/jre –enable-java-home –with-jvm-root-dir=/usr/lib/jvm/java-1.5.0-gcj-4.8-armhf –with-jvm-jar-dir=/usr/lib/jvm-exports/java-1.5.0-gcj-4.8-armhf –with-arch-directory=arm –with-ecj-jar=/usr/share/java/eclipse-ecj.jar –enable-objc-gc –enable-multiarch –enable-multilib –disable-sjlj-exceptions –with-arch=armv7-a –with-fpu=vfpv3-d16 –with-float=hard –with-mode=thumb –disable-werror –enable-checking=release –build=arm-linux-gnueabihf –host=arm-linux-gnueabihf –target=arm-linux-gnueabihf
    Thread model: posix
    gcc version 4.8.1 (Ubuntu/Linaro 4.8.1-10ubuntu8)

  3. Beaglebones don’t run at top speed if powered only by USB. They power adaptor must be connected.

  4. Hmm, I tried it. With external power supply and freq set to max (1000Mhz) and I got same result:12 min. It is weird.

    Is your Raspberry overclocked? Maybe PREEMPT linux helps for better results…I dont know.

  5. I booted Debian from SD card and I tried it again.

    Sys. informations:
    Linux debian-armhf 3.8.13-bone30 #1 SMP Thu Nov 14 02:59:07 UTC 2013 armv7l GNU/Linux

    processor : 0
    model name : ARMv7 Processor rev 2 (v7l)
    BogoMIPS : 663.07
    Features : swp half thumb fastmult vfp edsp thumbee neon vfpv3 tls
    CPU implementer : 0x41
    CPU architecture: 7
    CPU variant : 0x3
    CPU part : 0xc08
    CPU revision : 2

    Hardware : Generic AM33XX (Flattened Device Tree)
    Revision : 0000
    Serial : 0000000000000000

    Comp.:
    g++ -Ofast card.cpp -o card -mfloat-abi=hard

    Result:
    real 4m55.239s
    user 4m55.117s
    sys 0m0.008s

  6. Yes, I now find out that Ångström only runs as soft float, and to get peak performance, you need a command line like:

    g++ -o card-bbb -march=armv7-a -mtune=cortex-a8 -mfloat-abi=hard -mfpu=neon -ffast-math -O3 -lm card.cpp

  7. My BBB has BogoMIPS as 990+
    And with Single cable connection, the same exercise yield time of
    real 12m56.424s
    user 12m41.154s
    sys 0m0.191s
    ——————————–

  8. Ooh, painful; I managed to put a dev system onto an Intel Galileo. I may be doing something wrong:

    time ./card > /dev/null

    real 13m19.794s
    user 13m11.310s
    sys 0m3.560s

    =======
    # uname -a
    Linux clanton 3.8.7-yocto-standard #1 Tue Oct 1 00:09:01 IST 2013 i586 GNU/Linux
    # cat /proc/cpuinfo
    processor : 0
    vendor_id : GenuineIntel
    cpu family : 5
    model : 9
    model name : 05/09
    stepping : 0
    cpu MHz : 399.100
    cache size : 0 KB
    fdiv_bug : no
    hlt_bug : no
    f00f_bug : yes
    coma_bug : no
    fpu : yes
    fpu_exception : yes
    cpuid level : 7
    wp : yes
    flags : fpu vme pse tsc msr pae cx8 apic pge pbe nx smep
    bogomips : 798.20
    clflush size : 32
    cache_alignment : 32
    address sizes : 32 bits physical, 32 bits virtual
    power management:

  9. Note that in this case, the Raspberry Pi’s ARM11 CPU is actually superior!

    The ARM11 has a nice, pipelined VFP implementation. Cortex-A8 only has a stripped down VFP-lite configuration that is not pipelined. You can force the A8 to use the NEON unit for some VFP instructions (the so-called RunFast mode), but that has its limits and isn’t trivial to use from C.

    So even with hardfloat configured correctly, Cortex-A8 is simply unlikely to outperform ARM11 clock by clock.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>