Tag: benchmark

  • MicroPython Benchmarks

    Somewhat predictably, my Parallel MicroPython Benchmarking thing got out of hand, and I’ve been scrabbling around jamming the benchmark code on every MicroPython board I can find.

    So despite WordPress’s best efforts in thwarting me from having a table here, my results are as follows, from fastest to slowest:

    Board Interpreter CPU @ Frequency / MHz Time / s
    DevEBox STM32H7xx micropython 1.20.0 STM32H743VIT6 @ 400 3.7
    Metro M7 micropython 1.24.1 MIMXRT1011DAE5A @ 500 4.3
    S3 PRO micropython 1.25.0.preview ESP32S3 @ 240 8.9
    Raspberry Pi Pico 2 W micropython 1.25.0.preview RP2350 @ 150 10.3
    ItsyBitsy M4 Express micropython 1.24.1 SAMD51G19A @ 120 12.3
    pyboard v1.1 micropython 1.24.1 STM32F405RG @ 168 13.0
    C3 mini micropython 1.25.0.preview ESP32-C3FH4 @ 160 13.2
    HUZZAH32 – ESP32 micropython 1.24.1 ESP32 @ 160 15.4
    S2 mini micropython 1.25.0.preview ESP32-S2FN4R2 @ 160 17.4
    Raspberry Pi Pico W micropython 1.24.1 RP2040 @ 125 19.8
    WeAct BlackPill STM32F411CEU6 micropython 1.24.0.preview STM32F411CE @ 96 21.4
    W600-PICO micropython 1.25.0.preview W600-B8 @ 80 30.7
    LOLIN D1 mini micropython 1.24.1 ESP8266 @ 80 45.6

    Yes, I was very surprised that the DevEBox STM32H7 at 400 MHz was faster than the 500 MHz MIMXRT1011 in the Metro M7. What was even more impressive is that the STM32H7 board was doing all the calculations in double precision, while all the others were working in single.

    As for the other boards, the ESP32 variants performed solidly, but the ESP8266 in last place should be retired. The Raspberry Pi Pico 2 W was fairly nippy, but the original Raspberry Pi Pico is still a lowly Cortex-M0+, no matter how fast you clock it. The STM32F4 boards were slower than I expected them to be, frankly. And yay! to the plucky little W600: it comes in second last, but it’s the cheapest thing out there.

    All of these benchmarks were made with the same code, but with two lines changed:

    1. The I2C specification, which is a minor syntax change for each board;
    2. The input trigger pin. Some boards like these as numbers, some take them as strings. Pro tip for W600 users: don’t use D0 for an input that’s tied to ground, unless you want the board to go into bootloader mode …

    I’d hoped to run these tests on the SAMD21 little micro-controllers (typically 48 MHz Cortex-M0), but they don’t have enough memory for MicroPython’s framebuf module, so it’s omitted from the build. They would likely have been very slow, though.

    In the spirit of fairness, I also benchmarked CircuitPython on a Arduino Nano RP2040 Connect, which has the same processor as a Raspberry Pi Pico:

    Board Interpreter CPU @ Frequency / MHz Time / s
    Arduino Nano RP2040 Connect circuitpython 9.2.3 RP2040 @ 125 18.0

    So it’s about 10% quicker than MicroPython, but I had to muck around for ages fighting with CircuitPython’s all-over-the-shop documentation and ninny syntax changes. For those that like that sort of thing, I guess that’s the sort of thing they like.

  • Parallel MicroPython Benchmarking

    On the left, a Raspberry Pi Pico 2W. On the right, a Raspberry Pi Pico. Each is connected to its own small OLED screen. When a button is pressed, both boards calculate and display the Mandelbrot set, along with its completion time. Needless to say, the Pico 2 W is quite a bit quicker.
    two small OLED screens side by side on a breadboard. They're the type that are surplus from pulse oximeter machines, so the top 16 pixels are yellow, and the rest of the rows are blue.

The left screen displays: "micropython 1.25.0.preview RP2350 150 MHz 128*64; 120", while the screen on the right shows "micropython 1.24.1 RP2040 125 MHz 128*64; 120"
    the before screens …
    The same two OLED screens, this time showing a complete Mandelbrot set and an elapsed time for each microcontroller. Pico 2 comes in at 10.3 seconds, original Pico at 19.8 seconds
    Pico 2 comes in at 10.3 seconds, original Pico at 19.8 seconds

    Stuff I found out setting this up:

    • some old OLEDs, like these surplus pulse oximeter ones, don’t have pull-up resistors on their data lines. These I’ve carefully hidden behind the displays, but they’re there.
    • Some MicroPython ports don’t include the complex type, so I had to lose the elegant z→z²+C mapping to some ugly code.
    • Some MicroPython ports don’t have os.uname(), but sys.implementation seems to cover most of the data I need.
    • On some boards, machine.freq() is an integer value representing the CPU frequency. On others, it’s a list. Aargh.

    These displays came from the collection of the late Tom Luff, a Toronto maker who passed away late 2024 after a long illness. Tom had a huge component collection, and my way of remembering him is to show off his stuff being used.

    Source:

    # benchmark Mandelbrot set (aka Brooks-Matelski set) on OLED
    # scruss, 2025-01
    # MicroPython
    # -*- coding: utf-8 -*-
    
    from machine import Pin, I2C, idle, reset, freq
    
    # from os import uname
    from sys import implementation
    from ssd1306 import SSD1306_I2C
    from time import ticks_ms, ticks_diff
    
    # %%% These are the only things you should edit %%%
    startpin = 16  # pin for trigger configured with external pulldown
    # I2C connection for display
    i2c = machine.I2C(1, freq=400000, scl=19, sda=18, timeout=50000)
    # %%% Stop editing here - I mean it!!!1! %%%
    
    
    # maps value between istart..istop to range ostart..ostop
    def valmap(value, istart, istop, ostart, ostop):
        return ostart + (ostop - ostart) * (
            (value - istart) / (istop - istart)
        )
    
    
    WIDTH = 128
    HEIGHT = 64
    TEXTSIZE = 8  # 16x8 text chars
    maxit = 120  # DO NOT CHANGE!
    # value of 120 gives roughly 10 second run time for Pico 2W
    
    # get some information about the board
    # thanks to projectgus for the sys.implementation tip
    if type(freq()) is int:
        f_mhz = freq() // 1_000_000
    else:
        # STM32 has freq return a tuple
        f_mhz = freq()[0] // 1_000_000
    sys_id = (
        implementation.name,
        ".".join([str(x) for x in implementation.version]).rstrip(
            "."
        ),  # version
        implementation._machine.split()[-1],  # processor
        "%d MHz" % (f_mhz),  # frequency
        "%d*%d; %d" % (WIDTH, HEIGHT, maxit),  # run parameters
    )
    
    p = Pin(startpin, Pin.IN)
    
    # displays I have are yellow/blue, have no pull-up resistors
    #  and have a confusing I2C address on the silkscreen
    oled = SSD1306_I2C(WIDTH, HEIGHT, i2c)
    oled.contrast(31)
    oled.fill(0)
    # display system info
    ypos = (HEIGHT - TEXTSIZE * len(sys_id)) // 2
    for s in sys_id:
        ts = s[: WIDTH // TEXTSIZE]
        xpos = (WIDTH - TEXTSIZE * len(ts)) // 2
        oled.text(ts, xpos, ypos)
        ypos = ypos + TEXTSIZE
    
    oled.show()
    
    while p.value() == 0:
        # wait for button press
        idle()
    
    oled.fill(0)
    oled.show()
    start = ticks_ms()
    # NB: oled.pixel() is *slow*, so only refresh once per row
    for y in range(HEIGHT):
        # complex range reversed because display axes wrong way up
        cc = valmap(float(y + 1), 1.0, float(HEIGHT), 1.2, -1.2)
        for x in range(WIDTH):
            cr = valmap(float(x + 1), 1.0, float(WIDTH), -2.8, 2.0)
            # can't use complex type as small boards don't have it dammit)
            zr = 0.0
            zc = 0.0
            for k in range(maxit):
                t = zr
                zr = zr * zr - zc * zc + cr
                zc = 2 * t * zc + cc
                if zr * zr + zc * zc > 4.0:
                    oled.pixel(x, y, k % 2)  # set pixel if escaped
                    break
        oled.show()
    elapsed = ticks_diff(ticks_ms(), start) / 1000
    elapsed_str = "%.1f s" % elapsed
    # oled.text(" " * len(elapsed_str), 0, HEIGHT - TEXTSIZE)
    oled.rect(
        0, HEIGHT - TEXTSIZE, TEXTSIZE * len(elapsed_str), TEXTSIZE, 0, True
    )
    
    oled.text(elapsed_str, 0, HEIGHT - TEXTSIZE)
    oled.show()
    
    # we're done, so clear screen and reset after the button is pressed
    while p.value() == 0:
        idle()
    oled.fill(0)
    oled.show()
    reset()
    
    

    (also here: benchmark Mandelbrot set (aka Brooks-Matelski set) on OLED – MicroPython)

    I will add more tests as I get to wiring up the boards. I have so many (too many?) MicroPython boards!

    Results are here: MicroPython Benchmarks

  • Raspberry Pi Zero 2 W: initial performance

    Running A Pi Pie Chart turned out some useful performance numbers. It’s almost, but not quite, a Raspberry Pi 3B in a Raspberry Pi Zero form factor.

    32-bit mode

    Running stock Raspberry Pi OS with desktop, compiled with stock options:

    pie chart comparing multi-thread numeric performance of Raspberry Pi Zero 2 W: slightly faster than a Raspberry Pi 2B
    multi-thread results
    pie chart comparing single-thread numeric performance of Raspberry Pi Zero 2 W: slightly faster than a Raspberry Pi 2B
    single-thread results
    time ./pichart-openmp -t "Zero 2W, OpenMP"
    pichart -- Raspberry Pi Performance OPENMP version 36
    
    Prime Sieve          P=14630843 Workers=4 Sec=2.18676 Mops=427.266
    Merge Sort           N=16777216 Workers=8 Sec=1.9341 Mops=208.186
    Fourier Transform    N=4194304 Workers=8 Sec=3.10982 Mflops=148.36
    Lorenz 96            N=32768 K=16384 Workers=4 Sec=4.56845 Mflops=705.102
    
    The Zero 2W, OpenMP has Raspberry Pi ratio=8.72113
    Making pie charts...done.
    
    real	8m20.245s
    user	15m27.197s
    sys	0m3.752s
    
    -----------------------------
    
    time ./pichart-serial -t "Zero 2W, Serial"
    pichart -- Raspberry Pi Performance Serial version 36
    
    Prime Sieve          P=14630843 Workers=1 Sec=8.77047 Mops=106.531
    Merge Sort           N=16777216 Workers=2 Sec=7.02049 Mops=57.354
    Fourier Transform    N=4194304 Workers=2 Sec=8.58785 Mflops=53.724
    Lorenz 96            N=32768 K=16384 Workers=1 Sec=17.1408 Mflops=187.927
    
    The Zero 2W, Serial has Raspberry Pi ratio=2.48852
    Making pie charts...done.
    
    real	7m50.524s
    user	7m48.854s
    sys	0m1.370s

    64-bit

    Running stock/beta 64-bit Raspberry Pi OS with desktop. Curiously, these ran out of memory (at least, in oom-kill‘s opinion) with the desktop running, so I had to run from console. This also meant it was harder to capture the program run times.

    The firmware required to run in this mode should be in the official distribution by now.

    pie chart comparing 64 bit multi-thread numeric performance of Raspberry Pi Zero 2 W: slightly faster than a Raspberry Pi 2B
    multi-thread, 64 bit: no, I can’t explain why Lorenz is better than a 3B+
    pie chart comparing 64 bit single-thread numeric performance of Raspberry Pi Zero 2 W: slightly faster than a Raspberry Pi 2B
    single thread, again with the bump in Lorenz performance
    pichart -- Raspberry Pi Performance OPENMP version 36
    
    Prime Sieve          P=14630843 Workers=4 Sec=1.78173 Mops=524.395
    Merge Sort           N=16777216 Workers=8 Sec=1.83854 Mops=219.007
    Fourier Transform    N=4194304 Workers=4 Sec=2.83797 Mflops=162.572
    Lorenz 96            N=32768 K=16384 Workers=4 Sec=2.66808 Mflops=1207.32
    
    The Zero2W-64bit has Raspberry Pi ratio=10.8802
    Making pie charts...done.
    
    -------------------------
    
    pichart -- Raspberry Pi Performance Serial version 36
    
    Prime Sieve          P=14630843 Workers=1 Sec=7.06226 Mops=132.299
    Merge Sort           N=16777216 Workers=2 Sec=6.75762 Mops=59.5851
    Fourier Transform    N=4194304 Workers=2 Sec=7.73993 Mflops=59.6095
    Lorenz 96            N=32768 K=16384 Workers=1 Sec=9.00538 Mflops=357.7
    
    The Zero2W-64bit has Raspberry Pi ratio=3.19724
    Making pie charts...done.

    The main reason for the Raspberry Pi Zero 2 W appearing slower than the 3B and 3B+ is likely that it uses LPDDR2 memory instead of LPDDR3. 64-bit mode provides is a useful performance increase, offset by increased memory use. I found desktop apps to be almost unusably swappy in 64-bit mode, but there might be some tweaking I can do to avoid this.

    Unlike the single core Raspberry Pi Zero, the Raspberry Pi Zero 2 W can be made to go into thermal throttling if you’re really, really determined. Like “3 or more cores running flat-out“-determined. In my testing, two cores at 100% (as you might get in emulation) won’t put it into thermal throttling, even in the snug official case closed up tight. More on this later.

    (And a great big raspberry blown at Make, who leaked the Raspberry Pi Zero 2 W release a couple of days ago. Not classy.)

  • bench64: a new BASIC benchmark index for 8-bit computers

    Nobody asked for this. Nobody needs this. But here we are …

    commodore 64 screen shot showing benchmark results:
basic bench index
>i good. ntsc c64=100

1/8 - for:
 60 s; 674.5 /s; i= 100
2/8 - goto:
 60 s; 442.3 /s; i= 100
3/8 - gosub:
 60 s; 350.8 /s; i= 100
4/8 - if:
 60 s; 242.9 /s; i= 100
5/8 - fn:
 60 s; 60.7 /s; i= 100
6/8 - maths:
 60 s; 6.4 /s; i= 100
7/8 - string:
 60 s; 82.2 /s; i= 100
8/8 - array:
 60 s; 27.9 /s; i= 100

overall index= 100


ready.
    bench64 running on the reference system, an NTSC Commodore 64c

    Inspired by J. G. Harston’s clever but domain-specific ClockSp benchmark, I set out to write a BASIC benchmark suite that was:

    1. more portable;
    2. based on a benchmark system that more people might own;
    3. and a bunch of other less important ideas.

    Since I already had a Commodore 64, and seemingly several million other people did too, it seemed like a fair choice to use as the reference system. But the details, so many details …

    basic bench index
>i good. ntsc c64=100

1/8 - for:
 309.5 s; 130.8 /s; i= 19 
2/8 - goto:
 367.8 s; 72.1 /s; i= 16 
3/8 - gosub:
 340.9 s; 61.7 /s; i= 18 
4/8 - if:
 181.8 s; 80.1 /s; i= 33 
5/8 - fn:
 135.3 s; 26.9 /s; i= 44 
6/8 - maths:
 110.1 s; 3.5 /s; i= 54 
7/8 - string:
 125.8 s; 39.2 /s; i= 48 
8/8 - array:
 103 s; 16.3 /s; i= 58 

overall index= 29
    It was entirely painful running the same code on a real ZX Spectrum at under ⅓ the speed of a C64

    (I mean: who knew that Commodore PET BASIC could run faster or slower depending on how your numbered your lines? Not me — until today, that is.)

    While the benchmark doesn’t scale well for BASIC running on modern computers — the comparisons between a simple 8-bit processor at a few MHz and a multi-core wildly complex modern CPU at many GHz just aren’t applicable — it turns out I may have one of the fastest 8-bit BASIC computers around in the matchbox-sized shape of the MinZ v1.1 (36.864 Z180, CP/M 2.2, BBC BASIC [Z80] v3):

    BASIC BENCH INDEX
    >I GOOD. NTSC C64=100
    
    1/8 - FOR:
     3.2 S; 12778 /S; I= 1895 
    2/8 - GOTO:
     6.1 S; 4324.5 /S; I= 978 
    3/8 - GOSUB:
     3.1 S; 6789 /S; I= 1935 
    4/8 - IF:
     2.9 S; 4966.9 /S; I= 2046 
    5/8 - FN:
     3.5 S; 1030.6 /S; I= 1698 
    6/8 - MATHS:
     1.5 S; 255.3 /S; I= 4000 
    7/8 - STRING:
     2.6 S; 1871.6 /S; I= 2279 
    8/8 - ARRAY:
     3.1 S; 540.3 /S; I= 1935 
    
    OVERALL INDEX= 1839 
    

    That’s more than 9× the speed of a BBC Micro Model B.

    Github link: bench64 – a new BASIC benchmark index for 8-bit computers.

    Archive download:

  • BeagleBone Black: slow as a dog

    All benchmarks are artificial, but this one had me scratching my head. One hears  that the BeagleBone Black is screamingly fast compared to the Raspberry Pi; faster, newer processor, blahdeblah, mcbtyc, etc. I found the opposite is true.

    So I buy one at the exceptionally soggy Toronto Mini Maker Faire. Props to the CircuitCo folks, they are easy to set up: just a mini-USB cable provides power and virtual network shell. And BoneScript — an Arduino-like JavaScript library — is very clever indeed. But I need to see if this thing has any grunt, and so I need a benchmark.

    After hearing about the business-card raytracer, I thought it would be perfect. I compiled it on both machines with:

    g++  -Ofast   card.cpp   -o card

    and then ran it with:

    time ./card > /dev/null

    The results are … surprising:

    • Raspberry Pi: 4′ 15″
    • BeagleBone Black: 12′ 39″ → 3× slower

    (In comparison, my i7 quad-core laptop runs it in 8½ seconds.)

    I don’t have any explanation why the BBB is so much slower. It’s almost as if the compiler isn’t fully optimizing under Ã…ngström Linux.

    Raspberry Pi: system info

    $ uname -a
    Linux rpi 3.6.11+ #538 PREEMPT Fri Aug 30 20:42:08 BST 2013 armv6l GNU/Linux
    
    $ cat /proc/cpuinfo 
    Processor    : ARMv6-compatible processor rev 7 (v6l)
    BogoMIPS    : 697.95
    Features    : swp half thumb fastmult vfp edsp java tls 
    CPU implementer    : 0x41
    CPU architecture: 7
    CPU variant    : 0x0
    CPU part    : 0xb76
    CPU revision    : 7
    
    Hardware    : BCM2708
    Revision    : 000f

    BeagleBone Black: system info

    # uname -a
    Linux beaglebone 3.8.13 #1 SMP Tue Jun 18 02:11:09 EDT 2013 armv7l GNU/Linux
    # cat /proc/cpuinfo 
    processor    : 0
    model name    : ARMv7 Processor rev 2 (v7l)
    BogoMIPS    : 297.40
    Features    : swp half thumb fastmult vfp edsp thumbee neon vfpv3 tls 
    CPU implementer    : 0x41
    CPU architecture: 7
    CPU variant    : 0x3
    CPU part    : 0xc08
    CPU revision    : 2
    
    Hardware    : Generic AM33XX (Flattened Device Tree)
    Revision    : 0000

    Both boards are running at stock speed.

    Update: I’ve tried with an external power supply, and checked that the processor was running at full speed. It made no difference. I suspect that Raspbian enables armhf floating point by default, while Ã…ngström needs to be told to use it.