telmomoya: octubre 2016

AMP with Raspberry Pi: Cookbook

In order to boot Linux on a RaspBerry Pi 2 or 3 with 3 cores and reserve half ram for bare-metal apps you must use U-boot, no the default RasPi bootloader.

Prepare a SD card with Minibian or Raspbian
Get boot files and examples with
git clone https://github.com/telmomoya/AMP

Copy to SD boot folder (fat's root) the files provided in repository boot folder:
boot.scr
u-boot.bin
uboot.env

(More details on this files here)

Add to config.txt a line with:
kernel=u-boot.bin

Boot your Linux (look during boot only 3 berries= 3 cores)

Blinking Led

This file will blink a led connected to GPIO16.

In amp-test folder you can find source and binary files for this first example.

Test it with enough privileges (root or sudo):

cd amp-test
./loadmetal up-metal.bin
./devmem 0x400000bc w 0x20000000

You can change blink timming with:
./devmem 0x200000a0 b 0x0a

0x200000a0 is the physical address for a "delay" variable, so writing from Linux to that address changes the value for the remote (bare-metal) process: IPC!.

View other posts for detailed info about each step:

Booting RasPi Linux with 3 cores and partial RAM

Compiling, Linking and Loading from Linux the bare-metal executables (locables in upper memory)

Start and stop bare-metal execution with Core 3 (LCM: Life Control Mangement in AMP terminology)

Share some memory areas for send and receive data between Linux and the bare-metal app (remote process)

Example

AMP with Raspberry Pi: Step 4 - Shared memory for Inter-Process Communication

To establish a simple IPC I will write from Linux to a memory address used by bare-metal app.
+Linux mmap() can access beyond the assigned memory (lower 512Mb imposed by boot args) and devmem is mmap() based, so I'll use it.

First, create a variable in the bare-metal app:
Edit armc-03.c and add a global variable:

volatile char delay=0x54;

also change the delay loops in order to use it:

for(tim = 0; tim < delay * 10000; tim++)

Compile armc-03.c and link (upper 512MB starts from 0x20000000)

gcc -c armc-03.c -o arm-03.o
ld -Ttext 0x20000000 arm-03.o -o up-metal2.elf
objcopy up-metal2.elf -O binary up-metal2.img

Place the obtained binary in memory (0x20000000)
root@minibian:~/code#./loadmetal up-metal2.img

Write to Core 3 mailbox 3 the start address (0x20000000)
./devmem 0x400000bc w 0x20000000

Blink starts, with blink speed managed by the "delay" variable.

To determine "delay" memory address look at elf symbol table:

nm up-metal2.elf

200000ac B __bss_end__
200000ac B _bss_end__
200000a1 D __bss_start
200000a1 D __bss_start__
200000a0 D __data_start
200000a0 D delay
200000a1 D _edata
200000ac B _end
200000ac B __end__
200000a8 B gpio
20000000 T main
00080000 N _stack
U _start
200000a4 B tim

The global (volatile) "delay" variable is located at 0x200000a0
If we read it we get the coded value (0x54)

./devmem 0x200000a0 b

Or we can change it in order to change blink speed:

./devmem 0x200000a0 b 0x0a

+ volatile is needed to avoid compiler optimizations (our program never changes "delay" value, so a posible optimization is to make it a constant).

AMP with Raspberry Pi: Step 3 - Life Control Mangement for remote process

Running Bare Metal App
When Linux kernel boots all non asigned cpus (here Core 3) remain in a loop looking it's mailbox 3 for non-zero value, with an address where to jump (read more here). Writting to that mailbox (Core3_MBOX3_SET register = 0x400000BC) the value 0x20000000 makes Core 3 to jump to the executable that we loaded at that possition.
For physical memory read/write from Linux I used devmem, so get it from http://free-electrons.com/pub/mirror/devmem2.c

Now start the loaded blinking app writing 0x20000000 to Core 3 MailBox3:

./devmem 0x400000bc w 0x20000000

Connect a LED to GPIO16.

LCM
Brian's tutorials include an C-startup assembler file "armc-0x-start.S" that is the first code to be executed.

For that armc-0x-start.S include this label:
.section ".text.startup"
That label is the first in the used linker script: rpi.x

In that examples we loose control over execution, as the used C-startup funcion start.S branches to kernel_main function and never return.

In order to have control over bare-metal execution the some changes were made to armc-0x-start.S:

.section ".text.startup"

.global _start
.global _get_stack_pointer

_start:
// Clear CORE3_MBOX3
ldr r1,=0x400000FC
ldr r3,=0xffffffff
str r3, [r1]

// Set the stack pointer, which progresses downwards through memory
// Set it at 64MB which we know our application will not crash into
// and we also know will be available to the ARM CPU. No matter what
// settings we use to split the memory between the GPU and ARM CPU
ldr sp, =(768 * 1024 * 1024) //SP to 0x3000000

// Run the c startup function - should not return and will call kernel_main
bl _cstartup
bl kernel_main //changed from b to bl

// Check CORE3_MBOX3 for jump address (not zero)
_check_loop:
ldr r1,=0x400000FC
ldr r1, [r1]
mov r3, #0
cmp r1, r3
beq _check_loop
bx r1

_get_stack_pointer:
// Return the stack pointer value
str sp, [sp]
ldr r0, [sp]

// Return from the function
mov pc, lr

Replacing the "b" kernel_main branch with "bl" (branch with link) kernel_main can return.
When kernel_main returns the core loops (_check_loop) looking for a non-zero value in mailbox 3.
Previously (when this code start) mailbox 3 is cleared.

For a tentative LCM implementation see this example.

A serious LCM requires interrups or any exception mechanism in order to take the control.

AMP with Raspberry Pi: Step 2 - Compiling, Linking and Loading a Bare Metal App

Bare Metal Coding

Following Brian´s excellent bare metal tutorials I obtained a binary image for a blinking led code.
+Source files for this step from Brian's github repository.
Compile armc-03.c with linker options to relocate the binary to upper 512Mb, that is 0x20000000

gcc -c armc-03.c -o armc-03.o
ld -Ttext 0x20000000 -nostartfiles -g -Wl,-verbose -Wl,-T,rpi.x armc-03.o -o up-metal.elf
objcopy up-metal2.elf -O binary up-metal.img

The rpi.x linker script file is included from armc-06

Check obtained img filesize, must be 152 bytes (afterall it only blinks a led).

Loader
To place (from Linux) the bare-metal executable at 0x20000000 I wrote a simple mmap() based loader, invoke it with binary filename as parameter.

loadmetal src code

#include <stdio.h>

#include <stdlib.h>

#include <fcntl.h>

#include <sys/mman.h>

int main (int argc, char * argv [])

{

int fd_mem;

void *load_address;

unsigned long fileLen;

FILE *file;

printf ("Opening %s\n",argv[1]);

file=fopen(argv[1],"rb");

//Get file length

fseek(file, 0, SEEK_END);

fileLen=ftell(file);

fseek(file, 0, SEEK_SET);

printf ("File lenght %d\n",fileLen);

/* Map Physical address of RAM to virtual address segment with Read/Write Access */

printf ("Opening Mem %x\n",0x20000000);

fd_mem = open("/dev/mem", O_RDWR);

load_address = mmap(NULL, fileLen,PROT_READ|PROT_WRITE, MAP_SHARED, fd_mem, 0x20000000);

// Read file contents

fread(load_address, fileLen, 1, file);

fclose(file);

}

Now go to Step 3 for Life Control Mangement

AMP with Raspberry Pi: Step 1 - Booting Linux with 3 cores

All this work was done on a Raspberry Pi 3, that´s a 1.2 GHz 64-bit quad-core ARMv8 CPU and later on a Raspberry Pi 2.
I used the Raspberry as development platform, but you can cross-compile if you want.

Operative System:
I used Minibian, a reduced Debian Linux (no GUI), in order to get it visit https://minibianpi.wordpress.com/

To get an AMP environment we need to boot Linux with at most three cores and reserve one for bare-metal. Also RAM needs to be separated, i.e. lower 512 MB for OS and upper 512Mb for bare-metal.
Linux kernel accepts boot time parameters that can be used to force kernel to override the default hardware using.

With "maxcpus" and "mem" boot parms we will get the job done, but.... standard Raspberry Pi boot process involves GPU bootloader, ARM bootloader, and a config.txt with some possible configuration options (not really full compilant linux boot parameters). Setting maxcpus=3 and mem=512 in config.txt result in a system boot with 3 active cores, but very inestable, even it crash with ethernet cable connected. And the "mem" parameter has no effect (Linux gets all RAM).
Here comes U-Boot to help us,

Bootloader:
Following step is based on Tim's post about "Booting a Raspberry Pi2, with u-boot and HYP enabled"

NOTE; Last updates are mandatory for Raspberry Pi 3!! Be sure to do:
apt-get install rpi-update
rpi-update

U-boot is a flexible bootloader intended for embedded systems. Clone and compile it:

git clone git://git.denx.de/u-boot.git
cd u-boot
make rpi_2_defconfig
make all

Copy u-boot.bin to your SD and change config.txt to read:

kernel=u-boot.bin

Using CH340 from ex-arduino nano

Now you need a serial console to boot and (press any key) to get U-boot prompt to set and save environment vars:

setenv machid 0x00000c42

setenv bootargs= earlyprintk console=tty0 console=ttyAM0 root=/dev/mmcblk0p2 rootfstype=ext4 rootwait noinitrd mem=512M maxcpus=3

saveenv

Create a boot.sc.source file containing:
fatload mmc 0:1 ${fdt_addr_r} bcm2710-rpi-3-b.dtb
fatload mmc 0:1 ${kernel_addr_r} kernel7.img
bootz ${kernel_addr_r} - ${fdt_addr_r}

And do:
mkimage -A arm -O linux -T script -C none -n boot.scr -d boot.scr.source boot.scr

Move the obtained boot.scr file to SD root.
+If you don't have a serial adapter you can get example files in the boot folder at the repository

Now Linux boots with 3 cores (draws only 3 berries during boot) and half memory (512Mb)
Test it with
free
cat /proc/cpuinfo

We have remaining hardware resources for simultaneous baremetal app run: AMP!

Read Step 2 for Bare Metal Coding

+If you have experience obtaining a Raspberry Pi uboot.env file from Linux with fw_setenv please let me know. This will avoid the serial adapter for succeeding bootargs mods.

Asymmetric Multi Processing (AMP) with Raspberry Pi

AMP for the masses

Looking for a performance upgrade for my "ARM based 6510 ICE" I decided to use a Raspberry Pi bare metal app. I started reading excellent Brian's tutorials and making some tests. But SD card swaping between PC and RASPi did not suit me for a development process. Searching for a bootloader I found David's one and tested it with a homemade serial level adapter (using CH340 from a dead Arduino Nano board). That was a better development mechanism, but also tedious.
Thinking about a better option, AMP comes to my mind: If RasPi has 4 ARM cores why not use only one core for bare metal and remainings for development.
The idea was to boot Linux with Cores 0,1,2 and half RAM (lower 512Mb), leaving Core 3 and upper 512Mb for bare-metal. Compilation and linking will be done in Linux (no more cross compiling!) and a loader app (also Linux) will put the binary image in the upper memory used by bare-metal core. Then Core 3 must start the execution of that binary.
In this post you have simple instructions to obtain a working AMP system.

Read the following posts for details about each step involved:

About AMP

Asymmetric Multi Processing (AMP) refers to heterogeneous cores, heterogeneous software for each core or both. A Multi Core system has multiple CPUs, each of which may be a different architecture (heterogeneous multicore) or can be the same (homogeneous multicore).

Also, each core in a multicore system can run the same or different software. If cores are different (heterogeneous) software will be different too, resulting in Asymmetric Multi Processing (AMP).

Most homogeneous systems use an Symmetric Multi Processing (SMP) software architecture, where a single operating system instance treats all processors equally, but an AMP architecture is also possible using heterogeneous software.

In many cases an AMP system is dessirable to get a real time (deterministic) operation without losing the benefits of an OS, for example: bare-metal or RTOS apps running on same cores and remainig cores running Linux.

An AMP system involves:

Multicore system (homogeneous or heterogeneous)
Heterogeneous software
Separate address space (program and data)
Communication facility between the CPUs

Read more about Asymmetric Multi Processing (AMP):

https://en.wikipedia.org/wiki/Asymmetric_multiprocessing

OpenAMP

http://www.multicore-association.org/workgroup/oamp.php

ToDo

In an AMP system shared resources are bottlenecks, in my examples bare-metal apps only use a GPIO port and Linux all other reources, avoiding access conflicts. In more complex aplications a beter resource access must be implemented.
Look at OpenAMP for an standard framework for LCM, IPC and resource sharing.

Life Cycle Management for bare metal must be improved, an idea is to use a no returning interrupt like reset. It needs MMU management in order to remap individual Core exception vectors.

Cross debug is also desirable, like to use OpenOCD from Linux to contact a remote gdb stub on bare-metal. May use rpi_stub, replacing UART comms with mailboxes or shared memory IPC.