In order to boot Linux on a RaspBerry Pi 2 or 3 with 3 cores and reserve half ram for bare-metal apps you must use U-boot, no the default RasPi bootloader.
Prepare a SD card with Minibian or Raspbian
Get boot files and examples with git clone https://github.com/telmomoya/AMP
Copy to SD boot folder (fat's root) the files provided in repository boot folder: boot.scr u-boot.bin uboot.env
Boot your Linux (look during boot only 3 berries= 3 cores)
Login and go to examples foders
Blinking Led
This file will blink a led connected to GPIO16.
In amp-test folder you can find source and binary files for this first example.
Test it with enough privileges (root or sudo):
cd amp-test ./loadmetal up-metal.bin ./devmem 0x400000bc w 0x20000000
You can change blink timming with: ./devmem 0x200000a0 b 0x0a
0x200000a0 is the physical address for a "delay" variable, so writing from Linux to that address changes the value for the remote (bare-metal) process: IPC!.
View other posts for detailed info about each step:
To establish a simple IPC I will write from Linux to a memory address used by bare-metal app.
+Linux mmap() can access beyond the assigned memory (lower 512Mb imposed by boot args) and devmem is mmap() based, so I'll use it.
First, create a variable in the bare-metal app:
Edit armc-03.c and add a global variable:
volatile char delay=0x54;
also change the delay loops in order to use it:
for(tim = 0; tim < delay * 10000; tim++)
Compile armc-03.c and link (upper 512MB starts from 0x20000000)
Place the obtained binary in memory (0x20000000) root@minibian:~/code#./loadmetal up-metal2.img
Write to Core 3 mailbox 3 the start address (0x20000000) ./devmem 0x400000bc w 0x20000000
Blink starts, with blink speed managed by the "delay" variable.
To determine "delay" memory address look at elf symbol table:
nm up-metal2.elf 200000ac B __bss_end__ 200000ac B _bss_end__ 200000a1 D __bss_start 200000a1 D __bss_start__ 200000a0 D __data_start 200000a0 D delay 200000a1 D _edata 200000ac B _end 200000ac B __end__ 200000a8 B gpio 20000000 T main 00080000 N _stack U _start 200000a4 B tim
The global (volatile) "delay" variable is located at 0x200000a0
If we read it we get the coded value (0x54)
./devmem 0x200000a0 b
Or we can change it in order to change blink speed:
./devmem 0x200000a0 b 0x0a
+ volatile is needed to avoid compiler optimizations (our program never changes "delay" value, so a posible optimization is to make it a constant).
Running Bare Metal App
When Linux kernel boots all non asigned cpus (here Core 3) remain in a loop looking it's mailbox 3 for non-zero value, with an address where to jump (read more here). Writting to that mailbox (Core3_MBOX3_SET register = 0x400000BC) the value 0x20000000 makes Core 3 to jump to the executable that we loaded at that possition.
For physical memory read/write from Linux I used devmem, so get it from http://free-electrons.com/pub/mirror/devmem2.c
Now start the loaded blinking app writing 0x20000000 to Core 3 MailBox3:
./devmem 0x400000bc w 0x20000000
Connect a LED to GPIO16.
LCM
Brian's tutorials include an C-startup assembler file "armc-0x-start.S" that is the first code to be executed.
For that armc-0x-start.S include this label: .section ".text.startup"
That label is the first in the used linker script: rpi.x
In that examples we loose control over execution, as the used C-startup funcion start.S branches to kernel_main function and never return.
In order to have control over bare-metal execution the some changes were made to armc-0x-start.S:
.section ".text.startup" .global _start .global _get_stack_pointer _start: // Clear CORE3_MBOX3 ldr r1,=0x400000FC ldr r3,=0xffffffff str r3, [r1] // Set the stack pointer, which progresses downwards through memory // Set it at 64MB which we know our application will not crash into // and we also know will be available to the ARM CPU. No matter what // settings we use to split the memory between the GPU and ARM CPU ldr sp, =(768 * 1024 * 1024)//SP to 0x3000000 // Run the c startup function - should not return and will call kernel_main bl _cstartup bl kernel_main //changed from b to bl // Check CORE3_MBOX3 for jump address (not zero) _check_loop: ldr r1,=0x400000FC ldr r1, [r1] mov r3, #0 cmp r1, r3 beq_check_loop bx r1 _get_stack_pointer: // Return the stack pointer value str sp, [sp] ldr r0, [sp] // Return from the function mov pc, lr
Replacing the "b" kernel_main branch with "bl" (branch with link) kernel_main can return.
When kernel_main returns the core loops (_check_loop) looking for a non-zero value in mailbox 3.
Previously (when this code start) mailbox 3 is cleared.
For a tentative LCM implementation see this example.
A serious LCM requires interrups or any exception mechanism in order to take the control.
Following Brian´s excellent bare metal tutorials I obtained a binary image for a blinking led code.
+Source files for this step from Brian's github repository.
Compile armc-03.c with linker options to relocate the binary to upper 512Mb, that is 0x20000000 gcc -c armc-03.c -o armc-03.o ld -Ttext 0x20000000 -nostartfiles -g -Wl,-verbose -Wl,-T,rpi.x armc-03.o -o up-metal.elf objcopy up-metal2.elf -O binary up-metal.img
The rpi.x linker script file is included from armc-06
Check obtained img filesize, must be 152 bytes (afterall it only blinks a led).
Loader
To place (from Linux) the bare-metal executable at 0x20000000 I wrote a simple mmap() based loader, invoke it with binary filename as parameter.
loadmetal src code
#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <sys/mman.h>
int main (int argc, char * argv [])
{
int fd_mem;
void *load_address;
unsigned long fileLen;
FILE *file;
printf ("Opening %s\n",argv[1]);
file=fopen(argv[1],"rb");
//Get file length
fseek(file, 0, SEEK_END);
fileLen=ftell(file);
fseek(file, 0, SEEK_SET);
printf ("File lenght %d\n",fileLen);
/* Map Physical address of RAM to virtual address segment with Read/Write Access */
All this work was done on a Raspberry Pi 3, that´s a 1.2 GHz 64-bit quad-core ARMv8 CPU and later on a Raspberry Pi 2.
I used the Raspberry as development platform, but you can cross-compile if you want.
Operative System:
I used Minibian, a reduced Debian Linux (no GUI), in order to get it visit https://minibianpi.wordpress.com/
To get an AMP environment we need to boot Linux with at most three cores and reserve one for bare-metal. Also RAM needs to be separated, i.e. lower 512 MB for OS and upper 512Mb for bare-metal.
Linux kernel accepts boot time parameters that can be used to force kernel to override the default hardware using.
With "maxcpus" and "mem" boot parms we will get the job done, but.... standard Raspberry Pi boot process involves GPU bootloader, ARM bootloader, and a config.txt with some possible configuration options (not really full compilant linux boot parameters). Setting maxcpus=3 and mem=512 in config.txt result in a system boot with 3 active cores, but very inestable, even it crash with ethernet cable connected. And the "mem" parameter has no effect (Linux gets all RAM).
Here comes U-Boot to help us,
NOTE; Last updates are mandatory for Raspberry Pi 3!! Be sure to do: apt-get install rpi-update rpi-update
U-boot is a flexible bootloader intended for embedded systems. Clone and compile it:
git clone git://git.denx.de/u-boot.git cd u-boot make rpi_2_defconfig make all
Copy u-boot.bin to your SD and change config.txt to read: kernel=u-boot.bin
Using CH340 from ex-arduino nano
Now you need a serial console to boot and (press any key) to get U-boot prompt to set and save environment vars: setenv machid 0x00000c42 setenv bootargs= earlyprintk console=tty0 console=ttyAM0 root=/dev/mmcblk0p2 rootfstype=ext4 rootwait noinitrd mem=512M maxcpus=3 saveenv
+If you have experience obtaining a Raspberry Pi uboot.env file from Linux with fw_setenv please let me know. This will avoid the serial adapter for succeeding bootargs mods.
AMP for the masses
Looking for a performance upgrade for my "ARM based 6510 ICE" I decided to use a Raspberry Pi bare metal app. I started reading excellent Brian's tutorials and making some tests. But SD card swaping between PC and RASPi did not suit me for a development process. Searching for a bootloader I found David's one and tested it with a homemade serial level adapter (using CH340 from a dead Arduino Nano board). That was a better development mechanism, but also tedious.
Thinking about a better option, AMP comes to my mind: If RasPi has 4 ARM cores why not use only one core for bare metal and remainings for development.
The idea was to boot Linux with Cores 0,1,2 and half RAM (lower 512Mb), leaving Core 3 and upper 512Mb for bare-metal. Compilation and linking will be done in Linux (no more cross compiling!) and a loader app (also Linux) will put the binary image in the upper memory used by bare-metal core. Then Core 3 must start the execution of that binary.
In this post you have simple instructions to obtain a working AMP system.
Read the following posts for details about each step involved:
Asymmetric Multi Processing (AMP) refers to heterogeneous cores, heterogeneous software for each core or both. A Multi Core system has multiple CPUs, each of which may be a different architecture (heterogeneous multicore) or can be the same (homogeneous multicore).
Also, each core in a multicore system can run the same or different software. If cores are different (heterogeneous) software will be different too, resulting in Asymmetric Multi Processing (AMP).
Most homogeneous systems use an Symmetric Multi Processing (SMP) software architecture, where a single operating system instance treats all processors equally, but an AMP architecture is also possible using heterogeneous software.
In many cases an AMP system is dessirable to get a real time (deterministic) operation without losing the benefits of an OS, for example: bare-metal or RTOS apps running on same cores and remainig cores running Linux.
An AMP system involves:
Multicore system (homogeneous or heterogeneous)
Heterogeneous software
Separate address space (program and data)
Communication facility between the CPUs
Read more about Asymmetric Multi Processing (AMP):
In an AMP system shared resources are bottlenecks, in my examples bare-metal apps only use a GPIO port and Linux all other reources, avoiding access conflicts. In more complex aplications a beter resource access must be implemented.
Look at OpenAMP for an standard framework for LCM, IPC and resource sharing.
Life Cycle Management for bare metal must be improved, an idea is to use a no returning interrupt like reset. It needs MMU management in order to remap individual Core exception vectors.
Cross debug is also desirable, like to use OpenOCD from Linux to contact a remote gdb stub on bare-metal. May use rpi_stub, replacing UART comms with mailboxes or shared memory IPC.
Hi, today I'm pressenting a new video about the ARM powered C64 in Dual Core mode (6510 & Z80) playing a SID file. That is, while Z80 executes CP/M, 6510 executes irq-based SID player.
Tools used: sidreloc to relocate sid player&data from $1000 to $6000 avoiding CP/M overlappings PSID64 for prg generation (-n: no interface) bin2hex for .h header file generation to include from C code
Also, some 6502/10 asm code for SID player initializing and irq vector redirecting:
Exploring reconfigurability of the ARM-powered C64 I added a Z80 emulator to the existing 6510 emulator. And for dynamic testing what better than cartridgeless C64 CP/M.
So, heterogeneous multi-software-core C64 is obtained. Of course non-parallel concurrency is obtained, as only one hardware core (ARM) is available.
For photos and videos, a very visual (and retro) effect was included, setting screen border color according to working core (Light Blue: 6510, Red:Z80)
C64 CP/M Background In 80's Commodore developed a CP/M-cartridge that contained a Z80 to benefit C64 of software
available for CP/M. See more about this at Ruud Baltissen's site.
As the original cartridge shares buses between 6510 and Z80 (and also VIC), not allowing simultaneous processors operation, the presented ARM based, non-parallel dual core, is enough for C64 CP/M execution.
Some CP/M BIOS portions, like disk access take advantage of C64 kernal (ROM) and were written for 6510, and CP/M core, running on Z80, calls them continually (as border colors in the video).
6510 core: In previous post an ARM based C64 was presented, with a C coded 6502 emulator modified for 6510-like operation. It's based on the great Mike Chambers fake6502 emulator. Z80 emulator
Looking for free portable Z80 C coded emulator I found Marcel de Kogel's Z80emu: "written in pure C, which can be used on just about every 32+ bit system". It was easyly integrated to the existing project IDE: a bare-metal LPC1769 Eclipse workspace.
For ARM compilation "low-endian CPU" option must be declared in Z80.h at "Machine dependent definitions" section.
Z80 use in the C64 cartridge is limited, as IORQ and interrupts are not used, only memory access must be implemented.
Z80 memory access
User must provide Z80emu with the Z80_RDMEM() and Z80_WRMEM() functions. As buses are shared with 6510 CPU, the same memory access C functions used by 6510 emulator are used by Z80 emulator: externalread() and externalwrite()
/****************************************************************************/
/* Read a byte from given memory location */
/****************************************************************************/
unsigned Z80_RDMEM(dword A){ return(externalread((A&0xffff)+0x1000);); }
/****************************************************************************/
/* Write a byte to given memory location */
/****************************************************************************/
void Z80_WRMEM(dword A,byte V){ externalwrite(((A&0xffff)+0x1000), V&0xff); }
Note the 0x1000 term added to adresses, recreating the 74LS283 adder included in the CP/M
cartridge for address-space shift. This is so because of the conflict betwheen 6510's I/O port and Z80's reset vector, both located at 0x0000.
Core switching
Without a core scheduler, the C64 CP/M cartridge, implementes a simple scheme. The Z80 is enabled or dissabled writing a byte to an address in the range $DE00/$DEFF with LSB = 0 or 1.
So, core switch is entrusted to software, look how CP/M does it: 6510 assembler code, part of C64 CP/M Bios: http://www.z80.eu/c64/BIOS65.ASM
Note MODESW definitios due to address shift.
This functionality was implemented as follows:
A catch in externalwrite() function enables changes the value of a "processor flag" variable (like the Flip Flop in the cartridge )
In main loop, according to "processor flag" variable, one of the following action is performed:
Z80 instruction execution
6510 instruction execution and 6510 interrupt treatment
CP/M loading
For previous C64 IEC testing the Uno2IEC was used, but this simple Arduno based drive emulator does not support sector read and write needed by CP/M disk access.
Not having available a disk drive or highly compatible device (SD2IEC, uIEC, etc.) another solution is necessary, described on its own post: Software-core C64 diskless CP/M boot
A software core for the C64 it's possible. Unlike other implementations based on programmable logic (FPGA) and soft-cores, this is a 32-bit microcontroller running a 6510 emulator. So I call it software-core, no soft-core. For easy reconfigurability, a portable C programed microprocessor emulator is used resulting in a "High-Level Languaje In-Circuit Emulator". This is a spacetime emulator use, space because in-circuit, and time because real-time operation (software running synchronized to an extrernal clock). Watch the videos about a C64 with the 6510 microprocessor replaced with a 32-bit microcontroller: ARM Cortex M3 LPC1769. There are running games and IEC operation (via Uno2IEC).
As you can see, emulation it's not limited to 6510 processor only. Why not implement another CPU emulators for an heterogeneous multi-core C64. In this next post view 6510 and Z80 emulators on ARM for cartridgeless C64 CP/M: Dual Core C64
miker00lz's EhBasic@Arduino
Inspiration:
On Sept 2015 I saw miker00lz’s post at Arduino forum about runnig his 6502 emulator (fake6502) on Arduino (https://forum.arduino.cc/index.php?topic=193216.0). That was interesting but I thought it would be better if the microcontroller handle external SRAM, CIA or SID.
ARM:
EhBasic@LPC1769 debug console
Thinking that at some point Arduino would limit performance I choose a more powerfull platform: an NXP LPC1769, ARM Cortex 3. The LPC also has 5V tolerant GPIO pins needed by MOS chips. First test was with a 32k x 8 SRAM (HM62256) and fake6502 running EhBasic.
Then GPIO pins were connected to SRAM as 6502 buses, really an incomplete address bus (A0 to A13) beause only 14 contiguous GPIO pins available: P1.18 to P1.31.
LPCXpresso 1769 Pinout
62256 (32K x 8) SRAM Pinout
And memory read and write functions for fake6502 were written to manage the GPIO pins connected as address, data and control buses.
//MEMORY READ
uint8_t read6502(uint16_t address){
FIO2SET|= (1 << 10);// Pone en H RW* (p2.10) Lectura
FIO2PINH= address << 2;// Escribe la dirección A0 a A13 en P1.18 a P1.31 (bus de direcciones)
FIO2SET|= (1 << 11);// Pone en H CS (p2.11)
value = FIO2PIN0;// Lee bus de datos (P2.0 a P2.7)
FIO2CLR|= (1 << 11);// Pone en L CS (p2.11)
RETURN (value);
}
//MEMORY WRITE
void write6502(uint16_t address, uint8_t value){
FIO2PINH= address << 2;// Escribe la dirección A0 a A13 en P1.18 a P1.31 (bus de direcciones)
FIO2PIN0= value;// Escribe bus de datos (P2.0 a P2.7)
FIO2CLR|= (1 << 10);// Pone en L RW* (p2.10) Escritura
FIO2SET|= (1 << 11);// Pone en H CS (p2.11)
FIO2CLR|= (1 << 11);// Pone en L CS (p2.11)
}
The pins connection was direct and the emulator ran asynchronously, EhBasic code was, like on the Arduino, contained as a C constant array in the microcontroller flash program memory.
Comparing ROM Dump with ROM Read
C64:
Then I wanted to reproduce 6502 operation reading program from a ROM, that is an external stored program, external to the microcontroller. One first option was to write EhBasic to a 29FXXX DIP Flash but I prefered to use C64 ROMs, so I retired from a C64 a socketed 901225 Characters ROM chip, connected it like the SRAM to the microcontroller and readed it.
And then, one more step, what about connecting the microcontroller to the C64 replacing the original 6510 and test ROM, CIAs and SID with address decoding provided by the original PLA.
It seems trivial but presents several difficulties because in the C64 the 6510 microprocessor shares the buses with the VIC video chip, the system RAM is DRAM refreshed by the VIC chip and we need to emulate 6510’s I/O port.
6510’s I/O port is mapped at $0001 address and, in C64, bits 0,1 and 2 connected to LORAM, HIRAM and CHAREN signals. PLA use this signals for switching between ROM and RAM for $A000-$BFFF, $D000-$DFFF and $E000-$FFFF memory areas.
This operation was implemented with 3 GPIO pins and catchs at write function:
// 6510 I/O Port if (address==0x1) { if ((value & 0x1)==0){ FIO2CLR = (1 << 11); // Pone en L LORAM } else { FIO2SET = (1 << 11); // Pone en H LORAM } if ((value & 0x2)==0){ FIO2CLR = (1 << 12); // Pone en L HIRAM } else { FIO2SET = (1 << 12); // Pone en H HIRAM } if ((value & 0x4)==0){ FIO2CLR = (1 << 13); } else { FIO2SET = (1 << 13); // Pone en H CHAREN } }
VIC Out
For first tests I removed the VIC chip from the board, but that left me without CPU Clock and without refresh for DRAM, so emulator must use microcontroller RAM. With catchs on memory write function I redirected RAM writes to screen area ($0400-$7fff) to microcontroller debug console and received the C64 startup message. Also pending address lines A14 and A15 were implemented with two aditional GPIOs: P4.28 and P4.29
ROM Read
C64 Boot
VIC: As VIC provides CLK for CPU up to here the emulator runs asynchronously. In order to place the VIC on the board and share buses the emulator needs third state capability and synchronization to system clock.
In the C64 bus access is driven by VIC with it’s BA signal connected to 6510’s AEC pin. All microprocessor read and write operations take place when AEC=1 and must be enabled by the RDY signal too. When AEC=0 VIC uses the buses and microprocessor pins must go to third state. For more datails please see: http://www.zimmers.net/cbmpics/cbm/c64/vic-ii.txt
Interface: That operation was implemented using a custom interface adapter using discrete logic: 3 x 74HC245 Octal 3−State Noninverting Bus Transceivers for address and data bus and a 74HC00 for R/W. Two remaining 74HC00 gates were used to accommodate required delay between in and out 6510 clock signals.
Also read and write functions were rewritten in order to take into account AEC and RDY states. As 6510 only can access buses when AEC=HIGH all read and write operations are synchronized to that signal and not to Clk (Fi2).
//MEMORY READ (SHARED BUS)
uint8_t externalread(uint16_t address) {
uint8_t value;
while ((FIO2PIN1&1)==0) {};// Espera mientras AEC=0 -> Fi0(P2.8)=0
while ((FIO2PIN1&1)==1) {};// Espera mientras AEC=1 -> Fi0(P2.8)=1
FIO1PINH =(address << 2);// Escribe la dirección A0 a A13 en P1.18 a P1.31 (bus de direcciones)
FIO4PINH= (address >> 2);// A14 y A15 en P4.28 y P4.29
while ((FIO0PIN3&16)==0){};// Si RDY=0 (P0.28) espero aquí
while ((FIO2PIN1&1)==0) {};// Espera mientras AEC=0 -> Fi0(P2.8)=0
while ((FIO2PIN1&1)==1) {};// Espera mientras AEC=1 -> Fi0(P2.8)=1
value = FIO2PIN0; // Lee bus de datos (P2.0 a P2.7)
FIO2PIN0 = value; // Escribe bus de datos (P2.0 a P2.7)
FIO2DIR0 = 0x00;// Bus de datos como Entrada
FIO2CLR= (1 << 10);// Pone en L W (P2.10) Lectura
}
Timmings:
VIC chip was connected to the board and several time adjustments were made to read/write routines. Incorrect timmings produced DRAM corruption during VIC refresh and access.
Ram Corruption
Error: 38911./909 Bytes Free
Finally I got the system running, a C64 with a software-emulated microprocessor!
Testing;
The shared bus access works fine even with the lower performance obtained, about 60%, respect to the original 1MHz 6510.
Games and default IEC bus was tested using Uno2IEC, a 1541 IEC interface emulator using Arduino.
Of course better performance can be achieved using an assembly emulator, like a6502 (https://github.com/BigEd/a6502), but the idea is a "High-Level Languaje In-Circuit Emulator".