Simtec Hydra ARM Multiprocessor System

Introduction

Hydra is a hardware add-on for Acorn Computers RiscPC ARM based desktop computer systems which will convert it into an 
affordable asymmetric parallel processing system.  RiscPC machines have the ability to support more than one processor.  As 
standard they have two processor slots, one is normally occupied by an ARM processor card (the primary processor), and the 
other is free allowing the addition of a second ARM, Intel, Motorola or other secondary processor.  While the design of the 
primary processor card may be relatively simple the second processor card must incorporate a certain amount of arbitration logic 
to enable it to share the bus with the primary processor.  Although there are different design requirements for primary and 
secondary processor cards the two processor slots on a standard RiscPC are electrically identical.
The Hydra card interfaces with the RiscPC via one of the processor slots and duplicates both of the original slots and combines 
additional slots with the necessary arbitration logic to support a further four ARM processor cards.
Because the Hydra design integrates the arbitration logic with the base board, ordinary ARM610 and 710 processor cards can be 
used.  This makes it possible to add up to four off-the-shelf ARM processor cards to any RiscPC system. Indeed, the Hydra card is 
not limited to just ARM processor cards, anything which appears to the system to be an ARM card can be used.  This open up the 
possibility of adding alternative high speed I/O cards which access memory or other expansion cards directly.   


The Hydra API (Application program interface)

With four slave processor cards fitted a RiscPC with Hydra has, in theory,  five times the processing power of a standard RiscPC. 
Unfortunately the operating system RISC OS is not a multiprocessor OS and has no way of taking advantage of this increased 
processing power.  One way to make effective use of Hydra is to switch to an operating system which does support 
multiprocessing such as RiscBSD,  Helios or Taos.  This has the advantage that any applications software which can multithread 
will automatically take advantage of any available processors.  However,  for the ordinary RISC OS user,  the easiest way to 
harness the power of Hydra is to use application software written to enhance parts of RISC OS which uses the Hydra API.  As the 
API exists independently on RISC OS,  any MP aware applications will make use of the new resources and ordinary applications 
will run unaffected.


Design philosophy

RISC OS is a robust, compact, efficient ROM based operating system with support for installable filesystems, fast bitmap and 
graphics operations, anti-aliased font rendering.  It has a desktop environment (the Wimp) which allows multiple co-operating 
tasks to share the machine.  RISC OS was designed to run on a single processor.  As such there is no interface to support the 
creation of threads or manage their execution. The Hydra API is designed to provide some of the benefits of multithreading with 
as little as possible of the overhead. After all it is reasoned that the main reason for using Hydra is to enhance the computers 
performance. In this context it is not appropriate for the software to impose a heavy performance burden.

The Hydra API provides calls to:
l Set up the areas of memory containing code and data which a thread will use.
l Move additional areas of memory in and out of the address space of the slave processors.
l Schedule the thread for execution.
l Monitor the progress of a scheduled thread.

Threads are written in ARM assembler 32 bit mode. They see an operating system interface which is a subset of RISC OS 
supporting screen and keyboard I/O, file operations and certain utility functions. In addition there is a generic interface which 
allows a thread to issue a call to any RISC OS SWI.  SWIs generated on a slave processor are either performed locally or passed  
to the Master processor for execution.  In this way, and filing operations are performed by only one processor so filing system 
consistency is guaranteed.


Architecture

The Hydra API is implemented by a relocatable module which runs on the RISC OS host and a small kernel which is run by each 
slave.  Code (kernel and user) is shared between slaves.  Data areas can be shared or unique.  When Hydra starts the kernel code 
is loaded into shared memory and the slave processors are reset under control of the host. Memory is then allocated to hold level 
1 & 2 page tables for each installed slave. At the end of the boot sequence the kernel enters a command processing loop.
As an aid to software development each slave processor can receive keyboard input and send character based output to a virtual 
terminal which is provided by the HydraTerm application. This allows trace information and notifications of exceptions to be 
displayed. The kernel also supports a limited command line interface (CLI) allowing memory and registers to be dumped and 
disassembled and code to be executed.  Each slave inputs and processes commands until a thread is scheduled for it whereupon it 
abandons whatever command it was executing and enters the thread code at the specified address.  Any calls which the thread 
makes to the standard character I/O SWIs (OSReadC, OSReadLine, OSWriteC etc.) are routed to the virtual terminal.  It is not 
anticipated that end users will interact with Hydra via this interface.  When a thread signifies that it has terminated (by calling 
OSExit) the next pending thread is executed. If no thread is waiting control returns to the interactive command line.


Scheduling Threads

As described above threads are allocated to processors on a first-come first served basis. The simple queuing mechanism allows 
Hydra to be shared between a number of client applications and allows for solutions which scale well whatever number of slave 
processors are fitted.  Lets assume that a hypothetical application has a time consuming task which can be split to run in parallel 
on a number of processors.  A naive approach might be to split the task into four threads each of which would take N seconds to 
execute.  On a system with one slave the four threads would execute sequentially taking a total of 4 N seconds.  On a four slave 
system the threads would execute concurrently taking N seconds.  However, on a three processor system the first three threads 
would execute immediately, leaving the fourth thread to execute on its own after the first three had completed, taking a total of 2 
N seconds.  A better approach would be to split the task into twelve threads. On a four processor system each processor would 
execute three of the threads; on a three processor system each processor would handle four of the threads and so on. This 
approach also scales better to future systems which may support more than four slave processors.


Memory map

The memory map for a slave processor looks a little like the memory map of a RISC OS machine:
Address Allocation
00000000 - 00007FFF	Kernel internal use, vector tables, communication queues and stacks (unique to each slave)
00008000 - 037FFFFF	Available to user programs. Memory in this region is allocated by the client application.
03800000 - 0380FFFF	Kernel code (read only, shared between all slaves, may be less than 64k in practice)
03810000 - 03FFFFFF	One to one mapping with I/O space in hosts address space
04000000 +		Level 1 and level 2 page tables and other memory management workspace. The size of this       
 		area depends on the amount of physical ram in the system.
80000000 - FFFFFFFF	One to one mapping with physical memory which by default is not accessible to prevent a
		rogue slave from corrupting RISC OS or other processors workspace.


How it woks

The Hydra arbitration logic is used to multiplex processors to the memory bus and ensures that only one processor talks to the 
memory bus at any one time.  Any processor requiring a memory cycle is guaranteed access to the bus by using a last used-least 
priority rotational priority encoder which gives the bus to each processor in turn if they need it otherwise it stays with the current 
owner.

When reset, an external memory modifier unit is enabled to force the processor to execute its reset code from a fixed area of 
memory otherwise it would execute the RISC OS reset code and crash the already running RISC OS.  Once the processor is 
initialised and running useful code, the modifier unit is disabled and the processor addresses are output normally.
There is also logic to halt a processor so when a task is complete, a processor can shut itself down and wait in suspended 
animation until un-halted or reset by the Master processor.
There is an extensive interrupt structure allowing slaves to send IRQs or FIQs to each other and to signal the Master processor 
through the interrupt structure of the podule bus.
Wherever possible registers have hardware interlocks which prevent one processor from interfering with bits that control the 
others.  In some cases, registers are context sensitive and will only set or enable particular bits of a register dependant on which 
processor is accessing them. A processor can be identified by reading the ID_Status register, whose contents reflects the physical 
socket number that the processor is connected to. This enables the controlling software to compute which register bits belong to 
that processor.  A HardwareVer register holds the current revision number of the arbitration logic.

For those who feel a need to access the hardware directly, below is a register list of the Hydra card.  Please note that some of these 
registers and their operation will change but every effort will be made to make them backwards compatible.  Currently there are 
16 write and 8 read registers, each 4 bits wide, addressed physically from &3800000 and a 4Mb block of address space set aside 
at &3C00000 for local Slave memory.

Addr	Register	Settings			          Reset State  Flags R/W

&00	FIQ_set	1 sets bits in reg. 0 no change. 1(n) asserts FIQ P(n)		0000	(-MS) W
&04	FIQ_clr	1 clears bits in reg. 0 no change.				(-MS) W
&08	ForceFIQ_clr	1 clears bits every slave FIQ reg. 0 no change. 1(n)			(-M-) W
&10	MMU_LSN	Writes D[3:0] to A[24:21] of MMU			0000	(A--) W
&14	MMU_MSN	Writes D[3:0] to A[28:25] of MMU			0000	(A--) W
&18	MMU_set	1 sets bits in reg. 0 no change. 1(n) enables MMU for P(n)	0000	(A--) W
&1C	MMU_clr	1 clears bits in reg. 0 no change.				(-MS) W
&20	IRQ_set	1 sets bits in reg. 0 no change. 1(n) asserts IRQ P(n)		0000	(-MS) W
&24	IRQ_clr	1 clears bits in reg. 0 no change.				(-MS) W
&28	ForceIRQ_clr	1 clears bits every slave IRQ reg. 0 no change. 1(n)			(-M-) W
&30	Reset	Writes D[3:0] to reg. 1(n) to assert RST(n).		0000	(-M-) W
&34	X86_killer	Writes D[3:0] to reg. 0 is disabled 1111 max (SEQ 15/16ths)	0000	(A--) W
&38	Halt_set	1 sets bits in reg. 0 no change. 1(n) to halt P(n).		1111	(-MS) W
&3C	Halt_clr	1 clears bits in reg. 0 no change.				(-M-) W

Status Registers:

&00	FIQ_status	D(n)=1 if P(n) set interrupt. For D(n) <self> =1 then Master set interrupt.	(-MS) R
&04	FIQ_readback	D(0:3) returns data written to FIQ_set reg.			(-MS) R
&08	HardwareVer	D(0:3) with hardware id number (current version returns 1)		(A--) R
&18	MMU_status	D(n)=1 then MMU enabled for P(n).			0000	(A--) R
&1D	ID_status	D[3:0]  Master X0XX, P(0)=X100, P(1)=X101, P(2)=X110, P(0)=X111.		(A--) R
&20	IRQ_status	D(n)=1 if P(n) set interrupt. For D(n) <self> =1 then Master set interrupt.	(-MS) R
&24	IRQ_readback	D(0:3) returns data written to by IRQ_set reg.			(-MS) R
&30	RST_status	D(n)=1 then P(n) is still under RESET				(A--) R
&38	Halt_status	D(n)=1 then P(n) is halted				(A--) R

Access Flags:	A - Any processor, M - Master only,  S - Slave only
*NOTE:	MS - Master and Slave have context sensitive access  

Obsolete Registers:

&08	PFIQ_set	1 sets bits in reg. 0 no change. 1(n) asserts PFIQ to master	0000	(--S) W
&0C	PFIQ_clr	1 clears bits in reg. 0 no change.				(-M-) W
&28	PIRQ_set	1 sets bits in reg. 0 no change. 1(n) asserts PIRQ to master.	0000	(--S) W
&2C	PIRQ_clr	1 clears bits in reg. 0 no change.				(-M-) W
&10	PFIQ_status	D(n)=1 if P(n) set interrupt.				(A--) R
&28	PIRQ_status	D(n)=1 if P(n) set interrupt.				(A--) R


Inter-processor interrupts

WARNING - THESE REGISTERS HAVE CHANGED!
BEFORE USING THESE INTERRUPT REGISTERS,  CONTACT SIMTEC FOR THE LATEST DOCUMENTATION

The Hydra card supports two identical interrupt structures, one for IRQs and the other for FIQs.  In each case it is possible for a 
slave to set the interrupt line of one or more slaves simultaneously by writing to either the IRQ_set or FIQ_set registers.
Slaves communicate to the Master processor by writing to these registers which assert the appropriate podule interrupt lines.
The IRQ mechanism is used by the API for message passing and should not be used by user code.  However,  FIQs may be freely 
used.  The default owner of the FIQ vector is the register snapshot routine used by the debugger.

Registers are written to by accessing a set register with the required bits set to 1 and cleared by accessing the paired reset register 
with bits set in the positions where bits are to be cleared.  In this way, if other processors set additional interrupt bits, they won't 
be accidentally cleared by the interrupted processor as writing a zero to any of the registers has no effect.

Inter slave and master to slave interrupts:  FIQ_set: D(0:3) & IRQ_set: D(0:3)

	Set register:		Status register:

Bits:	D0    D1    D2    D3		D0    D1    D2    D3

Master:	M>S0  M>S1  M>S2  M>S3		S0>M  S1>M  S2>M  S3>M

Slave0:	S0>M  S0>S1 S0>S2 S0>S3		M>S0  S1>S0 S2>S0 S3>S0 
Slave1:	S1>S0 S1>M  S1>S2 S1>S3		S0>S1 M>S1  S2>S1 S3>S1
Slave2:	S2>S0 S2>S1 S2>M  S2>S3		S0>S2 S1>S2 M>S2  S3>S2
Slave3:	S3>S0 S3>S1 S3>S2 S3>M		S0>S3 S1>S3 S2>S3 M>S3


A slave can send an interrupt to other slaves by writing the appropriate bits to the set register. D(n) will send an interrupt to slave 
processor n.  When a slave reads the register, a vertical slice is read, with bits set for every processor that has posted it an 
interrupt.  Slave 0 would set D(0), slave 1 set D(1) etc.  As sending an interrupt to oneself has no purpose,  the otherwise 
redundant diagonal bits are used to store the interrupt bits written by the Master processor to the slaves.  When writing to the 
register,  the master sets the flags of M(S0) M(S1) M(S2) M(S3), one for each slave.

FIQ_readback: D(0:3) & IRQ_readback: D(0:3)

It is possible for a processor to examine whether an interrupt has been cleared by the recipient by reading the readback registers.  
They return the bitfield in the same format as the set registers.  Because continuous polling of a register is bus-inefficient,  it is 
expected that an acknowledge interrupt will be returned to the sender after the interrupt is serviced.

Slave to Master interrupts are performed by writing to the 'redundant' bit that corresponds to the slave itself.
When a slave interrupts the master it sets its flag bit in the 4 bit IRQ or FIQ set register.

Once an interrupt bit is set,  it can only be cleared by the recipient of the interrupt or by a system reset.  In case interrupts are sent
to a processor that is not fitted or running,  the ForceFIQclear and ForceIRQclear registers allow the master to clear all interrupts 
destined for a particular slave.  D(0) clears all interrupts to slave0, D(1) to slave1 etc.













































Simtec Hydra Multiprocessor hardware overview				Iss B 17th May 1996
