# Linux on the Cell processor Linux Kernel Hacking Free Course — IV edition

#### Paolo Palana

System Programming Research Group — University of Rome Tor Vergata palana@sprg.uniroma2.it

April 23, 2008



#### What is Cell?



#### What is Cell?

 Cell is a multiprocessor system on single chip developed by IBM in collaboration with Sony and Toshiba





• Many other multiprocessor architectures today:



- Many other multiprocessor architectures today:
- Intel Core duo
- Intel Xeon
- AMD Athlon 64 X2
- AMD Opteron



- Many other multiprocessor architectures today:
- Intel Core duo
- Intel Xeon
- AMD Athlon 64 X2
- AMD Opteron
- All homogeneous architectures



- Many other multiprocessor architectures today:
- Intel Core duo
- Intel Xeon
- AMD Athlon 64 X2
- AMD Opteron
- All homogeneous architectures
- Cell has a non homogeneous architecture



- Many other multiprocessor architectures today:
- Intel Core duo
- Intel Xeon
- AMD Athlon 64 X2
- AMD Opteron
- All homogeneous architectures
- Cell has a non homogeneous architecture
- One general purpose processor (PPE)



- Many other multiprocessor architectures today:
- Intel Core duo
- Intel Xeon
- AMD Athlon 64 X2
- AMD Opteron
- All homogeneous architectures
- Cell has a non homogeneous architecture
- One general purpose processor (PPE)
- Eight special purpose processors (SPE)



- Many other multiprocessor architectures today:
- Intel Core duo
- Intel Xeon
- AMD Athlon 64 X2
- AMD Opteron
- All homogeneous architectures
- Cell has a non homogeneous architecture
- One general purpose processor (PPE)
- Eight special purpose processors (SPE)
- To fully exploit the Cell architecture a new programming approach is required



## **Architectural Overview**





#### The Power Processor Element:

 The main processor: it executes both the operating system and the general purpose applications, and it spawns tasks to SPE



- The main processor: it executes both the operating system and the general purpose applications, and it spawns tasks to SPE
- A dual-threaded general purpose processor



- The main processor: it executes both the operating system and the general purpose applications, and it spawns tasks to SPE
- A dual-threaded general purpose processor
- Based on a 64 bit RISC architecture conforming to the PowerPC Architecture version 2.02



- The main processor: it executes both the operating system and the general purpose applications, and it spawns tasks to SPE
- A dual-threaded general purpose processor
- Based on a 64 bit RISC architecture conforming to the PowerPC Architecture version 2.02
- Has vector/SIMD multimedia extensions



#### PPE simple block diagram





Image taken from CBE Programming Tutorial v. 3

### Synergistic Processor Element (SPE)

Each SPE is:



### Synergistic Processor Element (SPE)

#### Each SPE is:

 Slave processor: it execute tasks spawned from the PPE



### Synergistic Processor Element (SPE)

#### Each SPE is:

- Slave processor: it execute tasks spawned from the PPE
- Based on a 128 bit RISC architecture specialized for computing intensive SIMD applications



#### SPE simple block diagram





Image taken from CBE Programming Tutorial v. 3



Deals with instructions execution and control



- Deals with instructions execution and control
- Single (unified) register file with 128 registers



- Deals with instructions execution and control
- Single (unified) register file with 128 registers
- Unified 256 KB local memory for instructions and data named Local Store (LS)



- Deals with instructions execution and control
- Single (unified) register file with 128 registers
- Unified 256 KB local memory for instructions and data named Local Store (LS)
- New SIMD (Single Instruction Multiple Data) instruction set



### Local Store (LS)



### Local Store (LS)

 Each SPE is an indipendent processor with its own program counter



### Local Store (LS)

- Each SPE is an indipendent processor with its own program counter
- The SPU fetches instructions and load/store data from/to its own Local Store





 It's the interface between the SPE and the other system processors



- It's the interface between the SPE and the other system processors
- Contains a DMA controller for DMA transfers support



- It's the interface between the SPE and the other system processors
- Contains a DMA controller for DMA transfers support
- In order to support the DMA controller, the MFC maintains a queue of DMA commands



- system processors
- Contains a DMA controller for DMA transfers support
- In order to support the DMA controller, the MFC maintains a queue of DMA commands

It's the interface between the SPF and the other

 After a DMA command has been queued, the SPU can continue to execute instructions while the MFC processes the DMA command

#### DMA tranfers



#### DMA tranfers

• Each DMA transfer can move up to 16 KB.



#### DMA tranfers

- Each DMA transfer can move up to 16 KB.
- The SPU associated with MFC can issue a DMA-list of up to 2048 DMA



# High level programming





SPE and PPE are independent processors



- SPE and PPE are independent processors
- To fully exploit the Cell performance you must write two different software programs:



- SPE and PPE are independent processors
- To fully exploit the Cell performance you must write two different software programs:
- PPE program a program running on PowerPC core that offloads task to SPE



- SPE and PPE are independent processors
- To fully exploit the Cell performance you must write two different software programs:
- PPE program a program running on PowerPC core that offloads task to SPE
- SPE program a program running on SPE processor that uses the SPU Instruction Set





 A PPE program spawns a task to an SPE by creating a thread on the SPE. It uses the following functions:



- A PPE program spawns a task to an SPE by creating a thread on the SPE. It uses the following functions:
- spe\_context\_create() creates a context for the SPE thread



- A PPE program spawns a task to an SPE by creating a thread on the SPE. It uses the following functions:
- spe\_context\_create() creates a context for the SPE thread
- spe\_program\_load() load an SPE program into the context



- A PPE program spawns a task to an SPE by creating a thread on the SPE. It uses the following functions:
- spe\_context\_create() creates a context for the SPE thread
- spe\_program\_load() load an SPE program into the context
- spe\_context\_run() execute a context on a physical SPE

## Creating a SPE thread from PPE and libspe2

 The function aboves are in libspe2, which is an implementation of the SPE Runtime Management Library developed by IBM under GPL license and downlodable from http://sourceforge.net/projects/libspe





Conceived and compiled for execution on the SPE



- Conceived and compiled for execution on the SPE
- This program can use SPE (vectorial) data types and SIMD instructions



- Conceived and compiled for execution on the SPE
- This program can use SPE (vectorial) data types and SIMD instructions
- SIMD instructions are defined in the SPU C/C++ language extensions and are named intrinsics



- Conceived and compiled for execution on the SPE
- This program can use SPE (vectorial) data types and SIMD instructions
- SIMD instructions are defined in the  $SPU\ C/C++$  language extensions and are named intrinsics
- A SPE program transfers data from/to main memory to/from Local Store through DMA transfers





Use vector data type instead of scalars



- Use vector data type instead of scalars
- Perform loop unrolling



- Use vector data type instead of scalars
- Perform loop unrolling
- Use double buffering





 The SPE processor has a vectorial architecture. The SPU loads and stores one quadword at time



- The SPE processor has a vectorial architecture. The SPU loads and stores one quadword at time
- Scalar types are stored in the left-most word in the register (*Preferred Slot*)



- The SPE processor has a vectorial architecture. The SPU loads and stores one quadword at time
- Scalar types are stored in the left-most word in the register (*Preferred Slot*)
- We must avoid as much as possible scalar types because operations on scalar types are inefficient



- The SPE processor has a vectorial architecture. The SPU loads and stores one quadword at time
- Scalar types are stored in the left-most word in the register (*Preferred Slot*)
- We must avoid as much as possible scalar types because operations on scalar types are inefficient
- For example a scalar load must be rotated into the preferred slot



• Loop unrolling is a common technique for increasing the performances



- Loop unrolling is a common technique for increasing the performances
- SPE processors have 128 registers



- Loop unrolling is a common technique for increasing the performances
- SPE processors have 128 registers
- Using loop unrolling can improve register utilization



- Loop unrolling is a common technique for increasing the performances
- SPE processors have 128 registers
- Using loop unrolling can improve register utilization
- PROBLEM



- Loop unrolling is a common technique for increasing the performances
- SPE processors have 128 registers
- Using loop unrolling can improve register utilization
- PROBLEM
- Loop unrolling increases the size of code



- Loop unrolling is a common technique for increasing the performances
- SPE processors have 128 registers
- Using loop unrolling can improve register utilization
- PROBLEM
- Loop unrolling increases the size of code
- Data and code must fit in 256 KB Local Store



## Double buffering (1/2)



 The SPU moves data from/to main memory only with DMA tranfers



- The SPU moves data from/to main memory only with DMA tranfers
- The communication bus between SPE's and PPE is a bottleneck



- The SPU moves data from/to main memory only with DMA tranfers
- The communication bus between SPE's and PPE is a bottleneck
- In the Cell architecture DMA transfers are asynchronous



- The SPU moves data from/to main memory only with DMA tranfers
- The communication bus between SPE's and PPE is a bottleneck
- In the Cell architecture DMA transfers are asynchronous
- This feature allow the programmer to schedule the transfers so that the latency of memory accesses can be hidden by overlapping the transfers in one buffer with computations in another



Image taken from CBE Programming Tutorial v. 3



# Cell and the Linux Kernel





 The Cell processor is fully supported by the Linux Kernel



- The Cell processor is fully supported by the Linux Kernel
- Cell is a PowerPC-based architecture



- The Cell processor is fully supported by the Linux Kernel
- Cell is a PowerPC-based architecture
- If you look in /<path\_linux\_source>/arch/powerpc/platforms you can find two folders (among many others) named:



- The Cell processor is fully supported by the Linux Kernel
- Cell is a PowerPC-based architecture
- If you look in /<path\_linux\_source>/arch/powerpc/platforms you can find two folders (among many others) named:
- cell



- The Cell processor is fully supported by the Linux Kernel
- Cell is a PowerPC-based architecture
- If you look in /<path\_linux\_source>/arch/powerpc/platforms you can find two folders (among many others) named:
- cell
- ps3



- The Cell processor is fully supported by the Linux Kernel
- Cell is a PowerPC-based architecture
- If you look in /<path\_linux\_source>/arch/powerpc/platforms you can find two folders (among many others) named:
- cell
- ps3
- The first folder includes code for supporting the native Cell

- The Cell processor is fully supported by the Linux Kernel
- Cell is a PowerPC-based architecture
- If you look in /<path\_linux\_source>/arch/powerpc/platforms you can find two folders (among many others) named:
- cell
- ps3
- The first folder includes code for supporting the native Cell
- The second folder include code for supporting the Cell on Sony PlayStation 3

#### Differences between native Cell and Cell on ps3



#### Differences between native Cell and Cell on ps3

 In native Cell the Linux kernel runs directly on hardware



#### Differences between native Cell and Cell on ps3

- In native Cell the Linux kernel runs directly on hardware
- In ps3 the Linux kernel runs in a virtualized environment



### Why different platforms?



#### Why different platforms?

 Why there are different platforms for Cell native and Cell on ps3?



#### Why different platforms?

- Why there are different platforms for Cell native and Cell on ps3?
- The presence of a virtualization layer imposes different low level interactions between hardware devices and kernel



#### Kernel execution overview on native Cell

video SPU Platform USB graphics audio network storage output support support control Х libspe utils Optical TCP/IP Mouse/KBD PAD PPC64 infrastructure USB mass Bluetooth etc Video 왏 PS3 Core support ALSA FΒ Output SPE Control SCSI support NETWORK VFB GbE USB audio storage System Bus GbE ATA USB Video SPUs PPU Output Graphics Audio HDD/ Control K.BO WiFi BD Bluetooth

function

Linux kernel

ΜW



#### Kernel execution overview on PlayStation3

video SPU Platform USB output graphics audio network storage support support control Х utils libspe TCP/IP PPC64 infrastructure Mouse/KBD USB mass etc Bluetooth Video 왊 PS3 Core support ALSA FB Output SPE Control SCSI support NETWORK VFB audio GbE USB storage System Bus virtualization GbE ATA USB Video SPUs PPU Output Graphics Audio HDD/ BD Mouse KBD Control WiFi Bluetooth

function

Linux kernel

Hyper visor

¥



 $Image\ taken\ from\ http://www.kernel.org/pub/linux/kernel/people/geoff/cell/$ 

#### How libspe2 create a SPE context - spu\_create() syscall



#### How libspe2 create a SPE context – spu\_create() syscall

 The spu\_create\_context() of libspe2 creates an SPE context



#### How libspe2 create a SPE context – spu\_create() syscall

- The spu\_create\_context() of libspe2 creates an SPE context
- An SPE context is, essentially, a directory in spufs pseudo file system (see later)



### How libspe2 create a SPE context – spu\_create() syscall

- The spu\_create\_context() of libspe2 creates an SPE context
- An SPE context is, essentially, a directory in spufs pseudo file system (see later)
- The spu\_create\_context() function creates an entry in spufs through the spu\_create() system call and maps some file created by spu\_create(). For example the mem file (see later)



#### How libspe2 create a SPE context - spu\_create() syscall

- The spu\_create\_context() of libspe2 creates an SPE context
- An SPE context is, essentially, a directory in spufs pseudo file system (see later)
- The spu\_create\_context() function creates an entry in spufs through the spu\_create() system call and maps some file created by spu\_create(). For example the mem file (see later)
- The spu\_create() system call creates a spu context in kernel memory and return an open file descriptor for the directory (in /spu) associated with it.



• Similar to procfs and sysfs



- Similar to procfs and sysfs
- Purely virtual file system



- Similar to procfs and sysfs
- Purely virtual file system
- By convention mounted in /spu



- Similar to procfs and sysfs
- Purely virtual file system
- By convention mounted in /spu
- Directories in /spu represent SPE contexts whose properties are shown as regular files



- Similar to procfs and sysfs
- Purely virtual file system
- By convention mounted in /spu
- Directories in /spu represent SPE contexts whose properties are shown as regular files
- Interaction with these contexts can happen through file operations like open, read, write, etc.



#### Examples of files in a spufs sub-directory



#### Examples of files in a spufs sub-directory

 mem – The local memory of a SPE context. Mainly used to load the executable file of the program to be run onto the SPE



#### Examples of files in a spufs sub-directory

- mem The local memory of a SPE context. Mainly used to load the executable file of the program to be run onto the SPE
- regs The general purpose registers of an SPE.
   Normally can't be accessed directly but they can be saved in a context in kernel memory





• It is a data structure which represents a SPE task



- It is a data structure which represents a SPE task
- A context has all properties of a physical SPE



- It is a data structure which represents a SPE task
- A context has all properties of a physical SPE
- The kernel can use this structure to save the state of a SPE thread



- It is a data structure which represents a SPE task
- A context has all properties of a physical SPE
- The kernel can use this structure to save the state of a SPE thread
- Context switching on SPE is very inefficient





 The spe\_program\_load() function of libspe2 loads an SPE ELF object file in an SPE



- The spe\_program\_load() function of libspe2 loads an SPE ELF object file in an SPE
- This function does not call any syscall



- The spe\_program\_load() function of libspe2 loads an SPE ELF object file in an SPE
- This function does not call any syscall
- Instead, it makes use of a file memory mapping of the mem file



- The spe\_program\_load() function of libspe2 loads an SPE ELF object file in an SPE
- This function does not call any syscall
- Instead, it makes use of a file memory mapping of the mem file
- Thus, the SPE ELF object file is loaded into the context directly from user space



## Running a SPE program – spu\_run() syscall



# Running a SPE program – spu\_run() syscall

 The spe\_context\_run() function runs a SPE program previously loaded into a SPE context



# Running a SPE program - spu\_run() syscall

- The spe\_context\_run() function runs a SPE program previously loaded into a SPE context
- It calls the spu\_run() system call



# Running a SPE program – spu\_run() syscall

- The spe\_context\_run() function runs a SPE program previously loaded into a SPE context
- It calls the spu\_run() system call
- spu\_run() starts the SPE thread execution. The PPE thread that called spu\_run() blocks in that system call



# Running a SPE program - spu\_run() syscall

- The spe\_context\_run() function runs a SPE program previously loaded into a SPE context
- It calls the spu\_run() system call
- spu\_run() starts the SPE thread execution. The PPE thread that called spu\_run() blocks in that system call
- Each SPE thread is associated with one PPE thread





Cell is increasingly used in accademic and scientific world



- Cell is increasingly used in accademic and scientific world
- With ps3 Cell is incredibly low cost



- Cell is increasingly used in accademic and scientific world
- With ps3 Cell is incredibly low cost
- The University of Massachusetts Dartmouth uses a cluster of sixteen ps3 for astrophysics analysis



## Performance scaling with implementation

2048x2048 float matrix multiplication on single SPE

| Implementation         | Execution time (ms)    |
|------------------------|------------------------|
| Scalar                 | 338687.230514          |
| Vectorial              | 336059.746404 (-0,77%) |
| Vectorial - Unrolling  | 280815.662356 (-17%)   |
| Vectorial - Unrolling  | 262594.693659 (-23%)   |
| con spu_madd           |                        |
| Vectorial - spu_madd   | 75076.210915 (-78%)    |
| Vectorial - spu_madd - | 18072.911028 (-95%)    |
| spu_gcc -O3            |                        |
| Vectorial - spu_madd - | 10509.868133 (-97%)    |
| with Double Buffering  |                        |





 Cell is a very interesting and (potentially) powerful architecture



- Cell is a very interesting and (potentially) powerful architecture
- Each SPE is capable of 25.6 GFLOPS in integer and single precision arithmetic



- Cell is a very interesting and (potentially) powerful architecture
- Each SPE is capable of 25.6 GFLOPS in integer and single precision arithmetic
- Fully exploiting Cell capabilities is not easy

