RISC-V PMP and Zephyr RTOS — a new user(space) has entered

Introduction

Adoption of the RISC-V open standard ISA continues to grow, along with increased market acceptance of solutions based on RISC-V designs. With this adoption comes more and more use cases & requirements to bring products to market, including security requirements.

One of those new use cases is the growing use of RISC-V with the Zephyr open source real-time operating system (RTOS).  Until recently, Zephyr only supported basic features for MCU-class RISC-V SoCs, but demand for security features such as hardware memory protection and stack protection has been growing.  

The RISC-V architecture supports these features in hardware, but support was missing in Zephyr.  To meet that demand, BayLibre recently implemented memory protection for the RISC-V architecture in Zephyr.

In this article, we’ll give a brief introduction to some Zephyr features, and related RISC-V hardware features before diving into how we used those features to implement memory protection, stack protection and user-mode threads for the 32-bit RISC-V architecture (aka RV32.)

Overview of generic Zephyr features

Before getting into the details of RISC-V, this section will give a very brief overview of the Zephyr features and terminology used throughout the article.  For more detailed documentation, please see the Zephyr project documentation. Specifically, the User Mode section and the Memory Protection sections provide very useful background information.

Privilege modes

By default, Zephyr has a single privilege mode called kernel mode. This is the privilege level of the OS kernel itself.  For hardware platforms without an unprivileged mode, threads will also run in privileged mode along with the kernel.  On these platforms, a buggy or malicious thread could corrupt other threads or the kernel itself.  This is obviously undesirable.

User mode threads

For hardware architectures that support additional modes with reduced privileges, Zephyr offers the option to run threads at a reduced privilege level.  This privilege level is called user-mode.

The primary goal of this work was to take advantage of the RISC-V hardware privilege levels to run threads at the lowest privilege mode, thus keeping threads isolated from other threads and from the kernel.  

Stack protection

Zephyr supports hardware stack protection if supported by the underlying hardware.  Hardware stack protection is an optional feature which detects stack buffer overflows when the system is running in supervisor mode. This catches issues when the entire kernel stack buffer has overflowed, but not for individual stack frames.  However Zephyr supports optional compiler features which enable stack canaries for individual frames.

Stack separation

Along with stack overflow protection, Zephyr provides the ability to have per-thread stacks separated from the kernel stack.  When combined with a memory protection unit (MPU), this provides for MPU backed userspace.

Memory Domains 

When the underlying hardware provides memory protection, unprivileged user threads may only access the required memory regions and nothing else.  The minimum regions for a user thread are

  • program text and read-only data
  • its own stack

In addition, Zephyr provides a memory domain API to grant access to additional blocks of memory to a user thread.  The number of memory regions available is limited by the available number of regions which can be defined by the MPU.  Many RISC-V cores have a very limited number of MPU regions, so care must be taken to not use too many domains on these platforms.  Conceptually, a memory domain is a collection of some number of memory partitions. The maximum number of memory partitions in a domain is limited by the number of available MPU regions. This is why it is important to minimize the number of boot-time MPU regions.

Overview of relevant RISC-V hardware features

In this section, we’ll introduce the RISC-V hardware capabilities that will enable us to implement the memory protection and userspace features needed.

Privilege modes in hardware

The RISC-V architecture defines three primary privilege modes, from most privileged to least:

  • Machine mode (M)
  • Supervisor mode (S)  (optional)
  • User mode (U) (optional)

Note that machine mode, more commonly called M-mode, is the only mandatory mode. The other modes are optional.  This means that by default, the Zephyr port for RISC-V must run in M-mode.  Commonly used RISC-V cores for embedded applications, such as the E family of cores from SiFive provide M and U modes, but no S mode.  The primary project described below was based on an E31 core from this family.

Before this work, the RISC-V port for Zephyr only supported M-mode, so a big chunk of the work needed was the infrastructure for switching back and forth between privilege modes.  Switching from a lower privilege mode to a higher one happens by explicit calls to higher privilege levels (e.g. system calls) or by code performing an operation that is not permitted at that privilege level, causing an exception.  These exceptions will typically cause traps to the next higher privilege level where they can be properly handled.  The implementation of this switching will be described in more detail below.

Physical Memory Protection (PMP)

The Zephyr documentation for memory protection refers to processors with memory protection units (MPUs).  On RISC-V designs, the MPU is an optional part of the spec, and if present is provided by hardware called the Physical Memory Protection (PMP) unit.   The PMP provides per-CPU control registers to allow physical memory access privileges (read, write, execute) to be specified for each physical memory region.  PMP checks are applied to all memory accesses when the CPU is in supervisor (S) or user (U) mode.  

Optionally, PMP checks may additionally apply to machine (M) mode accesses, in which case the PMP registers themselves are locked, so that even M-mode software cannot change them without a system reset. In effect, PMP can grant permissions to S and U modes, which by default have none, and can revoke permissions from M-mode, which by default has full permissions.

The granularity of PMP access control settings are platform-specific and within a platform may vary by physical memory region, but the standard PMP encoding supports regions as small as four bytes.

The RISC-V ISA specification for PMP defines up to 16 PMP entries, but the actual number of entries present in each design is vendor defined and varies from platform to platform.  For example, the SiFive E31 core used for this project only has 8 entries.  The limited number of PMP entries leads to complexities and trade-offs in the implementation choices which will be discussed in more detail later.

Each PMP entry is defined by a config register and an address register.  To further complicate the already limited number of PMP entries, two entries must be used in order to define an arbitrary memory range: one to define the start address and another for the end address.  Using two slots for each memory range consumes the limited number of PMP slots very quickly.  

A more efficient usage of the PMP entries can be done if the memory regions to be protected are a naturally aligned power of two, referred to as NAPOT in the RISC-V spec.  Using  NAPOT memory regions, a single PMP entry can be used to define each region.  Therefore, the limited number of PMP entries leads to careful consideration of memory layout and regions to be protected. 

The full details of the PMP entries can be found in the RISC-V ISA specification, Volume II: Privileged Architecture.

New features / What we added

Now that we have an overview of relevant Zephyr features and RISC-V hardware capabilities, we can get into the implementation details for how these features were added for a 32-bit E31 core from SiFive. 

User mode threads /  CONFIG_USERSPACE

The Zephyr architecture porting guide describes the details of the APIs needed to support user mode threads.  Here we’ll describe the implementation of the main parts of this new functionality.  

Privilege mode detection

First, Zephyr provides an API call for detecting whether the current privilege level is user mode.  

arch_is_user_context():

return non-zero if the CPU is currently running in user mode. 

This will be called from different places, and can be called from any privilege mode.  Unfortunately, RISC-V does not have a CPU register readable from all privilege modes to get this information, so two possible solutions were evaluated.

  1. Use an ECALL (Environment Call) instruction or a machine mode instruction.  Either option would trigger a fault when running in user mode and trap to machine mode where  the privilege level could be checked.  This approach would lead to additional complexity in the kernel fault handlers, but more importantly it would add significant overhead.  This function is used often, so a solution that traps to kernel mode for every call is not good for overall performance.
  2. Declare a global variable, protected by PMP and made read only for user mode. This variable is then updated every time the privilege mode changes.  This approach has no performance overhead, but does consume one of our limited PMP entries to protect this variable.  The smallest region that can be described by the RISC-V PMP is a 4-byte region, so a 4-bytes in RAM is reserved for this variable. 

After experimentation, the second solution was chosen since it was secure and significantly faster.

User syscalls

Zephyr provides a set of system call APIs for user mode threads to call kernel mode functionality with arguments. For RISC-V, these are built on top of the ECALL instruction.  ECALL is the RISC-V instruction used to trigger a change in privilege mode.  The bulk of the work for adding user mode threads was in making changes to the kernel mode ECALL handler to handle the transitions between privilege modes.

Before we started this work, the core infrastructure for system calls was already in place, since system calls are used from kernel mode as well.  In kernel mode, the system call logic is also used for context switching and IRQ offload.

However, to support user mode, quite a bit of additional functionality was needed.  Focusing on just the logic involved for the newly added user-mode threads, here’s a high-level overview of the steps involved for handling a system call from user mode:

User mode:

  1. User mode thread invokes a system call function
  2. syscall wrapper prepares arguments 
  3. syscall wrapper issues ECALL instruction which traps to M-mode

Machine mode: ECALL exception handler

  1. save exception stack frame (ESF), SoC context
  2. Clear is_user flag
  3. kernel ECALL?
    1. is this a return from user-mode syscall? –> return_from_syscall
  4. handle user ECALL
  5. load syscall args from ESF
  6. switch to privileged (kernel) stack
  7. validate syscall ID
  8. do_syscall()
  9. syscall return does a (nested) ECALL  (back to step 1.)

Machine mode: return_from_syscall

  • set is_user flag
  • restore registers (ESF)
  • restore thread stack pointer
  • return to user (MRET instruction)

Note that the ECALL handler is used for all interrupt and exception handling for the RISC-V port of Zephyr.  This leads to some complexity in the code, but it’s well worth it for the amount of code reuse.  RISC-V has special “machine trap delegation” registers (medeleg / mideleg) which could be useful for separating kernel and user mode exception and interrupt handling.  This will be discussed later in the Future Work section.

Physical Memory Protection (PMP)

In this section, we’ll describe in more detail how the PMP is configured.  But first, an overview of the memory regions that need to be protected

User regions, per-thread

Each user mode thread should only be allowed access to the minimum required memory regions.  By default, these regions are:

  • program executable “text” and read-only .data section (RX, per-thread)
  • thread stack (includes TLS for thread data & bss sections) (RW, per-thread)

The text and read-only data sections are next to each other so are protected by a single PMP region.  This region is enabled with read (R) and execute (X) privileges.  Next is the thread stack.  The thread stack also includes thread-local storage (TLS) which includes the data (.tdata) and bss (.tbss) sections for the thread, and is marked as read-write.

Now, remember from the overview section on PMP above, that two PMP entries are required to define an arbitrary memory region.  So for these two regions, we are already using 4 out of 8 available PMP slots.   If we can restrict the read-only region and thread stack to naturally-aligned power-of-two (NAPOT), we can reduce this to one entry each.  This can be decided at build time by enabling CONFIG_PMP_POWER_OF_TWO_ALIGNMENT, which turns on CONFIG_MPU_REQUIRES_POWER_OF_TWO_ALIGNMENT.

User regions, shared

In addition to the per-thread regions, one additional region is required to protect the current privilege level. 

  • current privilege mode (read-only, all threads)

As discussed above in the Privilege mode detection section, a global variable was created to keep track of the current privilege mode.  This variable needs to be readable from all privilege modes, but read-only from user mode, so a single PMP entry is used to protect this 4-byte variable.  Since this region is only 4-bytes, it can be defined with a single PMP entry.

How many PMP entries are needed?

To summarize, in order to protect the minimum regions for user-mode threads, 5 PMP entries are required for the general case, but this could be reduced to 3 for the special case of NAPOT regions for read-only memory and thread stack.

Memory Domains

If the memory API is enabled, additional PMP entries are required to protect each additional partition of a memory domain.  However, the number of entries available for memory domains is limited by the maximum number of PMP entries available, which is up to 16, but more commonly 8.  

Considering that up to 5 entries may already be used for user-mode thread protection, that leaves only 3 available PMP entries for memory domain partitions.   If the memory partitions themselves are NAPOT, then up to 3 can be defined.  But if they are not NAPOT, 2 PMP entries per region are required, meaning that only a single memory partition can be defined.

The RISC-V architecture implementation will define a maximum number of PMP entries available for memory domains, and this value will be returned by arch_mem_domain_max_partitions_get().  This API call is used by core kernel code to ensure that dynamic requests for memory domains do not exceed the maximum number of MPU entries that can be managed by the underlying hardware. 

Future work / Issues / Tricky stuff

While working on this project, we ran into several challenges and areas for potential future work.  Some of those areas have already been mentioned above, but a few others will be described in this section.

Syscalls

In order to support memory protected userspace, we added quite a bit of complexity to the system call handler for ECALL.  Additional complexity is needed because this main trap handler must handle exceptions and interrupts from both kernel space and userspace.  A potential improvement can be done by splitting this handler by using the RISC-V machine trap delegation registers (medeleg and mideleg.)  This has not yet been explored, but it could potentially improve the readability and maintenance of the exception and interrupt handling for the RISC-V port of Zephyr.

Hardware Stack Protection

In this article, we only discussed isolation of the thread stacks.  Using the same underlying PMP hardware, hardware stack overflow protection was implemented for the kernel stack as well, but discussion of this implementation will be left for a future article.

QEMU support

While developing the software for this effort, the RISC-V support in QEMU was often useful for adding new features.  However, we discovered that the PMP emulation in QEMU was not fully functional.  We have proposed some modifications to QEMU to improve this functionality, but more work is needed to make the PMP support fully functional in QEMU.

RISC-V 64-bit support

Many of the 64-bit RISC-V (aka RV64) platforms support supervisor (S) mode in addition to machine mode and user mode.  It should be explored whether it could be useful in Zephyr to support S-mode on these cores in a similar way that the Linux port to RV64 does: OpenSBI in M-mode, Zephyr in S-mode and threads in U-mode. 

Multi-core

This work for adding RISC-V PMP support to Zephyr was primarily developed on single-core embedded processors (e.g. SiFive E family) with only a single CPU.  Some features like the is_user flag that was added need additional work to support SoCs with multiple CPUs and to ensure they are safe for symmetric multi-processor (SMP) platforms.

BayLibre & Zephyr

In addition to the RISC-V PMP work done for this project, BayLibre is active in the Zephyr community where we are collaborators and maintainers of core parts of Zephyr.  

While the solution discussed in this article was targeted to the SiFive E34 MCU with floating point unit, BayLibre has also implemented support in Zephyr for more capable 64-bit RISC-V MPUs.  Speaking of 64-bit, BayLibre also implemented the 64-bit support for Arm v8 cores.

If your team needs help with RISC-V, Zephyr or any other embedded software such as Linux or Android, don’t hesitate to reach out to the experts at BayLibre.