\textbf{\large \textbf{\textit{\lambda}-NIC: Interactive Serverless Compute on Programmable SmartNICs}}

Sean Choi, Muhammad Shahbaz, Balaji Prabhakar, and Mendel Rosenblum

\textit{Stanford University}

\textbf{Abstract—}There is a growing interest in serverless compute, a cloud computing model that automates infrastructure resource-allocation and management while billing customers only for the resources they use. Workloads like stream processing benefit from high elasticity and fine-grain pricing of these serverless frameworks. However, so far, limited concurrency and high latency of server CPUs prohibit many interactive workloads (e.g., web servers and database clients) from taking advantage of serverless compute to achieve high performance.

In this paper, we argue that server CPUs are ill-suited to run serverless workloads (i.e., lambdas) and present \textit{\lambda}-NIC, an open-source framework, that runs interactive workloads directly on a SmartNIC; more specifically an ASIC-based NIC that consists of a dense grid of Network Processing Unit (NPU) cores. \textit{\lambda}-NIC leverages SmartNIC\textquotesingle s proximity to the network and a vast array of NPU cores to simultaneously run thousands of lambdas on a single NIC with strict tail-latency guarantees. To ease the development and deployment of lambdas, \textit{\lambda}-NIC exposes an event-based programming abstraction, \textit{Match+Lambdas}, and a machine model that allows developers to compose and execute lambdas on SmartNICs easily. Our evaluation shows that \textit{\lambda}-NIC achieves up to 880× and 736× improvements in workloads\textquotesingle response latency and throughput, respectively, while significantly reducing host CPU and memory usage.

\section{Introduction}

Serverless compute is emerging as an attractive cloud computing model that lets developers focus only on the core applications—building the workloads as small, fine-grained custom programs (i.e., lambdas)—without having to worry about the infrastructure they run on. Cloud providers dynamically provision, deploy, patch, and monitor the infrastructure and its resources (e.g., compute, storage, memory, and network) for these workloads; with tenants only paying for the resources they consume at millisecond increments. Serverless compute lowers the barrier to entry, especially, for organizations lacking expertise, manpower, and budget to efficiently manage the underlying infrastructure resources.

Today, most major cloud providers offer a form of serverless frameworks, such as Amazon Lambda [1], Google Cloud Functions [2], and Microsoft Azure Functions [3]. These frameworks rely on server virtualization (i.e., virtual machines (VMs) [4]) and container (i.e., Docker [5]) to deploy, execute and scale the lambdas. These technologies are designed to maximize utilization of the providers\textquotesingle physical infrastructure, while presenting each tenant with its own view of an isolated machine. However, in serverless compute, where server management is hidden from tenants, these virtualization technologies become redundant, unnecessarily bloating the code-size of lambdas and causing processing delays and memory overheads [6]. Such increased overheads also limit the concurrency of these workloads, raising the overall cost of running them in a data center.

The cloud-computing industry is now realizing these issues and some providers, such as Google and CloudFlare, have already started developing alternative frameworks (like Isolate [7]) to remove these layers (e.g., containers) and run lambdas directly on a bare-metal server [7]. However, these bare-metal alternatives are inherently limited by the design restrictions of CPU-based architectures, which exacerbate when running at the scale of cloud data centers [8]. CPUs are designed to process sequence of instructions fast. They are not designed to run thousands of such small, discrete functions in parallel—a typical server CPU has in the order of 4 to 28 cores that can run up to 56 simultaneous threads [9]. Each lambda interrupts a CPU core to store the state (e.g., registers and memory) of the currently running function and load a new one, resulting in wasting tens of milliseconds worth of CPU cycles per context switch, increasing the overall costs for the cloud providers [10]. Thus, with ever increasing network speeds—100/400G NICs are on the horizon—these overheads quickly add up, limiting throughput and leading to long-tail latency in the order of 100s of milliseconds [11].

Recently, cloud providers are deploying SmartNICs to reduce load on host CPUs [10]. So far, these attempts have been limited to offloading ad-hoc application tasks (e.g., TCP offload, VXLAN tunneling or overlay networking) to accelerate the network processing [12], [13], [14]. However, modern SmartNICs, more specifically ASIC-based NICs, are more flexible, consisting of hundreds of RISC processors (i.e., NPUs) [15] each with local instruction store and memory. These NICs can run many discrete functions in parallel at high speed and low latency—unlike GPUs and FPGAs, which are optimized to accelerate specific workloads [16], [10], [17].

Serverless workloads by design are small, presenting unique opportunities for SmartNICs to accelerate them, while also achieving strict tail-latency guarantees. However, the main shortcomings of using SmartNICs come from their programming complexity. Programming SmartNICs is a formidable undertaking that requires intimate familiarity with NIC\textquotesingle s system and resource architecture (e.g., memory hierarchy, and multicore parallelism and pipelining). Developers need to carefully partition, schedule, and arbitrate these resources to maximize performance of their applications, which is a characteristic that is counter to the motivation behind serverless compute (i.e., where developers are unaware of the architectural details of the underlying infrastructure). Furthermore, each application has to explicitly handle packet processing as there is no notion of...
a network stack in SmartNICs.

In this paper, we present λ-NIC, a framework for running interactive serverless workloads entirely on SmartNICs. λ-NIC supports a new programming abstraction (Match+Lambda) along with a machine model—an extension of P4’s match-action abstraction [18] with more sophisticated actions—and helps address the shortcomings of SmartNICs in five key ways. First, users provide their lambdas, which λ-NIC compiles and then, at runtime, selects to execute by matching on the header of the incoming requests’ packets. Second, users program their lambdas assuming a flat memory model; λ-NIC analyzes the memory-access patterns \((i.e., read, write, or both)\) and sizes, and optimally maps these lambdas across different memory hierarchies of the NICs while ensuring that memory accesses are isolated. Third, λ-NIC infers which packet headers are used by each lambda and automatically generates the corresponding parser for the headers, thus eliminating the need for manually specifying packet-processing logic within these lambdas. Fourth, instead of partitioning and scheduling a single lambda across multiple NPUs, λ-NIC assumes a run-to-completion (RTC) model exploiting the fact that lambdas are small and can run within a single NPU. The vast array of NPU cores and short service times of lambdas further mitigate head-of-line-blocking and performance issues that lead to high tail latency. Lastly, serverless functions mostly communicate using independent, mutually-exclusive request-response pairs and do not need the strict in-order, streaming delivery semantic provided by TCP [19]. λ-NIC, therefore, employs a weakly-consistent delivery semantic [20], [19], alongside RDMA, for communication between serverless workloads—processing requests directly within the NIC cores without involving the host CPU.

We begin with a background on the current state-of-the-art in cloud-computing frameworks and SmartNICs (§II) followed by the challenges and motivations behind λ-NIC (§III). We then describe the overall architecture of λ-NIC (§IV) with an extensive evaluation of the system (§VI). Finally, we conclude by discussing λ-NIC’s limitations and future works (§VII), and comparisons with related implementations (§VIII).

II. BACKGROUND

We now discuss the latest advancements in cloud-computing frameworks and programmable SmartNICs, which are the core building blocks of λ-NIC.

A. Cloud Computing Frameworks

Fig. 1 illustrates four different cloud-computing frameworks in use today. Server virtualization is the foremost technology underneath cloud computing that allows a bare-metal server to host multiple VMs each with their independent, isolated environments containing separate Operating Systems (OS), libraries, and applications. It arrived at a time when advances in hardware made it difficult for a single application to efficiently consume the entire bare-metal server resources, and having multiple applications co-existing on a single server raised various issues (such as isolation and contention for resources). However, as trends shifted from monoliths to building applications as microservices—to increase manageability, resiliency, and scalability—the overheads of having a separate OS for each microservice were no longer negligible. This gave rise to container [5], which is a way to isolate applications’ binaries and libraries while sharing an OS and allowing communication with other containers using an overlay network that is set up using virtual switches (like Open vSwitch (OVS) [21]).

Still, with growing complexity and scale of cloud workloads, it became daunting for many users to provision and manage infrastructure resources with tasks requiring fine-grain allocation of resources under changing workload demands. Early solutions settled on over-provisioning these resources, incurring added cost for idle resources. More recently, serverless compute [22] has emerged as a favorable compute model that alleviates such operational and financial burdens from the users by letting them specify only the workloads, their memory and timing constraints, and when and which events (e.g., API calls, database updates, incoming data streams) to trigger them. In response, cloud providers independently provision infrastructure resources, deploying a set of containers on-demand to serve workloads’ requests. These containers are quickly taken down once the workloads complete and users are charged only for the time a container is executed. Serverless workloads are therefore short-lived with strict compute-time and memory limits (up to 15 minutes and 3 GB respectively, for Amazon Lambda [23]).

a) Serverless frameworks: Figure 2 depicts a typical serverless compute framework, which consists of the following entities: Workload manager compiles users’ lambdas to executable binaries, which, along with other data dependencies, are stored in a shared storage (e.g., Amazon S3 [24], Google Cloud [25], or Microsoft Azure Storage [26]). Workers are dynamically provisioned servers and run the compiled binaries typically as containers (e.g., Docker) managed by orchestration engines (e.g., Kubernetes [27]). Gateway proxies users’ requests (or events) to appropriate lambdas, and upon completion, the results from the lambdas are written to the storage or forwarded to other service for further execution.

Running lambdas within containers running atop an OS provides memory, compute, and file-system isolation using OS-based mechanisms (e.g., cgroups [28] and namespaces [29]). However, this approach incurs additional processing and net-
Fig. 2: An overview of a general serverless compute framework that executes users’ workloads and serves requests to such workloads. ($R_i$ represents request to workload $W_j$).

working overheads. Therefore, as an alternative, Isolate functions [7] run workloads directly on the bare-metal server itself (Figure 1), while providing all the benefits of a typical serverless framework (e.g., resource isolation between workloads). Although still in early development, these Isolate functions are showing promising results: up to 3x improvement in request latency while consuming 90% less memory than containers with faster startup times.

B. Programmable SmartNICs

In addition to handling basic networking tasks, SmartNICs can offload more general tasks that a CPU normally handles (e.g., checksum, TCP offload, and more). Based on their architecture and processing capabilities, these SmartNICs come in three different types: FPGA-, ASIC-, and SoC-based.

FPGA-based NICs [30] are dominated by the on-chip interconnect overhead [31] that significantly limits the lookup tables (LUTs) and memory (e.g., SRAM) resources available for executing lambda functions—today’s large FPGAs can barely support a small number of processing cores (< 10 or so). SoC-based NICs (e.g., Mellanox Bluefield [32] and Broadcom Stingray [33]) are easier to program as they run on a Linux-like OS on embedded cores (like ARM); however, similar to server CPUs, they are susceptible to high tail latency due to context switch and network stack overheads. Therefore, it’s questionable that these SoC-based NICs can support speeds higher than 100 Gb+ with low latency [10].

ASIC-based NICs consist of an ASIC that can sustain traffic rates of 100 Gbps+; contain hundreds of non-cache-coherent multi-threaded RISC cores (e.g., NPU, ARM, or RISC-V), operating at GHz speeds, along with specialized hardware functions (e.g., lookup, load balancing, queuing, and more); and are capable of running embarrassingly parallel workloads with low latency. Furthermore, recent advances in the design of these SmartNICs (e.g., Netronome Agilio [15] and Marvell LiquidIO [34]) make it easier for users to customize the NIC’s data-plane logic (e.g., parse, match, and action) using, for example, P4 [35] and Micro-C [36] programs, thus exposing dataflow and C-like abstractions a typical programmer is familiar with, without the need for an OS.

In §VI, we demonstrate how the unique characteristics of serverless workloads (i.e., short-lived with strict compute and memory limits) make ASIC-based SmartNICs a viable execution platform to accelerate lambda functions.

III. Motivation & Challenges

a) Low latency and high throughput lambdas: The key tenet of serverless compute is that it establishes a clear demarcation between users and infrastructure providers; users only specify lambdas that the providers efficiently execute on their infrastructure. Yet, all modern serverless frameworks are based on technologies (i.e., VMs and containers) that were designed to give users explicit control over the underlying infrastructure from the get-go. This control—in the form of compute and network (physical and overlay) virtualization—adds a significant overhead to lambdas. For interactive lambdas, with strict tail latency SLOs, eliminating such computational and networking overheads is becoming crucial [37].

The modern server architecture (with CPUs and GPUs) further adds to these overheads. CPUs are Von Neumann machines designed to efficiently execute a long sequence of instructions. However, they perform poorly when executing a large number of small lambdas where significant time is wasted context switching between them. Similarly, GPUs are Single-Instruction-Multiple-Data (SIMD) machines that serve as look-aside accelerators [38] in a typical server, controlled by the primary CPU. Although, in recent years, these GPUs have shown orders of magnitude improvements in accelerating machine-learning workloads [39, 40] (which by nature are long-running, batch jobs), they perform poorly for low-latency, interactive tasks. Even with technologies (such as GPUDirect RDMA [41] and RDMA-over-Converged-Ethernet (RoCE) [42]) that can bypass a CPU and push data directly into the GPU or main memory, the requests for lambdas still have to traverse a NIC—adding non-negligible delays in the order of sub-microseconds. λ-NIC eliminates both these virtualization and architectural overheads by running lambdas directly on the vast array of NIC-resident RISC cores.

b) A domain-specific processor for lambdas.: Till now, cloud providers have almost always relied on newer, faster CPUs to improve applications’ performance. More recently, they have started looking into other fine-grain, domain-specific processors, like look-aside or bump-in-the-wire accelerators (e.g., GPUs or FPGAs [10]). This is because, with Moore’s Law slowing down [43], CPUs today are no-longer a viable solution to meet ever-rising performance demands of users’ lambdas in a cost-efficient way—as has been demonstrated by both GPUs (for improving machine-learning training and inference throughput [40] and FPGAs (for accelerating search indexes [17] and host networking [10])). We believe that ASIC-based SmartNICs present the same opportunity for accelerating lambdas with orders of magnitude improvement in performance-per-watt at one-tenth of the hardware cost [44, 45], compared to server CPUs and GPUs [10].
A. Key Challenges

The embarrassingly-parallel and independent nature of lambdas take away much of the complexities that arise when synchronizing state between lambdas [14], making them an ideal candidate for SmartNICs with hundreds of cores. Still, executing them on these NICs is not a panacea and comes with its own unique challenges:

a) Programming SmartNICs: Due to their non-cache-coherent design, programming ASIC-based SmartNICs has always been considered hard [46]; non-coherency requires developers to program each NPU core separately, forcing them to manually handle synchronization between individual lambda. With lambdas, however, this is no longer an issue as lambdas do not share state and can run independently. Still, the lack of an OS layer in these NICs—though useful in reducing unwanted processing—puts the onus of mapping and placing these lambdas, across various clusters of cores and memory hierarchy, on the developers; requiring them to have low-level knowledge of the NIC architecture, firmware, and specialized languages it supports. To take this burden away from developers, we need a high-level abstraction and a framework that can automatically and efficiently compile, optimize, and deploy lambdas across a collection of these SmartNICs.

b) Offloading lambdas: NPUs are optimized for packet forwarding, and they typically do not support features (e.g., floating-point operations, dynamic-memory allocation, recursion, and reliable transport) that a general-purpose CPU supports [47], [48]. A serverless framework, therefore, must be able to compile workloads that rely on these features by, for example, transforming programs with floating-point operations, fixed-point arithmetic [49], dynamic-memory allocations to explicit memory calls, recursions to iterations, as well as employing other forms of reliable (or weakly-consistent) delivery protocols (e.g., RoCEv2 [42], Lightweight Transport Layer [20], or R2P2 [19]). To achieve high throughput and low latency, emerging workloads are also lowering their dependency on these features. For example, deep-learning training and inference is shown to perform well with lower, fixed bit-width integers [49], [50]. Moreover, serverless request-response (RPC) pairs are mostly independent and mutually-exclusive, and do not need TCP’s strict, reliable, and in-order streaming delivery of messages [19].

c) Ensuring security under multi-tenancy: Lambdas run alongside other workloads (e.g., microservices) in a data center and share infrastructure resources. When using SmartNICs: (1) a serverless framework should ensure that lambdas running on the NICs do not interfere with each other or degrade the network performance between the NIC and host CPUs that are running traditional workloads. (2) The framework should reserve ample SmartNIC resources (i.e., cores and memory) for basic NIC operations (e.g., TCP/IP offload and checksums) while maximally consuming remaining resources for lambdas. (3) Lambdas should execute in their own isolated sandboxes and the framework should restrict them from accessing each others’ working set. (4) Lastly, the framework should be robust against security attacks (e.g., DDoS) both from outside actors and malicious tenants.

IV. λ-NIC Overview

λ-NIC adds a new backend to existing serverless frameworks with its own programming abstraction, called Match+Lambda (§IV-A), and the accompanying machine model (§IV-B) that makes it easier to program and deploy lambdas directly on a SmartNIC.

A. Match+Lambda Abstraction

λ-NIC implements a Match+Lambda programming abstraction that extends the traditional Match+Action Table (MAT) abstraction [51] with more complicated actions (lambdas).

a) Programming lambdas: In λ-NIC, users provide one or more lambdas written in a restricted C-like language, called Micro-C [36]. Listing 1 shows the signature of the top-level function, which each lambda must begin with, having two arguments: headers and match_data. The number and structure of all the supported headers (i.e., the EXTRACTED_HEADERS_T data structure) and function parameters (i.e., MATCH_DATA_T), in λ-NIC, are defined a-priori. The lambdas operate directly on these parameters and headers without having to parse packets, which is done at the parse stage (Figure 3). Furthermore, these functions can have both local objects as well as global objects that persist state across runs.

```c
int function_name(EXTRACTED_HEADERS_T *headers, MATCH_DATA_T *match_data) {
    // local/global memory and objects.
    return return_value;
}
```

Listing 1: Signature of the top-level function in Micro-C for the Match+Lambda abstraction.

Listing 2 shows a real-world example of a lambda running as a web server. The function reads the server address (Line 5) from the headers variable. It then copies the requested web content from the memory into the header location pointed by the server address (Line 7), before returning.

```c
#define MEM_PER_LAMBDA 20
uint8_t memory[MEM_PER_LAMBDA * 3];
int web_server(EXTRACTED_HEADERS_T *headers, MATCH_DATA_T *match_data) {
    serverHdr = hdr_get_serverHdr(headers);
    memcpy(serverHdr->address, memory, MEM_PER_LAMBDA);
    return RETURN_FORWARD;
}
```

Listing 2: An example of a web-server lambda.

1We use Micro-C as it is the native language of the SmartNICs we have for the evaluation. The Micro-C language can support a large class of serverless functions (§II-A); however, λ-NIC is not just limited to this language and can work with more feature-rich languages supported by other SmartNICs.
### b) Expressing match

The user further specifies the corresponding P4 code\(^2\) for the match stage (Listing 3). During compilation, the workload manager assigns unique identifiers (IDs) to each of these lambdas, shares this mapping with the gateway, and populates the ID variables (e.g., `WEB_SERVER_ID`, `OTHER_LAMBDA_ID`) in the P4 code. For each incoming request, the gateway inserts the ID of the destined lambda as a new header. The match stage of a λ-NIC (as defined in the P4 code), checks the ID listed in the new header and calls the matching lambda (implemented as an extern in P4\([18]\)) or sends the packet to the host OS, in cases where no matching ID is found.

```p4
control ingress {
  if (valid(lambda_hdr)) {
    if (lambda_hdr.wId == WEB_SERVER_ID) {
      apply(web_server);
      apply(return_web_server_results);
    } else if (lambda_hdr.wId == OTHER_LAMBDA_ID) {
      apply(other_lambda);
      apply(return_other_lambda_results);
    } else {
      apply(send_pkt_to_host);
    }
  }
}
```

Listing 3: Snippet of a P4 code for the match stage.

In the end, the workload manager pairs the lambdas (Micro-C code) and match stage (P4 code) into a single Match+Lambda program, and prepends it with a generic P4 packet-parsing logic. It then compiles and transforms this program into a format that the target SmartNIC can execute (\(\S\)), while ensuring fair allocation of resources and isolation between lambda workloads.

### B. Abstract Machine Model

In λ-NIC, users write their Match+Lambda workloads against an abstract machine model (Figure 3). In this model: (1) lambdas are independent programs that do not share state and are isolated from each other; only a matching rule can invoke these functions. (2) The match stage serves as a scheduler (analogous to the OS networking stack) that forwards packets to the matching lambda or the host OS. Finally, (3) a parser handles packet operations (like header identification), and lambdas operate directly on the parsed headers.

These properties of Match+Lambda machine model make it easier for software developers to express serverless functions by separating out the parsing and matching logic, as well as for hardware designers to efficiently support the model on their target SmartNICs (\(\S\)) demonstrates such implementation using the Netronome SmartNICs). Thus, the abstract machine model enables unique optimizations that lets lambdas run in parallel without any interference from each other.

1) **Design Characteristics:** The abstract machine model has the following three design characteristics:

   a) **D1: Run-to-completion execution:** λ-NIC executes each lambda to completion. The machine model exposes a dense array of discrete, non-coherent processing threads (Figure 3), having their own instruction and data store in the memory. The lambdas execute in the context of a given thread, maximally utilizing the resources of that thread only (e.g., CPU and memory). Given the short service times and strict memory footprints of serverless functions, modern SmartNICs hold ample resources per thread to execute these lambdas [36].

   Having a large number of parallel threads, further mitigate issues related to head-of-line-blocking where lambdas wait behind other lambdas to finish or context switch, as in the case of server CPUs. These issues severely affect the performance of lambdas at the tail and require more complicated scheduling policies (like preemption or core allocation [37]). With λ-NIC, however, this is not the case as lambdas—even at the tail—can run to completion without degradation in performance (\(\S\)).

   Moreover, the highly-parallel nature and run-to-completion characteristic of the machine model ensure strong **performance isolation** between different lambdas running as separate threads, and λ-NIC implements weighted-fair-queueing (WFQ) to route requests between these threads. We leave it as future work to explore more sophisticated resource-allocation mechanisms (e.g., DRF [52]) to further improve the performance of λ-NIC.

   b) **D2: Flat memory access:** The abstract machine model lets users write lambdas assuming a flat (virtual) memory address space. All objects (local and global) on the thread stack are allocated from within that address space. This has the advantage that users do not have to worry about the complex memories, and their structure and hierarchies, present in modern SmartNICs [36]. Each of these memories come with their own performance benefits and are necessary to reach high speeds in these NICs; however, having so makes it the responsibility of the programmers to efficiently utilize these memories. λ-NIC’s machine model takes this onerous away by exposing a single, uniform memory to the user.

   When deploying to a particular SmartNIC, the compiler (or the workload manager) can take into account NIC’s specifications and can perform target-specific optimizations to effectively utilize its memory resources (\(\S\)). The users can also provide pragmas—specifying which objects are read more frequently—to guide the compiler in allocating objects to memories based on their access needs; it can place small or hot objects to core-local memories, and large or less frequently used ones in external, shared memories.

   Furthermore, having a virtual memory space per lambda can let the compiler enforce policies for **data isolation**, since virtual spaces do not interfere and are isolated from each other. The compiler can insert static and dynamic assertions to ensure

\(^2\)We use P4 as it is the most widely used data-plane language [35].
that a lambda does not access the physical memory of other lambdas on the target SmartNIC.

c) D3: Network transport: In λ-NIC, the primary mode of communication between the gateway, lambdas, and external services (e.g., storage) is via Remote Procedure Calls (RPCs). These RPCs are small, typically single-packet, request-response messages [19]. The parser, in the abstract machine model (Figure 3), decomposes these messages into headers and forwards them to the match stage, which further directs them to a matching lambda (by looking up the lambda ID associated with each message). Multi-packet RPCs, depending upon their size, can either be processed directly by the parse and match stage or pushed into the memory over RDMA (i.e., RoCEv2 [42]). (In the latter case, an event RPC triggers the lambda to start reading data from the desired memory location.)

Already, modern datacenter applications (like Amazon DynamoDB [53] and Deep Learning [49], [50]) are choosing to go away with the strict, reliable, and in-order guarantees provided by TCP, which are far stronger and computationally intensive than what applications need. Instead, these applications are being designed to work with weaker guarantees to achieve low tail latency [19]. λ-NIC exploits these facts and assumes a weakly-consistent delivery semantic for RPCs that are processed by the parse and match stage. A sender (the gateway or external services) tracks the outgoing RPCs to lambdas, and is responsible for resending a message in case of timeouts or packet drops. λ-NIC, on the other hand, performs packet reordering at the SmartNIC for multi-packet RPCs.\(^3\)

V. IMPLEMENTATION

In this section, we present an implementation of λ-NIC on P4-enabled Netronome SmartNICs (Figure 4). These NICs contain hundreds of RISC cores grouped together into islands. Each core has its own instruction and local memory, as well as a Cluster Target Memory (CTM) [36] per island, and is capable of executing multiple threads, concurrently. There are also on-chip internal memories (IMEMs) and an external memory (EMEM) shared between all islands and their cores; and a dedicated scheduler unit that directs incoming packets to cores. The architecture of Netronome SmartNICs therefore has the necessary elements to efficiently implement λ-NIC’s Match+Lambda abstract machine model: cores can execute lambdas to completion (D1), data can reside in different memories (e.g., local, CTM, or more) based on their usage patterns (D2), and the scheduler can direct RPCs to lambdas (D3).

A more programmable scheduler (e.g., RMT/PIFO-based [54]) can execute the parse and match stage of the machine model directly, with cores only processing the lambda logic. However, the scheduler inside the Netronome NICs we used for our evaluation is not programmable; it is workload manager can compose these stages with the logic already running on the core, removing the unused headers and duplicate match fields from the final code. Furthermore, the P4 tables are converted into if-else sequences, which the NIC core can execute more efficiently. Transforming tables into if-else sequences also helps reduce the total number of instructions of the final binary, running on a core.

\(^3\)We measured that Netronome SmartNICs can reorder four 100 B packets using 120 unrolled instructions, which is only 1.3% of the instructions used by our benchmark lambdas (§VI-D0c).

\(^4\)The other approach is to pipeline these stages and run them on separate cores; we intend to look into this as future work.

\(^5\)At present, the binary running on SmartNICs must be swapped with a new one each time, resulting in downtimes. However, this constraint is expected to disappear in the next-generation NICs (§VII).
c) Memory stratification: Based on the access patterns, the workload manager can choose the most efficient memory for an object at compile time. It can also look at the object size or hints from the user (as pragmas) to decide whether to put the object in a local memory, CTM, IMEM or EMEM.

VI. EVALUATION

We now compare the performance of λ-NIC with bare-metal and container backends both in isolation, when executing a single lambda (§VI-C1), and in a shared setting, when running multiple lambdas together (§VI-C2). We also evaluate the impact of λ-NIC on resource utilization, startup times, as well as the effectiveness of the λ-NIC compiler to optimize lambda program size when running on the NIC cores (§VI-D).

A. Test Methodology

1) Baseline framework: To evaluate λ-NIC against existing serverless backends, we select OpenFaaS [56] as our baseline serverless compute framework, which closely resembles the architecture depicted in Figure 2. We choose OpenFaaS for its simplicity, ease of deployment, extensive feature set, and greater adoption by the community. It is written in Golang and includes: (1) a Web UI, (2) an autoscaler to scale lambdas as demands change, (3) a Prometheus-based [58] monitoring engine to analyze system state and (4) a gateway with a NAT to proxy users’ requests to the appropriate lambdas. Each of these components and lambdas run as Docker [5] containers managed by Kubernetes [27].

a) Adding a bare-metal backend: For evaluating emerging runtimes like Isolate [7], we add support for a bare-metal backend to OpenFaaS. It is implemented as a Python service that runs on a bare-metal server as a standalone process, launching lambdas as new threads to serve users’ requests. The service relies on a Raft-based distributed key-value store, called etcd [59], to sync lambda-related states (e.g., number of active lambdas, their placement and load balancing policies) with the gateway to correctly proxy requests. Our goal, using the bare-metal backends, is to analyze how performance of lambdas improves in the absence of the container stack.

b) Introducing λ-NIC extensions: λ-NIC is built as an extension to OpenFaaS, inheriting all of OpenFaaS’s core features with additional support for running lambdas on the SmartNICs. With the extension, OpenFaaS can simultaneously deploy lambdas to containers, bare-metal, and SmartNIC backends. We also augment etcd to share state and manage λ-NIC deployments across multiple worker nodes.

2) Testbed setup: Our evaluation testbed consists of a cluster of five servers (Figure 5) housing two Intel Xeon Gold 5117 processors with 14 physical cores, running at 2.0 GHz with 32 GiB DDR4 2666 MT/s Dual-Ranked RAM and 120 GiB SATA SSD. One of the servers (M1) act as a master node running: Kubernetes services, gateway, workload manager, memcached server [60], the web interface, and the monitoring engine. M1 comes with a Broadcom 57412 2x10 Gb and 2x1 Gb Quad-Port NIC, which is used for management traffic. Other servers (M2–5) are worker nodes equipped with a Netronome Agilio CX 2x10 Gb SmartNIC [15] having 56 RISC cores (8 threads and 16 K instructions per core) running at 633 MHz with 2 GiB of on-board RAM. All servers connect to an Arista DCS-7124S switch over a 10 Gbps link. The backends communicate over an overlay network using calico [61] networking plugin for high performance switching and policy management across nodes inside a Kubernetes cluster.

B. Benchmark Workloads

We evaluate λ-NIC on three distinct interactive lambdas (i.e., web server, key-value client, and image transformer), each reflecting a popular use case [62], [63], [64].

a) Web server: A common usage pattern for lambdas is to serve web contents [63], such as text or HTML pages, similar to traditional web servers (like nginx [65]). These workloads are typically self-contained and do not need information from external sources (e.g., data stores) to service a request. For the evaluation, we implement a lambda that returns text responses based on the incoming requests.

b) Key-value client: Next, we consider workloads with external dependencies, needing information from remote services. These workloads query users’ data from external storage, e.g., databases or key-value stores (such as memcached [60]), do customization on the retrieved data, and finally send the processed data back to the user. Moreover, these workloads generate extensive intra-data center requests and typically have strict tail-latency requirement to meet the SLOs. For the evaluation, we implement key-value client lambdas that generate write (SET) and read (GET) requests to a memcached server.

c) Image transformer: Finally, we evaluate workloads performing real-time, interactive processing of large data (i.e., image processing or stream processing) [66], where the data span multiple packets and must be stored in memory. These workloads perform customization to the requested data, and either immediately return a response or store results to the memory for further processing [67]. For the evaluation, we implement a lambda that transforms RGBA image to grayscale.
C. System Performance

We now discuss the performance of λ-NIC, in terms of latency and throughput, compared to the bare-metal and container backends. We evaluate two cases: (1) when a single lambda runs on a backend in isolation, and (2) when multiple lambdas concurrently run, contending for shared resources (i.e., cores and memory).

1) Performance in Isolation: We first look at the latency and throughput of a lambda in isolation.

   a) Latency: We measure the latency of each backend, which is the time it takes for a gateway to send a request to a node running a single thread of the pre-loaded (or warm) lambda and receive a response back (Figure 6). For web server and key-value client lambdas, λ-NIC outperforms containers by 880x and bare-metal by 30x in average latency—completing requests in under 100ns—while still achieving 5x to 3x improvements for the data-intensive image-transformer lambda. Improvement are more visible at the tail (i.e., 99th percentile) where λ-NIC achieves 5x to 24x better tail latency than bare-metal for the three benchmark lambdas. For the key-value client lambdas, λ-NIC even improves upon reported latencies in a highly-optimized cloud-scale data center [60] by three orders of magnitude. Furthermore, container and bare-metal exhibit longer tail latency, specifically, for short-lived web server and key-value lambdas. This is likely the artifact of miscellaneous software overheads (e.g., context switching, cache management, and network stack).

   b) Throughput: We see similar improvements in throughput, measured as request serviced per second, for the three lambdas when running on λ-NIC. We carry out two separate experiments: (1) closed-loop testing with sender generating each request one after the other, and (2) parallel testing with 56 requests—the maximum number of threads that can run simultaneously on our tested server CPU—to stress the backends under concurrent load. λ-NIC outperforms both container and bare-metal backends (Figure 7), showing 27x to 736x improvements over the two backends for the web server and key-value client lambdas, and 5x to 15x improvements for the image-transformer lambda.

2) Performance Under Resource Contention: In a real setting, a serverless backend will typically run multiple lambdas at the same time. These lambdas will contend for their fair share of resources (i.e., CPU and memory), leading to added delays due to, for example, context switching of lambdas and movement of data to and from CPU and memory.

   a) Effects of context switching: In the previous experiment, we measured the performance of each backend when running a single lambda in isolation. Now, we evaluate a setup with three distinct web-server lambdas running on a single backend at once. We generate requests for each of these workloads in a round-robin fashion, causing the processor to context switch between lambdas when servicing each incoming request. With multiple lambdas running concurrently, the bare-metal backend suffers higher tail latency compared to λ-NIC, +26.7x (2.01ms to 53.6ms) vs. +6.49x (89µs to 578µs) (The latency of containers was even worse.). These experiments demonstrate that, unlike container and bare-metal backends (with server CPUs), λ-NIC is less susceptible to context switching and performs better under resource contention by virtue of a vast array of on-chip NPU cores and the lack of operating system and container software.

D. Other Metrics

   a) Resource utilization: We also compare the memory and CPU usage at the host and the SmartNIC when running a single data-intensive image-transformer instance in isolation. Table I shows the additional resources utilized by each backend when servicing 56 concurrent requests. Containers have the largest memory footprint and consume an order of magnitude more host memory and CPU cycles than the bare-metal backend. On the other hand, as expected, λ-NIC’s impact on the host memory and CPU is negligible, and it consumes roughly the same amount of NIC memory for the image-transformer workload as the bare-metal backend.
TABLE I: Additional resources utilized by each serverless backend for the image-transformer workload.

<table>
<thead>
<tr>
<th>Workload Size (MiB)</th>
<th>Host CPU (Avg. %)</th>
<th>Host Memory (MiB)</th>
<th>NIC Memory (MiB)</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>λ-NIC</td>
<td>Bare Metal</td>
<td>Container</td>
</tr>
<tr>
<td></td>
<td>+0.1</td>
<td>+9.2</td>
<td>+13.7</td>
</tr>
<tr>
<td>Startup Time (s)</td>
<td>19.8</td>
<td>5.0</td>
<td>31.7</td>
</tr>
</tbody>
</table>

TABLE II: Factors affecting startup times.

<table>
<thead>
<tr>
<th>Workload Size (MiB)</th>
<th>λ-NIC</th>
<th>Bare Metal</th>
<th>Container</th>
</tr>
</thead>
<tbody>
<tr>
<td>Startup Time (s)</td>
<td>11.0</td>
<td>17.0</td>
<td>153.0</td>
</tr>
<tr>
<td></td>
<td>19.8</td>
<td>5.0</td>
<td>31.7</td>
</tr>
</tbody>
</table>

7. DISCUSSION

a) Choice of hardware: λ-NIC is not just limited to NPU-based SmartNICs. In fact, the λ-NIC’s abstract machine model can run on other SmartNICs (with varying benefits) having more general-purpose processors: either FPGA- or SoC-based SmartNICs, or some form of ASIC with ARM cores. These alternatives can further extend the processing capabilities of λ-NIC, providing support for more features (such as floating point operations, deeper instruction store, and dynamic memory allocation) to run more complicated workloads.

b) Hot swapping workloads: For each new lambda, λ-NIC needs to recompile and swap the firmware with the one currently running on the SmartNIC. Present versions of Netronome SmartNICs do not support hot swapping or hitless updates [36], resulting in downtimes each time a new firmware is loaded. Hitless updates are not technically challenging as devices like FPGAs [30] and programmable switch fabrics [51] already support it (using partial reconfiguration and versioning techniques). We believe this limitation will go away in the future versions of SmartNICs as well, allowing λ-NIC to load new lambdas with causing downtimes.

c) Optimizer effectiveness: We now report the results of our compiler optimizations. The number of instructions in the naïve implementation—consisting of two key-value clients, a web server, and an image transformer lambda—is gradually reduced by applying the following target-specific optimizations (§V-A). First, we perform lambda coalescing for the two distinct key-value clients. We coalesce these lambdas, as they contain equivalent logic to generate a new packet to query memcached, which we can combine and reuse. We further coalesce the web server and image-transformer lambdas, having a pattern of response that does not query external services. Hence, we combine their reply logic. Next, we apply match reduction. The naïve implementation adds a separate table for managing routes for each lambda. We combine these tables into one, and use individual parameter values (defined as P4 metadata) for route management. Finally, we do memory stratification to place variables into appropriate memories based on their sizes. For example, the image variable within the image-transformer lambda is mapped to IMEM, whereas the web server results are mapped to CTM inside the island. These optimizations bring the total number of instructions of the final binary down to 8,050 (a reduction of 9.56% from the naïve implementation); hence, improving latency by 6.3 µs (on average) or letting additional lambdas to fit in the program-size constraints of the Netronome SmartNIC.
d) Security and reliable transport: The serverless framework (i.e., gateway, workload manager, and worker nodes) typically run within a trusted domain of a provider or a tenant, such that any malicious attempt to trigger the lambdas will be blocked by the gateway. In addition, each lambda operates within its own memory region on the NIC, restricting them from accessing each others data. λ-NIC enforces this policy using compile-time assertions; in the future, we plan to explore enforcing assertions at runtime for dynamically-allocated memory (once it becomes available in the upcoming SmartNICs). For transport, λ-NIC relies on RDMA for multi-packet messages. However, with recent focus on terminating the entire transport layer on the NIC [71], λ-NIC’s serverless functions can instead operate on complete messages rather than individual packets.

VIII. RELATED WORK

a) NIC offloading technologies: Offloads for network features, such as L3/L4 checksum computation, large send and receive offload (LSO, LRO), and Receive Side Scaling (RSS) have been around for decades. More recent work, however, is looking into offloading more complex, application-related functions to modern programmable, SmartNICs [10]. For example, Microsoft is using FPGA-based SamrtNICs to offload hypervisor switching tasks [10], which were previously handled by CPU-based software switches (like Open vSwitch (OVS) [21]). HyperLoop [72] provides methods for accelerating replicated storage operations using RDMA on NICs. λ-NIC can assist these works by providing a framework to easily deploy general compute on a cluster of nodes hosting SmartNICs. Both Floem [12] and iPipe [14] provide a framework to enable easier development of NIC-assisted applications. However, these frameworks can offload only a portion of these applications to a NIC as a bump-in-the-wire, and need CPUs to do the remaining processing. In contrast, λ-NIC runs complete workloads on the NIC, mitigating the effects of any CPU-related overheads.

b) In-network computing: Orthogonal to NIC offloading technologies, there is a recent focus on moving various application tasks inside the network. P4 [35] and RMT [51] have provided the initial building blocks: a data-plane language and an architecture for programmable network devices, which enabled developers to run various applications in-network. For example, SilkRoad [73] and HULA [74] present methods for offloading load balancers, NetCache [69] implements a key-value store, and NetPaxos [75] runs the Paxos consensus algorithm inside switches. Tokusashi et al. [76] further demonstrate that in-network computing not only improves performance but also is more power efficient. λ-NIC, alongside these networking devices, can provide a more programmable environment for accelerated application processing. In fact, SmartNICs have more memory and less-restricted programming model, which can help alleviate the limitations present in these switches.

c) Improving serverless compute: Serverless compute is a relatively new idea and many of its details are not yet disclosed by the cloud providers. Thus, most of the recent work focuses on reverse engineering existing frameworks to study their internals and to educate the public. For example, OpenLambda [77] provides an open-source serverless compute framework that closely resembles the ones deployed by the cloud providers. Glockson et al. [78] proposed another framework with support for edge deployments. λ-NIC complements these efforts by presenting a high-performance, open-source serverless compute framework for testing and developing lambdas on SmartNICs.

IX. CONCLUSION

Server CPUs are not the right architecture for serverless compute. The particular characteristics of the serverless workloads (i.e., fine-grain functions with short service times, and memory), as well as the slowdown of Moore’s Law and Denard Scaling, demand a radically different architecture for accelerating serverless functions (or lambdas): an architecture that can execute multiple of these lambdas in parallel with minimal contention and context-switching overhead. In this paper, we present λ-NIC, a serverless framework along with an abstraction, Match+Lambda, and a machine model for executing thousands of lambdas on an ASIC-based SmartNIC. These SmartNICs host a vast array of RISC cores with their own instruction store and local memory, and can execute embarrassingly parallel workloads with very low latency. λ-NIC’s Match+Lambda abstraction hides the architectural complexities of these SmartNICs and efficiently compiles, optimizes, and deploys serverless functions on these NICs—improving average latency by 880x and throughput by 736x. With new and emerging developments in both SmartNICs and serverless frameworks, we believe offloading lambdas to SmartNICs will become a common practice, inspiring even more complex real-world workloads to run on λ-NIC.

ACKNOWLEDGMENT

We thank the anonymous ICDCS reviewers for their valuable feedback that helped improve the quality of this paper. This research is supported by The Stanford Platform Lab.

REFERENCES


