VPP 105: Memory Management & DPDK APIs

VPP 105: Memory Management & DPDK APIs

Welcome to the 5th part of our VPP Guide series! Today, we will be asking ourselves practical questions regarding the various technologies & libraries managed in VPP – their usage, advantages, and management. Let’s jump right into it and ask ourselves:

Why does DPDK use Hugepages?

Hugepages

Hugepages is one of the techniques used in virtual memory management. In a standard environment, CPU allocates memory (virtual) for each process. Those blocks of memory are called „pages“ and for efficiency in Linux kernel the size of allocated memory is 4kB. When a process wants to access its memory, CPU has to find where this virtual memory is – this is the task of Memory Management Unit and page table lookup. Using the page table structure CPU could map virtual to physical memory.

For example, when the process needs 1GB of memory, this leads to more than 200k of pages in the page table which the CPU has to lookup for. Of course, this leads to performance slowdown. Fortunately, nowadays CPUs support bigger pages – so-called Hugepages. They can reduce the number of pages to be lookup for and usage of huge pages increases performance.

Memory Management Unit uses one additional hardware cache – Translation Lookaside Buffers (TLB). When there is address translation from virtual memory to physical memory, translation is calculated in MMU, and this mapping is stored in the TLB. So next time accessing the same page will be first handled by TLB (which is fast) and then by MMU.

As TLB is a hardware cache, it has a limited number of entries, so a large number of pages will slow down the application. So a combination of TLB with Hugepages reduces the time it takes to translate a virtual page address to a physical page address and to lookup for and so again it will increase performance.

This is the reason why DPDK, and VPP as well, uses Hugepages for large memory pool allocation, used for packet buffers. By using Hugepages allocations, performance is increased since fewer pages and fewer lookups are needed and the management is more effective.

Cache prefetching

Cache prefetching is another technique used by VPP to boost execution performance. Prefetching data from their original storage in slower memory to a faster local memory before it is actually needed significantly increase performance. CPUs have fast and local cache memory in which prefetched data is held until it is required. Examples of CPU caches with a specific function are the D-cache (data cache), I-cache (instruction cache) and the TLB (translation lookaside buffer) for the MMU. Separated D-cache and I-cache makes it possible to fetch instructions and data in parallel. Moreover, instructions and data have different access patterns.

Cache prefetching is used mainly in nodes when processing packets. In VPP, each node has a registered function responsible for incoming traffic handling. An example of registration (abf and flowprobe nodes):

/*
 * Copyright (c) 2017 Cisco and/or its affiliates. Licensed under the Apache License, Version 2.0. You may obtain a copy of the License at: http://www.apache.org/licenses/LICENSE-2.0
 */

VLIB_REGISTER_NODE (abf_ip4_node) = {
  .function = abf_input_ip4,
  .name = "abf-input-ip4",

VLIB_REGISTER_NODE (flowprobe_ip4_node) = {
  .function = flowprobe_ip4_node_fn,
  .name = "flowprobe-ip4",

In abf processing function, we can see single loop handling – it loops over packets and handles them one by one.

/*
 * Copyright (c) 2017 Cisco and/or its affiliates. Licensed under the Apache License, Version 2.0. You may obtain a copy of the License at: http://www.apache.org/licenses/LICENSE-2.0
 */

abf_input_inline (vlib_main_t * vm,
      vlib_node_runtime_t * node,
      vlib_frame_t * frame, fib_protocol_t fproto)
{
...
      while (n_left_from > 0 && n_left_to_next > 0)
  {
...	
    abf_next_t next0 = ABF_NEXT_DROP;
    vlib_buffer_t *b0;
    u32 bi0, sw_if_index0;
...
    bi0 = from[0];
    to_next[0] = bi0;
    from += 1;
    to_next += 1;
    n_left_from -= 1;
    n_left_to_next -= 1;

    b0 = vlib_get_buffer (vm, bi0);
    sw_if_index0 = vnet_buffer (b0)->sw_if_index[VLIB_RX];

    ASSERT (vec_len (abf_per_itf[fproto]) > sw_if_index0);
    attachments0 = abf_per_itf[fproto][sw_if_index0];
...
    /* verify speculative enqueue, maybe switch current next frame */
    vlib_validate_buffer_enqueue_x1 (vm, node, next_index,
             to_next, n_left_to_next, bi0,
             next0);
  }
      vlib_put_next_frame (vm, node, next_index, n_left_to_next);
    }

In flowprobe node, we can see quad/single loop using prefetching which can significantly increase performance. In the first loop:

/*
 * Copyright (c) 2015 Cisco and/or its affiliates. Licensed under the Apache License, Version 2.0. You may obtain a copy of the License at: http://www.apache.org/licenses/LICENSE-2.0
 */

( while (n_left_from >= 4 ... ) )

it processes buffers b0 and b1 (and moreover, the next two buffers are prefetched), and in the next loop

/*
 * Copyright (c) 2015 Cisco and/or its affiliates. Licensed under the Apache License, Version 2.0. You may obtain a copy of the License at: http://www.apache.org/licenses/LICENSE-2.0
 */

( while (n_left_from > 0 ... ) )

remaining packets are processed.

/*
 * Copyright (c) 2018 Cisco and/or its affiliates. Licensed under the Apache License, Version 2.0. You may obtain a copy of the License at: http://www.apache.org/licenses/LICENSE-2.0
 */

flowprobe_node_fn (vlib_main_t * vm,
       vlib_node_runtime_t * node, vlib_frame_t * frame,
       flowprobe_variant_t which)
{
...
      /*
      * While we have at least 4 vector elements (pkts) to process..
      */

      while (n_left_from >= 4 && n_left_to_next >= 2)
  {
...
    /* Prefetch next iteration. */
    {
      vlib_buffer_t *p2, *p3;

      p2 = vlib_get_buffer (vm, from[2]);
      p3 = vlib_get_buffer (vm, from[3]);

      vlib_prefetch_buffer_header (p2, LOAD);
      vlib_prefetch_buffer_header (p3, LOAD);

      CLIB_PREFETCH (p2->data, CLIB_CACHE_LINE_BYTES, STORE);
      CLIB_PREFETCH (p3->data, CLIB_CACHE_LINE_BYTES, STORE);
    }
...
    /* speculatively enqueue b0 and b1 to the current next frame */
    b0 = vlib_get_buffer (vm, bi0);
    b1 = vlib_get_buffer (vm, bi1);


    /* verify speculative enqueues, maybe switch current next frame */
    vlib_validate_buffer_enqueue_x2 (vm, node, next_index,
             to_next, n_left_to_next,
             bi0, bi1, next0, next1);
  }
      /*
      * Clean up 0...3 remaining packets at the end of the frame
      */
      while (n_left_from > 0 && n_left_to_next > 0)
  {
    u32 bi0;
    vlib_buffer_t *b0;
    u32 next0 = FLOWPROBE_NEXT_DROP;
    u16 len0;

    /* speculatively enqueue b0 to the current next frame */
    bi0 = from[0];
    to_next[0] = bi0;
    from += 1;
    to_next += 1;
    n_left_from -= 1;
    n_left_to_next -= 1;

    b0 = vlib_get_buffer (vm, bi0);

    vnet_feature_next (&next0, b0);

    len0 = vlib_buffer_length_in_chain (vm, b0);
    ethernet_header_t *eh0 = vlib_buffer_get_current (b0);
    u16 ethertype0 = clib_net_to_host_u16 (eh0->type);

    if (PREDICT_TRUE ((b0->flags & VNET_BUFFER_F_FLOW_REPORT) == 0))
      {
        flowprobe_trace_t *t = 0;
        if (PREDICT_FALSE ((node->flags & VLIB_NODE_FLAG_TRACE)
         && (b0->flags & VLIB_BUFFER_IS_TRACED)))
    t = vlib_add_trace (vm, node, b0, sizeof (*t));

        add_to_flow_record_state (vm, node, fm, b0, timestamp, len0,
          flowprobe_get_variant
          (which, fm->context[which].flags,
           ethertype0), t);
      }

    /* verify speculative enqueue, maybe switch current next frame */
    vlib_validate_buffer_enqueue_x1 (vm, node, next_index,
             to_next, n_left_to_next,
             bi0, next0);
  }

VPP I/O Request Handling

Why is polling faster than IRQs? How do the hardware/software IRQs work?

I/O device (NIC) event handling is a significant part of VPP. The CPU doesn’t know when an I/O event can occur, but it has to respond. There are two different approaches – IRQ and Polling, which are different from each other in many aspects.

From a CPU point of view, IRQ seems to be better, as the device disturbs the CPU only when it needs servicing, instead of constantly checking device status in case of polling. But from an efficiency point of view, interruptions are inefficient when the devices keep on interrupting the CPU repeatedly and polling is inefficient when the CPU device is rarely ready for servicing.

As in the case of packet processing in VPP, it is expected that traffic will be permanent. In such a case, the number of interruptions would rapidly increase. On the other hand, the device will be ready for service all the time. So polling seems to be more efficient for packet processing and it is the reason why VPP uses polling when processing the incoming packet.

VPP & DPDK

What API does DPDK offer? How does VPP use this library?

DPDK networking drivers are classified in two categories:

  • physical for real devices
  • virtual for emulated devices

The DPDK ethdev layer exposes APIs, in order to use the networking functions of these devices. For a full list of the supported features and APIs, click here.

In VPP, DPDK support has been moved from core to plugin to simplify enabling/disabling and handling DPDK interfaces. To simplify and store all DPDK relevant info, a DPDK device implementation (src/plugin/dpdk/device/dpdk.h) has a structure with DPDK data:

/* SPDX-License-Identifier: BSD-3-Clause
 * Copyright(c) 2010-2014 Intel Corporation
 */

typedef struct
{
...
  struct rte_eth_conf port_conf;
  struct rte_eth_txconf tx_conf;
...
  struct rte_flow_error last_flow_error;
...
  struct rte_eth_link link;
...
  struct rte_eth_stats stats;
  struct rte_eth_stats last_stats;
  struct rte_eth_xstat *xstats;
...
} dpdk_device_t;

containing all relevant DPDK structs used in VPP, to store DPDK relevant info.

DPDK APIs are used in the DPDK plugin only. Here is a list of DPDK features and their API’s used in VPP, with a few examples of usage.

Speed Capabilities / Runtime Rx / Tx Queue Setup

Supports getting the speed capabilities that the current device is capable of. Supports Rx queue setup after the device started.

API: rte_eth_dev_info_get()

/*
 * Copyright (c) 2017 Cisco and/or its affiliates. Licensed under the Apache License, Version 2.0. You may obtain a copy of the License at: http://www.apache.org/licenses/LICENSE-2.0
 */

dpdk_device_setup (dpdk_device_t * xd)
{
  dpdk_main_t *dm = &dpdk_main;
...
  struct rte_eth_dev_info dev_info;
...
  if (xd->flags & DPDK_DEVICE_FLAG_ADMIN_UP)
    {
      vnet_hw_interface_set_flags (dm->vnet_main, xd->hw_if_index, 0);
      dpdk_device_stop (xd);
    }

  /* Enable flow director when flows exist */
  if (xd->pmd == VNET_DPDK_PMD_I40E)
    {
      if ((xd->flags & DPDK_DEVICE_FLAG_RX_FLOW_OFFLOAD) != 0)
  xd->port_conf.fdir_conf.mode = RTE_FDIR_MODE_PERFECT;
      else
  xd->port_conf.fdir_conf.mode = RTE_FDIR_MODE_NONE;
    }

  rte_eth_dev_info_get (xd->port_id, &dev_info);

Link Status

Supports getting the link speed, duplex mode and link-state (up/down).

API: rte_eth_link_get_nowait()

/*
 *Copyright (c) 2015 Cisco and/or its affiliates. Licensed under the Apache License, Version 2.0. You may obtain a copy of the License at: http://www.apache.org/licenses/LICENSE-2.0
 */

dpdk_update_link_state (dpdk_device_t * xd, f64 now)
{
  vnet_main_t *vnm = vnet_get_main ();
  struct rte_eth_link prev_link = xd->link;
...
  /* only update link state for PMD interfaces */
  if ((xd->flags & DPDK_DEVICE_FLAG_PMD) == 0)
    return;

  xd->time_last_link_update = now ? now : xd->time_last_link_update;
  clib_memset (&xd->link, 0, sizeof (xd->link));
  rte_eth_link_get_nowait (xd->port_id, &xd->link);

Lock-Free Tx Queue

If a PMD advertises DEV_TX_OFFLOAD_MT_LOCKFREE capable, multiple threads can invoke rte_eth_tx_burst() concurrently on the same Tx queue without SW lock.

API: rte_eth_tx_burst()

/*
 * Copyright (c) 2015 Cisco and/or its affiliates. Licensed under the Apache License, Version 2.0. You may obtain a copy of the License at: http://www.apache.org/licenses/LICENSE-2.0
 */

static clib_error_t *
dpdk_lib_init (dpdk_main_t * dm)
{
...
  dpdk_device_t *xd;
...
      if (xd->pmd == VNET_DPDK_PMD_FAILSAFE)
  {
    /* failsafe device numerables are reported with active device only,
     * need to query the mtu for current device setup to overwrite
     * reported value.
     */
    uint16_t dev_mtu;
    if (!rte_eth_dev_get_mtu (i, &dev_mtu))
      {
        mtu = dev_mtu;
        max_rx_frame = mtu + sizeof (ethernet_header_t);

        if (dpdk_port_crc_strip_enabled (xd))
    {
      max_rx_frame += 4;
    }
      }
  }

Promiscuous Mode

Supports enabling/disabling promiscuous mode for a port.

API: rte_eth_promiscuous_enable(), rte_eth_promiscuous_disable(), rte_eth_promiscuous_get()

Allmulticast Mode

Supports enabling/disabling receiving multicast frames.

API: rte_eth_allmulticast_enable(), rte_eth_allmulticast_disable(), rte_eth_allmulticast_get()

/*
 * Copyright (c) 2017 Cisco and/or its affiliates. Licensed under the Apache License, Version 2.0. You may obtain a copy of the License at: http://www.apache.org/licenses/LICENSE-2.0
 */

dpdk_device_stop (dpdk_device_t * xd)
{
  if (xd->flags & DPDK_DEVICE_FLAG_PMD_INIT_FAIL)
    return;

  rte_eth_allmulticast_disable (xd->port_id);
  rte_eth_dev_stop (xd->port_id);
...

Unicast MAC Filter

Supports adding MAC addresses to enable white-list filtering to accept packets.

APIrte_eth_dev_default_mac_addr_set(), rte_eth_dev_mac_addr_add(), rte_eth_dev_mac_addr_remove(), rte_eth_macaddr_get()

VLAN Filter

Supports filtering of a VLAN Tag identifier.

API: rte_eth_dev_vlan_filter()

VLAN Offload

Supports VLAN offload to hardware.

API: rte_eth_dev_set_vlan_offload(), rte_eth_dev_get_vlan_offload()

/*
 * Copyright (c) 2015 Cisco and/or its affiliates. Licensed under the Apache License, Version 2.0. You may obtain a copy of the License at: http://www.apache.org/licenses/LICENSE-2.0
 */

dpdk_subif_add_del_function (vnet_main_t * vnm,
           u32 hw_if_index,
           struct vnet_sw_interface_t *st, int is_add)
{
...
  dpdk_device_t *xd = vec_elt_at_index (xm->devices, hw->dev_instance);
  int r, vlan_offload;
...
  vlan_offload = rte_eth_dev_get_vlan_offload (xd->port_id);
  vlan_offload |= ETH_VLAN_FILTER_OFFLOAD;

  if ((r = rte_eth_dev_set_vlan_offload (xd->port_id, vlan_offload)))
    {
      xd->num_subifs = prev_subifs;
      err = clib_error_return (0, "rte_eth_dev_set_vlan_offload[%d]: err %d",
             xd->port_id, r);
      goto done;
    }

  if ((r =
       rte_eth_dev_vlan_filter (xd->port_id,
        t->sub.eth.outer_vlan_id, is_add)))
    {
      xd->num_subifs = prev_subifs;
      err = clib_error_return (0, "rte_eth_dev_vlan_filter[%d]: err %d",
             xd->port_id, r);
      goto done;
    }

Basic Stats

Support basic statistics such as: ipackets, opackets, ibytes, obytes, imissed, ierrors, oerrors, rx_nombuf. And per queue stats: q_ipackets, q_opackets, q_ibytes, q_obytes, q_errors.

API: rte_eth_stats_get, rte_eth_stats_reset()

/*
 * Copyright (c) 2015 Cisco and/or its affiliates. Licensed under the Apache License, Version 2.0. You may obtain a copy of the License at: http://www.apache.org/licenses/LICENSE-2.0
 */

dpdk_update_counters (dpdk_device_t * xd, f64 now)
{
  vlib_simple_counter_main_t *cm;
  vnet_main_t *vnm = vnet_get_main ();
  u32 thread_index = vlib_get_thread_index ();
  u64 rxerrors, last_rxerrors;

  /* only update counters for PMD interfaces */
  if ((xd->flags & DPDK_DEVICE_FLAG_PMD) == 0)
    return;

  xd->time_last_stats_update = now ? now : xd->time_last_stats_update;
  clib_memcpy_fast (&xd->last_stats, &xd->stats, sizeof (xd->last_stats));
  rte_eth_stats_get (xd->port_id, &xd->stats);

Extended Stats

Supports Extended Statistics, changes from driver to driver.

APIrte_eth_xstats_get(), rte_eth_xstats_reset(), rte_eth_xstats_get_names, rte_eth_xstats_get_by_id(), rte_eth_xstats_get_names_by_id(), rte_eth_xstats_get_id_by_name()

Module EEPROM Dump

Supports getting information and data of plugin module eeprom.

API: rte_eth_dev_get_module_info(), rte_eth_dev_get_module_eeprom()


VPP Library (vlib)

What funcionality does vlib offer?

Vlib is a vector processing library. It also handles various application management functions:

  • buffer, memory, and graph node management and scheduling
  • reliable multicast support
  • ultra-lightweight cooperative multi-tasking threads
  • physical memory, and Linux epoll support
  • maintaining and exporting counters
  • thread management
  • packet tracing.

Vlib also implements the debug CLI.

In VPP (vlib), a vector is an instance of the vlib_frame_t type:

/*
 * Copyright (c) 2015 Cisco and/or its affiliates. Licensed under the Apache License, Version 2.0. You may obtain a copy of the License at: http://www.apache.org/licenses/LICENSE-2.0
 */

typedef struct vlib_frame_t
{
  /* Frame flags. */
  u16 flags;

  /* Number of scalar bytes in arguments. */
  u8 scalar_size;

  /* Number of bytes per vector argument. */
  u8 vector_size;

  /* Number of vector elements currently in frame. */
  u16 n_vectors;

  /* Scalar and vector arguments to next node. */
  u8 arguments[0];
} vlib_frame_t;

As shown, vectors are dynamically resized arrays with user-defined “headers”. Many data structures in VPP (buffers, hash, heap, pool) are vectors with different headers.

The memory layout looks like this:

© Copyright 2018, Linux Foundation


User header (optional, uword aligned)
                  Alignment padding (if needed)
                  Vector length in elements
User's pointer -> Vector element 0
                  Vector element 1
                  ...
                  Vector element N-1

Vectors are not only used in vppinfra data structures (hash, heap, pool, …) but also in vlib – in nodes, buffers, processes and more.

Buffers

Vlib buffers are used to reach high performance in packet processing. To do so, one allocates/frees N-buffers at once, rather than one at a time – except for directly processing specific buffer (its packets in given node), one deals with buffer indices instead of buffer pointers. Vlib buffers have a structure of a vector:

/*
 * Copyright (c) 2015 Cisco and/or its affiliates. Licensed under the Apache License, Version 2.0. You may obtain a copy of the License at: http://www.apache.org/licenses/LICENSE-2.0
 */

/** VLIB buffer representation. */
typedef union
{
  struct
  {
    CLIB_CACHE_LINE_ALIGN_MARK (cacheline0);

    /** signed offset in data[], pre_data[] that we are currently
      * processing. If negative current header points into predata area.  */
    i16 current_data;

    /** Nbytes between current data and the end of this buffer.  */
    u16 current_length;
...
    /** Opaque data used by sub-graphs for their own purposes. */
    u32 opaque[10];
...
    /**< More opaque data, see ../vnet/vnet/buffer.h */
    u32 opaque2[14];

    /** start of third cache line */
      CLIB_CACHE_LINE_ALIGN_MARK (cacheline2);

    /** Space for inserting data before buffer start.  Packet rewrite string
      * will be rewritten backwards and may extend back before
      * buffer->data[0].  Must come directly before packet data.  */
    u8 pre_data[VLIB_BUFFER_PRE_DATA_SIZE];

    /** Packet data */
    u8 data[0];
  };
#ifdef CLIB_HAVE_VEC128
  u8x16 as_u8x16[4];
#endif
#ifdef CLIB_HAVE_VEC256
  u8x32 as_u8x32[2];
#endif
#ifdef CLIB_HAVE_VEC512
  u8x64 as_u8x64[1];
#endif
} vlib_buffer_t;

Each vlib_buffer_t (packet buffer) carries the buffer metadata, which describes the current packet-processing state.

  • u8 data[0]: Ordinarily, hardware devices use data as the DMA target but there are exceptions. Do not access data directly, use vlib_buffer_get_current.
  • u32 opaque[10]: primary vnet-layer opaque data
  • u32 opaque2[14]: secondary vnet-layer opaque data

There are several functions to get data from vector (vlib/node_funcs.h):

To get a pointer to frame vector data

/*
 * Copyright (c) 2015 Cisco and/or its affiliates. Licensed under the Apache License, Version 2.0. You may obtain a copy of the License at: http://www.apache.org/licenses/LICENSE-2.0
 */

always_inline void *
vlib_frame_vector_args (vlib_frame_t * f)
{
  return (void *) f + vlib_frame_vector_byte_offset (f->scalar_size);
}
-	to get pointer to scalar data
always_inline void *
vlib_frame_scalar_args (vlib_frame_t * f)
{
  return vlib_frame_vector_args (f) - f->scalar_size;
}

Get pointer to scalar data

/*
 * Copyright (c) 2015 Cisco and/or its affiliates. Licensed under the Apache License, Version 2.0. You may obtain a copy of the License at: http://www.apache.org/licenses/LICENSE-2.0
 */

always_inline void *
vlib_frame_scalar_args (vlib_frame_t * f)
{
  return vlib_frame_vector_args (f) - f->scalar_size;
}

Translate the buffer index into buffer pointer

/*
 * Copyright (c) 2015 Cisco and/or its affiliates. Licensed under the Apache License, Version 2.0. You may obtain a copy of the License at: http://www.apache.org/licenses/LICENSE-2.0
 */

always_inline vlib_buffer_t *
vlib_get_buffer (vlib_main_t * vm, u32 buffer_index)
{
  vlib_buffer_main_t *bm = vm->buffer_main;
  vlib_buffer_t *b;

  b = vlib_buffer_ptr_from_index (bm->buffer_mem_start, buffer_index, 0);
  vlib_buffer_validate (vm, b);
  return b;
}

Get the pointer to current (packet) data from a buffer to process

/*
 * Copyright (c) 2015 Cisco and/or its affiliates. Licensed under the Apache License, Version 2.0. You may obtain a copy of the License at: http://www.apache.org/licenses/LICENSE-2.0
 */

always_inline void *
vlib_buffer_get_current (vlib_buffer_t * b)
{
  /* Check bounds. */
  ASSERT ((signed) b->current_data >= (signed) -VLIB_BUFFER_PRE_DATA_SIZE);
  return b->data + b->current_data;

Get vnet primary buffer metadata in the reserved opaque field

/*
 * Copyright (c) 2015 Cisco and/or its affiliates. Licensed under the Apache License, Version 2.0. You may obtain a copy of the License at: http://www.apache.org/licenses/LICENSE-2.0<br >
 */

#define vnet_buffer(b) ((vnet_buffer_opaque_t *) (b)->opaque)

An example to retrieve vnet buffer data:

/*
 * Copyright (c) 2017 Cisco and/or its affiliates. Licensed under the Apache License, Version 2.0. You may obtain a copy of the License at: http://www.apache.org/licenses/LICENSE-2.0
 */

add_to_flow_record_state (vlib_main_t * vm, vlib_node_runtime_t * node,
        flowprobe_main_t * fm, vlib_buffer_t * b,
        timestamp_nsec_t timestamp, u16 length,
        flowprobe_variant_t which, flowprobe_trace_t * t)
{
...
  u32 rx_sw_if_index = vnet_buffer (b)->sw_if_index[VLIB_RX];

Get vnet primary buffer metadata in reserved opaque2 field

/*
 * Copyright (c) 2017 Cisco and/or its affiliates. Licensed under the Apache License, Version 2.0. You may obtain a copy of the License at: http://www.apache.org/licenses/LICENSE-2.0
 */

#define vnet_buffer2(b) ((vnet_buffer_opaque2_t *) (b)->opaque2)

Let’s take a look at flowprobe node processing function. Vlib functions always start with a vlib_ prefix.

/*
 * Copyright (c) 2017 Cisco and/or its affiliates. Licensed under the Apache License, Version 2.0. You may obtain a copy of the License at: http://www.apache.org/licenses/LICENSE-2.0
 */

flowprobe_node_fn (vlib_main_t * vm,
       vlib_node_runtime_t * node, vlib_frame_t * frame,
       flowprobe_variant_t which)
{
  u32 n_left_from, *from, *to_next;
  flowprobe_next_t next_index;
  flowprobe_main_t *fm = &flowprobe_main;
  timestamp_nsec_t timestamp;

  unix_time_now_nsec_fraction (&timestamp.sec, &timestamp.nsec);
  
// access frame vector data
  from = vlib_frame_vector_args (frame);
  n_left_from = frame->n_vectors;
  next_index = node->cached_next_index;

  while (n_left_from > 0)
    {
      u32 n_left_to_next;

      // get pointer to next vector data
      vlib_get_next_frame (vm, node, next_index, to_next, n_left_to_next);

// dual loop – we are processing two buffers and prefetching next two buffers
      while (n_left_from >= 4 && n_left_to_next >= 2)
  {
    u32 next0 = FLOWPROBE_NEXT_DROP;
    u32 next1 = FLOWPROBE_NEXT_DROP;
    u16 len0, len1;
    u32 bi0, bi1;
    vlib_buffer_t *b0, *b1;

    /* Prefetch next iteration. */
             // prefetching packets p3 and p4 while p1 and p2 are processed
    {
      vlib_buffer_t *p2, *p3;

      p2 = vlib_get_buffer (vm, from[2]);
      p3 = vlib_get_buffer (vm, from[3]);

      vlib_prefetch_buffer_header (p2, LOAD);
      vlib_prefetch_buffer_header (p3, LOAD);

      CLIB_PREFETCH (p2->data, CLIB_CACHE_LINE_BYTES, STORE);
      CLIB_PREFETCH (p3->data, CLIB_CACHE_LINE_BYTES, STORE);
    }
/* speculatively enqueue b0 and b1 to the current next frame */
// frame contains buffer indecies (bi0, bi1) instead of pointers
    to_next[0] = bi0 = from[0];
    to_next[1] = bi1 = from[1];
    from += 2;
    to_next += 2;
    n_left_from -= 2;
    n_left_to_next -= 2;

// translate buffer index to buffer pointer
    b0 = vlib_get_buffer (vm, bi0);
    b1 = vlib_get_buffer (vm, bi1);
// select next node based on feature arc
    vnet_feature_next (&next0, b0);
    vnet_feature_next (&next1, b1);

    len0 = vlib_buffer_length_in_chain (vm, b0);
// get current data (header) from packet to process
// currently we are on L2 so get etehernet header, but if we
// are on L3 for example we can retrieve L3 header, i.e.
// ip4_header_t *ip0 = (ip4_header_t *) ((u8 *) vlib_buffer_get_current (b0) 
    ethernet_header_t *eh0 = vlib_buffer_get_current (b0);
    u16 ethertype0 = clib_net_to_host_u16 (eh0->type);

    if (PREDICT_TRUE ((b0->flags & VNET_BUFFER_F_FLOW_REPORT) == 0))
      add_to_flow_record_state (vm, node, fm, b0, timestamp, len0,
              flowprobe_get_variant
              (which, fm->context[which].flags,
               ethertype0), 0);
...
/* verify speculative enqueue, maybe switch current next frame */
    vlib_validate_buffer_enqueue_x1 (vm, node, next_index,
             to_next, n_left_to_next,
             bi0, next0);
  }

      vlib_put_next_frame (vm, node, next_index, n_left_to_next);
    }
  return frame->n_vectors;
}

Nodes

As we said – vlib is also designed for graph node management. When creating a new feature, one has to initialize it, using the VLIB_INIT_FUNCTION macro. This constructs a vlib_node_registration_t, most often via the VLIB_REGISTER_NODE macro. At runtime, the framework processes the set of such registrations into a directed graph.

/*
 * Copyright (c) 2016 Cisco and/or its affiliates. Licensed under the Apache License, Version 2.0. You may obtain a copy of the License at: http://www.apache.org/licenses/LICENSE-2.0<br >
 */

static clib_error_t *
flowprobe_init (vlib_main_t * vm)
{
  /* ... initialize things ... */
 
  return 0;
}

VLIB_INIT_FUNCTION (flowprobe_init);

...

VLIB_REGISTER_NODE (flowprobe_l2_node) = {
  .function = flowprobe_l2_node_fn,
  .name = "flowprobe-l2",
  .vector_size = sizeof (u32),
  .format_trace = format_flowprobe_trace,
  .type = VLIB_NODE_TYPE_INTERNAL,
  .n_errors = ARRAY_LEN(flowprobe_error_strings),
  .error_strings = flowprobe_error_strings,
  .n_next_nodes = FLOWPROBE_N_NEXT,
  .next_nodes = FLOWPROBE_NEXT_NODES,
};

VLIB_REGISTER_NODE (flowprobe_walker_node) = {
  .function = flowprobe_walker_process,
  .name = "flowprobe-walker",
  .type = VLIB_NODE_TYPE_INPUT,
  .state = VLIB_NODE_STATE_INTERRUPT,
};

Type member in node registration specifies the purpose of the node:

  • VLIB_NODE_TYPE_PRE_INPUT – run before all other node types
  • VLIB_NODE_TYPE_INPUT – run as often as possible, after pre_input nodes
  • VLIB_NODE_TYPE_INTERNAL – only when explicitly made runnable by adding pending frames for processing
  • VLIB_NODE_TYPE_PROCESS – only when explicitly made runnable.

The initialization of feature is executed at some point in the application’s startup. However, constraints must be used to specify an order (when one feature has to be initialized after/before another one). To hook feature into specific feature arc VNET_FEATURE_INT macro can be used.

/*
 * Copyright (c) 2016 Cisco and/or its affiliates. Licensed under the Apache License, Version 2.0. You may obtain a copy of the License at: http://www.apache.org/licenses/LICENSE-2.0
 */

VNET_FEATURE_INIT (ip4_nat44_ed_hairpin_src, static) = {
  .arc_name = "ip4-output",
  .node_name = "nat44-ed-hairpin-src",
  .runs_after = VNET_FEATURES ("acl-plugin-out-ip4-fa"),
};

VNET_FEATURE_INIT (ip4_nat_hairpinning, static) =
{
  .arc_name = "ip4-local",
  .node_name = "nat44-hairpinning",
  .runs_before = VNET_FEATURES("ip4-local-end-of-arc"),
};

Since VLIB_NODE_TYPE_INPUT nodes are the starting point of a feature arc, they are responsible for generating packets from some source, like a NIC or PCAP file and injecting them into the rest of the graph.

When registering a node, one can provide a .next_node parameter with an indexed list of the upcoming nodes in the graph. For example, a flowprobe node below:

/*
 * Copyright (c) 2017 Cisco and/or its affiliates. Licensed under the Apache License, Version 2.0. You may obtain a copy of the License at: http://www.apache.org/licenses/LICENSE-2.0<br >
 */

...
next_nodes = FLOWPROBE_NEXT_NODES,
...

#define FLOWPROBE_NEXT_NODES {				\
    [FLOWPROBE_NEXT_DROP] = "error-drop",		\
    [FLOWPROBE_NEXT_IP4_LOOKUP] = "ip4-lookup",		\
}

vnet_feature_next is commonly used to select the next node. This selection is based on the feature mechanism, as in the flowprobe example above:

/*
 * Copyright (c) 2017 Cisco and/or its affiliates. Licensed under the Apache License, Version 2.0. You may obtain a copy of the License at: http://www.apache.org/licenses/LICENSE-2.0
 */

flowprobe_node_fn (vlib_main_t * vm,
       vlib_node_runtime_t * node, vlib_frame_t * frame,
       flowprobe_variant_t which)
{
...
    b0 = vlib_get_buffer (vm, bi0);
    b1 = vlib_get_buffer (vm, bi1);
        // select next node based on feature arc
    vnet_feature_next (&next0, b0);
    vnet_feature_next (&next1, b1);

The graph node dispatcher pushes the work-vector through the directed graph, subdividing it as needed until the original work-vector has been completely processed.

Graph node dispatch functions call vlib_get_next_frame to set (u32 *)to_next to the right place in the vlib_frame_t, corresponding to the ith arc (known as next0) from the current node, to the indicated next node.

Before a dispatch function returns, it’s required to call vlib_put_next_frame for all of the graph arcs it actually used. This action adds a vlib_pending_frame_t to the graph dispatcher’s pending frame vector.

/*
 * Copyright (c) 2015 Cisco and/or its affiliates. Licensed under the Apache License, Version 2.0. You may obtain a copy of the License at: http://www.apache.org/licenses/LICENSE-2.0
 */

      vlib_put_next_frame (vm, node, next_index, n_left_to_next);
    }
  return frame->n_vectors;
}

Pavel Kotúček

Thank you for reading through the 5th part of our PANTHEON.tech VPP Guide! As always, feel free to contact us if you are interested in customized solutions!


You can contact us at https://pantheon.tech/

Explore our Pantheon GitHub.

Watch our YouTube Channel.