Certification Zone - Switch Platform Architecture: A Model

You can look at a switch abstractly as a relay. Relays are devices with at least two interfaces, which accept data on one interface and send it out another. A range-extending repeater, operating at the physical layer, is the simplest type of relay, with only one input and one output.

Ethernet hubs are still relays, although they copy the data onto an internal shared medium and fan the contents of that medium to all other ports. You really can't get a good sense of relays until layer 2, when the platform software has to make a decision regarding which egress interface to use.

While Cisco likes to talk about frames vs. packets vs. segments vs. messages, doing so is not correct OSI terminology. OSI formalism sometimes is very pedantic, but some of its terminology can be very precise and unambiguous.

OSI documents speak not of specifically named units at every layer (e.g., frame at layer 2), but of Protocol Data Units (PDU). At a specific layer, you speak of Transport PDUs or Data Link PDUs. Another useful concept, especially when dealing with protocol encapsulation, is the layer above the current layer is called (N+1) while the layer below is (N-1). From the perspective of the network layer, it receives (N+1)PDUs from Transport, and sends out (N-1)PDUs to Data Link.

A relay, which is a term from the formal (yes, that's the way it's spelled), is a device (or software function) with at least two interfaces. It receives PDUs on one interface and de-encapsulates them until it has the information on which it will make forwarding decisions. Ignoring devices such as multilayer switches, devices such as bridges and LAN and WAN switches accept physical layer bits, build them into Data Link PDUs, and make forwarding decisions on information at Data Link.

Routers receive bits, form frames, and extract Network PDUs from the Data Link PDUs. After examining Network Layer information, they internally forward Network PDUs to an outgoing interface, and then encapsulate these into Data Link PDUs and then Physical Layer information.

To make any of these forwarding decisions, the relay must first have an association between destination (and possibly other) information in the PDU at which it makes decisions, and information about the appropriate outgoing interface. The process of learning these associations is path determination. In bridges and LAN switches, path determination involves the spanning tree protocol, VLAN protocols, and source routing. In routers, path determination involves static and dynamic routing, as well as the up/down state of hardware interfaces.

Practical Issues: What Are Ports?

Ports, in general, are the physical connectors to which you can connect clients, servers, or switches to a switch. There are virtual ports, but they are beyond the scope of this discussion.

Through manual configuration, autoconfiguration, and hardware mechanisms, a port can take on many roles.

Don't confuse the physical port types in Table 12 with the spanning tree port types in Table 35. A port can have both a physical type and a spanning tree type.

Management

In a relay, the management function is concerned with building the forwarding "map", whether that is a spanning tree at OSI Layer 2, a routing table at Layer 3, or content switching tables at higher layers. Other functions include exception processing such as ICMP, running routing and spanning tree protocols, etc.

Management obviously includes the automated management functions (e.g., TFTP, logging) and the human interface.

Hardware

Management functions usually are implemented in general-purpose processors. As performance requirements grew more stringent, the processor often was a Reduced Instruction Set computer (RISC) design rather than a Complex Instruction Set (CISC) design.

Under some conditions, forwarding uses the same processor as is used for management.

Software

Management is primarily a software function. Clearly, this is the role of the human interface, be it textual or Web-oriented, or be it any of the different switch operating systems.

Control

Control software runs management functions, including the human interface, as well as topology learning with spanning tree and dynamic routing protocols.

Forwarding Tables and Populating Them

On router platforms, forwarding tables began with the routing table, that which you see with a show ip route. This table, more formally called the Routing Information Base, is optimized for adding and deleting routes. That optimization benefits control, but not forwarding efficiency.

In contrast, the tables used in the high-speed forwarding path are optimized for fast lookup, and are populated from data in the RIB. While the generic computer science term for this fast-lookup data structure is the Forwarding Information Base (FIB), Cisco uses the term cache and FIB a bit differently. The first cache example was the fast switching cache, which is a data structure in the main RAM, which has fewer entries than in the RIB. A fast lookup algorithm, such as hashing, is used.

First-generation fast lookup tables had to be rebuilt whenever an entry was added or deleted. Partial updating was not practical. You could, as a result, see drops in performance whenever there was a "cache fault", or an attempt to look up a destination not present in the cache. For fast switching and its equivalent hardware assisted variants, autonomous switching (AGS+ and early 7000) and silicon switching (7000 with RSP), cache faults could significantly affect performance. These distributed caches were quite small, either 512 or 1024 entries. This small number of entries worked acceptably in an enterprise, which typically has a moderate number of frequently used routes, but was a severe performance limitation in ISP routers.

Distributed switching on VIPs was a major performance advance, because the VIP FIB has a one-to-one correspondence with the RIB. With this correspondence, there never will be a cache fault.

Forwarding

At a general level, let's consider the forwarding modes, also called switching paths, in Cisco platforms.

Ingress Buffering and Processing

As long as the fabric is non-blocking, there is no need for input buffering. It is possible that buffers will be required when doing traffic shaping at the ingress.

At the most basic, ingress processing looks up the destination address in the frame or packet header, selects the egress interface, and moves the frame or packet to the fabric. If the fabric is blocking, the packet may go into a buffer.

In all cases where I am familiar with router or switch internals, the ingress processor prefixes the frame or packet with an internal header used by the fabric to send it to an appropriate egress interface(s). Such headers are never seen outside the platform.

Pattern Recognition

Ingress processing, in the real world, gets complicated by frequent requirements to recognize patterns in the packet or frame, patterns other than the destination. Among the most common is what we generically call an access control list (ACL), which checks certain fields, usually with a mask that indicates whether the value of a bit is to be checked, or if the pattern will accept any bit value in that position (i.e., wild card).

When you consider wild cards as well as a bit being one or zero, you introduce ternary logic, a step beyond a simple binary on-or-off decision.

Cisco now describes the individual lines in an ACL as access control entries (ACE). You can recognize patterns, at L2 and L3, for various reasons, including security filtering, special routing (e.g., source routing) or QoS recognition and marking.

Advances in Forwarding Tables: CAM and TCAM

One of the challenges to wire-speed forwarding is how quickly destination information can be retrieved from an address table. In L2 switches, this historically was the job of the Content Addressable Memory (CAM), and now the job of the Ternary Content Addressable Memory (TCAM). The TCAM has both L2 and L3 fast lookup capability, as opposed to the Forwarding Information Bases in router Versatile Interface Processors (VIP) or the forwarding part of a Route Switch Processor (RSP).

Router FIBs, however, hold considerably more routes than a TCAM, a necessity for service providers.

Early switches used a CAM to look up destination MAC addresses, CAM had far fewer entries than most router cache or FIB, which often was acceptable given the scope in which a switch worked.

In a CAM, you must match on every bit of a MAC address, even if some of them, such as the first 24 bits of vendor ID, are not significant for the particular lookup.

Introducing Ternary Tables

TCAMs, however, can "wildcard" fields. This gives several advantages over a CAM, including longest-match selection for ACLs and CEF (i.e., in L3 forwarding), a single lookup of fixed latency, and the ability to ignore fields. TCAMs are used in the 6500, 4000 and 3550 series.

There is a platform-dependent number of templates and number of entries per template type; the TCAM is partitioned into regions of templates.

In the 4000 and basic 6500, there is a single centralized forwarding table. The central forwarding engine is the limit to forwarding performance.

Switches with 100-Mbps rates and above use distributed forwarding, which allows the forwarding speeds of multiple forwarding engines to be added. Distributed switching is present in the 3550 and in the 6500 with DFC.

Templates

Switch Database Management for TCAMs was introduced on the 3550. Originally, there were four templates, which would set TCAM elements to an optimal solution for:

Notice that the default template is optimized to support a large number of MAC addresses in the MAC table, and a large number of IP routes in the routing table. The trade off is fewer resources for IGMP groups, QoS, and security related access control entries (lines in access-lists):

The routing template offers support for twice as many routes (16,000 versus 8,000), but far fewer access control entries and QoS entries. In contrast, the VLAN template disables routing entirely, and focuses all resources towards L2 and VLAN support.

As Chuck Larrieu put it in his 3550 Tutorial, "While it is unlikely that any CCIE Lab scenario would stress any of these settings, it is possible that a Candidate might be asked to 'assure that SVI support is maximized' or 'ensure that L3 functionality is not compromised by L2 considerations'." It is equally possible that a candidate for a written exam -- CCIE or CCNP -- might be asked a similar question. It's likely that the template model will spread to platforms other than the 3550.

Forwarding models

Demand-based forwarding requires that the first packet of a flow must go through the "slow" or "software" path, which then populates a high-speed table. You will see this in the Supervisor 1A/MSFC on the 6500.

Topology-based forwarding, on the 6500 with Supervisor 2, the 4000 with Supervisor 3, and the 3550, breaks the dependence on software lookup.

Fabric

The fabric interconnects the input and output interfaces. There are three main types of fabric:

A given switch will have one or more types of fabric. Indeed, on high-performance switches such as the 6500, the highest-speed fabric is a separate card, not just part of the backplane.

Don't make the mistake I did, early in my career, and equate the backplane with the fabric. The backplane tends to be passive or nearly so. The active fabric will be on the supervisor card (or integrated equivalent), and sometimes on a separate plug-in card. Indeed, a single platform can have more than one fabric.

* Cisco specifications are not always clear if the bandwidth stated is unidirectional, or adds together the two directions
[1] Depends on platform model
[2] Total bandwidth for stack

Shared Bus

Most lower-performance devices use a shared bus as the fabric. A single bus allows a connection between two interfaces, with all interfaces contending for the bus. Don't fall into salesdroid traps and assume faster is always better. Shared bus is the cheapest solution, and thus appropriate for workgroup and other small switches where cost is more important than performance.

The fabric is usually built into the backplane. Some devices, such as the 5500 switch, may have several busses bridged into one, and the throughput figure is the sum of the bus speeds.

Shared Memory

Shared memory systems keep the frame or packet in memory until the last egress interface is finished with it. Memory management can be simple or difficult, depending on whether or not there are requirements for QoS and/or multicast.

QoS requires static buffer allocation in the shared memory. When you are multicasting, unless there is enough concurrent ports to the memory to service simultaneously all egress ports in the multicast group, the packet or frame has to stay in memory until the last egress port transmits it.

Crossbar

Crossbar designs are a full mesh, allowing concurrent communications between any pair of interfaces. Obviously, there is no contention for unicast forwarding.

Crossbars are the fastest fabric technology. There may be several cooperating crossbars within a large switch or router, as the ASICs involved are typically not greatly larger than 16x16.

Multicasting on crossbars can be a challenge, since the one-to-one relationship inherent to a crossbar is not a good fit to the one-to-many of multicast involving multiple egress interfaces. Crossbar works perfectly well in the middle of a multicast tree, where you have a single egress interface for a multicast group address. Shared memory fabrics may work better for multiple-egress-interface multicasting.

Egress Processing

In most switches and routers, the bulk of the processing is done at the ingress. Such functions as egress QoS, data link protocol conversion, etc., do take place in the egress card.

When the egress port connects to a server that is incapable of wire-speed operation, output buffering may be needed to avoid drops. In such cases, the amount of output buffering designed into the switch involves delicate tradeoffs. Too little buffering causes data drops, but too much buffering can cause unacceptable delay.

QoS at the Switch

The discussion of QoS here is less to get into the various ways of enforcing QoS, such as shaping, policing, and queuing, and more to discuss how QoS requirements affect switch architecture.

When you do not implement a QoS marking mechanism, the DSCP fields of packets and frames are trusted, and those fields used to sort the data units into appropriate queues. In switches, the default means of servicing queues is round-robin. Most switches support four queues, either in partitioned main memory or in dedicated memory

You can enable QoS marking and have the option of resetting the DSCP field to a new value, or you can set up new mappings between DSCP values and queues. See Figure 1 for the default mappings from DSCP to queue.

In switches with four queues, transmit queue 3 can be taken out of the round robin rotation and designated to follow strict priority queuing. This function, disabled by default, is intended for low-volume, delay-sensitive traffic such as voice and network control information. Be very conservative in assigning traffic to this queue, or you may starve the other queues.

You can find the transmit queue and priority assignment for an interface with the show run interface command.

A special bandwidth subcommand of tx-queue, not to be confused with interface bandwidth, can allocate a guaranteed minimum bandwidth to each of the four queues. At present, this is only available on non-blocking Gigabit Ethernet interfaces. For a 4000-specific example of such ports, see Table 22.

If you enable global QoS without bandwidth statements, each queue will get 250 Mbps. Do be aware that the switch does not check for consistency amount the assignments, and it will let you oversubscribe (e.g., assign 250 Mbps to queues 1 and 2 and 500 Mbps to queues 3 and 4).

As long as a transmit queue is below the preconfigured share and shaping values, it is considered high priority and served by the priority queuing discipline. Queues that do meet the share and shape values will be serviced after the high priority queues. Only if no high priority queues exist will strict round robin be observed.

Interfacing: the GBIC (Gigabit Ethernet Interface Converter)

Cisco standardizes the Gigabit Ethernet ports on switches, and assumes you will connect a Gigabit Ethernet Interface Converter (GBIC) to the ports to interface the port to the specific GE technology. There are GBICs for short- and long-wave optical GE, for long-haul systems, for coarse and dense wavelength division multiplexing on optical transmission systems, for switch stacking, for GE over copper, and a constantly growing list of optical and electrical media.

Port type	Attributes
Static	No filtering and may be assigned to a VLAN based on physical port ID.
Dynamic	Assigned to a LAN based on frame contents and the definitions in the VLAN Policy Management Server (VPMS)
Secure	Has a MAC address filter
Trunk	Runs 802.1q, 802.1v or ISL
Source SPAN	Source of traffic to be sent to the SPAN monitoring port
Destination SPAN	Port associated with SPAN analysis (e.g., RMON)

Switching mode	Speed	MIB:RIB Relationship
"Software"	Slowest but most intelligent	MIB and FIB are the same.
"Hardware" -- CAM for L2	Default mode and most common at layer 2	May be centralized or distributed. Uses Content Addressable Memory requiring an exact match
"Hardware" -- TCAM for L2 and L3	Good compromise between speed and intelligence	May be centralized or distributed. Uses one or more Ternary Content Addressable Memories

Switching mode	Speed	MIB:RIB Relationship
Process switching	Slowest but most intelligent	MIB and FIB are the same.
Fast switching	Default mode, faster than process	FIB is in RAM, and is smaller than the RIB.
Autonomous, silicon, optimum	Fast, hardware-assisted and platform-dependent	FIB is in special hardware, and is much smaller than the RIB.
Express	Fastest, especially when distributed into multiple Versatile Interface Processors	FIB is a full copy of the RIB.

TCAM	Access	Default	Routing	VLAN
unicast MAC address	1024	5120	5120	8192
IGMP group	2048	1024	1024	1024
QoS Access Control Element (ACE)	1024	1024	512	1024
Security ACE	2048	1	512	0
Unicast Routes	2048	8	16384	0
Multicast Route	2048	1	1024	0

Platform	Fabric Speeds in Gbps *
Platform	Shared bus	Shared Memory	Crossbar
2900		8.8
2955		13.6
3550		8.8, 13.6, 24 [1]
3750	32 [2]
4000		32
4500		28, 64
5000	1.2
5500	3.6
6000		32
6500			256

Switch Platform Architecture: A Model