The MPLS WG Archive

Cell Relay Retreat>MPLS WG Archive>month:2002-Dec> msg00209



[Date Prev][Date Next][Thread Prev][Thread Next]  
  [Date Index][Thread Index][Author Index][Subject Index]

draft-ietf-mpls-lsp-ping-01.txt - ECMP considerations

  • From: Curtis Villamizar <curtis@fictitious.org>
  • Date: Tue, 10 Dec 2002 16:01:24 -0500
  • cc: curtis@fictitious.org


This is a consolidation of some of what I wrote as a participant in an
overly long thread (appologies offered for the length of that thread)
in which ECMP (among other things) were discussed.

Curtis


\begin{aside}

  An early email in this thread stated:

      That is not to say that no legitimate issues have been brought up.
      For example: 1) it is not clear to me how to handle IP src/dst
      hash based or MPLS label hash based load balancing, as would occur
      in LDP/ECMP, 2) I'm not clear on the usage of the FEC stack.  I
      think 1) may be a deficiency, 2) is either a clarity or a feature
      with no clear purpose.  Perhaps an example (sent to the list if
      not in the document) would clear up the purpose of FEC stack.

    [Correction: "a clarity" should be "a clarity issue".]

  Note that a reread of the example in 3.1 cleared up 2) for me so just
  the multipath (MP) issue remains.  [This was my fault.  I was trying
  to read more functionality into the FEC label stack than was intended
  resulting in my confusion.  There is no requirement that RFCs be clear
  to rushed and careless readers so this is fine.]

\end{aside}

Rather than just paste the past mail in here, I'll summarize.  I've
added detail to this.


There are a number of MP cases to consider:

  cases:

  1.  IP traffic at an ingress LSR is put on multiple LSPs.

  2.  An LSP splits traffic at one or more midpoints based on labels
      beneath the top label.

  3.  An LSP which is known to contain IP traffic (either through
      signaling or empirically) and that traffic path splits at one or
      more midpoints based on IP header information.

  4.  Traffic is spread over more than one reasource (for example:
      forwarding ASIC, switch fabric channel, external link) where the
      component path needs to be exposed for a thorough test.

It is worth checking to see if all of these cases exist and looking at
the requirements of those know to exist before adding more complexity to
MPLS-ping than is needed.

  1.  ECMP over LDP is implemented and known to be deployed.
      Multipath over MPLS/TE is implemented and known to be deployed.

  2.  Label stack based load split is the default behavior for LDP
      midpoints that support ECMP and allow branching at a midpoint.
      No equivalent exists in MPLS/TE (the traffic must be split at
      the ingress, even if loose hops are specified).

  3.  Some router hardware is capable of looking underneath the label
      stack either a) only if the depth is one or b) regardless of
      depth, and hash based on the src/dst of the packet.  I do not
      know if anyone takes advantage of this in a deployed network.
      Cases where this could be applied are:

   a. At an LDP node doing ECMP as an ingress to IP traffic, traffic
      coming in as MPLS could also be split in the same way as the
      ingress IP traffic if the incoming LSP is known to contain IP.

   b. At an LDP node not doing ECMP per se, but using a MPLS/TE tunnel
      split among more than one MPLS/TE LSP, could be split in the
      same way as the ingress IP traffic if the incoming LSP is known
      to contain IP.

   c. A PSC hop in a hierarchical MPLS/TE could be a hop over more
      than one LSP with same egress and an incoming LSP that is known
      to contain IP (in this case L3PID is available) may be split in
      the same way as IP traffic that enters at this ingress.

  4a. Many routers today perform some operations in parallel or
      support multiple parallel data paths.  Where the paths are
      deterministic, these paths could be exposed to the diagnostic
      and individually checked.  Whether any vendor would do this is
      questionable.

  4b. Many routers support a form of link bundling in which load split
      can be done by either MPLS label, or by IP header.  Load split
      of MPLS traffic by IP header over this type of link is deployed.


It might be better is responses which argue the existence or
non-existence of one of the above cases be separated from responses to
the technical part below.

In ping mode, you start with a TTL=255 and either the egress responds,
or it doesn't.  If it does, then at least one path in a MP works.

For MP, periodic traceroute is needed.  Ideally, traceroute would
yield the full set of paths.

In either ping mode or traceroute mode the ingress must know:

  problems:

  1.  An address on the egress router to be placed in the IP header.

  2.  Whether there are multiple paths, and if so how to exercise each
      one.

The use of a 127/8 solves problem 1) for the case of LDP where the
egress is not known, for the case of a hop over a MPLS-ping incapable
LSR where the following LSR is not yet known, and for the case where
the packet emerges at an unexpected LSR.

Problem 2) for case 1 (top of message) where the LSR is the ingress to
IP traffic is trivially solved.  The ingress LSR must be able to force
a packet into any one of the available LSPs.  The fixed 127.0.0.1
address would be fine since some minimal programming effort is already
needed to force a packet with a 127/8 destination into the LSP.

[Note that problem 2 is an issue if any of case 2-4 exist in real
routers.  As I said earlier, discussion of whether they all don't
exist should be separated from discussion of the protocol bits
suggested here.]

Problem 2) for case 2-4 can be handled with a simple extension.  We
could solve this by adding a "Downstream Multipath Mapping" TLV as
type 5 and Multipath Exercise TLV as type 6 (or squeeze it into 3 and
4 and renumber the current 3 and 4).

  Downstream Multipath Mapping TLV

       0                   1                   2                   3
       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |  number of multipaths         |  type of multipath            |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |  Multipath Exercise TLV                                       |
      .                                                               .
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |  Downstream Mapping TLV                                       |
      .                                                               .
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      ..... more Multipath Exercise / Downstream Mapping TLV pairs ....
      .                                                               .
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

    number of multipaths - This gives the number of multipaths and
    determines the number of Multipath Exercise / Downstream Mapping
    TLV pairs to follow unless TLV or packet size limits cause
    truncation.

    type of multipath - This gives the type of multipath split
    performed at the branchpoint.  It also determines the format of
    the Multipath Exercise TLVs.  Supported types are:

        type of multipath      meaning
	-----------------      -------
			1      Per packet split (use of this form of
			       multipath is highly discouraged).
			2      Seeded unspecified hash based on
			       underlying labels.
			3      Seeded unspecified hash based on IP
			       source and destination.
			4      Seeded unspecified hash based on IP
			       source and destination for stack depth
			       of one, based on underlying labels for
			       stack depth greater than one.
			5      Other or unspecified.

    Any "type of multipath" which is unrecognized must be treated like
    type 5 and any "Multipath Exercise TLV" provided with a
    unrecognized "type of multipath" must be ignored.

  Multipath Exercise TLV

    Omitted for multipath type 1 (non-deterministic) and for type 5.

    for mulipath type 2-4:

       0                   1                   2                   3
       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      | Hash Key Type | Depth Limit   |  MP Index     |reserved (zero)|
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |  IP Address or Next Label                                     |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      ....

    Hash Key Type:			   IP Address or Next Label
    --------------			   ------------------------
		 1   label		   1 or more label
		 2   IP address		   1 or more IP address
		 3   label range	   1 or more low/high label pairs
		 4   IP address range	   1 or more low/high address pairs
		 5   no more labels	   (nothing)
		 6   All IP addresses	   (nothing)
		 7   no match		   (nothing)

    Depth Limit - applicable only to a label stack, the maximum number
    of labels considered in the hash or zero for unspecified or
    unlimited.

    MP Index - used to number the MP paths and refer to them.

    IP Address or Next Label - an IP address from the range 127/8 or
    an next label which will exercise this path are given.

    Use of Hash Key Type and IP Address or Next Label:
    --------------------------------------------------
    type 1 - a list of single labels is provided, any one of which will
	     cause the hash to match this MP path.
    type 2 - a list of single IP addresses is provided, any one of
	     which will cause the hash to match this MP path.
    type 3 - a list of label ranges is provided, any one of which will
	     cause the hash to match this MP path.
    type 4 - a list of IP address ranges is provided, any one of which
	     will cause the hash to match this MP path.
    type 5 - if no more labels are provided on the stack, this MP path
	     will apply (can only appear once).
    type 6 - Any IP addresses matches.  Undertlying labels may go
	     elsewhere, but all IP takes only one MP path (can only
	     appear once).
    type 7 - no matches are possible given the set of "Multipath
	     Exercise TLV" provided by prior hops.

    If prior hops provide a "Downstream Multipath Mapping TLV" the
    labels and IP addreesses should be picked from the set provided in
    prior "Multipath Exercise TLV" or "Hash Key Type" of 7 used.

    An ingress may choose to reduce the size of a "Downstream
    Multipath Mapping TLV" when copying into the next Echo Request as
    long as the Hash Key Type matching the label or IP address used to
    exercise the current MP is still present.

This solves the problem if there is only one branch point.  If there
is more than one branch point, then this is not a satisfactory.  This
could be solved by adding to the "type of multipath" types in which
the hash algoritm is specified and the seed is provided rather than an
short list of addresses or labels.  The ingress can then determine
what addresses or labels will take each branch and either exercise the
branch or determine that a branch will never be used.

It is uncertain whether router makers will reveal their hash functions
(I don't know if my employer will, though I personnally can't see why
not), but if they do, the extension mechanism is there to support it.

The only other solution is to add the ability to specify which path to
check and to trace the control plane only.  This would not be a
satisfactory solution since a clear goal is to verify forwarding.

...  Appendix of sorts - Some notes on multipath algorithms ...

RFC2991 is informational and points to some requirements.  It is not
normative, but if a router doesn't meet the requirements the
consequences (microflow reorder or poor load split) are clear and
ample indication of how to avoid this is given.

 requirements:

  1.  Any multipath implementation must avoid reordering microflows.

  2.  Some algorithms split traffic better than others.  Using the
      whole address rather than part of it still doesn't split
      microflows and balances load better.

  3.  Some algorithms produce a delay discontinuity for less
      microflows when a new member is added or deleted.

The third criteria in RFC2991 is ignored by current practice.

 key points:

  1.  Per packet (send every other packet to a different interface)
      load split reorders the packets within microflows and is bad.
      [This also impacts L2VPN if reordering at egress is unsupported.]

  2.  The granularity of per IP perfix load split is too coarse and
      the resulting load split can be terrible.  For example, all
      traffic to 18/8 (MIT) going to one interface might make for a
      very uneven load split on a Boston circuit.

Current practice dates back to the late 1980s, possible earlier, when
the PC-RT based T1-NSFNET routers did multipath by src/dst based hash.
The method of hash and modulo or a specific hash designed to produce
an even distribution over a number range has been used by numerous
routers and remains in widespread use.