The MPLS WG Archive

Cell Relay Retreat>MPLS WG Archive>month:2004-Dec> msg00049



[Date Prev][Date Next][Thread Prev][Thread Next]  
  [Date Index][Thread Index][Author Index][Subject Index]

[mpls] working group last call on draft-ietf-mpls-oam-frmwk-01.txt

  • From: neil.2.harrison@bt.com
  • Date: Thu, 30 Dec 2004 17:49:24 -0000
  • Cc: mpls@ietf.org
  • Thread-Index: AcTttThQ+1Pitg2QR8Ky3zxSzezdRAAuusWg
  • Thread-Topic: [mpls] working group last call ondraft-ietf-mpls-oam-frmwk-01.txt
  • X-MIME-Autoconverted: from quoted-printable to 8bit by cell.onecall.net id iBUHvpi30254
  • X-OriginalArrivalTime: 30 Dec 2004 17:49:24.0883 (UTC)FILETIME=[E68F2630:01C4EE97]

<snipped>
Adrian and draft authors,
> 
> Section 3.3 Availability
> The definition of availability is fine. But what is meant by
> the following?
>     MPLS has several forwarding modes (depending on the control plane
>     used). As such more than one model may be defined.
> Does this say that there may be more than one definition of 
> availability? Or more than one way of computing it?

The definition of availability is not fine.....indeed there is no
*definition* given.  Think folks needs to start being a little more
rigorous here.  I apologise for keep raising the point that there are 3
networks modes, but it is rather important and we can't simply apply the
same treatment to each (and this is more than OAM).

There is also an OAM ordering principle here, viz:
Mode=>defects=>(un)availability=>PM.  Which means that:
-	there is no basis for defining PM until we have defined
entry/exit criteria for unavailability
-	there is no basis for defining unavailability until we have
defined entry/exit criteria for defects
-	there is no basis for defining defects until we have identified
the mode we dealing with

And just to remind....
cl-ps - breaks only
co-cs - breaks and swaps (but ONLY between exactly alike trail entities)
co-ps - breaks, swaps (any trail entities) and misbranching {see Note}
(from_any into_any trail entities).

There is a common CV requirement theme that runs through all 3 modes.
CV 'comes for free' in the cl-ps mode as each pkt contains a SA....this
is why one can only have breaks here (though to test connectivity
independently of the traffic behaviour one would have to consciously
inject a test probe, which is what ICMP echo request/response does).
However, we must consciously also add the CV function (ie SA
information) in the co-cs and co-ps modes.  In the co-cs mode the frame
structure provides the obvious vehicle for this, eg the J0 byte in SDH
VC4 trail O/H.  In the co-ps we need to determine the insertion rate of
a CV function/pkt into the data-plane.

{Note - Adrian, I noted you also remarked (snipped here) you did not
like the term 'misbranching'.  I can understand this, and there are
probably better words to capture the intention here.  Misbranching is a
shorthand way of capturing this type behaviour:  Pkts in LSP1 from A to
B get unintentionally replicated (in some node between A and B) and
merged into the pkts going from C to D of LSP 2 say.  LSP1 is blissfully
unaware there is a security/integrity problem as its traffic is still
arriving OK at B.  The problem however should be detected at the sink of
LSP2 (by observing the SA of both LSP1 and LSP1 arriving).  A spin on
this is that LSP1 pkts simply get misdirected into LSP2, so the same
mismerging still results at LSP2's sink, but LSP1 now also shows a
break.....but its not a simple break as LSP1's traffic is not
black-holing.}

In both co modes the defect detection/handling must be unidirectional
and the consequent actions must be taken at the trail sink.  This is
important so that:
-	there is no ambiguity of the direction that has failed
-	we can correctly apply FDI to any affected co clients in the
syntax required by client mode to suppress alarm storms
-	we can suppress traffic to avoid security/integrity breaches.
Note:  In the co-cs mode FDI is an all 1s signal which *replaces* the
traffic, so FDI also fulfils the security traffic suppression
requirement.  Not so in the co-ps mode, here both FDI and traffic
suppression are separate actions that both need to be taken.

Doing the OAM as noted above means that both p2p and p2mp trail
constructions can be handled correctly and scalably in the same way, ie
for the p2mp case only the sinks that are downstream of a failure in the
trail construction will be required to take the consequent actions.
This means the availability definition (which I'll return to shortly) at
the sink can be the same in both these cases.

I have no idea how one can handle mp2p constructions in any meaningful
way wrt the above observations, ie how would one define any defects here
in terms of entry/exit criteria and sensible consequent actions?  And if
we can't do this then I can't see how we can define (un)availability (or
indeed PM) for such constructs.  A similar observation would apply to
the use of ECMP.


In addition to the CV function and the FDI function there is one further
basic OAM function that can be defined.  This is the BDI (Backward
Defect Indication).  This function is not mandatory for base defect
detection/handling (unlike CV and FDI for the reasons noted above), but
it is very useful when one wants to have visibility of the defect and
availability behaviour of both directions from a single end.  Folks
familiar with Y.1711 will be aware that both near-end and far-end state
processing is discussed that covers this aspect.

When we define SLAs these have 2 parts:  An (un)availability part that
defines the fraction of total time that the network (for which we can
also read as 'service') is ('down' or) 'up', and a PM (Perf Mon) part
that defines the (limiting degree of) transfer performance impairments
whilst the network (service) is in the 'up' state.  It is therefore
essential that we have agreed definitions of what constitutes entry/exit
of the unavailable state.  Further, in order that we can compare apples
with apples we require a strong degree of harmonisation across ALL layer
networks in this respect.  Remember, we can create a client/server
relationship where layer network X is carried over layer network Y (as
is the case with PWs) and it is therefore obvious (I hope) why such
harmonisation is important.

Traditionally, unavailability has been defined to occur at the start of
a period of 10 consecutive seconds of severe degradation (for the co-cs
mode this is defined in terms of 10 consecutive SESs (Severely Errored
Second)).  In Y.1711 I have used this same base definition but made
things far simpler, based on empirical evidence of what SESs looked like
in real networks when I last measured these (though I should add this
was quite a few years ago).

The vast majority of p2p applications require connectivity to be
available in both directions, else they will not work.  This means that
in the p2p case if EITHER direction fails then BOTH directions are
deemed unavailable, ie unavailability is an OR function of directivity
failure.  An interesting implication of this is that any PM SLA
recording for p2p constructs should be suspended in both directions when
we have an unavailability event in either direction.  There are some
rather subtle issues here, including timing/synchronisation at both ends
of the trail and 'packets in flight', that can add significant
complications. I could explain the rationale here but it would take a
while so let me simply define the end-result that one finds in Y.1711
wrt handling this and defining availability......PM purists may not
completely agree with what I have done, but I again stress that my main
goal was simplification.

I have defined a short-break parameter in Y.1711 (which is 2nd only in
importance to unavailability in terms of customer/service impact) that
is purposely aligned with the 3s entry period of a defect (noting all
the defect cases are also harmonised in terms of entry/exit criteria,
again for simplification purposes, on a 3s period).  In order to make
both availability and PM SLA processing very simple, Y.1711 requires
that near-end PM SLA recording be turned-off when a defect state in
entered.  Further, and again for simplification reasons based on what we
saw empirically for real error events, there are no PM metrics included
in the definition of defect entry/exit criteria or indeed the
unavailability entry/exit criteria....the latter is simply based on a
defect state being continuously present/absent for 10 consecutive
seconds.  One could use PM metrics (such as some weighted combination of
delay/loss/error metrics) in these entry/exit criteria, but IMO the
'accuracy' benefits this might bring (where 'accuracy' can actually be
distorted by including these) are simply not worth the
complications/cost involved, and I think there are smarter ways to
handle these state changes.  

In Y.1711 any PM would be recommenced on either (i) (near-end only) exit
from the defect state IF we are still in the available state when this
happens (noting that this also records a short break event), or (ii)
(near-end and far-end) exit from the unavailable state if this state has
indeed been entered (where a short-break event would not be recorded on
entry).

This allows us to differentiate and measure PM, short-breaks and
unavailability in the most simple but consistent and complete manner
possible.  There is much more thought gone into Y.1711 (especially on
making stuff simple) than might be apparent to a casual reader.

One final point on OAM activation/deactivation:  The base defect
detection/handling CV function must be activated/deactivated in concert
with whatever mechanism is used to set-up/take-down the trail
(LSP)......so this could be either signalling or management
provisioning.  However, I have noted that (the lightweight) BFD OAM
assumes (the heavyweight) LSP-Ping OAM is used to activate/deactivate
BFD.  Ignoring the need to detect defects and take consequent actions at
the trail sink, this is clearly wrong since it would still need
synchronising to the signalling/provisioning process anyway.

regards, Neil  








_______________________________________________
mpls mailing list
mpls@lists.ietf.org
https://www1.ietf.org/mailman/listinfo/mpls