The MPLS WG Archive

Cell Relay Retreat>MPLS WG Archive>month:2004-Dec> msg00051



[Date Prev][Date Next][Thread Prev][Thread Next]  
  [Date Index][Thread Index][Author Index][Subject Index]

[mpls] working group last call on draft-ietf-mpls-oam-frmwk-01.txt

  • From: neil.2.harrison@bt.com
  • Date: Thu, 30 Dec 2004 21:53:04 -0000
  • Cc: mpls@ietf.org
  • Thread-Index: AcTunhkZ68NeElSbQfGuxrooUgL+7wAD9Asw
  • Thread-Topic: [mpls] working group last call ondraft-ietf-mpls-oam-frmwk-01.txt
  • X-MIME-Autoconverted: from quoted-printable to 8bit by cell.onecall.net id iBULuEi03378
  • X-OriginalArrivalTime: 30 Dec 2004 21:53:05.0408 (UTC)FILETIME=[F1123800:01C4EEB9]

Hi Peter.....

> 
> Neil,
> 
> Two questions:
> 
> 1) You say that PM can not be defined without defining entry 
> and exit criteria for availability. However, in Y.1711 you 
> specify that PM SLA recording be turned off at the entry of 
> the defect state. Does't this proof that PM is dependent on 
> an unambiguous definition of defect states, but independent 
> of the definition of availability?
NH=> In my opinion, and from a practical point of view, 'yes' is the
trite answer.....but only with how I have specified defects here, so let
me explain this a bit further.  Note - if you look at the ITU Recs that
deal with this they all define that a 10s period is to be used for
unavailability entry/exit and the turning off/on of PM, eg see G.826 for
example.

My main objectives were simplicity whilst concentrating on what is
operationally important based on many years of studying real tranport
networks.  Now in a bit more detail wrt your question:

True PM purists won't like my proposals, such as (i) not including PM in
the entry and exit criteria of availability or (ii) the fact I suggested
stopping the PM counters at the end of 3s defect evaluation period.  We
don't have to do either of these things of course...we can go for full
'by the book' processing with the added complexity and loss of
information this might result in.  Let me try and explain what I mean by
this.

Nearly all (close to 100%) of SES events I ever saw were accompanied by
defect conditions in transport networks.  But SES came in 2 flavours
(you may recall the reasons for this that I explained in a long mail to
you some time ago which I won't repeat here):  (i) few ms events at an
effective error density 0.5 (ii) few seconds events at an effective
error density 0.5.  Most <1s events self healed, so did a few 1-2s
events, but few >3s events ever did (ergo why 3s in Y.1711 for
defects).....and there was invariably a (server layer) defect present
for all this time, meaning there would be FDI inserted into whatever the
clients were (assuming they had an FDI).  In other words just looking
for defects was sufficient to measure SES and thus was the simple way to
measure unavailability in transport networks.

The few seconds SES events (which we called short-breaks) were however
rather important.  These were 3 orders of magnitude bigger than the few
ms SES and they knocked over applications.  This would not have been
such a problem if they were rare compared to the few ms SES events.  But
they were not....I was very surprised to see a roughly 50/50 split of
these.  But we had no method of measuring these...and if they just got
dumped in the same SES bucket with singleton PM SES events we lost sight
of them.

It was so we could also distinguish these, and not lose sight of them,
that I have suggested these be measured as a parameter in their own
right (albeit harmonised with the manner in which defect entry events
were defined).  In other words a defect lasting 3-9s would register a
short-break event....else it will 'get lost' in the PM measurements (and
it obviously will not factor as an unavailability event).  However,
letting these short-break events get captured by the PM process will
distort the metric objectives PM is intended to measure.  And it is this
latter point that is the practical reason for saying there is liitle
point in continuing to meausre PM metrics after a period of about 3s.

However, you imply stopping PM measurements on defect entry (any?).  But
this is critical on how one defines the defect entry criteria.  If this
means the 3s as noted above they I'd be happy with that, but if you were
to imply defects detected in <<1s periods (as you can easily do on
transport networks) I would not, as I believe this should still be
counted in the PM category.

Note also that we should be a bit careful not to act too fast on <<1s
events as we can trigger prot-sw on (PM) error events that would
otherwise go away.

Finally here, please note we don't have to make the simplifications I
suggested.  The 'by the book' approach means one has to let all events
lasting <9s get registered as PM....though as I note this will distort
PM and we will lose sight of short-break events....and simply use the
10s unavailable state entry/exit criteria to turn on/off PMs.  I'd
settle for this, though I don't think its the smartest approach.

> 
> 2) Isn't availability only relevant with respect to SLAs? In 
> most networks, services are carried over multiple networks 
> layers. Eg. an ethernet service can be carried over an 
> MPLS-over-SONET-over-WDM core network. Defect conditions need 
> to be defined for each network layer, but availability is 
> only relevant for the end-to-end Ethernet service. Therefore, 
> it seems to me that the SLA itself can provide the definition 
> of availability (including a description of the verification 
> process) and that there is not necessarily a need for a tight 
> coupling with the defect definitions in the underlying network layers.

NH=> If I have SDHoverMPLS which is the layer you would apply this
reasoning to?  And what about MPLSoverSDH?...does this change things?

Please also note that the SLA will apply to all layer networks.  If you
were leasing capacity off some SP in some layer network X I'd assume you
want and SLA for that, and not just the services you were then running
over that.  And if there was no alignment at all between how the leasing
SP specified his SLA and you specified your SLA then what would do about
that?

My simple approach (because there are currently no rules for XoverY or
how many XoverY can be allowed in some global HRX) is to aim for as much
harmonisation as possible....some folks might call it 'backwards
compatibility', I'm just trying to be pragmatic.

Now let me ask you a question if I may.  I have made a fair stab at
defining quite a lot of stuff as simply as I can in Y.1711 (which would
apply to for p2p and p2mp).  So can you please explain to me how
defects, availability and all this PM stuff is going to work on mp2p
constructs?

regards, Neil 
> 
> Peter
> 
> 
> 
> mpls-bounces@lists.ietf.org wrote:
> > <snipped>
> > Adrian and draft authors,
> >> 
> >> Section 3.3 Availability
> >> The definition of availability is fine. But what is meant by the 
> >> following?
> >>     MPLS has several forwarding modes (depending on the 
> control plane
> >>     used). As such more than one model may be defined.
> >> Does this say that there may be more than one definition of 
> >> availability? Or more than one way of computing it?
> > 
> > The definition of availability is not fine.....indeed there is no
> > *definition* given.  Think folks needs to start being a little more 
> > rigorous here.  I apologise for keep raising the point that 
> there are 
> > 3 networks modes, but it is rather important and we can't 
> simply apply 
> > the same treatment to each (and this is more than OAM).
> > 
> > There is also an OAM ordering principle here, viz: 
> > Mode=>defects=>(un)availability=>PM.  Which means that:
> > -	there is no basis for defining PM until we have defined
> > entry/exit criteria for unavailability
> > -	there is no basis for defining unavailability until we have
> > defined entry/exit criteria for defects
> > -	there is no basis for defining defects until we have identified
> > the mode we dealing with
> > 
> > And just to remind....
> > cl-ps - breaks only
> > co-cs - breaks and swaps (but ONLY between exactly alike trail
> > entities) co-ps - breaks, swaps (any trail entities) and 
> misbranching 
> > {see Note} (from_any into_any trail entities).
> > 
> > There is a common CV requirement theme that runs through 
> all 3 modes. 
> > CV 'comes for free' in the cl-ps mode as each pkt contains a 
> > SA....this is why one can only have breaks here (though to test 
> > connectivity independently of the traffic behaviour one 
> would have to 
> > consciously inject a test probe, which is what ICMP echo 
> > request/response does). However, we must consciously also 
> add the CV 
> > function (ie SA information) in the co-cs and co-ps modes.  In the 
> > co-cs mode the frame structure provides the obvious vehicle 
> for this, 
> > eg the J0 byte in SDH VC4 trail O/H.  In the co-ps we need to 
> > determine the insertion rate of a CV function/pkt into the 
> data-plane.
> > 
> > {Note - Adrian, I noted you also remarked (snipped here) 
> you did not 
> > like the term 'misbranching'.  I can understand this, and there are 
> > probably better words to capture the intention here.  
> Misbranching is 
> > a shorthand way of capturing this type behaviour:  Pkts in 
> LSP1 from A 
> > to B get unintentionally replicated (in some node between A 
> and B) and 
> > merged into the pkts going from C to D of LSP 2 say.  LSP1 is 
> > blissfully unaware there is a security/integrity problem as its 
> > traffic is still arriving OK at B.  The problem however should be 
> > detected at the sink of LSP2 (by observing the SA of both LSP1 and 
> > LSP1 arriving).  A spin on this is that LSP1 pkts simply get 
> > misdirected into LSP2, so the same mismerging still results 
> at LSP2's 
> > sink, but LSP1 now also shows a break.....but its not a 
> simple break 
> > as LSP1's traffic is not black-holing.}
> > 
> > In both co modes the defect detection/handling must be 
> unidirectional 
> > and the consequent actions must be taken at the trail sink. 
>  This is 
> > important so that:
> > -	there is no ambiguity of the direction that has failed
> > -	we can correctly apply FDI to any affected co clients in the
> > syntax required by client mode to suppress alarm storms
> > -	we can suppress traffic to avoid security/integrity breaches.
> > Note:  In the co-cs mode FDI is an all 1s signal which 
> *replaces* the 
> > traffic, so FDI also fulfils the security traffic suppression 
> > requirement.  Not so in the co-ps mode, here both FDI and traffic 
> > suppression are separate actions that both need to be taken.
> > 
> > Doing the OAM as noted above means that both p2p and p2mp trail 
> > constructions can be handled correctly and scalably in the 
> same way, 
> > ie for the p2mp case only the sinks that are downstream of 
> a failure 
> > in the trail construction will be required to take the consequent 
> > actions. This means the availability definition (which I'll 
> return to
> > shortly) at the sink can be the same in both these cases.
> > 
> > I have no idea how one can handle mp2p constructions in any 
> meaningful 
> > way wrt the above observations, ie how would one define any defects 
> > here in terms of entry/exit criteria and sensible 
> consequent actions? 
> > And if we can't do this then I can't see how we can define 
> > (un)availability (or indeed PM) for such constructs.  A similar 
> > observation would apply to the use of ECMP.
> > 
> > 
> > In addition to the CV function and the FDI function there is one 
> > further basic OAM function that can be defined.  This is the BDI 
> > (Backward Defect Indication).  This function is not 
> mandatory for base 
> > defect detection/handling (unlike CV and FDI for the reasons noted 
> > above), but it is very useful when one wants to have 
> visibility of the 
> > defect and availability behaviour of both directions from a single 
> > end.  Folks familiar with Y.1711 will be aware that both 
> near-end and 
> > far-end state processing is discussed that covers this aspect.
> > 
> > When we define SLAs these have 2 parts:  An 
> (un)availability part that 
> > defines the fraction of total time that the network (for 
> which we can 
> > also read as 'service') is ('down' or) 'up', and a PM (Perf 
> Mon) part 
> > that defines the (limiting degree of) transfer performance 
> impairments 
> > whilst the network (service) is in the 'up' state.  It is therefore 
> > essential that we have agreed definitions of what constitutes 
> > entry/exit of the unavailable state.  Further, in order that we can 
> > compare apples with apples we require a strong degree of 
> harmonisation 
> > across ALL layer networks in this respect.  Remember, we 
> can create a 
> > client/server relationship where layer network X is carried 
> over layer 
> > network Y (as is the case with PWs) and it is therefore obvious (I 
> > hope) why such harmonisation is important.
> > 
> > Traditionally, unavailability has been defined to occur at 
> the start 
> > of a period of 10 consecutive seconds of severe degradation 
> (for the 
> > co-cs mode this is defined in terms of 10 consecutive SESs 
> (Severely 
> > Errored Second)).  In Y.1711 I have used this same base 
> definition but 
> > made things far simpler, based on empirical evidence of what SESs 
> > looked like in real networks when I last measured these (though I 
> > should add this was quite a few years ago).
> > 
> > The vast majority of p2p applications require connectivity to be 
> > available in both directions, else they will not work.  This means 
> > that in the p2p case if EITHER direction fails then BOTH directions 
> > are deemed unavailable, ie unavailability is an OR function of 
> > directivity failure.  An interesting implication of this is 
> that any 
> > PM SLA recording for p2p constructs should be suspended in both 
> > directions when we have an unavailability event in either 
> direction. 
> > There are some rather subtle issues here, including 
> > timing/synchronisation at both ends of the trail and 'packets in 
> > flight', that can add significant complications. I could 
> explain the 
> > rationale here but it would take a while so let me simply 
> define the 
> > end-result that one finds in Y.1711 wrt handling this and defining 
> > availability......PM purists may not completely agree with 
> what I have 
> > done, but I again stress that my main goal was simplification.
> > 
> > I have defined a short-break parameter in Y.1711 (which is 
> 2nd only in 
> > importance to unavailability in terms of customer/service 
> impact) that 
> > is purposely aligned with the 3s entry period of a defect 
> (noting all 
> > the defect cases are also harmonised in terms of entry/exit 
> criteria, 
> > again for simplification purposes, on a 3s period).  In 
> order to make 
> > both availability and PM SLA processing very simple, Y.1711 
> requires 
> > that near-end PM SLA recording be turned-off when a defect state in 
> > entered.  Further, and again for simplification reasons 
> based on what 
> > we saw empirically for real error events, there are no PM metrics 
> > included in the definition of defect entry/exit criteria or 
> indeed the 
> > unavailability entry/exit criteria....the latter is simply 
> based on a 
> > defect state being continuously present/absent for 10 consecutive 
> > seconds.  One could use PM metrics (such as some weighted 
> combination 
> > of delay/loss/error metrics) in these entry/exit criteria, 
> but IMO the 
> > 'accuracy' benefits this might bring (where 'accuracy' can 
> actually be 
> > distorted by including these) are simply not worth the 
> > complications/cost involved, and I think there are smarter ways to 
> > handle these state changes.
> > 
> > In Y.1711 any PM would be recommenced on either (i) (near-end only) 
> > exit from the defect state IF we are still in the available 
> state when 
> > this happens (noting that this also records a short break 
> event), or 
> > (ii) (near-end and far-end) exit from the unavailable state if this 
> > state has indeed been entered (where a short-break event 
> would not be 
> > recorded on entry).
> > 
> > This allows us to differentiate and measure PM, short-breaks and 
> > unavailability in the most simple but consistent and 
> complete manner 
> > possible.  There is much more thought gone into Y.1711 
> (especially on 
> > making stuff simple) than might be apparent to a casual reader.
> > 
> > One final point on OAM activation/deactivation:  The base defect 
> > detection/handling CV function must be activated/deactivated in 
> > concert with whatever mechanism is used to set-up/take-down 
> the trail 
> > (LSP)......so this could be either signalling or management 
> > provisioning.  However, I have noted that (the lightweight) BFD OAM 
> > assumes (the heavyweight) LSP-Ping OAM is used to 
> activate/deactivate 
> > BFD.  Ignoring the need to detect defects and take 
> consequent actions 
> > at the trail sink, this is clearly wrong since it would still need 
> > synchronising to the signalling/provisioning process anyway.
> > 
> > regards, Neil
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > _______________________________________________
> > mpls mailing list
> > mpls@lists.ietf.org https://www1.ietf.org/mailman/listinfo/mpls
> 

_______________________________________________
mpls mailing list
mpls@lists.ietf.org
https://www1.ietf.org/mailman/listinfo/mpls