The MPLS WG Archive[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index][Thread Index][Author Index][Subject Index] [mpls] working group last call on draft-ietf-mpls-oam-frmwk-01.txt
Neil,
Two questions:
1) You say that PM can not be defined without defining entry and exit criteria for availability. However, in Y.1711 you specify that PM SLA recording be turned off at the entry of the defect state. Does't this proof that PM is dependent on an unambiguous definition of defect states, but independent of the definition of availability?
2) Isn't availability only relevant with respect to SLAs? In most networks, services are carried over multiple networks layers. Eg. an ethernet service can be carried over an MPLS-over-SONET-over-WDM core network. Defect conditions need to be defined for each network layer, but availability is only relevant for the end-to-end Ethernet service. Therefore, it seems to me that the SLA itself can provide the definition of availability (including a description of the verification process) and that there is not necessarily a need for a tight coupling with the defect definitions in the underlying network layers.
Peter
mpls-bounces@lists.ietf.org wrote:
> <snipped>
> Adrian and draft authors,
>>
>> Section 3.3 Availability
>> The definition of availability is fine. But what is meant by
>> the following?
>> MPLS has several forwarding modes (depending on the control plane
>> used). As such more than one model may be defined.
>> Does this say that there may be more than one definition of
>> availability? Or more than one way of computing it?
>
> The definition of availability is not fine.....indeed there is no
> *definition* given. Think folks needs to start being a little more
> rigorous here. I apologise for keep raising the point that there are
> 3 networks modes, but it is rather important and we can't simply
> apply the same treatment to each (and this is more than OAM).
>
> There is also an OAM ordering principle here, viz:
> Mode=>defects=>(un)availability=>PM. Which means that:
> - there is no basis for defining PM until we have defined
> entry/exit criteria for unavailability
> - there is no basis for defining unavailability until we have
> defined entry/exit criteria for defects
> - there is no basis for defining defects until we have identified
> the mode we dealing with
>
> And just to remind....
> cl-ps - breaks only
> co-cs - breaks and swaps (but ONLY between exactly alike trail
> entities) co-ps - breaks, swaps (any trail entities) and misbranching
> {see Note} (from_any into_any trail entities).
>
> There is a common CV requirement theme that runs through all 3 modes.
> CV 'comes for free' in the cl-ps mode as each pkt contains a
> SA....this is why one can only have breaks here (though to test
> connectivity independently of the traffic behaviour one would have to
> consciously inject a test probe, which is what ICMP echo
> request/response does). However, we must consciously also add the CV
> function (ie SA information) in the co-cs and co-ps modes. In the
> co-cs mode the frame structure provides the obvious vehicle for this,
> eg the J0 byte in SDH VC4 trail O/H. In the co-ps we need to
> determine the insertion rate of a CV function/pkt into the data-plane.
>
> {Note - Adrian, I noted you also remarked (snipped here) you did not
> like the term 'misbranching'. I can understand this, and there are
> probably better words to capture the intention here. Misbranching is
> a shorthand way of capturing this type behaviour: Pkts in LSP1 from
> A to B get unintentionally replicated (in some node between A and B)
> and merged into the pkts going from C to D of LSP 2 say. LSP1 is
> blissfully unaware there is a security/integrity problem as its
> traffic is still arriving OK at B. The problem however should be
> detected at the sink of LSP2 (by observing the SA of both LSP1 and
> LSP1 arriving). A spin on this is that LSP1 pkts simply get
> misdirected into LSP2, so the same mismerging still results at LSP2's
> sink, but LSP1 now also shows a break.....but its not a simple break
> as LSP1's traffic is not black-holing.}
>
> In both co modes the defect detection/handling must be unidirectional
> and the consequent actions must be taken at the trail sink. This is
> important so that:
> - there is no ambiguity of the direction that has failed
> - we can correctly apply FDI to any affected co clients in the
> syntax required by client mode to suppress alarm storms
> - we can suppress traffic to avoid security/integrity breaches.
> Note: In the co-cs mode FDI is an all 1s signal which *replaces* the
> traffic, so FDI also fulfils the security traffic suppression
> requirement. Not so in the co-ps mode, here both FDI and traffic
> suppression are separate actions that both need to be taken.
>
> Doing the OAM as noted above means that both p2p and p2mp trail
> constructions can be handled correctly and scalably in the same way,
> ie for the p2mp case only the sinks that are downstream of a failure
> in the trail construction will be required to take the consequent
> actions. This means the availability definition (which I'll return to
> shortly) at the sink can be the same in both these cases.
>
> I have no idea how one can handle mp2p constructions in any meaningful
> way wrt the above observations, ie how would one define any defects
> here in terms of entry/exit criteria and sensible consequent actions?
> And if we can't do this then I can't see how we can define
> (un)availability (or indeed PM) for such constructs. A similar
> observation would apply to the use of ECMP.
>
>
> In addition to the CV function and the FDI function there is one
> further basic OAM function that can be defined. This is the BDI
> (Backward Defect Indication). This function is not mandatory for
> base defect detection/handling (unlike CV and FDI for the reasons
> noted above), but it is very useful when one wants to have visibility
> of the defect and availability behaviour of both directions from a
> single end. Folks familiar with Y.1711 will be aware that both
> near-end and far-end state processing is discussed that covers this
> aspect.
>
> When we define SLAs these have 2 parts: An (un)availability part that
> defines the fraction of total time that the network (for which we can
> also read as 'service') is ('down' or) 'up', and a PM (Perf Mon) part
> that defines the (limiting degree of) transfer performance impairments
> whilst the network (service) is in the 'up' state. It is therefore
> essential that we have agreed definitions of what constitutes
> entry/exit of the unavailable state. Further, in order that we can
> compare apples with apples we require a strong degree of
> harmonisation across ALL layer networks in this respect. Remember,
> we can create a client/server relationship where layer network X is
> carried over layer network Y (as is the case with PWs) and it is
> therefore obvious (I hope) why such harmonisation is important.
>
> Traditionally, unavailability has been defined to occur at the start
> of a period of 10 consecutive seconds of severe degradation (for the
> co-cs mode this is defined in terms of 10 consecutive SESs (Severely
> Errored Second)). In Y.1711 I have used this same base definition
> but made things far simpler, based on empirical evidence of what SESs
> looked like in real networks when I last measured these (though I
> should add this was quite a few years ago).
>
> The vast majority of p2p applications require connectivity to be
> available in both directions, else they will not work. This means
> that in the p2p case if EITHER direction fails then BOTH directions
> are deemed unavailable, ie unavailability is an OR function of
> directivity failure. An interesting implication of this is that any
> PM SLA recording for p2p constructs should be suspended in both
> directions when we have an unavailability event in either direction.
> There are some rather subtle issues here, including
> timing/synchronisation at both ends of the trail and 'packets in
> flight', that can add significant complications. I could explain the
> rationale here but it would take a while so let me simply define the
> end-result that one finds in Y.1711 wrt handling this and defining
> availability......PM purists may not completely agree with what I
> have done, but I again stress that my main goal was simplification.
>
> I have defined a short-break parameter in Y.1711 (which is 2nd only in
> importance to unavailability in terms of customer/service impact) that
> is purposely aligned with the 3s entry period of a defect (noting all
> the defect cases are also harmonised in terms of entry/exit criteria,
> again for simplification purposes, on a 3s period). In order to make
> both availability and PM SLA processing very simple, Y.1711 requires
> that near-end PM SLA recording be turned-off when a defect state in
> entered. Further, and again for simplification reasons based on what
> we saw empirically for real error events, there are no PM metrics
> included in the definition of defect entry/exit criteria or indeed the
> unavailability entry/exit criteria....the latter is simply based on a
> defect state being continuously present/absent for 10 consecutive
> seconds. One could use PM metrics (such as some weighted combination
> of delay/loss/error metrics) in these entry/exit criteria, but IMO the
> 'accuracy' benefits this might bring (where 'accuracy' can actually be
> distorted by including these) are simply not worth the
> complications/cost involved, and I think there are smarter ways to
> handle these state changes.
>
> In Y.1711 any PM would be recommenced on either (i) (near-end only)
> exit from the defect state IF we are still in the available state
> when this happens (noting that this also records a short break
> event), or (ii) (near-end and far-end) exit from the unavailable
> state if this state has indeed been entered (where a short-break
> event would not be recorded on entry).
>
> This allows us to differentiate and measure PM, short-breaks and
> unavailability in the most simple but consistent and complete manner
> possible. There is much more thought gone into Y.1711 (especially on
> making stuff simple) than might be apparent to a casual reader.
>
> One final point on OAM activation/deactivation: The base defect
> detection/handling CV function must be activated/deactivated in
> concert with whatever mechanism is used to set-up/take-down the trail
> (LSP)......so this could be either signalling or management
> provisioning. However, I have noted that (the lightweight) BFD OAM
> assumes (the heavyweight) LSP-Ping OAM is used to activate/deactivate
> BFD. Ignoring the need to detect defects and take consequent actions
> at the trail sink, this is clearly wrong since it would still need
> synchronising to the signalling/provisioning process anyway.
>
> regards, Neil
>
>
>
>
>
>
>
>
> _______________________________________________
> mpls mailing list
> mpls@lists.ietf.org
> https://www1.ietf.org/mailman/listinfo/mpls
_______________________________________________
mpls mailing list
mpls@lists.ietf.org
https://www1.ietf.org/mailman/listinfo/mpls
|
|