The MPLS WG Archive[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index][Thread Index][Author Index][Subject Index] MPLSOAM BOF meeting draft minutes
Hi Curtis...I thought you said you were ceasing to add to this thread? Few nits below. Curtis observed 18 December 2001 15:02: > The rate should be configurable. If you want to fix it at one second > in your network you can still do so. We will never get useful > performance measurement through sending probe packets unless our > performance criteria is abysmally poor. The T3-NSFNET circa 1994 used > 10^-4 packet loss (0.01% loss) as the target. That would require at > least a multiple 10,000 seconds or almost three hours to have any > confidence at all of catching rate of loss at that rate. Even 0.1% > loss (fair IP service, not good, not terrible either) would require > measurement over a multiple of 1,000 seconds (which is actually done > for some SLA checks). Low rate ping or OAM CV is just a connectivity > check. Probe counting is a poor measurement of performance. NH=> I am really confused here why performance metrics (pkt loss in this case) are being lumped in with defect handling. The order of attack our Ops people want is: - automatic/simple defect handling (and our management wants to see see reduced Opex by de-skilling/automation); - availability measurements (which are based on *agreed/standardised* defect persistencies...ergo defects need agreement/standardising 1st)....this is *the* number one metric in the eyes of customers....and I am aware of customer criticism of operators here in the way operators provide woolly SLA definitions (simply because they have no choice given current lack of standards in this area, and esp if they use LDP mp2p LSPs as a server LSP level for BGP4 client LSPs wrt rfc2547 VPNs); - performance metrics measured on ad hoc basis (measuring these continuous is very rarely sensible or cost-effective...any technology). Perf metrics only apply to the available state....when the LSP is down, performance aggregation must cease. So as you can surely see (even if you don't agree that operators really want this...though I keep telling you we do!), if I want to offer consistent SLA metrics across/between vendors/operators there is a logical ordering of the way things have to be addressed. Let me now return to you pkt loss problem. If I want to measure a given loss ratio with a fair statistical confidence I have to measure the order of >=10 loss events, so that is 10 times the inverse of the loss ratio in total pkts sent, ie for a 10-^04 loss ratio we require the order of 10^5 pkts to be observed. We can't relate this to some time period, as its a function of the pkt rate.....which may even be zero irrespective of the potential network pkt rate if the customer is quiescent. But one of the things I might want to do is to be sure I am not accruing a 'false' loss metric because I failed to stop pkt counting on a change of availability status. For example, let's take the simplest customer service....some constant pkt rate connection say with no quiescent behaviour. Let's assume I have some availabilithy SLA and some simple pkt loss SLA of 10^-06 say with that customer. Let's also assume I make a 1s error judgement in stopping the pkt loss counters when the connection goes unavailable (I am assuming this has been defined of course). If the customer pkt rate is X pkts/s, then I have incorrectly accrued X pkts as 'lost' when in fact they should not have been counted as such. I therefore now need to have a period (when available again) of 10^6s 'pkt loss-free time' to recover this error of judgement ie X/(X x 10^6) to get back to seeing a <= 10^-6 pkt loss SLA objective). 10^6 s is 11.6 days....which is starting to look a bit silly, and at 10^-9 loos rate we are talking the order of 32 loss-free years which is clearly an absurdity for a SLA. I am starting to digress here, but since you raised the issue of QoS metrics I discussed the above so that you can see these 2 points being brought out: - a pkt loss metric and the measurement of such, can become a meaningless quantity unless it is related, in a well-defined consistent manner, to a prior understanding of the available state....ergo the need to sort-out defects, then availability, then QoS in that order....no other ordering makes much sense; - to keep thing as simple as possible, one should not use any QoS metrics in some Or'd combinatorial sense as a criteria for defect (or availability) entry/exit, ie customer activity should be decoupled from these. Ignoring this requirement invites costs/complextities that are unmanagebale IMO....and this is why it was one of the key principles laid down for OAM in the orginal harrison draft, now expired....though this, along with other key principles, can be found in draft Rec Y.1710. > > > - you can decouple source/sink processing (might not be obvious > > why.....but here are some pointers: CV is unidirectional > (for good reasons > > as explained why by several people already). CV generation > is trivially > > easy. Some LSPs don't need CV *monitoring*, ie unimportant > ones (possibly > > potential candidates for Ping). CV can detect more than > simple breaks, like > > various misbranching/merging/config cases and immediately identify > > offender.....and if you have some important customers whose traffic > > integrity needs protection this might be considered quite important) > > Your first point is ragarding efficiency. There is no strong argument > for putting the load on the egress. Not having any feedback to sender > is a poor design. We can agree to disagree on this. NH=> No, you still don't get it I am afraid. Defects on CO trails can be unidirectional and their detection/handling needs to follow this, ie form follows function. The egress is the only point that can correctly execute the required behaviour. Let's take an example to ram home the point. Suppose I have an ATM_over_MPLS service (this will be an important requirement along the road to convergence on MPLS, so we have to get this right). If I get an MPLS layer failure I have to tell the ATM layer of this so that I don't get unnecessary alarms raised against the ATM layer networks (VP or VC level). This is important even in a single management domain, but it is essential must-have behaviour when the ATM client belongs to a different management domain than the MPLS server. For unidirectional failures this has to use an FDI mapping (from the MPLS layer at the point of defect detection) to ATM AIS cells (their FDI). This cannot be done (and is a meaningless action anyway) by telling the MPLS ingress. The MPLS ingress can, however, be told by the MPLS LSP trail terminationn point where the defect is being handled via the BDI mechanism (useful for single-ended monioring of both-direction availability or if needed as trigger for prot-sw)....note this functionality also recurses in the ATM layer(s), but here it *must* be handled by their trail termination functions. This is all very well known stuff in transport networks and if we are going to use MPLS in such a vein (ie X_over_MPLS) MPLS has to have the right OAM functional behaviour. BTW - LSP-Ping does not help you with these requirements. > > Your second point is regarding another type of failure. One traffic > flow may show up for no good reason at the egress of the wrong LSP > carrying very important traffic. If the former LSP is not monitored, > no MPLS-OAM traffic is sent and it is not detected NH=> Not so hasty....please recall my earlier point about decoupling source generation and sink processing of CV. One may still generate CV on non-important LSPs (since its a trivial thing to do) even if it is not processed at their sink. > and no LSP Ping is > done so it is not detected. If that LSP is monitored, MPLS-OAM > detects the error as traffic showing up at an egress where it doesn't > belong and LSP Ping detects it as the wrong egress in the return > address on the ICMP reply. LSP Ping may need a less frequent UDP > probe that identifies the LSP and sends that back to the ingress to > insure that the packet not only reaches the correct egress LSR but > also emerges on the correct LSP. NH=> LSP-ping may indeed be used to help *diagnose* (but not automatically handle) such mismerging defects I agree in an IP network. But it's of no use as noted above for the more general case of X_over_MPLS. Further, we want the same solution irrespective of the client layer supported or any control-plane used (including none). LSP ping is restrictive in this sense, and I would prefer not to have a range of different solutions that are application/case specific....though I would not wish to stop those who think this is a good idea. > > Current hardware does not remember the prior label after it is popped, > so there may be a problem identifying the LSP after inspecting what is > underneath it. NH=> Not for CV....this carries the unique LSP TTSI (Trail Termination Source Identifier). > For that we made need to limit TTL, recognize the the > bottom label is special (whether MPLS-OAM or IPv4 Explicit Null) and > process according to the banished TTL expired rules. > > Maybe what we really need is to resurect Ron Bonica's MPLS ICMP TTL > exceeded draft and specify that for non-IP LSP, a IPv4 Explicit Null > is used in the bottom (with a ICMP echo reply). NH=> Is this the one that was thrown out on the basis if layer violations? I can't understand why you are so desparate to cling to an IP layer OAM function like ICMP for applications for which it is clearly not appropriate, nor provides the required functionality. Why can't you just accept that CV is a good solution? > You keep sending > larger TTL getting ICMP TTL exceeded until you get a ICMP echo reply. > This would trace the data path including labels for non-IP LSPs but > may raise objections on religious grounds. The again, we could just > do it. [Ron - would you mind adding this usage and respinning as > draft-ietf-mpls-icmp-03.txt? We can try for updated radioactive > historic banished RFC status.] > > > - CV rate needs to be fixed *if* you want standardised defects > > (entry/exit and their consequent action handling, like > sending FDI to > > supress client layer alarms....did you see my analysis of > ATMoverMPLS last > > night?) and, based on this, standardised availability > criteria....if you > > vary the rate these metrics vary too....ergo no consistent > behaviour from > > LSP to LSP and no chance to create consistent SLAs. Some > operators might > > find some of these reasons important for certain customers > and none of this > > has been considered so far in Ping. > > I'd hate to be taking corrective action based on a one packet per > second probe. NH=> I don't understand this comment. Is it perhaps somehow related to the requirement to take corrective actions (like prot-sw) as fast as SDH (I hear some want to do this)....now this really is quite silly, as one will be taking actions for error burst events (even lower level prot-sw events) that should be ignored. At the MPLS level I would say prot-sw in the order 3s is quite sufficient....there are very few real applications that need action faster than this. > If the send rate is configurable there is nothing to > stop you from setting it to one second everywhere in **your** network. NH=> Gee thanks.....but I don't want the solution period, it does not meet our requirements irrespective of this. > > > - CV needs to be unidirectional (many reasons) but in context of > > previous point availability/QoS have to monitored in a > unidirectional sense. > > It doesn't **need** to be unidirectional. We can agree to disagree as > to whether the advantages of feedback to the sender is more important > than probe traffic being unidirectional. NH=> I have tried to explain, but if you don't agree with it then that's that I guess. So you stick with your solution and let us have ours. At least that seems fair. > > > - etc > > So it depends on application. I would never consider > monitoring all the > > PSTN connections in our telephony network....I'd use a > reactive tool like > > Ping (we actually use a form of sampling in practice to > spot latent faults). > > This is good enough, and appropriate, for this network. On > more important > > connections (like current leased lines/VPNs) we want the > ability to monitor > > the trails since they have stringent SLAs associated with > them. I would > > never consider *sink* processing of CV on all the LSPs that carry > > BE/Internet traffic.....but I might want it on important > LSPs, like VPNs, > > LSP transit services, XoverMPLS emulation services, etc. > CV has been > > designed very much with how operators have needed to > provide services to > > important customers/applications. Its horses for > courses....does that help > > or not? Maybe we should have both mechanisms and let customers (ie > > operators) choose which one they want? > > Monitoring every active PSTN pair passing through a major junction or > every pair of IP address on the planet is infeasible so you don't need > to mention that you don't plan to do it. > > Most IP providers today that use MPLS for BE Internet traffic **do** > currently send pings through all of those LSPs to make sure they stay > up. There are only a small number of those. It is VPN and XoverMPLS > that generate a large number of labels. It may be those VPN and > XoverMPLS LSPs that need to be monitored end to end. If you plan to > succeed with the service you need to expect a large number of them. > Specifying protocols with the ability to scale is an explicit goal > stated by the IAB. NH=> I agree.....but I thought we'd done that? regards, Neil |
|