The MPLS WG Archive[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index][Thread Index][Author Index][Subject Index] draft-nadeau-ip-basedtool-requirements-01
Dave,
Good comments. Rather than repeat yours where I agree, I'll just augment
and add some new ones.....please see in-line below.
regards, Neil
> -----Original Message-----
> From: David Allan [mailto:dallan@nortelnetworks.com]
> Sent: 12 November 2002 18:42
> To: Thomas D. Nadeau; mmorrow@cisco.com; swallow@cisco.com
> Cc: mpls@UU.NET
> Subject: draft-nadeau-ip-basedtool-requirements-01
>
>
> Tom/Monique/George:
>
> Interesting document, a "few" comments though ;-)
NH=> 'Abstract': Great to see others now taking fault-management of MPLS
more seriously. Thanks.
NH=> 'Section 1, Introduction': I know quite a bit of what is in here is
drawn from the requirements/principles given in Y.1710. Reference to this
work out of courtesy would seem in order.
>
> Section 2:
> "Any error condition that prevents an LSR"....do you mean LSP here?
NH=> The focus should be on traffic-affecting defects. Foremost here are
all the potential defects that could affect the data-plane LSP, and these
should be:
- identified 1st
- then defined wrt entry/exit criteria and consequent actions that
need to be applied, esp at the sink points of the LSPs.
The need for this in its own right is obvious I think, but if one then wants
consistent behaviour across vendors/operators and also an ability to define
SLAs (for both availability and QoS metrics) this is a fundamental
requirement. This aspect is still missing.
>
> Section 3:
NH=> I only counted 4 operators in the acknowledgements who have contributed
(BT being one of them). I would hesitate to say we have 'much experience
running MPLS networks' in all cases....I suspect the primary experience of
most operators is with rfc2547 and LDP, and low/zero experience in the
RSVP-TE and/or PWE3 XoverMPLS areas.
>
> 'a' this munges together detection and diagnosis. Shouldn't
> path liveliness
> and path tracing be separated out as separate requirements.
> (They seem to be
> synonymous in this document).
NH=>It's not just detection and diagnosis (though I agree these are distinct
issues).....between these 2 we need to have automatic consequent actions
where appropriate.
>
> 'b' Are you inheriting all the requirements of the GTTP requirements
> document by reference. There is a lot of stuff in there that
> is more generic
> and may not be applicable. Is there a specific difference
> between tunnel
> trace and path trace in this document?
NH=> I'd like some clarification here too. LSP PING has changed
significantly from its initial incarnation and now includes path tracing
capabilities (include attempting exercising ECMP subnetworks). So how does
this now sit with the other 'TUNTRACE' and 'GTTP' references that are given,
eg will the latter pair exclude the MPLS case so as not to create overlap
with LSP PING , or is something else intended?
>
> 'c' The phrase "automation of path tracing..." are you really
> discussing
> misbranching/mismerging detection. My read of this is that I
> am required to
> periodically perform traceroute on LSPs. I doubt that is the
> actual intent.
NH=> I was also unclear as to intent here. Surely path trace is an
'on-demand' diagnostic function? The thing we need running continuously is
automatic defect detection and handling.
>
> 'd' [LSPPING] for CE to CE verification. Besides violating the
> "non-prescriptive" claim in the introduction, I didn't think
> MPLS went CE to
> CE ;-) The wording seems to confuse path tracing as a
> response to failure
> detection, and the mechanism for failure detection as
> outlined in 'b'. Also
> some requirements seem to do with the tools, and others with
> respect to box
> behaviors (e.g. suggesting a nodal response to failure
> detection), it might
> be cleaner simply to stick to what the tools should do, and
> let folks roll
> up the system behaviors into applications themselves.
NH=> IP layer PING is testing the IP layer. That is fine and we want this.
But we also want mechanisms for monitoring the MPLS layer(s) that are
independent of the client, and consistent across all clients, eg PWE3 case.
Take the logic further....I would not be expecting to use an IP PING
mechanism to test a SDH VC4 path. Each layer needs independent fault
handling mechanisms, else no layer independence (which impacts consistency
of usage for any client and ability to evolve layer networks in isolation).
>
> 'e' If I understood 'b' correctly, is this not the same thing?
>
> 'f' Requirements 'f', 'l' and 'm' seem to be largely
> overlapping, although I
> cannot actually parse 'l', and may be interested in latency
> for reasons
> other than SLA.
NH=> Agree with goals stated on being able to say something about QoS
performance, but unclear how to get there. I cannot see how we can progress
to any meaningful discussion of QoS metrics unless we have first
agreed/specified:
- defects (entry/exit criteria and consequent actions)
- up/down transition states.....since all QoS metrics only have any
currency when the entity considered is 'up'.
These aspect are all addressed for the case of p2p LSPs in Y.1711.
**BTW - what is happening wrt to the various MIBs here?**....I come back on
this importnat observation later.
>
> 'g' what is a "device" in this context, and that it only "may" take
> corrective action runs counter to the first sentence
> suggesting the network
> MUST self-heal. The general requirement is rather high level.
> I've preferred
> the wording that "the network detect faults, and may
> facilitate automated
> response to restore service (e.g. via fault notification or
> whatever)".
NH=> Agree with above observations. Consequent actions are more than just
invoking restoration.....one has to specify all the defects and decide on
the appropriate consequent actions in each case to make the required
behaviour clear. For example, a simple break is quite different (in
required consequent actions) to traffic leaking (ie replicated) from one LSP
into some other LSP....in latter case there is no observable defect on the
'offending' LSP, but it does create a potential security compromise for that
traffic.
>
> 'h' Agree
>
> 'i' What OAM functions are common to how both P2P and MP2P
> are instrumented?
> Presumably one solution that meets the requirement is to
> overlay P2P on
> MP2P....
NH=> This is a fundamental point....worth explaining so everyone can grasp
the inherent problem we have to live with here. co pkt-sw networks can have
connectivity failures modes that are not relevant to cnls networks....the
*network-unique* address in each/every cnls IP pkt provides the protection,
but this is not present with only *link-connection-unique* labels which is
the way co pkt-sw networks do forwarding. 'Merging' as a concept does not
even exist in IP/cnls networks, here we have multiplexing, ie each/every pkt
is always uniquely identifiable as to source and can therefore always be
readily selected from any mixed stream of IP pkts. mp2p merging is a weird
thing to do when using a co pkt-sw forwarding paradigm (ie where only
link-connection-unique labels are used)....and from a fault-management or
QoS measurement perspective p2p is OK (=trivial) and p2mp is OK (=reasonably
simple) but mp2p is 'difficult'. If you do mp2p using a co pkt-sw paradigm
you *must* run some form of p2p above this to remove the effect of the
merging. In a PWE3 XoverMPLS sense this is why one must have 2 labels in
Martini et al, and even in rfc2547 L3 VPNs there are both p2p BGP4
instantiated LSPs (to resolve to VPN....*not* the specific PE source, since
it has been decided that the label issued by given PE to all other PEs is
the same) and above this there must be IP pkts (to resolve to the precise
destination required within identified VPN). If one could somehow monitor
the correct connectivity and the QoS of these 'always must have' higher
level p2p constructs one could then have some form of detecting failures at
any server level....even to the optics/duct, though 'client proxying for
missing server OAM functionality' should not be a general rule...and thus
one would only need to have ad hoc on-demand tools to investigate the mp2p
constructs once the higher p2p constructs had detected *and* handled (ie
apply appropriate consequent actions to) the defect. {Note - if L1 shows a
fault then obviously no need to even investigate the mp2p anyway} Worth
thinking about IMO....though I suspect not simple to address in all cases.
>
> 'j' In general agree, with the proviso that some
> synchronization of tool
> usage/frequency is required for availability metrics, e.g.
> the network needs
> to function as a system and the OAM functions harmonized
> across the system.
NH=>Understand sentiment. Also depends on whether we are discussing ad hoc
diagnostic tools or continuous defect/detection handling tools. However, if
one wants consistent metrics (say availability) this will be hard to achieve
unless the behaviour of the defect detection mechanism is specified to be
the same in all cases. So where strong SLAs are *not* required then
variable OAM behaviour may be fine, but where strong SLAs are required (be
they directly with a customer or perhaps another operator in some
client/server or tandemed interworking case) then it seems important that
all parties would want to have a consistent behavioural view. Currently the
pressure is on to get 'anything' from an Ops-only perspective.....but it's
unlikely to remain this way IMO, ie more consistency will be required over
time, so we need to be thinking about this aspect.
>
> 'k' The wording seems to undermine requirements 'h' and 'r'.
> IMHO this seems
> to highlight one problem with instrumenting load balancing
> (as per ECMP
> etc.) in general. The approach is that all paths need to be
> tested, the
> specific implication is that they all need to be able to be
> tested from any
> point. That I do not think is sustainable. If OAM probes are
> to follow the
> same path as the user traffic, then ECMP should function
> independently of
> both the OAM probes and the user traffic. Otherwise we're
> into the "monkeys
> and typewriters" verification model as the OAM probes try various
> permutations to impersonate user traffic's ECMP
> characteristics (which by
> definition is proprietary and therefore unknowable to the
> probe origin).
NH=> Well if you don't have explicitly controlled routes and you need to
load share (to get a form of TE to avoid the SPF IGP route choice) this is
an unavoidable consequence. If we were doing consistent defect detection of
some higher level p2p entities this may not be such a big problem.....that
is, if fault shows up on higher level entity then start investigating ECMP
behaviour using ad hoc on-demand tools. However, if we don't/can't do this
then one will have some issues here as you note Dave.
>
> 'l' See 'f'.
>
> 'm' This one is not clear to me, and sort of suggests using
> bursts of OAM
> packets to characterize LSP performance, a clarification please..?
NH=> I read this as 'compare pkts 'in' with pkts 'out' at source/sink of LSP
against some metric, eg delay, loss, errors.'. I can see this is obvious in
p2p case, also obvious in p2mp case, but less obvious in mp2p case.....is
this what you meant by asking for further clarification, because I'd like to
understand this aspect?
>
> 'n' The actual example is rather trivial as all the LSPs
> discussed terminate
> in a single box. What is required is alarm suppression in all clients
> regardless of where the client LSP, PW, VCC, VPC etc. terminates.
NH=> Correct.
>
> 'o' This seems othogonal to the other requirements, and seems
> relatively
> hard not to satisfy. In other words, why is it here?
NH=> I see what you mean Dave, but I think the MIB relationships are
actually very important. I have some concerns they may not have taken all
the users requirements into account....after all they seem to have been done
*before* the OAM work has been concluded! Further, there are different
user-perspectives on the element collected statistics, for example:
- Ops would be concerned with day-day fault management....so the
nature of the network embedded OAM functions determine the efficacy of the
element/MIB parameters collected and acted on; that is, if the inherent
network functions provide poor information then any subsequent NMS/OSS
manipulation cannot improve it
- Design/planning functions work over longer periods and are concerned
with TE and growing/changing network topologies according to utilisation
observed and forecasts (new traffic) given
- Products/services functions need to bill and correlate SLA
availability/QoS performance over some period against which the metric
objectives are assessed
>
> 'p' Seems prescriptive. IMHO detection of, and limitation of
> damage/disruption done by DOS attacks is the requirement. "How" is for
> another day.
>
> 'q' OAM interworking for fault notification is only a part of
> the problem.
> The client may be monitored, therefore timliness of this
> stuff matters as it
> either needs to harmonize with the client OAM, or the client
> OAM needs to be
> configured to "hold off" before reacting. Which also implies
> a requirement
> for bounded detection times, which is one nit I have with
> current proposals
> for instrumentation of ECMP and other load sharing mechanisms
> (one offshoot
> being the "guidelines for load balancing" draft to be
> presented later on the
> WG agenda).
NH=> The requirement as stated is generally OK with me, and its important.
However, the key to being able to do this properly is having harmonised and
well specified defect handling behaviour across all layers independently.
It's no good specifying apples at layer X if some (client) layer Y is
measuring oranges.....and esp if the relationship between the 2 is variable
(see comments against point 'j' above for example) and not consistent for
all parties even at layer Y.
>
> 'r' This is a rather long winded way to say that "liveliness
> must test the
> actual forwarding path (proxy verification of what systems "think" is
> happening is insufficient)".
NH=> The way this is effectively worded in Y.1710 is to require that the
fault-handling of the traffic's data-plane has to be independent of any
control-plane protocols (inc case of 'none').
>
> And finally, nothing in the document leap out at me as
> justifying the title
> (except perhaps the use of SNMP ;-)
>
> cheers
> Dave
>
>
>
|
|