The MPLS WG Archive

Cell Relay Retreat>MPLS WG Archive>month:2005-Jan> msg00001



[Date Prev][Date Next][Thread Prev][Thread Next]  
  [Date Index][Thread Index][Author Index][Subject Index]

[mpls] working group last call on draft-ietf-mpls-oam-frmwk-01.txt

  • From: neil.2.harrison@bt.com
  • Date: Tue, 4 Jan 2005 10:41:15 -0000
  • Cc: mpls@ietf.org
  • Thread-Index: AcTx2dub9z+Hu0FhQGqVuT6cW8yDKgAVT68Q
  • Thread-Topic: [mpls] working group last call ondraft-ietf-mpls-oam-frmwk-01.txt
  • X-MIME-Autoconverted: from quoted-printable to 8bit by cell.onecall.net id j04AaWi10057
  • X-OriginalArrivalTime: 04 Jan 2005 10:41:16.0331 (UTC)FILETIME=[EB0DEFB0:01C4F249]

Hi Peter.....

> 
> Thanks, Neil.
> 
> A few comments in-line.
NH=> Thanks....some attempted responses to these also in-line.
> 
> Peter
> 
> 
> neil.2.harrison@bt.com wrote:
> > Hi Peter.....
> > 
> >> 
> >> Neil,
> >> 
> >> Two questions:
> >> 
> >> 1) You say that PM can not be defined without defining 
> entry and exit 
> >> criteria for availability. However, in Y.1711 you specify 
> that PM SLA 
> >> recording be turned off at the entry of the defect state. 
> Does't this 
> >> proof that PM is dependent on an unambiguous definition of defect 
> >> states, but independent of the definition of availability?
> > NH=> In my opinion, and from a practical point of view, 
> 'yes' is the 
> > trite answer.....but only with how I have specified defects 
> here, so 
> > let me explain this a bit further.  Note - if you look at 
> the ITU Recs 
> > that deal with this they all define that a 10s period is to be used 
> > for unavailability entry/exit and the turning off/on of PM, eg see 
> > G.826 for example.
> > 
> > My main objectives were simplicity whilst concentrating on what is 
> > operationally important based on many years of studying 
> real tranport 
> > networks.  Now in a bit more detail wrt your question:
> > 
> > True PM purists won't like my proposals, such as (i) not 
> including PM 
> > in the entry and exit criteria of availability or (ii) the fact I 
> > suggested stopping the PM counters at the end of 3s defect 
> evaluation 
> > period.  We don't have to do either of these things of 
> course...we can 
> > go for full 'by the book' processing with the added complexity and 
> > loss of information this might result in.  Let me try and 
> explain what 
> > I mean by this.
> > 
> > Nearly all (close to 100%) of SES events I ever saw were 
> accompanied 
> > by defect conditions in transport networks.  But SES came in 2 
> > flavours (you may recall the reasons for this that I explained in a 
> > long mail to you some time ago which I won't repeat here):  
> (i) few ms 
> > events at an effective error density 0.5 (ii) few seconds 
> events at an 
> > effective error density 0.5.  Most <1s events self healed, so did a 
> > few 1-2s events, but few >3s events ever did (ergo why 3s in Y.1711 
> > for defects).....and there was invariably a (server layer) defect 
> > present for all this time, meaning there would be FDI inserted into 
> > whatever the clients were (assuming they had an FDI).  In 
> other words 
> > just looking for defects was sufficient to measure SES and thus was 
> > the simple way to measure unavailability in transport networks.
> > 
> > The few seconds SES events (which we called short-breaks) 
> were however 
> > rather important.  These were 3 orders of magnitude bigger than the 
> > few ms SES and they knocked over applications.  This would not have 
> > been such a problem if they were rare compared to the few ms SES 
> > events.  But they were not....I was very surprised to see a roughly 
> > 50/50 split of these.  But we had no method of measuring 
> these...and 
> > if they just got dumped in the same SES bucket with 
> singleton PM SES 
> > events we lost sight of them.
> 
> You observed this behavior in an SDH network, right?
NH=> Not specifically.  PDH and SDH networks really are just that, ie
'networks'.....we 'network' (switch) the trails of the various layer
networks involved here (eg E1, VC12, etc).  However, below these there
are the (p2p) section layer transmission technologies.  These can be
electrical, optical or radio in nature.  So what I was measuring was
principally the error events orginally sourced from the line systems.
However, if I was observing these via a E4, VC4, E1 or whatever layer
network above these then the error events are likely to be modified by
the framing structures that are used.  For example, due to the way
justification is defined for the PDH layer networks it means that quite
short error events (10s of us) are highly likely to cause justification
corruption (ie leads to a bit slip) and a non-linear extension of the
error event (as the various nested layer network lose/recover framing).
This non-linearity is not a problem, its just something to be aware of.

> It would 
> be interesting to find out if th same behavior occurs in 
> networks that rely on packet switching switching more than  
> on SDH (e.g. ethernet-based metro networks).
NH=> I agree.  I had to stop the program of research I had in place in
the late 80s early 90s for various reasons (eg job changes).  Then we
were trying to mathematically quantify the behaviour of our transport
networks so we could (i) better understand what was happening and (ii)
define more meaningful SLAs.  We suceeded and got some very useful data.
However, since I stopped driving this work no one else here has
continued it (at least to the detail we were looking at) and I am not
aware of other operators doing similar studies.  So I don't really know
what real error events look like these days.  It would be useful if we
had this information.  Is anyone reading this able to say anything on
this matter?
> 
> 
> > 
> > It was so we could also distinguish these, and not lose 
> sight of them, 
> > that I have suggested these be measured as a parameter in their own 
> > right (albeit harmonised with the manner in which defect 
> entry events 
> > were defined).  In other words a defect lasting 3-9s would 
> register a 
> > short-break event....else it will 'get lost' in the PM measurements 
> > (and it obviously will not factor as an unavailability event). 
> > However, letting these short-break events get captured by the PM 
> > process will distort the metric objectives PM is intended 
> to measure. 
> > And it is this latter point that is the practical reason for saying 
> > there is liitle point in continuing to meausre PM metrics after a 
> > period of about 3s.
> > 
> > However, you imply stopping PM measurements on defect entry (any?).
> > But this is critical on how one defines the defect entry criteria. 
> > If this means the 3s as noted above they I'd be happy with that, but
> > if you were to imply defects detected in <<1s periods (as you can
> > easily do on transport networks) I would not, as I believe this
> > should still be counted in the PM category.
> 
> Good point. I was thinking of defects that would be detected 
> through MPLS CV or BFD.
> 
> How do you deal with server-layer defects? Do they have any 
> impact on MPLS PM, or would the LSP end-point wait for Loss 
> of CV to occur?
NH=> If we have a set of nested layer networks then there are 2
inheritance behaviours we should be aware of:
-	there is a 'static' 2-dimensional topological inheritance, ie
one cannot improve on the disjoint connectivity of the duct
graph.....this is critical to network survivability design (esp
control/management plane networks) and impacts the availability metric.
-	there is a 'dynamic' 1-dimensional nested trail performance
inheritance, due to the fact that a link-connection in layer network N
is supported by a lower level trail in layer network N-1....and this
recurses to the duct.

Now from the 2nd point above, any impairments that arise in layer
network M say get inherited (and possibly/likely(?) distorted) in layer
network M+1.  Of course layer network M will have its own type of
impairments, and these will (in some way) be additive to those inherited
from the lower layer network M+1.  So, starting from the duct, we can
observe that what an application sees PM-wise will be some form of PM
addition across all the layer networks involved down to the duct.

There are very good reasons why co-cs mode technologies usually sit at
the bottom of the stack, ie these have the 'best' PM behaviour, eg low
fixed abs delay with very low variance (due to the jittter/wander of the
physics here), and clearly there is no such thing as 'congestion'
impacting either loss or delay (which is why you will often see me
remark that the co-cs mode has no QoS classes).  Clearly this is not
true of any packet-switching technologies....be these co-ps or cl-ps in
mode....and of course network load is a critical factor here (though one
can have CAC to control this with the co-ps mode since this has trails
like the co-cs mode).  So the degree of PM one sees is a function of the
mode considered, network loading (if pkt-sw) and the type and number of
XoverY client/server modal nestings allowed.

So to answer your question......
-	all layer networks at and above that where an error or defect
event originates will 'see' that event in some manner
-	we should arrange that defects are detected and acted on in some
sensible manner in nested layer networks, ie faster closer to the duct,
slower closer to the application (which is why I express concern about
folks wanting to act too fast too high in the stack)

This of course assumes one has some logical design paradigm on what
constitutes sensible X/Y client/server layer network relationships (and
how many of these are allowed per some HRX).  However, there are no
formal rules here, at least at the present time.
> 
> 
> > 
> > Note also that we should be a bit careful not to act too 
> fast on <<1s 
> > events as we can trigger prot-sw on (PM) error events that would 
> > otherwise go away.
> > 
> > Finally here, please note we don't have to make the 
> simplifications I 
> > suggested.  The 'by the book' approach means one has to let 
> all events 
> > lasting <9s get registered as PM....though as I note this 
> will distort 
> > PM and we will lose sight of short-break events....and 
> simply use the 
> > 10s unavailable state entry/exit criteria to turn on/off PMs.  I'd 
> > settle for this, though I don't think its the smartest approach.
> 
> 
> I am not in favor of this 'by the book' approach.
NH=> Yes, I also have some concerns with this.  I gave my reasons for a
pragmatic departure whilst trying to be at least consistent with how
networks are currently specified.

> My concern 
> is that I think that many service providers will run MPLS 
> connectivity verification at rates (much) slower than once 
> per second.
NH=> I have seen folks argue for very fast (ie must beat SDH 50ms) and,
as you note here, very slow defect detection/handling behaviour.  My
view however is simply this:  As fast as possible when close to the
duct, and as slow as possible when close to the application......though
this does imply one has some view of what constitutes what sensible
client/server relationships are allowed (else there is no logical
ordering).  The 3s defect detection periods in Y.1711 were (IMO at
least) a practically sensible choice.  Remember this however:  If one
varies the CV rate there is no possibility of consistent defect
definition and detection/handling, eg imagine an LSPA with a CV rate of
1/min interferring with an LSPB with a CV rate of 1/s.  In simple terms,
the LSPB with the 1/s CV rate (and all clients supported by this) will
see a cycle of in/out of defect state every minute.  So be careful.  We
need some common sense here.

> In that case, a ten-second availability 
> definition is useless. Therefore, I would prefer it if we 
> could take availability out of the equation and relate PM to 
> defects only.
NH=> I can see the thrust of your argument.  But we do need to take a
holistic view from duct to application, ie we have to consider the
specification of all the layer networks involved due to:
-	inheritance 
-	the setting of mtce thresholds
-	consistency of SLAs between parties....IMO network buider
services will always be required and we can't simply disregard this.

If we don't have a some unifying view (ie across all layer networks) of
what constitutes available/unavailable time then I cannot see how we can
ever have any meaningful basis for SLAs.  And I note here that arbitrary
XoverY is making this even more of a requirement and not less of one.
> 
> 
> > 
> >> 
> >> 2) Isn't availability only relevant with respect to SLAs? In most 
> >> networks, services are carried over multiple networks 
> layers. Eg. an 
> >> ethernet service can be carried over an 
> MPLS-over-SONET-over-WDM core 
> >> network. Defect conditions need to be defined for each 
> network layer, 
> >> but availability is only relevant for the end-to-end Ethernet 
> >> service. Therefore, it seems to me that the SLA itself can provide 
> >> the definition of availability (including a description of the 
> >> verification
> >> process) and that there is not necessarily a need for a tight 
> >> coupling with the defect definitions in the underlying network 
> >> layers.
> > 
> > NH=> If I have SDHoverMPLS which is the layer you would apply this 
> > reasoning to?  And what about MPLSoverSDH?...does this 
> change things?
> > 
> > Please also note that the SLA will apply to all layer networks.
> 
> I disagree. A Service Level Agreement applies to a service 
> (as the name implies) not to network layers. 
NH=> If only this were the case!  You do realise operators currently
sell 'technology' as 'services', eg FR, E1/T1 leased-lines, ATM, etc?
And please do not overlook the fact that one man's service is another
man's layer network.....this is the essence of all client/server
relationships and it is a crucial aspect of all network builder
services.
> 
> Assume for example the following service guarantee for an IP 
> or Ethernet VPN service:
> 
> - every 15 minutes, generate a Ping request from each CE 
> router to each peer CE router in the same VPN
> 
> - the number of failed ping attemps will not be more than X 
> per Y hours
> 
> Such an SLA is completely independent of the network layers 
> over which the VPN service is carried.
NH=> So how are these specified then?  And what about an SDHoverPSN PW
here.....how is the SDH 'service' specified in this case?
> 
> 
> >  If you were leasing capacity off some SP in some layer 
> network X I'd 
> > assume you want and SLA for that, and not just the services 
> you were 
> > then running over that. And if there was no alignment at 
> all between 
> > how the leasing SP specified his SLA and you specified your 
> SLA then 
> > what would do about that?
> 
> SLAs need to be aligned, but that does not mean that there 
> must be a consistent definition of availabillity.
> 
> Obviously, it would be unwise if a service provider would 
> offer his customers more stringent SLAs than he himself would 
> get from a backbone provider. However, there is no problem if 
> the customer SLAs are less stringent than the backbone 
> provider's SLA. Therefore, I don't see why Ethernet services 
> would need to inherit the availability definition used in SDH 
> networks.
NH=> You seem to be implying some implicit ordering of what can be the
client and what can be the server (which I would agree with).  But what
about a SDHoverPSN PW here?
> 
> > 
> > My simple approach (because there are currently no rules 
> for XoverY or 
> > how many XoverY can be allowed in some global HRX) is to aim for as 
> > much harmonisation as possible....some folks might call it 
> 'backwards 
> > compatibility', I'm just trying to be pragmatic.
> > 
> > Now let me ask you a question if I may.  I have made a fair stab at 
> > defining quite a lot of stuff as simply as I can in Y.1711 (which 
> > would apply to for p2p and p2mp).  So can you please 
> explain to me how 
> > defects, availability and all this PM stuff is going to 
> work on mp2p 
> > constructs?
> 
> As far as I can tell, most of the fault detection mechanisms 
> work as well in mp2p constructs as in p2p constructs. As long 
> as Connectivity Verification messages carry identifiers that 
> uniquely identify a source-destination pair, all possible 
> errors (loss of CV, mismerge, misbranching) can be detected.
NH=> If you studied how mp2p is defined in func arch terms then you
would note that demerging has to be performed by something above the
mp2p level.   A cl-ps client can always effect such recovery due to the
fact each/every pkt carries a SA/DA pair (and of course the cl-ps mode
cannot merge in the 1st place).  A non-cl-ps client however requires one
must run a p2p LSP above the mp2p construction to effect the demerging.
So here the p2p label is proxing for the role of the SA.  So one can run
the OAM on the p2p construct and define defects here in a sensible
way......but I don't know how to do this for the mp2p construct itself.

regards, Neil

> Also, ping, traceroute and loopback will still work.
> 
> The only mechanism that does not work well in mp2p constructs 
> is AIS insertion due to server layer defects.
> 
> However, Service Providers can avoid mp2p construct if they 
> setup LSPs with RSVP-TE. Since it is likely that services 
> with stringent quality requirements are carried over 
> traffic-engineered LSPs, whereas service with less stringent 
> requirements are carried over autonomously-created LSP 
> (through LDP), I don't see the problem with mp2p.
> 
> 
> > 
> > regards, Neil
> >> 
> >> Peter
> >> 
> [snipped till the end]
> 

_______________________________________________
mpls mailing list
mpls@lists.ietf.org
https://www1.ietf.org/mailman/listinfo/mpls