Fighting the Restart Avalanche

4. States, Failover and Race Conditions

4.4 Race Conditions

4.4.6 Fighting the Restart Avalanche

Let’s suppose that a large number of gateways are powered on simultaneously. If they were to all initiate a RestartInProgress transaction, the Call Agent would very likely be swamped, leading to message losses and network congestion during the critical period of service restoration. In order to prevent such avalanches, the following behavior is REQUIRED:

1) When a gateway is powered on, it MUST initiate a restart timer to a random value, uniformly distributed between 0 and a maximum waiting delay (MWD). Care should be taken to avoid synchronicity of the random number generation between multiple gateways that would use the same algorithm.

2) The gateway MUST then wait for either the end of this timer, the reception of a command from the Call Agent, or the detection of a local user activity, such as for example an off-hook transition on a residential gateway.

3) When the timer elapses, when a command is received, or when an activity is detected, the gateway MUST initiate the restart procedure.

The restart procedure simply requires the endpoint to guarantee that the first

* non-audit command, or

* non-restart response (i.e., error codes other than 405, 501, and 520) to a non-audit command

that the Call Agent sees from this endpoint is a "restart"

RestartInProgress command. The endpoint is free to take full advantage of piggybacking to achieve this. Endpoints that are

considered in-service will have a RestartMethod of "restart", whereas endpoints considered out-of-service will have a RestartMethod of "forced" (also see Section 4.4.5). Commands rejected due to an endpoint not yet having completed the restart procedure SHOULD use error code 405 (endpoint "restarting").

The restart procedure is complete once a success response has been received. If an error response is received, the subsequent behavior depends on the error code in question:

* If the error code indicates a transient error (4xx), then the restart procedure MUST be initiated again (as a new transaction).

* If the error code is 521, then the endpoint is redirected, and the restart procedure MUST be initiated again (as a new transaction).

The 521 response MUST have included a NotifiedEntity which then is the "notified entity" towards which the restart is initiated. If it did not include a NotifiedEntity, the response is treated as any other permanent error (see below).

* If the error is any other permanent error (5xx), and the endpoint is not able to rectify the error, then the endpoint no longer initiates the restart procedure on its own (until

rebooted/restarted) unless otherwise specified. If a command is received for the endpoint, the endpoint MUST initiate the restart procedure again.

Note that if the RestartInProgress is piggybacked with the response (R) to a command received while restarting, then retransmission of the RestartInProgress does not require piggybacking of the response R. However, while the endpoint is restarting, a resend of the response R does require the RestartInProgress to be piggybacked to ensure in-order delivery of the two.

Should the gateway enter the "disconnected" state while carrying out the restart procedure, the disconnected procedure specified in

Section 4.4.7 MUST be carried out, except that a "restart" rather than "disconnected" message is sent during the procedure.

Each endpoint in a gateway will have a provisionable Call Agent, i.e., "notified entity", to direct the initial restart message

towards. When the collection of endpoints in a gateway is managed by more than one Call Agent, the above procedure MUST be performed for each collection of endpoints managed by a given Call Agent. The gateway MUST take full advantage of wild-carding to minimize the

number of RestartInProgress messages generated when multiple

endpoints in a gateway restart and the endpoints are managed by the same Call Agent. Note that during startup, it is possible for endpoints to start out as being out-of-service, and then become service as part of the gateway initialization procedure. A gateway may thus choose to send first a "forced" RestartInProgress for all its endpoints, and subsequently a "restart" RestartInProgress for the endpoints that come in-service. Alternatively, the gateway may

simply send "restart" RestartInProgress for only those endpoints that are in-service, and "forced" RestartInProgress for the specific

endpoints that are out-of-service. Wild-carding MUST still be used to minimize the number of messages sent though.

The value of MWD is a configuration parameter that depends on the type of the gateway. The following reasoning can be used to determine the value of this delay on residential gateways.

Call agents are typically dimensioned to handle the peak hour traffic load, during which, in average, 10% of the lines will be busy,

placing calls whose average duration is typically 3 minutes. The processing of a call typically involves 5 to 6 MGCP transactions between each endpoint and the Call Agent. This simple calculation shows that the Call Agent is expected to handle 5 to 6 transactions for each endpoint, every 30 minutes on average, or, to put it

otherwise, about one transaction per endpoint every 5 to 6 minutes on average. This suggest that a reasonable value of MWD for a

residential gateway would be 10 to 12 minutes. In the absence of explicit configuration, residential gateways should adopt a value of 600 seconds for MWD.

The same reasoning suggests that the value of MWD should be much shorter for trunking gateways or for business gateways, because they handle a large number of endpoints, and also because the usage rate of these endpoints is much higher than 10% during the peak busy hour, a typical value being 60%. These endpoints, during the peak hour, are thus expected to contribute about one transaction per minute to the Call Agent load. A reasonable algorithm is to make the value of MWD per "trunk" endpoint six times shorter than the MWD per

residential gateway, and also inversely proportional to the number of endpoints that are being restarted. For example MWD should be set to 2.5 seconds for a gateway that handles a T1 line, or to 60

milliseconds for a gateway that handles a T3 line.

Dans le document IESG Note This document is being published for the information of the community (Page 140-143)