Home | History | Annotate | Download | only in dhcpagent
      1 CDDL HEADER START
      2 
      3 The contents of this file are subject to the terms of the
      4 Common Development and Distribution License (the "License").
      5 You may not use this file except in compliance with the License.
      6 
      7 You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
      8 or http://www.opensolaris.org/os/licensing.
      9 See the License for the specific language governing permissions
     10 and limitations under the License.
     11 
     12 When distributing Covered Code, include this CDDL HEADER in each
     13 file and include the License file at usr/src/OPENSOLARIS.LICENSE.
     14 If applicable, add the following below this CDDL HEADER, with the
     15 fields enclosed by brackets "[]" replaced with your own identifying
     16 information: Portions Copyright [yyyy] [name of copyright owner]
     17 
     18 CDDL HEADER END
     19 
     20 Copyright 2007 Sun Microsystems, Inc.  All rights reserved.
     21 Use is subject to license terms.
     22 
     23 Architectural Overview for the DHCP agent
     24 Peter Memishian
     25 ident	"%Z%%M%	%I%	%E% SMI"
     26 
     27 INTRODUCTION
     28 ============
     29 
     30 The Solaris DHCP agent (dhcpagent) is a DHCP client implementation
     31 compliant with RFCs 2131, 3315, and others.  The major forces shaping
     32 its design were:
     33 
     34 	* Must be capable of managing multiple network interfaces.
     35 	* Must consume little CPU, since it will always be running.
     36 	* Must have a small memory footprint, since it will always be
     37 	  running.
     38 	* Must not rely on any shared libraries outside of /lib, since
     39 	  it must run before all filesystems have been mounted.
     40 
     41 When a DHCP agent implementation is only required to control a single
     42 interface on a machine, the problem is expressed well as a simple
     43 state-machine, as shown in RFC2131.  However, when a DHCP agent is
     44 responsible for managing more than one interface at a time, the
     45 problem becomes much more complicated.
     46 
     47 This can be resolved using threads or with an event-driven model.
     48 Given that DHCP's behavior can be expressed concisely as a state
     49 machine, the event-driven model is the closest match.
     50 
     51 While tried-and-true, that model is subtle and easy to get wrong.
     52 Indeed, much of the agent's code is there to manage the complexity of
     53 programming in an asynchronous event-driven paradigm.
     54 
     55 THE BASICS
     56 ==========
     57 
     58 The DHCP agent consists of roughly 30 source files, most with a
     59 companion header file.  While the largest source file is around 1700
     60 lines, most are much shorter.  The source files can largely be broken
     61 up into three groups:
     62 
     63 	* Source files that, along with their companion header files,
     64 	  define an abstract "object" that is used by other parts of
     65 	  the system.  Examples include "packet.c", which along with
     66 	  "packet.h" provide a Packet object for use by the rest of
     67 	  the agent; and "async.c", which along with "async.h" defines
     68 	  an interface for managing asynchronous transactions within
     69 	  the agent.
     70 
     71 	* Source files that implement a given state of the agent; for
     72 	  instance, there is a "request.c" which comprises all of
     73 	  the procedural "work" which must be done while in the
     74 	  REQUESTING state of the agent.  By encapsulating states in
     75 	  files, it becomes easier to debug errors in the
     76 	  client/server protocol and adapt the agent to new
     77 	  constraints, since all the relevant code is in one place.
     78 
     79 	* Source files, which along with their companion header files,
     80   	  encapsulate a given task or related set of tasks.  The
     81 	  difference between this and the first group is that the
     82 	  interfaces exported from these files do not operate on
     83 	  an "object", but rather perform a specific task.  Examples
     84 	  include "defaults.c", which provides a useful interface
     85 	  to /etc/default/dhcpagent file operations.
     86 
     87 OVERVIEW
     88 ========
     89 
     90 Here we discuss the essential objects and subtle aspects of the
     91 DHCP agent implementation.  Note that there is of course much more
     92 that is not discussed here, but after this overview you should be able 
     93 to fend for yourself in the source code.
     94 
     95 For details on the DHCPv6 aspects of the design, and how this relates
     96 to the implementation present in previous releases of Solaris, see the
     97 README.v6 file.
     98 
     99 Event Handlers and Timer Queues
    100 -------------------------------
    101 
    102 The most important object in the agent is the event handler, whose
    103 interface is in libinetutil.h and whose implementation is in
    104 libinetutil.  The event handler is essentially an object-oriented
    105 wrapper around poll(2): other components of the agent can register to
    106 be called back when specific events on file descriptors happen -- for
    107 instance, to wait for requests to arrive on its IPC socket, the agent
    108 registers a callback function (accept_event()) that will be called
    109 back whenever a new connection arrives on the file descriptor
    110 associated with the IPC socket.  When the agent initially begins in
    111 main(), it registers a number of events with the event handler, and
    112 then calls iu_handle_events(), which proceeds to wait for events to
    113 happen -- this function does not return until the agent is shutdown
    114 via signal.
    115 
    116 When the registered events occur, the callback functions are called
    117 back, which in turn might lead to additional callbacks being
    118 registered -- this is the classic event-driven model.  (As an aside,
    119 note that programming in an event-driven model means that callbacks
    120 cannot block, or else the agent will become unresponsive.)
    121 
    122 A special kind of "event" is a timeout.  Since there are many timers
    123 which must be maintained for each DHCP-controlled interface (such as a
    124 lease expiration timer, time-to-first-renewal (t1) timer, and so
    125 forth), an object-oriented abstraction to timers called a "timer
    126 queue" is provided, whose interface is in libinetutil.h with a
    127 corresponding implementation in libinetutil.  The timer queue allows
    128 callback functions to be "scheduled" for callback after a certain
    129 amount of time has passed.
    130 
    131 The event handler and timer queue objects work hand-in-hand: the event
    132 handler is passed a pointer to a timer queue in iu_handle_events() --
    133 from there, it can use the iu_earliest_timer() routine to find the
    134 timer which will next fire, and use this to set its timeout value in
    135 its call to poll(2).  If poll(2) returns due to a timeout, the event
    136 handler calls iu_expire_timers() to expire all timers that expired
    137 (note that more than one may have expired if, for example, multiple
    138 timers were set to expire at the same time).
    139 
    140 Although it is possible to instantiate more than one timer queue or
    141 event handler object, it doesn't make a lot of sense -- these objects
    142 are really "singletons".  Accordingly, the agent has two global
    143 variables, `eh' and `tq', which store pointers to the global event
    144 handler and timer queue.
    145 
    146 Network Interfaces
    147 ------------------
    148 
    149 For each network interface managed by the agent, there is a set of
    150 associated state that describes both its general properties (such as
    151 the maximum MTU) and its connections to DHCP-related state (the
    152 protocol state machines).  This state is stored in a pair of
    153 structures called `dhcp_pif_t' (the IP physical interface layer or
    154 PIF) and `dhcp_lif_t' (the IP logical interface layer or LIF).  Each
    155 dhcp_pif_t represents a single physical interface, such as "hme0," for
    156 a given IP protocol version (4 or 6), and has a list of dhcp_lif_t
    157 structures representing the logical interfaces (such as "hme0:1") in
    158 use by the agent.
    159 
    160 This split is important because of differences between IPv4 and IPv6.
    161 For IPv4, each DHCP state machine manages a single IP address and
    162 associated configuration data.  This corresponds to a single logical
    163 interface, which must be specified by the user.  For IPv6, however,
    164 each DHCP state machine manages a group of addresses, and is
    165 associated with DUID value rather than with just an interface.
    166 
    167 Thus, DHCPv6 behaves more like in.ndpd in its creation of "ADDRCONF"
    168 interfaces.  The agent automatically plumbs logical interfaces when
    169 needed and removes them when the addresses expire.
    170 
    171 The state for a given session is stored separately in `dhcp_smach_t'.
    172 This state machine then points to the main LIF used for I/O, and to a
    173 list of `dhcp_lease_t' structures representing individual leases, and
    174 each of those points to a list of LIFs corresponding to the individual
    175 addresses being managed.
    176 
    177 One point that was brushed over in the preceding discussion of event
    178 handlers and timer queues was context.  Recall that the event-driven
    179 nature of the agent requires that functions cannot block, lest they
    180 starve out others and impact the observed responsiveness of the agent.
    181 As an example, consider the process of extending a lease: the agent
    182 must send a REQUEST packet and wait for an ACK or NAK packet in
    183 response.  This is done by sending a REQUEST and then returning to the
    184 event handler that waits for an ACK or NAK packet to arrive on the
    185 file descriptor associated with the interface.  Note however, that
    186 when the ACK or NAK does arrive, and the callback function called
    187 back, it must know which state machine this packet is for (it must get
    188 back its context).  This could be handled through an ad-hoc mapping of
    189 file descriptors to state machines, but a cleaner approach is to have
    190 the event handler's register function (iu_register_event()) take in an
    191 opaque context pointer, which will then be passed back to the
    192 callback.  In the agent, the context pointer used depends on the
    193 nature of the event: events on LIFs use the dhcp_lif_t pointer, events
    194 on the state machine use dhcp_smach_t, and so on.
    195 
    196 Note that there is nothing that guarantees the pointer passed into
    197 iu_register_event() or iu_schedule_timer() will still be valid when
    198 the callback is called back (for instance, the memory may have been
    199 freed in the meantime).  To solve this problem, all of the data
    200 structures used in this way are reference counted.  For more details
    201 on how the reference count scheme is implemented, see the closing
    202 comments in interface.h regarding memory management.
    203 
    204 Transactions
    205 ------------
    206 
    207 Many operations performed via DHCP must be performed in groups -- for
    208 instance, acquiring a lease requires several steps: sending a
    209 DISCOVER, collecting OFFERs, selecting an OFFER, sending a REQUEST,
    210 and receiving an ACK, assuming everything goes well.  Note however
    211 that due to the event-driven model the agent operates in, these
    212 operations are not inherently "grouped" -- instead, the agent sends a
    213 DISCOVER, goes back into the main event loop, waits for events
    214 (perhaps even requests on the IPC channel to begin acquiring a lease
    215 on another state machine), eventually checks to see if an acceptable
    216 OFFER has come in, and so forth.  To some degree, the notion of the
    217 state machine's current state (SELECTING, REQUESTING, etc) helps
    218 control the potential chaos of the event-driven model (for instance,
    219 if while the agent is waiting for an OFFER on a given state machine,
    220 an IPC event comes in requesting that the leases be RELEASED, the
    221 agent knows to send back an error since the state machine must be in
    222 at least the BOUND state before a RELEASE can be performed.)
    223 
    224 However, states are not enough -- for instance, suppose that the agent
    225 begins trying to renew a lease.  This is done by sending a REQUEST
    226 packet and waiting for an ACK or NAK, which might never come.  If,
    227 while waiting for the ACK or NAK, the user sends a request to renew
    228 the lease as well, then if the agent were to send another REQUEST,
    229 things could get quite complicated (and this is only the beginning of
    230 this rathole).  To protect against this, two objects exist:
    231 `async_action' and `ipc_action'.  These objects are related, but
    232 independent of one another; the more essential object is the
    233 `async_action', which we will discuss first.
    234 
    235 In short, an `async_action' represents a pending transaction (aka
    236 asynchronous action), of which each state machine can have at most
    237 one.  The `async_action' structure is embedded in the `dhcp_smach_t'
    238 structure, which is fine since there can be at most one pending
    239 transaction per state machine.  Typical "asynchronous transactions"
    240 are START, EXTEND, and INFORM, since each consists of a sequence of
    241 packets that must be done without interruption.  Note that not all
    242 DHCP operations are "asynchronous" -- for instance, a DHCPv4 RELEASE
    243 operation is synchronous (not asynchronous) since after the RELEASE is
    244 sent no reply is expected from the DHCP server, but DHCPv6 Release is
    245 asynchronous, as all DHCPv6 messages are transactional.  Some
    246 operations, such as status query, are synchronous and do not affect
    247 the system state, and thus do not require sequencing.
    248 
    249 When the agent realizes it must perform an asynchronous transaction,
    250 it calls async_async() to open the transaction.  If one is already
    251 pending, then the new transaction must fail (the details of failure
    252 depend on how the transaction was initiated, which is described in
    253 more detail later when the `ipc_action' object is discussed).  If
    254 there is no pending asynchronous transaction, the operation succeeds.
    255 
    256 When the transaction is complete, either async_finish() or
    257 async_cancel() must be called to complete or cancel the asynchronous
    258 action on that state machine.  If the transaction is unable to
    259 complete within a certain amount of time (more on this later), a timer
    260 should be used to cancel the operation.
    261 
    262 The notion of asynchronous transactions is complicated by the fact
    263 that they may originate from both inside and outside of the agent.
    264 For instance, a user initiates an asynchronous START transaction when
    265 he performs an `ifconfig hme0 dhcp start', but the agent will
    266 internally need to perform asynchronous EXTEND transactions to extend
    267 the lease before it expires.  Note that user-initiated actions always
    268 have priority over internal actions: the former will cancel the
    269 latter, if necessary.
    270 
    271 This leads us into the `ipc_action' object.  An `ipc_action'
    272 represents the IPC-related pieces of an asynchronous transaction that
    273 was started as a result of a user request, as well as the `BUSY' state
    274 of the administrative interface.  Only IPC-generated asynchronous
    275 transactions have a valid `ipc_action' object.  Note that since there
    276 can be at most one asynchronous action per state machine, there can
    277 also be at most one `ipc_action' per state machine (this means it can
    278 also conveniently be embedded inside the `dhcp_smach_t' structure).
    279 
    280 One of the main purposes of the `ipc_action' object is to timeout user
    281 events.  When the user specifies a timeout value as an argument to
    282 ifconfig, he is specifying an `ipc_action' timeout; in other words,
    283 how long he is willing to wait for the command to complete.  When this
    284 time expires, the ipc_action is terminated, as well as the
    285 asynchronous operation.
    286 
    287 The API provided for the `ipc_action' object is quite similar to the
    288 one for the `async_action' object: when an IPC request comes in for an 
    289 operation requiring asynchronous operation, ipc_action_start() is
    290 called.  When the request completes, ipc_action_finish() is called.
    291 If the user times out before the request completes, then
    292 ipc_action_timeout() is called.
    293 
    294 Packet Management
    295 -----------------
    296 
    297 Another complicated area is packet management: building, manipulating,
    298 sending and receiving packets.  These operations are all encapsulated
    299 behind a dozen or so interfaces (see packet.h) that abstract the
    300 unimportant details away from the rest of the agent code.  In order to
    301 send a DHCP packet, code first calls init_pkt(), which returns a
    302 dhcp_pkt_t initialized suitably for transmission.  Note that currently
    303 init_pkt() returns a dhcp_pkt_t that is actually allocated as part of
    304 the `dhcp_smach_t', but this may change in the future..  After calling
    305 init_pkt(), the add_pkt_opt*() functions are used to add options to
    306 the DHCP packet.  Finally, send_pkt() and send_pkt_v6() can be used to
    307 transmit the packet to a given IP address.
    308 
    309 The send_pkt() function handles the details of packet timeout and
    310 retransmission.  The last argument to send_pkt() is a pointer to a
    311 "stop function."  If this argument is passed as NULL, then the packet
    312 will only be sent once (it won't be retransmitted).  Otherwise, before
    313 each retransmission, the stop function will be called back prior to
    314 retransmission.  The callback may alter dsm_send_timeout if necessary
    315 to place a cap on the next timeout; this is done for DHCPv6 in
    316 stop_init_reboot() in order to implement the CNF_MAX_RD constraint.
    317 
    318 The return value from this function indicates whether to continue
    319 retransmission or not, which allows the send_pkt() caller to control
    320 the retransmission policy without making it have to deal with the
    321 retransmission mechanism.  See request.c for an example of this in
    322 action.
    323 
    324 The recv_pkt() function is simpler but still complicated by the fact
    325 that one may want to receive several different types of packets at
    326 once.  The caller registers an event handler on the file descriptor,
    327 and then calls recv_pkt() to read in the packet along with meta
    328 information about the message (the sender and interface identifier).
    329 				
    330 For IPv6, packet reception is done with a single socket, using
    331 IPV6_PKTINFO to determine the actual destination address and receiving
    332 interface.  Packets are then matched against the state machines on the
    333 given interface through the transaction ID.
    334 
    335 For IPv4, due to oddities in the DHCP specification (discussed in
    336 PSARC/2007/571), a special IP_DHCPINIT_IF socket option must be used
    337 to allow unicast DHCP traffic to be received on an interface during
    338 lease acquisition.  Since the IP_DHCPINIT_IF socket option can only
    339 enable one interface at a time, one socket must be used per interface.
    340 
    341 Time
    342 ----
    343 
    344 The notion of time is an exceptionally subtle area.  You will notice
    345 five ways that time is represented in the source: as lease_t's,
    346 uint32_t's, time_t's, hrtime_t's, and monosec_t's.  Each of these
    347 types serves a slightly different function.
    348 
    349 The `lease_t' type is the simplest to understand; it is the unit of
    350 time in the CD_{LEASE,T1,T2}_TIME options in a DHCP packet, as defined
    351 by RFC2131. This is defined as a positive number of seconds (relative
    352 to some fixed point in time) or the value `-1' (DHCP_PERM) which
    353 represents infinity (i.e., a permanent lease).  The lease_t should be
    354 used either when dealing with actual DHCP packets that are sent on the
    355 wire or for variables which follow the exact definition given in the
    356 RFC.
    357 
    358 The `uint32_t' type is also used to represent a relative time in
    359 seconds.  However, here the value `-1' is not special and of course
    360 this type is not tied to any definition given in RFC2131.  Use this
    361 for representing "offsets" from another point in time that are not
    362 DHCP lease times.
    363 
    364 The `time_t' type is the natural Unix type for representing time since
    365 the epoch.  Unfortunately, it is affected by stime(2) or adjtime(2)
    366 and since the DHCP client is used during system installation (and thus
    367 when time is typically being configured), the time_t cannot be used in
    368 general to represent an absolute time since the epoch.  For instance,
    369 if a time_t were used to keep track of when a lease began, and then a
    370 minute later stime(2) was called to adjust the system clock forward a
    371 year, then the lease would appeared to have expired a year ago even
    372 though it has only been a minute.  For this reason, time_t's should
    373 only be used either when wall time must be displayed (such as in
    374 DHCP_STATUS ipc transaction) or when a time meaningful across reboots
    375 must be obtained (such as when caching an ACK packet at system
    376 shutdown).
    377 
    378 The `hrtime_t' type returned from gethrtime() works around the
    379 limitations of the time_t in that it is not affected by stime(2) or
    380 adjtime(2), with the disadvantage that it represents time from some
    381 arbitrary time in the past and in nanoseconds.  The timer queue code
    382 deals with hrtime_t's directly since that particular piece of code is
    383 meant to be fairly independent of the rest of the DHCP client.
    384 
    385 However, dealing with nanoseconds is error-prone when all the other
    386 time types are in seconds.  As a result, yet another time type, the
    387 `monosec_t' was created to represent a monotonically increasing time
    388 in seconds, and is really no more than (hrtime_t / NANOSEC).  Note
    389 that this unit is typically used where time_t's would've traditionally
    390 been used.  The function monosec() in util.c returns the current
    391 monosec, and monosec_to_time() can convert a given monosec to wall
    392 time, using the system's current notion of time.
    393 
    394 One additional limitation of the `hrtime_t' and `monosec_t' types is
    395 that they are unaware of the passage of time across checkpoint/resume
    396 events (e.g., those generated by sys-suspend(1M)).  For example, if
    397 gethrtime() returns time T, and then the machine is suspended for 2
    398 hours, and then gethrtime() is called again, the time returned is not
    399 T + (2 * 60 * 60 * NANOSEC), but rather approximately still T.
    400 
    401 To work around this (and other checkpoint/resume related problems),
    402 when a system is resumed, the DHCP client makes the pessimistic
    403 assumption that all finite leases have expired while the machine was
    404 suspended and must be obtained again.  This is known as "refreshing"
    405 the leases, and is handled by refresh_smachs().
    406 
    407 Note that it appears like a more intelligent approach would be to
    408 record the time(2) when the system is suspended, compare that against
    409 the time(2) when the system is resumed, and use the delta between them
    410 to decide which leases have expired.  Sadly, this cannot be done since
    411 through at least Solaris 10, it is not possible for userland programs
    412 to be notified of system suspend events.
    413 
    414 Configuration
    415 -------------
    416 
    417 For the most part, the DHCP client only *retrieves* configuration data
    418 from the DHCP server, leaving the configuration to scripts (such as
    419 boot scripts), which themselves use dhcpinfo(1) to retrieve the data
    420 from the DHCP client.  This is desirable because it keeps the mechanism
    421 of retrieving the configuration data decoupled from the policy of using
    422 the data.
    423 
    424 However, unless used in "inform" mode, the DHCP client *does*
    425 configure each IP interface enough to allow it to communicate with
    426 other hosts.  Specifically, the DHCP client configures the interface's
    427 IP address, netmask, and broadcast address using the information
    428 provided by the server.  Further, for IPv4 logical interface 0
    429 ("hme0"), any provided default routes are also configured.
    430 
    431 For IPv6, only the IP addresses are set.  The netmask (prefix) is then
    432 set automatically by in.ndpd, and routes are discovered in the usual
    433 way by router discovery or routing protocols.  DHCPv6 doesn't set
    434 routes.
    435 
    436 Since logical interfaces cannot be specified as output interfaces in
    437 the kernel forwarding table, and in most cases, logical interfaces
    438 share a default route with their associated physical interface, the
    439 DHCP client does not automatically add or remove default routes when
    440 IPv4 leases are acquired or expired on logical interfaces.
    441 
    442 Event Scripting
    443 ---------------
    444 
    445 The DHCP client supports user program invocations on DHCP events.  The
    446 supported events are BOUND, EXTEND, EXPIRE, DROP, RELEASE, and INFORM
    447 for DHCPv4, and BUILD6, EXTEND6, EXPIRE6, DROP6, LOSS6, RELEASE6, and
    448 INFORM6 for DHCPv6.  The user program runs asynchronous to the DHCP
    449 client so that the main event loop stays active to process other
    450 events, including events triggered by the user program (for example,
    451 when it invokes dhcpinfo).
    452 
    453 The user program execution is part of the transaction of a DHCP command.
    454 For example, if the user program is not enabled, the transaction of the
    455 DHCP command START is considered over when an ACK is received and the
    456 interface is configured successfully.  If the user program is enabled,
    457 it is invoked after the interface is configured successfully, and the
    458 transaction is considered over only when the user program exits.  The
    459 event scripting implementation makes use of the asynchronous operations
    460 discussed in the "Transactions" section.
    461 
    462 An upper bound of 58 seconds is imposed on how long the user program
    463 can run. If the user program does not exit after 55 seconds, the signal
    464 SIGTERM is sent to it. If it still does not exit after additional 3
    465 seconds, the signal SIGKILL is sent to it.  Since the event handler is
    466 a wrapper around poll(), the DHCP client cannot directly observe the
    467 completion of the user program.  Instead, the DHCP client creates a
    468 child "helper" process to synchronously monitor the user program (this
    469 process is also used to send the aformentioned signals to the process,
    470 if necessary).  The DHCP client and the helper process share a pipe
    471 which is included in the set of poll descriptors monitored by the DHCP
    472 client's event handler.  When the user program exits, the helper process
    473 passes the user program exit status to the DHCP client through the pipe,
    474 informing the DHCP client that the user program has finished.  When the
    475 DHCP client is asked to shut down, it will wait for any running instances
    476 of the user program to complete.
    477