Home | History | Annotate | Download | only in zfs
      1 /*
      2  * CDDL HEADER START
      3  *
      4  * The contents of this file are subject to the terms of the
      5  * Common Development and Distribution License (the "License").
      6  * You may not use this file except in compliance with the License.
      7  *
      8  * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
      9  * or http://www.opensolaris.org/os/licensing.
     10  * See the License for the specific language governing permissions
     11  * and limitations under the License.
     12  *
     13  * When distributing Covered Code, include this CDDL HEADER in each
     14  * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
     15  * If applicable, add the following below this CDDL HEADER, with the
     16  * fields enclosed by brackets "[]" replaced with your own identifying
     17  * information: Portions Copyright [yyyy] [name of copyright owner]
     18  *
     19  * CDDL HEADER END
     20  */
     21 /*
     22  * Copyright 2009 Sun Microsystems, Inc.  All rights reserved.
     23  * Use is subject to license terms.
     24  */
     25 
     26 /*
     27  * Virtual Device Labels
     28  * ---------------------
     29  *
     30  * The vdev label serves several distinct purposes:
     31  *
     32  *	1. Uniquely identify this device as part of a ZFS pool and confirm its
     33  *	   identity within the pool.
     34  *
     35  * 	2. Verify that all the devices given in a configuration are present
     36  *         within the pool.
     37  *
     38  * 	3. Determine the uberblock for the pool.
     39  *
     40  * 	4. In case of an import operation, determine the configuration of the
     41  *         toplevel vdev of which it is a part.
     42  *
     43  * 	5. If an import operation cannot find all the devices in the pool,
     44  *         provide enough information to the administrator to determine which
     45  *         devices are missing.
     46  *
     47  * It is important to note that while the kernel is responsible for writing the
     48  * label, it only consumes the information in the first three cases.  The
     49  * latter information is only consumed in userland when determining the
     50  * configuration to import a pool.
     51  *
     52  *
     53  * Label Organization
     54  * ------------------
     55  *
     56  * Before describing the contents of the label, it's important to understand how
     57  * the labels are written and updated with respect to the uberblock.
     58  *
     59  * When the pool configuration is altered, either because it was newly created
     60  * or a device was added, we want to update all the labels such that we can deal
     61  * with fatal failure at any point.  To this end, each disk has two labels which
     62  * are updated before and after the uberblock is synced.  Assuming we have
     63  * labels and an uberblock with the following transaction groups:
     64  *
     65  *              L1          UB          L2
     66  *           +------+    +------+    +------+
     67  *           |      |    |      |    |      |
     68  *           | t10  |    | t10  |    | t10  |
     69  *           |      |    |      |    |      |
     70  *           +------+    +------+    +------+
     71  *
     72  * In this stable state, the labels and the uberblock were all updated within
     73  * the same transaction group (10).  Each label is mirrored and checksummed, so
     74  * that we can detect when we fail partway through writing the label.
     75  *
     76  * In order to identify which labels are valid, the labels are written in the
     77  * following manner:
     78  *
     79  * 	1. For each vdev, update 'L1' to the new label
     80  * 	2. Update the uberblock
     81  * 	3. For each vdev, update 'L2' to the new label
     82  *
     83  * Given arbitrary failure, we can determine the correct label to use based on
     84  * the transaction group.  If we fail after updating L1 but before updating the
     85  * UB, we will notice that L1's transaction group is greater than the uberblock,
     86  * so L2 must be valid.  If we fail after writing the uberblock but before
     87  * writing L2, we will notice that L2's transaction group is less than L1, and
     88  * therefore L1 is valid.
     89  *
     90  * Another added complexity is that not every label is updated when the config
     91  * is synced.  If we add a single device, we do not want to have to re-write
     92  * every label for every device in the pool.  This means that both L1 and L2 may
     93  * be older than the pool uberblock, because the necessary information is stored
     94  * on another vdev.
     95  *
     96  *
     97  * On-disk Format
     98  * --------------
     99  *
    100  * The vdev label consists of two distinct parts, and is wrapped within the
    101  * vdev_label_t structure.  The label includes 8k of padding to permit legacy
    102  * VTOC disk labels, but is otherwise ignored.
    103  *
    104  * The first half of the label is a packed nvlist which contains pool wide
    105  * properties, per-vdev properties, and configuration information.  It is
    106  * described in more detail below.
    107  *
    108  * The latter half of the label consists of a redundant array of uberblocks.
    109  * These uberblocks are updated whenever a transaction group is committed,
    110  * or when the configuration is updated.  When a pool is loaded, we scan each
    111  * vdev for the 'best' uberblock.
    112  *
    113  *
    114  * Configuration Information
    115  * -------------------------
    116  *
    117  * The nvlist describing the pool and vdev contains the following elements:
    118  *
    119  * 	version		ZFS on-disk version
    120  * 	name		Pool name
    121  * 	state		Pool state
    122  * 	txg		Transaction group in which this label was written
    123  * 	pool_guid	Unique identifier for this pool
    124  * 	vdev_tree	An nvlist describing vdev tree.
    125  *
    126  * Each leaf device label also contains the following:
    127  *
    128  * 	top_guid	Unique ID for top-level vdev in which this is contained
    129  * 	guid		Unique ID for the leaf vdev
    130  *
    131  * The 'vs' configuration follows the format described in 'spa_config.c'.
    132  */
    133 
    134 #include <sys/zfs_context.h>
    135 #include <sys/spa.h>
    136 #include <sys/spa_impl.h>
    137 #include <sys/dmu.h>
    138 #include <sys/zap.h>
    139 #include <sys/vdev.h>
    140 #include <sys/vdev_impl.h>
    141 #include <sys/uberblock_impl.h>
    142 #include <sys/metaslab.h>
    143 #include <sys/zio.h>
    144 #include <sys/fs/zfs.h>
    145 
    146 /*
    147  * Basic routines to read and write from a vdev label.
    148  * Used throughout the rest of this file.
    149  */
    150 uint64_t
    151 vdev_label_offset(uint64_t psize, int l, uint64_t offset)
    152 {
    153 	ASSERT(offset < sizeof (vdev_label_t));
    154 	ASSERT(P2PHASE_TYPED(psize, sizeof (vdev_label_t), uint64_t) == 0);
    155 
    156 	return (offset + l * sizeof (vdev_label_t) + (l < VDEV_LABELS / 2 ?
    157 	    0 : psize - VDEV_LABELS * sizeof (vdev_label_t)));
    158 }
    159 
    160 /*
    161  * Returns back the vdev label associated with the passed in offset.
    162  */
    163 int
    164 vdev_label_number(uint64_t psize, uint64_t offset)
    165 {
    166 	int l;
    167 
    168 	if (offset >= psize - VDEV_LABEL_END_SIZE) {
    169 		offset -= psize - VDEV_LABEL_END_SIZE;
    170 		offset += (VDEV_LABELS / 2) * sizeof (vdev_label_t);
    171 	}
    172 	l = offset / sizeof (vdev_label_t);
    173 	return (l < VDEV_LABELS ? l : -1);
    174 }
    175 
    176 static void
    177 vdev_label_read(zio_t *zio, vdev_t *vd, int l, void *buf, uint64_t offset,
    178 	uint64_t size, zio_done_func_t *done, void *private, int flags)
    179 {
    180 	ASSERT(spa_config_held(zio->io_spa, SCL_STATE_ALL, RW_WRITER) ==
    181 	    SCL_STATE_ALL);
    182 	ASSERT(flags & ZIO_FLAG_CONFIG_WRITER);
    183 
    184 	zio_nowait(zio_read_phys(zio, vd,
    185 	    vdev_label_offset(vd->vdev_psize, l, offset),
    186 	    size, buf, ZIO_CHECKSUM_LABEL, done, private,
    187 	    ZIO_PRIORITY_SYNC_READ, flags, B_TRUE));
    188 }
    189 
    190 static void
    191 vdev_label_write(zio_t *zio, vdev_t *vd, int l, void *buf, uint64_t offset,
    192 	uint64_t size, zio_done_func_t *done, void *private, int flags)
    193 {
    194 	ASSERT(spa_config_held(zio->io_spa, SCL_ALL, RW_WRITER) == SCL_ALL ||
    195 	    (spa_config_held(zio->io_spa, SCL_CONFIG | SCL_STATE, RW_READER) ==
    196 	    (SCL_CONFIG | SCL_STATE) &&
    197 	    dsl_pool_sync_context(spa_get_dsl(zio->io_spa))));
    198 	ASSERT(flags & ZIO_FLAG_CONFIG_WRITER);
    199 
    200 	zio_nowait(zio_write_phys(zio, vd,
    201 	    vdev_label_offset(vd->vdev_psize, l, offset),
    202 	    size, buf, ZIO_CHECKSUM_LABEL, done, private,
    203 	    ZIO_PRIORITY_SYNC_WRITE, flags, B_TRUE));
    204 }
    205 
    206 /*
    207  * Generate the nvlist representing this vdev's config.
    208  */
    209 nvlist_t *
    210 vdev_config_generate(spa_t *spa, vdev_t *vd, boolean_t getstats,
    211     boolean_t isspare, boolean_t isl2cache)
    212 {
    213 	nvlist_t *nv = NULL;
    214 
    215 	VERIFY(nvlist_alloc(&nv, NV_UNIQUE_NAME, KM_SLEEP) == 0);
    216 
    217 	VERIFY(nvlist_add_string(nv, ZPOOL_CONFIG_TYPE,
    218 	    vd->vdev_ops->vdev_op_type) == 0);
    219 	if (!isspare && !isl2cache)
    220 		VERIFY(nvlist_add_uint64(nv, ZPOOL_CONFIG_ID, vd->vdev_id)
    221 		    == 0);
    222 	VERIFY(nvlist_add_uint64(nv, ZPOOL_CONFIG_GUID, vd->vdev_guid) == 0);
    223 
    224 	if (vd->vdev_path != NULL)
    225 		VERIFY(nvlist_add_string(nv, ZPOOL_CONFIG_PATH,
    226 		    vd->vdev_path) == 0);
    227 
    228 	if (vd->vdev_devid != NULL)
    229 		VERIFY(nvlist_add_string(nv, ZPOOL_CONFIG_DEVID,
    230 		    vd->vdev_devid) == 0);
    231 
    232 	if (vd->vdev_physpath != NULL)
    233 		VERIFY(nvlist_add_string(nv, ZPOOL_CONFIG_PHYS_PATH,
    234 		    vd->vdev_physpath) == 0);
    235 
    236 	if (vd->vdev_fru != NULL)
    237 		VERIFY(nvlist_add_string(nv, ZPOOL_CONFIG_FRU,
    238 		    vd->vdev_fru) == 0);
    239 
    240 	if (vd->vdev_nparity != 0) {
    241 		ASSERT(strcmp(vd->vdev_ops->vdev_op_type,
    242 		    VDEV_TYPE_RAIDZ) == 0);
    243 
    244 		/*
    245 		 * Make sure someone hasn't managed to sneak a fancy new vdev
    246 		 * into a crufty old storage pool.
    247 		 */
    248 		ASSERT(vd->vdev_nparity == 1 ||
    249 		    (vd->vdev_nparity <= 2 &&
    250 		    spa_version(spa) >= SPA_VERSION_RAIDZ2) ||
    251 		    (vd->vdev_nparity <= 3 &&
    252 		    spa_version(spa) >= SPA_VERSION_RAIDZ3));
    253 
    254 		/*
    255 		 * Note that we'll add the nparity tag even on storage pools
    256 		 * that only support a single parity device -- older software
    257 		 * will just ignore it.
    258 		 */
    259 		VERIFY(nvlist_add_uint64(nv, ZPOOL_CONFIG_NPARITY,
    260 		    vd->vdev_nparity) == 0);
    261 	}
    262 
    263 	if (vd->vdev_wholedisk != -1ULL)
    264 		VERIFY(nvlist_add_uint64(nv, ZPOOL_CONFIG_WHOLE_DISK,
    265 		    vd->vdev_wholedisk) == 0);
    266 
    267 	if (vd->vdev_not_present)
    268 		VERIFY(nvlist_add_uint64(nv, ZPOOL_CONFIG_NOT_PRESENT, 1) == 0);
    269 
    270 	if (vd->vdev_isspare)
    271 		VERIFY(nvlist_add_uint64(nv, ZPOOL_CONFIG_IS_SPARE, 1) == 0);
    272 
    273 	if (!isspare && !isl2cache && vd == vd->vdev_top) {
    274 		VERIFY(nvlist_add_uint64(nv, ZPOOL_CONFIG_METASLAB_ARRAY,
    275 		    vd->vdev_ms_array) == 0);
    276 		VERIFY(nvlist_add_uint64(nv, ZPOOL_CONFIG_METASLAB_SHIFT,
    277 		    vd->vdev_ms_shift) == 0);
    278 		VERIFY(nvlist_add_uint64(nv, ZPOOL_CONFIG_ASHIFT,
    279 		    vd->vdev_ashift) == 0);
    280 		VERIFY(nvlist_add_uint64(nv, ZPOOL_CONFIG_ASIZE,
    281 		    vd->vdev_asize) == 0);
    282 		VERIFY(nvlist_add_uint64(nv, ZPOOL_CONFIG_IS_LOG,
    283 		    vd->vdev_islog) == 0);
    284 	}
    285 
    286 	if (vd->vdev_dtl_smo.smo_object != 0)
    287 		VERIFY(nvlist_add_uint64(nv, ZPOOL_CONFIG_DTL,
    288 		    vd->vdev_dtl_smo.smo_object) == 0);
    289 
    290 	if (vd->vdev_crtxg)
    291 		VERIFY(nvlist_add_uint64(nv, ZPOOL_CONFIG_CREATE_TXG,
    292 		    vd->vdev_crtxg) == 0);
    293 
    294 	if (getstats) {
    295 		vdev_stat_t vs;
    296 		vdev_get_stats(vd, &vs);
    297 		VERIFY(nvlist_add_uint64_array(nv, ZPOOL_CONFIG_STATS,
    298 		    (uint64_t *)&vs, sizeof (vs) / sizeof (uint64_t)) == 0);
    299 	}
    300 
    301 	if (!vd->vdev_ops->vdev_op_leaf) {
    302 		nvlist_t **child;
    303 		int c;
    304 
    305 		ASSERT(!vd->vdev_ishole);
    306 
    307 		child = kmem_alloc(vd->vdev_children * sizeof (nvlist_t *),
    308 		    KM_SLEEP);
    309 
    310 		for (c = 0; c < vd->vdev_children; c++)
    311 			child[c] = vdev_config_generate(spa, vd->vdev_child[c],
    312 			    getstats, isspare, isl2cache);
    313 
    314 		VERIFY(nvlist_add_nvlist_array(nv, ZPOOL_CONFIG_CHILDREN,
    315 		    child, vd->vdev_children) == 0);
    316 
    317 		for (c = 0; c < vd->vdev_children; c++)
    318 			nvlist_free(child[c]);
    319 
    320 		kmem_free(child, vd->vdev_children * sizeof (nvlist_t *));
    321 
    322 	} else {
    323 		const char *aux = NULL;
    324 
    325 		if (vd->vdev_offline && !vd->vdev_tmpoffline)
    326 			VERIFY(nvlist_add_uint64(nv, ZPOOL_CONFIG_OFFLINE,
    327 			    B_TRUE) == 0);
    328 		if (vd->vdev_faulted)
    329 			VERIFY(nvlist_add_uint64(nv, ZPOOL_CONFIG_FAULTED,
    330 			    B_TRUE) == 0);
    331 		if (vd->vdev_degraded)
    332 			VERIFY(nvlist_add_uint64(nv, ZPOOL_CONFIG_DEGRADED,
    333 			    B_TRUE) == 0);
    334 		if (vd->vdev_removed)
    335 			VERIFY(nvlist_add_uint64(nv, ZPOOL_CONFIG_REMOVED,
    336 			    B_TRUE) == 0);
    337 		if (vd->vdev_unspare)
    338 			VERIFY(nvlist_add_uint64(nv, ZPOOL_CONFIG_UNSPARE,
    339 			    B_TRUE) == 0);
    340 		if (vd->vdev_ishole)
    341 			VERIFY(nvlist_add_uint64(nv, ZPOOL_CONFIG_IS_HOLE,
    342 			    B_TRUE) == 0);
    343 
    344 		switch (vd->vdev_stat.vs_aux) {
    345 		case VDEV_AUX_ERR_EXCEEDED:
    346 			aux = "err_exceeded";
    347 			break;
    348 
    349 		case VDEV_AUX_EXTERNAL:
    350 			aux = "external";
    351 			break;
    352 		}
    353 
    354 		if (aux != NULL)
    355 			VERIFY(nvlist_add_string(nv, ZPOOL_CONFIG_AUX_STATE,
    356 			    aux) == 0);
    357 	}
    358 
    359 	return (nv);
    360 }
    361 
    362 /*
    363  * Generate a view of the top-level vdevs.  If we currently have holes
    364  * in the namespace, then generate an array which contains a list of holey
    365  * vdevs.  Additionally, add the number of top-level children that currently
    366  * exist.
    367  */
    368 void
    369 vdev_top_config_generate(spa_t *spa, nvlist_t *config)
    370 {
    371 	vdev_t *rvd = spa->spa_root_vdev;
    372 	uint64_t *array;
    373 	uint_t idx;
    374 
    375 	array = kmem_alloc(rvd->vdev_children * sizeof (uint64_t), KM_SLEEP);
    376 
    377 	idx = 0;
    378 	for (int c = 0; c < rvd->vdev_children; c++) {
    379 		vdev_t *tvd = rvd->vdev_child[c];
    380 
    381 		if (tvd->vdev_ishole)
    382 			array[idx++] = c;
    383 	}
    384 
    385 	if (idx) {
    386 		VERIFY(nvlist_add_uint64_array(config, ZPOOL_CONFIG_HOLE_ARRAY,
    387 		    array, idx) == 0);
    388 	}
    389 
    390 	VERIFY(nvlist_add_uint64(config, ZPOOL_CONFIG_VDEV_CHILDREN,
    391 	    rvd->vdev_children) == 0);
    392 
    393 	kmem_free(array, rvd->vdev_children * sizeof (uint64_t));
    394 }
    395 
    396 nvlist_t *
    397 vdev_label_read_config(vdev_t *vd)
    398 {
    399 	spa_t *spa = vd->vdev_spa;
    400 	nvlist_t *config = NULL;
    401 	vdev_phys_t *vp;
    402 	zio_t *zio;
    403 	int flags = ZIO_FLAG_CONFIG_WRITER | ZIO_FLAG_CANFAIL |
    404 	    ZIO_FLAG_SPECULATIVE;
    405 
    406 	ASSERT(spa_config_held(spa, SCL_STATE_ALL, RW_WRITER) == SCL_STATE_ALL);
    407 
    408 	if (!vdev_readable(vd))
    409 		return (NULL);
    410 
    411 	vp = zio_buf_alloc(sizeof (vdev_phys_t));
    412 
    413 retry:
    414 	for (int l = 0; l < VDEV_LABELS; l++) {
    415 
    416 		zio = zio_root(spa, NULL, NULL, flags);
    417 
    418 		vdev_label_read(zio, vd, l, vp,
    419 		    offsetof(vdev_label_t, vl_vdev_phys),
    420 		    sizeof (vdev_phys_t), NULL, NULL, flags);
    421 
    422 		if (zio_wait(zio) == 0 &&
    423 		    nvlist_unpack(vp->vp_nvlist, sizeof (vp->vp_nvlist),
    424 		    &config, 0) == 0)
    425 			break;
    426 
    427 		if (config != NULL) {
    428 			nvlist_free(config);
    429 			config = NULL;
    430 		}
    431 	}
    432 
    433 	if (config == NULL && !(flags & ZIO_FLAG_TRYHARD)) {
    434 		flags |= ZIO_FLAG_TRYHARD;
    435 		goto retry;
    436 	}
    437 
    438 	zio_buf_free(vp, sizeof (vdev_phys_t));
    439 
    440 	return (config);
    441 }
    442 
    443 /*
    444  * Determine if a device is in use.  The 'spare_guid' parameter will be filled
    445  * in with the device guid if this spare is active elsewhere on the system.
    446  */
    447 static boolean_t
    448 vdev_inuse(vdev_t *vd, uint64_t crtxg, vdev_labeltype_t reason,
    449     uint64_t *spare_guid, uint64_t *l2cache_guid)
    450 {
    451 	spa_t *spa = vd->vdev_spa;
    452 	uint64_t state, pool_guid, device_guid, txg, spare_pool;
    453 	uint64_t vdtxg = 0;
    454 	nvlist_t *label;
    455 
    456 	if (spare_guid)
    457 		*spare_guid = 0ULL;
    458 	if (l2cache_guid)
    459 		*l2cache_guid = 0ULL;
    460 
    461 	/*
    462 	 * Read the label, if any, and perform some basic sanity checks.
    463 	 */
    464 	if ((label = vdev_label_read_config(vd)) == NULL)
    465 		return (B_FALSE);
    466 
    467 	(void) nvlist_lookup_uint64(label, ZPOOL_CONFIG_CREATE_TXG,
    468 	    &vdtxg);
    469 
    470 	if (nvlist_lookup_uint64(label, ZPOOL_CONFIG_POOL_STATE,
    471 	    &state) != 0 ||
    472 	    nvlist_lookup_uint64(label, ZPOOL_CONFIG_GUID,
    473 	    &device_guid) != 0) {
    474 		nvlist_free(label);
    475 		return (B_FALSE);
    476 	}
    477 
    478 	if (state != POOL_STATE_SPARE && state != POOL_STATE_L2CACHE &&
    479 	    (nvlist_lookup_uint64(label, ZPOOL_CONFIG_POOL_GUID,
    480 	    &pool_guid) != 0 ||
    481 	    nvlist_lookup_uint64(label, ZPOOL_CONFIG_POOL_TXG,
    482 	    &txg) != 0)) {
    483 		nvlist_free(label);
    484 		return (B_FALSE);
    485 	}
    486 
    487 	nvlist_free(label);
    488 
    489 	/*
    490 	 * Check to see if this device indeed belongs to the pool it claims to
    491 	 * be a part of.  The only way this is allowed is if the device is a hot
    492 	 * spare (which we check for later on).
    493 	 */
    494 	if (state != POOL_STATE_SPARE && state != POOL_STATE_L2CACHE &&
    495 	    !spa_guid_exists(pool_guid, device_guid) &&
    496 	    !spa_spare_exists(device_guid, NULL, NULL) &&
    497 	    !spa_l2cache_exists(device_guid, NULL))
    498 		return (B_FALSE);
    499 
    500 	/*
    501 	 * If the transaction group is zero, then this an initialized (but
    502 	 * unused) label.  This is only an error if the create transaction
    503 	 * on-disk is the same as the one we're using now, in which case the
    504 	 * user has attempted to add the same vdev multiple times in the same
    505 	 * transaction.
    506 	 */
    507 	if (state != POOL_STATE_SPARE && state != POOL_STATE_L2CACHE &&
    508 	    txg == 0 && vdtxg == crtxg)
    509 		return (B_TRUE);
    510 
    511 	/*
    512 	 * Check to see if this is a spare device.  We do an explicit check for
    513 	 * spa_has_spare() here because it may be on our pending list of spares
    514 	 * to add.  We also check if it is an l2cache device.
    515 	 */
    516 	if (spa_spare_exists(device_guid, &spare_pool, NULL) ||
    517 	    spa_has_spare(spa, device_guid)) {
    518 		if (spare_guid)
    519 			*spare_guid = device_guid;
    520 
    521 		switch (reason) {
    522 		case VDEV_LABEL_CREATE:
    523 		case VDEV_LABEL_L2CACHE:
    524 			return (B_TRUE);
    525 
    526 		case VDEV_LABEL_REPLACE:
    527 			return (!spa_has_spare(spa, device_guid) ||
    528 			    spare_pool != 0ULL);
    529 
    530 		case VDEV_LABEL_SPARE:
    531 			return (spa_has_spare(spa, device_guid));
    532 		}
    533 	}
    534 
    535 	/*
    536 	 * Check to see if this is an l2cache device.
    537 	 */
    538 	if (spa_l2cache_exists(device_guid, NULL))
    539 		return (B_TRUE);
    540 
    541 	/*
    542 	 * If the device is marked ACTIVE, then this device is in use by another
    543 	 * pool on the system.
    544 	 */
    545 	return (state == POOL_STATE_ACTIVE);
    546 }
    547 
    548 /*
    549  * Initialize a vdev label.  We check to make sure each leaf device is not in
    550  * use, and writable.  We put down an initial label which we will later
    551  * overwrite with a complete label.  Note that it's important to do this
    552  * sequentially, not in parallel, so that we catch cases of multiple use of the
    553  * same leaf vdev in the vdev we're creating -- e.g. mirroring a disk with
    554  * itself.
    555  */
    556 int
    557 vdev_label_init(vdev_t *vd, uint64_t crtxg, vdev_labeltype_t reason)
    558 {
    559 	spa_t *spa = vd->vdev_spa;
    560 	nvlist_t *label;
    561 	vdev_phys_t *vp;
    562 	char *pad2;
    563 	uberblock_t *ub;
    564 	zio_t *zio;
    565 	char *buf;
    566 	size_t buflen;
    567 	int error;
    568 	uint64_t spare_guid, l2cache_guid;
    569 	int flags = ZIO_FLAG_CONFIG_WRITER | ZIO_FLAG_CANFAIL;
    570 
    571 	ASSERT(spa_config_held(spa, SCL_ALL, RW_WRITER) == SCL_ALL);
    572 
    573 	for (int c = 0; c < vd->vdev_children; c++)
    574 		if ((error = vdev_label_init(vd->vdev_child[c],
    575 		    crtxg, reason)) != 0)
    576 			return (error);
    577 
    578 	/* Track the creation time for this vdev */
    579 	vd->vdev_crtxg = crtxg;
    580 
    581 	if (!vd->vdev_ops->vdev_op_leaf)
    582 		return (0);
    583 
    584 	/*
    585 	 * Dead vdevs cannot be initialized.
    586 	 */
    587 	if (vdev_is_dead(vd))
    588 		return (EIO);
    589 
    590 	/*
    591 	 * Determine if the vdev is in use.
    592 	 */
    593 	if (reason != VDEV_LABEL_REMOVE &&
    594 	    vdev_inuse(vd, crtxg, reason, &spare_guid, &l2cache_guid))
    595 		return (EBUSY);
    596 
    597 	/*
    598 	 * If this is a request to add or replace a spare or l2cache device
    599 	 * that is in use elsewhere on the system, then we must update the
    600 	 * guid (which was initialized to a random value) to reflect the
    601 	 * actual GUID (which is shared between multiple pools).
    602 	 */
    603 	if (reason != VDEV_LABEL_REMOVE && reason != VDEV_LABEL_L2CACHE &&
    604 	    spare_guid != 0ULL) {
    605 		uint64_t guid_delta = spare_guid - vd->vdev_guid;
    606 
    607 		vd->vdev_guid += guid_delta;
    608 
    609 		for (vdev_t *pvd = vd; pvd != NULL; pvd = pvd->vdev_parent)
    610 			pvd->vdev_guid_sum += guid_delta;
    611 
    612 		/*
    613 		 * If this is a replacement, then we want to fallthrough to the
    614 		 * rest of the code.  If we're adding a spare, then it's already
    615 		 * labeled appropriately and we can just return.
    616 		 */
    617 		if (reason == VDEV_LABEL_SPARE)
    618 			return (0);
    619 		ASSERT(reason == VDEV_LABEL_REPLACE);
    620 	}
    621 
    622 	if (reason != VDEV_LABEL_REMOVE && reason != VDEV_LABEL_SPARE &&
    623 	    l2cache_guid != 0ULL) {
    624 		uint64_t guid_delta = l2cache_guid - vd->vdev_guid;
    625 
    626 		vd->vdev_guid += guid_delta;
    627 
    628 		for (vdev_t *pvd = vd; pvd != NULL; pvd = pvd->vdev_parent)
    629 			pvd->vdev_guid_sum += guid_delta;
    630 
    631 		/*
    632 		 * If this is a replacement, then we want to fallthrough to the
    633 		 * rest of the code.  If we're adding an l2cache, then it's
    634 		 * already labeled appropriately and we can just return.
    635 		 */
    636 		if (reason == VDEV_LABEL_L2CACHE)
    637 			return (0);
    638 		ASSERT(reason == VDEV_LABEL_REPLACE);
    639 	}
    640 
    641 	/*
    642 	 * Initialize its label.
    643 	 */
    644 	vp = zio_buf_alloc(sizeof (vdev_phys_t));
    645 	bzero(vp, sizeof (vdev_phys_t));
    646 
    647 	/*
    648 	 * Generate a label describing the pool and our top-level vdev.
    649 	 * We mark it as being from txg 0 to indicate that it's not
    650 	 * really part of an active pool just yet.  The labels will
    651 	 * be written again with a meaningful txg by spa_sync().
    652 	 */
    653 	if (reason == VDEV_LABEL_SPARE ||
    654 	    (reason == VDEV_LABEL_REMOVE && vd->vdev_isspare)) {
    655 		/*
    656 		 * For inactive hot spares, we generate a special label that
    657 		 * identifies as a mutually shared hot spare.  We write the
    658 		 * label if we are adding a hot spare, or if we are removing an
    659 		 * active hot spare (in which case we want to revert the
    660 		 * labels).
    661 		 */
    662 		VERIFY(nvlist_alloc(&label, NV_UNIQUE_NAME, KM_SLEEP) == 0);
    663 
    664 		VERIFY(nvlist_add_uint64(label, ZPOOL_CONFIG_VERSION,
    665 		    spa_version(spa)) == 0);
    666 		VERIFY(nvlist_add_uint64(label, ZPOOL_CONFIG_POOL_STATE,
    667 		    POOL_STATE_SPARE) == 0);
    668 		VERIFY(nvlist_add_uint64(label, ZPOOL_CONFIG_GUID,
    669 		    vd->vdev_guid) == 0);
    670 	} else if (reason == VDEV_LABEL_L2CACHE ||
    671 	    (reason == VDEV_LABEL_REMOVE && vd->vdev_isl2cache)) {
    672 		/*
    673 		 * For level 2 ARC devices, add a special label.
    674 		 */
    675 		VERIFY(nvlist_alloc(&label, NV_UNIQUE_NAME, KM_SLEEP) == 0);
    676 
    677 		VERIFY(nvlist_add_uint64(label, ZPOOL_CONFIG_VERSION,
    678 		    spa_version(spa)) == 0);
    679 		VERIFY(nvlist_add_uint64(label, ZPOOL_CONFIG_POOL_STATE,
    680 		    POOL_STATE_L2CACHE) == 0);
    681 		VERIFY(nvlist_add_uint64(label, ZPOOL_CONFIG_GUID,
    682 		    vd->vdev_guid) == 0);
    683 	} else {
    684 		label = spa_config_generate(spa, vd, 0ULL, B_FALSE);
    685 
    686 		/*
    687 		 * Add our creation time.  This allows us to detect multiple
    688 		 * vdev uses as described above, and automatically expires if we
    689 		 * fail.
    690 		 */
    691 		VERIFY(nvlist_add_uint64(label, ZPOOL_CONFIG_CREATE_TXG,
    692 		    crtxg) == 0);
    693 	}
    694 
    695 	buf = vp->vp_nvlist;
    696 	buflen = sizeof (vp->vp_nvlist);
    697 
    698 	error = nvlist_pack(label, &buf, &buflen, NV_ENCODE_XDR, KM_SLEEP);
    699 	if (error != 0) {
    700 		nvlist_free(label);
    701 		zio_buf_free(vp, sizeof (vdev_phys_t));
    702 		/* EFAULT means nvlist_pack ran out of room */
    703 		return (error == EFAULT ? ENAMETOOLONG : EINVAL);
    704 	}
    705 
    706 	/*
    707 	 * Initialize uberblock template.
    708 	 */
    709 	ub = zio_buf_alloc(VDEV_UBERBLOCK_RING);
    710 	bzero(ub, VDEV_UBERBLOCK_RING);
    711 	*ub = spa->spa_uberblock;
    712 	ub->ub_txg = 0;
    713 
    714 	/* Initialize the 2nd padding area. */
    715 	pad2 = zio_buf_alloc(VDEV_PAD_SIZE);
    716 	bzero(pad2, VDEV_PAD_SIZE);
    717 
    718 	/*
    719 	 * Write everything in parallel.
    720 	 */
    721 retry:
    722 	zio = zio_root(spa, NULL, NULL, flags);
    723 
    724 	for (int l = 0; l < VDEV_LABELS; l++) {
    725 
    726 		vdev_label_write(zio, vd, l, vp,
    727 		    offsetof(vdev_label_t, vl_vdev_phys),
    728 		    sizeof (vdev_phys_t), NULL, NULL, flags);
    729 
    730 		/*
    731 		 * Skip the 1st padding area.
    732 		 * Zero out the 2nd padding area where it might have
    733 		 * left over data from previous filesystem format.
    734 		 */
    735 		vdev_label_write(zio, vd, l, pad2,
    736 		    offsetof(vdev_label_t, vl_pad2),
    737 		    VDEV_PAD_SIZE, NULL, NULL, flags);
    738 
    739 		vdev_label_write(zio, vd, l, ub,
    740 		    offsetof(vdev_label_t, vl_uberblock),
    741 		    VDEV_UBERBLOCK_RING, NULL, NULL, flags);
    742 	}
    743 
    744 	error = zio_wait(zio);
    745 
    746 	if (error != 0 && !(flags & ZIO_FLAG_TRYHARD)) {
    747 		flags |= ZIO_FLAG_TRYHARD;
    748 		goto retry;
    749 	}
    750 
    751 	nvlist_free(label);
    752 	zio_buf_free(pad2, VDEV_PAD_SIZE);
    753 	zio_buf_free(ub, VDEV_UBERBLOCK_RING);
    754 	zio_buf_free(vp, sizeof (vdev_phys_t));
    755 
    756 	/*
    757 	 * If this vdev hasn't been previously identified as a spare, then we
    758 	 * mark it as such only if a) we are labeling it as a spare, or b) it
    759 	 * exists as a spare elsewhere in the system.  Do the same for
    760 	 * level 2 ARC devices.
    761 	 */
    762 	if (error == 0 && !vd->vdev_isspare &&
    763 	    (reason == VDEV_LABEL_SPARE ||
    764 	    spa_spare_exists(vd->vdev_guid, NULL, NULL)))
    765 		spa_spare_add(vd);
    766 
    767 	if (error == 0 && !vd->vdev_isl2cache &&
    768 	    (reason == VDEV_LABEL_L2CACHE ||
    769 	    spa_l2cache_exists(vd->vdev_guid, NULL)))
    770 		spa_l2cache_add(vd);
    771 
    772 	return (error);
    773 }
    774 
    775 /*
    776  * ==========================================================================
    777  * uberblock load/sync
    778  * ==========================================================================
    779  */
    780 
    781 /*
    782  * Consider the following situation: txg is safely synced to disk.  We've
    783  * written the first uberblock for txg + 1, and then we lose power.  When we
    784  * come back up, we fail to see the uberblock for txg + 1 because, say,
    785  * it was on a mirrored device and the replica to which we wrote txg + 1
    786  * is now offline.  If we then make some changes and sync txg + 1, and then
    787  * the missing replica comes back, then for a new seconds we'll have two
    788  * conflicting uberblocks on disk with the same txg.  The solution is simple:
    789  * among uberblocks with equal txg, choose the one with the latest timestamp.
    790  */
    791 static int
    792 vdev_uberblock_compare(uberblock_t *ub1, uberblock_t *ub2)
    793 {
    794 	if (ub1->ub_txg < ub2->ub_txg)
    795 		return (-1);
    796 	if (ub1->ub_txg > ub2->ub_txg)
    797 		return (1);
    798 
    799 	if (ub1->ub_timestamp < ub2->ub_timestamp)
    800 		return (-1);
    801 	if (ub1->ub_timestamp > ub2->ub_timestamp)
    802 		return (1);
    803 
    804 	return (0);
    805 }
    806 
    807 static void
    808 vdev_uberblock_load_done(zio_t *zio)
    809 {
    810 	spa_t *spa = zio->io_spa;
    811 	zio_t *rio = zio->io_private;
    812 	uberblock_t *ub = zio->io_data;
    813 	uberblock_t *ubbest = rio->io_private;
    814 
    815 	ASSERT3U(zio->io_size, ==, VDEV_UBERBLOCK_SIZE(zio->io_vd));
    816 
    817 	if (zio->io_error == 0 && uberblock_verify(ub) == 0) {
    818 		mutex_enter(&rio->io_lock);
    819 		if (ub->ub_txg <= spa->spa_load_max_txg &&
    820 		    vdev_uberblock_compare(ub, ubbest) > 0)
    821 			*ubbest = *ub;
    822 		mutex_exit(&rio->io_lock);
    823 	}
    824 
    825 	zio_buf_free(zio->io_data, zio->io_size);
    826 }
    827 
    828 void
    829 vdev_uberblock_load(zio_t *zio, vdev_t *vd, uberblock_t *ubbest)
    830 {
    831 	spa_t *spa = vd->vdev_spa;
    832 	vdev_t *rvd = spa->spa_root_vdev;
    833 	int flags = ZIO_FLAG_CONFIG_WRITER | ZIO_FLAG_CANFAIL |
    834 	    ZIO_FLAG_SPECULATIVE | ZIO_FLAG_TRYHARD;
    835 
    836 	if (vd == rvd) {
    837 		ASSERT(zio == NULL);
    838 		spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
    839 		zio = zio_root(spa, NULL, ubbest, flags);
    840 		bzero(ubbest, sizeof (uberblock_t));
    841 	}
    842 
    843 	ASSERT(zio != NULL);
    844 
    845 	for (int c = 0; c < vd->vdev_children; c++)
    846 		vdev_uberblock_load(zio, vd->vdev_child[c], ubbest);
    847 
    848 	if (vd->vdev_ops->vdev_op_leaf && vdev_readable(vd)) {
    849 		for (int l = 0; l < VDEV_LABELS; l++) {
    850 			for (int n = 0; n < VDEV_UBERBLOCK_COUNT(vd); n++) {
    851 				vdev_label_read(zio, vd, l,
    852 				    zio_buf_alloc(VDEV_UBERBLOCK_SIZE(vd)),
    853 				    VDEV_UBERBLOCK_OFFSET(vd, n),
    854 				    VDEV_UBERBLOCK_SIZE(vd),
    855 				    vdev_uberblock_load_done, zio, flags);
    856 			}
    857 		}
    858 	}
    859 
    860 	if (vd == rvd) {
    861 		(void) zio_wait(zio);
    862 		spa_config_exit(spa, SCL_ALL, FTAG);
    863 	}
    864 }
    865 
    866 /*
    867  * On success, increment root zio's count of good writes.
    868  * We only get credit for writes to known-visible vdevs; see spa_vdev_add().
    869  */
    870 static void
    871 vdev_uberblock_sync_done(zio_t *zio)
    872 {
    873 	uint64_t *good_writes = zio->io_private;
    874 
    875 	if (zio->io_error == 0 && zio->io_vd->vdev_top->vdev_ms_array != 0)
    876 		atomic_add_64(good_writes, 1);
    877 }
    878 
    879 /*
    880  * Write the uberblock to all labels of all leaves of the specified vdev.
    881  */
    882 static void
    883 vdev_uberblock_sync(zio_t *zio, uberblock_t *ub, vdev_t *vd, int flags)
    884 {
    885 	uberblock_t *ubbuf;
    886 	int n;
    887 
    888 	for (int c = 0; c < vd->vdev_children; c++)
    889 		vdev_uberblock_sync(zio, ub, vd->vdev_child[c], flags);
    890 
    891 	if (!vd->vdev_ops->vdev_op_leaf)
    892 		return;
    893 
    894 	if (!vdev_writeable(vd))
    895 		return;
    896 
    897 	n = ub->ub_txg & (VDEV_UBERBLOCK_COUNT(vd) - 1);
    898 
    899 	ubbuf = zio_buf_alloc(VDEV_UBERBLOCK_SIZE(vd));
    900 	bzero(ubbuf, VDEV_UBERBLOCK_SIZE(vd));
    901 	*ubbuf = *ub;
    902 
    903 	for (int l = 0; l < VDEV_LABELS; l++)
    904 		vdev_label_write(zio, vd, l, ubbuf,
    905 		    VDEV_UBERBLOCK_OFFSET(vd, n), VDEV_UBERBLOCK_SIZE(vd),
    906 		    vdev_uberblock_sync_done, zio->io_private,
    907 		    flags | ZIO_FLAG_DONT_PROPAGATE);
    908 
    909 	zio_buf_free(ubbuf, VDEV_UBERBLOCK_SIZE(vd));
    910 }
    911 
    912 int
    913 vdev_uberblock_sync_list(vdev_t **svd, int svdcount, uberblock_t *ub, int flags)
    914 {
    915 	spa_t *spa = svd[0]->vdev_spa;
    916 	zio_t *zio;
    917 	uint64_t good_writes = 0;
    918 
    919 	zio = zio_root(spa, NULL, &good_writes, flags);
    920 
    921 	for (int v = 0; v < svdcount; v++)
    922 		vdev_uberblock_sync(zio, ub, svd[v], flags);
    923 
    924 	(void) zio_wait(zio);
    925 
    926 	/*
    927 	 * Flush the uberblocks to disk.  This ensures that the odd labels
    928 	 * are no longer needed (because the new uberblocks and the even
    929 	 * labels are safely on disk), so it is safe to overwrite them.
    930 	 */
    931 	zio = zio_root(spa, NULL, NULL, flags);
    932 
    933 	for (int v = 0; v < svdcount; v++)
    934 		zio_flush(zio, svd[v]);
    935 
    936 	(void) zio_wait(zio);
    937 
    938 	return (good_writes >= 1 ? 0 : EIO);
    939 }
    940 
    941 /*
    942  * On success, increment the count of good writes for our top-level vdev.
    943  */
    944 static void
    945 vdev_label_sync_done(zio_t *zio)
    946 {
    947 	uint64_t *good_writes = zio->io_private;
    948 
    949 	if (zio->io_error == 0)
    950 		atomic_add_64(good_writes, 1);
    951 }
    952 
    953 /*
    954  * If there weren't enough good writes, indicate failure to the parent.
    955  */
    956 static void
    957 vdev_label_sync_top_done(zio_t *zio)
    958 {
    959 	uint64_t *good_writes = zio->io_private;
    960 
    961 	if (*good_writes == 0)
    962 		zio->io_error = EIO;
    963 
    964 	kmem_free(good_writes, sizeof (uint64_t));
    965 }
    966 
    967 /*
    968  * We ignore errors for log and cache devices, simply free the private data.
    969  */
    970 static void
    971 vdev_label_sync_ignore_done(zio_t *zio)
    972 {
    973 	kmem_free(zio->io_private, sizeof (uint64_t));
    974 }
    975 
    976 /*
    977  * Write all even or odd labels to all leaves of the specified vdev.
    978  */
    979 static void
    980 vdev_label_sync(zio_t *zio, vdev_t *vd, int l, uint64_t txg, int flags)
    981 {
    982 	nvlist_t *label;
    983 	vdev_phys_t *vp;
    984 	char *buf;
    985 	size_t buflen;
    986 
    987 	for (int c = 0; c < vd->vdev_children; c++)
    988 		vdev_label_sync(zio, vd->vdev_child[c], l, txg, flags);
    989 
    990 	if (!vd->vdev_ops->vdev_op_leaf)
    991 		return;
    992 
    993 	if (!vdev_writeable(vd))
    994 		return;
    995 
    996 	/*
    997 	 * Generate a label describing the top-level config to which we belong.
    998 	 */
    999 	label = spa_config_generate(vd->vdev_spa, vd, txg, B_FALSE);
   1000 
   1001 	vp = zio_buf_alloc(sizeof (vdev_phys_t));
   1002 	bzero(vp, sizeof (vdev_phys_t));
   1003 
   1004 	buf = vp->vp_nvlist;
   1005 	buflen = sizeof (vp->vp_nvlist);
   1006 
   1007 	if (nvlist_pack(label, &buf, &buflen, NV_ENCODE_XDR, KM_SLEEP) == 0) {
   1008 		for (; l < VDEV_LABELS; l += 2) {
   1009 			vdev_label_write(zio, vd, l, vp,
   1010 			    offsetof(vdev_label_t, vl_vdev_phys),
   1011 			    sizeof (vdev_phys_t),
   1012 			    vdev_label_sync_done, zio->io_private,
   1013 			    flags | ZIO_FLAG_DONT_PROPAGATE);
   1014 		}
   1015 	}
   1016 
   1017 	zio_buf_free(vp, sizeof (vdev_phys_t));
   1018 	nvlist_free(label);
   1019 }
   1020 
   1021 int
   1022 vdev_label_sync_list(spa_t *spa, int l, uint64_t txg, int flags)
   1023 {
   1024 	list_t *dl = &spa->spa_config_dirty_list;
   1025 	vdev_t *vd;
   1026 	zio_t *zio;
   1027 	int error;
   1028 
   1029 	/*
   1030 	 * Write the new labels to disk.
   1031 	 */
   1032 	zio = zio_root(spa, NULL, NULL, flags);
   1033 
   1034 	for (vd = list_head(dl); vd != NULL; vd = list_next(dl, vd)) {
   1035 		uint64_t *good_writes = kmem_zalloc(sizeof (uint64_t),
   1036 		    KM_SLEEP);
   1037 
   1038 		ASSERT(!vd->vdev_ishole);
   1039 
   1040 		zio_t *vio = zio_null(zio, spa, NULL,
   1041 		    (vd->vdev_islog || vd->vdev_aux != NULL) ?
   1042 		    vdev_label_sync_ignore_done : vdev_label_sync_top_done,
   1043 		    good_writes, flags);
   1044 		vdev_label_sync(vio, vd, l, txg, flags);
   1045 		zio_nowait(vio);
   1046 	}
   1047 
   1048 	error = zio_wait(zio);
   1049 
   1050 	/*
   1051 	 * Flush the new labels to disk.
   1052 	 */
   1053 	zio = zio_root(spa, NULL, NULL, flags);
   1054 
   1055 	for (vd = list_head(dl); vd != NULL; vd = list_next(dl, vd))
   1056 		zio_flush(zio, vd);
   1057 
   1058 	(void) zio_wait(zio);
   1059 
   1060 	return (error);
   1061 }
   1062 
   1063 /*
   1064  * Sync the uberblock and any changes to the vdev configuration.
   1065  *
   1066  * The order of operations is carefully crafted to ensure that
   1067  * if the system panics or loses power at any time, the state on disk
   1068  * is still transactionally consistent.  The in-line comments below
   1069  * describe the failure semantics at each stage.
   1070  *
   1071  * Moreover, vdev_config_sync() is designed to be idempotent: if it fails
   1072  * at any time, you can just call it again, and it will resume its work.
   1073  */
   1074 int
   1075 vdev_config_sync(vdev_t **svd, int svdcount, uint64_t txg, boolean_t tryhard)
   1076 {
   1077 	spa_t *spa = svd[0]->vdev_spa;
   1078 	uberblock_t *ub = &spa->spa_uberblock;
   1079 	vdev_t *vd;
   1080 	zio_t *zio;
   1081 	int error;
   1082 	int flags = ZIO_FLAG_CONFIG_WRITER | ZIO_FLAG_CANFAIL;
   1083 
   1084 	/*
   1085 	 * Normally, we don't want to try too hard to write every label and
   1086 	 * uberblock.  If there is a flaky disk, we don't want the rest of the
   1087 	 * sync process to block while we retry.  But if we can't write a
   1088 	 * single label out, we should retry with ZIO_FLAG_TRYHARD before
   1089 	 * bailing out and declaring the pool faulted.
   1090 	 */
   1091 	if (tryhard)
   1092 		flags |= ZIO_FLAG_TRYHARD;
   1093 
   1094 	ASSERT(ub->ub_txg <= txg);
   1095 
   1096 	/*
   1097 	 * If this isn't a resync due to I/O errors,
   1098 	 * and nothing changed in this transaction group,
   1099 	 * and the vdev configuration hasn't changed,
   1100 	 * then there's nothing to do.
   1101 	 */
   1102 	if (ub->ub_txg < txg &&
   1103 	    uberblock_update(ub, spa->spa_root_vdev, txg) == B_FALSE &&
   1104 	    list_is_empty(&spa->spa_config_dirty_list))
   1105 		return (0);
   1106 
   1107 	if (txg > spa_freeze_txg(spa))
   1108 		return (0);
   1109 
   1110 	ASSERT(txg <= spa->spa_final_txg);
   1111 
   1112 	/*
   1113 	 * Flush the write cache of every disk that's been written to
   1114 	 * in this transaction group.  This ensures that all blocks
   1115 	 * written in this txg will be committed to stable storage
   1116 	 * before any uberblock that references them.
   1117 	 */
   1118 	zio = zio_root(spa, NULL, NULL, flags);
   1119 
   1120 	for (vd = txg_list_head(&spa->spa_vdev_txg_list, TXG_CLEAN(txg)); vd;
   1121 	    vd = txg_list_next(&spa->spa_vdev_txg_list, vd, TXG_CLEAN(txg)))
   1122 		zio_flush(zio, vd);
   1123 
   1124 	(void) zio_wait(zio);
   1125 
   1126 	/*
   1127 	 * Sync out the even labels (L0, L2) for every dirty vdev.  If the
   1128 	 * system dies in the middle of this process, that's OK: all of the
   1129 	 * even labels that made it to disk will be newer than any uberblock,
   1130 	 * and will therefore be considered invalid.  The odd labels (L1, L3),
   1131 	 * which have not yet been touched, will still be valid.  We flush
   1132 	 * the new labels to disk to ensure that all even-label updates
   1133 	 * are committed to stable storage before the uberblock update.
   1134 	 */
   1135 	if ((error = vdev_label_sync_list(spa, 0, txg, flags)) != 0)
   1136 		return (error);
   1137 
   1138 	/*
   1139 	 * Sync the uberblocks to all vdevs in svd[].
   1140 	 * If the system dies in the middle of this step, there are two cases
   1141 	 * to consider, and the on-disk state is consistent either way:
   1142 	 *
   1143 	 * (1)	If none of the new uberblocks made it to disk, then the
   1144 	 *	previous uberblock will be the newest, and the odd labels
   1145 	 *	(which had not yet been touched) will be valid with respect
   1146 	 *	to that uberblock.
   1147 	 *
   1148 	 * (2)	If one or more new uberblocks made it to disk, then they
   1149 	 *	will be the newest, and the even labels (which had all
   1150 	 *	been successfully committed) will be valid with respect
   1151 	 *	to the new uberblocks.
   1152 	 */
   1153 	if ((error = vdev_uberblock_sync_list(svd, svdcount, ub, flags)) != 0)
   1154 		return (error);
   1155 
   1156 	/*
   1157 	 * Sync out odd labels for every dirty vdev.  If the system dies
   1158 	 * in the middle of this process, the even labels and the new
   1159 	 * uberblocks will suffice to open the pool.  The next time
   1160 	 * the pool is opened, the first thing we'll do -- before any
   1161 	 * user data is modified -- is mark every vdev dirty so that
   1162 	 * all labels will be brought up to date.  We flush the new labels
   1163 	 * to disk to ensure that all odd-label updates are committed to
   1164 	 * stable storage before the next transaction group begins.
   1165 	 */
   1166 	return (vdev_label_sync_list(spa, 1, txg, flags));
   1167 }
   1168