Home | History | Annotate | Download | only in os
      1 /*
      2  * CDDL HEADER START
      3  *
      4  * The contents of this file are subject to the terms of the
      5  * Common Development and Distribution License (the "License").
      6  * You may not use this file except in compliance with the License.
      7  *
      8  * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
      9  * or http://www.opensolaris.org/os/licensing.
     10  * See the License for the specific language governing permissions
     11  * and limitations under the License.
     12  *
     13  * When distributing Covered Code, include this CDDL HEADER in each
     14  * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
     15  * If applicable, add the following below this CDDL HEADER, with the
     16  * fields enclosed by brackets "[]" replaced with your own identifying
     17  * information: Portions Copyright [yyyy] [name of copyright owner]
     18  *
     19  * CDDL HEADER END
     20  */
     21 
     22 /*
     23  * Copyright 2008 Sun Microsystems, Inc.  All rights reserved.
     24  * Use is subject to license terms.
     25  */
     26 
     27 /*
     28  * Big Theory Statement for turnstiles.
     29  *
     30  * Turnstiles provide blocking and wakeup support, including priority
     31  * inheritance, for synchronization primitives (e.g. mutexes and rwlocks).
     32  * Typical usage is as follows:
     33  *
     34  * To block on lock 'lp' for read access in foo_enter():
     35  *
     36  *	ts = turnstile_lookup(lp);
     37  *	[ If the lock is still held, set the waiters bit
     38  *	turnstile_block(ts, TS_READER_Q, lp, &foo_sobj_ops);
     39  *
     40  * To wake threads waiting for write access to lock 'lp' in foo_exit():
     41  *
     42  *	ts = turnstile_lookup(lp);
     43  *	[ Either drop the lock (change owner to NULL) or perform a direct
     44  *	[ handoff (change owner to one of the threads we're about to wake).
     45  *	[ If we're going to wake the last waiter, clear the waiters bit.
     46  *	turnstile_wakeup(ts, TS_WRITER_Q, nwaiters, new_owner or NULL);
     47  *
     48  * turnstile_lookup() returns holding the turnstile hash chain lock for lp.
     49  * Both turnstile_block() and turnstile_wakeup() drop the turnstile lock.
     50  * To abort a turnstile operation, the client must call turnstile_exit().
     51  *
     52  * Requirements of the client:
     53  *
     54  * (1)  The lock's waiters indicator may be manipulated *only* while
     55  *	holding the turnstile hash chain lock (i.e. under turnstile_lookup()).
     56  *
     57  * (2)	Once the lock is marked as having waiters, the owner may be
     58  *	changed *only* while holding the turnstile hash chain lock.
     59  *
     60  * (3)	The caller must never block on an unheld lock.
     61  *
     62  * Consequences of these assumptions include the following:
     63  *
     64  * (a) It is impossible for a lock to be unheld but have waiters.
     65  *
     66  * (b)	The priority inheritance code can safely assume that an active
     67  *	turnstile's ts_inheritor never changes until the inheritor calls
     68  *	turnstile_pi_waive().
     69  *
     70  * These assumptions simplify the implementation of both turnstiles and
     71  * their clients.
     72  *
     73  * Background on priority inheritance:
     74  *
     75  * Priority inheritance allows a thread to "will" its dispatch priority
     76  * to all the threads blocking it, directly or indirectly.  This prevents
     77  * situations called priority inversions in which a high-priority thread
     78  * needs a lock held by a low-priority thread, which cannot run because
     79  * of medium-priority threads.  Without PI, the medium-priority threads
     80  * can starve out the high-priority thread indefinitely.  With PI, the
     81  * low-priority thread becomes high-priority until it releases whatever
     82  * synchronization object the real high-priority thread is waiting for.
     83  *
     84  * How turnstiles work:
     85  *
     86  * All active turnstiles reside in a global hash table, turnstile_table[].
     87  * The address of a synchronization object determines its hash index.
     88  * Each hash chain is protected by its own dispatcher lock, acquired
     89  * by turnstile_lookup().  This lock protects the hash chain linkage, the
     90  * contents of all turnstiles on the hash chain, and the waiters bits of
     91  * every synchronization object in the system that hashes to the same chain.
     92  * Giving the lock such broad scope simplifies the interactions between
     93  * the turnstile code and its clients considerably.  The blocking path
     94  * is rare enough that this has no impact on scalability.  (If it ever
     95  * does, it's almost surely a second-order effect -- the real problem
     96  * is that some synchronization object is *very* heavily contended.)
     97  *
     98  * Each thread has an attached turnstile in case it needs to block.
     99  * A thread cannot block on more than one lock at a time, so one
    100  * turnstile per thread is the most we ever need.  The first thread
    101  * to block on a lock donates its attached turnstile and adds it to
    102  * the appropriate hash chain in turnstile_table[].  This becomes the
    103  * "active turnstile" for the lock.  Each subsequent thread that blocks
    104  * on the same lock discovers that the lock already has an active
    105  * turnstile, so it stashes its own turnstile on the active turnstile's
    106  * freelist.  As threads wake up, the process is reversed.
    107  *
    108  * turnstile_block() puts the current thread to sleep on the active
    109  * turnstile for the desired lock, walks the blocking chain to apply
    110  * priority inheritance to everyone in its way, and yields the CPU.
    111  *
    112  * turnstile_wakeup() waives any priority the owner may have inherited
    113  * and wakes the specified number of waiting threads.  If the caller is
    114  * doing direct handoff of ownership (rather than just dropping the lock),
    115  * the new owner automatically inherits priority from any existing waiters.
    116  */
    117 
    118 #include <sys/param.h>
    119 #include <sys/systm.h>
    120 #include <sys/thread.h>
    121 #include <sys/proc.h>
    122 #include <sys/debug.h>
    123 #include <sys/cpuvar.h>
    124 #include <sys/turnstile.h>
    125 #include <sys/t_lock.h>
    126 #include <sys/disp.h>
    127 #include <sys/sobject.h>
    128 #include <sys/cmn_err.h>
    129 #include <sys/sysmacros.h>
    130 #include <sys/lockstat.h>
    131 #include <sys/lwp_upimutex_impl.h>
    132 #include <sys/schedctl.h>
    133 #include <sys/cpu.h>
    134 #include <sys/sdt.h>
    135 #include <sys/cpupart.h>
    136 
    137 extern upib_t upimutextab[UPIMUTEX_TABSIZE];
    138 
    139 #define	IS_UPI(sobj)	\
    140 	((uintptr_t)(sobj) - (uintptr_t)upimutextab < sizeof (upimutextab))
    141 
    142 /*
    143  * The turnstile hash table is partitioned into two halves: the lower half
    144  * is used for upimutextab[] locks, the upper half for everything else.
    145  * The reason for the distinction is that SOBJ_USER_PI locks present a
    146  * unique problem: the upimutextab[] lock passed to turnstile_block()
    147  * cannot be dropped until the calling thread has blocked on its
    148  * SOBJ_USER_PI lock and willed its priority down the blocking chain.
    149  * At that point, the caller's t_lockp will be one of the turnstile locks.
    150  * If mutex_exit() discovers that the upimutextab[] lock has waiters, it
    151  * must wake them, which forces a lock ordering on us: the turnstile lock
    152  * for the upimutextab[] lock will be acquired in mutex_vector_exit(),
    153  * which will eventually call into turnstile_pi_waive(), which will then
    154  * acquire the caller's thread lock, which in this case is the turnstile
    155  * lock for the SOBJ_USER_PI lock.  In general, when two turnstile locks
    156  * must be held at the same time, the lock order must be the address order.
    157  * Therefore, to prevent deadlock in turnstile_pi_waive(), we must ensure
    158  * that upimutextab[] locks *always* hash to lower addresses than any
    159  * other locks.  You think this is cheesy?  Let's see you do better.
    160  */
    161 #define	TURNSTILE_HASH_SIZE	128		/* must be power of 2 */
    162 #define	TURNSTILE_HASH_MASK	(TURNSTILE_HASH_SIZE - 1)
    163 #define	TURNSTILE_SOBJ_HASH(sobj)	\
    164 	((((ulong_t)sobj >> 2) + ((ulong_t)sobj >> 9)) & TURNSTILE_HASH_MASK)
    165 #define	TURNSTILE_SOBJ_BUCKET(sobj)		\
    166 	((IS_UPI(sobj) ? 0 : TURNSTILE_HASH_SIZE) + TURNSTILE_SOBJ_HASH(sobj))
    167 #define	TURNSTILE_CHAIN(sobj)	turnstile_table[TURNSTILE_SOBJ_BUCKET(sobj)]
    168 
    169 typedef struct turnstile_chain {
    170 	turnstile_t	*tc_first;	/* first turnstile on hash chain */
    171 	disp_lock_t	tc_lock;	/* lock for this hash chain */
    172 } turnstile_chain_t;
    173 
    174 turnstile_chain_t	turnstile_table[2 * TURNSTILE_HASH_SIZE];
    175 
    176 static	lock_t	turnstile_loser_lock;
    177 
    178 /*
    179  * Make 'inheritor' inherit priority from this turnstile.
    180  */
    181 static void
    182 turnstile_pi_inherit(turnstile_t *ts, kthread_t *inheritor, pri_t epri)
    183 {
    184 	ASSERT(THREAD_LOCK_HELD(inheritor));
    185 	ASSERT(DISP_LOCK_HELD(&TURNSTILE_CHAIN(ts->ts_sobj).tc_lock));
    186 
    187 	if (epri <= inheritor->t_pri)
    188 		return;
    189 
    190 	if (ts->ts_inheritor == NULL) {
    191 		ts->ts_inheritor = inheritor;
    192 		ts->ts_epri = epri;
    193 		disp_lock_enter_high(&inheritor->t_pi_lock);
    194 		ts->ts_prioinv = inheritor->t_prioinv;
    195 		inheritor->t_prioinv = ts;
    196 		disp_lock_exit_high(&inheritor->t_pi_lock);
    197 	} else {
    198 		/*
    199 		 * 'inheritor' is already inheriting from this turnstile,
    200 		 * so just adjust its priority.
    201 		 */
    202 		ASSERT(ts->ts_inheritor == inheritor);
    203 		if (ts->ts_epri < epri)
    204 			ts->ts_epri = epri;
    205 	}
    206 
    207 	if (epri > DISP_PRIO(inheritor))
    208 		thread_change_epri(inheritor, epri);
    209 }
    210 
    211 /*
    212  * If turnstile is non-NULL, remove it from inheritor's t_prioinv list.
    213  * Compute new inherited priority, and return it.
    214  */
    215 static pri_t
    216 turnstile_pi_tsdelete(turnstile_t *ts, kthread_t *inheritor)
    217 {
    218 	turnstile_t **tspp, *tsp;
    219 	pri_t new_epri = 0;
    220 
    221 	disp_lock_enter_high(&inheritor->t_pi_lock);
    222 	tspp = &inheritor->t_prioinv;
    223 	while ((tsp = *tspp) != NULL) {
    224 		if (tsp == ts)
    225 			*tspp = tsp->ts_prioinv;
    226 		else
    227 			new_epri = MAX(new_epri, tsp->ts_epri);
    228 		tspp = &tsp->ts_prioinv;
    229 	}
    230 	disp_lock_exit_high(&inheritor->t_pi_lock);
    231 	return (new_epri);
    232 }
    233 
    234 /*
    235  * Remove turnstile from inheritor's t_prioinv list, compute
    236  * new priority, and change the inheritor's effective priority if
    237  * necessary. Keep in synch with turnstile_pi_recalc().
    238  */
    239 static void
    240 turnstile_pi_waive(turnstile_t *ts)
    241 {
    242 	kthread_t *inheritor = ts->ts_inheritor;
    243 	pri_t new_epri;
    244 
    245 	ASSERT(inheritor == curthread);
    246 
    247 	thread_lock_high(inheritor);
    248 	new_epri = turnstile_pi_tsdelete(ts, inheritor);
    249 	if (new_epri != DISP_PRIO(inheritor))
    250 		thread_change_epri(inheritor, new_epri);
    251 	ts->ts_inheritor = NULL;
    252 	if (DISP_MUST_SURRENDER(inheritor))
    253 		cpu_surrender(inheritor);
    254 	thread_unlock_high(inheritor);
    255 }
    256 
    257 /*
    258  * Compute caller's new inherited priority, and change its effective
    259  * priority if necessary. Necessary only for SOBJ_USER_PI, because of
    260  * its interruptibility characteristic.
    261  */
    262 void
    263 turnstile_pi_recalc(void)
    264 {
    265 	kthread_t *inheritor = curthread;
    266 	pri_t new_epri;
    267 
    268 	thread_lock(inheritor);
    269 	new_epri = turnstile_pi_tsdelete(NULL, inheritor);
    270 	if (new_epri != DISP_PRIO(inheritor))
    271 		thread_change_epri(inheritor, new_epri);
    272 	if (DISP_MUST_SURRENDER(inheritor))
    273 		cpu_surrender(inheritor);
    274 	thread_unlock(inheritor);
    275 }
    276 
    277 /*
    278  * Grab the lock protecting the hash chain for sobj
    279  * and return the active turnstile for sobj, if any.
    280  */
    281 turnstile_t *
    282 turnstile_lookup(void *sobj)
    283 {
    284 	turnstile_t *ts;
    285 	turnstile_chain_t *tc = &TURNSTILE_CHAIN(sobj);
    286 
    287 	disp_lock_enter(&tc->tc_lock);
    288 
    289 	for (ts = tc->tc_first; ts != NULL; ts = ts->ts_next)
    290 		if (ts->ts_sobj == sobj)
    291 			break;
    292 
    293 	return (ts);
    294 }
    295 
    296 /*
    297  * Drop the lock protecting the hash chain for sobj.
    298  */
    299 void
    300 turnstile_exit(void *sobj)
    301 {
    302 	disp_lock_exit(&TURNSTILE_CHAIN(sobj).tc_lock);
    303 }
    304 
    305 /*
    306  * When we apply priority inheritance, we must grab the owner's thread lock
    307  * while already holding the waiter's thread lock.  If both thread locks are
    308  * turnstile locks, this can lead to deadlock: while we hold L1 and try to
    309  * grab L2, some unrelated thread may be applying priority inheritance to
    310  * some other blocking chain, holding L2 and trying to grab L1.  The most
    311  * obvious solution -- do a lock_try() for the owner lock -- isn't quite
    312  * sufficient because it can cause livelock: each thread may hold one lock,
    313  * try to grab the other, fail, bail out, and try again, looping forever.
    314  * To prevent livelock we must define a winner, i.e. define an arbitrary
    315  * lock ordering on the turnstile locks.  For simplicity we declare that
    316  * virtual address order defines lock order, i.e. if L1 < L2, then the
    317  * correct lock ordering is L1, L2.  Thus the thread that holds L1 and
    318  * wants L2 should spin until L2 is available, but the thread that holds
    319  * L2 and can't get L1 on the first try must drop L2 and return failure.
    320  * Moreover, the losing thread must not reacquire L2 until the winning
    321  * thread has had a chance to grab it; to ensure this, the losing thread
    322  * must grab L1 after dropping L2, thus spinning until the winner is done.
    323  * Complicating matters further, note that the owner's thread lock pointer
    324  * can change (i.e. be pointed at a different lock) while we're trying to
    325  * grab it.  If that happens, we must unwind our state and try again.
    326  *
    327  * On success, returns 1 with both locks held.
    328  * On failure, returns 0 with neither lock held.
    329  */
    330 static int
    331 turnstile_interlock(lock_t *wlp, lock_t *volatile *olpp)
    332 {
    333 	ASSERT(LOCK_HELD(wlp));
    334 
    335 	for (;;) {
    336 		volatile lock_t *olp = *olpp;
    337 
    338 		/*
    339 		 * If the locks are identical, there's nothing to do.
    340 		 */
    341 		if (olp == wlp)
    342 			return (1);
    343 		if (lock_try((lock_t *)olp)) {
    344 			/*
    345 			 * If 'olp' is still the right lock, return success.
    346 			 * Otherwise, drop 'olp' and try the dance again.
    347 			 */
    348 			if (olp == *olpp)
    349 				return (1);
    350 			lock_clear((lock_t *)olp);
    351 		} else {
    352 			hrtime_t spin_time = 0;
    353 			/*
    354 			 * If we're grabbing the locks out of order, we lose.
    355 			 * Drop the waiter's lock, and then grab and release
    356 			 * the owner's lock to ensure that we won't retry
    357 			 * until the winner is done (as described above).
    358 			 */
    359 			if (olp >= (lock_t *)turnstile_table && olp < wlp) {
    360 				lock_clear(wlp);
    361 				lock_set((lock_t *)olp);
    362 				lock_clear((lock_t *)olp);
    363 				return (0);
    364 			}
    365 			/*
    366 			 * We're grabbing the locks in the right order,
    367 			 * so spin until the owner's lock either becomes
    368 			 * available or spontaneously changes.
    369 			 */
    370 			spin_time =
    371 			    LOCKSTAT_START_TIME(LS_TURNSTILE_INTERLOCK_SPIN);
    372 			while (olp == *olpp && LOCK_HELD(olp)) {
    373 				if (panicstr)
    374 					return (1);
    375 				SMT_PAUSE();
    376 			}
    377 			LOCKSTAT_RECORD_TIME(LS_TURNSTILE_INTERLOCK_SPIN,
    378 			    olp, spin_time);
    379 		}
    380 	}
    381 }
    382 
    383 /*
    384  * Block the current thread on a synchronization object.
    385  *
    386  * Turnstiles implement both kernel and user-level priority inheritance.
    387  * To avoid missed wakeups in the user-level case, lwp_upimutex_lock() calls
    388  * turnstile_block() holding the appropriate lock in the upimutextab (see
    389  * the block comment in lwp_upimutex_lock() for details).  The held lock is
    390  * passed to turnstile_block() as the "mp" parameter, and will be dropped
    391  * after priority has been willed, but before the thread actually sleeps
    392  * (this locking behavior leads to some subtle ordering issues; see the
    393  * block comment on turnstile hashing for details).  This _must_ be the only
    394  * lock held when calling turnstile_block() with a SOBJ_USER_PI sobj; holding
    395  * other locks can result in panics due to cycles in the blocking chain.
    396  *
    397  * turnstile_block() always succeeds for kernel synchronization objects.
    398  * For SOBJ_USER_PI locks the possible errors are EINTR for signals, and
    399  * EDEADLK for cycles in the blocking chain. A return code of zero indicates
    400  * *either* that the lock is now held, or that this is a spurious wake-up, or
    401  * that the lock can never be held due to an ENOTRECOVERABLE error.
    402  * It is up to lwp_upimutex_lock() to sort this all out.
    403  */
    404 
    405 int
    406 turnstile_block(turnstile_t *ts, int qnum, void *sobj, sobj_ops_t *sobj_ops,
    407     kmutex_t *mp, lwp_timer_t *lwptp)
    408 {
    409 	kthread_t *owner;
    410 	kthread_t *t = curthread;
    411 	proc_t *p = ttoproc(t);
    412 	klwp_t *lwp = ttolwp(t);
    413 	turnstile_chain_t *tc = &TURNSTILE_CHAIN(sobj);
    414 	int error = 0;
    415 	int loser = 0;
    416 
    417 	ASSERT(DISP_LOCK_HELD(&tc->tc_lock));
    418 	ASSERT(mp == NULL || IS_UPI(mp));
    419 	ASSERT((SOBJ_TYPE(sobj_ops) == SOBJ_USER_PI) ^ (mp == NULL));
    420 
    421 	thread_lock_high(t);
    422 
    423 	if (ts == NULL) {
    424 		/*
    425 		 * This is the first thread to block on this sobj.
    426 		 * Take its attached turnstile and add it to the hash chain.
    427 		 */
    428 		ts = t->t_ts;
    429 		ts->ts_sobj = sobj;
    430 		ts->ts_next = tc->tc_first;
    431 		tc->tc_first = ts;
    432 		ASSERT(ts->ts_waiters == 0);
    433 	} else {
    434 		/*
    435 		 * Another thread has already donated its turnstile
    436 		 * to block on this sobj, so ours isn't needed.
    437 		 * Stash it on the active turnstile's freelist.
    438 		 */
    439 		turnstile_t *myts = t->t_ts;
    440 		myts->ts_free = ts->ts_free;
    441 		ts->ts_free = myts;
    442 		t->t_ts = ts;
    443 		ASSERT(ts->ts_sobj == sobj);
    444 		ASSERT(ts->ts_waiters > 0);
    445 	}
    446 
    447 	/*
    448 	 * Put the thread to sleep.
    449 	 */
    450 	ASSERT(t != CPU->cpu_idle_thread);
    451 	ASSERT(CPU_ON_INTR(CPU) == 0);
    452 	ASSERT(t->t_wchan0 == NULL && t->t_wchan == NULL);
    453 	ASSERT(t->t_state == TS_ONPROC);
    454 
    455 	if (SOBJ_TYPE(sobj_ops) == SOBJ_USER_PI) {
    456 		curthread->t_flag |= T_WAKEABLE;
    457 	}
    458 	CL_SLEEP(t);		/* assign kernel priority */
    459 	THREAD_SLEEP(t, &tc->tc_lock);
    460 	t->t_wchan = sobj;
    461 	t->t_sobj_ops = sobj_ops;
    462 	DTRACE_SCHED(sleep);
    463 
    464 	if (lwp != NULL) {
    465 		lwp->lwp_ru.nvcsw++;
    466 		(void) new_mstate(t, LMS_SLEEP);
    467 		if (SOBJ_TYPE(sobj_ops) == SOBJ_USER_PI) {
    468 			lwp->lwp_asleep = 1;
    469 			lwp->lwp_sysabort = 0;
    470 			/*
    471 			 * make wchan0 non-zero to conform to the rule that
    472 			 * threads blocking for user-level objects have a
    473 			 * non-zero wchan0: this prevents spurious wake-ups
    474 			 * by, for example, /proc.
    475 			 */
    476 			t->t_wchan0 = (caddr_t)1;
    477 		}
    478 	}
    479 	ts->ts_waiters++;
    480 	sleepq_insert(&ts->ts_sleepq[qnum], t);
    481 
    482 	if (SOBJ_TYPE(sobj_ops) == SOBJ_MUTEX &&
    483 	    SOBJ_OWNER(sobj_ops, sobj) == NULL)
    484 		panic("turnstile_block(%p): unowned mutex", (void *)ts);
    485 
    486 	/*
    487 	 * Follow the blocking chain to its end, willing our priority to
    488 	 * everyone who's in our way.
    489 	 */
    490 	while (t->t_sobj_ops != NULL &&
    491 	    (owner = SOBJ_OWNER(t->t_sobj_ops, t->t_wchan)) != NULL) {
    492 		if (owner == curthread) {
    493 			if (SOBJ_TYPE(sobj_ops) != SOBJ_USER_PI) {
    494 				panic("Deadlock: cycle in blocking chain");
    495 			}
    496 			/*
    497 			 * If the cycle we've encountered ends in mp,
    498 			 * then we know it isn't a 'real' cycle because
    499 			 * we're going to drop mp before we go to sleep.
    500 			 * Moreover, since we've come full circle we know
    501 			 * that we must have willed priority to everyone
    502 			 * in our way.  Therefore, we can break out now.
    503 			 */
    504 			if (t->t_wchan == (void *)mp)
    505 				break;
    506 
    507 			if (loser)
    508 				lock_clear(&turnstile_loser_lock);
    509 			/*
    510 			 * For SOBJ_USER_PI, a cycle is an application
    511 			 * deadlock which needs to be communicated
    512 			 * back to the application.
    513 			 */
    514 			thread_unlock_nopreempt(t);
    515 			mutex_exit(mp);
    516 			setrun(curthread);
    517 			swtch(); /* necessary to transition state */
    518 			curthread->t_flag &= ~T_WAKEABLE;
    519 			if (lwptp->lwpt_id != 0)
    520 				(void) lwp_timer_dequeue(lwptp);
    521 			setallwatch();
    522 			lwp->lwp_asleep = 0;
    523 			lwp->lwp_sysabort = 0;
    524 			return (EDEADLK);
    525 		}
    526 		if (!turnstile_interlock(t->t_lockp, &owner->t_lockp)) {
    527 			/*
    528 			 * If we failed to grab the owner's thread lock,
    529 			 * turnstile_interlock() will have dropped t's
    530 			 * thread lock, so at this point we don't even know
    531 			 * that 't' exists anymore.  The simplest solution
    532 			 * is to restart the entire priority inheritance dance
    533 			 * from the beginning of the blocking chain, since
    534 			 * we *do* know that 'curthread' still exists.
    535 			 * Application of priority inheritance is idempotent,
    536 			 * so it's OK that we're doing it more than once.
    537 			 * Note also that since we've dropped our thread lock,
    538 			 * we may already have been woken up; if so, our
    539 			 * t_sobj_ops will be NULL, the loop will terminate,
    540 			 * and the call to swtch() will be a no-op.  Phew.
    541 			 *
    542 			 * There is one further complication: if two (or more)
    543 			 * threads keep trying to grab the turnstile locks out
    544 			 * of order and keep losing the race to another thread,
    545 			 * these "dueling losers" can livelock the system.
    546 			 * Therefore, once we get into this rare situation,
    547 			 * we serialize all the losers.
    548 			 */
    549 			if (loser == 0) {
    550 				loser = 1;
    551 				lock_set(&turnstile_loser_lock);
    552 			}
    553 			t = curthread;
    554 			thread_lock_high(t);
    555 			continue;
    556 		}
    557 
    558 		/*
    559 		 * We now have the owner's thread lock.  If we are traversing
    560 		 * from non-SOBJ_USER_PI ops to SOBJ_USER_PI ops, then we know
    561 		 * that we have caught the thread while in the TS_SLEEP state,
    562 		 * but holding mp.  We know that this situation is transient
    563 		 * (mp will be dropped before the holder actually sleeps on
    564 		 * the SOBJ_USER_PI sobj), so we will spin waiting for mp to
    565 		 * be dropped.  Then, as in the turnstile_interlock() failure
    566 		 * case, we will restart the priority inheritance dance.
    567 		 */
    568 		if (SOBJ_TYPE(t->t_sobj_ops) != SOBJ_USER_PI &&
    569 		    owner->t_sobj_ops != NULL &&
    570 		    SOBJ_TYPE(owner->t_sobj_ops) == SOBJ_USER_PI) {
    571 			kmutex_t *upi_lock = (kmutex_t *)t->t_wchan;
    572 
    573 			ASSERT(IS_UPI(upi_lock));
    574 			ASSERT(SOBJ_TYPE(t->t_sobj_ops) == SOBJ_MUTEX);
    575 
    576 			if (t->t_lockp != owner->t_lockp)
    577 				thread_unlock_high(owner);
    578 			thread_unlock_high(t);
    579 			if (loser)
    580 				lock_clear(&turnstile_loser_lock);
    581 
    582 			while (mutex_owner(upi_lock) == owner) {
    583 				SMT_PAUSE();
    584 				continue;
    585 			}
    586 
    587 			if (loser)
    588 				lock_set(&turnstile_loser_lock);
    589 			t = curthread;
    590 			thread_lock_high(t);
    591 			continue;
    592 		}
    593 
    594 		turnstile_pi_inherit(t->t_ts, owner, DISP_PRIO(t));
    595 		if (t->t_lockp != owner->t_lockp)
    596 			thread_unlock_high(t);
    597 		t = owner;
    598 	}
    599 
    600 	if (loser)
    601 		lock_clear(&turnstile_loser_lock);
    602 
    603 	/*
    604 	 * Note: 't' and 'curthread' were synonymous before the loop above,
    605 	 * but now they may be different.  ('t' is now the last thread in
    606 	 * the blocking chain.)
    607 	 */
    608 	if (SOBJ_TYPE(sobj_ops) == SOBJ_USER_PI) {
    609 		ushort_t s = curthread->t_oldspl;
    610 		int timedwait = 0;
    611 		uint_t imm_timeout = 0;
    612 		clock_t tim = -1;
    613 
    614 		thread_unlock_high(t);
    615 		if (lwptp->lwpt_id != 0) {
    616 			/*
    617 			 * We enqueued a timeout.  If it has already fired,
    618 			 * lwptp->lwpt_imm_timeout has been set with cas,
    619 			 * so fetch it with cas.
    620 			 */
    621 			timedwait = 1;
    622 			imm_timeout =
    623 			    atomic_cas_uint(&lwptp->lwpt_imm_timeout, 0, 0);
    624 		}
    625 		mutex_exit(mp);
    626 		splx(s);
    627 
    628 		if (ISSIG(curthread, JUSTLOOKING) ||
    629 		    MUSTRETURN(p, curthread) || imm_timeout)
    630 			setrun(curthread);
    631 		swtch();
    632 		curthread->t_flag &= ~T_WAKEABLE;
    633 		if (timedwait)
    634 			tim = lwp_timer_dequeue(lwptp);
    635 		setallwatch();
    636 		if (ISSIG(curthread, FORREAL) || lwp->lwp_sysabort ||
    637 		    MUSTRETURN(p, curthread))
    638 			error = EINTR;
    639 		else if (imm_timeout || (timedwait && tim == -1))
    640 			error = ETIME;
    641 		lwp->lwp_sysabort = 0;
    642 		lwp->lwp_asleep = 0;
    643 	} else {
    644 		thread_unlock_nopreempt(t);
    645 		swtch();
    646 	}
    647 
    648 	return (error);
    649 }
    650 
    651 /*
    652  * Remove thread from specified turnstile sleep queue; retrieve its
    653  * free turnstile; if it is the last waiter, delete the turnstile
    654  * from the turnstile chain and if there is an inheritor, delete it
    655  * from the inheritor's t_prioinv chain.
    656  */
    657 static void
    658 turnstile_dequeue(kthread_t *t)
    659 {
    660 	turnstile_t *ts = t->t_ts;
    661 	turnstile_chain_t *tc = &TURNSTILE_CHAIN(ts->ts_sobj);
    662 	turnstile_t *tsfree, **tspp;
    663 
    664 	ASSERT(DISP_LOCK_HELD(&tc->tc_lock));
    665 	ASSERT(t->t_lockp == &tc->tc_lock);
    666 
    667 	if ((tsfree = ts->ts_free) != NULL) {
    668 		ASSERT(ts->ts_waiters > 1);
    669 		ASSERT(tsfree->ts_waiters == 0);
    670 		t->t_ts = tsfree;
    671 		ts->ts_free = tsfree->ts_free;
    672 		tsfree->ts_free = NULL;
    673 	} else {
    674 		/*
    675 		 * The active turnstile's freelist is empty, so this
    676 		 * must be the last waiter.  Remove the turnstile
    677 		 * from the hash chain and leave the now-inactive
    678 		 * turnstile attached to the thread we're waking.
    679 		 * Note that the ts_inheritor for the turnstile
    680 		 * may be NULL. If one exists, its t_prioinv
    681 		 * chain has to be updated.
    682 		 */
    683 		ASSERT(ts->ts_waiters == 1);
    684 		if (ts->ts_inheritor != NULL) {
    685 			(void) turnstile_pi_tsdelete(ts, ts->ts_inheritor);
    686 			/*
    687 			 * If we ever do a "disinherit" or "unboost", we need
    688 			 * to do it only if "t" is a thread at the head of the
    689 			 * sleep queue. Since the sleep queue is prioritized,
    690 			 * the disinherit is necessary only if the interrupted
    691 			 * thread is the highest priority thread.
    692 			 * Otherwise, there is a higher priority thread blocked
    693 			 * on the turnstile, whose inheritance cannot be
    694 			 * disinherited. However, disinheriting is explicitly
    695 			 * not done here, since it would require holding the
    696 			 * inheritor's thread lock (see turnstile_unsleep()).
    697 			 */
    698 			ts->ts_inheritor = NULL;
    699 		}
    700 		tspp = &tc->tc_first;
    701 		while (*tspp != ts)
    702 			tspp = &(*tspp)->ts_next;
    703 		*tspp = ts->ts_next;
    704 		ASSERT(t->t_ts == ts);
    705 	}
    706 	ts->ts_waiters--;
    707 	sleepq_dequeue(t);
    708 	t->t_sobj_ops = NULL;
    709 	t->t_wchan = NULL;
    710 	t->t_wchan0 = NULL;
    711 	ASSERT(t->t_state == TS_SLEEP);
    712 }
    713 
    714 /*
    715  * Wake threads that are blocked in a turnstile.
    716  */
    717 void
    718 turnstile_wakeup(turnstile_t *ts, int qnum, int nthreads, kthread_t *owner)
    719 {
    720 	turnstile_chain_t *tc = &TURNSTILE_CHAIN(ts->ts_sobj);
    721 	sleepq_t *sqp = &ts->ts_sleepq[qnum];
    722 
    723 	ASSERT(DISP_LOCK_HELD(&tc->tc_lock));
    724 
    725 	/*
    726 	 * Waive any priority we may have inherited from this turnstile.
    727 	 */
    728 	if (ts->ts_inheritor != NULL) {
    729 		turnstile_pi_waive(ts);
    730 	}
    731 	while (nthreads-- > 0) {
    732 		kthread_t *t = sqp->sq_first;
    733 		ASSERT(t->t_ts == ts);
    734 		ASSERT(ts->ts_waiters > 1 || ts->ts_inheritor == NULL);
    735 		DTRACE_SCHED1(wakeup, kthread_t *, t);
    736 		turnstile_dequeue(t);
    737 		CL_WAKEUP(t); /* previous thread lock, tc_lock, not dropped */
    738 		/*
    739 		 * If the caller did direct handoff of ownership,
    740 		 * make the new owner inherit from this turnstile.
    741 		 */
    742 		if (t == owner) {
    743 			kthread_t *wp = ts->ts_sleepq[TS_WRITER_Q].sq_first;
    744 			kthread_t *rp = ts->ts_sleepq[TS_READER_Q].sq_first;
    745 			pri_t wpri = wp ? DISP_PRIO(wp) : 0;
    746 			pri_t rpri = rp ? DISP_PRIO(rp) : 0;
    747 			turnstile_pi_inherit(ts, t, MAX(wpri, rpri));
    748 			owner = NULL;
    749 		}
    750 		thread_unlock_high(t);		/* drop run queue lock */
    751 	}
    752 	if (owner != NULL)
    753 		panic("turnstile_wakeup: owner %p not woken", (void *)owner);
    754 	disp_lock_exit(&tc->tc_lock);
    755 }
    756 
    757 /*
    758  * Change priority of a thread sleeping in a turnstile.
    759  */
    760 void
    761 turnstile_change_pri(kthread_t *t, pri_t pri, pri_t *t_prip)
    762 {
    763 	sleepq_t *sqp = t->t_sleepq;
    764 
    765 	sleepq_dequeue(t);
    766 	*t_prip = pri;
    767 	sleepq_insert(sqp, t);
    768 }
    769 
    770 /*
    771  * We don't allow spurious wakeups of threads blocked in turnstiles
    772  * for synch objects whose sobj_ops vector is initialized with the
    773  * following routine (e.g. kernel synchronization objects).
    774  * This is vital to the correctness of direct-handoff logic in some
    775  * synchronization primitives, and it also simplifies the PI logic.
    776  */
    777 /* ARGSUSED */
    778 void
    779 turnstile_stay_asleep(kthread_t *t)
    780 {
    781 }
    782 
    783 /*
    784  * Wake up a thread blocked in a turnstile. Used to enable interruptibility
    785  * of threads blocked on a SOBJ_USER_PI sobj.
    786  *
    787  * The implications of this interface are:
    788  *
    789  * 1. turnstile_block() may return with an EINTR.
    790  * 2. When the owner of an sobj releases it, but no turnstile is found (i.e.
    791  *    no waiters), the (prior) owner must call turnstile_pi_recalc() to
    792  *    waive any priority inherited from interrupted waiters.
    793  *
    794  * When a waiter is interrupted, disinheriting its willed priority from the
    795  * inheritor would require holding the inheritor's thread lock, while also
    796  * holding the waiter's thread lock which is a turnstile lock. If the
    797  * inheritor's thread lock is not free, and is also a turnstile lock that
    798  * is out of lock order, the waiter's thread lock would have to be dropped.
    799  * This leads to complications for the caller of turnstile_unsleep(), since
    800  * the caller holds the waiter's thread lock. So, instead of disinheriting
    801  * on waiter interruption, the owner is required to follow rule 2 above.
    802  *
    803  * Avoiding disinherit on waiter interruption seems acceptable because
    804  * the owner runs at an unnecessarily high priority only while sobj is held,
    805  * which it would have done in any case, if the waiter had not been interrupted.
    806  */
    807 void
    808 turnstile_unsleep(kthread_t *t)
    809 {
    810 	turnstile_dequeue(t);
    811 	THREAD_TRANSITION(t);
    812 	CL_SETRUN(t);
    813 }
    814