diff options
Diffstat (limited to 'doc/design-thoughts')
-rw-r--r-- | doc/design-thoughts/binding-possibilities.txt | 167 | ||||
-rw-r--r-- | doc/design-thoughts/connection-reuse.txt | 224 | ||||
-rw-r--r-- | doc/design-thoughts/http_load_time.url | 5 | ||||
-rw-r--r-- | doc/design-thoughts/pool-debugging.txt | 243 | ||||
-rw-r--r-- | doc/design-thoughts/thread-group.txt | 655 |
5 files changed, 1294 insertions, 0 deletions
diff --git a/doc/design-thoughts/binding-possibilities.txt b/doc/design-thoughts/binding-possibilities.txt new file mode 100644 index 0000000..3f5e432 --- /dev/null +++ b/doc/design-thoughts/binding-possibilities.txt @@ -0,0 +1,167 @@ +2013/10/10 - possibilities for setting source and destination addresses + + +When establishing a connection to a remote device, this device is designated +as a target, which designates an entity defined in the configuration. A same +target appears only once in a configuration, and multiple targets may share +the same settings if needed. + +The following types of targets are currently supported : + + - listener : all connections with this type of target come from clients ; + - server : connections to such targets are for "server" lines ; + - peer : connections to such target address "peer" lines in "peers" + sections ; + - proxy : these targets are used by "dispatch", "option transparent" + or "option http_proxy" statements. + +A connection might not be reused between two different targets, even if all +parameters seem similar. One of the reason is that some parameters are specific +to the target and are not easy or not cheap to compare (eg: bind to interface, +mss, ...). + +A number of source and destination addresses may be set for a given target. + + - listener : + - the "from" address:port is set by accept() + + - the "to" address:port is set if conn_get_to_addr() is called + + - peer : + - the "from" address:port is not set + + - the "to" address:port is static and dependent only on the peer + + - server : + - the "from" address may be set alone when "source" is used with + a forced IP address, or when "usesrc clientip" is used. + + - the "from" port may be set only combined with the address when + "source" is used with IP:port, IP:port-range or "usesrc client" is + used. Note that in this case, both the address and the port may be + 0, meaning that the kernel will pick the address or port and that + the final value might not match the one explicitly set (eg: + important for logging). + + - the "from" address may be forced from a header which implies it + may change between two consecutive requests on the same connection. + + - the "to" address and port are set together when connecting to a + regular server, or by copying the client's IP address when + "server 0.0.0.0" is used. Note that the destination port may be + an offset applied to the original destination port. + + - proxy : + - the "from" address may be set alone when "source" is used with a + forced IP address or when "usesrc clientip" is used. + + - the "from" port may be set only combined with the address when + "source" is used with IP:port or with "usesrc client". There is + no ip:port range for a proxy as of now. Same comment applies as + above when port and/or address are 0. + + - the "from" address may be forced from a header which implies it + may change between two consecutive requests on the same connection. + + - the "to" address and port are set together, either by configuration + when "dispatch" is used, or dynamically when "transparent" is used + (1:1 with client connection) or "option http_proxy" is used, where + each client request may lead to a different destination address. + + +At the moment, there are some limits in what might happen between multiple +concurrent requests to a same target. + + - peers parameter do not change, so no problem. + + - server parameters may change in this way : + - a connection may require a source bound to an IP address found in a + header, which will fall back to the "source" settings if the address + is not found in this header. This means that the source address may + switch between a dynamically forced IP address and another forced + IP and/or port range. + + - if the element is not found (eg: header), the remaining "forced" + source address might very well be empty (unset), so the connection + reuse is acceptable when switching in that direction. + + - it is not possible to switch between client and clientip or any of + these and hdr_ip() because they're exclusive. + + - using a source address/port belonging to a port range is compatible + with connection reuse because there is a single range per target, so + switching from a range to another range means we remain in the same + range. + + - destination address may currently not change since the only possible + case for dynamic destination address setting is the transparent mode, + reproducing the client's destination address. + + - proxy parameters may change in this way : + - a connection may require a source bound to an IP address found in a + header, which will fall back to the "source" settings if the address + is not found in this header. This means that the source address may + switch between a dynamically forced IP address and another forced + IP and/or port range. + + - if the element is not found (eg: header), the remaining "forced" + source address might very well be empty (unset), so the connection + reuse is acceptable when switching in that direction. + + - it is not possible to switch between client and clientip or any of + these and hdr_ip() because they're exclusive. + + - proxies do not support port ranges at the moment. + + - destination address might change in the case where "option http_proxy" + is used. + +So, for each source element (IP, port), we want to know : + - if the element was assigned by static configuration (eg: ":80") + - if the element was assigned from a connection-specific value (eg: usesrc clientip) + - if the element was assigned from a configuration-specific range (eg: 1024-65535) + - if the element was assigned from a request-specific value (eg: hdr_ip(xff)) + - if the element was not assigned at all + +For the destination, we want to know : + - if the element was assigned by static configuration (eg: ":80") + - if the element was assigned from a connection-specific value (eg: transparent) + - if the element was assigned from a request-specific value (eg: http_proxy) + +We don't need to store the information about the origin of the dynamic value +since we have the value itself. So in practice we have : + - default value, unknown (not yet checked with getsockname/getpeername) + - default value, known (check done) + - forced value (known) + - forced range (known) + +We can't do that on an ip:port basis because the port may be fixed regardless +of the address and conversely. + +So that means : + + enum { + CO_ADDR_NONE = 0, /* not set, unknown value */ + CO_ADDR_KNOWN = 1, /* not set, known value */ + CO_ADDR_FIXED = 2, /* fixed value, known */ + CO_ADDR_RANGE = 3, /* from assigned range, known */ + } conn_addr_values; + + unsigned int new_l3_src_status:2; + unsigned int new_l4_src_status:2; + unsigned int new_l3_dst_status:2; + unsigned int new_l4_dst_status:2; + + unsigned int cur_l3_src_status:2; + unsigned int cur_l4_src_status:2; + unsigned int cur_l3_dsp_status:2; + unsigned int cur_l4_dst_status:2; + + unsigned int new_family:2; + unsigned int cur_family:2; + +Note: this obsoletes CO_FL_ADDR_FROM_SET and CO_FL_ADDR_TO_SET. These flags +must be changed to individual l3+l4 checks ORed between old and new values, +or better, set to cur only which will inherit new. + +In the connection, these values may be merged in the same word as err_code. diff --git a/doc/design-thoughts/connection-reuse.txt b/doc/design-thoughts/connection-reuse.txt new file mode 100644 index 0000000..4eb22f6 --- /dev/null +++ b/doc/design-thoughts/connection-reuse.txt @@ -0,0 +1,224 @@ +2015/08/06 - server connection sharing + +Improvements on the connection sharing strategies +------------------------------------------------- + +4 strategies are currently supported : + - never + - safe + - aggressive + - always + +The "aggressive" and "always" strategies take into account the fact that the +connection has already been reused at least once or not. The principle is that +second requests can be used to safely "validate" connection reuse on newly +added connections, and that such validated connections may be used even by +first requests from other sessions. A validated connection is a connection +which has already been reused, hence proving that it definitely supports +multiple requests. Such connections are easy to verify : after processing the +response, if the txn already had the TX_NOT_FIRST flag, then it was not the +first request over that connection, and it is validated as safe for reuse. +Validated connections are put into a distinct list : server->safe_conns. + +Incoming requests with TX_NOT_FIRST first pick from the regular idle_conns +list so that any new idle connection is validated as soon as possible. + +Incoming requests without TX_NOT_FIRST only pick from the safe_conns list for +strategy "aggressive", guaranteeing that the server properly supports connection +reuse, or first from the safe_conns list, then from the idle_conns list for +strategy "always". + +Connections are always stacked into the list (LIFO) so that there are higher +changes to convert recent connections and to use them. This will first optimize +the likeliness that the connection works, and will avoid TCP metrics from being +lost due to an idle state, and/or the congestion window to drop and the +connection going to slow start mode. + + +Handling connections in pools +----------------------------- + +A per-server "pool-max" setting should be added to permit disposing unused idle +connections not attached anymore to a session for use by future requests. The +principle will be that attached connections are queued from the front of the +list while the detached connections will be queued from the tail of the list. + +This way, most reused connections will be fairly recent and detached connections +will most often be ignored. The number of detached idle connections in the lists +should be accounted for (pool_used) and limited (pool_max). + +After some time, a part of these detached idle connections should be killed. +For this, the list is walked from tail to head and connections without an owner +may be evicted. It may be useful to have a per-server pool_min setting +indicating how many idle connections should remain in the pool, ready for use +by new requests. Conversely, a pool_low metric should be kept between eviction +runs, to indicate the lowest amount of detached connections that were found in +the pool. + +For eviction, the principle of a half-life is appealing. The principle is +simple : over a period of time, half of the connections between pool_min and +pool_low should be gone. Since pool_low indicates how many connections were +remaining unused over a period, it makes sense to kill some of them. + +In order to avoid killing thousands of connections in one run, the purge +interval should be split into smaller batches. Let's call N the ratio of the +half-life interval and the effective interval. + +The algorithm consists in walking over them from the end every interval and +killing ((pool_low - pool_min) + 2 * N - 1) / (2 * N). It ensures that half +of the unused connections are killed over the half-life period, in N batches +of population/2N entries at most. + +Unsafe connections should be evicted first. There should be quite few of them +since most of them are probed and become safe. Since detached connections are +quickly recycled and attached to a new session, there should not be too many +detached connections in the pool, and those present there may be killed really +quickly. + +Another interesting point of pools is that when a pool-max is not null, then it +makes sense to automatically enable pretend-keep-alive on non-private connections +going to the server in order to be able to feed them back into the pool. With +the "aggressive" or "always" strategies, it can allow clients making a single +request over their connection to share persistent connections to the servers. + + + +2013/10/17 - server connection management and reuse + +Current state +------------- + +At the moment, a connection entity is needed to carry any address +information. This means in the following situations, we need a server +connection : + +- server is elected and the server's destination address is set + +- transparent mode is elected and the destination address is set from + the incoming connection + +- proxy mode is enabled, and the destination's address is set during + the parsing of the HTTP request + +- connection to the server fails and must be retried on the same + server using the same parameters, especially the destination + address (SN_ADDR_SET not removed) + + +On the accepting side, we have further requirements : + +- allocate a clean connection without a stream interface + +- incrementally set the accepted connection's parameters without + clearing it, and keep track of what is set (eg: getsockname). + +- initialize a stream interface in established mode + +- attach the accepted connection to a stream interface + + +This means several things : + +- the connection has to be allocated on the fly the first time it is + needed to store the source or destination address ; + +- the connection has to be attached to the stream interface at this + moment ; + +- it must be possible to incrementally set some settings on the + connection's addresses regardless of the connection's current state + +- the connection must not be released across connection retries ; + +- it must be possible to clear a connection's parameters for a + redispatch without having to detach/attach the connection ; + +- we need to allocate a connection without an existing stream interface + +So on the accept() side, it looks like this : + + fd = accept(); + conn = new_conn(); + get_some_addr_info(&conn->addr); + ... + si = new_si(); + si_attach_conn(si, conn); + si_set_state(si, SI_ST_EST); + ... + get_more_addr_info(&conn->addr); + +On the connect() side, it looks like this : + + si = new_si(); + while (!properly_connected) { + if (!(conn = si->end)) { + conn = new_conn(); + conn_clear(conn); + si_attach_conn(si, conn); + } + else { + if (connected) { + f = conn->flags & CO_FL_XPRT_TRACKED; + conn->flags &= ~CO_FL_XPRT_TRACKED; + conn_close(conn); + conn->flags |= f; + } + if (!correct_dest) + conn_clear(conn); + } + set_some_addr_info(&conn->addr); + si_set_state(si, SI_ST_CON); + ... + set_more_addr_info(&conn->addr); + conn->connect(); + if (must_retry) { + close_conn(conn); + } + } + +Note: we need to be able to set the control and transport protocols. +On outgoing connections, this is set once we know the destination address. +On incoming connections, this is set the earliest possible (once we know +the source address). + +The problem analysed below was solved on 2013/10/22 + +| ==> the real requirement is to know whether a connection is still valid or not +| before deciding to close it. CO_FL_CONNECTED could be enough, though it +| will not indicate connections that are still waiting for a connect to occur. +| This combined with CO_FL_WAIT_L4_CONN and CO_FL_WAIT_L6_CONN should be OK. +| +| Alternatively, conn->xprt could be used for this, but needs some careful checks +| (it's used by conn_full_close at least). +| +| Right now, conn_xprt_close() checks conn->xprt and sets it to NULL. +| conn_full_close() also checks conn->xprt and sets it to NULL, except +| that the check on ctrl is performed within xprt. So conn_xprt_close() +| followed by conn_full_close() will not close the file descriptor. +| Note that conn_xprt_close() is never called, maybe we should kill it ? +| +| Note: at the moment, it's problematic to leave conn->xprt to NULL before doing +| xprt_init() because we might end up with a pending file descriptor. Or at +| least with some transport not de-initialized. We might thus need +| conn_xprt_close() when conn_xprt_init() fails. +| +| The fd should be conditioned by ->ctrl only, and the transport layer by ->xprt. +| +| - conn_prepare_ctrl(conn, ctrl) +| - conn_prepare_xprt(conn, xprt) +| - conn_prepare_data(conn, data) +| +| Note: conn_xprt_init() needs conn->xprt so it's not a problem to set it early. +| +| One problem might be with conn_xprt_close() not being able to know if xprt_init() +| was called or not. That's where it might make sense to only set ->xprt during init. +| Except that it does not fly with outgoing connections (xprt_init is called after +| connect()). +| +| => currently conn_xprt_close() is only used by ssl_sock.c and decides whether +| to do something based on ->xprt_ctx which is set by ->init() from xprt_init(). +| So there is nothing to worry about. We just need to restore conn_xprt_close() +| and rely on ->ctrl to close the fd instead of ->xprt. +| +| => we have the same issue with conn_ctrl_close() : when is the fd supposed to be +| valid ? On outgoing connections, the control is set much before the fd... diff --git a/doc/design-thoughts/http_load_time.url b/doc/design-thoughts/http_load_time.url new file mode 100644 index 0000000..f178e46 --- /dev/null +++ b/doc/design-thoughts/http_load_time.url @@ -0,0 +1,5 @@ +Excellent paper about page load time for keepalive on/off, pipelining, +multiple host names, etc... + +http://www.die.net/musings/page_load_time/ + diff --git a/doc/design-thoughts/pool-debugging.txt b/doc/design-thoughts/pool-debugging.txt new file mode 100644 index 0000000..106e41c --- /dev/null +++ b/doc/design-thoughts/pool-debugging.txt @@ -0,0 +1,243 @@ +2022-02-22 - debugging options with pools + +Two goals: + - help developers spot bugs as early as possible + + - make the process more reliable in field, by killing sick ones as soon as + possible instead of letting them corrupt data, cause trouble, or even be + exploited. + +An allocated object may exist in 5 forms: + - in use: currently referenced and used by haproxy, 100% of its size are + dedicated to the application which can do absolutely anything with it, + but it may never touch anything before nor after that area. + + - in cache: the object is neither referenced nor used anymore, but it sits + in a thread's cache. The application may not touch it at all anymore, and + some parts of it could even be unmapped. Only the current thread may safely + reach it, though others might find/release it when under thread isolation. + The thread cache needs some LRU linking that may be stored anywhere, either + inside the area, or outside. The parts surrounding the <size> parts remain + invisible to the application layer, and can serve as a protection. + + - in shared cache: the object is neither referenced nor used anymore, but it + may be reached by any thread. Some parts of it could be unmapped. Any + thread may pick it but only one may find it, hence once grabbed, it is + guaranteed no other one will find it. The shared cache needs to set up a + linked list and a single pointer needs to be stored anywhere, either inside + or outside the area. The parts surrounding the <size> parts remain + invisible to the application layer, and can serve as a protection. + + - in the system's memory allocator: the object is not known anymore from + haproxy. It may be reassigned in parts or totally to other pools or other + subsystems (e.g. crypto library). Some or all of it may be unmapped. The + areas surrounding the <size> parts are also part of the object from the + library's point of view and may be delivered to other areas. Tampering + with these may cause any other part to malfunction in dirty ways. + + - in the OS only: the memory allocator gave it back to the OS. + +The following options need to be configurable: + - detect improper initialization: this is done by poisonning objects before + delivering them to the application. + + - help figure where an object was allocated when in use: a pointer to the + call place will help. Pointing to the last pool_free() as well for the + same reasons when dealing with a UAF. + + - detection of wrong pointer/pool when in use: a pointer to the pool before + or after the area will definitely help. + + - detection of overflows when in use: a canary at the end of the area + (closest possible to <size>) will definitely help. The pool above can do + that job. Ideally, we should fill some data at the end so that even + unaligned sizes can be checked (e.g. a buffer that gets a zero appended). + If we just align on 2 pointers, writing the same pointer twice at the end + may do the job, but we won't necessarily have our bytes. Thus a particular + end-of-string pattern would be useful (e.g. ff55aa01) to fill it. + + - detection of double free when in cache: similar to detection of wrong + pointer/pool when in use: the pointer at the end may simply be changed so + that it cannot match the pool anymore. By using a pointer to the caller of + the previous free() operation, we have the guarantee to see different + pointers, and this pointer can be inspected to figure where the object was + previously freed. An extra check may even distinguish a perfect double-free + (same caller) from just a wrong free (pointer differs from pool). + + - detection of late corruption when in cache: keeping a copy of the + checksum of the whole area upon free() will do the job, but requires one + extra storage area for the checksum. Filling the area with a pattern also + does the job and doesn't require extra storage, but it loses the contents + and can be a bit slower. Sometimes losing the contents can be a feature, + especially when trying to detect late reads. Probably that both need to + be implemented. Note that if contents are not strictly needed, storing a + checksum inside the area does the job. + + - preserve total contents in cache for debugging: losing some precious + information can be a problem. + + - pattern filling of the area helps detect use-after-free in read-only mode. + + - allocate cold first helps with both cases above. + +Uncovered: + - overflow/underflow when in cache/shared/libc: it belongs to use-after-free + pattern and such an error during regular use ought to be caught while the + object was still in use. + + - integrity when in libc: not under our control anymore, this is a libc + problem. + +Arbitrable: + - integrity when in shared cache: unlikely to happen only then if it could + have happened in the local cache. Shared cache not often used anymore, thus + probably not worth the effort + + - protection against double-free when in shared cache/libc: might be done for + a cheap price, probably worth being able to quickly tell that such an + object left the local cache (e.g. the mark points to the caller, but could + possibly just be incremented, hence still point to the same code location+1 + byte when released. Calls are 4 bytes min on RISC, 5 on x86 so we do have + some margin by having a caller's location be +0,+1,+2 or +3. + + - underflow when in use: hasn't been really needed over time but may change. + + - detection of late corruption when in shared cache: checksum or area filling + are possible, but is this as relevant as it used to considering the less + common use of the shared cache ? + +Design considerations: + - object allocation when in use must remain minimal + + - when in cache, there are 2 lists which the compiler expect to be at least + aligned each (e.g. if/when we start to use DWCAS). + + - the original "pool debugging" feature covers both pool tracking, double- + free detection, overflow detection and caller info at the cost of a single + pointer placed immediately after the area. + + - preserving the contents might be done by placing the cache links and the + shared cache's list outside of the area (either before or after). Placing + it before has the merit that the allocated object preserves the 4-ptr + alignment. But when a larger alignment is desired this often does not work + anymore. Placing it after requires some dynamic adjustment depending on the + object's size. If any protection is installed, this protection must be + placed before the links so that the list doesn't get randomly corrupted and + corrupts adjacent elements. Note that if protection is desired, the extra + waste is probably less critical. + + - a link to the last caller might have to be stored somewhere. Without + preservation the free() caller may be placed anywhere while the alloc() + caller may only be placed outside. With preservation, again the free() + caller may be placed either before the object or after the mark at the end. + There is no particular need that both share the same location though it may + help. Note that when debugging is enabled, the free() caller doesn't need + to be duplicated and can continue to serve as the double-free detection. + Thus maybe in the end we only need to store the caller to the last alloc() + but not the free() since if we want it it's available via the pool debug. + + - use-after-free detection: contents may be erased on free() and checked on + alloc(), but they can also be checksummed on free() and rechecked on + alloc(). In the latter case we need to store a checksum somewhere. Note + that with pure checksum we don't know what part was modified, but seeing + previous contents can be useful. + +Possibilities: + +1) Linked lists inside the area: + + V size alloc + ---+------------------------------+-----------------+-- + in use |##############################| (Pool) (Tracer) | + ---+------------------------------+-----------------+-- + + ---+--+--+------------------------+-----------------+-- + in cache |L1|L2|########################| (Caller) (Sum) | + ---+--+--+------------------------+-----------------+-- +or: + ---+--+--+------------------------+-----------------+-- + in cache |L1|L2|###################(sum)| (Caller) | + ---+--+--+------------------------+-----------------+-- + + ---+-+----------------------------+-----------------+-- + in global |N|XXXX########################| (Caller) | + ---+-+----------------------------+-----------------+-- + + +2) Linked lists before the the area leave room for tracer and pool before + the area, but the canary must remain at the end, however the area will + be more difficult to keep aligned: + + V head size alloc + ----+-+-+------------------------------+-----------------+-- + in use |T|P|##############################| (canary) | + ----+-+-+------------------------------+-----------------+-- + + --+-----+------------------------------+-----------------+-- + in cache |L1|L2|##############################| (Caller) (Sum) | + --+-----+------------------------------+-----------------+-- + + ------+-+------------------------------+-----------------+-- + in global |N|##############################| (Caller) | + ------+-+------------------------------+-----------------+-- + + +3) Linked lists at the end of the area, might be shared with extra data + depending on the state: + + V size alloc + ---+------------------------------+-----------------+-- + in use |##############################| (Pool) (Tracer) | + ---+------------------------------+-----------------+-- + + ---+------------------------------+--+--+-----------+-- + in cache |##############################|L1|L2| (Caller) (Sum) + ---+------------------------------+--+--+-----------+-- + + ---+------------------------------+-+---------------+-- + in global |##############################|N| (Caller) | + ---+------------------------------+-+---------------+-- + +This model requires a little bit of alignment at the end of the area, which is +not incompatible with pattern filling and/or checksumming: + - preserving the area for post-mortem analysis means nothing may be placed + inside. In this case it could make sense to always store the last releaser. + - detecting late corruption may be done either with filling or checksumming, + but the simple fact of assuming a risk of corruption that needs to be + chased means we must not store the lists nor caller inside the area. + +Some models imply dedicating some place when in cache: + - preserving contents forces the lists to be prefixed or appended, which + leaves unused places when in use. Thus we could systematically place the + pool pointer and the caller in this case. + + - if preserving contents is not desired, almost everything can be stored + inside when not in use. Then each situation's size should be calculated + so that the allocated size is known, and entries are filled from the + beginning while not in use, or after the size when in use. + + - if poisonning is requested, late corruption might be detected but then we + don't want the list to be stored inside at the risk of being corrupted. + +Maybe just implement a few models: + - compact/optimal: put l1/l2 inside + - detect late corruption: fill/sum, put l1/l2 out + - preserve contents: put l1/l2 out + - corruption+preserve: do not fill, sum out + - poisonning: not needed on free if pattern filling is done. + +try2: + - poison on alloc to detect missing initialization: yes/no + (note: nothing to do if filling done) + - poison on free to detect use-after-free: yes/no + (note: nothing to do if filling done) + - check on alloc for corruption-after-free: yes/no + If content-preserving => sum, otherwise pattern filling; in + any case, move L1/L2 out. + - check for overflows: yes/no: use a canary after the area. The + canary can be the pointer to the pool. + - check for alloc caller: yes/no => always after the area + - content preservation: yes/no + (disables filling, moves lists out) + - improved caller tracking: used to detect double-free, may benefit + from content-preserving but not only. diff --git a/doc/design-thoughts/thread-group.txt b/doc/design-thoughts/thread-group.txt new file mode 100644 index 0000000..e845230 --- /dev/null +++ b/doc/design-thoughts/thread-group.txt @@ -0,0 +1,655 @@ +Thread groups +############# + +2021-07-13 - first draft +========== + +Objective +--------- +- support multi-socket systems with limited cache-line bouncing between + physical CPUs and/or L3 caches + +- overcome the 64-thread limitation + +- Support a reasonable number of groups. I.e. if modern CPUs arrive with + core complexes made of 8 cores, with 8 CC per chip and 2 chips in a + system, it makes sense to support 16 groups. + + +Non-objective +------------- +- no need to optimize to the last possible cycle. I.e. some algos like + leastconn will remain shared across all threads, servers will keep a + single queue, etc. Global information remains global. + +- no stubborn enforcement of FD sharing. Per-server idle connection lists + can become per-group; listeners can (and should probably) be per-group. + Other mechanisms (like SO_REUSEADDR) can already overcome this. + +- no need to go beyond 64 threads per group. + + +Identified tasks +================ + +General +------- +Everywhere tid_bit is used we absolutely need to find a complement using +either the current group or a specific one. Thread debugging will need to +be extended as masks are extensively used. + + +Scheduler +--------- +The global run queue and global wait queue must become per-group. This +means that a task may only be queued into one of them at a time. It +sounds like tasks may only belong to a given group, but doing so would +bring back the original issue that it's impossible to perform remote wake +ups. + +We could probably ignore the group if we don't need to set the thread mask +in the task. the task's thread_mask is never manipulated using atomics so +it's safe to complement it with a group. + +The sleeping_thread_mask should become per-group. Thus possibly that a +wakeup may only be performed on the assigned group, meaning that either +a task is not assigned, in which case it be self-assigned (like today), +otherwise the tg to be woken up will be retrieved from the task itself. + +Task creation currently takes a thread mask of either tid_bit, a specific +mask, or MAX_THREADS_MASK. How to create a task able to run anywhere +(checks, Lua, ...) ? + +Profiling -> completed +--------- +There should be one task_profiling_mask per thread group. Enabling or +disabling profiling should be made per group (possibly by iterating). +-> not needed anymore, one flag per thread in each thread's context. + +Thread isolation +---------------- +Thread isolation is difficult as we solely rely on atomic ops to figure +who can complete. Such operation is rare, maybe we could have a global +read_mostly flag containing a mask of the groups that require isolation. +Then the threads_want_rdv_mask etc can become per-group. However setting +and clearing the bits will become problematic as this will happen in two +steps hence will require careful ordering. + +FD +-- +Tidbit is used in a number of atomic ops on the running_mask. If we have +one fdtab[] per group, the mask implies that it's within the group. +Theoretically we should never face a situation where an FD is reported nor +manipulated for a remote group. + +There will still be one poller per thread, except that this time all +operations will be related to the current thread_group. No fd may appear +in two thread_groups at once, but we can probably not prevent that (e.g. +delayed close and reopen). Should we instead have a single shared fdtab[] +(less memory usage also) ? Maybe adding the group in the fdtab entry would +work, but when does a thread know it can leave it ? Currently this is +solved by running_mask and by update_mask. Having two tables could help +with this (each table sees the FD in a different group with a different +mask) but this looks overkill. + +There's polled_mask[] which needs to be decided upon. Probably that it +should be doubled as well. Note, polled_mask left fdtab[] for cacheline +alignment reasons in commit cb92f5cae4. + +If we have one fdtab[] per group, what *really* prevents from using the +same FD in multiple groups ? _fd_delete_orphan() and fd_update_events() +need to check for no-thread usage before closing the FD. This could be +a limiting factor. Enabling could require to wake every poller. + +Shouldn't we remerge fdinfo[] with fdtab[] (one pointer + one int/short, +used only during creation and close) ? + +Other problem, if we have one fdtab[] per TG, disabling/enabling an FD +(e.g. pause/resume on listener) can become a problem if it's not necessarily +on the current TG. We'll then need a way to figure that one. It sounds like +FDs from listeners and receivers are very specific and suffer from problems +all other ones under high load do not suffer from. Maybe something specific +ought to be done for them, if we can guarantee there is no risk of accidental +reuse (e.g. locate the TG info in the receiver and have a "MT" bit in the +FD's flags). The risk is always that a close() can result in instant pop-up +of the same FD on any other thread of the same process. + +Observations: right now fdtab[].thread_mask more or less corresponds to a +declaration of interest, it's very close to meaning "active per thread". It is +in fact located in the FD while it ought to do nothing there, as it should be +where the FD is used as it rules accesses to a shared resource that is not +the FD but what uses it. Indeed, if neither polled_mask nor running_mask have +a thread's bit, the FD is unknown to that thread and the element using it may +only be reached from above and not from the FD. As such we ought to have a +thread_mask on a listener and another one on connections. These ones will +indicate who uses them. A takeover could then be simplified (atomically set +exclusivity on the FD's running_mask, upon success, takeover the connection, +clear the running mask). Probably that the change ought to be performed on +the connection level first, not the FD level by the way. But running and +polled are the two relevant elements, one indicates userland knowledge, +the other one kernel knowledge. For listeners there's no exclusivity so it's +a bit different but the rule remains the same that we don't have to know +what threads are *interested* in the FD, only its holder. + +Not exact in fact, see FD notes below. + +activity +-------- +There should be one activity array per thread group. The dump should +simply scan them all since the cumuled values are not very important +anyway. + +applets +------- +They use tid_bit only for the task. It looks like the appctx's thread_mask +is never used (now removed). Furthermore, it looks like the argument is +*always* tid_bit. + +CPU binding +----------- +This is going to be tough. It will be needed to detect that threads overlap +and are not bound (i.e. all threads on same mask). In this case, if the number +of threads is higher than the number of threads per physical socket, one must +try hard to evenly spread them among physical sockets (e.g. one thread group +per physical socket) and start as many threads as needed on each, bound to +all threads/cores of each socket. If there is a single socket, the same job +may be done based on L3 caches. Maybe it could always be done based on L3 +caches. The difficulty behind this is the number of sockets to be bound: it +is not possible to bind several FDs per listener. Maybe with a new bind +keyword we can imagine to automatically duplicate listeners ? In any case, +the initially bound cpumap (via taskset) must always be respected, and +everything should probably start from there. + +Frontend binding +---------------- +We'll have to define a list of threads and thread-groups per frontend. +Probably that having a group mask and a same thread-mask for each group +would suffice. + +Threads should have two numbers: + - the per-process number (e.g. 1..256) + - the per-group number (1..64) + +The "bind-thread" lines ought to use the following syntax: + - bind 45 ## bind to process' thread 45 + - bind 1/45 ## bind to group 1's thread 45 + - bind all/45 ## bind to thread 45 in each group + - bind 1/all ## bind to all threads in group 1 + - bind all ## bind to all threads + - bind all/all ## bind to all threads in all groups (=all) + - bind 1/65 ## rejected + - bind 65 ## OK if there are enough + - bind 35-45 ## depends. Rejected if it crosses a group boundary. + +The global directive "nbthread 28" means 28 total threads for the process. The +number of groups will sub-divide this. E.g. 4 groups will very likely imply 7 +threads per group. At the beginning, the nbgroup should be manual since it +implies config adjustments to bind lines. + +There should be a trivial way to map a global thread to a group and local ID +and to do the opposite. + + +Panic handler + watchdog +------------------------ +Will probably depend on what's done for thread_isolate + +Per-thread arrays inside structures +----------------------------------- +- listeners have a thr_conn[] array, currently limited to MAX_THREADS. Should + we simply bump the limit ? +- same for servers with idle connections. +=> doesn't seem very practical. +- another solution might be to point to dynamically allocated arrays of + arrays (e.g. nbthread * nbgroup) or a first level per group and a second + per thread. +=> dynamic allocation based on the global number + +Other +----- +- what about dynamic thread start/stop (e.g. for containers/VMs) ? + E.g. if we decide to start $MANY threads in 4 groups, and only use + one, in the end it will not be possible to use less than one thread + per group, and at most 64 will be present in each group. + + +FD Notes +-------- + - updt_fd_polling() uses thread_mask to figure where to send the update, + the local list or a shared list, and which bits to set in update_mask. + This could be changed so that it takes the update mask in argument. The + call from the poller's fork would just have to broadcast everywhere. + + - pollers use it to figure whether they're concerned or not by the activity + update. This looks important as otherwise we could re-enable polling on + an FD that changed to another thread. + + - thread_mask being a per-thread active mask looks more exact and is + precisely used this way by _update_fd(). In this case using it instead + of running_mask to gauge a change or temporarily lock it during a + removal could make sense. + + - running should be conditioned by thread. Polled not (since deferred + or migrated). In this case testing thread_mask can be enough most of + the time, but this requires synchronization that will have to be + extended to tgid.. But migration seems a different beast that we shouldn't + care about here: if first performed at the higher level it ought to + be safe. + +In practice the update_mask can be dropped to zero by the first fd_delete() +as the only authority allowed to fd_delete() is *the* owner, and as soon as +all running_mask are gone, the FD will be closed, hence removed from all +pollers. This will be the only way to make sure that update_mask always +refers to the current tgid. + +However, it may happen that a takeover within the same group causes a thread +to read the update_mask late, while the FD is being wiped by another thread. +That other thread may close it, causing another thread in another group to +catch it, and change the tgid and start to update the update_mask. This means +that it would be possible for a thread entering do_poll() to see the correct +tgid, then the fd would be closed, reopened and reassigned to another tgid, +and the thread would see its bit in the update_mask, being confused. Right +now this should already happen when the update_mask is not cleared, except +that upon wakeup a migration would be detected and that would be all. + +Thus we might need to set the running bit to prevent the FD from migrating +before reading update_mask, which also implies closing on fd_clr_running() == 0 :-( + +Also even fd_update_events() leaves a risk of updating update_mask after +clearing running, thus affecting the wrong one. Probably that update_mask +should be updated before clearing running_mask there. Also, how about not +creating an update on a close ? Not trivial if done before running, unless +thread_mask==0. + +Note that one situation that is currently visible is that a thread closes a +file descriptor that it's the last one to own and to have an update for. In +fd_delete_orphan() it does call poller.clo() but this one is not sufficient +as it doesn't drop the update_mask nor does it clear the polled_mask. The +typical problem that arises is that the close() happens before processing +the last update (e.g. a close() just after a partial read), thus it still +has *at least* one bit set for the current thread in both update_mask and +polled_mask, and it is present in the update_list. Not handling it would +mean that the event is lost on update() from the concerned threads and +that some resource might leak. Handling it means zeroing the update_mask +and polled_mask, and deleting the update entry from the update_list, thus +losing the update event. And as indicated above, if the FD switches twice +between 2 groups, the finally called thread does not necessarily know that +the FD isn't the same anymore, thus it's difficult to decide whether to +delete it or not, because deleting the event might in fact mean deleting +something that was just re-added for the same thread with the same FD but +a different usage. + +Also it really seems unrealistic to scan a single shared update_list like +this using write operations. There should likely be one per thread-group. +But in this case there is no more choice than deleting the update event +upon fd_delete_orphan(). This also means that poller->clo() must do the +job for all of the group's threads at once. This would mean a synchronous +removal before the close(), which doesn't seem ridiculously expensive. It +just requires that any thread of a group may manipulate any other thread's +status for an FD and a poller. + +Note about our currently supported pollers: + + - epoll: our current code base relies on the modern version which + automatically removes closed FDs, so we don't have anything to do + when closing and we don't need the update. + + - kqueue: according to https://www.freebsd.org/cgi/man.cgi?query=kqueue, just + like epoll, a close() implies a removal. Our poller doesn't perform + any bookkeeping either so it's OK to directly close. + + - evports: https://docs.oracle.com/cd/E86824_01/html/E54766/port-dissociate-3c.html + says the same, i.e. close() implies a removal of all events. No local + processing nor bookkeeping either, we can close. + + - poll: the fd_evts[] array is global, thus shared by all threads. As such, + a single removal is needed to flush it for all threads at once. The + operation is already performed like this. + + - select: works exactly like poll() above, hence already handled. + +As a preliminary conclusion, it's safe to delete the event and reset +update_mask just after calling poller->clo(). If extremely unlucky (changing +thread mask due to takeover ?), the same FD may appear at the same time: + - in one or several thread-local fd_updt[] arrays. These ones are just work + queues, there's nothing to do to ignore them, just leave the holes with an + outdated FD which will be ignored once met. As a bonus, poller->clo() could + check if the last fd_updt[] points to this specific FD and decide to kill + it. + + - in the global update_list. In this case, fd_rm_from_fd_list() already + performs an attachment check, so it's safe to always call it before closing + (since no one else may be in the process of changing anything). + + +########################################################### + +Current state: + + +Mux / takeover / fd_delete() code ||| poller code +-------------------------------------------------|||--------------------------------------------------- + \|/ +mux_takeover(): | fd_set_running(): + if (fd_takeover()<0) | old = {running, thread}; + return fail; | new = {tid_bit, tid_bit}; + ... | +fd_takeover(): | do { + atomic_or(running, tid_bit); | if (!(old.thread & tid_bit)) + old = {running, thread}; | return -1; + new = {tid_bit, tid_bit}; | new = { running | tid_bit, old.thread } + if (owner != expected) { | } while (!dwcas({running, thread}, &old, &new)); + atomic_and(running, ~tid_bit); | + return -1; // fail | fd_clr_running(): + } | return atomic_and_fetch(running, ~tid_bit); + | + while (old == {tid_bit, !=0 }) | poll(): + if (dwcas({running, thread}, &old, &new)) { | if (!owner) + atomic_and(running, ~tid_bit); | continue; + return 0; // success | + } | if (!(thread_mask & tid_bit)) { + } | epoll_ctl_del(); + | continue; + atomic_and(running, ~tid_bit); | } + return -1; // fail | + | // via fd_update_events() +fd_delete(): | if (fd_set_running() != -1) { + atomic_or(running, tid_bit); | iocb(); + atomic_store(thread, 0); | if (fd_clr_running() == 0 && !thread_mask) + if (fd_clr_running(fd) = 0) | fd_delete_orphan(); + fd_delete_orphan(); | } + + +The idle_conns_lock prevents the connection from being *picked* and released +while someone else is reading it. What it does is guarantee that on idle +connections, the caller of the IOCB will not dereference the task's context +while the connection is still in the idle list, since it might be picked then +freed at the same instant by another thread. As soon as the IOCB manages to +get that lock, it removes the connection from the list so that it cannot be +taken over anymore. Conversely, the mux's takeover() code runs under that +lock so that if it frees the connection and task, this will appear atomic +to the IOCB. The timeout task (which is another entry point for connection +deletion) does the same. Thus, when coming from the low-level (I/O or timeout): + - task always exists, but ctx checked under lock validates; conn removal + from list prevents takeover(). + - t->context is stable, except during changes under takeover lock. So + h2_timeout_task may well run on a different thread than h2_io_cb(). + +Coming from the top: + - takeover() done under lock() clears task's ctx and possibly closes the FD + (unless some running remains present). + +Unlikely but currently possible situations: + - multiple pollers (up to N) may have an idle connection's FD being + polled, if the connection was passed from thread to thread. The first + event on the connection would wake all of them. Most of them would + see fdtab[].owner set (the late ones might miss it). All but one would + see that their bit is missing from fdtab[].thread_mask and give up. + However, just after this test, others might take over the connection, + so in practice if terribly unlucky, all but 1 could see their bit in + thread_mask just before it gets removed, all of them set their bit + in running_mask, and all of them call iocb() (sock_conn_iocb()). + Thus all of them dereference the connection and touch the subscriber + with no protection, then end up in conn_notify_mux() that will call + the mux's wake(). + + - multiple pollers (up to N-1) might still be in fd_update_events() + manipulating fdtab[].state. The cause is that the "locked" variable + is determined by atleast2(thread_mask) but that thread_mask is read + at a random instant (i.e. it may be stolen by another one during a + takeover) since we don't yet hold running to prevent this from being + done. Thus we can arrive here with thread_mask==something_else (1bit), + locked==0 and fdtab[].state assigned non-atomically. + + - it looks like nothing prevents h2_release() from being called on a + thread (e.g. from the top or task timeout) while sock_conn_iocb() + dereferences the connection on another thread. Those killing the + connection don't yet consider the fact that it's an FD that others + might currently be waking up on. + +################### + +pb with counter: + +users count doesn't say who's using the FD and two users can do the same +close in turn. The thread_mask should define who's responsible for closing +the FD, and all those with a bit in it ought to do it. + + +2021-08-25 - update with minimal locking on tgid value +========== + + - tgid + refcount at once using CAS + - idle_conns lock during updates + - update: + if tgid differs => close happened, thus drop update + otherwise normal stuff. Lock tgid until running if needed. + - poll report: + if tgid differs => closed + if thread differs => stop polling (migrated) + keep tgid lock until running + - test on thread_id: + if (xadd(&tgid,65536) != my_tgid) { + // was closed + sub(&tgid, 65536) + return -1 + } + if !(thread_id & tidbit) => migrated/closed + set_running() + sub(tgid,65536) + - note: either fd_insert() or the final close() ought to set + polled and update to 0. + +2021-09-13 - tid / tgroups etc. +========== + + * tid currently is the thread's global ID. It's essentially used as an index + for arrays. It must be clearly stated that it works this way. + + * tasklets use the global thread id, and __tasklet_wakeup_on() must use a + global ID as well. It's capital that tinfo[] provides instant access to + local/global bits/indexes/arrays + + - tid_bit makes no sense process-wide, so it must be redefined to represent + the thread's tid within its group. The name is not much welcome though, but + there are 286 of it that are not going to be changed that fast. + => now we have ltid and ltid_bit in thread_info. thread-local tid_bit still + not changed though. If renamed we must make sure the older one vanishes. + Why not rename "ptid, ptid_bit" for the process-wide tid and "gtid, + gtid_bit" for the group-wide ones ? This removes the ambiguity on "tid" + which is half the time not the one we expect. + + * just like "ti" is the thread_info, we need to have "tg" pointing to the + thread_group. + + - other less commonly used elements should be retrieved from ti->xxx. E.g. + the thread's local ID. + + - lock debugging must reproduce tgid + + * task profiling must be made per-group (annoying), unless we want to add a + per-thread TH_FL_* flag and have the rare places where the bit is changed + iterate over all threads if needed. Sounds preferable overall. + + * an offset might be placed in the tgroup so that even with 64 threads max + we could have completely separate tid_bits over several groups. + => base and count now + +2021-09-15 - bind + listen() + rx +========== + + - thread_mask (in bind_conf->rx_settings) should become an array of + MAX_TGROUP longs. + - when parsing "thread 123" or "thread 2/37", the proper bit is set, + assuming the array is either a contiguous bitfield or a tgroup array. + An option RX_O_THR_PER_GRP or RX_O_THR_PER_PROC is set depending on + how the thread num was parsed, so that we reject mixes. + - end of parsing: entries translated to the cleanest form (to be determined) + - binding: for each socket()/bind()/listen()... just perform one extra dup() + for each tgroup and store the multiple FDs into an FD array indexed on + MAX_TGROUP. => allows to use one FD per tgroup for the same socket, hence + to have multiple entries in all tgroup pollers without requiring the user + to duplicate the bind line. + +2021-09-15 - global thread masks +========== + +Some global variables currently expect to know about thread IDs and it's +uncertain what must be done with them: + - global_tasks_mask /* Mask of threads with tasks in the global runqueue */ + => touched under the rq lock. Change it per-group ? What exact use is made ? + + - sleeping_thread_mask /* Threads that are about to sleep in poll() */ + => seems that it can be made per group + + - all_threads_mask: a bit complicated, derived from nbthread and used with + masks and with my_ffsl() to wake threads up. Should probably be per-group + but we might miss something for global. + + - stopping_thread_mask: used in combination with all_threads_mask, should + move per-group. + + - threads_harmless_mask: indicates all threads that are currently harmless in + that they promise not to access a shared resource. Must be made per-group + but then we'll likely need a second stage to have the harmless groups mask. + threads_idle_mask, threads_sync_mask, threads_want_rdv_mask go with the one + above. Maybe the right approach will be to request harmless on a group mask + so that we can detect collisions and arbiter them like today, but on top of + this it becomes possible to request harmless only on the local group if + desired. The subtlety is that requesting harmless at the group level does + not mean it's achieved since the requester cannot vouch for the other ones + in the same group. + +In addition, some variables are related to the global runqueue: + __decl_aligned_spinlock(rq_lock); /* spin lock related to run queue */ + struct eb_root rqueue; /* tree constituting the global run queue, accessed under rq_lock */ + unsigned int grq_total; /* total number of entries in the global run queue, atomic */ + static unsigned int global_rqueue_ticks; /* insertion count in the grq, use rq_lock */ + +And others to the global wait queue: + struct eb_root timers; /* sorted timers tree, global, accessed under wq_lock */ + __decl_aligned_rwlock(wq_lock); /* RW lock related to the wait queue */ + struct eb_root timers; /* sorted timers tree, global, accessed under wq_lock */ + + +2022-06-14 - progress on task affinity +========== + +The particularity of the current global run queue is to be usable for remote +wakeups because it's protected by a lock. There is no need for a global run +queue beyond this, and there could already be a locked queue per thread for +remote wakeups, with a random selection at wakeup time. It's just that picking +a pending task in a run queue among a number is convenient (though it +introduces some excessive locking). A task will either be tied to a single +group or will be allowed to run on any group. As such it's pretty clear that we +don't need a global run queue. When a run-anywhere task expires, either it runs +on the current group's runqueue with any thread, or a target thread is selected +during the wakeup and it's directly assigned. + +A global wait queue seems important for scheduled repetitive tasks however. But +maybe it's more a task for a cron-like job and there's no need for the task +itself to wake up anywhere, because once the task wakes up, it must be tied to +one (or a set of) thread(s). One difficulty if the task is temporarily assigned +a thread group is that it's impossible to know where it's running when trying +to perform a second wakeup or when trying to kill it. Maybe we'll need to have +two tgid for a task (desired, effective). Or maybe we can restrict the ability +of such a task to stay in wait queue in case of wakeup, though that sounds +difficult. Other approaches would be to set the GID to the current one when +waking up the task, and to have a flag (or sign on the GID) indicating that the +task is still queued in the global timers queue. We already have TASK_SHARED_WQ +so it seems that antoher similar flag such as TASK_WAKE_ANYWHERE could make +sense. But when is TASK_SHARED_WQ really used, except for the "anywhere" case ? +All calls to task_new() use either 1<<thr, tid_bit, all_threads_mask, or come +from appctx_new which does exactly the same. The only real user of non-global, +non-unique task_new() call is debug_parse_cli_sched() which purposely allows to +use an arbitrary mask. + + +----------------------------------------------------------------------------+ + | => we don't need one WQ per group, only a global and N local ones, hence | + | the TASK_SHARED_WQ flag can continue to be used for this purpose. | + +----------------------------------------------------------------------------+ + +Having TASK_SHARED_WQ should indicate that a task will always be queued to the +shared queue and will always have a temporary gid and thread mask in the run +queue. + +Going further, as we don't have any single case of a task bound to a small set +of threads, we could decide to wake up only expired tasks for ourselves by +looking them up using eb32sc and adopting them. Thus, there's no more need for +a shared runqueue nor a global_runqueue_ticks counter, and we can simply have +the ability to wake up a remote task. The task's thread_mask will then change +so that it's only a thread ID, except when the task has TASK_SHARED_WQ, in +which case it corresponds to the running thread. That's very close to what is +already done with tasklets in fact. + + +2021-09-29 - group designation and masks +========== + +Neither FDs nor tasks will belong to incomplete subsets of threads spanning +over multiple thread groups. In addition there may be a difference between +configuration and operation (for FDs). This allows to fix the following rules: + + group mask description + 0 0 bind_conf: groups & thread not set. bind to any/all + task: it would be nice to mean "run on the same as the caller". + + 0 xxx bind_conf: thread set but not group: thread IDs are global + FD/task: group 0, mask xxx + + G>0 0 bind_conf: only group is set: bind to all threads of group G + FD/task: mask 0 not permitted (= not owned). May be used to + mention "any thread of this group", though already covered by + G/xxx like today. + + G>0 xxx bind_conf: Bind to these threads of this group + FD/task: group G, mask xxx + +It looks like keeping groups starting at zero internally complicates everything +though. But forcing it to start at 1 might also require that we rescan all tasks +to replace 0 with 1 upon startup. This would also allow group 0 to be special and +be used as the default group for any new thread creation, so that group0.count +would keep the number of unassigned threads. Let's try: + + group mask description + 0 0 bind_conf: groups & thread not set. bind to any/all + task: "run on the same group & thread as the caller". + + 0 xxx bind_conf: thread set but not group: thread IDs are global + FD/task: invalid. Or maybe for a task we could use this to + mean "run on current group, thread XXX", which would cover + the need for health checks (g/t 0/0 while sleeping, 0/xxx + while running) and have wake_expired_tasks() detect 0/0 and + wake them up to a random group. + + G>0 0 bind_conf: only group is set: bind to all threads of group G + FD/task: mask 0 not permitted (= not owned). May be used to + mention "any thread of this group", though already covered by + G/xxx like today. + + G>0 xxx bind_conf: Bind to these threads of this group + FD/task: group G, mask xxx + +With a single group declared in the config, group 0 would implicitly find the +first one. + + +The problem with the approach above is that a task queued in one group+thread's +wait queue could very well receive a signal from another thread and/or group, +and that there is no indication about where the task is queued, nor how to +dequeue it. Thus it seems that it's up to the application itself to unbind/ +rebind a task. This contradicts the principle of leaving a task waiting in a +wait queue and waking it anywhere. + +Another possibility might be to decide that a task having a defined group but +a mask of zero is shared and will always be queued into its group's wait queue. +However, upon expiry, the scheduler would notice the thread-mask 0 and would +broadcast it to any group. + +Right now in the code we have: + - 18 calls of task_new(tid_bit) + - 17 calls of task_new_anywhere() + - 2 calls with a single bit + +Thus it looks like "task_new_anywhere()", "task_new_on()" and +"task_new_here()" would be sufficient. |