IN-Operator Implementation Notes ================================ ## Definitions: An IN operator has one of the following formats: > x IN (y1,y2,y3,...,yN) x IN (subquery) The "x" is referred to as the LHS (left-hand side). The list or subquery on the right is called the RHS (right-hand side). If the RHS is a list it must be a non-empty list. But if the RHS is a subquery, it can be an empty set. The LHS can be a scalar (a single quantity) or a vector (a list of two or or more values) or a subquery that returns one or more columns. We use the term "vector" to mean an actually list of values or a subquery that returns two or more columns. An isolated value or a subquery that returns a single columns is called a scalar. The RHS can be a subquery that returns a single column, a subquery that returns two or more columns, or a list of scalars. It is not currently support for the RHS to be a list of vectors. The number of columns for LHS must match the number of columns for the RHS. If the RHS is a list of values, then the LHS must be a scalar. If the RHS is a subquery returning N columns, then the LHS must be a vector of size N. NULL values can occur in either or both of the LHS and RHS. If the LHS contains only NULL values then we say that it is a "total-NULL". If the LHS contains some NULL values and some non-NULL values, then it is a "partial-NULL". For a scalar, there is no difference between a partial-NULL and a total-NULL. The RHS is a partial-NULL if any row contains a NULL value. The RHS is a total-NULL if it contains one or more rows that contain only NULL values. The LHS is called "non-NULL" if it contains no NULL values. The RHS is called "non-NULL" if it contains no NULL values in any row. The result of an IN operator is one of TRUE, FALSE, or NULL. A NULL result means that it cannot be determined if the LHS is contained in the RHS due to the presence of NULL values. In some contexts (for example, when the IN operator occurs in a WHERE clause) the system only needs a binary result: TRUE or NOT-TRUE. One can also to define a binary result of FALSE and NOT-FALSE, but it turns out that no extra optimizations are possible in that case, so if the FALSE/NOT-FALSE binary is needed, we have to compute the three-state TRUE/FALSE/NULL result and then combine the TRUE and NULL values into NOT-FALSE. A "NOT IN" operator is computed by first computing the equivalent IN operator, then interchanging the TRUE and FALSE results. ## Simple Full-Scan Algorithm The following algorithm always compute the correct answer. However, this algorithm is suboptimal, especially if there are many rows on the RHS. 1. Set the null-flag to false 2. For each row in the RHS: