summaryrefslogtreecommitdiffstats
path: root/src/arrow/docs/source/java/vector.rst
diff options
context:
space:
mode:
Diffstat (limited to 'src/arrow/docs/source/java/vector.rst')
-rw-r--r--src/arrow/docs/source/java/vector.rst288
1 files changed, 288 insertions, 0 deletions
diff --git a/src/arrow/docs/source/java/vector.rst b/src/arrow/docs/source/java/vector.rst
new file mode 100644
index 000000000..ece07d0a7
--- /dev/null
+++ b/src/arrow/docs/source/java/vector.rst
@@ -0,0 +1,288 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements. See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership. The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License. You may obtain a copy of the License at
+
+.. http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied. See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+===========
+ValueVector
+===========
+
+:class:`ValueVector` interface (which called Array in C++ implementation and
+the :doc:`the specification <../format/Columnar>`) is an abstraction that is used to store a
+sequence of values having the same type in an individual column. Internally, those values are
+represented by one or several buffers, the number and meaning of which depend on the vector’s data type.
+
+There are concrete subclasses of :class:`ValueVector` for each primitive data type
+and nested type described in the specification. There are a few differences in naming
+with the type names described in the specification:
+Table with non-intuitive names (BigInt = 64 bit integer, etc).
+
+It is important that vector is allocated before attempting to read or write,
+:class:`ValueVector` "should" strive to guarantee this order of operation:
+create > allocate > mutate > set value count > access > clear (or allocate to start the process over).
+We will go through a concrete example to demonstrate each operation in the next section.
+
+Vector Life Cycle
+=================
+
+As discussed above, each vector goes through several steps in its life cycle,
+and each step is triggered by a vector operation. In particular, we have the following vector operations:
+
+1. **Vector creation**: we create a new vector object by, for example, the vector constructor.
+The following code creates a new ``IntVector`` by the constructor:
+
+.. code-block:: Java
+
+ RootAllocator allocator = new RootAllocator(Long.MAX_VALUE);
+ ...
+ IntVector vector = new IntVector("int vector", allocator);
+
+By now, a vector object is created. However, no underlying memory has been allocated, so we need the
+following step.
+
+2. **Vector allocation**: in this step, we allocate memory for the vector. For most vectors, we
+have two options: 1) if we know the maximum vector capacity, we can specify it by calling the
+``allocateNew(int)`` method; 2) otherwise, we should call the ``allocateNew()`` method, and a default
+capacity will be allocated for it. For our running example, we assume that the vector capacity never
+exceeds 10:
+
+.. code-block:: Java
+
+ vector.allocateNew(10);
+
+3. **Vector mutation**: now we can populate the vector with values we desire. For all vectors, we can populate
+vector values through vector writers (An example will be given in the next section). For primitive types,
+we can also mutate the vector by the set methods. There are two classes of set methods: 1) if we can
+be sure the vector has enough capacity, we can call the ``set(index, value)`` method. 2) if we are not sure
+about the vector capacity, we should call the ``setSafe(index, value)`` method, which will automatically
+take care of vector reallocation, if the capacity is not sufficient. For our running example, we know the
+vector has enough capacity, so we can call
+
+.. code-block:: Java
+
+ vector.set(/*index*/5, /*value*/25);
+
+4. **Set value count**: for this step, we set the value count of the vector by calling the
+``setValueCount(int)`` method:
+
+.. code-block:: Java
+
+ vector.setValueCount(10);
+
+After this step, the vector enters an immutable state. In other words, we should no longer mutate it.
+(Unless we reuse the vector by allocating it again. This will be discussed shortly.)
+
+5. **Vector access**: it is time to access vector values. Similarly, we have two options to access values:
+1) get methods and 2) vector reader. Vector reader works for all types of vectors, while get methods are
+only available for primitive vectors. A concrete example for vector reader will be given in the next section.
+Below is an example of vector access by get method:
+
+.. code-block:: Java
+
+ int value = vector.get(5); // value == 25
+
+6. **Vector clear**: when we are done with the vector, we should clear it to release its memory. This is done by
+calling the ``close()`` method:
+
+.. code-block:: Java
+
+ vector.close();
+
+Some points to note about the steps above:
+
+* The steps are not necessarily performed in a linear sequence. Instead, they can be in a loop. For example,
+ when a vector enters the access step, we can also go back to the vector mutation step, and then set value
+ count, access vector, and so on.
+
+* We should try to make sure the above steps are carried out in order. Otherwise, the vector
+ may be in an undefined state, and some unexpected behavior may occur. However, this restriction
+ is not strict. That means it is possible that we violates the order above, but still get
+ correct results.
+
+* When mutating vector values through set methods, we should prefer ``set(index, value)`` methods to
+ ``setSafe(index, value)`` methods whenever possible, to avoid unnecessary performance overhead of handling
+ vector capacity.
+
+* All vectors implement the ``AutoCloseable`` interface. So they must be closed explicitly when they are
+ no longer used, to avoid resource leak. To make sure of this, it is recommended to place vector related operations
+ into a try-with-resources block.
+
+* For fixed width vectors (e.g. IntVector), we can set values at different indices in arbitrary orders.
+ For variable width vectors (e.g. VarCharVector), however, we must set values in non-decreasing order of the
+ indices. Otherwise, the values after the set position will become invalid. For example, suppose we use the
+ following statements to populate a variable width vector:
+
+.. code-block:: Java
+
+ VarCharVector vector = new VarCharVector("vector", allocator);
+ vector.allocateNew();
+ vector.setSafe(0, "zero");
+ vector.setSafe(1, "one");
+ ...
+ vector.setSafe(9, "nine");
+
+Then we set the value at position 5 again:
+
+.. code-block:: Java
+
+ vector.setSafe(5, "5");
+
+After that, the values at positions 6, 7, 8, and 9 of the vector will become invalid.
+
+Building ValueVector
+====================
+
+Note that the current implementation doesn't enforce the rule that Arrow objects are immutable.
+:class:`ValueVector` instances could be created directly by using new keyword, there are
+set/setSafe APIs and concrete subclasses of FieldWriter for populating values.
+
+For example, the code below shows how to build a :class:`BigIntVector`, in this case, we build a
+vector of the range 0 to 7 where the element that should hold the fourth value is nulled
+
+.. code-block:: Java
+
+ try (BufferAllocator allocator = new RootAllocator(Long.MAX_VALUE);
+ BigIntVector vector = new BigIntVector("vector", allocator)) {
+ vector.allocateNew(8);
+ vector.set(0, 1);
+ vector.set(1, 2);
+ vector.set(2, 3);
+ vector.setNull(3);
+ vector.set(4, 5);
+ vector.set(5, 6);
+ vector.set(6, 7);
+ vector.set(7, 8);
+ vector.setValueCount(8); // this will finalizes the vector by convention.
+ ...
+ }
+
+The :class:`BigIntVector` holds two ArrowBufs. The first buffer holds the null bitmap, which consists
+here of a single byte with the bits 1|1|1|1|0|1|1|1 (the bit is 1 if the value is non-null).
+The second buffer contains all the above values. As the fourth entry is null, the value at that position
+in the buffer is undefined. Note compared with set API, setSafe API would check value capacity before setting
+values and reallocate buffers if necessary.
+
+Here is how to build a vector using writer
+
+.. code-block:: Java
+
+ try (BigIntVector vector = new BigIntVector("vector", allocator);
+ BigIntWriter writer = new BigIntWriterImpl(vector)) {
+ writer.setPosition(0);
+ writer.writeBigInt(1);
+ writer.setPosition(1);
+ writer.writeBigInt(2);
+ writer.setPosition(2);
+ writer.writeBigInt(3);
+ // writer.setPosition(3) is not called which means the forth value is null.
+ writer.setPosition(4);
+ writer.writeBigInt(5);
+ writer.setPosition(5);
+ writer.writeBigInt(6);
+ writer.setPosition(6);
+ writer.writeBigInt(7);
+ writer.setPosition(7);
+ writer.writeBigInt(8);
+ }
+
+There are get API and concrete subclasses of :class:`FieldReader` for accessing vector values, what needs
+to be declared is that writer/reader is not as efficient as direct access
+
+.. code-block:: Java
+
+ // access via get API
+ for (int i = 0; i < vector.getValueCount(); i++) {
+ if (!vector.isNull(i)) {
+ System.out.println(vector.get(i));
+ }
+ }
+
+ // access via reader
+ BigIntReader reader = vector.getReader();
+ for (int i = 0; i < vector.getValueCount(); i++) {
+ reader.setPosition(i);
+ if (reader.isSet()) {
+ System.out.println(reader.readLong());
+ }
+ }
+
+Building ListVector
+===================
+
+A :class:`ListVector` is a vector that holds a list of values for each index. Working with one you need to handle the same steps as mentioned above (create > allocate > mutate > set value count > access > clear), but the details of how you accomplish this are slightly different since you need to both create the vector and set the list of values for each index.
+
+For example, the code below shows how to build a :class:`ListVector` of int's using the writer :class:`UnionListWriter`. We build a vector from 0 to 9 and each index contains a list with values [[0, 0, 0, 0, 0], [0, 1, 2, 3, 4], [0, 2, 4, 6, 8], …, [0, 9, 18, 27, 36]]. List values can be added in any order so writing a list such as [3, 1, 2] would be just as valid.
+
+.. code-block:: Java
+
+ try (BufferAllocator allocator = new RootAllocator(Long.MAX_VALUE);
+ ListVector listVector = ListVector.empty("vector", allocator)) {
+ UnionListWriter writer = listVector.getWriter();
+ for (int i = 0; i < 10; i++) {
+ writer.startList();
+ writer.setPosition(i);
+ for (int j = 0; j < 5; j++) {
+ writer.writeInt(j * i);
+ }
+ writer.setValueCount(5);
+ writer.endList();
+ }
+ listVector.setValueCount(10);
+ }
+
+:class:`ListVector` values can be accessed either through the get API or through the reader class :class:`UnionListReader`. To read all the values, first enumerate through the indexes, and then enumerate through the inner list values.
+
+.. code-block:: Java
+
+ // access via get API
+ for (int i = 0; i < listVector.getValueCount(); i++) {
+ if (!listVector.isNull(i)) {
+ ArrayList<Integer> elements = (ArrayList<Integer>) listVector.getObject(i);
+ for (Integer element : elements) {
+ System.out.println(element);
+ }
+ }
+ }
+
+ // access via reader
+ UnionListReader reader = listVector.getReader();
+ for (int i = 0; i < listVector.getValueCount(); i++) {
+ reader.setPosition(i);
+ while (reader.next()) {
+ IntReader intReader = reader.reader();
+ if (intReader.isSet()) {
+ System.out.println(intReader.readInteger());
+ }
+ }
+ }
+
+Slicing
+=======
+
+Similar with C++ implementation, it is possible to make zero-copy slices of vectors to obtain a vector
+referring to some logical sub-sequence of the data through :class:`TransferPair`
+
+.. code-block:: Java
+
+ IntVector vector = new IntVector("intVector", allocator);
+ for (int i = 0; i < 10; i++) {
+ vector.setSafe(i, i);
+ }
+ vector.setValueCount(10);
+
+ TransferPair tp = vector.getTransferPair(allocator);
+ tp.splitAndTransfer(0, 5);
+ IntVector sliced = (IntVector) tp.getTo();
+ // In this case, the vector values are [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] and the sliceVector values are [0, 1, 2, 3, 4].