Adding upstream version 18.2.2.upstream/18.2.2

Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
author: Daniel Baumann <daniel.baumann@progress-linux.org> 2024-04-21 11:54:28 +0000
committer: Daniel Baumann <daniel.baumann@progress-linux.org> 2024-04-21 11:54:28 +0000
commit: e6918187568dbd01842d8d1d2c808ce16a894239 (patch)
tree: 64f88b554b444a49f656b6c656111a145cbbaa28 /src/arrow/docs/source/java
parent: Initial commit. (diff)
download: ceph-e6918187568dbd01842d8d1d2c808ce16a894239.tar.xz
ceph-e6918187568dbd01842d8d1d2c808ce16a894239.zip
6 files changed, 693 insertions, 0 deletions
diff --git a/src/arrow/docs/source/java/algorithm.rst b/src/arrow/docs/source/java/algorithm.rst
new file mode 100644
index 000000000..f838398af
--- /dev/null
+++ b/src/arrow/docs/source/java/algorithm.rst
@@ -0,0 +1,92 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+Java Algorithms
+===============
+
+Arrow's Java library provides algorithms for some commonly-used
+functionalities. The algorithms are provided in the ``org.apache.arrow.algorithm``
+package of the ``algorithm`` module. 
+
+Comparing Vector Elements
+-------------------------
+
+Comparing vector elements is the basic for many algorithms. Vector 
+elements can be compared in one of the two ways:
+
+1. **Equality comparison**: there are two possible results for this type of comparisons: ``equal`` and ``unequal``.
+Currently, this type of comparison is supported through the ``org.apache.arrow.vector.compare.VectorValueEqualizer``
+interface.
+
+2. **Ordering comparison**: there are three possible results for this type of comparisons: ``less than``, ``equal to ``
+and ``greater than``. This comparison is supported by the abstract class ``org.apache.arrow.algorithm.sort.VectorValueComparator``.
+
+We provide default implementations to compare vector elements. However, users can also define ways
+for customized comparisons. 
+
+Vector Element Search
+---------------------
+
+A search algorithm tries to find a particular value in a vector. When successful, a vector index is 
+returned; otherwise, a ``-1`` is returned. The following search algorithms are provided:
+
+1. **Linear search**: this algorithm simply traverses the vector from the beginning, until a match is 
+found, or the end of the vector is reached. So it takes ``O(n)`` time, where ``n`` is the number of elements
+in the vector.  This algorithm is implemented in ``org.apache.arrow.algorithm.search.VectorSearcher#linearSearch``.
+
+2. **Binary search**: this represents a more efficient search algorithm, as it runs in ``O(log(n))`` time. 
+However, it is only applicable to sorted vectors. To get a sorted vector,
+one can use one of our sorting algorithms, which will be discussed in the next section. This algorithm
+is implemented in ``org.apache.arrow.algorithm.search.VectorSearcher#binarySearch``.
+
+3. **Parallel search**: when the vector is large, it takes a long time to traverse the elements to search
+for a value. To make this process faster, one can split the vector into multiple partitions, and perform the 
+search for each partition in parallel. This is supported by ``org.apache.arrow.algorithm.search.ParallelSearcher``.
+
+4. **Range search**: for many scenarios, there can be multiple matching values in the vector. 
+If the vector is sorted, the matching values reside in a contiguous region in the vector. The
+range search algorithm tries to find the upper/lower bound of the region in ``O(log(n))`` time. 
+An implementation is provided in ``org.apache.arrow.algorithm.search.VectorRangeSearcher``.
+
+Vector Sorting
+--------------
+
+Given a vector, a sorting algorithm turns it into a sorted one. The sorting criteria must
+be specified by some ordering comparison operation. The sorting algorithms can be
+classified into the following categories:
+
+1. **In-place sorter**: an in-place sorter performs the sorting by manipulating the original
+vector, without creating any new vector. So it just returns the original vector after the sorting operations.
+Currently, we have ``org.apache.arrow.algorithm.sort.FixedWidthInPlaceVectorSorter`` for in-place
+sorting in ``O(nlog(n))`` time. As the name suggests, it only supports fixed width vectors. 
+
+2. **Out-of-place sorter**: an out-of-place sorter does not mutate the original vector. Instead,
+it copies vector elements to a new vector in sorted order, and returns the new vector.
+We have ``org.apache.arrow.algorithm.sort.FixedWidthInPlaceVectorSorter.FixedWidthOutOfPlaceVectorSorter`` 
+and ``org.apache.arrow.algorithm.sort.FixedWidthInPlaceVectorSorter.VariableWidthOutOfPlaceVectorSorter``
+for fixed width and variable width vectors, respectively. Both algorithms run in ``O(nlog(n))`` time. 
+
+3. **Index sorter**: this sorter does not actually sort the vector. Instead, it returns an integer
+vector, which correspond to indices of vector elements in sorted order. With the index vector, one can
+easily construct a sorted vector. In addition, some other tasks can be easily achieved, like finding the ``k``th
+smallest value in the vector. Index sorting is supported by ``org.apache.arrow.algorithm.sort.IndexSorter``, 
+which runs in ``O(nlog(n))`` time. It is applicable to vectors of any type. 
+
+Other Algorithms
+----------------
+
+Other algorithms include vector deduplication, dictionary encoding, etc., in the ``algorithm`` module.
diff --git a/src/arrow/docs/source/java/index.rst b/src/arrow/docs/source/java/index.rst
new file mode 100644
index 000000000..65a7a3a4f
--- /dev/null
+++ b/src/arrow/docs/source/java/index.rst
@@ -0,0 +1,31 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+Java Implementation
+===================
+
+This is the documentation of the Java API of Apache Arrow. For more details
+on the Arrow format and other language bindings see the :doc:`parent documentation <../index>`.
+
+.. toctree::
+   :maxdepth: 2
+
+   vector
+   vector_schema_root
+   ipc
+   algorithm
+   Reference (javadoc) <reference/index>
diff --git a/src/arrow/docs/source/java/ipc.rst b/src/arrow/docs/source/java/ipc.rst
new file mode 100644
index 000000000..7cab480c4
--- /dev/null
+++ b/src/arrow/docs/source/java/ipc.rst
@@ -0,0 +1,187 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+===========================
+Reading/Writing IPC formats
+===========================
+Arrow defines two types of binary formats for serializing record batches:
+
+* **Streaming format**: for sending an arbitrary number of record
+  batches. The format must be processed from start to end, and does not support
+  random access
+
+* **File or Random Access format**: for serializing a fixed number of record
+  batches. It supports random access, and thus is very useful when used with
+  memory maps
+
+Writing and Reading Streaming Format
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+First, let's populate a :class:`VectorSchemaRoot` with a small batch of records
+
+.. code-block:: Java
+
+    BitVector bitVector = new BitVector("boolean", allocator);
+    VarCharVector varCharVector = new VarCharVector("varchar", allocator);
+    for (int i = 0; i < 10; i++) {
+      bitVector.setSafe(i, i % 2 == 0 ? 0 : 1);
+      varCharVector.setSafe(i, ("test" + i).getBytes(StandardCharsets.UTF_8));
+    }
+    bitVector.setValueCount(10);
+    varCharVector.setValueCount(10);
+
+    List<Field> fields = Arrays.asList(bitVector.getField(), varCharVector.getField());
+    List<FieldVector> vectors = Arrays.asList(bitVector, varCharVector);
+    VectorSchemaRoot root = new VectorSchemaRoot(fields, vectors);
+
+Now, we can begin writing a stream containing some number of these batches. For this we use :class:`ArrowStreamWriter`
+(DictionaryProvider used for any vectors that are dictionary encoded is optional and can be null))::
+
+    ByteArrayOutputStream out = new ByteArrayOutputStream();
+    ArrowStreamWriter writer = new ArrowStreamWriter(root, /*DictionaryProvider=*/null, Channels.newChannel(out));
+
+
+Here we used an in-memory stream, but this could have been a socket or some other IO stream. Then we can do
+
+.. code-block:: Java
+
+    writer.start();
+    // write the first batch
+    writer.writeBatch();
+
+    // write another four batches.
+    for (int i = 0; i < 4; i++) {
+      // populate VectorSchemaRoot data and write the second batch
+      BitVector childVector1 = (BitVector)root.getVector(0);
+      VarCharVector childVector2 = (VarCharVector)root.getVector(1);
+      childVector1.reset();
+      childVector2.reset();
+      ... do some populate work here, could be different for each batch
+      writer.writeBatch();
+    }
+
+    // end
+    writer.end();
+
+Note since the :class:`VectorSchemaRoot` in writer is a container that can hold batches, batches flow through
+:class:`VectorSchemaRoot` as part of a pipeline, so we need to populate data before `writeBatch` so that later batches
+could overwrite previous ones.
+
+Now the :class:`ByteArrayOutputStream` contains the complete stream which contains 5 record batches.
+We can read such a stream with :class:`ArrowStreamReader`, note that :class:`VectorSchemaRoot` within
+reader will be loaded with new values on every call to :class:`loadNextBatch()`
+
+.. code-block:: Java
+
+    try (ArrowStreamReader reader = new ArrowStreamReader(new ByteArrayInputStream(out.toByteArray()), allocator)) {
+      Schema schema = reader.getVectorSchemaRoot().getSchema();
+      for (int i = 0; i < 5; i++) {
+        // This will be loaded with new values on every call to loadNextBatch
+        VectorSchemaRoot readBatch = reader.getVectorSchemaRoot();
+        reader.loadNextBatch();
+        ... do something with readBatch
+      }
+
+    }
+
+Here we also give a simple example with dictionary encoded vectors
+
+.. code-block:: Java
+
+    DictionaryProvider.MapDictionaryProvider provider = new DictionaryProvider.MapDictionaryProvider();
+    // create dictionary and provider
+    final VarCharVector dictVector = new VarCharVector("dict", allocator);
+    dictVector.allocateNewSafe();
+    dictVector.setSafe(0, "aa".getBytes());
+    dictVector.setSafe(1, "bb".getBytes());
+    dictVector.setSafe(2, "cc".getBytes());
+    dictVector.setValueCount(3);
+
+    Dictionary dictionary =
+        new Dictionary(dictVector, new DictionaryEncoding(1L, false, /*indexType=*/null));
+    provider.put(dictionary);
+
+    // create vector and encode it
+    final VarCharVector vector = new VarCharVector("vector", allocator);
+    vector.allocateNewSafe();
+    vector.setSafe(0, "bb".getBytes());
+    vector.setSafe(1, "bb".getBytes());
+    vector.setSafe(2, "cc".getBytes());
+    vector.setSafe(3, "aa".getBytes());
+    vector.setValueCount(4);
+
+    // get the encoded vector
+    IntVector encodedVector = (IntVector) DictionaryEncoder.encode(vector, dictionary);
+
+    // create VectorSchemaRoot
+    List<Field> fields = Arrays.asList(encodedVector.getField());
+    List<FieldVector> vectors = Arrays.asList(encodedVector);
+    VectorSchemaRoot root = new VectorSchemaRoot(fields, vectors);
+
+    // write data
+    ByteArrayOutputStream out = new ByteArrayOutputStream();
+    ArrowStreamWriter writer = new ArrowStreamWriter(root, provider, Channels.newChannel(out));
+    writer.start();
+    writer.writeBatch();
+    writer.end();
+
+    // read data
+    try (ArrowStreamReader reader = new ArrowStreamReader(new ByteArrayInputStream(out.toByteArray()), allocator)) {
+      reader.loadNextBatch();
+      VectorSchemaRoot readRoot = reader.getVectorSchemaRoot();
+      // get the encoded vector
+      IntVector intVector = (IntVector) readRoot.getVector(0);
+
+      // get dictionaries and decode the vector
+      Map<Long, Dictionary> dictionaryMap = reader.getDictionaryVectors();
+      long dictionaryId = intVector.getField().getDictionary().getId();
+      VarCharVector varCharVector =
+          (VarCharVector) DictionaryEncoder.decode(intVector, dictionaryMap.get(dictionaryId));
+
+    }
+
+Writing and Reading Random Access Files
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+The :class:`ArrowFileWriter` has the same API as :class:`ArrowStreamWriter`
+
+.. code-block:: Java
+
+    ByteArrayOutputStream out = new ByteArrayOutputStream();
+    ArrowFileWriter writer = new ArrowFileWriter(root, null, Channels.newChannel(out));
+    writer.start();
+    // write the first batch
+    writer.writeBatch();
+    // write another four batches.
+    for (int i = 0; i < 4; i++) {
+      ... do populate work
+      writer.writeBatch();
+    }
+    writer.end();
+
+The difference between :class:`ArrowFileReader` and :class:`ArrowStreamReader` is that the input source
+must have a ``seek`` method for random access. Because we have access to the entire payload, we know the
+number of record batches in the file, and can read any at random
+
+.. code-block:: Java
+
+    try (ArrowFileReader reader = new ArrowFileReader(
+        new ByteArrayReadableSeekableByteChannel(out.toByteArray()), allocator)) {
+
+      // read the 4-th batch
+      ArrowBlock block = reader.getRecordBlocks().get(3);
+      reader.loadRecordBatch(block);
+      VectorSchemaRoot readBatch = reader.getVectorSchemaRoot();
+    }
diff --git a/src/arrow/docs/source/java/reference/index.rst b/src/arrow/docs/source/java/reference/index.rst
new file mode 100644
index 000000000..523ac0c7f
--- /dev/null
+++ b/src/arrow/docs/source/java/reference/index.rst
@@ -0,0 +1,21 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+Java Reference (javadoc)
+========================
+
+Stub page for the Java reference docs; actual source is located in the java/ directory.
diff --git a/src/arrow/docs/source/java/vector.rst b/src/arrow/docs/source/java/vector.rst
new file mode 100644
index 000000000..ece07d0a7
--- /dev/null
+++ b/src/arrow/docs/source/java/vector.rst
@@ -0,0 +1,288 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+===========
+ValueVector
+===========
+
+:class:`ValueVector` interface (which called Array in C++ implementation and
+the :doc:`the specification <../format/Columnar>`) is an abstraction that is used to store a
+sequence of values having the same type in an individual column. Internally, those values are
+represented by one or several buffers, the number and meaning of which depend on the vector’s data type.
+
+There are concrete subclasses of :class:`ValueVector` for each primitive data type
+and nested type described in the specification. There are a few differences in naming
+with the type names described in the specification:
+Table with non-intuitive names (BigInt = 64 bit integer, etc).
+
+It is important that vector is allocated before attempting to read or write,
+:class:`ValueVector` "should" strive to guarantee this order of operation:
+create > allocate > mutate > set value count > access > clear (or allocate to start the process over).
+We will go through a concrete example to demonstrate each operation in the next section.
+
+Vector Life Cycle
+=================
+
+As discussed above, each vector goes through several steps in its life cycle,
+and each step is triggered by a vector operation. In particular, we have the following vector operations:
+
+1. **Vector creation**: we create a new vector object by, for example, the vector constructor.
+The following code creates a new ``IntVector`` by the constructor:
+
+.. code-block:: Java
+
+    RootAllocator allocator = new RootAllocator(Long.MAX_VALUE);
+    ...
+    IntVector vector = new IntVector("int vector", allocator);
+
+By now, a vector object is created. However, no underlying memory has been allocated, so we need the
+following step.
+
+2. **Vector allocation**: in this step, we allocate memory for the vector. For most vectors, we
+have two options: 1) if we know the maximum vector capacity, we can specify it by calling the
+``allocateNew(int)`` method; 2) otherwise, we should call the ``allocateNew()`` method, and  a default
+capacity will be allocated for it. For our running example, we assume that the vector capacity never
+exceeds 10:
+
+.. code-block:: Java
+
+    vector.allocateNew(10);
+
+3. **Vector mutation**: now we can populate the vector with values we desire. For all vectors, we can populate
+vector values through vector writers (An example will be given in the next section). For primitive types,
+we can also mutate the vector by the set methods. There are two classes of set methods: 1) if we can
+be sure the vector has enough capacity, we can call the ``set(index, value)`` method. 2) if we are not sure
+about the vector capacity, we should call the ``setSafe(index, value)`` method, which will automatically
+take care of vector reallocation, if the capacity is not sufficient. For our running example, we know the
+vector has enough capacity, so we can call
+
+.. code-block:: Java
+
+    vector.set(/*index*/5, /*value*/25);
+
+4. **Set value count**: for this step, we set the value count of the vector by calling the
+``setValueCount(int)`` method:
+
+.. code-block:: Java
+
+    vector.setValueCount(10);
+
+After this step, the vector enters an immutable state. In other words, we should no longer mutate it.
+(Unless we reuse the vector by allocating it again. This will be discussed shortly.)
+
+5. **Vector access**: it is time to access vector values. Similarly, we have two options to access values:
+1) get methods and 2) vector reader. Vector reader works for all types of vectors, while get methods are
+only available for primitive vectors. A concrete example for vector reader will be given in the next section.
+Below is an example of vector access by get method:
+
+.. code-block:: Java
+
+    int value = vector.get(5);  // value == 25
+
+6. **Vector clear**: when we are done with the vector, we should clear it to release its memory. This is done by
+calling the ``close()`` method:
+
+.. code-block:: Java
+
+    vector.close();
+
+Some points to note about the steps above:
+
+* The steps are not necessarily performed in a linear sequence. Instead, they can be in a loop. For example,
+  when a vector enters the access step, we can also go back to the vector mutation step, and then set value
+  count, access vector, and so on.
+
+* We should try to make sure the above steps are carried out in order. Otherwise, the vector
+  may be in an undefined state, and some unexpected behavior may occur. However, this restriction
+  is not strict. That means it is possible that we violates the order above, but still get
+  correct results.
+
+* When mutating vector values through set methods, we should prefer ``set(index, value)`` methods to
+  ``setSafe(index, value)`` methods whenever possible, to avoid unnecessary performance overhead of handling
+  vector capacity.
+
+* All vectors implement the ``AutoCloseable`` interface. So they must be closed explicitly when they are
+  no longer used, to avoid resource leak. To make sure of this, it is recommended to place vector related operations
+  into a try-with-resources block.
+
+* For fixed width vectors (e.g. IntVector), we can set values at different indices in arbitrary orders.
+  For variable width vectors (e.g. VarCharVector), however, we must set values in non-decreasing order of the
+  indices. Otherwise, the values after the set position will become invalid. For example, suppose we use the
+  following statements to populate a variable width vector:
+
+.. code-block:: Java
+
+    VarCharVector vector = new VarCharVector("vector", allocator);
+    vector.allocateNew();
+    vector.setSafe(0, "zero");
+    vector.setSafe(1, "one");
+    ...
+    vector.setSafe(9, "nine");
+
+Then we set the value at position 5 again:
+
+.. code-block:: Java
+
+    vector.setSafe(5, "5");
+
+After that, the values at positions 6, 7, 8, and 9 of the vector will become invalid.
+
+Building ValueVector
+====================
+
+Note that the current implementation doesn't enforce the rule that Arrow objects are immutable.
+:class:`ValueVector` instances could be created directly by using new keyword, there are
+set/setSafe APIs and concrete subclasses of FieldWriter for populating values.
+
+For example, the code below shows how to build a :class:`BigIntVector`, in this case, we build a
+vector of the range 0 to 7 where the element that should hold the fourth value is nulled
+
+.. code-block:: Java
+
+    try (BufferAllocator allocator = new RootAllocator(Long.MAX_VALUE);
+      BigIntVector vector = new BigIntVector("vector", allocator)) {
+      vector.allocateNew(8);
+      vector.set(0, 1);
+      vector.set(1, 2);
+      vector.set(2, 3);
+      vector.setNull(3);
+      vector.set(4, 5);
+      vector.set(5, 6);
+      vector.set(6, 7);
+      vector.set(7, 8);
+      vector.setValueCount(8); // this will finalizes the vector by convention.
+      ...
+    }
+
+The :class:`BigIntVector` holds two ArrowBufs. The first buffer holds the null bitmap, which consists
+here of a single byte with the bits 1|1|1|1|0|1|1|1 (the bit is 1 if the value is non-null).
+The second buffer contains all the above values. As the fourth entry is null, the value at that position
+in the buffer is undefined. Note compared with set API, setSafe API would check value capacity before setting
+values and reallocate buffers if necessary.
+
+Here is how to build a vector using writer
+
+.. code-block:: Java
+
+    try (BigIntVector vector = new BigIntVector("vector", allocator);
+      BigIntWriter writer = new BigIntWriterImpl(vector)) {
+      writer.setPosition(0);
+      writer.writeBigInt(1);
+      writer.setPosition(1);
+      writer.writeBigInt(2);
+      writer.setPosition(2);
+      writer.writeBigInt(3);
+      // writer.setPosition(3) is not called which means the forth value is null.
+      writer.setPosition(4);
+      writer.writeBigInt(5);
+      writer.setPosition(5);
+      writer.writeBigInt(6);
+      writer.setPosition(6);
+      writer.writeBigInt(7);
+      writer.setPosition(7);
+      writer.writeBigInt(8);
+    }
+
+There are get API and concrete subclasses of :class:`FieldReader` for accessing vector values, what needs
+to be declared is that writer/reader is not as efficient as direct access
+
+.. code-block:: Java
+
+    // access via get API
+    for (int i = 0; i < vector.getValueCount(); i++) {
+      if (!vector.isNull(i)) {
+        System.out.println(vector.get(i));
+      }
+    }
+
+    // access via reader
+    BigIntReader reader = vector.getReader();
+    for (int i = 0; i < vector.getValueCount(); i++) {
+      reader.setPosition(i);
+      if (reader.isSet()) {
+        System.out.println(reader.readLong());
+      }
+    }
+
+Building ListVector
+===================
+
+A :class:`ListVector` is a vector that holds a list of values for each index. Working with one you need to handle the same steps as mentioned above (create > allocate > mutate > set value count > access > clear), but the details of how you accomplish this are slightly different since you need to both create the vector and set the list of values for each index.
+
+For example, the code below shows how to build a :class:`ListVector` of int's using the writer :class:`UnionListWriter`. We build a vector from 0 to 9 and each index contains a list with values [[0, 0, 0, 0, 0], [0, 1, 2, 3, 4], [0, 2, 4, 6, 8], …, [0, 9, 18, 27, 36]]. List values can be added in any order so writing a list such as [3, 1, 2] would be just as valid.
+
+.. code-block:: Java
+  
+  try (BufferAllocator allocator = new RootAllocator(Long.MAX_VALUE);
+    ListVector listVector = ListVector.empty("vector", allocator)) {
+    UnionListWriter writer = listVector.getWriter();
+    for (int i = 0; i < 10; i++) {
+       writer.startList();
+       writer.setPosition(i);
+       for (int j = 0; j < 5; j++) {
+           writer.writeInt(j * i);
+       }
+       writer.setValueCount(5);
+       writer.endList();
+    }
+    listVector.setValueCount(10);
+  }    
+
+:class:`ListVector` values can be accessed either through the get API or through the reader class :class:`UnionListReader`. To read all the values, first enumerate through the indexes, and then enumerate through the inner list values.
+
+.. code-block:: Java
+
+  // access via get API
+  for (int i = 0; i < listVector.getValueCount(); i++) {
+     if (!listVector.isNull(i)) {
+         ArrayList<Integer> elements = (ArrayList<Integer>) listVector.getObject(i);
+         for (Integer element : elements) {
+             System.out.println(element);
+         }
+     }
+  }
+
+  // access via reader
+  UnionListReader reader = listVector.getReader();
+  for (int i = 0; i < listVector.getValueCount(); i++) {
+     reader.setPosition(i);
+     while (reader.next()) {
+         IntReader intReader = reader.reader();
+         if (intReader.isSet()) {
+             System.out.println(intReader.readInteger());
+         }
+     }
+  }
+
+Slicing
+=======
+
+Similar with C++ implementation, it is possible to make zero-copy slices of vectors to obtain a vector
+referring to some logical sub-sequence of the data through :class:`TransferPair`
+
+.. code-block:: Java
+
+    IntVector vector = new IntVector("intVector", allocator);
+    for (int i = 0; i < 10; i++) {
+      vector.setSafe(i, i);
+    }
+    vector.setValueCount(10);
+
+    TransferPair tp = vector.getTransferPair(allocator);
+    tp.splitAndTransfer(0, 5);
+    IntVector sliced = (IntVector) tp.getTo();
+    // In this case, the vector values are [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] and the sliceVector values are [0, 1, 2, 3, 4].
diff --git a/src/arrow/docs/source/java/vector_schema_root.rst b/src/arrow/docs/source/java/vector_schema_root.rst
new file mode 100644
index 000000000..7f787d9d5
--- /dev/null
+++ b/src/arrow/docs/source/java/vector_schema_root.rst
@@ -0,0 +1,74 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+================
+VectorSchemaRoot
+================
+A :class:`VectorSchemaRoot` is a container that can hold batches, batches flow through :class:`VectorSchemaRoot`
+as part of a pipeline. Note this is different from other implementations (i.e. in C++ and Python,
+a :class:`RecordBatch` is a collection of equal-length vector instances and was created each time for a new batch).
+
+The recommended usage for :class:`VectorSchemaRoot` is creating a single :class:`VectorSchemaRoot`
+based on the known schema and populated data over and over into the same VectorSchemaRoot in a stream
+of batches rather than creating a new :class:`VectorSchemaRoot` instance each time
+(see `Numba <https://github.com/apache/arrow/tree/master/java/flight/src/main/java/org/apache/arrow/flight>`_ or
+``ArrowFileWriter`` for better understanding). Thus at any one point a VectorSchemaRoot may have data or
+may have no data (say it was transferred downstream or not yet populated).
+
+
+Here is the example of building a :class:`VectorSchemaRoot`
+
+.. code-block:: Java
+
+    BitVector bitVector = new BitVector("boolean", allocator);
+    VarCharVector varCharVector = new VarCharVector("varchar", allocator);
+    bitVector.allocateNew();
+    varCharVector.allocateNew();
+    for (int i = 0; i < 10; i++) {
+      bitVector.setSafe(i, i % 2 == 0 ? 0 : 1);
+      varCharVector.setSafe(i, ("test" + i).getBytes(StandardCharsets.UTF_8));
+    }
+    bitVector.setValueCount(10);
+    varCharVector.setValueCount(10);
+
+    List<Field> fields = Arrays.asList(bitVector.getField(), varCharVector.getField());
+    List<FieldVector> vectors = Arrays.asList(bitVector, varCharVector);
+    VectorSchemaRoot vectorSchemaRoot = new VectorSchemaRoot(fields, vectors);
+
+The vectors within a :class:`VectorSchemaRoot` could be loaded/unloaded via :class:`VectorLoader` and :class:`VectorUnloader`.
+:class:`VectorLoader` and :class:`VectorUnloader` handles converting between :class:`VectorSchemaRoot` and :class:`ArrowRecordBatch`(
+representation of a RecordBatch :doc:`IPC <../format/IPC.rst>` message). Examples as below
+
+.. code-block:: Java
+
+    // create a VectorSchemaRoot root1 and convert its data into recordBatch
+    VectorSchemaRoot root1 = new VectorSchemaRoot(fields, vectors);
+    VectorUnloader unloader = new VectorUnloader(root1);
+    ArrowRecordBatch recordBatch = unloader.getRecordBatch();
+
+    // create a VectorSchemaRoot root2 and load the recordBatch
+    VectorSchemaRoot root2 = VectorSchemaRoot.create(root1.getSchema(), allocator);
+    VectorLoader loader = new VectorLoader(root2);
+    loader.load(recordBatch);
+
+A new :class:`VectorSchemaRoot` could be sliced from an existing instance with zero-copy
+
+.. code-block:: Java
+
+    // 0 indicates start index (inclusive) and 5 indicated length (exclusive).
+    VectorSchemaRoot newRoot = vectorSchemaRoot.slice(0, 5);
+
author	Daniel Baumann <daniel.baumann@progress-linux.org>	2024-04-21 11:54:28 +0000
committer	Daniel Baumann <daniel.baumann@progress-linux.org>	2024-04-21 11:54:28 +0000
commit	e6918187568dbd01842d8d1d2c808ce16a894239 (patch)
tree	64f88b554b444a49f656b6c656111a145cbbaa28 /src/arrow/docs/source/java
parent	Initial commit. (diff)
download	ceph-e6918187568dbd01842d8d1d2c808ce16a894239.tar.xz ceph-e6918187568dbd01842d8d1d2c808ce16a894239.zip