diff options
author | Daniel Baumann <daniel.baumann@progress-linux.org> | 2024-04-21 11:54:28 +0000 |
---|---|---|
committer | Daniel Baumann <daniel.baumann@progress-linux.org> | 2024-04-21 11:54:28 +0000 |
commit | e6918187568dbd01842d8d1d2c808ce16a894239 (patch) | |
tree | 64f88b554b444a49f656b6c656111a145cbbaa28 /src/arrow/docs/source/java | |
parent | Initial commit. (diff) | |
download | ceph-e6918187568dbd01842d8d1d2c808ce16a894239.tar.xz ceph-e6918187568dbd01842d8d1d2c808ce16a894239.zip |
Adding upstream version 18.2.2.upstream/18.2.2
Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
Diffstat (limited to 'src/arrow/docs/source/java')
-rw-r--r-- | src/arrow/docs/source/java/algorithm.rst | 92 | ||||
-rw-r--r-- | src/arrow/docs/source/java/index.rst | 31 | ||||
-rw-r--r-- | src/arrow/docs/source/java/ipc.rst | 187 | ||||
-rw-r--r-- | src/arrow/docs/source/java/reference/index.rst | 21 | ||||
-rw-r--r-- | src/arrow/docs/source/java/vector.rst | 288 | ||||
-rw-r--r-- | src/arrow/docs/source/java/vector_schema_root.rst | 74 |
6 files changed, 693 insertions, 0 deletions
diff --git a/src/arrow/docs/source/java/algorithm.rst b/src/arrow/docs/source/java/algorithm.rst new file mode 100644 index 000000000..f838398af --- /dev/null +++ b/src/arrow/docs/source/java/algorithm.rst @@ -0,0 +1,92 @@ +.. Licensed to the Apache Software Foundation (ASF) under one +.. or more contributor license agreements. See the NOTICE file +.. distributed with this work for additional information +.. regarding copyright ownership. The ASF licenses this file +.. to you under the Apache License, Version 2.0 (the +.. "License"); you may not use this file except in compliance +.. with the License. You may obtain a copy of the License at + +.. http://www.apache.org/licenses/LICENSE-2.0 + +.. Unless required by applicable law or agreed to in writing, +.. software distributed under the License is distributed on an +.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +.. KIND, either express or implied. See the License for the +.. specific language governing permissions and limitations +.. under the License. + +Java Algorithms +=============== + +Arrow's Java library provides algorithms for some commonly-used +functionalities. The algorithms are provided in the ``org.apache.arrow.algorithm`` +package of the ``algorithm`` module. + +Comparing Vector Elements +------------------------- + +Comparing vector elements is the basic for many algorithms. Vector +elements can be compared in one of the two ways: + +1. **Equality comparison**: there are two possible results for this type of comparisons: ``equal`` and ``unequal``. +Currently, this type of comparison is supported through the ``org.apache.arrow.vector.compare.VectorValueEqualizer`` +interface. + +2. **Ordering comparison**: there are three possible results for this type of comparisons: ``less than``, ``equal to `` +and ``greater than``. This comparison is supported by the abstract class ``org.apache.arrow.algorithm.sort.VectorValueComparator``. + +We provide default implementations to compare vector elements. However, users can also define ways +for customized comparisons. + +Vector Element Search +--------------------- + +A search algorithm tries to find a particular value in a vector. When successful, a vector index is +returned; otherwise, a ``-1`` is returned. The following search algorithms are provided: + +1. **Linear search**: this algorithm simply traverses the vector from the beginning, until a match is +found, or the end of the vector is reached. So it takes ``O(n)`` time, where ``n`` is the number of elements +in the vector. This algorithm is implemented in ``org.apache.arrow.algorithm.search.VectorSearcher#linearSearch``. + +2. **Binary search**: this represents a more efficient search algorithm, as it runs in ``O(log(n))`` time. +However, it is only applicable to sorted vectors. To get a sorted vector, +one can use one of our sorting algorithms, which will be discussed in the next section. This algorithm +is implemented in ``org.apache.arrow.algorithm.search.VectorSearcher#binarySearch``. + +3. **Parallel search**: when the vector is large, it takes a long time to traverse the elements to search +for a value. To make this process faster, one can split the vector into multiple partitions, and perform the +search for each partition in parallel. This is supported by ``org.apache.arrow.algorithm.search.ParallelSearcher``. + +4. **Range search**: for many scenarios, there can be multiple matching values in the vector. +If the vector is sorted, the matching values reside in a contiguous region in the vector. The +range search algorithm tries to find the upper/lower bound of the region in ``O(log(n))`` time. +An implementation is provided in ``org.apache.arrow.algorithm.search.VectorRangeSearcher``. + +Vector Sorting +-------------- + +Given a vector, a sorting algorithm turns it into a sorted one. The sorting criteria must +be specified by some ordering comparison operation. The sorting algorithms can be +classified into the following categories: + +1. **In-place sorter**: an in-place sorter performs the sorting by manipulating the original +vector, without creating any new vector. So it just returns the original vector after the sorting operations. +Currently, we have ``org.apache.arrow.algorithm.sort.FixedWidthInPlaceVectorSorter`` for in-place +sorting in ``O(nlog(n))`` time. As the name suggests, it only supports fixed width vectors. + +2. **Out-of-place sorter**: an out-of-place sorter does not mutate the original vector. Instead, +it copies vector elements to a new vector in sorted order, and returns the new vector. +We have ``org.apache.arrow.algorithm.sort.FixedWidthInPlaceVectorSorter.FixedWidthOutOfPlaceVectorSorter`` +and ``org.apache.arrow.algorithm.sort.FixedWidthInPlaceVectorSorter.VariableWidthOutOfPlaceVectorSorter`` +for fixed width and variable width vectors, respectively. Both algorithms run in ``O(nlog(n))`` time. + +3. **Index sorter**: this sorter does not actually sort the vector. Instead, it returns an integer +vector, which correspond to indices of vector elements in sorted order. With the index vector, one can +easily construct a sorted vector. In addition, some other tasks can be easily achieved, like finding the ``k``th +smallest value in the vector. Index sorting is supported by ``org.apache.arrow.algorithm.sort.IndexSorter``, +which runs in ``O(nlog(n))`` time. It is applicable to vectors of any type. + +Other Algorithms +---------------- + +Other algorithms include vector deduplication, dictionary encoding, etc., in the ``algorithm`` module. diff --git a/src/arrow/docs/source/java/index.rst b/src/arrow/docs/source/java/index.rst new file mode 100644 index 000000000..65a7a3a4f --- /dev/null +++ b/src/arrow/docs/source/java/index.rst @@ -0,0 +1,31 @@ +.. Licensed to the Apache Software Foundation (ASF) under one +.. or more contributor license agreements. See the NOTICE file +.. distributed with this work for additional information +.. regarding copyright ownership. The ASF licenses this file +.. to you under the Apache License, Version 2.0 (the +.. "License"); you may not use this file except in compliance +.. with the License. You may obtain a copy of the License at + +.. http://www.apache.org/licenses/LICENSE-2.0 + +.. Unless required by applicable law or agreed to in writing, +.. software distributed under the License is distributed on an +.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +.. KIND, either express or implied. See the License for the +.. specific language governing permissions and limitations +.. under the License. + +Java Implementation +=================== + +This is the documentation of the Java API of Apache Arrow. For more details +on the Arrow format and other language bindings see the :doc:`parent documentation <../index>`. + +.. toctree:: + :maxdepth: 2 + + vector + vector_schema_root + ipc + algorithm + Reference (javadoc) <reference/index> diff --git a/src/arrow/docs/source/java/ipc.rst b/src/arrow/docs/source/java/ipc.rst new file mode 100644 index 000000000..7cab480c4 --- /dev/null +++ b/src/arrow/docs/source/java/ipc.rst @@ -0,0 +1,187 @@ +.. Licensed to the Apache Software Foundation (ASF) under one +.. or more contributor license agreements. See the NOTICE file +.. distributed with this work for additional information +.. regarding copyright ownership. The ASF licenses this file +.. to you under the Apache License, Version 2.0 (the +.. "License"); you may not use this file except in compliance +.. with the License. You may obtain a copy of the License at + +.. http://www.apache.org/licenses/LICENSE-2.0 + +.. Unless required by applicable law or agreed to in writing, +.. software distributed under the License is distributed on an +.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +.. KIND, either express or implied. See the License for the +.. specific language governing permissions and limitations +.. under the License. + +=========================== +Reading/Writing IPC formats +=========================== +Arrow defines two types of binary formats for serializing record batches: + +* **Streaming format**: for sending an arbitrary number of record + batches. The format must be processed from start to end, and does not support + random access + +* **File or Random Access format**: for serializing a fixed number of record + batches. It supports random access, and thus is very useful when used with + memory maps + +Writing and Reading Streaming Format +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +First, let's populate a :class:`VectorSchemaRoot` with a small batch of records + +.. code-block:: Java + + BitVector bitVector = new BitVector("boolean", allocator); + VarCharVector varCharVector = new VarCharVector("varchar", allocator); + for (int i = 0; i < 10; i++) { + bitVector.setSafe(i, i % 2 == 0 ? 0 : 1); + varCharVector.setSafe(i, ("test" + i).getBytes(StandardCharsets.UTF_8)); + } + bitVector.setValueCount(10); + varCharVector.setValueCount(10); + + List<Field> fields = Arrays.asList(bitVector.getField(), varCharVector.getField()); + List<FieldVector> vectors = Arrays.asList(bitVector, varCharVector); + VectorSchemaRoot root = new VectorSchemaRoot(fields, vectors); + +Now, we can begin writing a stream containing some number of these batches. For this we use :class:`ArrowStreamWriter` +(DictionaryProvider used for any vectors that are dictionary encoded is optional and can be null)):: + + ByteArrayOutputStream out = new ByteArrayOutputStream(); + ArrowStreamWriter writer = new ArrowStreamWriter(root, /*DictionaryProvider=*/null, Channels.newChannel(out)); + + +Here we used an in-memory stream, but this could have been a socket or some other IO stream. Then we can do + +.. code-block:: Java + + writer.start(); + // write the first batch + writer.writeBatch(); + + // write another four batches. + for (int i = 0; i < 4; i++) { + // populate VectorSchemaRoot data and write the second batch + BitVector childVector1 = (BitVector)root.getVector(0); + VarCharVector childVector2 = (VarCharVector)root.getVector(1); + childVector1.reset(); + childVector2.reset(); + ... do some populate work here, could be different for each batch + writer.writeBatch(); + } + + // end + writer.end(); + +Note since the :class:`VectorSchemaRoot` in writer is a container that can hold batches, batches flow through +:class:`VectorSchemaRoot` as part of a pipeline, so we need to populate data before `writeBatch` so that later batches +could overwrite previous ones. + +Now the :class:`ByteArrayOutputStream` contains the complete stream which contains 5 record batches. +We can read such a stream with :class:`ArrowStreamReader`, note that :class:`VectorSchemaRoot` within +reader will be loaded with new values on every call to :class:`loadNextBatch()` + +.. code-block:: Java + + try (ArrowStreamReader reader = new ArrowStreamReader(new ByteArrayInputStream(out.toByteArray()), allocator)) { + Schema schema = reader.getVectorSchemaRoot().getSchema(); + for (int i = 0; i < 5; i++) { + // This will be loaded with new values on every call to loadNextBatch + VectorSchemaRoot readBatch = reader.getVectorSchemaRoot(); + reader.loadNextBatch(); + ... do something with readBatch + } + + } + +Here we also give a simple example with dictionary encoded vectors + +.. code-block:: Java + + DictionaryProvider.MapDictionaryProvider provider = new DictionaryProvider.MapDictionaryProvider(); + // create dictionary and provider + final VarCharVector dictVector = new VarCharVector("dict", allocator); + dictVector.allocateNewSafe(); + dictVector.setSafe(0, "aa".getBytes()); + dictVector.setSafe(1, "bb".getBytes()); + dictVector.setSafe(2, "cc".getBytes()); + dictVector.setValueCount(3); + + Dictionary dictionary = + new Dictionary(dictVector, new DictionaryEncoding(1L, false, /*indexType=*/null)); + provider.put(dictionary); + + // create vector and encode it + final VarCharVector vector = new VarCharVector("vector", allocator); + vector.allocateNewSafe(); + vector.setSafe(0, "bb".getBytes()); + vector.setSafe(1, "bb".getBytes()); + vector.setSafe(2, "cc".getBytes()); + vector.setSafe(3, "aa".getBytes()); + vector.setValueCount(4); + + // get the encoded vector + IntVector encodedVector = (IntVector) DictionaryEncoder.encode(vector, dictionary); + + // create VectorSchemaRoot + List<Field> fields = Arrays.asList(encodedVector.getField()); + List<FieldVector> vectors = Arrays.asList(encodedVector); + VectorSchemaRoot root = new VectorSchemaRoot(fields, vectors); + + // write data + ByteArrayOutputStream out = new ByteArrayOutputStream(); + ArrowStreamWriter writer = new ArrowStreamWriter(root, provider, Channels.newChannel(out)); + writer.start(); + writer.writeBatch(); + writer.end(); + + // read data + try (ArrowStreamReader reader = new ArrowStreamReader(new ByteArrayInputStream(out.toByteArray()), allocator)) { + reader.loadNextBatch(); + VectorSchemaRoot readRoot = reader.getVectorSchemaRoot(); + // get the encoded vector + IntVector intVector = (IntVector) readRoot.getVector(0); + + // get dictionaries and decode the vector + Map<Long, Dictionary> dictionaryMap = reader.getDictionaryVectors(); + long dictionaryId = intVector.getField().getDictionary().getId(); + VarCharVector varCharVector = + (VarCharVector) DictionaryEncoder.decode(intVector, dictionaryMap.get(dictionaryId)); + + } + +Writing and Reading Random Access Files +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +The :class:`ArrowFileWriter` has the same API as :class:`ArrowStreamWriter` + +.. code-block:: Java + + ByteArrayOutputStream out = new ByteArrayOutputStream(); + ArrowFileWriter writer = new ArrowFileWriter(root, null, Channels.newChannel(out)); + writer.start(); + // write the first batch + writer.writeBatch(); + // write another four batches. + for (int i = 0; i < 4; i++) { + ... do populate work + writer.writeBatch(); + } + writer.end(); + +The difference between :class:`ArrowFileReader` and :class:`ArrowStreamReader` is that the input source +must have a ``seek`` method for random access. Because we have access to the entire payload, we know the +number of record batches in the file, and can read any at random + +.. code-block:: Java + + try (ArrowFileReader reader = new ArrowFileReader( + new ByteArrayReadableSeekableByteChannel(out.toByteArray()), allocator)) { + + // read the 4-th batch + ArrowBlock block = reader.getRecordBlocks().get(3); + reader.loadRecordBatch(block); + VectorSchemaRoot readBatch = reader.getVectorSchemaRoot(); + } diff --git a/src/arrow/docs/source/java/reference/index.rst b/src/arrow/docs/source/java/reference/index.rst new file mode 100644 index 000000000..523ac0c7f --- /dev/null +++ b/src/arrow/docs/source/java/reference/index.rst @@ -0,0 +1,21 @@ +.. Licensed to the Apache Software Foundation (ASF) under one +.. or more contributor license agreements. See the NOTICE file +.. distributed with this work for additional information +.. regarding copyright ownership. The ASF licenses this file +.. to you under the Apache License, Version 2.0 (the +.. "License"); you may not use this file except in compliance +.. with the License. You may obtain a copy of the License at + +.. http://www.apache.org/licenses/LICENSE-2.0 + +.. Unless required by applicable law or agreed to in writing, +.. software distributed under the License is distributed on an +.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +.. KIND, either express or implied. See the License for the +.. specific language governing permissions and limitations +.. under the License. + +Java Reference (javadoc) +======================== + +Stub page for the Java reference docs; actual source is located in the java/ directory. diff --git a/src/arrow/docs/source/java/vector.rst b/src/arrow/docs/source/java/vector.rst new file mode 100644 index 000000000..ece07d0a7 --- /dev/null +++ b/src/arrow/docs/source/java/vector.rst @@ -0,0 +1,288 @@ +.. Licensed to the Apache Software Foundation (ASF) under one +.. or more contributor license agreements. See the NOTICE file +.. distributed with this work for additional information +.. regarding copyright ownership. The ASF licenses this file +.. to you under the Apache License, Version 2.0 (the +.. "License"); you may not use this file except in compliance +.. with the License. You may obtain a copy of the License at + +.. http://www.apache.org/licenses/LICENSE-2.0 + +.. Unless required by applicable law or agreed to in writing, +.. software distributed under the License is distributed on an +.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +.. KIND, either express or implied. See the License for the +.. specific language governing permissions and limitations +.. under the License. + +=========== +ValueVector +=========== + +:class:`ValueVector` interface (which called Array in C++ implementation and +the :doc:`the specification <../format/Columnar>`) is an abstraction that is used to store a +sequence of values having the same type in an individual column. Internally, those values are +represented by one or several buffers, the number and meaning of which depend on the vector’s data type. + +There are concrete subclasses of :class:`ValueVector` for each primitive data type +and nested type described in the specification. There are a few differences in naming +with the type names described in the specification: +Table with non-intuitive names (BigInt = 64 bit integer, etc). + +It is important that vector is allocated before attempting to read or write, +:class:`ValueVector` "should" strive to guarantee this order of operation: +create > allocate > mutate > set value count > access > clear (or allocate to start the process over). +We will go through a concrete example to demonstrate each operation in the next section. + +Vector Life Cycle +================= + +As discussed above, each vector goes through several steps in its life cycle, +and each step is triggered by a vector operation. In particular, we have the following vector operations: + +1. **Vector creation**: we create a new vector object by, for example, the vector constructor. +The following code creates a new ``IntVector`` by the constructor: + +.. code-block:: Java + + RootAllocator allocator = new RootAllocator(Long.MAX_VALUE); + ... + IntVector vector = new IntVector("int vector", allocator); + +By now, a vector object is created. However, no underlying memory has been allocated, so we need the +following step. + +2. **Vector allocation**: in this step, we allocate memory for the vector. For most vectors, we +have two options: 1) if we know the maximum vector capacity, we can specify it by calling the +``allocateNew(int)`` method; 2) otherwise, we should call the ``allocateNew()`` method, and a default +capacity will be allocated for it. For our running example, we assume that the vector capacity never +exceeds 10: + +.. code-block:: Java + + vector.allocateNew(10); + +3. **Vector mutation**: now we can populate the vector with values we desire. For all vectors, we can populate +vector values through vector writers (An example will be given in the next section). For primitive types, +we can also mutate the vector by the set methods. There are two classes of set methods: 1) if we can +be sure the vector has enough capacity, we can call the ``set(index, value)`` method. 2) if we are not sure +about the vector capacity, we should call the ``setSafe(index, value)`` method, which will automatically +take care of vector reallocation, if the capacity is not sufficient. For our running example, we know the +vector has enough capacity, so we can call + +.. code-block:: Java + + vector.set(/*index*/5, /*value*/25); + +4. **Set value count**: for this step, we set the value count of the vector by calling the +``setValueCount(int)`` method: + +.. code-block:: Java + + vector.setValueCount(10); + +After this step, the vector enters an immutable state. In other words, we should no longer mutate it. +(Unless we reuse the vector by allocating it again. This will be discussed shortly.) + +5. **Vector access**: it is time to access vector values. Similarly, we have two options to access values: +1) get methods and 2) vector reader. Vector reader works for all types of vectors, while get methods are +only available for primitive vectors. A concrete example for vector reader will be given in the next section. +Below is an example of vector access by get method: + +.. code-block:: Java + + int value = vector.get(5); // value == 25 + +6. **Vector clear**: when we are done with the vector, we should clear it to release its memory. This is done by +calling the ``close()`` method: + +.. code-block:: Java + + vector.close(); + +Some points to note about the steps above: + +* The steps are not necessarily performed in a linear sequence. Instead, they can be in a loop. For example, + when a vector enters the access step, we can also go back to the vector mutation step, and then set value + count, access vector, and so on. + +* We should try to make sure the above steps are carried out in order. Otherwise, the vector + may be in an undefined state, and some unexpected behavior may occur. However, this restriction + is not strict. That means it is possible that we violates the order above, but still get + correct results. + +* When mutating vector values through set methods, we should prefer ``set(index, value)`` methods to + ``setSafe(index, value)`` methods whenever possible, to avoid unnecessary performance overhead of handling + vector capacity. + +* All vectors implement the ``AutoCloseable`` interface. So they must be closed explicitly when they are + no longer used, to avoid resource leak. To make sure of this, it is recommended to place vector related operations + into a try-with-resources block. + +* For fixed width vectors (e.g. IntVector), we can set values at different indices in arbitrary orders. + For variable width vectors (e.g. VarCharVector), however, we must set values in non-decreasing order of the + indices. Otherwise, the values after the set position will become invalid. For example, suppose we use the + following statements to populate a variable width vector: + +.. code-block:: Java + + VarCharVector vector = new VarCharVector("vector", allocator); + vector.allocateNew(); + vector.setSafe(0, "zero"); + vector.setSafe(1, "one"); + ... + vector.setSafe(9, "nine"); + +Then we set the value at position 5 again: + +.. code-block:: Java + + vector.setSafe(5, "5"); + +After that, the values at positions 6, 7, 8, and 9 of the vector will become invalid. + +Building ValueVector +==================== + +Note that the current implementation doesn't enforce the rule that Arrow objects are immutable. +:class:`ValueVector` instances could be created directly by using new keyword, there are +set/setSafe APIs and concrete subclasses of FieldWriter for populating values. + +For example, the code below shows how to build a :class:`BigIntVector`, in this case, we build a +vector of the range 0 to 7 where the element that should hold the fourth value is nulled + +.. code-block:: Java + + try (BufferAllocator allocator = new RootAllocator(Long.MAX_VALUE); + BigIntVector vector = new BigIntVector("vector", allocator)) { + vector.allocateNew(8); + vector.set(0, 1); + vector.set(1, 2); + vector.set(2, 3); + vector.setNull(3); + vector.set(4, 5); + vector.set(5, 6); + vector.set(6, 7); + vector.set(7, 8); + vector.setValueCount(8); // this will finalizes the vector by convention. + ... + } + +The :class:`BigIntVector` holds two ArrowBufs. The first buffer holds the null bitmap, which consists +here of a single byte with the bits 1|1|1|1|0|1|1|1 (the bit is 1 if the value is non-null). +The second buffer contains all the above values. As the fourth entry is null, the value at that position +in the buffer is undefined. Note compared with set API, setSafe API would check value capacity before setting +values and reallocate buffers if necessary. + +Here is how to build a vector using writer + +.. code-block:: Java + + try (BigIntVector vector = new BigIntVector("vector", allocator); + BigIntWriter writer = new BigIntWriterImpl(vector)) { + writer.setPosition(0); + writer.writeBigInt(1); + writer.setPosition(1); + writer.writeBigInt(2); + writer.setPosition(2); + writer.writeBigInt(3); + // writer.setPosition(3) is not called which means the forth value is null. + writer.setPosition(4); + writer.writeBigInt(5); + writer.setPosition(5); + writer.writeBigInt(6); + writer.setPosition(6); + writer.writeBigInt(7); + writer.setPosition(7); + writer.writeBigInt(8); + } + +There are get API and concrete subclasses of :class:`FieldReader` for accessing vector values, what needs +to be declared is that writer/reader is not as efficient as direct access + +.. code-block:: Java + + // access via get API + for (int i = 0; i < vector.getValueCount(); i++) { + if (!vector.isNull(i)) { + System.out.println(vector.get(i)); + } + } + + // access via reader + BigIntReader reader = vector.getReader(); + for (int i = 0; i < vector.getValueCount(); i++) { + reader.setPosition(i); + if (reader.isSet()) { + System.out.println(reader.readLong()); + } + } + +Building ListVector +=================== + +A :class:`ListVector` is a vector that holds a list of values for each index. Working with one you need to handle the same steps as mentioned above (create > allocate > mutate > set value count > access > clear), but the details of how you accomplish this are slightly different since you need to both create the vector and set the list of values for each index. + +For example, the code below shows how to build a :class:`ListVector` of int's using the writer :class:`UnionListWriter`. We build a vector from 0 to 9 and each index contains a list with values [[0, 0, 0, 0, 0], [0, 1, 2, 3, 4], [0, 2, 4, 6, 8], …, [0, 9, 18, 27, 36]]. List values can be added in any order so writing a list such as [3, 1, 2] would be just as valid. + +.. code-block:: Java + + try (BufferAllocator allocator = new RootAllocator(Long.MAX_VALUE); + ListVector listVector = ListVector.empty("vector", allocator)) { + UnionListWriter writer = listVector.getWriter(); + for (int i = 0; i < 10; i++) { + writer.startList(); + writer.setPosition(i); + for (int j = 0; j < 5; j++) { + writer.writeInt(j * i); + } + writer.setValueCount(5); + writer.endList(); + } + listVector.setValueCount(10); + } + +:class:`ListVector` values can be accessed either through the get API or through the reader class :class:`UnionListReader`. To read all the values, first enumerate through the indexes, and then enumerate through the inner list values. + +.. code-block:: Java + + // access via get API + for (int i = 0; i < listVector.getValueCount(); i++) { + if (!listVector.isNull(i)) { + ArrayList<Integer> elements = (ArrayList<Integer>) listVector.getObject(i); + for (Integer element : elements) { + System.out.println(element); + } + } + } + + // access via reader + UnionListReader reader = listVector.getReader(); + for (int i = 0; i < listVector.getValueCount(); i++) { + reader.setPosition(i); + while (reader.next()) { + IntReader intReader = reader.reader(); + if (intReader.isSet()) { + System.out.println(intReader.readInteger()); + } + } + } + +Slicing +======= + +Similar with C++ implementation, it is possible to make zero-copy slices of vectors to obtain a vector +referring to some logical sub-sequence of the data through :class:`TransferPair` + +.. code-block:: Java + + IntVector vector = new IntVector("intVector", allocator); + for (int i = 0; i < 10; i++) { + vector.setSafe(i, i); + } + vector.setValueCount(10); + + TransferPair tp = vector.getTransferPair(allocator); + tp.splitAndTransfer(0, 5); + IntVector sliced = (IntVector) tp.getTo(); + // In this case, the vector values are [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] and the sliceVector values are [0, 1, 2, 3, 4]. diff --git a/src/arrow/docs/source/java/vector_schema_root.rst b/src/arrow/docs/source/java/vector_schema_root.rst new file mode 100644 index 000000000..7f787d9d5 --- /dev/null +++ b/src/arrow/docs/source/java/vector_schema_root.rst @@ -0,0 +1,74 @@ +.. Licensed to the Apache Software Foundation (ASF) under one +.. or more contributor license agreements. See the NOTICE file +.. distributed with this work for additional information +.. regarding copyright ownership. The ASF licenses this file +.. to you under the Apache License, Version 2.0 (the +.. "License"); you may not use this file except in compliance +.. with the License. You may obtain a copy of the License at + +.. http://www.apache.org/licenses/LICENSE-2.0 + +.. Unless required by applicable law or agreed to in writing, +.. software distributed under the License is distributed on an +.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +.. KIND, either express or implied. See the License for the +.. specific language governing permissions and limitations +.. under the License. + +================ +VectorSchemaRoot +================ +A :class:`VectorSchemaRoot` is a container that can hold batches, batches flow through :class:`VectorSchemaRoot` +as part of a pipeline. Note this is different from other implementations (i.e. in C++ and Python, +a :class:`RecordBatch` is a collection of equal-length vector instances and was created each time for a new batch). + +The recommended usage for :class:`VectorSchemaRoot` is creating a single :class:`VectorSchemaRoot` +based on the known schema and populated data over and over into the same VectorSchemaRoot in a stream +of batches rather than creating a new :class:`VectorSchemaRoot` instance each time +(see `Numba <https://github.com/apache/arrow/tree/master/java/flight/src/main/java/org/apache/arrow/flight>`_ or +``ArrowFileWriter`` for better understanding). Thus at any one point a VectorSchemaRoot may have data or +may have no data (say it was transferred downstream or not yet populated). + + +Here is the example of building a :class:`VectorSchemaRoot` + +.. code-block:: Java + + BitVector bitVector = new BitVector("boolean", allocator); + VarCharVector varCharVector = new VarCharVector("varchar", allocator); + bitVector.allocateNew(); + varCharVector.allocateNew(); + for (int i = 0; i < 10; i++) { + bitVector.setSafe(i, i % 2 == 0 ? 0 : 1); + varCharVector.setSafe(i, ("test" + i).getBytes(StandardCharsets.UTF_8)); + } + bitVector.setValueCount(10); + varCharVector.setValueCount(10); + + List<Field> fields = Arrays.asList(bitVector.getField(), varCharVector.getField()); + List<FieldVector> vectors = Arrays.asList(bitVector, varCharVector); + VectorSchemaRoot vectorSchemaRoot = new VectorSchemaRoot(fields, vectors); + +The vectors within a :class:`VectorSchemaRoot` could be loaded/unloaded via :class:`VectorLoader` and :class:`VectorUnloader`. +:class:`VectorLoader` and :class:`VectorUnloader` handles converting between :class:`VectorSchemaRoot` and :class:`ArrowRecordBatch`( +representation of a RecordBatch :doc:`IPC <../format/IPC.rst>` message). Examples as below + +.. code-block:: Java + + // create a VectorSchemaRoot root1 and convert its data into recordBatch + VectorSchemaRoot root1 = new VectorSchemaRoot(fields, vectors); + VectorUnloader unloader = new VectorUnloader(root1); + ArrowRecordBatch recordBatch = unloader.getRecordBatch(); + + // create a VectorSchemaRoot root2 and load the recordBatch + VectorSchemaRoot root2 = VectorSchemaRoot.create(root1.getSchema(), allocator); + VectorLoader loader = new VectorLoader(root2); + loader.load(recordBatch); + +A new :class:`VectorSchemaRoot` could be sliced from an existing instance with zero-copy + +.. code-block:: Java + + // 0 indicates start index (inclusive) and 5 indicated length (exclusive). + VectorSchemaRoot newRoot = vectorSchemaRoot.slice(0, 5); + |