diff options
Diffstat (limited to 'doc/overview')
-rw-r--r-- | doc/overview/doc-orcus.rst | 176 | ||||
-rw-r--r-- | doc/overview/doc-user.rst | 574 | ||||
-rw-r--r-- | doc/overview/index.rst | 95 | ||||
-rw-r--r-- | doc/overview/json.rst | 353 | ||||
-rw-r--r-- | doc/overview/yaml.rst | 8 |
5 files changed, 1206 insertions, 0 deletions
diff --git a/doc/overview/doc-orcus.rst b/doc/overview/doc-orcus.rst new file mode 100644 index 0000000..0577d4e --- /dev/null +++ b/doc/overview/doc-orcus.rst @@ -0,0 +1,176 @@ + +.. highlight:: cpp + +Use orcus's spreadsheet document class +====================================== + +If you want to use orcus' :cpp:class:`~orcus::spreadsheet::document` as your +document store, you can use the :cpp:class:`~orcus::spreadsheet::import_factory` +class that orcus provides which already implements all necessary interfaces. +The example code shown below illustrates how to do this: + +.. literalinclude:: ../../doc_example/spreadsheet_doc_1.cpp + :language: C++ + +This example code loads a file saved in the Open Document Spreadsheet format +stored in a directory whose path is to be defined in the environment variable +named ``INPUTDIR``. In this example, we don't check for the validity of ``INPUTDIR`` +for bravity's sake. + +The input file consists of the following content on its first sheet. + +.. figure:: /_static/images/overview/doc-content.png + +While it is not clear from this screenshot, cell C2 contains the formula +**CONCATENATE(A2, " ", B2)** to concatenate the content of A2 and B2 with a +space between them. Cells C3 through C7 also contain similar formula +expressions. + +Let's walk through this code step by step. First, we need to instantiate the +document store. Here we are using the concrete :cpp:class:`~orcus::spreadsheet::document` +class available in orcus. Then immediately pass this document to the +:cpp:class:`~orcus::spreadsheet::import_factory` instance also from orcus: + +.. literalinclude:: ../../doc_example/spreadsheet_doc_1.cpp + :language: C++ + :start-after: //!code-start: instantiate + :end-before: //!code-end: instantiate + :dedent: 4 + +The next step is to create the loader instance and pass the factory to it: + +.. literalinclude:: ../../doc_example/spreadsheet_doc_1.cpp + :language: C++ + :start-after: //!code-start: loader + :end-before: //!code-end: loader + :dedent: 4 + +In this example we are using the :cpp:class:`~orcus::orcus_ods` filter class +because the document we are loading is of Open Document Spreadsheet type, but +the process is the same for other document types, the only difference being +the name of the class. Once the filter object is constructed, we'll simply +load the file by calling its :cpp:func:`~orcus::orcus_ods::read_file` method +and passing the path to the file as its argument: + +.. literalinclude:: ../../doc_example/spreadsheet_doc_1.cpp + :language: C++ + :start-after: //!code-start: read-file + :end-before: //!code-end: read-file + :dedent: 4 + +Once this call returns, the document has been fully populated. What the rest +of the code does is to access the content of the first row of the first sheet of +the document. First, you need to get a reference to the internal cell value +store that we call *model context*: + +.. literalinclude:: ../../doc_example/spreadsheet_doc_1.cpp + :language: C++ + :start-after: //!code-start: model-context + :end-before: //!code-end: model-context + :dedent: 4 + +Since the content of cell A1 is a string, to get the value you need to first +get the ID of the string: + +.. literalinclude:: ../../doc_example/spreadsheet_doc_1.cpp + :language: C++ + :start-after: //!code-start: string-id + :end-before: //!code-end: string-id + :dedent: 4 + +Once you have the ID of the string, you can pass that to the model to get the +actual string value and print it to the standard output: + +.. literalinclude:: ../../doc_example/spreadsheet_doc_1.cpp + :language: C++ + :start-after: //!code-start: print-string + :end-before: //!code-end: print-string + :dedent: 4 + +Here we assume that the string value exists for the given ID. In case you +pass a string ID value to the :cpp:func:`get_string` method and there isn't a string +value associated with it, you'll get a null pointer returned from the call. + +The reason you need to take this 2-step process to get a string value is +because all the string values stored in the cells are pooled at the document +model level, and the cells themselves only store the ID values as integers. + +You may also have noticed that the types surrounding the :cpp:class:`ixion::model_context` +class are all in the :cpp:any:`ixion` namespace. It is because orcus' own +:cpp:class:`~orcus::spreadsheet::document` class uses the formula engine and the +document model from the `ixion library <https://gitlab.com/ixion/ixion>`_ to handle +calculation of the formula cells stored in the document, and the formula engine +requires all cell values to be stored in the :cpp:class:`ixion::model_context` +instance. + +.. note:: The :cpp:class:`~orcus::spreadsheet::document` class in orcus uses + the formula engine from the `ixion library <https://gitlab.com/ixion/ixion>`_ + to calculate the results of the formula cells stored in the document. + +The rest of the code basically repeats the same process for cells B1 and C1: + +.. literalinclude:: ../../doc_example/spreadsheet_doc_1.cpp + :language: C++ + :start-after: //!code-start: rest + :end-before: //!code-end: rest + :dedent: 4 + +and generate the following output: + +.. code-block:: text + + A1: Number + B1: String + C1: Formula + +Accessing the numeric cell values are a bit simpler since the values are +stored directly with the cells. Using the document from the above code example +code, the following code block: + +.. literalinclude:: ../../doc_example/spreadsheet_doc_1_num_and_formula.cpp + :language: C++ + :start-after: //!code-start: print-numeric-cells + :end-before: //!code-end: print-numeric-cells + :dedent: 4 + +will access the cells from A2 through A7 and print out their numeric values. +You should see the following output generated from this code block: + +.. code-block:: text + + A2: 1 + A3: 2 + A4: 3 + A5: 4 + A6: 5 + A7: 6 + +It's a bit more complex to handle formula cells. Since each formula cell +contains two things: 1) the formula expression which is stored as tokens +internally, and 2) the cached result of the formula. The following code +illustrates how to retrieve the cached formula results of cells C2 through +C7: + +.. literalinclude:: ../../doc_example/spreadsheet_doc_1_num_and_formula.cpp + :language: C++ + :start-after: //!code-start: print-formula-cells + :end-before: //!code-end: print-formula-cells + :dedent: 4 + +For each cell, this code first accesses the stored formula cell instance, get +a reference to its cached result, then obtain its string result value to print +it out to the standard output. Running this block of code will produce the +following output: + +.. code-block:: text + + C2: 1 Andy + C3: 2 Bruce + C4: 3 Charlie + C5: 4 David + C6: 5 Edward + C7: 6 Frank + +.. warning:: In production code, you should probabaly check the formula cell + pointer which may be null in case the cell at the specified + position is not a formula cell. diff --git a/doc/overview/doc-user.rst b/doc/overview/doc-user.rst new file mode 100644 index 0000000..a1292e5 --- /dev/null +++ b/doc/overview/doc-user.rst @@ -0,0 +1,574 @@ + +.. highlight:: cpp + +Use a user-defined custom document class +======================================== + +In this section we will demonstrate how you can use orcus to populate your own +custom document model by implementing your own set of interface classes and +passing it to the orcus import filter. The first example code shown below is +the *absolute* minimum that you need to implement in order for the orcus +filter to function properly: + +.. literalinclude:: ../../doc_example/spreadsheet_doc_2.cpp + :language: C++ + +Just like the example we used in the previous section, we are also loading a +document saved in the Open Document Spreadsheet format via +:cpp:class:`~orcus::orcus_ods`. The document being loaded is named +multi-sheets.ods, and contains three sheets which are are named **'1st +Sheet'**, **'2nd Sheet'**, and **'3rd Sheet'** in this exact order. When you +compile and execute the above code, you should get the following output: + +.. code-block:: text + + append_sheet: sheet index: 0; sheet name: 1st Sheet + append_sheet: sheet index: 1; sheet name: 2nd Sheet + append_sheet: sheet index: 2; sheet name: 3rd Sheet + +One primary role the import factory plays is to provide the orcus import +filter with the ability to create and insert a new sheet to the document. As +illustrated in the above code, it also provides access to existing sheets by +its name or its position. Every import factory implementation must be a +derived class of the :cpp:class:`orcus::spreadsheet::iface::import_factory` +interface base class. At a minimum, it must implement + +* the :cpp:func:`~orcus::spreadsheet::iface::import_factory::append_sheet` + method which inserts a new sheet and return access to it, + +* two variants of the :cpp:func:`~orcus::spreadsheet::iface::import_factory::get_sheet` + method which returns access to an existing sheet, and + +* the :cpp:func:`~orcus::spreadsheet::iface::import_factory::finalize` method + which gets called exactly once at the very end of the import, to give the + implementation a chance to perform post-import tasks. + +in order for the code to be buildable. Now, since all of the sheet accessor +methods return null pointers in this code, the import filter has no way of +populating the sheet data. To actually receive the sheet data from the import +filter, you must have these methods return valid pointers to sheet accessors. +The next example shows how that can be done. + + +Implement sheet interface +------------------------- + +In this section we will expand on the code in the previous section to +implement the sheet accessor interface, in order to receive cell values +in each individual sheet. In this example, we will define a structure +to hold a cell value, and store them in a 2-dimensional array for each +sheet. First, let's define the cell value structure: + +.. literalinclude:: ../../doc_example/spreadsheet_doc_2_sheets_no_string_pool.cpp + :language: C++ + :start-after: //!code-start: cell_value + :end-before: //!code-end: cell_value + +As we will be handling only three cell types i.e. empty, numeric, or string +cell type, this structure will work just fine. We will also define a namespace +alias called ``ss`` for convenience. This will be used in later code. + +Next, we'll define a sheet class called ``my_sheet`` that stores the cell values +in a 2-dimensional array, and implements all required interfaces as a child class +of :cpp:class:`~orcus::spreadsheet::iface::import_sheet`. + +At a minimum, the sheet accessor class must implement the following virtual +methods to satisfy the interface requirements of +:cpp:class:`~orcus::spreadsheet::iface::import_sheet`. + +* :cpp:func:`~orcus::spreadsheet::iface::import_sheet::set_auto` - This is a + setter method for a cell whose type is undetermined. The implementor must + determine the value type of this cell, from the raw string value of the + cell. This method is used when loading a CSV document, for instance. + +* :cpp:func:`~orcus::spreadsheet::iface::import_sheet::set_string` - This is a + setter method for a cell that stores a string value. All cell string values + are expectd to be pooled for the entire document, and this method only + receives a string index into a centrally-managed string table. The document + model is expected to implement a central string table that can translate an + index into its actual string value. + +* :cpp:func:`~orcus::spreadsheet::iface::import_sheet::set_value` - This is a + setter method for a cell that stores a numeric value. + +* :cpp:func:`~orcus::spreadsheet::iface::import_sheet::set_bool` - This is a + setter method for a cell that stores a boolean value. Note that not all + format types use this method, as some formats store boolean values as + numeric values. + +* :cpp:func:`~orcus::spreadsheet::iface::import_sheet::set_date_time` - This + is a setter method for a cell that stores a date time value. As with + boolean value type, some format types may not use this method as they store + date time values as numeric values, typically as days since epoch. + +* :cpp:func:`~orcus::spreadsheet::iface::import_sheet::set_format` - This is a + setter method for applying cell formats. Just like the string values, cell + format properties are expected to be stored in a document-wide cell format + properties table, and this method only receives an index into the table. + +* :cpp:func:`~orcus::spreadsheet::iface::import_sheet::get_sheet_size` - This + method is expected to return the dimension of the sheet which the loader may + need in some operations. + +For now, we'll only implement +:cpp:func:`~orcus::spreadsheet::iface::import_sheet::set_string`, +:cpp:func:`~orcus::spreadsheet::iface::import_sheet::set_value`, and +:cpp:func:`~orcus::spreadsheet::iface::import_sheet::get_sheet_size`, and +leave the rest empty. + +Here is the actual code for class ``my_sheet``: + +.. literalinclude:: ../../doc_example/spreadsheet_doc_2_sheets_no_string_pool.cpp + :language: C++ + :start-after: //!code-start: my_sheet + :end-before: //!code-end: my_sheet + +Note that this class receives its sheet index value from the caller upon +instantiation. A sheet index is a 0-based value and represents its position +within the sheet collection. + +Finally, we will modify the ``my_import_factory`` class to store and manage a +collection of ``my_sheet`` instances and to return the pointer value to a +correct sheet accessor instance as needed. + +.. literalinclude:: ../../doc_example/spreadsheet_doc_2_sheets_no_string_pool.cpp + :language: C++ + :start-after: //!code-start: my_import_factory + :end-before: //!code-end: my_import_factory + +Let's put it all together and run this code: + +.. literalinclude:: ../../doc_example/spreadsheet_doc_2_sheets_no_string_pool.cpp + :language: C++ + +We'll be loading the same document we loaded in the previous example, but this +time we will receive its cell values. Let's go through each sheet one at a +time. + +Data on the first sheet looks like this: + +.. figure:: /_static/images/overview/multi-sheets-sheet1.png + +It consists of 4 columns, with each column having a header row followed by +exactly ten rows of data. The first and forth columns contain numeric data, +while the second and third columns contain string data. + +When you run the above code to load this sheet, you'll get the following output: + +.. code-block:: text + + (sheet: 0; row: 0; col: 0): string index = 0 + (sheet: 0; row: 0; col: 1): string index = 0 + (sheet: 0; row: 0; col: 2): string index = 0 + (sheet: 0; row: 0; col: 3): string index = 0 + (sheet: 0; row: 1; col: 0): value = 1 + (sheet: 0; row: 1; col: 1): string index = 0 + (sheet: 0; row: 1; col: 2): string index = 0 + (sheet: 0; row: 1; col: 3): value = 35 + (sheet: 0; row: 2; col: 0): value = 2 + (sheet: 0; row: 2; col: 1): string index = 0 + (sheet: 0; row: 2; col: 2): string index = 0 + (sheet: 0; row: 2; col: 3): value = 56 + (sheet: 0; row: 3; col: 0): value = 3 + (sheet: 0; row: 3; col: 1): string index = 0 + (sheet: 0; row: 3; col: 2): string index = 0 + (sheet: 0; row: 3; col: 3): value = 6 + (sheet: 0; row: 4; col: 0): value = 4 + (sheet: 0; row: 4; col: 1): string index = 0 + (sheet: 0; row: 4; col: 2): string index = 0 + (sheet: 0; row: 4; col: 3): value = 65 + (sheet: 0; row: 5; col: 0): value = 5 + (sheet: 0; row: 5; col: 1): string index = 0 + (sheet: 0; row: 5; col: 2): string index = 0 + (sheet: 0; row: 5; col: 3): value = 88 + (sheet: 0; row: 6; col: 0): value = 6 + (sheet: 0; row: 6; col: 1): string index = 0 + (sheet: 0; row: 6; col: 2): string index = 0 + (sheet: 0; row: 6; col: 3): value = 90 + (sheet: 0; row: 7; col: 0): value = 7 + (sheet: 0; row: 7; col: 1): string index = 0 + (sheet: 0; row: 7; col: 2): string index = 0 + (sheet: 0; row: 7; col: 3): value = 80 + (sheet: 0; row: 8; col: 0): value = 8 + (sheet: 0; row: 8; col: 1): string index = 0 + (sheet: 0; row: 8; col: 2): string index = 0 + (sheet: 0; row: 8; col: 3): value = 66 + (sheet: 0; row: 9; col: 0): value = 9 + (sheet: 0; row: 9; col: 1): string index = 0 + (sheet: 0; row: 9; col: 2): string index = 0 + (sheet: 0; row: 9; col: 3): value = 14 + (sheet: 0; row: 10; col: 0): value = 10 + (sheet: 0; row: 10; col: 1): string index = 0 + (sheet: 0; row: 10; col: 2): string index = 0 + (sheet: 0; row: 10; col: 3): value = 23 + +There is a couple of things worth pointing out. First, the cell data +flows left to right first then top to bottom second. Second, for this +particular sheet and for this particular format, implementing just the +two setter methods, namely +:cpp:func:`~orcus::spreadsheet::iface::import_sheet::set_string` and +:cpp:func:`~orcus::spreadsheet::iface::import_sheet::set_value` are +enough to receive all cell values. However, we are getting a string +index value of 0 for all string cells. This is because orcus expects +the backend document model to implement the shared strings interface +which is responsible for providing correct string indices to the import +filter, and we have not yet implemented one. Let's fix that. + + +Implement shared strings interface +---------------------------------- + +The first thing to do is define some types: + +.. literalinclude:: ../../doc_example/spreadsheet_doc_2_sheets_with_string_pool.cpp + :language: C++ + :start-after: //!code-start: types + :end-before: //!code-end: types + +Here, we define ``ss_type`` to be the authoritative store for the shared +string values. The string values will be stored as std::string type, and we +use std::deque here to avoid re-allocation of internal buffers as the size +of the container grows. + +Another type we define is ``ss_hash_type``, which will be the hash map type +for storing string-to-index mapping entries. Here, we are using std::string_view +instead of std::string so that we can simply reference the string values stored in +the first container. + +The shared string interface is designed to handle both unformatted and +formatted string values. The following two methods: + +* :cpp:func:`~orcus::spreadsheet::iface::import_shared_strings::add` +* :cpp:func:`~orcus::spreadsheet::iface::import_shared_strings::append` + +are for unformatted string values. The +:cpp:func:`~orcus::spreadsheet::iface::import_shared_strings::add` method is +used when passing a string value that may or may not already exist in the +shared string pool. The +:cpp:func:`~orcus::spreadsheet::iface::import_shared_strings::append` method, +on the other hand, is used only when the string value being passed is a +brand-new string not yet stored in the string pool. When implementing the +:cpp:func:`~orcus::spreadsheet::iface::import_shared_strings::append` method, +you may skip checking for the existance of the string value in the pool before +inserting it. Both of these methods are expected to return a positive integer +value as the index of the string being passed. + +The following eight methods: + +* :cpp:func:`~orcus::spreadsheet::iface::import_shared_strings::set_segment_bold` +* :cpp:func:`~orcus::spreadsheet::iface::import_shared_strings::set_segment_font` +* :cpp:func:`~orcus::spreadsheet::iface::import_shared_strings::set_segment_font_color` +* :cpp:func:`~orcus::spreadsheet::iface::import_shared_strings::set_segment_font_name` +* :cpp:func:`~orcus::spreadsheet::iface::import_shared_strings::set_segment_font_size` +* :cpp:func:`~orcus::spreadsheet::iface::import_shared_strings::set_segment_italic` +* :cpp:func:`~orcus::spreadsheet::iface::import_shared_strings::append_segment` +* :cpp:func:`~orcus::spreadsheet::iface::import_shared_strings::commit_segments` + +are for receiving formatted string values. Conceptually, a formatted string +consists of a series of multiple string segments, where each segment may have +different formatting attributes applied to it. These ``set_segment_*`` +methods are used to set the individual formatting attributes for the current +string segment, and the string value for the current segment is passed through +the +:cpp:func:`~orcus::spreadsheet::iface::import_shared_strings::append_segment` +call. The order in which the ``set_segment_*`` methods are called is not +specified, and not all of them may be called, but they are guaranteed to be +called before the +:cpp:func:`~orcus::spreadsheet::iface::import_shared_strings::append_segment` +method gets called. The implementation should keep a buffer to store the +formatting attributes for the current segment and apply each attribute to the +buffer as one of the ``set_segment_*`` methods gets called. When the +:cpp:func:`~orcus::spreadsheet::iface::import_shared_strings::append_segment` +gets called, the implementation should apply the formatting attirbute set +currently in the buffer to the current segment, and reset the buffer for the +next segment. When all of the string segments and their formatting attributes +are passed, +:cpp:func:`~orcus::spreadsheet::iface::import_shared_strings::commit_segments` +gets called, signaling the implementation that now it's time to commit the +string to the document model. + +As we are going to ignore the formatting attributes in our current example, +the following code will do: + +.. literalinclude:: ../../doc_example/spreadsheet_doc_2_sheets_with_string_pool.cpp + :language: C++ + :start-after: //!code-start: my_shared_strings + :end-before: //!code-end: my_shared_strings + +Note that some import filters may use the +:cpp:func:`~orcus::spreadsheet::iface::import_shared_strings::append_segment` +and +:cpp:func:`~orcus::spreadsheet::iface::import_shared_strings::commit_segments` +combination even for unformatted strings. Because of this, you still need to +implement these two methods even if raw string values are all you care about. + +Note also that the container storing the string values is a reference. The +source container will be owned by ``my_import_factory`` who will also be the +owner of the ``my_shared_strings`` instance. Shown below is the modified +version of ``my_import_factory`` that provides the shared string interface: + +.. literalinclude:: ../../doc_example/spreadsheet_doc_2_sheets_with_string_pool.cpp + :language: C++ + :start-after: //!code-start: my_import_factory + :end-before: //!code-end: my_import_factory + +The shared string store is also passed to each sheet instance, and we'll use +that to fetch the string values from their respective string indices. + +Let's put this all together: + +.. literalinclude:: ../../doc_example/spreadsheet_doc_2_sheets_with_string_pool.cpp + :language: C++ + +The sheet class is largely unchanged except for one thing; it now takes a +reference to the string pool and print the actual string value alongside the +string index associated with it. When you execute this code, you'll see the +following output when loading the same sheet: + +.. code-block:: text + + (sheet: 0; row: 0; col: 0): string index = 0 (ID) + (sheet: 0; row: 0; col: 1): string index = 1 (First Name) + (sheet: 0; row: 0; col: 2): string index = 2 (Last Name) + (sheet: 0; row: 0; col: 3): string index = 3 (Age) + (sheet: 0; row: 1; col: 0): value = 1 + (sheet: 0; row: 1; col: 1): string index = 5 (Thia) + (sheet: 0; row: 1; col: 2): string index = 6 (Beauly) + (sheet: 0; row: 1; col: 3): value = 35 + (sheet: 0; row: 2; col: 0): value = 2 + (sheet: 0; row: 2; col: 1): string index = 9 (Pepito) + (sheet: 0; row: 2; col: 2): string index = 10 (Resun) + (sheet: 0; row: 2; col: 3): value = 56 + (sheet: 0; row: 3; col: 0): value = 3 + (sheet: 0; row: 3; col: 1): string index = 13 (Emera) + (sheet: 0; row: 3; col: 2): string index = 14 (Gravey) + (sheet: 0; row: 3; col: 3): value = 6 + (sheet: 0; row: 4; col: 0): value = 4 + (sheet: 0; row: 4; col: 1): string index = 17 (Erinn) + (sheet: 0; row: 4; col: 2): string index = 18 (Flucks) + (sheet: 0; row: 4; col: 3): value = 65 + (sheet: 0; row: 5; col: 0): value = 5 + (sheet: 0; row: 5; col: 1): string index = 21 (Giusto) + (sheet: 0; row: 5; col: 2): string index = 22 (Bambury) + (sheet: 0; row: 5; col: 3): value = 88 + (sheet: 0; row: 6; col: 0): value = 6 + (sheet: 0; row: 6; col: 1): string index = 25 (Neall) + (sheet: 0; row: 6; col: 2): string index = 26 (Scorton) + (sheet: 0; row: 6; col: 3): value = 90 + (sheet: 0; row: 7; col: 0): value = 7 + (sheet: 0; row: 7; col: 1): string index = 29 (Ervin) + (sheet: 0; row: 7; col: 2): string index = 30 (Foreman) + (sheet: 0; row: 7; col: 3): value = 80 + (sheet: 0; row: 8; col: 0): value = 8 + (sheet: 0; row: 8; col: 1): string index = 33 (Shoshana) + (sheet: 0; row: 8; col: 2): string index = 34 (Bohea) + (sheet: 0; row: 8; col: 3): value = 66 + (sheet: 0; row: 9; col: 0): value = 9 + (sheet: 0; row: 9; col: 1): string index = 37 (Gladys) + (sheet: 0; row: 9; col: 2): string index = 38 (Somner) + (sheet: 0; row: 9; col: 3): value = 14 + (sheet: 0; row: 10; col: 0): value = 10 + (sheet: 0; row: 10; col: 1): string index = 41 (Ephraim) + (sheet: 0; row: 10; col: 2): string index = 42 (Russell) + (sheet: 0; row: 10; col: 3): value = 23 + +The string indices now increment nicely, and their respective string values +look correct. + +Now, let's turn our attention to the second sheet, which contains formulas. +First, here is what the second sheet looks like: + +.. figure:: /_static/images/overview/multi-sheets-sheet2.png + +It contains a simple table extending from A1 to C9. It consists of three +columns and the first row is a header row. Cells in the the first and second +columns contain simple numbers and the third column contains formulas that +simply add the two numbers to the left of the same row. When loading this +sheet using the last code we used above, you'll see the following output: + +.. code-block:: text + + (sheet: 1; row: 0; col: 0): string index = 44 (X) + (sheet: 1; row: 0; col: 1): string index = 45 (Y) + (sheet: 1; row: 0; col: 2): string index = 46 (X + Y) + (sheet: 1; row: 1; col: 0): value = 18 + (sheet: 1; row: 1; col: 1): value = 79 + (sheet: 1; row: 2; col: 0): value = 48 + (sheet: 1; row: 2; col: 1): value = 55 + (sheet: 1; row: 3; col: 0): value = 99 + (sheet: 1; row: 3; col: 1): value = 35 + (sheet: 1; row: 4; col: 0): value = 41 + (sheet: 1; row: 4; col: 1): value = 69 + (sheet: 1; row: 5; col: 0): value = 5 + (sheet: 1; row: 5; col: 1): value = 18 + (sheet: 1; row: 6; col: 0): value = 46 + (sheet: 1; row: 6; col: 1): value = 69 + (sheet: 1; row: 7; col: 0): value = 36 + (sheet: 1; row: 7; col: 1): value = 67 + (sheet: 1; row: 8; col: 0): value = 78 + (sheet: 1; row: 8; col: 1): value = 2 + +Everything looks fine except that the formula cells in C2:C9 are not loaded at +all. This is because, in order to receive formula cell data, you must +implement the required :cpp:class:`~orcus::spreadsheet::iface::import_formula` +interface, which we will cover in the next section. + + +Implement formula interface +--------------------------- + +In this section we will extend the code from the previous section in order to +receive and process formula cell values from the sheet. We will need to make +quite a few changes. Let's go over this one thing at a time. First, we are +adding a new cell value type ``formula``: + +.. literalinclude:: ../../doc_example/spreadsheet_doc_2_sheets_with_formula.cpp + :language: C++ + :start-after: //!code-start: cell_value_type + :end-before: //!code-end: cell_value_type + +which should not come as a surprise. + +We are not making any change to the ``cell_value`` struct itself, but we are +re-using its ``index`` member for a formula cell value such that, if the cell +stores a formula, the index will refer to its actual formula data which will +be stored in a separate data store, much like how strings are stored +externally and referenced by their indices in the ``cell_value`` instances. + +We are also adding a brand-new class called ``cell_grid``, to add an extra +layer over the raw cell value array: + +.. literalinclude:: ../../doc_example/spreadsheet_doc_2_sheets_with_formula.cpp + :language: C++ + :start-after: //!code-start: cell_grid + :end-before: //!code-end: cell_grid + +Each sheet instance will own one instance of ``cell_grid``, and the formula +interface class instance will hold a reference to it and use it to insert +formula cell values into it. The same sheet instance will also hold a formula +value store, and pass its reference to the formula interface class. + +The formula interface class must implement the following methods: + +* :cpp:func:`~orcus::spreadsheet::iface::import_formula::set_position` +* :cpp:func:`~orcus::spreadsheet::iface::import_formula::set_formula` +* :cpp:func:`~orcus::spreadsheet::iface::import_formula::set_shared_formula_index` +* :cpp:func:`~orcus::spreadsheet::iface::import_formula::set_result_string` +* :cpp:func:`~orcus::spreadsheet::iface::import_formula::set_result_value` +* :cpp:func:`~orcus::spreadsheet::iface::import_formula::set_result_empty` +* :cpp:func:`~orcus::spreadsheet::iface::import_formula::set_result_bool` +* :cpp:func:`~orcus::spreadsheet::iface::import_formula::commit` + +Depending on the type of a formula cell, and depending on the format of the +document, some methods may not be called. The +:cpp:func:`~orcus::spreadsheet::iface::import_formula::set_position` method +always gets called regardless of the formula cell type, to specify the +position of the formula cell. The +:cpp:func:`~orcus::spreadsheet::iface::import_formula::set_formula` gets +called for a formula cell that does not share its formula expression with any +other formula cells, or a formula cell that shares its formula expression with +a group of other formuls cells and is the primary cell of that group. If it's +the primary cell of a grouped formula cells, the +:cpp:func:`~orcus::spreadsheet::iface::import_formula::set_shared_formula_index` +method also gets called to receive the identifier value of that group. All +formula cells belonging to the same group receives the same identifier value +via +:cpp:func:`~orcus::spreadsheet::iface::import_formula::set_shared_formula_index`, +but only the primary cell of a group receives the formula expression string +via :cpp:func:`~orcus::spreadsheet::iface::import_formula::set_formula`. The +rest of the methods - +:cpp:func:`~orcus::spreadsheet::iface::import_formula::set_result_string`, +:cpp:func:`~orcus::spreadsheet::iface::import_formula::set_result_value`, +:cpp:func:`~orcus::spreadsheet::iface::import_formula::set_result_empty` and +:cpp:func:`~orcus::spreadsheet::iface::import_formula::set_result_bool` - are +called to deliver the cached formula cell value when applicable. + +The :cpp:func:`~orcus::spreadsheet::iface::import_formula::commit` method gets +called at the very end to let the implementation commit the formula cell data +to the backend document store. + +Without further ado, here is the formula interface implementation that we will +use: + +.. literalinclude:: ../../doc_example/spreadsheet_doc_2_sheets_with_formula.cpp + :language: C++ + :start-after: //!code-start: my_formula + :end-before: //!code-end: my_formula + +and here is the defintion of the ``formula`` struct that stores a formula expression +string as well as its grammer type: + +.. literalinclude:: ../../doc_example/spreadsheet_doc_2_sheets_with_formula.cpp + :language: C++ + :start-after: //!code-start: formula + :end-before: //!code-end: formula + +Note that since we are loading a OpenDocument Spereadsheet file (.ods) which +does not support shared formulas, we do not need to handle the +:cpp:func:`~orcus::spreadsheet::iface::import_formula::set_shared_formula_index` +method. Likewise, we are leaving the ``set_result_*`` methods unhandled for +now. + +This interface class also stores references to ``cell_grid`` and +``std::vector<formula>`` instances, both of which are passed from the parent +sheet instance. + +We also need to make a few changes to the sheet interface class to provide a formula interface +and add a formula value store: + +.. literalinclude:: ../../doc_example/spreadsheet_doc_2_sheets_with_formula.cpp + :language: C++ + :start-after: //!code-start: my_sheet + :end-before: //!code-end: my_sheet + +We've added the +:cpp:func:`~orcus::spreadsheet::iface::import_sheet::get_formula` method which +returns a pointer to the ``my_formula`` class instance defined above. The +rest of the code is unchanged. + +Now let's see what happens when loading the same sheet from the previous +section: + +.. code-block:: text + + (sheet: 1; row: 0; col: 0): string index = 44 (X) + (sheet: 1; row: 0; col: 1): string index = 45 (Y) + (sheet: 1; row: 0; col: 2): string index = 46 (X + Y) + (sheet: 1; row: 1; col: 0): value = 18 + (sheet: 1; row: 1; col: 1): value = 79 + (sheet: 1; row: 2; col: 0): value = 48 + (sheet: 1; row: 2; col: 1): value = 55 + (sheet: 1; row: 3; col: 0): value = 99 + (sheet: 1; row: 3; col: 1): value = 35 + (sheet: 1; row: 4; col: 0): value = 41 + (sheet: 1; row: 4; col: 1): value = 69 + (sheet: 1; row: 5; col: 0): value = 5 + (sheet: 1; row: 5; col: 1): value = 18 + (sheet: 1; row: 6; col: 0): value = 46 + (sheet: 1; row: 6; col: 1): value = 69 + (sheet: 1; row: 7; col: 0): value = 36 + (sheet: 1; row: 7; col: 1): value = 67 + (sheet: 1; row: 8; col: 0): value = 78 + (sheet: 1; row: 8; col: 1): value = 2 + (sheet: 1; row: 1; col: 2): formula = [.A2]+[.B2] (ods) + (sheet: 1; row: 2; col: 2): formula = [.A3]+[.B3] (ods) + (sheet: 1; row: 3; col: 2): formula = [.A4]+[.B4] (ods) + (sheet: 1; row: 4; col: 2): formula = [.A5]+[.B5] (ods) + (sheet: 1; row: 5; col: 2): formula = [.A6]+[.B6] (ods) + (sheet: 1; row: 6; col: 2): formula = [.A7]+[.B7] (ods) + (sheet: 1; row: 7; col: 2): formula = [.A8]+[.B8] (ods) + (sheet: 1; row: 8; col: 2): formula = [.A9]+[.B9] (ods) + +Looks like we are getting the formula cell values this time around. + +One thing to note is that the formula expression strings you see here follow +the syntax defined in the OpenFormula specifications, which is the formula syntax +used in the OpenDocument Spreadsheet format. + + +Implement more interfaces +------------------------- + +This section has covered only a part of the available spreadsheet interfaces +you can implement in your code. Refer to the :ref:`spreadsheet-interfaces` +section to see the complete list of interfaces. diff --git a/doc/overview/index.rst b/doc/overview/index.rst new file mode 100644 index 0000000..0b95f8b --- /dev/null +++ b/doc/overview/index.rst @@ -0,0 +1,95 @@ + +.. highlight:: cpp + +Overview +======== + +Composition of the library +-------------------------- + +The primary goal of the orcus library is to provide a framework to import the +contents of documents stored in various spreadsheet or spreadsheet-like +formats. The library also provides several low-level parsers that can be used +independently of the spreadsheet-related features if so desired. In addition, +the library also provides support for some hierarchical documents, such as JSON +and YAML, which were a later addition to the library. + +You can use this library either through its C++ API, Python API, or CLI. However, +not all three methods equally expose all features of the library, and the C++ API +is more complete than the other two. + +The library is physically split into four parts: + + 1. the parser part that provides the aforementioned low-level parsers, + 2. the filter part that providers higher level import filters for spreadsheet + and hierarchical documents that internally use the low-level parsers, + 3. the spreadsheet document model part that includes the document model suitable + for storing spreadsheet document contents, and + 4. CLI for loading and converting spreadsheet and hierarchical documents. + +If you need to just use the parser part of the library, you need to only link +against the ``liborcus-parser`` library file. If you need to use the import +filter part, link againt both the ``liborcus-parser`` and the ``liborcus`` +libraries. Likewise, if you need to use the spreadsheet document model part, +link against the aforementioned two plus the ``liborcus-spreadsheet-model`` +library. + +Also note that the spreadsheet document model part has additional dependency on +the `ixion library <https://gitlab.com/ixion/ixion>`_ for handling formula +re-calculations on document load. + + +Loading spreadsheet documents +----------------------------- + +The orcus library's primary aim is to provide a framework to import the contents +of documents stored in various spreadsheet, or spreadsheet-like formats. It +supports two primary use cases. The first use case is where the client +program does not have its own document model, but needs to import data from a +spreadsheet-like document file and access its content without implementing its +own document store from scratch. In this particular use case, you can simply +use the :cpp:class:`~orcus::spreadsheet::document` class to get it populated, +and access its content through its API afterward. + +The second use case, which is a bit more advanced, is where the client program +already has its own internal document model, and needs to use orcus +to populate its document model. In this particular use case, you can +implement your own set of classes that support necessary interfaces, and pass +that to the orcus import filter. + +For each document type that orcus supports, there is a top-level import filter +class that serves as an entry point for loading the content of a document you +wish to load. You don't pass your document to this filter directly; instead, +you wrap your document with what we call an **import factory**, then pass this +factory instance to the loader. This import factory is then required to +implement necessary interfaces that the filter class uses in order for it +to pass data to the document as the file is getting parsed. + +When using orcus's own document model, you can simply use orcus's own import +factory implementation to wrap its document. When using your own document +model, on the other hand, you'll need to implement your own set of interface +classes to wrap your document with. + +The following sections describe how to load a spreadsheet document by using 1) +orcus's own spreadsheet document class, and 2) a user-defined custom docuemnt +class. + +.. toctree:: + :maxdepth: 1 + + doc-orcus.rst + doc-user.rst + + +Loading hierarchical documents +------------------------------ + +The orcus library also includes support for hierarchical document types such +as JSON and YAML. The following sections delve more into the support for +these types of documents. + +.. toctree:: + :maxdepth: 1 + + json.rst + yaml.rst diff --git a/doc/overview/json.rst b/doc/overview/json.rst new file mode 100644 index 0000000..0e252f9 --- /dev/null +++ b/doc/overview/json.rst @@ -0,0 +1,353 @@ + +.. highlight:: cpp + +JSON +==== + +The JSON part of orcus consists of a low-level parser class that handles +parsing of JSON strings, and a high-level document class that stores parsed +JSON structures as a node tree. + +There are two approaches to processing JSON strings using the orcus library. +One approach is to utilize the :cpp:class:`~orcus::json::document_tree` class +to load and populate the JSON structure tree via its +:cpp:func:`~orcus::json::document_tree::load()` method and traverse the tree +through its :cpp:func:`~orcus::json::document_tree::get_document_root()` method. +This approach is ideal if you want a quick way to parse and access the content +of a JSON document with minimal effort. + +Another approach is to use the low-level :cpp:class:`~orcus::json_parser` +class directly by providing your own handler class to receive callbacks from +the parser. This method requires a bit more effort on your part to provide +and populate your own data structure, but if you already have a data structure +to store the content of JSON, then this approach is ideal. The +:cpp:class:`~orcus::json::document_tree` class internally uses +:cpp:class:`~orcus::json_parser` to parse JSON contents. + + +Populating a document tree from JSON string +------------------------------------------- + +The following code snippet shows an example of how to populate an instance of +:cpp:class:`~orcus::json::document_tree` from a JSON string, and navigate its +content tree afterward. + +.. literalinclude:: ../../doc_example/json_doc_1.cpp + :language: C++ + +You'll see the following output when executing this code: + +.. code-block:: text + + name: John Doe + occupation: Software Engineer + score: + - 89 + - 67 + - 90 + + +Using the low-level parser +-------------------------- + +The following code snippet shows how to use the low-level :cpp:class:`~orcus::json_parser` +class by providing an own handler class and passing it as a template argument: + +.. literalinclude:: ../../doc_example/json_parser_1.cpp + :language: C++ + +The parser constructor expects the char array, its length, and the handler +instance. The base handler class :cpp:class:`~orcus::json_handler` implements +all required handler methods. By inheriting from it, you only need to +implement the handler methods you need. In this example, we are only +implementing the :cpp:func:`~orcus::json_handler::object_key`, +:cpp:func:`~orcus::json_handler::string`, and :cpp:func:`~orcus::json_handler::number` +methods to process object key values, string values and numeric values, +respectively. Refer to the :cpp:class:`~orcus::json_handler` class definition +for all available handler methods. + +Executing this code will generate the following output: + +.. code-block:: text + + JSON string: {"key1": [1,2,3,4,5], "key2": 12.3} + object key: key1 + number: 1 + number: 2 + number: 3 + number: 4 + number: 5 + object key: key2 + number: 12.3 + + +Building a document tree directly +--------------------------------- + +You can also create and populate a JSON document tree directly without needing +to parse a JSON string. This approach is ideal if you want to create a JSON +tree from scratch and export it as a string. The following series of code +snippets demonstrate how to exactly build JSON document trees directly and +export their contents as JSON strings. + +The first example shows how to initialize the tree with a simple array: + +.. literalinclude:: ../../doc_example/json_doc_2.cpp + :language: C++ + :start-after: //!code-start: root list + :end-before: //!code-end: root list + +You can simply specify the content of the array via initialization list and +assign it to the document. The :cpp:func:`~orcus::json::document_tree::dump()` +method then turns the content into a single string instance, which looks like +the following: + +.. code-block:: text + + [ + 1, + 2, + "string value", + false, + null + ] + +If you need to build a array of arrays, do like the following: + +.. literalinclude:: ../../doc_example/json_doc_2.cpp + :language: C++ + :start-after: //!code-start: list nested + :end-before: //!code-end: list nested + +This will create an array of two nested child arrays with three values each. +Dumping the content of the tree as a JSON string will produce something like +the following: + +.. code-block:: text + + [ + [ + true, + false, + null + ], + [ + 1.1, + 2.2, + "text" + ] + ] + +Creating an object can be done by nesting one of more key-value pairs, each of +which is surrounded by a pair of curly braces, inside another pair of curly +braces. For example, the following code: + +.. literalinclude:: ../../doc_example/json_doc_2.cpp + :language: C++ + :start-after: //!code-start: list object + :end-before: //!code-end: list object + +produces the following output: + +.. code-block:: text + + { + "key1": 1.2, + "key2": "some text" + } + +indicating that the tree consists of a single object having two key-value +pairs. + +You may notice that this syntax is identical to the syntax for +creating an array of arrays as shown above. In fact, in order for this to be +an object, each of the inner sequences must have exactly two values, and its +first value must be a string value. Failing that, it will be interpreted as +an array of arrays. + +As with arrays, nesting of objects is also supported. The following code: + +.. literalinclude:: ../../doc_example/json_doc_2.cpp + :language: C++ + :start-after: //!code-start: list object 2 + :end-before: //!code-end: list object 2 + +creates a root object having two key-value pairs one of which contains +another object having three key-value pairs, as evident in the following output +generated by this code: + +.. code-block:: text + + { + "parent1": { + "child1": true, + "child2": false, + "child3": 123.4 + }, + "parent2": "not-nested" + } + +There is one caveat that you need to be aware of because of this special +object creation syntax. When you have a nested array that exactly contains +two values and the first value is a string value, you must explicitly declare +that as an array by using an :cpp:class:`~orcus::json::array` class instance. +For instance, this code: + +.. literalinclude:: ../../doc_example/json_doc_2.cpp + :language: C++ + :start-after: //!code-start: array ambiguous + :end-before: //!code-end: array ambiguous + +is intended to be an object containing an array. However, because the supposed +inner array contains exactly two values and the first value is a string +value, which could be interpreted as a key-value pair for the outer object, it +ends up being too ambiguous and a :cpp:class:`~orcus::json::key_value_error` +exception gets thrown as a result. + +To work around this ambiguity, you need to declare the inner array to be +explicit by using an :cpp:class:`~orcus::json::array` instance: + +.. literalinclude:: ../../doc_example/json_doc_2.cpp + :language: C++ + :start-after: //!code-start: array explicit + :end-before: //!code-end: array explicit + +This code now correctly generates a root object containing one key-value pair +whose value is an array: + +.. code-block:: text + + { + "array": [ + "one", + 987 + ] + } + +Similar ambiguity issue arises when you want to construct a tree consisting +only of an empty root object. You may be tempted to write something like +this: + +.. literalinclude:: ../../doc_example/json_doc_2.cpp + :language: C++ + :start-after: //!code-start: object ambiguous a + :end-before: //!code-end: object ambiguous a + +However, this will result in leaving the tree entirely unpopulated i.e. the +tree will not even have a root node! If you continue on and try to get a root +node from this tree, you'll get a :cpp:class:`~orcus::json::document_error` +thrown as a result. If you inspect the error message stored in the exception: + +.. literalinclude:: ../../doc_example/json_doc_2.cpp + :language: C++ + :start-after: //!code-start: object ambiguous b + :end-before: //!code-end: object ambiguous b + +you will get + +.. code-block:: text + + json::document_error: document tree is empty + +giving you further proof that the tree is indeed empty! The solution here is +to directly assign an instance of :cpp:class:`~orcus::json::object` to the +document tree, which will initialize the tree with an empty root object. The +following code: + +.. literalinclude:: ../../doc_example/json_doc_2.cpp + :language: C++ + :start-after: //!code-start: object explicit 1 + :end-before: //!code-end: object explicit 1 + +will therefore generate + +.. code-block:: text + + { + } + +You can also use the :cpp:class:`~orcus::json::object` class instances to +indicate empty objects anythere in the tree. For instance, this code: + +.. literalinclude:: ../../doc_example/json_doc_2.cpp + :language: C++ + :start-after: //!code-start: object explicit 2 + :end-before: //!code-end: object explicit 2 + +is intended to create an array containing three empty objects as its elements, +and that's exactly what it does: + +.. code-block:: text + + [ + { + }, + { + }, + { + } + ] + +So far all the examples have shown how to initialize the document tree as the +tree itself is being constructed. But our next example shows how to create +new key-value pairs to existing objects after the document tree instance has +been initialized. + +.. literalinclude:: ../../doc_example/json_doc_2.cpp + :language: C++ + :start-after: //!code-start: root object add child + :end-before: //!code-end: root object add child + +This code first initializes the tree with an empty object, then retrieves the +root empty object and assigns several key-value pairs to it. When converting +the tree content to a string and inspecting it you'll see something like the +following: + +.. code-block:: text + + { + "child array": [ + 1.1, + 1.2, + true + ], + "child1": 1, + "child3": [ + true, + false + ], + "child2": "string", + "child object": { + "key1": 100, + "key2": 200 + } + } + +The next example shows how to append values to an existing array after the +tree has been constructed. Let's take a look at the code: + +.. literalinclude:: ../../doc_example/json_doc_2.cpp + :language: C++ + :start-after: //!code-start: root array add child + :end-before: //!code-end: root array add child + +Like the previous example, this code first initializes the tree but this time +with an empty array as its root, retrieves the root array, then appends +several values to it via its :cpp:func:`~orcus::json::node::push_back` method. + +When you dump the content of this tree as a JSON string you'll get something +like this: + +.. code-block:: text + + [ + -1.2, + "string", + true, + null, + { + "key1": 1.1, + "key2": 1.2 + } + ] + diff --git a/doc/overview/yaml.rst b/doc/overview/yaml.rst new file mode 100644 index 0000000..4109cf2 --- /dev/null +++ b/doc/overview/yaml.rst @@ -0,0 +1,8 @@ + +.. highlight:: cpp + +YAML +==== + +TBD + |