diff options
Diffstat (limited to 'src/arrow/docs/source/python/timestamps.rst')
-rw-r--r-- | src/arrow/docs/source/python/timestamps.rst | 198 |
1 files changed, 198 insertions, 0 deletions
diff --git a/src/arrow/docs/source/python/timestamps.rst b/src/arrow/docs/source/python/timestamps.rst new file mode 100644 index 000000000..fb4da5cc0 --- /dev/null +++ b/src/arrow/docs/source/python/timestamps.rst @@ -0,0 +1,198 @@ +.. Licensed to the Apache Software Foundation (ASF) under one +.. or more contributor license agreements. See the NOTICE file +.. distributed with this work for additional information +.. regarding copyright ownership. The ASF licenses this file +.. to you under the Apache License, Version 2.0 (the +.. "License"); you may not use this file except in compliance +.. with the License. You may obtain a copy of the License at + +.. http://www.apache.org/licenses/LICENSE-2.0 + +.. Unless required by applicable law or agreed to in writing, +.. software distributed under the License is distributed on an +.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +.. KIND, either express or implied. See the License for the +.. specific language governing permissions and limitations +.. under the License. + +********** +Timestamps +********** + +Arrow/Pandas Timestamps +======================= + +Arrow timestamps are stored as a 64-bit integer with column metadata to +associate a time unit (e.g. milliseconds, microseconds, or nanoseconds), and an +optional time zone. Pandas (`Timestamp`) uses a 64-bit integer representing +nanoseconds and an optional time zone. +Python/Pandas timestamp types without a associated time zone are referred to as +"Time Zone Naive". Python/Pandas timestamp types with an associated time zone are +referred to as "Time Zone Aware". + + +Timestamp Conversions +===================== + +Pandas/Arrow ⇄ Spark +-------------------- + +Spark stores timestamps as 64-bit integers representing microseconds since +the UNIX epoch. It does not store any metadata about time zones with its +timestamps. + +Spark interprets timestamps with the *session local time zone*, (i.e. +``spark.sql.session.timeZone``). If that time zone is undefined, Spark turns to +the default system time zone. For simplicity's sake below, the session +local time zone is always defined. + +This implies a few things when round-tripping timestamps: + +#. Timezone information is lost (all timestamps that result from + converting from spark to arrow/pandas are "time zone naive"). +#. Timestamps are truncated to microseconds. +#. The session time zone might have unintuitive impacts on + translation of timestamp values. + +Spark to Pandas (through Apache Arrow) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The following cases assume the Spark configuration +``spark.sql.execution.arrow.enabled`` is set to ``"true"``. + +:: + + >>> pdf = pd.DataFrame({'naive': [datetime(2019, 1, 1, 0)], + ... 'aware': [Timestamp(year=2019, month=1, day=1, + ... nanosecond=500, tz=timezone(timedelta(hours=-8)))]}) + >>> pdf + naive aware + 0 2018-10-01 2018-10-01 00:00:00.000000500-08:00 + + >>> spark.conf.set("spark.sql.session.timeZone", "UTC") + >>> utc_df = sqlContext.createDataFrame(pdf) + >>> utf_df.show() + +-------------------+-------------------+ + | naive| aware| + +-------------------+-------------------+ + |2019-01-01 00:00:00|2019-01-01 08:00:00| + +-------------------+-------------------+ + +Note that conversion of the aware timestamp is shifted to reflect the time +assuming UTC (it represents the same instant in time). For naive +timestamps, Spark treats them as being in the system local +time zone and converts them UTC. Recall that internally, the schema +for spark dataframe's does not store any time zone information with +timestamps. + +Now if the session time zone is set to US Pacific Time (PST) we don't +see any shift in the display of the aware time zone (it +still represents the same instant in time): + +:: + + >>> spark.conf.set("spark.sql.session.timeZone", "US/Pacific") + >>> pst_df = sqlContext.createDataFrame(pdf) + >>> pst_df.show() + +-------------------+-------------------+ + | naive| aware| + +-------------------+-------------------+ + |2019-01-01 00:00:00|2019-01-01 00:00:00| + +-------------------+-------------------+ + +Looking again at utc_df.show() we see one of the impacts of the session time +zone. The naive timestamp was initially converted assuming UTC, the instant it +reflects is actually earlier than the naive time zone from the PST converted +data frame: + +:: + + >>> utc_df.show() + +-------------------+-------------------+ + | naive| aware| + +-------------------+-------------------+ + |2018-12-31 16:00:00|2019-01-01 00:00:00| + +-------------------+-------------------+ + +Spark to Pandas +~~~~~~~~~~~~~~~ + +We can observe what happens when converting back to Arrow/Pandas. Assuming the +session time zone is still PST: + +:: + + >>> pst_df.show() + +-------------------+-------------------+ + | naive| aware| + +-------------------+-------------------+ + |2019-01-01 00:00:00|2019-01-01 00:00:00| + +-------------------+-------------------+ + + + >>> pst_df.toPandas() + naive aware + 0 2019-01-01 2019-01-01 + >>> pst_df.toPandas().info() + <class 'pandas.core.frame.DataFrame'> + RangeIndex: 1 entries, 0 to 0 + Data columns (total 2 columns): + naive 1 non-null datetime64[ns] + aware 1 non-null datetime64[ns] + dtypes: datetime64[ns](2) + memory usage: 96.0 bytes + +Notice that, in addition to being a "time zone naive" timestamp, the 'aware' +value will now differ when converting to an epoch offset. Spark does the conversion +by first converting to the session time zone (or system local time zone if +session time zones isn't set) and then localizes to remove the time zone +information. This results in the timestamp being 8 hours before the original +time: + +:: + + >>> pst_df.toPandas()['aware'][0] + Timestamp('2019-01-01 00:00:00') + >>> pdf['aware'][0] + Timestamp('2019-01-01 00:00:00.000000500-0800', tz='UTC-08:00') + >>> (pst_df.toPandas()['aware'][0].timestamp()-pdf['aware'][0].timestamp())/3600 + -8.0 + +The same type of conversion happens with the data frame converted while +the session time zone was UTC. In this case both naive and aware +represent different instants in time (the naive instant is due to +the change in session time zone between creating data frames): + +:: + + >>> utc_df.show() + +-------------------+-------------------+ + | naive| aware| + +-------------------+-------------------+ + |2018-12-31 16:00:00|2019-01-01 00:00:00| + +-------------------+-------------------+ + + >>> utc_df.toPandas() + naive aware + 0 2018-12-31 16:00:00 2019-01-01 + +Note that the surprising shift for aware doesn't happen +when the session time zone is UTC (but the timestamps +still become "time zone naive"): + +:: + + >>> spark.conf.set("spark.sql.session.timeZone", "UTC") + >>> pst_df.show() + +-------------------+-------------------+ + | naive| aware| + +-------------------+-------------------+ + |2019-01-01 08:00:00|2019-01-01 08:00:00| + +-------------------+-------------------+ + + >>> pst_df.toPandas()['aware'][0] + Timestamp('2019-01-01 08:00:00') + >>> pdf['aware'][0] + Timestamp('2019-01-01 00:00:00.000000500-0800', tz='UTC-08:00') + >>> (pst_df.toPandas()['aware'][0].timestamp()-pdf['aware'][0].timestamp())/3600 + 0.0 |