You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@orc.apache.org by "Tadahito Kobayashi (JIRA)" <ji...@apache.org> on 2019/02/13 05:47:00 UTC
[jira] [Created] (ORC-468) Fix incorrect documentation for
nanoseconds stream encoding
Tadahito Kobayashi created ORC-468:
--------------------------------------
Summary: Fix incorrect documentation for nanoseconds stream encoding
Key: ORC-468
URL: https://issues.apache.org/jira/browse/ORC-468
Project: ORC
Issue Type: Bug
Components: documentation
Reporter: Tadahito Kobayashi
According to ORC spec doc, "1000 nanoseconds would be serialized as 0x0b and 100000 would be serialized as 0x0d."
However, the actual encoding result are: formatNano(1000) = 0x0a and formatNano(100000) = 0x0c.
How about changing the document as below?
"Because the number of nanoseconds often has a large number of trailing zeros, the number has trailing decimal zero digits removed and the last three bits are used to record how many zeros were removed {color:#FF0000}if the trailing zeros are more than 2{color}. Thus 1000 nanoseconds would be serialized as {color:#FF0000}0x0a{color} and 100000 would be serialized as {color:#FF0000}0x0c{color}."
Below is my test and result to confirm nanoseconds encodings.
{code:java}
// this is the ORC's serialization code in ColumnWriter.cc, ORC encodes nanoseconds by this function.
// https://github.com/apache/orc/blob/master/c%2B%2B/src/ColumnWriter.cc#L1669
static int64_t formatNano(int64_t nanos) {
if (nanos == 0) {
return 0;
}
else if (nanos % 100 != 0) {
return (nanos) << 3;
}
else {
nanos /= 100;
int64_t trailingZeros = 1;
while (nanos % 10 == 0 && trailingZeros < 7) {
nanos /= 10;
trailingZeros += 1;
}
return (nanos) << 3 | trailingZeros;
}
}
void main()
{
for (int nano = 1; nano <= 1000000; nano *= 10) {
printf("formatNano(%d) = 0x%02x\n", nano, formatNano(nano));
}
}
{code}
The result:
{code:java}
formatNano(1) = 0x08
formatNano(10) = 0x50
formatNano(100) = 0x09
formatNano(1000) = 0x0a
formatNano(10000) = 0x0b
formatNano(100000) = 0x0c
formatNano(1000000) = 0x0d{code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)