You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@storm.apache.org by Kristopher Kane <kk...@gmail.com> on 2017/08/01 20:53:57 UTC
Re: Writing orc files with storm via java API

For ORC specifically, I would ONLY create an ORC HDFS file based on a Tuple
batch and create/flush/close off the ORC file in one go. Adjust batch sizes
and message timeout  for what makes sense of your case. Yes, you will
likely have many small files in HDFS, but, since this ORC, the assumption
is you will be leveraging them via Hive. If that is the case, you can use
Hive to concat the ORC files at a partition level.

Other containerized formats need the same care but will need their own post
processing of small files.

Avro is a container format which doesn't have a footer and thus ideal for
schema + record acknowledgement processing.

Kris

On Mon, Jul 31, 2017 at 10:40 AM, Bobby Evans <ev...@yahoo-inc.com.invalid>
wrote:

> It should be possible to make this work, but it is not going to be
> simple.  The real issue is the format of the orc file.  It is not one
> record at a time, like CSV or other supported formats are.  Sadly this is
> currently an assumption with the AbstractHdfsBolt.
> https://github.com/apache/storm/blob/master/external/
> storm-hdfs/src/main/java/org/apache/storm/hdfs/bolt/format/
> RecordFormat.java
> So to support it we would need to make some modifications, not impossible,
> just not a drop in replacement.  If this is something you want to tackle
> and contribute back I think we would all love it.  You might also run into
> some issues with metadata for the format being written at the end of the
> file.
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC
> I am not totally sure how easy it is to recover an ORC file if that footer
> is missing because a worker crashed.  You might end up with data loss in
> some cases if you are not extremely careful.  You might also need to modify
> the ORC APIs themselves to be able to support storing/recovering the
> metadata in an external location for recovery to truly fix it, and then
> store them in ZK on a flush until the file is rotated.
>
> The Trident HDFState
> https://github.com/apache/storm/blob/master/external/
> storm-hdfs/src/main/java/org/apache/storm/hdfs/trident/HdfsState.java
> might be a more appropriate place to start, as the updated state is
> written out in micro batches, but you still have to deal with the footer
> issues, as trident really cares about exactly once processing.
>
> So overall it is not a simple problem, and relying on an external server
> like hive would make it a lot simpler.
>
>
> - Bobby
>
>
> On Tuesday, July 25, 2017, 8:38:42 AM CDT, Igor Kuzmenko <
> f1sherox@gmail.com> wrote:
>
> Is there any implementation of storm bolt which can write files to HDFS in
> ORC format, without using Hive Streaming API?
> I've found java API for writing ORC files <https://github.com/apache/orc>
> and I'm guessing is there any existing Hive bolts that uses it or any plans
> to create such?
>