You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@storm.apache.org by Filli Alem <Al...@ti8m.ch> on 2015/06/02 11:42:08 UTC

How to write avro objects to HDFS?

Hi,
Im struggeling with writing avro objects to HDFS. Is this possible yet? If so how?
Im able to read messages from Kafka and output them to the console, but I have no idea on how to write them.

I found this commit but it doesn't seem to be in the code base yet:
https://patch-diff.githubusercontent.com/raw/apache/storm/pull/347.patch

any help is much appreciated.
Alem

.
.

AW: How to write avro objects to HDFS?

Posted by Filli Alem <Al...@ti8m.ch>.
I would be able to handle the parquet / avro colab, but how do I use parquet with storm?


Von: Mike Thomsen [mailto:mikerthomsen@gmail.com]
Gesendet: Mittwoch, 3. Juni 2015 15:15
An: user@storm.apache.org
Betreff: Re: How to write avro objects to HDFS?

Parquet appears to have its own API for that. You'll have to look for how it handles Avro. I believe I saw it as a supported serialization type.

On Wed, Jun 3, 2015 at 9:06 AM, Filli Alem <Al...@ti8m.ch>> wrote:
Hey Mike,

Thanks for your quick response!

I looked into the parquet + avro solution, it is a possibility for us to try.
I still have the same problem though, how can I serialize with parquet?

Thanks
Alem

Von: Mike Thomsen [mailto:mikerthomsen@gmail.com<ma...@gmail.com>]
Gesendet: Dienstag, 2. Juni 2015 13:04
An: user@storm.apache.org<ma...@storm.apache.org>
Betreff: Re: How to write avro objects to HDFS?

You can take the patch I wrote and apply it to a copy and pasted version of the HDFS bolt from storm-hdfs. Then you just need to add this to main() in your topology where "conf" is the topology Config object

Map<String, Object> hdfsConfig = new HashMap<String, Object>();
        hdfsConfig.put("fs.file.impl", "org.apache.hadoop.fs.LocalFileSystem");
        hdfsConfig.put("fs.hdfs.impl", "org.apache.hadoop.hdfs.DistributedFileSystem");
        hdfsConfig.put("io.serializations", "org.apache.hadoop.io.serializer.JavaSerialization,org.apache.avro.hadoop.io.AvroSerialization");
        conf.put("storm.hdfs.config", hdfsConfig);
I would caution you to not go this route. HDFS sequence files are really not a good match for Storm + Avro. You can easily end up with duplicates in them if you're not careful because processing Avro data is a lot more CPU-intensive than typical uses of Storm. So you'll want to make sure you give yourself some extra room in the timeouts and max pending tuples.
My understand is that Apache Parquet supports Avro and it seems to be a lot better than HDFS sequence files. It's worth a look before you get deep into this.

On Tue, Jun 2, 2015 at 5:42 AM, Filli Alem <Al...@ti8m.ch>> wrote:
Hi,
Im struggeling with writing avro objects to HDFS. Is this possible yet? If so how?
Im able to read messages from Kafka and output them to the console, but I have no idea on how to write them.

I found this commit but it doesn’t seem to be in the code base yet:
https://patch-diff.githubusercontent.com/raw/apache/storm/pull/347.patch

any help is much appreciated.
Alem

.


.


.


.


.
.

Re: How to write avro objects to HDFS?

Posted by Mike Thomsen <mi...@gmail.com>.
Parquet appears to have its own API for that. You'll have to look for how
it handles Avro. I believe I saw it as a supported serialization type.

On Wed, Jun 3, 2015 at 9:06 AM, Filli Alem <Al...@ti8m.ch> wrote:

>  Hey Mike,
>
>
>
> Thanks for your quick response!
>
>
>
> I looked into the parquet + avro solution, it is a possibility for us to
> try.
>
> I still have the same problem though, how can I serialize with parquet?
>
>
>
> Thanks
>
> Alem
>
>
>
> *Von:* Mike Thomsen [mailto:mikerthomsen@gmail.com]
> *Gesendet:* Dienstag, 2. Juni 2015 13:04
> *An:* user@storm.apache.org
> *Betreff:* Re: How to write avro objects to HDFS?
>
>
>
> You can take the patch I wrote and apply it to a copy and pasted version
> of the HDFS bolt from storm-hdfs. Then you just need to add this to main()
> in your topology where "conf" is the topology Config object
>
> Map<String, Object> hdfsConfig = new HashMap<String, Object>();
>         hdfsConfig.put("fs.file.impl",
> "org.apache.hadoop.fs.LocalFileSystem");
>         hdfsConfig.put("fs.hdfs.impl",
> "org.apache.hadoop.hdfs.DistributedFileSystem");
>         hdfsConfig.put("io.serializations",
> "org.apache.hadoop.io.serializer.JavaSerialization,org.apache.avro.hadoop.io.AvroSerialization");
>         conf.put("storm.hdfs.config", hdfsConfig);
>
> I would caution you to not go this route. HDFS sequence files are really
> not a good match for Storm + Avro. You can easily end up with duplicates in
> them if you're not careful because processing Avro data is a lot more
> CPU-intensive than typical uses of Storm. So you'll want to make sure you
> give yourself some extra room in the timeouts and max pending tuples.
>
> My understand is that Apache Parquet supports Avro and it seems to be a
> lot better than HDFS sequence files. It's worth a look before you get deep
> into this.
>
>
>
> On Tue, Jun 2, 2015 at 5:42 AM, Filli Alem <Al...@ti8m.ch> wrote:
>
>  Hi,
>
> Im struggeling with writing avro objects to HDFS. Is this possible yet? If
> so how?
>
> Im able to read messages from Kafka and output them to the console, but I
> have no idea on how to write them.
>
>
>
> I found this commit but it doesn’t seem to be in the code base yet:
>
> https://patch-diff.githubusercontent.com/raw/apache/storm/pull/347.patch
>
>
>
> any help is much appreciated.
>
> Alem
>
>
>
> .
>
>
> * .*
>
>
>    .
> * .*
>

AW: How to write avro objects to HDFS?

Posted by Filli Alem <Al...@ti8m.ch>.
Hey Mike,

Thanks for your quick response!

I looked into the parquet + avro solution, it is a possibility for us to try.
I still have the same problem though, how can I serialize with parquet?

Thanks
Alem

Von: Mike Thomsen [mailto:mikerthomsen@gmail.com]
Gesendet: Dienstag, 2. Juni 2015 13:04
An: user@storm.apache.org
Betreff: Re: How to write avro objects to HDFS?

You can take the patch I wrote and apply it to a copy and pasted version of the HDFS bolt from storm-hdfs. Then you just need to add this to main() in your topology where "conf" is the topology Config object

Map<String, Object> hdfsConfig = new HashMap<String, Object>();
        hdfsConfig.put("fs.file.impl", "org.apache.hadoop.fs.LocalFileSystem");
        hdfsConfig.put("fs.hdfs.impl", "org.apache.hadoop.hdfs.DistributedFileSystem");
        hdfsConfig.put("io.serializations", "org.apache.hadoop.io.serializer.JavaSerialization,org.apache.avro.hadoop.io.AvroSerialization");
        conf.put("storm.hdfs.config", hdfsConfig);
I would caution you to not go this route. HDFS sequence files are really not a good match for Storm + Avro. You can easily end up with duplicates in them if you're not careful because processing Avro data is a lot more CPU-intensive than typical uses of Storm. So you'll want to make sure you give yourself some extra room in the timeouts and max pending tuples.
My understand is that Apache Parquet supports Avro and it seems to be a lot better than HDFS sequence files. It's worth a look before you get deep into this.

On Tue, Jun 2, 2015 at 5:42 AM, Filli Alem <Al...@ti8m.ch>> wrote:
Hi,
Im struggeling with writing avro objects to HDFS. Is this possible yet? If so how?
Im able to read messages from Kafka and output them to the console, but I have no idea on how to write them.

I found this commit but it doesn’t seem to be in the code base yet:
https://patch-diff.githubusercontent.com/raw/apache/storm/pull/347.patch

any help is much appreciated.
Alem

.


.


.
.

Re: How to write avro objects to HDFS?

Posted by Mike Thomsen <mi...@gmail.com>.
You can take the patch I wrote and apply it to a copy and pasted version of
the HDFS bolt from storm-hdfs. Then you just need to add this to main() in
your topology where "conf" is the topology Config object

Map<String, Object> hdfsConfig = new HashMap<String, Object>();
        hdfsConfig.put("fs.file.impl",
"org.apache.hadoop.fs.LocalFileSystem");
        hdfsConfig.put("fs.hdfs.impl",
"org.apache.hadoop.hdfs.DistributedFileSystem");
        hdfsConfig.put("io.serializations",
"org.apache.hadoop.io.serializer.JavaSerialization,org.apache.avro.hadoop.io.AvroSerialization");
        conf.put("storm.hdfs.config", hdfsConfig);

I would caution you to not go this route. HDFS sequence files are really
not a good match for Storm + Avro. You can easily end up with duplicates in
them if you're not careful because processing Avro data is a lot more
CPU-intensive than typical uses of Storm. So you'll want to make sure you
give yourself some extra room in the timeouts and max pending tuples.

My understand is that Apache Parquet supports Avro and it seems to be a lot
better than HDFS sequence files. It's worth a look before you get deep into
this.

On Tue, Jun 2, 2015 at 5:42 AM, Filli Alem <Al...@ti8m.ch> wrote:

>  Hi,
>
> Im struggeling with writing avro objects to HDFS. Is this possible yet? If
> so how?
>
> Im able to read messages from Kafka and output them to the console, but I
> have no idea on how to write them.
>
>
>
> I found this commit but it doesn’t seem to be in the code base yet:
>
> https://patch-diff.githubusercontent.com/raw/apache/storm/pull/347.patch
>
>
>
> any help is much appreciated.
>
> Alem
>
>
>   .
> * .*
>