You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Martin Neumann <mn...@sics.se> on 2016/02/12 00:20:37 UTC

streaming using DeserializationSchema

Hej,

I have a stream program reading data from Kafka where the data is in
avro. I have my own DeserializationSchema to deal with it.

For testing reasons I want to read a dump from hdfs instead, is there a way
to use the same DeserializationSchema to read from an avro file stored on
hdfs?

cheers Martin

Re: streaming using DeserializationSchema

Posted by Martin Neumann <mn...@sics.se>.
I ended up not using the DeserializationSchema and instead going for a
AvrioInputFormat in case of reading From file. I would have preferred to
keep the code simpler but the map solution was a lot more complicated since
the raw data I have is in Avro binary format so I cannot just read it and
map it later.

cheers Martin


On Fri, Feb 12, 2016 at 10:47 PM, Nick Dimiduk <nd...@apache.org> wrote:

> My input file contains newline-delimited JSON records, one per text line.
> The records on the Kafka topic are JSON blobs encoded to UTF8 and written
> as bytes.
>
> On Fri, Feb 12, 2016 at 1:41 PM, Martin Neumann <mn...@sics.se> wrote:
>
>> I'm trying the same thing now.
>>
>> I guess you need to read the file as byte arrays somehow to make it work.
>> What read function did you use? The mapper is not hard to write but the
>> byte array stuff gives me a headache.
>>
>> cheers Martin
>>
>>
>>
>>
>> On Fri, Feb 12, 2016 at 9:12 PM, Nick Dimiduk <nd...@apache.org>
>> wrote:
>>
>>> Hi Martin,
>>>
>>> I have the same usecase. I wanted to be able to load from dumps of data
>>> in the same format as is on the kafak queue. I created a new application
>>> main, call it the "job" instead of the "flow". I refactored my code a bit
>>> for building the flow so all that can be reused via factory method. I then
>>> implemented a MapFunction that simply calls my existing deserializer.
>>> Create a new DataStream from flat file and tack on the MapFunction step.
>>> The resulting DataStream is then type-compatible with the Kakfa consumer
>>> that starts the "flow" application, so I pass it into the factory method.
>>> Tweak the ParameterTools options for the "job" application, et voilà!
>>>
>>> Sorry I don't have example code for you; this would be a good example to
>>> contribute back to the community's example library though.
>>>
>>> Good luck!
>>> -n
>>>
>>> On Fri, Feb 12, 2016 at 2:25 AM, Martin Neumann <mn...@sics.se>
>>> wrote:
>>>
>>>> Its not only about testing, I will also need to run things against
>>>> different datasets. I want to reuse as much of the code as possible to load
>>>> the same data from a file instead of kafka.
>>>>
>>>> Is there a simple way of loading the data from a File using the same
>>>> conversion classes that I would use to transfrom them when I read them from
>>>> kafka or do I have to write a new avro deserializer (InputFormat).
>>>>
>>>> On Fri, Feb 12, 2016 at 2:06 AM, Gyula Fóra <gy...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hey,
>>>>>
>>>>> A very simple thing you could do is to set up a simple kafka producer
>>>>> in a java program that will feed the data into a topic. This also has the
>>>>> additional benefit that you are actually testing against kafka.
>>>>>
>>>>> Cheers,
>>>>> Gyula
>>>>>
>>>>> Martin Neumann <mn...@sics.se> ezt írta (időpont: 2016. febr. 12.,
>>>>> P, 0:20):
>>>>>
>>>>>> Hej,
>>>>>>
>>>>>> I have a stream program reading data from Kafka where the data is in
>>>>>> avro. I have my own DeserializationSchema to deal with it.
>>>>>>
>>>>>> For testing reasons I want to read a dump from hdfs instead, is there
>>>>>> a way to use the same DeserializationSchema to read from an avro file
>>>>>> stored on hdfs?
>>>>>>
>>>>>> cheers Martin
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: streaming using DeserializationSchema

Posted by Nick Dimiduk <nd...@apache.org>.
My input file contains newline-delimited JSON records, one per text line.
The records on the Kafka topic are JSON blobs encoded to UTF8 and written
as bytes.

On Fri, Feb 12, 2016 at 1:41 PM, Martin Neumann <mn...@sics.se> wrote:

> I'm trying the same thing now.
>
> I guess you need to read the file as byte arrays somehow to make it work.
> What read function did you use? The mapper is not hard to write but the
> byte array stuff gives me a headache.
>
> cheers Martin
>
>
>
>
> On Fri, Feb 12, 2016 at 9:12 PM, Nick Dimiduk <nd...@apache.org> wrote:
>
>> Hi Martin,
>>
>> I have the same usecase. I wanted to be able to load from dumps of data
>> in the same format as is on the kafak queue. I created a new application
>> main, call it the "job" instead of the "flow". I refactored my code a bit
>> for building the flow so all that can be reused via factory method. I then
>> implemented a MapFunction that simply calls my existing deserializer.
>> Create a new DataStream from flat file and tack on the MapFunction step.
>> The resulting DataStream is then type-compatible with the Kakfa consumer
>> that starts the "flow" application, so I pass it into the factory method.
>> Tweak the ParameterTools options for the "job" application, et voilà!
>>
>> Sorry I don't have example code for you; this would be a good example to
>> contribute back to the community's example library though.
>>
>> Good luck!
>> -n
>>
>> On Fri, Feb 12, 2016 at 2:25 AM, Martin Neumann <mn...@sics.se> wrote:
>>
>>> Its not only about testing, I will also need to run things against
>>> different datasets. I want to reuse as much of the code as possible to load
>>> the same data from a file instead of kafka.
>>>
>>> Is there a simple way of loading the data from a File using the same
>>> conversion classes that I would use to transfrom them when I read them from
>>> kafka or do I have to write a new avro deserializer (InputFormat).
>>>
>>> On Fri, Feb 12, 2016 at 2:06 AM, Gyula Fóra <gy...@gmail.com>
>>> wrote:
>>>
>>>> Hey,
>>>>
>>>> A very simple thing you could do is to set up a simple kafka producer
>>>> in a java program that will feed the data into a topic. This also has the
>>>> additional benefit that you are actually testing against kafka.
>>>>
>>>> Cheers,
>>>> Gyula
>>>>
>>>> Martin Neumann <mn...@sics.se> ezt írta (időpont: 2016. febr. 12.,
>>>> P, 0:20):
>>>>
>>>>> Hej,
>>>>>
>>>>> I have a stream program reading data from Kafka where the data is in
>>>>> avro. I have my own DeserializationSchema to deal with it.
>>>>>
>>>>> For testing reasons I want to read a dump from hdfs instead, is there
>>>>> a way to use the same DeserializationSchema to read from an avro file
>>>>> stored on hdfs?
>>>>>
>>>>> cheers Martin
>>>>>
>>>>
>>>
>>
>

Re: streaming using DeserializationSchema

Posted by Martin Neumann <mn...@sics.se>.
I'm trying the same thing now.

I guess you need to read the file as byte arrays somehow to make it work.
What read function did you use? The mapper is not hard to write but the
byte array stuff gives me a headache.

cheers Martin




On Fri, Feb 12, 2016 at 9:12 PM, Nick Dimiduk <nd...@apache.org> wrote:

> Hi Martin,
>
> I have the same usecase. I wanted to be able to load from dumps of data in
> the same format as is on the kafak queue. I created a new application main,
> call it the "job" instead of the "flow". I refactored my code a bit for
> building the flow so all that can be reused via factory method. I then
> implemented a MapFunction that simply calls my existing deserializer.
> Create a new DataStream from flat file and tack on the MapFunction step.
> The resulting DataStream is then type-compatible with the Kakfa consumer
> that starts the "flow" application, so I pass it into the factory method.
> Tweak the ParameterTools options for the "job" application, et voilà!
>
> Sorry I don't have example code for you; this would be a good example to
> contribute back to the community's example library though.
>
> Good luck!
> -n
>
> On Fri, Feb 12, 2016 at 2:25 AM, Martin Neumann <mn...@sics.se> wrote:
>
>> Its not only about testing, I will also need to run things against
>> different datasets. I want to reuse as much of the code as possible to load
>> the same data from a file instead of kafka.
>>
>> Is there a simple way of loading the data from a File using the same
>> conversion classes that I would use to transfrom them when I read them from
>> kafka or do I have to write a new avro deserializer (InputFormat).
>>
>> On Fri, Feb 12, 2016 at 2:06 AM, Gyula Fóra <gy...@gmail.com> wrote:
>>
>>> Hey,
>>>
>>> A very simple thing you could do is to set up a simple kafka producer in
>>> a java program that will feed the data into a topic. This also has the
>>> additional benefit that you are actually testing against kafka.
>>>
>>> Cheers,
>>> Gyula
>>>
>>> Martin Neumann <mn...@sics.se> ezt írta (időpont: 2016. febr. 12.,
>>> P, 0:20):
>>>
>>>> Hej,
>>>>
>>>> I have a stream program reading data from Kafka where the data is in
>>>> avro. I have my own DeserializationSchema to deal with it.
>>>>
>>>> For testing reasons I want to read a dump from hdfs instead, is there a
>>>> way to use the same DeserializationSchema to read from an avro file stored
>>>> on hdfs?
>>>>
>>>> cheers Martin
>>>>
>>>
>>
>

Re: streaming using DeserializationSchema

Posted by Nick Dimiduk <nd...@apache.org>.
Hi Martin,

I have the same usecase. I wanted to be able to load from dumps of data in
the same format as is on the kafak queue. I created a new application main,
call it the "job" instead of the "flow". I refactored my code a bit for
building the flow so all that can be reused via factory method. I then
implemented a MapFunction that simply calls my existing deserializer.
Create a new DataStream from flat file and tack on the MapFunction step.
The resulting DataStream is then type-compatible with the Kakfa consumer
that starts the "flow" application, so I pass it into the factory method.
Tweak the ParameterTools options for the "job" application, et voilà!

Sorry I don't have example code for you; this would be a good example to
contribute back to the community's example library though.

Good luck!
-n

On Fri, Feb 12, 2016 at 2:25 AM, Martin Neumann <mn...@sics.se> wrote:

> Its not only about testing, I will also need to run things against
> different datasets. I want to reuse as much of the code as possible to load
> the same data from a file instead of kafka.
>
> Is there a simple way of loading the data from a File using the same
> conversion classes that I would use to transfrom them when I read them from
> kafka or do I have to write a new avro deserializer (InputFormat).
>
> On Fri, Feb 12, 2016 at 2:06 AM, Gyula Fóra <gy...@gmail.com> wrote:
>
>> Hey,
>>
>> A very simple thing you could do is to set up a simple kafka producer in
>> a java program that will feed the data into a topic. This also has the
>> additional benefit that you are actually testing against kafka.
>>
>> Cheers,
>> Gyula
>>
>> Martin Neumann <mn...@sics.se> ezt írta (időpont: 2016. febr. 12., P,
>> 0:20):
>>
>>> Hej,
>>>
>>> I have a stream program reading data from Kafka where the data is in
>>> avro. I have my own DeserializationSchema to deal with it.
>>>
>>> For testing reasons I want to read a dump from hdfs instead, is there a
>>> way to use the same DeserializationSchema to read from an avro file stored
>>> on hdfs?
>>>
>>> cheers Martin
>>>
>>
>

Re: streaming using DeserializationSchema

Posted by Martin Neumann <mn...@sics.se>.
Its not only about testing, I will also need to run things against
different datasets. I want to reuse as much of the code as possible to load
the same data from a file instead of kafka.

Is there a simple way of loading the data from a File using the same
conversion classes that I would use to transfrom them when I read them from
kafka or do I have to write a new avro deserializer (InputFormat).

On Fri, Feb 12, 2016 at 2:06 AM, Gyula Fóra <gy...@gmail.com> wrote:

> Hey,
>
> A very simple thing you could do is to set up a simple kafka producer in a
> java program that will feed the data into a topic. This also has the
> additional benefit that you are actually testing against kafka.
>
> Cheers,
> Gyula
>
> Martin Neumann <mn...@sics.se> ezt írta (időpont: 2016. febr. 12., P,
> 0:20):
>
>> Hej,
>>
>> I have a stream program reading data from Kafka where the data is in
>> avro. I have my own DeserializationSchema to deal with it.
>>
>> For testing reasons I want to read a dump from hdfs instead, is there a
>> way to use the same DeserializationSchema to read from an avro file stored
>> on hdfs?
>>
>> cheers Martin
>>
>

Re: streaming using DeserializationSchema

Posted by Gyula Fóra <gy...@gmail.com>.
Hey,

A very simple thing you could do is to set up a simple kafka producer in a
java program that will feed the data into a topic. This also has the
additional benefit that you are actually testing against kafka.

Cheers,
Gyula

Martin Neumann <mn...@sics.se> ezt írta (időpont: 2016. febr. 12., P,
0:20):

> Hej,
>
> I have a stream program reading data from Kafka where the data is in
> avro. I have my own DeserializationSchema to deal with it.
>
> For testing reasons I want to read a dump from hdfs instead, is there a
> way to use the same DeserializationSchema to read from an avro file stored
> on hdfs?
>
> cheers Martin
>