You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flume.apache.org by Alan Miller <Al...@synopsys.com> on 2013/02/06 18:58:48 UTC

streaming Avro to HDFS

Hi I'm just getting started with Flume and trying to understand the flow of things.

I have avro binary data files being generated on remote nodes and I want to use
Flume (1.2.0) to stream them to my HDFS cluster at a central location. It seems I can
stream the data but the resulting files on HDFS seem corrupt.  Here's what I did:

For my "master" (on the NameNode of my Hadoop cluster)  I started this:
flume-ng agent -f agent.conf  -Dflume.root.logger=DEBUG,console -n agent
With this config:
agent.channels = memory-channel
agent.sources = avro-source
agent.sinks = hdfs-sink

agent.channels.memory-channel.type = memory
agent.channels.memory-channel.capacity = 1000
agent.channels.memory-channel.transactionCapacity = 100

agent.sources.avro-source.channels = memory-channel
agent.sources.avro-source.type = avro
agent.sources.avro-source.bind = 10.10.10.10
agent.sources.avro-source.port = 41414

agent.sinks.hdfs-sink.type = hdfs
agent.sinks.hdfs-sink.channel = memory-channel
agent.sinks.hdfs-sink.hdfs.path = hdfs://namenode1:9000/flume

On a remote node I streamed a test file like this:
flume-ng avro-client -H 10.10.10.10 -p 41414 -F /tmp/test.avro

I can see the master is writing to HDFS
  ......
  13/02/06 09:37:55 INFO hdfs.BucketWriter: Creating hdfs://namenode1:9000/flume/FlumeData.1360172273684.tmp
  13/02/06 09:38:25 INFO hdfs.BucketWriter: Renaming hdfs://namenode1:9000/flume/FlumeData.1360172273684.tmp
  to hdfs://namenode1:9000/flume/FlumeData.1360172273684

But the data doesn't seem right. The original file is 4551 bytes, the file written to
HDFS was only 219 bytes
  [localhost] $ ls -l FlumeData.1360172273684 /tmp/test.avro
  -rwxr-xr-x 1 amiller amiller  219 Feb  6 18:51 FlumeData.1360172273684
  -rwxr-xr-x 1 amiller amiller 4551 Feb 6 12:00 /tmp/test.avro

  [localhost] $ avro cat /tmp/test.avro
  {"system_model": null, "nfsv4": null, "ip": null, "site": null, "nfsv3": null, "export": null, "ifnet": [{"send_bps": 1234, "recv_bps": 5678, "name": "eth0"}, {"send_bps": 100, "recv_bps": 200, "name": "eth1"}, {"send_bps": 0, "recv_bps": 0, "name": "eth2"}], "disk": null, "hostname": "localhost", "total_mem": null, "ontapi_version": null, "serial_number": null, "cifs": null, "cpu_model": null, "volume": null, "time_stamp": 1357639723, "aggregate": null, "num_cpu": null, "cpu_speed_mhz": null, "hostid": null, "kernel_version": null, "qtree": null, "processor": null}

  [localhost] $ hadoop fs -copyToLocal /flume/FlumeData.1360172273684 .
  [localhost] $ avro cat FlumeData.1360172273684
  panic: ord() expected a character, but string of length 0 found

Alan

RE: streaming Avro to HDFS

Posted by Alan Miller <Al...@synopsys.com>.

Thanks again HAri, after I figured out the dependencies I was able to build a myapp.jar
that runs the MyApp.class example and streams 10  “Hello’s” to my remote namenode.

Do I understand this correctly, MyApp is triggering Events that send strings to the Avro source.
I’d have to implement a custom RpcClient (like this one) that sends Avro records to my Avro source.

13/02/07 05:20:30 INFO ipc.NettyServer: [id: 0x3b926e90, /10.1.1.1:44477 => /10.10.10.10:41414] OPEN
13/02/07 05:20:30 INFO ipc.NettyServer: [id: 0x3b926e90, /10.1.1.1:44477 => /10. 10.10.10:41414] BOUND: /10.10.10.10:41414
13/02/07 05:20:30 INFO ipc.NettyServer: [id: 0x3b926e90, /10.1.1.1:44477 => /10. 10.10.102:41414] CONNECTED: /10.1.1.1:44477
13/02/07 05:20:32 INFO ipc.NettyServer: [id: 0x3b926e90, /10.1.1.1:44477 :> /10. 10.10.10:41414] DISCONNECTED
13/02/07 05:20:32 INFO ipc.NettyServer: [id: 0x3b926e90, /10.1.1.1:44477 :> /10. 10.10.10:41414] UNBOUND
13/02/07 05:20:32 INFO ipc.NettyServer: [id: 0x3b926e90, /10.1.1.1:44477 :> /10. 10.10.10:41414] CLOSED
13/02/07 05:20:32 INFO ipc.NettyServer: Connection to /10.1.1.1:44477 disconnected.
13/02/07 05:20:33 INFO hdfs.BucketWriter: Creating hdfs://namenode1:9000/flume/FlumeData.1360243232430.tmp
13/02/07 05:21:03 INFO hdfs.BucketWriter: Renaming hdfs://namenode1:9000/flume/FlumeData.1360243232430.tmp to hdfs://namenode1:9000/flume/FlumeData.1360243232430

Alan

From: Hari Shreedharan [mailto:hshreedharan@cloudera.com]
Sent: Wednesday, February 06, 2013 7:59 PM
To: user@flume.apache.org
Subject: Re: streaming Avro to HDFS

Here you are: http://flume.apache.org/FlumeDeveloperGuide.html#client


Hari

--
Hari Shreedharan


On Wednesday, February 6, 2013 at 10:20 AM, Alan Miller wrote:

Thanks Hari,



Are there any links to examples of how to use the RpcClient?



Alan



From: Hari Shreedharan [mailto:hshreedharan@cloudera.com]
Sent: Wednesday, February 06, 2013 7:16 PM
To: user@flume.apache.org<ma...@flume.apache.org>
Subject: Re: streaming Avro to HDFS



Alan,



I think this is probably because the AvroClient is not really very "smart." It is mainly useful for testing the AvroSource. The AvroClient reads the file passed in, and sends one line per event (in 1.2.0, in 1.3.0+ there is an option of sending all files in a directory). So the events are not really sent as Avro files, and since you are using the text serializer they are dumped as is. Since events can arrive out of order, your data is likely to be invalid Avro. Also the new line character that is used to split the event may actually have been part of the real avro serialization, removing it simply made it invalid avro.



My advice would be to use the RpcClient to read the file, and send the data such that you send the data in a valid format, by making sure one avro "container" is in one event.





Hari



--

Hari Shreedharan



On Wednesday, February 6, 2013 at 9:58 AM, Alan Miller wrote:

Hi I’m just getting started with Flume and trying to understand the flow of things.



I have avro binary data files being generated on remote nodes and I want to use

Flume (1.2.0) to stream them to my HDFS cluster at a central location. It seems I can

stream the data but the resulting files on HDFS seem corrupt.  Here’s what I did:



For my “master” (on the NameNode of my Hadoop cluster)  I started this:

flume-ng agent -f agent.conf  -Dflume.root.logger=DEBUG,console -n agent

With this config:

agent.channels = memory-channel

agent.sources = avro-source

agent.sinks = hdfs-sink



agent.channels.memory-channel.type = memory

agent.channels.memory-channel.capacity = 1000

agent.channels.memory-channel.transactionCapacity = 100



agent.sources.avro-source.channels = memory-channel

agent.sources.avro-source.type = avro

agent.sources.avro-source.bind = 10.10.10.10

agent.sources.avro-source.port = 41414



agent.sinks.hdfs-sink.type = hdfs

agent.sinks.hdfs-sink.channel = memory-channel

agent.sinks.hdfs-sink.hdfs.path = hdfs://namenode1:9000/flume



On a remote node I streamed a test file like this:

flume-ng avro-client -H 10.10.10.10 -p 41414 -F /tmp/test.avro



I can see the master is writing to HDFS

  ……

  13/02/06 09:37:55 INFO hdfs.BucketWriter: Creating hdfs://namenode1:9000/flume/FlumeData.1360172273684.tmp

  13/02/06 09:38:25 INFO hdfs.BucketWriter: Renaming hdfs://namenode1:9000/flume/FlumeData.1360172273684.tmp

  to hdfs://namenode1:9000/flume/FlumeData.1360172273684



But the data doesn’t seem right. The original file is 4551 bytes, the file written to

HDFS was only 219 bytes

  [localhost] $ ls –l FlumeData.1360172273684 /tmp/test.avro

  -rwxr-xr-x 1 amiller amiller  219 Feb  6 18:51 FlumeData.1360172273684

  -rwxr-xr-x 1 amiller amiller 4551 Feb 6 12:00 /tmp/test.avro



  [localhost] $ avro cat /tmp/test.avro

  {"system_model": null, "nfsv4": null, "ip": null, "site": null, "nfsv3": null, "export": null, "ifnet": [{"send_bps": 1234, "recv_bps": 5678, "name": "eth0"}, {"send_bps": 100, "recv_bps": 200, "name": "eth1"}, {"send_bps": 0, "recv_bps": 0, "name": "eth2"}], "disk": null, "hostname": "localhost", "total_mem": null, "ontapi_version": null, "serial_number": null, "cifs": null, "cpu_model": null, "volume": null, "time_stamp": 1357639723, "aggregate": null, "num_cpu": null, "cpu_speed_mhz": null, "hostid": null, "kernel_version": null, "qtree": null, "processor": null}



  [localhost] $ hadoop fs -copyToLocal /flume/FlumeData.1360172273684 .

  [localhost] $ avro cat FlumeData.1360172273684

  panic: ord() expected a character, but string of length 0 found



Alan

Re: streaming Avro to HDFS

Posted by Hari Shreedharan <hs...@cloudera.com>.

Here you are: http://flume.apache.org/FlumeDeveloperGuide.html#client  


Hari  

--  
Hari Shreedharan


On Wednesday, February 6, 2013 at 10:20 AM, Alan Miller wrote:

> Thanks Hari,
>   
> Are there any links to examples of how to use the RpcClient?
>   
> Alan
>   
> From: Hari Shreedharan [mailto:hshreedharan@cloudera.com]  
> Sent: Wednesday, February 06, 2013 7:16 PM
> To: user@flume.apache.org (mailto:user@flume.apache.org)
> Subject: Re: streaming Avro to HDFS  
>   
> Alan,
>  
>   
>  
> I think this is probably because the AvroClient is not really very "smart." It is mainly useful for testing the AvroSource. The AvroClient reads the file passed in, and sends one line per event (in 1.2.0, in 1.3.0+ there is an option of sending all files in a directory). So the events are not really sent as Avro files, and since you are using the text serializer they are dumped as is. Since events can arrive out of order, your data is likely to be invalid Avro. Also the new line character that is used to split the event may actually have been part of the real avro serialization, removing it simply made it invalid avro.   
>  
>   
>  
> My advice would be to use the RpcClient to read the file, and send the data such that you send the data in a valid format, by making sure one avro "container" is in one event.
>  
>   
>  
>   
>  
> Hari
>  
>   
>  
> --  
>  
> Hari Shreedharan
>  
>   
>  
>  
> On Wednesday, February 6, 2013 at 9:58 AM, Alan Miller wrote:
> >  
> > Hi I’m just getting started with Flume and trying to understand the flow of things.
> >  
> >  
> >   
> >  
> >  
> > I have avro binary data files being generated on remote nodes and I want to use  
> >  
> >  
> > Flume (1.2.0) to stream them to my HDFS cluster at a central location. It seems I can
> >  
> >  
> > stream the data but the resulting files on HDFS seem corrupt.  Here’s what I did:
> >  
> >  
> >   
> >  
> >  
> > For my “master” (on the NameNode of my Hadoop cluster)  I started this:
> >  
> >  
> > flume-ng agent -f agent.conf  -Dflume.root.logger=DEBUG,console -n agent
> >  
> >  
> > With this config:
> >  
> >  
> > agent.channels = memory-channel
> >  
> >  
> > agent.sources = avro-source
> >  
> >  
> > agent.sinks = hdfs-sink
> >  
> >  
> >   
> >  
> >  
> > agent.channels.memory-channel.type = memory
> >  
> >  
> > agent.channels.memory-channel.capacity = 1000
> >  
> >  
> > agent.channels.memory-channel.transactionCapacity = 100
> >  
> >  
> >   
> >  
> >  
> > agent.sources.avro-source.channels = memory-channel
> >  
> >  
> > agent.sources.avro-source.type = avro
> >  
> >  
> > agent.sources.avro-source.bind = 10.10.10.10
> >  
> >  
> > agent.sources.avro-source.port = 41414
> >  
> >  
> >   
> >  
> >  
> > agent.sinks.hdfs-sink.type = hdfs
> >  
> >  
> > agent.sinks.hdfs-sink.channel = memory-channel
> >  
> >  
> > agent.sinks.hdfs-sink.hdfs.path = hdfs://namenode1:9000/flume
> >  
> >  
> >   
> >  
> >  
> > On a remote node I streamed a test file like this:
> >  
> >  
> > flume-ng avro-client -H 10.10.10.10 -p 41414 -F /tmp/test.avro
> >  
> >  
> >   
> >  
> >  
> > I can see the master is writing to HDFS
> >  
> >  
> >   ……
> >  
> >  
> >   13/02/06 09:37:55 INFO hdfs.BucketWriter: Creating hdfs://namenode1:9000/flume/FlumeData.1360172273684.tmp
> >  
> >  
> >   13/02/06 09:38:25 INFO hdfs.BucketWriter: Renaming hdfs://namenode1:9000/flume/FlumeData.1360172273684.tmp  
> >  
> >  
> >   to hdfs://namenode1:9000/flume/FlumeData.1360172273684
> >  
> >  
> >   
> >  
> >  
> > But the data doesn’t seem right. The original file is 4551 bytes, the file written to  
> >  
> >  
> > HDFS was only 219 bytes
> >  
> >  
> >   [localhost] $ ls –l FlumeData.1360172273684 /tmp/test.avro
> >  
> >  
> >   -rwxr-xr-x 1 amiller amiller  219 Feb  6 18:51 FlumeData.1360172273684
> >  
> >  
> >   -rwxr-xr-x 1 amiller amiller 4551 Feb 6 12:00 /tmp/test.avro
> >  
> >  
> >   
> >  
> >  
> >   [localhost] $ avro cat /tmp/test.avro  
> >  
> >  
> >   {"system_model": null, "nfsv4": null, "ip": null, "site": null, "nfsv3": null, "export": null, "ifnet": [{"send_bps": 1234, "recv_bps": 5678, "name": "eth0"}, {"send_bps": 100, "recv_bps": 200, "name": "eth1"}, {"send_bps": 0, "recv_bps": 0, "name": "eth2"}], "disk": null, "hostname": "localhost", "total_mem": null, "ontapi_version": null, "serial_number": null, "cifs": null, "cpu_model": null, "volume": null, "time_stamp": 1357639723, "aggregate": null, "num_cpu": null, "cpu_speed_mhz": null, "hostid": null, "kernel_version": null, "qtree": null, "processor": null}
> >  
> >  
> >   
> >  
> >  
> >   [localhost] $ hadoop fs -copyToLocal /flume/FlumeData.1360172273684 .
> >  
> >  
> >   [localhost] $ avro cat FlumeData.1360172273684
> >  
> >  
> >   panic: ord() expected a character, but string of length 0 found
> >  
> >  
> >   
> >  
> >  
> > Alan
> >  
> >  
> >   
> >  
> >  
> >   
> >  
> >  
> >  
> >  
>  
>   
>  
>  
>  
>

RE: streaming Avro to HDFS

Posted by Alan Miller <Al...@synopsys.com>.

Thanks Hari,

Are there any links to examples of how to use the RpcClient?

Alan

From: Hari Shreedharan [mailto:hshreedharan@cloudera.com]
Sent: Wednesday, February 06, 2013 7:16 PM
To: user@flume.apache.org
Subject: Re: streaming Avro to HDFS

Alan,

I think this is probably because the AvroClient is not really very "smart." It is mainly useful for testing the AvroSource. The AvroClient reads the file passed in, and sends one line per event (in 1.2.0, in 1.3.0+ there is an option of sending all files in a directory). So the events are not really sent as Avro files, and since you are using the text serializer they are dumped as is. Since events can arrive out of order, your data is likely to be invalid Avro. Also the new line character that is used to split the event may actually have been part of the real avro serialization, removing it simply made it invalid avro.

My advice would be to use the RpcClient to read the file, and send the data such that you send the data in a valid format, by making sure one avro "container" is in one event.

Hari

--
Hari Shreedharan

On Wednesday, February 6, 2013 at 9:58 AM, Alan Miller wrote:

Hi I’m just getting started with Flume and trying to understand the flow of things.

I have avro binary data files being generated on remote nodes and I want to use

Flume (1.2.0) to stream them to my HDFS cluster at a central location. It seems I can

stream the data but the resulting files on HDFS seem corrupt.  Here’s what I did:

For my “master” (on the NameNode of my Hadoop cluster)  I started this:

flume-ng agent -f agent.conf  -Dflume.root.logger=DEBUG,console -n agent

With this config:

agent.channels = memory-channel

agent.sources = avro-source

agent.sinks = hdfs-sink

agent.channels.memory-channel.type = memory

agent.channels.memory-channel.capacity = 1000

agent.channels.memory-channel.transactionCapacity = 100

agent.sources.avro-source.channels = memory-channel

agent.sources.avro-source.type = avro

agent.sources.avro-source.bind = 10.10.10.10

agent.sources.avro-source.port = 41414

agent.sinks.hdfs-sink.type = hdfs

agent.sinks.hdfs-sink.channel = memory-channel

agent.sinks.hdfs-sink.hdfs.path = hdfs://namenode1:9000/flume

On a remote node I streamed a test file like this:

flume-ng avro-client -H 10.10.10.10 -p 41414 -F /tmp/test.avro

I can see the master is writing to HDFS

  ……

  13/02/06 09:37:55 INFO hdfs.BucketWriter: Creating hdfs://namenode1:9000/flume/FlumeData.1360172273684.tmp

  13/02/06 09:38:25 INFO hdfs.BucketWriter: Renaming hdfs://namenode1:9000/flume/FlumeData.1360172273684.tmp

  to hdfs://namenode1:9000/flume/FlumeData.1360172273684

But the data doesn’t seem right. The original file is 4551 bytes, the file written to

HDFS was only 219 bytes

  [localhost] $ ls –l FlumeData.1360172273684 /tmp/test.avro

  -rwxr-xr-x 1 amiller amiller  219 Feb  6 18:51 FlumeData.1360172273684

  -rwxr-xr-x 1 amiller amiller 4551 Feb 6 12:00 /tmp/test.avro

  [localhost] $ avro cat /tmp/test.avro

  {"system_model": null, "nfsv4": null, "ip": null, "site": null, "nfsv3": null, "export": null, "ifnet": [{"send_bps": 1234, "recv_bps": 5678, "name": "eth0"}, {"send_bps": 100, "recv_bps": 200, "name": "eth1"}, {"send_bps": 0, "recv_bps": 0, "name": "eth2"}], "disk": null, "hostname": "localhost", "total_mem": null, "ontapi_version": null, "serial_number": null, "cifs": null, "cpu_model": null, "volume": null, "time_stamp": 1357639723, "aggregate": null, "num_cpu": null, "cpu_speed_mhz": null, "hostid": null, "kernel_version": null, "qtree": null, "processor": null}

  [localhost] $ hadoop fs -copyToLocal /flume/FlumeData.1360172273684 .

  [localhost] $ avro cat FlumeData.1360172273684

  panic: ord() expected a character, but string of length 0 found

Alan

Re: streaming Avro to HDFS

Posted by Hari Shreedharan <hs...@cloudera.com>.

Alan,

I think this is probably because the AvroClient is not really very "smart." It is mainly useful for testing the AvroSource. The AvroClient reads the file passed in, and sends one line per event (in 1.2.0, in 1.3.0+ there is an option of sending all files in a directory). So the events are not really sent as Avro files, and since you are using the text serializer they are dumped as is. Since events can arrive out of order, your data is likely to be invalid Avro. Also the new line character that is used to split the event may actually have been part of the real avro serialization, removing it simply made it invalid avro.   

My advice would be to use the RpcClient to read the file, and send the data such that you send the data in a valid format, by making sure one avro "container" is in one event.


Hari  

--  
Hari Shreedharan


On Wednesday, February 6, 2013 at 9:58 AM, Alan Miller wrote:

> Hi I’m just getting started with Flume and trying to understand the flow of things.
>   
> I have avro binary data files being generated on remote nodes and I want to use  
> Flume (1.2.0) to stream them to my HDFS cluster at a central location. It seems I can
> stream the data but the resulting files on HDFS seem corrupt.  Here’s what I did:
>   
> For my “master” (on the NameNode of my Hadoop cluster)  I started this:
> flume-ng agent -f agent.conf  -Dflume.root.logger=DEBUG,console -n agent
> With this config:
> agent.channels = memory-channel
> agent.sources = avro-source
> agent.sinks = hdfs-sink
>   
> agent.channels.memory-channel.type = memory
> agent.channels.memory-channel.capacity = 1000
> agent.channels.memory-channel.transactionCapacity = 100
>   
> agent.sources.avro-source.channels = memory-channel
> agent.sources.avro-source.type = avro
> agent.sources.avro-source.bind = 10.10.10.10
> agent.sources.avro-source.port = 41414
>   
> agent.sinks.hdfs-sink.type = hdfs
> agent.sinks.hdfs-sink.channel = memory-channel
> agent.sinks.hdfs-sink.hdfs.path = hdfs://namenode1:9000/flume
>   
> On a remote node I streamed a test file like this:
> flume-ng avro-client -H 10.10.10.10 -p 41414 -F /tmp/test.avro
>   
> I can see the master is writing to HDFS
>   ……
>   13/02/06 09:37:55 INFO hdfs.BucketWriter: Creating hdfs://namenode1:9000/flume/FlumeData.1360172273684.tmp
>   13/02/06 09:38:25 INFO hdfs.BucketWriter: Renaming hdfs://namenode1:9000/flume/FlumeData.1360172273684.tmp  
>   to hdfs://namenode1:9000/flume/FlumeData.1360172273684
>   
> But the data doesn’t seem right. The original file is 4551 bytes, the file written to  
> HDFS was only 219 bytes
>   [localhost] $ ls –l FlumeData.1360172273684 /tmp/test.avro
>   -rwxr-xr-x 1 amiller amiller  219 Feb  6 18:51 FlumeData.1360172273684
>   -rwxr-xr-x 1 amiller amiller 4551 Feb 6 12:00 /tmp/test.avro
>   
>   [localhost] $ avro cat /tmp/test.avro  
>   {"system_model": null, "nfsv4": null, "ip": null, "site": null, "nfsv3": null, "export": null, "ifnet": [{"send_bps": 1234, "recv_bps": 5678, "name": "eth0"}, {"send_bps": 100, "recv_bps": 200, "name": "eth1"}, {"send_bps": 0, "recv_bps": 0, "name": "eth2"}], "disk": null, "hostname": "localhost", "total_mem": null, "ontapi_version": null, "serial_number": null, "cifs": null, "cpu_model": null, "volume": null, "time_stamp": 1357639723, "aggregate": null, "num_cpu": null, "cpu_speed_mhz": null, "hostid": null, "kernel_version": null, "qtree": null, "processor": null}
>   
>   [localhost] $ hadoop fs -copyToLocal /flume/FlumeData.1360172273684 .
>   [localhost] $ avro cat FlumeData.1360172273684
>   panic: ord() expected a character, but string of length 0 found
>   
> Alan
>   
>   
>  
>  
>