You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flume.apache.org by Sunita Arvind <su...@gmail.com> on 2013/07/18 19:38:45 UTC

Seeking advice over choice of language and implementation

Hello friends,

I am new to flume and have written a python script to fetch some data from
social media. My response is JSON. I am seeking help on following issues:
1. I am finding it hard to make python and flume talk. Is it just my
ignorance or it is indeed a long route? AFAIK, I need to understand thrift
API and Avro etc to achieve this. I also read about pipes. Would this be a
simple implementation

2. I am equally comfortable (uncomfortable) in java. Hence wondering if its
better to re-write my application in Java so that I can easily integrate it
with flume. Are there any advantages of having a java application, as all
of hadoop is java?

3. I need to schedule the agent to run on a daily basis. Which of the above
approaches would help me achieve this easily?

4. Going by this -
http://mail-archives.apache.org/mod_mbox/flume-user/201306.mbox/%3CA7B08BAB-C8B8-4B55-B3EC-A80AB4EBB438@gmail.com%3Elooks
like we need to manually clean up disk space even with flume. I am
not clear on the advantages I would have with flume over using a simple
cron job to do the task. I can manually write statements like "hadoop fs
-put <location of output file on local> <location on hdfs>" in the cron job
instead.

Appreciate your help and guidance

regards,
Sunita

Fwd: Seeking advice over choice of language and implementation

Posted by Sunita Arvind <su...@gmail.com>.

thanks for your inputs Ashish and Hari.
Ashish, I'm attempting something similar (using webHDFS) to what you
mentioned inline of 3rd point (whether to consider flume for a daily batch
job).
Let me know if you have any idea about the error. I'll update the group if
setting the flag

dfs.webhdfs.enabled  = true

helps.

regards
Sunita


---------- Forwarded message ----------
From: Sunita Arvind <su...@gmail.com>
Date: Fri, Jul 19, 2013 at 7:30 PM
Subject: Re: Seeking advice over choice of language and implementation
To: user@flume.apache.org


Thankyou Israel,

I will attempt option 1 and share my experiences.

I tried a workaround in the meanwhile- Using webhdfs to directly write the
files to hdfs from a python daemon (using this library -
https://github.com/carlosmarin/webhdfs-py/blob/master/webhdfs/webhdfs.py)

However, with this, I am getting an exception
07/19/2013 06:05:59 PM - webhdfs - DEBUG - HTTP Response: 404, Not Found

If I copy paste the resultant URL in the browser address bar, I get
something like this:
"
{"RemoteException":{"exception":"IllegalArgumentException","javaClassName":"java.lang.IllegalArgumentException","message":"Invalid
value for webhdfs parameter \"op\": No enum const class
org.apache.hadoop.hdfs.web.resources.GetOpParam$Op.CREATE"}}
"

No idea what this means. Wondering if it means the hdfs is not configured as
dfs.webhdfs.enabled = true. (I do not have permission to check/change
this. Requesting for access from the admin). Let me know your
thoughts.

regards
Sunita





On Fri, Jul 19, 2013 at 12:51 AM, Israel Ekpo <is...@aicer.org> wrote:

> Sunita,
>
> Depending on your level of comfort, you can do one of the following:
>
> 1. Use Python to fetch your data and then send the events via HTTP to the
> Flume HTTP Source [1]
> 2. Use Java to create a custom source [6] in Flume that handles the data
> fetching and then puts it in a channel [3] so that it can be funneled into
> the sinks [4] and [5]
>
> Option 1 would be easier for you since you can get the data in Python and
> just stream it down via HTTP to Flume.
>
> Option 2 will be more involved since you need to write code that
> communicates with external endpoints.
>
> References
> [1] http://goo.gl/5lHlg
> [2] http://goo.gl/GnVbE
> [3] http://goo.gl/t31Xh
> [4] http://goo.gl/G9xS8
> [5] http://goo.gl/Wn4W5
> [6] http://goo.gl/Q0yyn
>
>
> *Author and Instructor for the Upcoming Book and Lecture Series*
> *Massive Log Data Aggregation, Processing, Searching and Visualization
> with Open Source Software*
> *http://massivelogdata.com*
>
>
> On 18 July 2013 13:38, Sunita Arvind <su...@gmail.com> wrote:
>
>> Hello friends,
>>
>> I am new to flume and have written a python script to fetch some data
>> from social media. My response is JSON. I am seeking help on following
>> issues:
>> 1. I am finding it hard to make python and flume talk. Is it just my
>> ignorance or it is indeed a long route? AFAIK, I need to understand thrift
>> API and Avro etc to achieve this. I also read about pipes. Would this be a
>> simple implementation
>>
>> 2. I am equally comfortable (uncomfortable) in java. Hence wondering if
>> its better to re-write my application in Java so that I can easily
>> integrate it with flume. Are there any advantages of having a java
>> application, as all of hadoop is java?
>>
>> 3. I need to schedule the agent to run on a daily basis. Which of the
>> above approaches would help me achieve this easily?
>>
>> 4. Going by this -
>> http://mail-archives.apache.org/mod_mbox/flume-user/201306.mbox/%3CA7B08BAB-C8B8-4B55-B3EC-A80AB4EBB438@gmail.com%3Elooks like we need to manually clean up disk space even with flume. I am
>> not clear on the advantages I would have with flume over using a simple
>> cron job to do the task. I can manually write statements like "hadoop fs
>> -put <location of output file on local> <location on hdfs>" in the cron job
>> instead.
>>
>> Appreciate your help and guidance
>>
>> regards,
>> Sunita
>>
>
>

Re: Seeking advice over choice of language and implementation

Posted by Sunita Arvind <su...@gmail.com>.

Thankyou Israel,

I will attempt option 1 and share my experiences.

I tried a workaround in the meanwhile- Using webhdfs to directly write the
files to hdfs from a python daemon (using this library -
https://github.com/carlosmarin/webhdfs-py/blob/master/webhdfs/webhdfs.py)

However, with this, I am getting an exception
07/19/2013 06:05:59 PM - webhdfs - DEBUG - HTTP Response: 404, Not Found

If I copy paste the resultant URL in the browser address bar, I get
something like this:
"
{"RemoteException":{"exception":"IllegalArgumentException","javaClassName":"java.lang.IllegalArgumentException","message":"Invalid
value for webhdfs parameter \"op\": No enum const class
org.apache.hadoop.hdfs.web.resources.GetOpParam$Op.CREATE"}}
"

No idea what this means. Wondering if it means the hdfs is not configured as
dfs.webhdfs.enabled = true. (I do not have permission to check/change
this. Requesting for access from the admin). Let me know your
thoughts.

regards
Sunita





On Fri, Jul 19, 2013 at 12:51 AM, Israel Ekpo <is...@aicer.org> wrote:

> Sunita,
>
> Depending on your level of comfort, you can do one of the following:
>
> 1. Use Python to fetch your data and then send the events via HTTP to the
> Flume HTTP Source [1]
> 2. Use Java to create a custom source [6] in Flume that handles the data
> fetching and then puts it in a channel [3] so that it can be funneled into
> the sinks [4] and [5]
>
> Option 1 would be easier for you since you can get the data in Python and
> just stream it down via HTTP to Flume.
>
> Option 2 will be more involved since you need to write code that
> communicates with external endpoints.
>
> References
> [1] http://goo.gl/5lHlg
> [2] http://goo.gl/GnVbE
> [3] http://goo.gl/t31Xh
> [4] http://goo.gl/G9xS8
> [5] http://goo.gl/Wn4W5
> [6] http://goo.gl/Q0yyn
>
>
> *Author and Instructor for the Upcoming Book and Lecture Series*
> *Massive Log Data Aggregation, Processing, Searching and Visualization
> with Open Source Software*
> *http://massivelogdata.com*
>
>
> On 18 July 2013 13:38, Sunita Arvind <su...@gmail.com> wrote:
>
>> Hello friends,
>>
>> I am new to flume and have written a python script to fetch some data
>> from social media. My response is JSON. I am seeking help on following
>> issues:
>> 1. I am finding it hard to make python and flume talk. Is it just my
>> ignorance or it is indeed a long route? AFAIK, I need to understand thrift
>> API and Avro etc to achieve this. I also read about pipes. Would this be a
>> simple implementation
>>
>> 2. I am equally comfortable (uncomfortable) in java. Hence wondering if
>> its better to re-write my application in Java so that I can easily
>> integrate it with flume. Are there any advantages of having a java
>> application, as all of hadoop is java?
>>
>> 3. I need to schedule the agent to run on a daily basis. Which of the
>> above approaches would help me achieve this easily?
>>
>> 4. Going by this -
>> http://mail-archives.apache.org/mod_mbox/flume-user/201306.mbox/%3CA7B08BAB-C8B8-4B55-B3EC-A80AB4EBB438@gmail.com%3Elooks like we need to manually clean up disk space even with flume. I am
>> not clear on the advantages I would have with flume over using a simple
>> cron job to do the task. I can manually write statements like "hadoop fs
>> -put <location of output file on local> <location on hdfs>" in the cron job
>> instead.
>>
>> Appreciate your help and guidance
>>
>> regards,
>> Sunita
>>
>
>

Re: Seeking advice over choice of language and implementation

Posted by Israel Ekpo <is...@aicer.org>.

Sunita,

Depending on your level of comfort, you can do one of the following:

1. Use Python to fetch your data and then send the events via HTTP to the
Flume HTTP Source [1]
2. Use Java to create a custom source [6] in Flume that handles the data
fetching and then puts it in a channel [3] so that it can be funneled into
the sinks [4] and [5]

Option 1 would be easier for you since you can get the data in Python and
just stream it down via HTTP to Flume.

Option 2 will be more involved since you need to write code that
communicates with external endpoints.

References
[1] http://goo.gl/5lHlg
[2] http://goo.gl/GnVbE
[3] http://goo.gl/t31Xh
[4] http://goo.gl/G9xS8
[5] http://goo.gl/Wn4W5
[6] http://goo.gl/Q0yyn


*Author and Instructor for the Upcoming Book and Lecture Series*
*Massive Log Data Aggregation, Processing, Searching and Visualization with
Open Source Software*
*http://massivelogdata.com*


On 18 July 2013 13:38, Sunita Arvind <su...@gmail.com> wrote:

> Hello friends,
>
> I am new to flume and have written a python script to fetch some data from
> social media. My response is JSON. I am seeking help on following issues:
> 1. I am finding it hard to make python and flume talk. Is it just my
> ignorance or it is indeed a long route? AFAIK, I need to understand thrift
> API and Avro etc to achieve this. I also read about pipes. Would this be a
> simple implementation
>
> 2. I am equally comfortable (uncomfortable) in java. Hence wondering if
> its better to re-write my application in Java so that I can easily
> integrate it with flume. Are there any advantages of having a java
> application, as all of hadoop is java?
>
> 3. I need to schedule the agent to run on a daily basis. Which of the
> above approaches would help me achieve this easily?
>
> 4. Going by this -
> http://mail-archives.apache.org/mod_mbox/flume-user/201306.mbox/%3CA7B08BAB-C8B8-4B55-B3EC-A80AB4EBB438@gmail.com%3Elooks like we need to manually clean up disk space even with flume. I am
> not clear on the advantages I would have with flume over using a simple
> cron job to do the task. I can manually write statements like "hadoop fs
> -put <location of output file on local> <location on hdfs>" in the cron job
> instead.
>
> Appreciate your help and guidance
>
> regards,
> Sunita
>

Re: Seeking advice over choice of language and implementation

Posted by Hari Shreedharan <hs...@cloudera.com>.

Aro's Python client is unlikely to work - because the Avro Netty RPC does
not have a python implementation (and is not compatible with HTTP
transceiver). At this point, either using HTTP Source or using Thrift RPC
is your best option.

Like Ashish said, Flume is not meant for batch jobs, rather for streaming
jobs. It would work for sure, but you may have other options.

Thanks,
Hari


On Fri, Jul 19, 2013 at 7:14 AM, Ashish <pa...@gmail.com> wrote:

>
>
>
> On Thu, Jul 18, 2013 at 11:08 PM, Sunita Arvind <su...@gmail.com>wrote:
>
>> Hello friends,
>>
>> I am new to flume and have written a python script to fetch some data
>> from social media. My response is JSON. I am seeking help on following
>> issues:
>> 1. I am finding it hard to make python and flume talk. Is it just my
>> ignorance or it is indeed a long route? AFAIK, I need to understand thrift
>> API and Avro etc to achieve this. I also read about pipes. Would this be a
>> simple implementation
>>
>
> Python would work fine. As said, you can use HTTP Source. Alternatively,
> you can also use Avro source using Avro's python client
>
>
>>
>> 2. I am equally comfortable (uncomfortable) in java. Hence wondering if
>> its better to re-write my application in Java so that I can easily
>> integrate it with flume. Are there any advantages of having a java
>> application, as all of hadoop is java?
>>
>
> The advantage would be that you can use Flume's Client SDK, reducing a lot
> of work. IMHO, it doesn't matter to Flume as to who is pushing the data
>
>
>>
>> 3. I need to schedule the agent to run on a daily basis. Which of the
>> above approaches would help me achieve this easily?
>>
>
> Looks like you have a batch job which would execute at a point of time
> during the day. If that's the case, please have a re-look if you need
> Flume. Flume can definitely be used, but you could directly do a load on
> HDFS. Again, cannot conclude based on the information provided.
>
>
>>
>> 4. Going by this -
>> http://mail-archives.apache.org/mod_mbox/flume-user/201306.mbox/%3CA7B08BAB-C8B8-4B55-B3EC-A80AB4EBB438@gmail.com%3Elooks like we need to manually clean up disk space even with flume. I am
>> not clear on the advantages I would have with flume over using a simple
>> cron job to do the task. I can manually write statements like "hadoop fs
>> -put <location of output file on local> <location on hdfs>" in the cron job
>> instead.
>>
>
> The ML thread pointed is related to RollingFileSink, not HDFS sink, so
> it's not valid in context of HDFS sink.
>
> HTH !
>
>
>>
>> Appreciate your help and guidance
>>
>> regards,
>> Sunita
>>
>
>
>
> --
> thanks
> ashish
>
> Blog: http://www.ashishpaliwal.com/blog
> My Photo Galleries: http://www.pbase.com/ashishpaliwal
>

Re: Seeking advice over choice of language and implementation

Posted by Ashish <pa...@gmail.com>.

On Thu, Jul 18, 2013 at 11:08 PM, Sunita Arvind <su...@gmail.com>wrote:

> Hello friends,
>
> I am new to flume and have written a python script to fetch some data from
> social media. My response is JSON. I am seeking help on following issues:
> 1. I am finding it hard to make python and flume talk. Is it just my
> ignorance or it is indeed a long route? AFAIK, I need to understand thrift
> API and Avro etc to achieve this. I also read about pipes. Would this be a
> simple implementation
>

Python would work fine. As said, you can use HTTP Source. Alternatively,
you can also use Avro source using Avro's python client

>
> 2. I am equally comfortable (uncomfortable) in java. Hence wondering if
> its better to re-write my application in Java so that I can easily
> integrate it with flume. Are there any advantages of having a java
> application, as all of hadoop is java?
>

The advantage would be that you can use Flume's Client SDK, reducing a lot
of work. IMHO, it doesn't matter to Flume as to who is pushing the data

>
> 3. I need to schedule the agent to run on a daily basis. Which of the
> above approaches would help me achieve this easily?
>

Looks like you have a batch job which would execute at a point of time
during the day. If that's the case, please have a re-look if you need
Flume. Flume can definitely be used, but you could directly do a load on
HDFS. Again, cannot conclude based on the information provided.

>
> 4. Going by this -
> http://mail-archives.apache.org/mod_mbox/flume-user/201306.mbox/%3CA7B08BAB-C8B8-4B55-B3EC-A80AB4EBB438@gmail.com%3Elooks like we need to manually clean up disk space even with flume. I am
> not clear on the advantages I would have with flume over using a simple
> cron job to do the task. I can manually write statements like "hadoop fs
> -put <location of output file on local> <location on hdfs>" in the cron job
> instead.
>

The ML thread pointed is related to RollingFileSink, not HDFS sink, so it's
not valid in context of HDFS sink.

HTH !

>
> Appreciate your help and guidance
>
> regards,
> Sunita
>

-- 
thanks
ashish

Blog: http://www.ashishpaliwal.com/blog
My Photo Galleries: http://www.pbase.com/ashishpaliwal