You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flume.apache.org by Rahul Patodi <pa...@gmail.com> on 2012/06/07 15:49:59 UTC

Error in flume sink (data is getting copy again and again to hbase)

I have configured flume sink hbase(),
I have managed it to work
data is getting copy from a file on local hard disk to hbase

but

*Same data is getting copy again and again to hbase table (I saw this by
using VERSIONS)*
(I am not doing any changes in source file)

my configurations:
Collector Source: collectorSource(35853)
Collector Sink: {regexAll("(\\w+)\\t+(\\w+)\\t+(\\w+)", "row", "data1",
"data2") => hbase("ft02", "%{row}", "cf1", "col", "%{data1}", "cf2",
"coll", "%{data2}")}

Agent Source: tail("/tmp/test03")
Agent Sink: agentSink("localhost",35853)


Any Help is appreciated......!!!!!!!

-- 
*Regards*,
Rahul Patodi

Re: Error in flume sink (data is getting copy again and again to hbase)

Posted by JS Jang <js...@gruter.com>.
Hello Rahul,

My apologies for my typo in your name in my previous reply.
How about trying fanout to console to narrow down to the cause, to see
whether it is from agent or collector?
say,
try 1. tail("/tmp/test03")|[agentSink("localhost",35853), console] to
check data duplication in agent, or you can use "dump" in collector
try 2. {regexAll("(\\w+)\\t+(\\w+)\\t+(\\w+)", "row", "data1", "data2")
=> [hbase("ft02", "%{row}", "cf1", "col", "%{data1}", "cf2", "coll",
"%{data2}"),console]} to check data duplication before hbaseSink in
collector

On 6/14/12 10:18 PM, Rahul Patodi wrote:
> {regexAll("(\\w+)\\t+(\\w+)\\t+(\\w+)", "row", "data1", "data2") =>
> hbase("ft02", "%{row}", "cf1", "col", "%{data1}", "cf2", "coll",
> "%{data2}")}


-- 
----------------------------
장정식 / jsjang@gruter.com
(주)그루터, R&D팀 수석
www.gruter.com
Cloud, Search and Social
----------------------------


Re: Error in flume sink (data is getting copy again and again to hbase)

Posted by Rahul Patodi <pa...@gmail.com>.
Thanks for the reply
I am using flume-0.9.4-cdh3u3*
*I am using hbase() sink
Any node is neither getting restarted nor reconfigured,
Their is no delay (as I am using single node setup)

Still I am getting same problem: continuously data is transmitted from
source to HBase table (I can see this on console)





On Thu, Jun 7, 2012 at 7:40 PM, JS Jang <js...@gruter.com> wrote:

>  Hello Pahul,
>
> There can be some chances of data duplication like:
> 1. In End-to-End mode, delayed acks from master can cause data re-transit
> by agent.
> 2. tail source in agent node you used reads data from start of the file if
> you don't specify and set startFromEnd parameter to true; whenever you
> re-configure or restart the logical node somehow, it tails from the start
> of file. if you do multi-config, it may cause "refreshAll" of all nodes the
> master cares.
> 3. there was a bug that sometimes same logical nodes were started twice in
> 0.9.4 github version, which fixed in latest version of cdh3
>
> to test, for 1, you can try agentBESink instead of agentSink which is as
> far as I know same as agentE2ESink
> for 2, you can try setting startFromEnd parameter to true.
>
> hope it would be helpful.
>
> JS
>
>
> On 6/7/12 10:49 PM, Rahul Patodi wrote:
>
>  I have configured flume sink hbase(),
> I have managed it to work
> data is getting copy from a file on local hard disk to hbase
>
> but
>
> *Same data is getting copy again and again to hbase table (I saw this by
> using VERSIONS)*
> (I am not doing any changes in source file)
>
> my configurations:
> Collector Source: collectorSource(35853)
> Collector Sink: {regexAll("(\\w+)\\t+(\\w+)\\t+(\\w+)", "row", "data1",
> "data2") => hbase("ft02", "%{row}", "cf1", "col", "%{data1}", "cf2",
> "coll", "%{data2}")}
>
> Agent Source: tail("/tmp/test03")
> Agent Sink: agentSink("localhost",35853)
>
>
> Any Help is appreciated......!!!!!!!
>
> --
> *Regards*,
> Rahul Patodi
>
>
>
> --
> ----------------------------
> 장정식 / jsjang@gruter.com
> (주)그루터, R&D팀 수석www.gruter.com
> Cloud, Search and Social
> ----------------------------
>
>

Re: Error in flume sink (data is getting copy again and again to hbase)

Posted by JS Jang <js...@gruter.com>.
Hello Pahul,

There can be some chances of data duplication like:
1. In End-to-End mode, delayed acks from master can cause data 
re-transit by agent.
2. tail source in agent node you used reads data from start of the file 
if you don't specify and set startFromEnd parameter to true; whenever 
you re-configure or restart the logical node somehow, it tails from the 
start of file. if you do multi-config, it may cause "refreshAll" of all 
nodes the master cares.
3. there was a bug that sometimes same logical nodes were started twice 
in 0.9.4 github version, which fixed in latest version of cdh3

to test, for 1, you can try agentBESink instead of agentSink which is as 
far as I know same as agentE2ESink
for 2, you can try setting startFromEnd parameter to true.

hope it would be helpful.

JS

On 6/7/12 10:49 PM, Rahul Patodi wrote:
> I have configured flume sink hbase(),
> I have managed it to work
> data is getting copy from a file on local hard disk to hbase
>
> but
>
> *Same data is getting copy again and again to hbase table (I saw this 
> by using VERSIONS)*
> (I am not doing any changes in source file)
>
> my configurations:
> Collector Source: collectorSource(35853)
> Collector Sink: {regexAll("(\\w+)\\t+(\\w+)\\t+(\\w+)", "row", 
> "data1", "data2") => hbase("ft02", "%{row}", "cf1", "col", "%{data1}", 
> "cf2", "coll", "%{data2}")}
>
> Agent Source: tail("/tmp/test03")
> Agent Sink: agentSink("localhost",35853)
>
>
> Any Help is appreciated......!!!!!!!
>
> -- 
> *Regards*,
> Rahul Patodi
>


-- 
----------------------------
??? / jsjang@gruter.com
(?)???, R&D? ??
www.gruter.com
Cloud, Search and Social
----------------------------