You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hadoop.apache.org by Uthayan Suthakar <ut...@gmail.com> on 2015/01/26 13:17:48 UTC

MapReduce job is not picking up appended data.

I have a Flume which stream data into HDFS sink (appends to same file),
which I could "hdfs dfs -cat" and see it from HDFS. However, when I run
MapReduce job on the folder that contains appended data, it only picks up
the first batch that was flushed (bacthSize = 100) into HDFS. The rest are
not being picked up, although I could cat and see the rest. When I execute
the MapReduce job after the file is rolled(closed), it's picking up all
data.

Do you know why MR job is failing to find the rest of the batch even though
it exists.

So this is what I'm trying to do:

1) Read constant data flow from message queue and write them into HDFS.
2) Rolling is configured by intervals (1 hour) e.g  hdfs.rollinterval =3600
3) Write number of events into file before flushing into HDFS is set to 100
e.g hdfs.BatchSize=100
4) The appending configuration is enabled at lower level e.g
hdfs.append.support =true.

Snippets from Flume source:

 if (conf.getBoolean("hdfs.append.support", false) == true && hdfs.isFile
            (dstPath)) {
      outStream = hdfs.append(dstPath);
    } else {
      outStream = hdfs.create(dstPath);
    }

5) Now, all configurations for appending data into HDFS are in place.
6) I tested the flume and I could see a hdfs://test/data/input/event1.tmp
file get written into HDFS.
7) When I hdfs dfs -cat hdfs://test/data/input/event1.tmp, I could see all
data that are being appended into the file e.g 500+ events.
8) However, when I executed a simple MR job to read folder
hdfs://test/data/input  , it only picked up the first 100 event, although
it had over 500+ events.

So it would appear that Flume is in fact appending data into HDFS but MR
job is failing to pick up everything, perhaps block caching issue or
partition issue? Has anyone come across this issue?

Re: MapReduce job is not picking up appended data.

Posted by Uthayan Suthakar <ut...@gmail.com>.

Azuryy, I'm pretty sure that I could 'cat'. Please see below for the
evidence:

(1)
>>>Flume.conf:
a1.sinks.k1.hdfs.rollInterval=3600
a1.sinks.k1.hdfs.batchSize = 10


>>>I sent 21 events and I could 'cat' and verify this:
$ hdfs dfs -cat
/user/mon/input/flume/test/15-01-27/data.2015.01.27.13.1422361100490.tmp |
wc -l
21

>>>But when I submitted MapReduce job on above directory, it only picked
11(batchSize is 10 but it always process an event extra to the size)
records:
Map-Reduce Framework:
Map input records=11


(2)
>>>I then decided to send 9 more events and I could see that they've
appended to the file.
$ hdfs dfs -cat
/user/wdtmon/atlas_xrd_mon/input/flume/test/15-01-27/data.2015.01.27.13.1422361100490.tmp
| wc -l
30

>>>However, when I executed MapReduce job on the file, it still picks only
those 11 events.
Map-Reduce Framework:
Map input records=11


Any idea what's going on?


On 27 January 2015 at 08:30, Azuryy Yu <az...@gmail.com> wrote:

> Are you sure you can 'cat' the lastest batch of the data on HDFS?
> for Flume, the data is available only after file rolled, because Flume
> only call FileSystem.close() during file rolling.
>
>
> On Mon, Jan 26, 2015 at 8:17 PM, Uthayan Suthakar <
> uthayan.suthakar@gmail.com> wrote:
>
>> I have a Flume which stream data into HDFS sink (appends to same file),
>> which I could "hdfs dfs -cat" and see it from HDFS. However, when I run
>> MapReduce job on the folder that contains appended data, it only picks up
>> the first batch that was flushed (bacthSize = 100) into HDFS. The rest are
>> not being picked up, although I could cat and see the rest. When I execute
>> the MapReduce job after the file is rolled(closed), it's picking up all
>> data.
>>
>> Do you know why MR job is failing to find the rest of the batch even
>> though it exists.
>>
>> So this is what I'm trying to do:
>>
>> 1) Read constant data flow from message queue and write them into HDFS.
>> 2) Rolling is configured by intervals (1 hour) e.g  hdfs.rollinterval
>> =3600
>> 3) Write number of events into file before flushing into HDFS is set to
>> 100 e.g hdfs.BatchSize=100
>> 4) The appending configuration is enabled at lower level e.g
>> hdfs.append.support =true.
>>
>> Snippets from Flume source:
>>
>>  if (conf.getBoolean("hdfs.append.support", false) == true && hdfs.isFile
>>             (dstPath)) {
>>       outStream = hdfs.append(dstPath);
>>     } else {
>>       outStream = hdfs.create(dstPath);
>>     }
>>
>> 5) Now, all configurations for appending data into HDFS are in place.
>> 6) I tested the flume and I could see a hdfs://test/data/input/event1.tmp
>> file get written into HDFS.
>> 7) When I hdfs dfs -cat hdfs://test/data/input/event1.tmp, I could see
>> all data that are being appended into the file e.g 500+ events.
>> 8) However, when I executed a simple MR job to read folder
>> hdfs://test/data/input  , it only picked up the first 100 event, although
>> it had over 500+ events.
>>
>> So it would appear that Flume is in fact appending data into HDFS but MR
>> job is failing to pick up everything, perhaps block caching issue or
>> partition issue? Has anyone come across this issue?
>>
>
>

Re: MapReduce job is not picking up appended data.

Posted by Uthayan Suthakar <ut...@gmail.com>.

Azuryy, I'm pretty sure that I could 'cat'. Please see below for the
evidence:

(1)
>>>Flume.conf:
a1.sinks.k1.hdfs.rollInterval=3600
a1.sinks.k1.hdfs.batchSize = 10


>>>I sent 21 events and I could 'cat' and verify this:
$ hdfs dfs -cat
/user/mon/input/flume/test/15-01-27/data.2015.01.27.13.1422361100490.tmp |
wc -l
21

>>>But when I submitted MapReduce job on above directory, it only picked
11(batchSize is 10 but it always process an event extra to the size)
records:
Map-Reduce Framework:
Map input records=11


(2)
>>>I then decided to send 9 more events and I could see that they've
appended to the file.
$ hdfs dfs -cat
/user/wdtmon/atlas_xrd_mon/input/flume/test/15-01-27/data.2015.01.27.13.1422361100490.tmp
| wc -l
30

>>>However, when I executed MapReduce job on the file, it still picks only
those 11 events.
Map-Reduce Framework:
Map input records=11


Any idea what's going on?


On 27 January 2015 at 08:30, Azuryy Yu <az...@gmail.com> wrote:

> Are you sure you can 'cat' the lastest batch of the data on HDFS?
> for Flume, the data is available only after file rolled, because Flume
> only call FileSystem.close() during file rolling.
>
>
> On Mon, Jan 26, 2015 at 8:17 PM, Uthayan Suthakar <
> uthayan.suthakar@gmail.com> wrote:
>
>> I have a Flume which stream data into HDFS sink (appends to same file),
>> which I could "hdfs dfs -cat" and see it from HDFS. However, when I run
>> MapReduce job on the folder that contains appended data, it only picks up
>> the first batch that was flushed (bacthSize = 100) into HDFS. The rest are
>> not being picked up, although I could cat and see the rest. When I execute
>> the MapReduce job after the file is rolled(closed), it's picking up all
>> data.
>>
>> Do you know why MR job is failing to find the rest of the batch even
>> though it exists.
>>
>> So this is what I'm trying to do:
>>
>> 1) Read constant data flow from message queue and write them into HDFS.
>> 2) Rolling is configured by intervals (1 hour) e.g  hdfs.rollinterval
>> =3600
>> 3) Write number of events into file before flushing into HDFS is set to
>> 100 e.g hdfs.BatchSize=100
>> 4) The appending configuration is enabled at lower level e.g
>> hdfs.append.support =true.
>>
>> Snippets from Flume source:
>>
>>  if (conf.getBoolean("hdfs.append.support", false) == true && hdfs.isFile
>>             (dstPath)) {
>>       outStream = hdfs.append(dstPath);
>>     } else {
>>       outStream = hdfs.create(dstPath);
>>     }
>>
>> 5) Now, all configurations for appending data into HDFS are in place.
>> 6) I tested the flume and I could see a hdfs://test/data/input/event1.tmp
>> file get written into HDFS.
>> 7) When I hdfs dfs -cat hdfs://test/data/input/event1.tmp, I could see
>> all data that are being appended into the file e.g 500+ events.
>> 8) However, when I executed a simple MR job to read folder
>> hdfs://test/data/input  , it only picked up the first 100 event, although
>> it had over 500+ events.
>>
>> So it would appear that Flume is in fact appending data into HDFS but MR
>> job is failing to pick up everything, perhaps block caching issue or
>> partition issue? Has anyone come across this issue?
>>
>
>

Re: MapReduce job is not picking up appended data.

Posted by Uthayan Suthakar <ut...@gmail.com>.

Azuryy, I'm pretty sure that I could 'cat'. Please see below for the
evidence:

(1)
>>>Flume.conf:
a1.sinks.k1.hdfs.rollInterval=3600
a1.sinks.k1.hdfs.batchSize = 10


>>>I sent 21 events and I could 'cat' and verify this:
$ hdfs dfs -cat
/user/mon/input/flume/test/15-01-27/data.2015.01.27.13.1422361100490.tmp |
wc -l
21

>>>But when I submitted MapReduce job on above directory, it only picked
11(batchSize is 10 but it always process an event extra to the size)
records:
Map-Reduce Framework:
Map input records=11


(2)
>>>I then decided to send 9 more events and I could see that they've
appended to the file.
$ hdfs dfs -cat
/user/wdtmon/atlas_xrd_mon/input/flume/test/15-01-27/data.2015.01.27.13.1422361100490.tmp
| wc -l
30

>>>However, when I executed MapReduce job on the file, it still picks only
those 11 events.
Map-Reduce Framework:
Map input records=11


Any idea what's going on?


On 27 January 2015 at 08:30, Azuryy Yu <az...@gmail.com> wrote:

> Are you sure you can 'cat' the lastest batch of the data on HDFS?
> for Flume, the data is available only after file rolled, because Flume
> only call FileSystem.close() during file rolling.
>
>
> On Mon, Jan 26, 2015 at 8:17 PM, Uthayan Suthakar <
> uthayan.suthakar@gmail.com> wrote:
>
>> I have a Flume which stream data into HDFS sink (appends to same file),
>> which I could "hdfs dfs -cat" and see it from HDFS. However, when I run
>> MapReduce job on the folder that contains appended data, it only picks up
>> the first batch that was flushed (bacthSize = 100) into HDFS. The rest are
>> not being picked up, although I could cat and see the rest. When I execute
>> the MapReduce job after the file is rolled(closed), it's picking up all
>> data.
>>
>> Do you know why MR job is failing to find the rest of the batch even
>> though it exists.
>>
>> So this is what I'm trying to do:
>>
>> 1) Read constant data flow from message queue and write them into HDFS.
>> 2) Rolling is configured by intervals (1 hour) e.g  hdfs.rollinterval
>> =3600
>> 3) Write number of events into file before flushing into HDFS is set to
>> 100 e.g hdfs.BatchSize=100
>> 4) The appending configuration is enabled at lower level e.g
>> hdfs.append.support =true.
>>
>> Snippets from Flume source:
>>
>>  if (conf.getBoolean("hdfs.append.support", false) == true && hdfs.isFile
>>             (dstPath)) {
>>       outStream = hdfs.append(dstPath);
>>     } else {
>>       outStream = hdfs.create(dstPath);
>>     }
>>
>> 5) Now, all configurations for appending data into HDFS are in place.
>> 6) I tested the flume and I could see a hdfs://test/data/input/event1.tmp
>> file get written into HDFS.
>> 7) When I hdfs dfs -cat hdfs://test/data/input/event1.tmp, I could see
>> all data that are being appended into the file e.g 500+ events.
>> 8) However, when I executed a simple MR job to read folder
>> hdfs://test/data/input  , it only picked up the first 100 event, although
>> it had over 500+ events.
>>
>> So it would appear that Flume is in fact appending data into HDFS but MR
>> job is failing to pick up everything, perhaps block caching issue or
>> partition issue? Has anyone come across this issue?
>>
>
>

Re: MapReduce job is not picking up appended data.

Posted by Uthayan Suthakar <ut...@gmail.com>.

Azuryy, I'm pretty sure that I could 'cat'. Please see below for the
evidence:

(1)
>>>Flume.conf:
a1.sinks.k1.hdfs.rollInterval=3600
a1.sinks.k1.hdfs.batchSize = 10


>>>I sent 21 events and I could 'cat' and verify this:
$ hdfs dfs -cat
/user/mon/input/flume/test/15-01-27/data.2015.01.27.13.1422361100490.tmp |
wc -l
21

>>>But when I submitted MapReduce job on above directory, it only picked
11(batchSize is 10 but it always process an event extra to the size)
records:
Map-Reduce Framework:
Map input records=11


(2)
>>>I then decided to send 9 more events and I could see that they've
appended to the file.
$ hdfs dfs -cat
/user/wdtmon/atlas_xrd_mon/input/flume/test/15-01-27/data.2015.01.27.13.1422361100490.tmp
| wc -l
30

>>>However, when I executed MapReduce job on the file, it still picks only
those 11 events.
Map-Reduce Framework:
Map input records=11


Any idea what's going on?


On 27 January 2015 at 08:30, Azuryy Yu <az...@gmail.com> wrote:

> Are you sure you can 'cat' the lastest batch of the data on HDFS?
> for Flume, the data is available only after file rolled, because Flume
> only call FileSystem.close() during file rolling.
>
>
> On Mon, Jan 26, 2015 at 8:17 PM, Uthayan Suthakar <
> uthayan.suthakar@gmail.com> wrote:
>
>> I have a Flume which stream data into HDFS sink (appends to same file),
>> which I could "hdfs dfs -cat" and see it from HDFS. However, when I run
>> MapReduce job on the folder that contains appended data, it only picks up
>> the first batch that was flushed (bacthSize = 100) into HDFS. The rest are
>> not being picked up, although I could cat and see the rest. When I execute
>> the MapReduce job after the file is rolled(closed), it's picking up all
>> data.
>>
>> Do you know why MR job is failing to find the rest of the batch even
>> though it exists.
>>
>> So this is what I'm trying to do:
>>
>> 1) Read constant data flow from message queue and write them into HDFS.
>> 2) Rolling is configured by intervals (1 hour) e.g  hdfs.rollinterval
>> =3600
>> 3) Write number of events into file before flushing into HDFS is set to
>> 100 e.g hdfs.BatchSize=100
>> 4) The appending configuration is enabled at lower level e.g
>> hdfs.append.support =true.
>>
>> Snippets from Flume source:
>>
>>  if (conf.getBoolean("hdfs.append.support", false) == true && hdfs.isFile
>>             (dstPath)) {
>>       outStream = hdfs.append(dstPath);
>>     } else {
>>       outStream = hdfs.create(dstPath);
>>     }
>>
>> 5) Now, all configurations for appending data into HDFS are in place.
>> 6) I tested the flume and I could see a hdfs://test/data/input/event1.tmp
>> file get written into HDFS.
>> 7) When I hdfs dfs -cat hdfs://test/data/input/event1.tmp, I could see
>> all data that are being appended into the file e.g 500+ events.
>> 8) However, when I executed a simple MR job to read folder
>> hdfs://test/data/input  , it only picked up the first 100 event, although
>> it had over 500+ events.
>>
>> So it would appear that Flume is in fact appending data into HDFS but MR
>> job is failing to pick up everything, perhaps block caching issue or
>> partition issue? Has anyone come across this issue?
>>
>
>

Re: MapReduce job is not picking up appended data.

Posted by Azuryy Yu <az...@gmail.com>.

Are you sure you can 'cat' the lastest batch of the data on HDFS?
for Flume, the data is available only after file rolled, because Flume only
call FileSystem.close() during file rolling.


On Mon, Jan 26, 2015 at 8:17 PM, Uthayan Suthakar <
uthayan.suthakar@gmail.com> wrote:

> I have a Flume which stream data into HDFS sink (appends to same file),
> which I could "hdfs dfs -cat" and see it from HDFS. However, when I run
> MapReduce job on the folder that contains appended data, it only picks up
> the first batch that was flushed (bacthSize = 100) into HDFS. The rest are
> not being picked up, although I could cat and see the rest. When I execute
> the MapReduce job after the file is rolled(closed), it's picking up all
> data.
>
> Do you know why MR job is failing to find the rest of the batch even
> though it exists.
>
> So this is what I'm trying to do:
>
> 1) Read constant data flow from message queue and write them into HDFS.
> 2) Rolling is configured by intervals (1 hour) e.g  hdfs.rollinterval =3600
> 3) Write number of events into file before flushing into HDFS is set to
> 100 e.g hdfs.BatchSize=100
> 4) The appending configuration is enabled at lower level e.g
> hdfs.append.support =true.
>
> Snippets from Flume source:
>
>  if (conf.getBoolean("hdfs.append.support", false) == true && hdfs.isFile
>             (dstPath)) {
>       outStream = hdfs.append(dstPath);
>     } else {
>       outStream = hdfs.create(dstPath);
>     }
>
> 5) Now, all configurations for appending data into HDFS are in place.
> 6) I tested the flume and I could see a hdfs://test/data/input/event1.tmp
> file get written into HDFS.
> 7) When I hdfs dfs -cat hdfs://test/data/input/event1.tmp, I could see all
> data that are being appended into the file e.g 500+ events.
> 8) However, when I executed a simple MR job to read folder
> hdfs://test/data/input  , it only picked up the first 100 event, although
> it had over 500+ events.
>
> So it would appear that Flume is in fact appending data into HDFS but MR
> job is failing to pick up everything, perhaps block caching issue or
> partition issue? Has anyone come across this issue?
>

Re: MapReduce job is not picking up appended data.

Posted by Azuryy Yu <az...@gmail.com>.

Are you sure you can 'cat' the lastest batch of the data on HDFS?
for Flume, the data is available only after file rolled, because Flume only
call FileSystem.close() during file rolling.


On Mon, Jan 26, 2015 at 8:17 PM, Uthayan Suthakar <
uthayan.suthakar@gmail.com> wrote:

> I have a Flume which stream data into HDFS sink (appends to same file),
> which I could "hdfs dfs -cat" and see it from HDFS. However, when I run
> MapReduce job on the folder that contains appended data, it only picks up
> the first batch that was flushed (bacthSize = 100) into HDFS. The rest are
> not being picked up, although I could cat and see the rest. When I execute
> the MapReduce job after the file is rolled(closed), it's picking up all
> data.
>
> Do you know why MR job is failing to find the rest of the batch even
> though it exists.
>
> So this is what I'm trying to do:
>
> 1) Read constant data flow from message queue and write them into HDFS.
> 2) Rolling is configured by intervals (1 hour) e.g  hdfs.rollinterval =3600
> 3) Write number of events into file before flushing into HDFS is set to
> 100 e.g hdfs.BatchSize=100
> 4) The appending configuration is enabled at lower level e.g
> hdfs.append.support =true.
>
> Snippets from Flume source:
>
>  if (conf.getBoolean("hdfs.append.support", false) == true && hdfs.isFile
>             (dstPath)) {
>       outStream = hdfs.append(dstPath);
>     } else {
>       outStream = hdfs.create(dstPath);
>     }
>
> 5) Now, all configurations for appending data into HDFS are in place.
> 6) I tested the flume and I could see a hdfs://test/data/input/event1.tmp
> file get written into HDFS.
> 7) When I hdfs dfs -cat hdfs://test/data/input/event1.tmp, I could see all
> data that are being appended into the file e.g 500+ events.
> 8) However, when I executed a simple MR job to read folder
> hdfs://test/data/input  , it only picked up the first 100 event, although
> it had over 500+ events.
>
> So it would appear that Flume is in fact appending data into HDFS but MR
> job is failing to pick up everything, perhaps block caching issue or
> partition issue? Has anyone come across this issue?
>

Re: MapReduce job is not picking up appended data.

Posted by Azuryy Yu <az...@gmail.com>.

Are you sure you can 'cat' the lastest batch of the data on HDFS?
for Flume, the data is available only after file rolled, because Flume only
call FileSystem.close() during file rolling.


On Mon, Jan 26, 2015 at 8:17 PM, Uthayan Suthakar <
uthayan.suthakar@gmail.com> wrote:

> I have a Flume which stream data into HDFS sink (appends to same file),
> which I could "hdfs dfs -cat" and see it from HDFS. However, when I run
> MapReduce job on the folder that contains appended data, it only picks up
> the first batch that was flushed (bacthSize = 100) into HDFS. The rest are
> not being picked up, although I could cat and see the rest. When I execute
> the MapReduce job after the file is rolled(closed), it's picking up all
> data.
>
> Do you know why MR job is failing to find the rest of the batch even
> though it exists.
>
> So this is what I'm trying to do:
>
> 1) Read constant data flow from message queue and write them into HDFS.
> 2) Rolling is configured by intervals (1 hour) e.g  hdfs.rollinterval =3600
> 3) Write number of events into file before flushing into HDFS is set to
> 100 e.g hdfs.BatchSize=100
> 4) The appending configuration is enabled at lower level e.g
> hdfs.append.support =true.
>
> Snippets from Flume source:
>
>  if (conf.getBoolean("hdfs.append.support", false) == true && hdfs.isFile
>             (dstPath)) {
>       outStream = hdfs.append(dstPath);
>     } else {
>       outStream = hdfs.create(dstPath);
>     }
>
> 5) Now, all configurations for appending data into HDFS are in place.
> 6) I tested the flume and I could see a hdfs://test/data/input/event1.tmp
> file get written into HDFS.
> 7) When I hdfs dfs -cat hdfs://test/data/input/event1.tmp, I could see all
> data that are being appended into the file e.g 500+ events.
> 8) However, when I executed a simple MR job to read folder
> hdfs://test/data/input  , it only picked up the first 100 event, although
> it had over 500+ events.
>
> So it would appear that Flume is in fact appending data into HDFS but MR
> job is failing to pick up everything, perhaps block caching issue or
> partition issue? Has anyone come across this issue?
>

Re: MapReduce job is not picking up appended data.

Posted by Azuryy Yu <az...@gmail.com>.

Are you sure you can 'cat' the lastest batch of the data on HDFS?
for Flume, the data is available only after file rolled, because Flume only
call FileSystem.close() during file rolling.


On Mon, Jan 26, 2015 at 8:17 PM, Uthayan Suthakar <
uthayan.suthakar@gmail.com> wrote:

> I have a Flume which stream data into HDFS sink (appends to same file),
> which I could "hdfs dfs -cat" and see it from HDFS. However, when I run
> MapReduce job on the folder that contains appended data, it only picks up
> the first batch that was flushed (bacthSize = 100) into HDFS. The rest are
> not being picked up, although I could cat and see the rest. When I execute
> the MapReduce job after the file is rolled(closed), it's picking up all
> data.
>
> Do you know why MR job is failing to find the rest of the batch even
> though it exists.
>
> So this is what I'm trying to do:
>
> 1) Read constant data flow from message queue and write them into HDFS.
> 2) Rolling is configured by intervals (1 hour) e.g  hdfs.rollinterval =3600
> 3) Write number of events into file before flushing into HDFS is set to
> 100 e.g hdfs.BatchSize=100
> 4) The appending configuration is enabled at lower level e.g
> hdfs.append.support =true.
>
> Snippets from Flume source:
>
>  if (conf.getBoolean("hdfs.append.support", false) == true && hdfs.isFile
>             (dstPath)) {
>       outStream = hdfs.append(dstPath);
>     } else {
>       outStream = hdfs.create(dstPath);
>     }
>
> 5) Now, all configurations for appending data into HDFS are in place.
> 6) I tested the flume and I could see a hdfs://test/data/input/event1.tmp
> file get written into HDFS.
> 7) When I hdfs dfs -cat hdfs://test/data/input/event1.tmp, I could see all
> data that are being appended into the file e.g 500+ events.
> 8) However, when I executed a simple MR job to read folder
> hdfs://test/data/input  , it only picked up the first 100 event, although
> it had over 500+ events.
>
> So it would appear that Flume is in fact appending data into HDFS but MR
> job is failing to pick up everything, perhaps block caching issue or
> partition issue? Has anyone come across this issue?
>