You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flume.apache.org by Bertrand Dechoux <de...@gmail.com> on 2014/07/23 08:00:35 UTC

Re: Skippin those gost darn 0 byte diles

The best would be to get a hold on a Flume developer. I am not strictly
sure of all the differences between sync/flush/hsync/hflush and the
different hadoop versions. It might be the case that you are only flushing
on the client side. Even if it was a clean strategy, creation+flush is
unlikely to be an atomic operation.

It is worth testing the read of an empty sequence file (real empty and with
only header). It should be quite easy with a unit test. A solution would
indeed to validate the behaviour of SequenceFileReader / InputFormat on
edge cases. But nothing guarantee you that you won't have a record split
between two HDFS blocks. This implies that during the writing only the
first block is visible and only a part of the record. It would be normal
for the reader to fail on that case. You could tweak mapreduce bad records
skipping but that feels like hacking a system where the design is wrong
from the beginning.

Anyway, a solution (seen in Flume if I remember correctly) is having a good
file name strategy. For exemple, all new files should end in ".open" and
only when they are finished the suffix is removed. Then for processing, you
only target the latter.

For Hive, you might need to adapt the strategy a bit because Hive may not
be able to target only files with a specific name (you are the expert). A
simple move of the file from a temporary directory to the table directory
would have the same effect (because from the point of view of HDFS, it's
the same operation : metadata change only).

Bertrand Dechoux


On Wed, Jul 23, 2014 at 12:16 AM, Edward Capriolo <ed...@gmail.com>
wrote:

> Here is the stack trace...
>
>  Caused by: java.io.EOFException
>   at java.io.DataInputStream.readByte(DataInputStream.java:267)
>   at org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:308)
>   at org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:329)
>   at org.apache.hadoop.io.SequenceFile$Reader.readBuffer(SequenceFile.java:2072)
>   at org.apache.hadoop.io.SequenceFile$Reader.seekToCurrentValue(SequenceFile.java:2139)
>   at org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:2214)
>   at org.apache.hadoop.mapred.SequenceFileRecordReader.getCurrentValue(SequenceFileRecordReader.java:109)
>   at org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:84)
>   at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:274)
>   ... 15 more
>
>
>
>
> On Tue, Jul 22, 2014 at 6:14 PM, Edward Capriolo <ed...@gmail.com>
> wrote:
>
>> Currently using:
>>
>>     <dependency>
>>             <groupId>org.apache.hadoop</groupId>
>>             <artifactId>hadoop-hdfs</artifactId>
>>             <version>2.3.0</version>
>>         </dependency>
>>
>>
>> I have this piece of code that does.
>>
>> writer = SequenceFile.createWriter(fs, conf, p, Text.class, Text.class,
>> CompressionType.BLOCK, codec);
>>
>> Then I have a piece of code like this...
>>
>>   public static final long SYNC_EVERY_LINES = 1000;
>>  if (meta.getLinesWritten() % SYNC_EVERY_LINES == 0){
>>         meta.getWriter().sync();
>>       }
>>
>>
>> And I commonly see:
>>
>> [ecapriolo@staging-hadoop-cdh-67-14 ~]$ hadoop dfs -ls  /user/beacon/
>> 2014072117
>> DEPRECATED: Use of this script to execute hdfs command is deprecated.
>> Instead use the hdfs command for it.
>>
>> Found 12 items
>> -rw-r--r--   3 service-igor supergroup    1065682 2014-07-21 17:50
>> /user/beacon/2014072117/0bb6cd71-70ac-405a-a8b7-b8caf9af8da1
>> -rw-r--r--   3 service-igor supergroup    1029041 2014-07-21 17:40
>> /user/beacon/2014072117/1b0ef6b3-bd51-4100-9d4b-1cecdd565f93
>> -rw-r--r--   3 service-igor supergroup    1002096 2014-07-21 17:10
>> /user/beacon/2014072117/34e2acb4-2054-44df-bbf7-a4ce7f1e5d1b
>> -rw-r--r--   3 service-igor supergroup    1028450 2014-07-21 17:30
>> /user/beacon/2014072117/41c7aa62-d27f-4d53-bed8-df2fb5803c92
>> -rw-r--r--   3 service-igor supergroup          0 2014-07-21 17:50
>> /user/beacon/2014072117/5450f246-7623-4bbd-8c97-8176a0c30351
>> -rw-r--r--   3 service-igor supergroup    1084873 2014-07-21 17:30
>> /user/beacon/2014072117/8b36fbca-6f5b-48a3-be3c-6df6254c3db2
>> -rw-r--r--   3 service-igor supergroup    1043108 2014-07-21 17:20
>> /user/beacon/2014072117/949da11a-247b-4992-b13a-5e6ce7e51e9b
>> -rw-r--r--   3 service-igor supergroup     986866 2014-07-21 17:10
>> /user/beacon/2014072117/979bba76-4d2e-423f-92f6-031bc41f6fbd
>> -rw-r--r--   3 service-igor supergroup          0 2014-07-21 17:50
>> /user/beacon/2014072117/b76db189-054f-4dac-84a4-a65f39a6c1a9
>> -rw-r--r--   3 service-igor supergroup    1040931 2014-07-21 17:50
>> /user/beacon/2014072117/bba6a677-226c-4982-8fb2-4b136108baf1
>> -rw-r--r--   3 service-igor supergroup    1012137 2014-07-21 17:40
>> /user/beacon/2014072117/be940202-f085-45bb-ac84-51ece2e1ba47
>> -rw-r--r--   3 service-igor supergroup    1028467 2014-07-21 17:20
>> /user/beacon/2014072117/c336e0c8-76e7-40e7-98e2-9f529f25577b
>>
>> Sometimes even though they show as 0 bytes you can read data from them.
>> Sometimes it blows up with a stack trace I have lost.
>>
>>
>> On Tue, Jul 22, 2014 at 5:45 PM, Bertrand Dechoux <de...@gmail.com>
>> wrote:
>>
>>> I looked at the source by curiosity, for the latest version (2.4), the
>>> header is flushed during the writer creation. Of course, key/value classes
>>> are provided. By 0-bytes, you really mean even without the header? Or 0
>>> bytes of payload?
>>>
>>>
>>> On Tue, Jul 22, 2014 at 11:05 PM, Bertrand Dechoux <de...@gmail.com>
>>> wrote:
>>>
>>>> The header is expected to have the full name of the key class and value
>>>> class so if it is only detected with the first record (?) indeed the file
>>>> can not respect its own format.
>>>>
>>>> I haven't tried it but LazyOutputFormat should solve your problem.
>>>>
>>>> https://hadoop.apache.org/docs/current/api/index.html?org/apache/hadoop/mapred/lib/LazyOutputFormat.html
>>>>
>>>> Regards
>>>>
>>>> Bertrand Dechoux
>>>>
>>>>
>>>> Bertrand Dechoux
>>>>
>>>>
>>>> On Tue, Jul 22, 2014 at 10:39 PM, Edward Capriolo <
>>>> edlinuxguru@gmail.com> wrote:
>>>>
>>>>> I have two processes. One that writes sequence files directly to hdfs,
>>>>> the other that is a hive table that reads these files.
>>>>>
>>>>> All works well with the exception that I am only flushing the files
>>>>> periodically. SequenceFile input format gets angry when it encounters
>>>>> 0-bytes seq files.
>>>>>
>>>>> I was considering flush and sync on first record write. Also was
>>>>> thinking should just be able to hack sequence file input format to skip 0
>>>>> byte files and not throw exception on readFully() which it sometimes does.
>>>>>
>>>>> Anyone ever tackled this?
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Skippin those gost darn 0 byte diles

Posted by Bertrand Dechoux <de...@gmail.com>.

For reference : https://issues.apache.org/jira/browse/SPARK-1960
(which seems highly related)

I don't know if anything is tracked on Hadoop/MapReduce side.

Bertrand Dechoux


On Wed, Jul 23, 2014 at 5:15 PM, Edward Capriolo <ed...@gmail.com>
wrote:

> Anyway, a solution (seen in Flume if I remember correctly) is having a
> good file name strategy. For exemple, all new files should end in ".open"
> and only when they are finished the suffix is removed. Then for processing,
> you only target the latter.
>
> I am not sure this will help. The sequence file reader will still try to
> open it regardless of it's name.
>
> For Hive, you might need to adapt the strategy a bit because Hive may not
> be able to target only files with a specific name (you are the expert). A
> simple move of the file from a temporary directory to the table directory
> would have the same effect (because from the point of view of HDFS, it's
> the same operation : metadata change only).
>
> I would like to consider the file as soon as their is reasonable data in
> them. If I have to rename/move files I will not be able to see the data
> until it is moved in/renamed. (I am building files for N minutes before
> closing them). The problem only happens with 0 byte files- files being
> written currently work fine.
>
> It seems like the split calculation could throw away 0 byte files before
> we ever get down to the record reader and parsing the header. An
> interesting thing is that even though dfs -ls shows the files as 0
> bytes....Sometimes I can dfs -text theses 0 byte files and they actually
> have data! Sometimes when I dfs -text them I get the exception attached!
>
> So it is interesting that the semantics here are not obvious. Can we map
> reduce a file being written? How does it work etc? It would be nice to
> understand the semantics here.
>
>
>
>
>
>
>
>
> On Wed, Jul 23, 2014 at 2:00 AM, Bertrand Dechoux <de...@gmail.com>
> wrote:
>
>> The best would be to get a hold on a Flume developer. I am not strictly
>> sure of all the differences between sync/flush/hsync/hflush and the
>> different hadoop versions. It might be the case that you are only flushing
>> on the client side. Even if it was a clean strategy, creation+flush is
>> unlikely to be an atomic operation.
>>
>> It is worth testing the read of an empty sequence file (real empty and
>> with only header). It should be quite easy with a unit test. A solution
>> would indeed to validate the behaviour of SequenceFileReader / InputFormat
>> on edge cases. But nothing guarantee you that you won't have a record split
>> between two HDFS blocks. This implies that during the writing only the
>> first block is visible and only a part of the record. It would be normal
>> for the reader to fail on that case. You could tweak mapreduce bad records
>> skipping but that feels like hacking a system where the design is wrong
>> from the beginning.
>>
>> Anyway, a solution (seen in Flume if I remember correctly) is having a
>> good file name strategy. For exemple, all new files should end in ".open"
>> and only when they are finished the suffix is removed. Then for processing,
>> you only target the latter.
>>
>> For Hive, you might need to adapt the strategy a bit because Hive may not
>> be able to target only files with a specific name (you are the expert). A
>> simple move of the file from a temporary directory to the table directory
>> would have the same effect (because from the point of view of HDFS, it's
>> the same operation : metadata change only).
>>
>> Bertrand Dechoux
>>
>>
>> On Wed, Jul 23, 2014 at 12:16 AM, Edward Capriolo <ed...@gmail.com>
>> wrote:
>>
>>> Here is the stack trace...
>>>
>>>  Caused by: java.io.EOFException
>>>   at java.io.DataInputStream.readByte(DataInputStream.java:267)
>>>   at org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:308)
>>>   at org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:329)
>>>   at org.apache.hadoop.io.SequenceFile$Reader.readBuffer(SequenceFile.java:2072)
>>>   at org.apache.hadoop.io.SequenceFile$Reader.seekToCurrentValue(SequenceFile.java:2139)
>>>   at org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:2214)
>>>   at org.apache.hadoop.mapred.SequenceFileRecordReader.getCurrentValue(SequenceFileRecordReader.java:109)
>>>   at org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:84)
>>>   at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:274)
>>>   ... 15 more
>>>
>>>
>>>
>>>
>>> On Tue, Jul 22, 2014 at 6:14 PM, Edward Capriolo <ed...@gmail.com>
>>> wrote:
>>>
>>>> Currently using:
>>>>
>>>>     <dependency>
>>>>             <groupId>org.apache.hadoop</groupId>
>>>>             <artifactId>hadoop-hdfs</artifactId>
>>>>             <version>2.3.0</version>
>>>>         </dependency>
>>>>
>>>>
>>>> I have this piece of code that does.
>>>>
>>>> writer = SequenceFile.createWriter(fs, conf, p, Text.class, Text.class,
>>>> CompressionType.BLOCK, codec);
>>>>
>>>> Then I have a piece of code like this...
>>>>
>>>>   public static final long SYNC_EVERY_LINES = 1000;
>>>>  if (meta.getLinesWritten() % SYNC_EVERY_LINES == 0){
>>>>         meta.getWriter().sync();
>>>>       }
>>>>
>>>>
>>>> And I commonly see:
>>>>
>>>> [ecapriolo@staging-hadoop-cdh-67-14 ~]$ hadoop dfs -ls  /user/beacon/
>>>> 2014072117
>>>> DEPRECATED: Use of this script to execute hdfs command is deprecated.
>>>> Instead use the hdfs command for it.
>>>>
>>>> Found 12 items
>>>> -rw-r--r--   3 service-igor supergroup    1065682 2014-07-21 17:50
>>>> /user/beacon/2014072117/0bb6cd71-70ac-405a-a8b7-b8caf9af8da1
>>>> -rw-r--r--   3 service-igor supergroup    1029041 2014-07-21 17:40
>>>> /user/beacon/2014072117/1b0ef6b3-bd51-4100-9d4b-1cecdd565f93
>>>> -rw-r--r--   3 service-igor supergroup    1002096 2014-07-21 17:10
>>>> /user/beacon/2014072117/34e2acb4-2054-44df-bbf7-a4ce7f1e5d1b
>>>> -rw-r--r--   3 service-igor supergroup    1028450 2014-07-21 17:30
>>>> /user/beacon/2014072117/41c7aa62-d27f-4d53-bed8-df2fb5803c92
>>>> -rw-r--r--   3 service-igor supergroup          0 2014-07-21 17:50
>>>> /user/beacon/2014072117/5450f246-7623-4bbd-8c97-8176a0c30351
>>>> -rw-r--r--   3 service-igor supergroup    1084873 2014-07-21 17:30
>>>> /user/beacon/2014072117/8b36fbca-6f5b-48a3-be3c-6df6254c3db2
>>>> -rw-r--r--   3 service-igor supergroup    1043108 2014-07-21 17:20
>>>> /user/beacon/2014072117/949da11a-247b-4992-b13a-5e6ce7e51e9b
>>>> -rw-r--r--   3 service-igor supergroup     986866 2014-07-21 17:10
>>>> /user/beacon/2014072117/979bba76-4d2e-423f-92f6-031bc41f6fbd
>>>> -rw-r--r--   3 service-igor supergroup          0 2014-07-21 17:50
>>>> /user/beacon/2014072117/b76db189-054f-4dac-84a4-a65f39a6c1a9
>>>> -rw-r--r--   3 service-igor supergroup    1040931 2014-07-21 17:50
>>>> /user/beacon/2014072117/bba6a677-226c-4982-8fb2-4b136108baf1
>>>> -rw-r--r--   3 service-igor supergroup    1012137 2014-07-21 17:40
>>>> /user/beacon/2014072117/be940202-f085-45bb-ac84-51ece2e1ba47
>>>> -rw-r--r--   3 service-igor supergroup    1028467 2014-07-21 17:20
>>>> /user/beacon/2014072117/c336e0c8-76e7-40e7-98e2-9f529f25577b
>>>>
>>>> Sometimes even though they show as 0 bytes you can read data from them.
>>>> Sometimes it blows up with a stack trace I have lost.
>>>>
>>>>
>>>> On Tue, Jul 22, 2014 at 5:45 PM, Bertrand Dechoux <de...@gmail.com>
>>>> wrote:
>>>>
>>>>> I looked at the source by curiosity, for the latest version (2.4), the
>>>>> header is flushed during the writer creation. Of course, key/value classes
>>>>> are provided. By 0-bytes, you really mean even without the header? Or 0
>>>>> bytes of payload?
>>>>>
>>>>>
>>>>> On Tue, Jul 22, 2014 at 11:05 PM, Bertrand Dechoux <dechouxb@gmail.com
>>>>> > wrote:
>>>>>
>>>>>> The header is expected to have the full name of the key class and
>>>>>> value class so if it is only detected with the first record (?) indeed the
>>>>>> file can not respect its own format.
>>>>>>
>>>>>> I haven't tried it but LazyOutputFormat should solve your problem.
>>>>>>
>>>>>> https://hadoop.apache.org/docs/current/api/index.html?org/apache/hadoop/mapred/lib/LazyOutputFormat.html
>>>>>>
>>>>>> Regards
>>>>>>
>>>>>> Bertrand Dechoux
>>>>>>
>>>>>>
>>>>>> Bertrand Dechoux
>>>>>>
>>>>>>
>>>>>> On Tue, Jul 22, 2014 at 10:39 PM, Edward Capriolo <
>>>>>> edlinuxguru@gmail.com> wrote:
>>>>>>
>>>>>>> I have two processes. One that writes sequence files directly to
>>>>>>> hdfs, the other that is a hive table that reads these files.
>>>>>>>
>>>>>>> All works well with the exception that I am only flushing the files
>>>>>>> periodically. SequenceFile input format gets angry when it encounters
>>>>>>> 0-bytes seq files.
>>>>>>>
>>>>>>> I was considering flush and sync on first record write. Also was
>>>>>>> thinking should just be able to hack sequence file input format to skip 0
>>>>>>> byte files and not throw exception on readFully() which it sometimes does.
>>>>>>>
>>>>>>> Anyone ever tackled this?
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Skippin those gost darn 0 byte diles

Posted by Bertrand Dechoux <de...@gmail.com>.

For reference : https://issues.apache.org/jira/browse/SPARK-1960
(which seems highly related)

I don't know if anything is tracked on Hadoop/MapReduce side.

Bertrand Dechoux


On Wed, Jul 23, 2014 at 5:15 PM, Edward Capriolo <ed...@gmail.com>
wrote:

> Anyway, a solution (seen in Flume if I remember correctly) is having a
> good file name strategy. For exemple, all new files should end in ".open"
> and only when they are finished the suffix is removed. Then for processing,
> you only target the latter.
>
> I am not sure this will help. The sequence file reader will still try to
> open it regardless of it's name.
>
> For Hive, you might need to adapt the strategy a bit because Hive may not
> be able to target only files with a specific name (you are the expert). A
> simple move of the file from a temporary directory to the table directory
> would have the same effect (because from the point of view of HDFS, it's
> the same operation : metadata change only).
>
> I would like to consider the file as soon as their is reasonable data in
> them. If I have to rename/move files I will not be able to see the data
> until it is moved in/renamed. (I am building files for N minutes before
> closing them). The problem only happens with 0 byte files- files being
> written currently work fine.
>
> It seems like the split calculation could throw away 0 byte files before
> we ever get down to the record reader and parsing the header. An
> interesting thing is that even though dfs -ls shows the files as 0
> bytes....Sometimes I can dfs -text theses 0 byte files and they actually
> have data! Sometimes when I dfs -text them I get the exception attached!
>
> So it is interesting that the semantics here are not obvious. Can we map
> reduce a file being written? How does it work etc? It would be nice to
> understand the semantics here.
>
>
>
>
>
>
>
>
> On Wed, Jul 23, 2014 at 2:00 AM, Bertrand Dechoux <de...@gmail.com>
> wrote:
>
>> The best would be to get a hold on a Flume developer. I am not strictly
>> sure of all the differences between sync/flush/hsync/hflush and the
>> different hadoop versions. It might be the case that you are only flushing
>> on the client side. Even if it was a clean strategy, creation+flush is
>> unlikely to be an atomic operation.
>>
>> It is worth testing the read of an empty sequence file (real empty and
>> with only header). It should be quite easy with a unit test. A solution
>> would indeed to validate the behaviour of SequenceFileReader / InputFormat
>> on edge cases. But nothing guarantee you that you won't have a record split
>> between two HDFS blocks. This implies that during the writing only the
>> first block is visible and only a part of the record. It would be normal
>> for the reader to fail on that case. You could tweak mapreduce bad records
>> skipping but that feels like hacking a system where the design is wrong
>> from the beginning.
>>
>> Anyway, a solution (seen in Flume if I remember correctly) is having a
>> good file name strategy. For exemple, all new files should end in ".open"
>> and only when they are finished the suffix is removed. Then for processing,
>> you only target the latter.
>>
>> For Hive, you might need to adapt the strategy a bit because Hive may not
>> be able to target only files with a specific name (you are the expert). A
>> simple move of the file from a temporary directory to the table directory
>> would have the same effect (because from the point of view of HDFS, it's
>> the same operation : metadata change only).
>>
>> Bertrand Dechoux
>>
>>
>> On Wed, Jul 23, 2014 at 12:16 AM, Edward Capriolo <ed...@gmail.com>
>> wrote:
>>
>>> Here is the stack trace...
>>>
>>>  Caused by: java.io.EOFException
>>>   at java.io.DataInputStream.readByte(DataInputStream.java:267)
>>>   at org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:308)
>>>   at org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:329)
>>>   at org.apache.hadoop.io.SequenceFile$Reader.readBuffer(SequenceFile.java:2072)
>>>   at org.apache.hadoop.io.SequenceFile$Reader.seekToCurrentValue(SequenceFile.java:2139)
>>>   at org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:2214)
>>>   at org.apache.hadoop.mapred.SequenceFileRecordReader.getCurrentValue(SequenceFileRecordReader.java:109)
>>>   at org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:84)
>>>   at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:274)
>>>   ... 15 more
>>>
>>>
>>>
>>>
>>> On Tue, Jul 22, 2014 at 6:14 PM, Edward Capriolo <ed...@gmail.com>
>>> wrote:
>>>
>>>> Currently using:
>>>>
>>>>     <dependency>
>>>>             <groupId>org.apache.hadoop</groupId>
>>>>             <artifactId>hadoop-hdfs</artifactId>
>>>>             <version>2.3.0</version>
>>>>         </dependency>
>>>>
>>>>
>>>> I have this piece of code that does.
>>>>
>>>> writer = SequenceFile.createWriter(fs, conf, p, Text.class, Text.class,
>>>> CompressionType.BLOCK, codec);
>>>>
>>>> Then I have a piece of code like this...
>>>>
>>>>   public static final long SYNC_EVERY_LINES = 1000;
>>>>  if (meta.getLinesWritten() % SYNC_EVERY_LINES == 0){
>>>>         meta.getWriter().sync();
>>>>       }
>>>>
>>>>
>>>> And I commonly see:
>>>>
>>>> [ecapriolo@staging-hadoop-cdh-67-14 ~]$ hadoop dfs -ls  /user/beacon/
>>>> 2014072117
>>>> DEPRECATED: Use of this script to execute hdfs command is deprecated.
>>>> Instead use the hdfs command for it.
>>>>
>>>> Found 12 items
>>>> -rw-r--r--   3 service-igor supergroup    1065682 2014-07-21 17:50
>>>> /user/beacon/2014072117/0bb6cd71-70ac-405a-a8b7-b8caf9af8da1
>>>> -rw-r--r--   3 service-igor supergroup    1029041 2014-07-21 17:40
>>>> /user/beacon/2014072117/1b0ef6b3-bd51-4100-9d4b-1cecdd565f93
>>>> -rw-r--r--   3 service-igor supergroup    1002096 2014-07-21 17:10
>>>> /user/beacon/2014072117/34e2acb4-2054-44df-bbf7-a4ce7f1e5d1b
>>>> -rw-r--r--   3 service-igor supergroup    1028450 2014-07-21 17:30
>>>> /user/beacon/2014072117/41c7aa62-d27f-4d53-bed8-df2fb5803c92
>>>> -rw-r--r--   3 service-igor supergroup          0 2014-07-21 17:50
>>>> /user/beacon/2014072117/5450f246-7623-4bbd-8c97-8176a0c30351
>>>> -rw-r--r--   3 service-igor supergroup    1084873 2014-07-21 17:30
>>>> /user/beacon/2014072117/8b36fbca-6f5b-48a3-be3c-6df6254c3db2
>>>> -rw-r--r--   3 service-igor supergroup    1043108 2014-07-21 17:20
>>>> /user/beacon/2014072117/949da11a-247b-4992-b13a-5e6ce7e51e9b
>>>> -rw-r--r--   3 service-igor supergroup     986866 2014-07-21 17:10
>>>> /user/beacon/2014072117/979bba76-4d2e-423f-92f6-031bc41f6fbd
>>>> -rw-r--r--   3 service-igor supergroup          0 2014-07-21 17:50
>>>> /user/beacon/2014072117/b76db189-054f-4dac-84a4-a65f39a6c1a9
>>>> -rw-r--r--   3 service-igor supergroup    1040931 2014-07-21 17:50
>>>> /user/beacon/2014072117/bba6a677-226c-4982-8fb2-4b136108baf1
>>>> -rw-r--r--   3 service-igor supergroup    1012137 2014-07-21 17:40
>>>> /user/beacon/2014072117/be940202-f085-45bb-ac84-51ece2e1ba47
>>>> -rw-r--r--   3 service-igor supergroup    1028467 2014-07-21 17:20
>>>> /user/beacon/2014072117/c336e0c8-76e7-40e7-98e2-9f529f25577b
>>>>
>>>> Sometimes even though they show as 0 bytes you can read data from them.
>>>> Sometimes it blows up with a stack trace I have lost.
>>>>
>>>>
>>>> On Tue, Jul 22, 2014 at 5:45 PM, Bertrand Dechoux <de...@gmail.com>
>>>> wrote:
>>>>
>>>>> I looked at the source by curiosity, for the latest version (2.4), the
>>>>> header is flushed during the writer creation. Of course, key/value classes
>>>>> are provided. By 0-bytes, you really mean even without the header? Or 0
>>>>> bytes of payload?
>>>>>
>>>>>
>>>>> On Tue, Jul 22, 2014 at 11:05 PM, Bertrand Dechoux <dechouxb@gmail.com
>>>>> > wrote:
>>>>>
>>>>>> The header is expected to have the full name of the key class and
>>>>>> value class so if it is only detected with the first record (?) indeed the
>>>>>> file can not respect its own format.
>>>>>>
>>>>>> I haven't tried it but LazyOutputFormat should solve your problem.
>>>>>>
>>>>>> https://hadoop.apache.org/docs/current/api/index.html?org/apache/hadoop/mapred/lib/LazyOutputFormat.html
>>>>>>
>>>>>> Regards
>>>>>>
>>>>>> Bertrand Dechoux
>>>>>>
>>>>>>
>>>>>> Bertrand Dechoux
>>>>>>
>>>>>>
>>>>>> On Tue, Jul 22, 2014 at 10:39 PM, Edward Capriolo <
>>>>>> edlinuxguru@gmail.com> wrote:
>>>>>>
>>>>>>> I have two processes. One that writes sequence files directly to
>>>>>>> hdfs, the other that is a hive table that reads these files.
>>>>>>>
>>>>>>> All works well with the exception that I am only flushing the files
>>>>>>> periodically. SequenceFile input format gets angry when it encounters
>>>>>>> 0-bytes seq files.
>>>>>>>
>>>>>>> I was considering flush and sync on first record write. Also was
>>>>>>> thinking should just be able to hack sequence file input format to skip 0
>>>>>>> byte files and not throw exception on readFully() which it sometimes does.
>>>>>>>
>>>>>>> Anyone ever tackled this?
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Skippin those gost darn 0 byte diles

Posted by Bertrand Dechoux <de...@gmail.com>.

For reference : https://issues.apache.org/jira/browse/SPARK-1960
(which seems highly related)

I don't know if anything is tracked on Hadoop/MapReduce side.

Bertrand Dechoux


On Wed, Jul 23, 2014 at 5:15 PM, Edward Capriolo <ed...@gmail.com>
wrote:

> Anyway, a solution (seen in Flume if I remember correctly) is having a
> good file name strategy. For exemple, all new files should end in ".open"
> and only when they are finished the suffix is removed. Then for processing,
> you only target the latter.
>
> I am not sure this will help. The sequence file reader will still try to
> open it regardless of it's name.
>
> For Hive, you might need to adapt the strategy a bit because Hive may not
> be able to target only files with a specific name (you are the expert). A
> simple move of the file from a temporary directory to the table directory
> would have the same effect (because from the point of view of HDFS, it's
> the same operation : metadata change only).
>
> I would like to consider the file as soon as their is reasonable data in
> them. If I have to rename/move files I will not be able to see the data
> until it is moved in/renamed. (I am building files for N minutes before
> closing them). The problem only happens with 0 byte files- files being
> written currently work fine.
>
> It seems like the split calculation could throw away 0 byte files before
> we ever get down to the record reader and parsing the header. An
> interesting thing is that even though dfs -ls shows the files as 0
> bytes....Sometimes I can dfs -text theses 0 byte files and they actually
> have data! Sometimes when I dfs -text them I get the exception attached!
>
> So it is interesting that the semantics here are not obvious. Can we map
> reduce a file being written? How does it work etc? It would be nice to
> understand the semantics here.
>
>
>
>
>
>
>
>
> On Wed, Jul 23, 2014 at 2:00 AM, Bertrand Dechoux <de...@gmail.com>
> wrote:
>
>> The best would be to get a hold on a Flume developer. I am not strictly
>> sure of all the differences between sync/flush/hsync/hflush and the
>> different hadoop versions. It might be the case that you are only flushing
>> on the client side. Even if it was a clean strategy, creation+flush is
>> unlikely to be an atomic operation.
>>
>> It is worth testing the read of an empty sequence file (real empty and
>> with only header). It should be quite easy with a unit test. A solution
>> would indeed to validate the behaviour of SequenceFileReader / InputFormat
>> on edge cases. But nothing guarantee you that you won't have a record split
>> between two HDFS blocks. This implies that during the writing only the
>> first block is visible and only a part of the record. It would be normal
>> for the reader to fail on that case. You could tweak mapreduce bad records
>> skipping but that feels like hacking a system where the design is wrong
>> from the beginning.
>>
>> Anyway, a solution (seen in Flume if I remember correctly) is having a
>> good file name strategy. For exemple, all new files should end in ".open"
>> and only when they are finished the suffix is removed. Then for processing,
>> you only target the latter.
>>
>> For Hive, you might need to adapt the strategy a bit because Hive may not
>> be able to target only files with a specific name (you are the expert). A
>> simple move of the file from a temporary directory to the table directory
>> would have the same effect (because from the point of view of HDFS, it's
>> the same operation : metadata change only).
>>
>> Bertrand Dechoux
>>
>>
>> On Wed, Jul 23, 2014 at 12:16 AM, Edward Capriolo <ed...@gmail.com>
>> wrote:
>>
>>> Here is the stack trace...
>>>
>>>  Caused by: java.io.EOFException
>>>   at java.io.DataInputStream.readByte(DataInputStream.java:267)
>>>   at org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:308)
>>>   at org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:329)
>>>   at org.apache.hadoop.io.SequenceFile$Reader.readBuffer(SequenceFile.java:2072)
>>>   at org.apache.hadoop.io.SequenceFile$Reader.seekToCurrentValue(SequenceFile.java:2139)
>>>   at org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:2214)
>>>   at org.apache.hadoop.mapred.SequenceFileRecordReader.getCurrentValue(SequenceFileRecordReader.java:109)
>>>   at org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:84)
>>>   at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:274)
>>>   ... 15 more
>>>
>>>
>>>
>>>
>>> On Tue, Jul 22, 2014 at 6:14 PM, Edward Capriolo <ed...@gmail.com>
>>> wrote:
>>>
>>>> Currently using:
>>>>
>>>>     <dependency>
>>>>             <groupId>org.apache.hadoop</groupId>
>>>>             <artifactId>hadoop-hdfs</artifactId>
>>>>             <version>2.3.0</version>
>>>>         </dependency>
>>>>
>>>>
>>>> I have this piece of code that does.
>>>>
>>>> writer = SequenceFile.createWriter(fs, conf, p, Text.class, Text.class,
>>>> CompressionType.BLOCK, codec);
>>>>
>>>> Then I have a piece of code like this...
>>>>
>>>>   public static final long SYNC_EVERY_LINES = 1000;
>>>>  if (meta.getLinesWritten() % SYNC_EVERY_LINES == 0){
>>>>         meta.getWriter().sync();
>>>>       }
>>>>
>>>>
>>>> And I commonly see:
>>>>
>>>> [ecapriolo@staging-hadoop-cdh-67-14 ~]$ hadoop dfs -ls  /user/beacon/
>>>> 2014072117
>>>> DEPRECATED: Use of this script to execute hdfs command is deprecated.
>>>> Instead use the hdfs command for it.
>>>>
>>>> Found 12 items
>>>> -rw-r--r--   3 service-igor supergroup    1065682 2014-07-21 17:50
>>>> /user/beacon/2014072117/0bb6cd71-70ac-405a-a8b7-b8caf9af8da1
>>>> -rw-r--r--   3 service-igor supergroup    1029041 2014-07-21 17:40
>>>> /user/beacon/2014072117/1b0ef6b3-bd51-4100-9d4b-1cecdd565f93
>>>> -rw-r--r--   3 service-igor supergroup    1002096 2014-07-21 17:10
>>>> /user/beacon/2014072117/34e2acb4-2054-44df-bbf7-a4ce7f1e5d1b
>>>> -rw-r--r--   3 service-igor supergroup    1028450 2014-07-21 17:30
>>>> /user/beacon/2014072117/41c7aa62-d27f-4d53-bed8-df2fb5803c92
>>>> -rw-r--r--   3 service-igor supergroup          0 2014-07-21 17:50
>>>> /user/beacon/2014072117/5450f246-7623-4bbd-8c97-8176a0c30351
>>>> -rw-r--r--   3 service-igor supergroup    1084873 2014-07-21 17:30
>>>> /user/beacon/2014072117/8b36fbca-6f5b-48a3-be3c-6df6254c3db2
>>>> -rw-r--r--   3 service-igor supergroup    1043108 2014-07-21 17:20
>>>> /user/beacon/2014072117/949da11a-247b-4992-b13a-5e6ce7e51e9b
>>>> -rw-r--r--   3 service-igor supergroup     986866 2014-07-21 17:10
>>>> /user/beacon/2014072117/979bba76-4d2e-423f-92f6-031bc41f6fbd
>>>> -rw-r--r--   3 service-igor supergroup          0 2014-07-21 17:50
>>>> /user/beacon/2014072117/b76db189-054f-4dac-84a4-a65f39a6c1a9
>>>> -rw-r--r--   3 service-igor supergroup    1040931 2014-07-21 17:50
>>>> /user/beacon/2014072117/bba6a677-226c-4982-8fb2-4b136108baf1
>>>> -rw-r--r--   3 service-igor supergroup    1012137 2014-07-21 17:40
>>>> /user/beacon/2014072117/be940202-f085-45bb-ac84-51ece2e1ba47
>>>> -rw-r--r--   3 service-igor supergroup    1028467 2014-07-21 17:20
>>>> /user/beacon/2014072117/c336e0c8-76e7-40e7-98e2-9f529f25577b
>>>>
>>>> Sometimes even though they show as 0 bytes you can read data from them.
>>>> Sometimes it blows up with a stack trace I have lost.
>>>>
>>>>
>>>> On Tue, Jul 22, 2014 at 5:45 PM, Bertrand Dechoux <de...@gmail.com>
>>>> wrote:
>>>>
>>>>> I looked at the source by curiosity, for the latest version (2.4), the
>>>>> header is flushed during the writer creation. Of course, key/value classes
>>>>> are provided. By 0-bytes, you really mean even without the header? Or 0
>>>>> bytes of payload?
>>>>>
>>>>>
>>>>> On Tue, Jul 22, 2014 at 11:05 PM, Bertrand Dechoux <dechouxb@gmail.com
>>>>> > wrote:
>>>>>
>>>>>> The header is expected to have the full name of the key class and
>>>>>> value class so if it is only detected with the first record (?) indeed the
>>>>>> file can not respect its own format.
>>>>>>
>>>>>> I haven't tried it but LazyOutputFormat should solve your problem.
>>>>>>
>>>>>> https://hadoop.apache.org/docs/current/api/index.html?org/apache/hadoop/mapred/lib/LazyOutputFormat.html
>>>>>>
>>>>>> Regards
>>>>>>
>>>>>> Bertrand Dechoux
>>>>>>
>>>>>>
>>>>>> Bertrand Dechoux
>>>>>>
>>>>>>
>>>>>> On Tue, Jul 22, 2014 at 10:39 PM, Edward Capriolo <
>>>>>> edlinuxguru@gmail.com> wrote:
>>>>>>
>>>>>>> I have two processes. One that writes sequence files directly to
>>>>>>> hdfs, the other that is a hive table that reads these files.
>>>>>>>
>>>>>>> All works well with the exception that I am only flushing the files
>>>>>>> periodically. SequenceFile input format gets angry when it encounters
>>>>>>> 0-bytes seq files.
>>>>>>>
>>>>>>> I was considering flush and sync on first record write. Also was
>>>>>>> thinking should just be able to hack sequence file input format to skip 0
>>>>>>> byte files and not throw exception on readFully() which it sometimes does.
>>>>>>>
>>>>>>> Anyone ever tackled this?
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Skippin those gost darn 0 byte diles

Posted by Bertrand Dechoux <de...@gmail.com>.

For reference : https://issues.apache.org/jira/browse/SPARK-1960
(which seems highly related)

I don't know if anything is tracked on Hadoop/MapReduce side.

Bertrand Dechoux


On Wed, Jul 23, 2014 at 5:15 PM, Edward Capriolo <ed...@gmail.com>
wrote:

> Anyway, a solution (seen in Flume if I remember correctly) is having a
> good file name strategy. For exemple, all new files should end in ".open"
> and only when they are finished the suffix is removed. Then for processing,
> you only target the latter.
>
> I am not sure this will help. The sequence file reader will still try to
> open it regardless of it's name.
>
> For Hive, you might need to adapt the strategy a bit because Hive may not
> be able to target only files with a specific name (you are the expert). A
> simple move of the file from a temporary directory to the table directory
> would have the same effect (because from the point of view of HDFS, it's
> the same operation : metadata change only).
>
> I would like to consider the file as soon as their is reasonable data in
> them. If I have to rename/move files I will not be able to see the data
> until it is moved in/renamed. (I am building files for N minutes before
> closing them). The problem only happens with 0 byte files- files being
> written currently work fine.
>
> It seems like the split calculation could throw away 0 byte files before
> we ever get down to the record reader and parsing the header. An
> interesting thing is that even though dfs -ls shows the files as 0
> bytes....Sometimes I can dfs -text theses 0 byte files and they actually
> have data! Sometimes when I dfs -text them I get the exception attached!
>
> So it is interesting that the semantics here are not obvious. Can we map
> reduce a file being written? How does it work etc? It would be nice to
> understand the semantics here.
>
>
>
>
>
>
>
>
> On Wed, Jul 23, 2014 at 2:00 AM, Bertrand Dechoux <de...@gmail.com>
> wrote:
>
>> The best would be to get a hold on a Flume developer. I am not strictly
>> sure of all the differences between sync/flush/hsync/hflush and the
>> different hadoop versions. It might be the case that you are only flushing
>> on the client side. Even if it was a clean strategy, creation+flush is
>> unlikely to be an atomic operation.
>>
>> It is worth testing the read of an empty sequence file (real empty and
>> with only header). It should be quite easy with a unit test. A solution
>> would indeed to validate the behaviour of SequenceFileReader / InputFormat
>> on edge cases. But nothing guarantee you that you won't have a record split
>> between two HDFS blocks. This implies that during the writing only the
>> first block is visible and only a part of the record. It would be normal
>> for the reader to fail on that case. You could tweak mapreduce bad records
>> skipping but that feels like hacking a system where the design is wrong
>> from the beginning.
>>
>> Anyway, a solution (seen in Flume if I remember correctly) is having a
>> good file name strategy. For exemple, all new files should end in ".open"
>> and only when they are finished the suffix is removed. Then for processing,
>> you only target the latter.
>>
>> For Hive, you might need to adapt the strategy a bit because Hive may not
>> be able to target only files with a specific name (you are the expert). A
>> simple move of the file from a temporary directory to the table directory
>> would have the same effect (because from the point of view of HDFS, it's
>> the same operation : metadata change only).
>>
>> Bertrand Dechoux
>>
>>
>> On Wed, Jul 23, 2014 at 12:16 AM, Edward Capriolo <ed...@gmail.com>
>> wrote:
>>
>>> Here is the stack trace...
>>>
>>>  Caused by: java.io.EOFException
>>>   at java.io.DataInputStream.readByte(DataInputStream.java:267)
>>>   at org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:308)
>>>   at org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:329)
>>>   at org.apache.hadoop.io.SequenceFile$Reader.readBuffer(SequenceFile.java:2072)
>>>   at org.apache.hadoop.io.SequenceFile$Reader.seekToCurrentValue(SequenceFile.java:2139)
>>>   at org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:2214)
>>>   at org.apache.hadoop.mapred.SequenceFileRecordReader.getCurrentValue(SequenceFileRecordReader.java:109)
>>>   at org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:84)
>>>   at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:274)
>>>   ... 15 more
>>>
>>>
>>>
>>>
>>> On Tue, Jul 22, 2014 at 6:14 PM, Edward Capriolo <ed...@gmail.com>
>>> wrote:
>>>
>>>> Currently using:
>>>>
>>>>     <dependency>
>>>>             <groupId>org.apache.hadoop</groupId>
>>>>             <artifactId>hadoop-hdfs</artifactId>
>>>>             <version>2.3.0</version>
>>>>         </dependency>
>>>>
>>>>
>>>> I have this piece of code that does.
>>>>
>>>> writer = SequenceFile.createWriter(fs, conf, p, Text.class, Text.class,
>>>> CompressionType.BLOCK, codec);
>>>>
>>>> Then I have a piece of code like this...
>>>>
>>>>   public static final long SYNC_EVERY_LINES = 1000;
>>>>  if (meta.getLinesWritten() % SYNC_EVERY_LINES == 0){
>>>>         meta.getWriter().sync();
>>>>       }
>>>>
>>>>
>>>> And I commonly see:
>>>>
>>>> [ecapriolo@staging-hadoop-cdh-67-14 ~]$ hadoop dfs -ls  /user/beacon/
>>>> 2014072117
>>>> DEPRECATED: Use of this script to execute hdfs command is deprecated.
>>>> Instead use the hdfs command for it.
>>>>
>>>> Found 12 items
>>>> -rw-r--r--   3 service-igor supergroup    1065682 2014-07-21 17:50
>>>> /user/beacon/2014072117/0bb6cd71-70ac-405a-a8b7-b8caf9af8da1
>>>> -rw-r--r--   3 service-igor supergroup    1029041 2014-07-21 17:40
>>>> /user/beacon/2014072117/1b0ef6b3-bd51-4100-9d4b-1cecdd565f93
>>>> -rw-r--r--   3 service-igor supergroup    1002096 2014-07-21 17:10
>>>> /user/beacon/2014072117/34e2acb4-2054-44df-bbf7-a4ce7f1e5d1b
>>>> -rw-r--r--   3 service-igor supergroup    1028450 2014-07-21 17:30
>>>> /user/beacon/2014072117/41c7aa62-d27f-4d53-bed8-df2fb5803c92
>>>> -rw-r--r--   3 service-igor supergroup          0 2014-07-21 17:50
>>>> /user/beacon/2014072117/5450f246-7623-4bbd-8c97-8176a0c30351
>>>> -rw-r--r--   3 service-igor supergroup    1084873 2014-07-21 17:30
>>>> /user/beacon/2014072117/8b36fbca-6f5b-48a3-be3c-6df6254c3db2
>>>> -rw-r--r--   3 service-igor supergroup    1043108 2014-07-21 17:20
>>>> /user/beacon/2014072117/949da11a-247b-4992-b13a-5e6ce7e51e9b
>>>> -rw-r--r--   3 service-igor supergroup     986866 2014-07-21 17:10
>>>> /user/beacon/2014072117/979bba76-4d2e-423f-92f6-031bc41f6fbd
>>>> -rw-r--r--   3 service-igor supergroup          0 2014-07-21 17:50
>>>> /user/beacon/2014072117/b76db189-054f-4dac-84a4-a65f39a6c1a9
>>>> -rw-r--r--   3 service-igor supergroup    1040931 2014-07-21 17:50
>>>> /user/beacon/2014072117/bba6a677-226c-4982-8fb2-4b136108baf1
>>>> -rw-r--r--   3 service-igor supergroup    1012137 2014-07-21 17:40
>>>> /user/beacon/2014072117/be940202-f085-45bb-ac84-51ece2e1ba47
>>>> -rw-r--r--   3 service-igor supergroup    1028467 2014-07-21 17:20
>>>> /user/beacon/2014072117/c336e0c8-76e7-40e7-98e2-9f529f25577b
>>>>
>>>> Sometimes even though they show as 0 bytes you can read data from them.
>>>> Sometimes it blows up with a stack trace I have lost.
>>>>
>>>>
>>>> On Tue, Jul 22, 2014 at 5:45 PM, Bertrand Dechoux <de...@gmail.com>
>>>> wrote:
>>>>
>>>>> I looked at the source by curiosity, for the latest version (2.4), the
>>>>> header is flushed during the writer creation. Of course, key/value classes
>>>>> are provided. By 0-bytes, you really mean even without the header? Or 0
>>>>> bytes of payload?
>>>>>
>>>>>
>>>>> On Tue, Jul 22, 2014 at 11:05 PM, Bertrand Dechoux <dechouxb@gmail.com
>>>>> > wrote:
>>>>>
>>>>>> The header is expected to have the full name of the key class and
>>>>>> value class so if it is only detected with the first record (?) indeed the
>>>>>> file can not respect its own format.
>>>>>>
>>>>>> I haven't tried it but LazyOutputFormat should solve your problem.
>>>>>>
>>>>>> https://hadoop.apache.org/docs/current/api/index.html?org/apache/hadoop/mapred/lib/LazyOutputFormat.html
>>>>>>
>>>>>> Regards
>>>>>>
>>>>>> Bertrand Dechoux
>>>>>>
>>>>>>
>>>>>> Bertrand Dechoux
>>>>>>
>>>>>>
>>>>>> On Tue, Jul 22, 2014 at 10:39 PM, Edward Capriolo <
>>>>>> edlinuxguru@gmail.com> wrote:
>>>>>>
>>>>>>> I have two processes. One that writes sequence files directly to
>>>>>>> hdfs, the other that is a hive table that reads these files.
>>>>>>>
>>>>>>> All works well with the exception that I am only flushing the files
>>>>>>> periodically. SequenceFile input format gets angry when it encounters
>>>>>>> 0-bytes seq files.
>>>>>>>
>>>>>>> I was considering flush and sync on first record write. Also was
>>>>>>> thinking should just be able to hack sequence file input format to skip 0
>>>>>>> byte files and not throw exception on readFully() which it sometimes does.
>>>>>>>
>>>>>>> Anyone ever tackled this?
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Skippin those gost darn 0 byte diles

Posted by Edward Capriolo <ed...@gmail.com>.

Anyway, a solution (seen in Flume if I remember correctly) is having a good
file name strategy. For exemple, all new files should end in ".open" and
only when they are finished the suffix is removed. Then for processing, you
only target the latter.

I am not sure this will help. The sequence file reader will still try to
open it regardless of it's name.

For Hive, you might need to adapt the strategy a bit because Hive may not
be able to target only files with a specific name (you are the expert). A
simple move of the file from a temporary directory to the table directory
would have the same effect (because from the point of view of HDFS, it's
the same operation : metadata change only).

I would like to consider the file as soon as their is reasonable data in
them. If I have to rename/move files I will not be able to see the data
until it is moved in/renamed. (I am building files for N minutes before
closing them). The problem only happens with 0 byte files- files being
written currently work fine.

It seems like the split calculation could throw away 0 byte files before we
ever get down to the record reader and parsing the header. An interesting
thing is that even though dfs -ls shows the files as 0 bytes....Sometimes I
can dfs -text theses 0 byte files and they actually have data! Sometimes
when I dfs -text them I get the exception attached!

So it is interesting that the semantics here are not obvious. Can we map
reduce a file being written? How does it work etc? It would be nice to
understand the semantics here.








On Wed, Jul 23, 2014 at 2:00 AM, Bertrand Dechoux <de...@gmail.com>
wrote:

> The best would be to get a hold on a Flume developer. I am not strictly
> sure of all the differences between sync/flush/hsync/hflush and the
> different hadoop versions. It might be the case that you are only flushing
> on the client side. Even if it was a clean strategy, creation+flush is
> unlikely to be an atomic operation.
>
> It is worth testing the read of an empty sequence file (real empty and
> with only header). It should be quite easy with a unit test. A solution
> would indeed to validate the behaviour of SequenceFileReader / InputFormat
> on edge cases. But nothing guarantee you that you won't have a record split
> between two HDFS blocks. This implies that during the writing only the
> first block is visible and only a part of the record. It would be normal
> for the reader to fail on that case. You could tweak mapreduce bad records
> skipping but that feels like hacking a system where the design is wrong
> from the beginning.
>
> Anyway, a solution (seen in Flume if I remember correctly) is having a
> good file name strategy. For exemple, all new files should end in ".open"
> and only when they are finished the suffix is removed. Then for processing,
> you only target the latter.
>
> For Hive, you might need to adapt the strategy a bit because Hive may not
> be able to target only files with a specific name (you are the expert). A
> simple move of the file from a temporary directory to the table directory
> would have the same effect (because from the point of view of HDFS, it's
> the same operation : metadata change only).
>
> Bertrand Dechoux
>
>
> On Wed, Jul 23, 2014 at 12:16 AM, Edward Capriolo <ed...@gmail.com>
> wrote:
>
>> Here is the stack trace...
>>
>>  Caused by: java.io.EOFException
>>   at java.io.DataInputStream.readByte(DataInputStream.java:267)
>>   at org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:308)
>>   at org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:329)
>>   at org.apache.hadoop.io.SequenceFile$Reader.readBuffer(SequenceFile.java:2072)
>>   at org.apache.hadoop.io.SequenceFile$Reader.seekToCurrentValue(SequenceFile.java:2139)
>>   at org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:2214)
>>   at org.apache.hadoop.mapred.SequenceFileRecordReader.getCurrentValue(SequenceFileRecordReader.java:109)
>>   at org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:84)
>>   at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:274)
>>   ... 15 more
>>
>>
>>
>>
>> On Tue, Jul 22, 2014 at 6:14 PM, Edward Capriolo <ed...@gmail.com>
>> wrote:
>>
>>> Currently using:
>>>
>>>     <dependency>
>>>             <groupId>org.apache.hadoop</groupId>
>>>             <artifactId>hadoop-hdfs</artifactId>
>>>             <version>2.3.0</version>
>>>         </dependency>
>>>
>>>
>>> I have this piece of code that does.
>>>
>>> writer = SequenceFile.createWriter(fs, conf, p, Text.class, Text.class,
>>> CompressionType.BLOCK, codec);
>>>
>>> Then I have a piece of code like this...
>>>
>>>   public static final long SYNC_EVERY_LINES = 1000;
>>>  if (meta.getLinesWritten() % SYNC_EVERY_LINES == 0){
>>>         meta.getWriter().sync();
>>>       }
>>>
>>>
>>> And I commonly see:
>>>
>>> [ecapriolo@staging-hadoop-cdh-67-14 ~]$ hadoop dfs -ls  /user/beacon/
>>> 2014072117
>>> DEPRECATED: Use of this script to execute hdfs command is deprecated.
>>> Instead use the hdfs command for it.
>>>
>>> Found 12 items
>>> -rw-r--r--   3 service-igor supergroup    1065682 2014-07-21 17:50
>>> /user/beacon/2014072117/0bb6cd71-70ac-405a-a8b7-b8caf9af8da1
>>> -rw-r--r--   3 service-igor supergroup    1029041 2014-07-21 17:40
>>> /user/beacon/2014072117/1b0ef6b3-bd51-4100-9d4b-1cecdd565f93
>>> -rw-r--r--   3 service-igor supergroup    1002096 2014-07-21 17:10
>>> /user/beacon/2014072117/34e2acb4-2054-44df-bbf7-a4ce7f1e5d1b
>>> -rw-r--r--   3 service-igor supergroup    1028450 2014-07-21 17:30
>>> /user/beacon/2014072117/41c7aa62-d27f-4d53-bed8-df2fb5803c92
>>> -rw-r--r--   3 service-igor supergroup          0 2014-07-21 17:50
>>> /user/beacon/2014072117/5450f246-7623-4bbd-8c97-8176a0c30351
>>> -rw-r--r--   3 service-igor supergroup    1084873 2014-07-21 17:30
>>> /user/beacon/2014072117/8b36fbca-6f5b-48a3-be3c-6df6254c3db2
>>> -rw-r--r--   3 service-igor supergroup    1043108 2014-07-21 17:20
>>> /user/beacon/2014072117/949da11a-247b-4992-b13a-5e6ce7e51e9b
>>> -rw-r--r--   3 service-igor supergroup     986866 2014-07-21 17:10
>>> /user/beacon/2014072117/979bba76-4d2e-423f-92f6-031bc41f6fbd
>>> -rw-r--r--   3 service-igor supergroup          0 2014-07-21 17:50
>>> /user/beacon/2014072117/b76db189-054f-4dac-84a4-a65f39a6c1a9
>>> -rw-r--r--   3 service-igor supergroup    1040931 2014-07-21 17:50
>>> /user/beacon/2014072117/bba6a677-226c-4982-8fb2-4b136108baf1
>>> -rw-r--r--   3 service-igor supergroup    1012137 2014-07-21 17:40
>>> /user/beacon/2014072117/be940202-f085-45bb-ac84-51ece2e1ba47
>>> -rw-r--r--   3 service-igor supergroup    1028467 2014-07-21 17:20
>>> /user/beacon/2014072117/c336e0c8-76e7-40e7-98e2-9f529f25577b
>>>
>>> Sometimes even though they show as 0 bytes you can read data from them.
>>> Sometimes it blows up with a stack trace I have lost.
>>>
>>>
>>> On Tue, Jul 22, 2014 at 5:45 PM, Bertrand Dechoux <de...@gmail.com>
>>> wrote:
>>>
>>>> I looked at the source by curiosity, for the latest version (2.4), the
>>>> header is flushed during the writer creation. Of course, key/value classes
>>>> are provided. By 0-bytes, you really mean even without the header? Or 0
>>>> bytes of payload?
>>>>
>>>>
>>>> On Tue, Jul 22, 2014 at 11:05 PM, Bertrand Dechoux <de...@gmail.com>
>>>> wrote:
>>>>
>>>>> The header is expected to have the full name of the key class and
>>>>> value class so if it is only detected with the first record (?) indeed the
>>>>> file can not respect its own format.
>>>>>
>>>>> I haven't tried it but LazyOutputFormat should solve your problem.
>>>>>
>>>>> https://hadoop.apache.org/docs/current/api/index.html?org/apache/hadoop/mapred/lib/LazyOutputFormat.html
>>>>>
>>>>> Regards
>>>>>
>>>>> Bertrand Dechoux
>>>>>
>>>>>
>>>>> Bertrand Dechoux
>>>>>
>>>>>
>>>>> On Tue, Jul 22, 2014 at 10:39 PM, Edward Capriolo <
>>>>> edlinuxguru@gmail.com> wrote:
>>>>>
>>>>>> I have two processes. One that writes sequence files directly to
>>>>>> hdfs, the other that is a hive table that reads these files.
>>>>>>
>>>>>> All works well with the exception that I am only flushing the files
>>>>>> periodically. SequenceFile input format gets angry when it encounters
>>>>>> 0-bytes seq files.
>>>>>>
>>>>>> I was considering flush and sync on first record write. Also was
>>>>>> thinking should just be able to hack sequence file input format to skip 0
>>>>>> byte files and not throw exception on readFully() which it sometimes does.
>>>>>>
>>>>>> Anyone ever tackled this?
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Skippin those gost darn 0 byte diles

Posted by Edward Capriolo <ed...@gmail.com>.

Anyway, a solution (seen in Flume if I remember correctly) is having a good
file name strategy. For exemple, all new files should end in ".open" and
only when they are finished the suffix is removed. Then for processing, you
only target the latter.

I am not sure this will help. The sequence file reader will still try to
open it regardless of it's name.

For Hive, you might need to adapt the strategy a bit because Hive may not
be able to target only files with a specific name (you are the expert). A
simple move of the file from a temporary directory to the table directory
would have the same effect (because from the point of view of HDFS, it's
the same operation : metadata change only).

I would like to consider the file as soon as their is reasonable data in
them. If I have to rename/move files I will not be able to see the data
until it is moved in/renamed. (I am building files for N minutes before
closing them). The problem only happens with 0 byte files- files being
written currently work fine.

It seems like the split calculation could throw away 0 byte files before we
ever get down to the record reader and parsing the header. An interesting
thing is that even though dfs -ls shows the files as 0 bytes....Sometimes I
can dfs -text theses 0 byte files and they actually have data! Sometimes
when I dfs -text them I get the exception attached!

So it is interesting that the semantics here are not obvious. Can we map
reduce a file being written? How does it work etc? It would be nice to
understand the semantics here.








On Wed, Jul 23, 2014 at 2:00 AM, Bertrand Dechoux <de...@gmail.com>
wrote:

> The best would be to get a hold on a Flume developer. I am not strictly
> sure of all the differences between sync/flush/hsync/hflush and the
> different hadoop versions. It might be the case that you are only flushing
> on the client side. Even if it was a clean strategy, creation+flush is
> unlikely to be an atomic operation.
>
> It is worth testing the read of an empty sequence file (real empty and
> with only header). It should be quite easy with a unit test. A solution
> would indeed to validate the behaviour of SequenceFileReader / InputFormat
> on edge cases. But nothing guarantee you that you won't have a record split
> between two HDFS blocks. This implies that during the writing only the
> first block is visible and only a part of the record. It would be normal
> for the reader to fail on that case. You could tweak mapreduce bad records
> skipping but that feels like hacking a system where the design is wrong
> from the beginning.
>
> Anyway, a solution (seen in Flume if I remember correctly) is having a
> good file name strategy. For exemple, all new files should end in ".open"
> and only when they are finished the suffix is removed. Then for processing,
> you only target the latter.
>
> For Hive, you might need to adapt the strategy a bit because Hive may not
> be able to target only files with a specific name (you are the expert). A
> simple move of the file from a temporary directory to the table directory
> would have the same effect (because from the point of view of HDFS, it's
> the same operation : metadata change only).
>
> Bertrand Dechoux
>
>
> On Wed, Jul 23, 2014 at 12:16 AM, Edward Capriolo <ed...@gmail.com>
> wrote:
>
>> Here is the stack trace...
>>
>>  Caused by: java.io.EOFException
>>   at java.io.DataInputStream.readByte(DataInputStream.java:267)
>>   at org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:308)
>>   at org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:329)
>>   at org.apache.hadoop.io.SequenceFile$Reader.readBuffer(SequenceFile.java:2072)
>>   at org.apache.hadoop.io.SequenceFile$Reader.seekToCurrentValue(SequenceFile.java:2139)
>>   at org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:2214)
>>   at org.apache.hadoop.mapred.SequenceFileRecordReader.getCurrentValue(SequenceFileRecordReader.java:109)
>>   at org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:84)
>>   at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:274)
>>   ... 15 more
>>
>>
>>
>>
>> On Tue, Jul 22, 2014 at 6:14 PM, Edward Capriolo <ed...@gmail.com>
>> wrote:
>>
>>> Currently using:
>>>
>>>     <dependency>
>>>             <groupId>org.apache.hadoop</groupId>
>>>             <artifactId>hadoop-hdfs</artifactId>
>>>             <version>2.3.0</version>
>>>         </dependency>
>>>
>>>
>>> I have this piece of code that does.
>>>
>>> writer = SequenceFile.createWriter(fs, conf, p, Text.class, Text.class,
>>> CompressionType.BLOCK, codec);
>>>
>>> Then I have a piece of code like this...
>>>
>>>   public static final long SYNC_EVERY_LINES = 1000;
>>>  if (meta.getLinesWritten() % SYNC_EVERY_LINES == 0){
>>>         meta.getWriter().sync();
>>>       }
>>>
>>>
>>> And I commonly see:
>>>
>>> [ecapriolo@staging-hadoop-cdh-67-14 ~]$ hadoop dfs -ls  /user/beacon/
>>> 2014072117
>>> DEPRECATED: Use of this script to execute hdfs command is deprecated.
>>> Instead use the hdfs command for it.
>>>
>>> Found 12 items
>>> -rw-r--r--   3 service-igor supergroup    1065682 2014-07-21 17:50
>>> /user/beacon/2014072117/0bb6cd71-70ac-405a-a8b7-b8caf9af8da1
>>> -rw-r--r--   3 service-igor supergroup    1029041 2014-07-21 17:40
>>> /user/beacon/2014072117/1b0ef6b3-bd51-4100-9d4b-1cecdd565f93
>>> -rw-r--r--   3 service-igor supergroup    1002096 2014-07-21 17:10
>>> /user/beacon/2014072117/34e2acb4-2054-44df-bbf7-a4ce7f1e5d1b
>>> -rw-r--r--   3 service-igor supergroup    1028450 2014-07-21 17:30
>>> /user/beacon/2014072117/41c7aa62-d27f-4d53-bed8-df2fb5803c92
>>> -rw-r--r--   3 service-igor supergroup          0 2014-07-21 17:50
>>> /user/beacon/2014072117/5450f246-7623-4bbd-8c97-8176a0c30351
>>> -rw-r--r--   3 service-igor supergroup    1084873 2014-07-21 17:30
>>> /user/beacon/2014072117/8b36fbca-6f5b-48a3-be3c-6df6254c3db2
>>> -rw-r--r--   3 service-igor supergroup    1043108 2014-07-21 17:20
>>> /user/beacon/2014072117/949da11a-247b-4992-b13a-5e6ce7e51e9b
>>> -rw-r--r--   3 service-igor supergroup     986866 2014-07-21 17:10
>>> /user/beacon/2014072117/979bba76-4d2e-423f-92f6-031bc41f6fbd
>>> -rw-r--r--   3 service-igor supergroup          0 2014-07-21 17:50
>>> /user/beacon/2014072117/b76db189-054f-4dac-84a4-a65f39a6c1a9
>>> -rw-r--r--   3 service-igor supergroup    1040931 2014-07-21 17:50
>>> /user/beacon/2014072117/bba6a677-226c-4982-8fb2-4b136108baf1
>>> -rw-r--r--   3 service-igor supergroup    1012137 2014-07-21 17:40
>>> /user/beacon/2014072117/be940202-f085-45bb-ac84-51ece2e1ba47
>>> -rw-r--r--   3 service-igor supergroup    1028467 2014-07-21 17:20
>>> /user/beacon/2014072117/c336e0c8-76e7-40e7-98e2-9f529f25577b
>>>
>>> Sometimes even though they show as 0 bytes you can read data from them.
>>> Sometimes it blows up with a stack trace I have lost.
>>>
>>>
>>> On Tue, Jul 22, 2014 at 5:45 PM, Bertrand Dechoux <de...@gmail.com>
>>> wrote:
>>>
>>>> I looked at the source by curiosity, for the latest version (2.4), the
>>>> header is flushed during the writer creation. Of course, key/value classes
>>>> are provided. By 0-bytes, you really mean even without the header? Or 0
>>>> bytes of payload?
>>>>
>>>>
>>>> On Tue, Jul 22, 2014 at 11:05 PM, Bertrand Dechoux <de...@gmail.com>
>>>> wrote:
>>>>
>>>>> The header is expected to have the full name of the key class and
>>>>> value class so if it is only detected with the first record (?) indeed the
>>>>> file can not respect its own format.
>>>>>
>>>>> I haven't tried it but LazyOutputFormat should solve your problem.
>>>>>
>>>>> https://hadoop.apache.org/docs/current/api/index.html?org/apache/hadoop/mapred/lib/LazyOutputFormat.html
>>>>>
>>>>> Regards
>>>>>
>>>>> Bertrand Dechoux
>>>>>
>>>>>
>>>>> Bertrand Dechoux
>>>>>
>>>>>
>>>>> On Tue, Jul 22, 2014 at 10:39 PM, Edward Capriolo <
>>>>> edlinuxguru@gmail.com> wrote:
>>>>>
>>>>>> I have two processes. One that writes sequence files directly to
>>>>>> hdfs, the other that is a hive table that reads these files.
>>>>>>
>>>>>> All works well with the exception that I am only flushing the files
>>>>>> periodically. SequenceFile input format gets angry when it encounters
>>>>>> 0-bytes seq files.
>>>>>>
>>>>>> I was considering flush and sync on first record write. Also was
>>>>>> thinking should just be able to hack sequence file input format to skip 0
>>>>>> byte files and not throw exception on readFully() which it sometimes does.
>>>>>>
>>>>>> Anyone ever tackled this?
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Skippin those gost darn 0 byte diles

Posted by Edward Capriolo <ed...@gmail.com>.

Anyway, a solution (seen in Flume if I remember correctly) is having a good
file name strategy. For exemple, all new files should end in ".open" and
only when they are finished the suffix is removed. Then for processing, you
only target the latter.

I am not sure this will help. The sequence file reader will still try to
open it regardless of it's name.

For Hive, you might need to adapt the strategy a bit because Hive may not
be able to target only files with a specific name (you are the expert). A
simple move of the file from a temporary directory to the table directory
would have the same effect (because from the point of view of HDFS, it's
the same operation : metadata change only).

I would like to consider the file as soon as their is reasonable data in
them. If I have to rename/move files I will not be able to see the data
until it is moved in/renamed. (I am building files for N minutes before
closing them). The problem only happens with 0 byte files- files being
written currently work fine.

It seems like the split calculation could throw away 0 byte files before we
ever get down to the record reader and parsing the header. An interesting
thing is that even though dfs -ls shows the files as 0 bytes....Sometimes I
can dfs -text theses 0 byte files and they actually have data! Sometimes
when I dfs -text them I get the exception attached!

So it is interesting that the semantics here are not obvious. Can we map
reduce a file being written? How does it work etc? It would be nice to
understand the semantics here.








On Wed, Jul 23, 2014 at 2:00 AM, Bertrand Dechoux <de...@gmail.com>
wrote:

> The best would be to get a hold on a Flume developer. I am not strictly
> sure of all the differences between sync/flush/hsync/hflush and the
> different hadoop versions. It might be the case that you are only flushing
> on the client side. Even if it was a clean strategy, creation+flush is
> unlikely to be an atomic operation.
>
> It is worth testing the read of an empty sequence file (real empty and
> with only header). It should be quite easy with a unit test. A solution
> would indeed to validate the behaviour of SequenceFileReader / InputFormat
> on edge cases. But nothing guarantee you that you won't have a record split
> between two HDFS blocks. This implies that during the writing only the
> first block is visible and only a part of the record. It would be normal
> for the reader to fail on that case. You could tweak mapreduce bad records
> skipping but that feels like hacking a system where the design is wrong
> from the beginning.
>
> Anyway, a solution (seen in Flume if I remember correctly) is having a
> good file name strategy. For exemple, all new files should end in ".open"
> and only when they are finished the suffix is removed. Then for processing,
> you only target the latter.
>
> For Hive, you might need to adapt the strategy a bit because Hive may not
> be able to target only files with a specific name (you are the expert). A
> simple move of the file from a temporary directory to the table directory
> would have the same effect (because from the point of view of HDFS, it's
> the same operation : metadata change only).
>
> Bertrand Dechoux
>
>
> On Wed, Jul 23, 2014 at 12:16 AM, Edward Capriolo <ed...@gmail.com>
> wrote:
>
>> Here is the stack trace...
>>
>>  Caused by: java.io.EOFException
>>   at java.io.DataInputStream.readByte(DataInputStream.java:267)
>>   at org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:308)
>>   at org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:329)
>>   at org.apache.hadoop.io.SequenceFile$Reader.readBuffer(SequenceFile.java:2072)
>>   at org.apache.hadoop.io.SequenceFile$Reader.seekToCurrentValue(SequenceFile.java:2139)
>>   at org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:2214)
>>   at org.apache.hadoop.mapred.SequenceFileRecordReader.getCurrentValue(SequenceFileRecordReader.java:109)
>>   at org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:84)
>>   at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:274)
>>   ... 15 more
>>
>>
>>
>>
>> On Tue, Jul 22, 2014 at 6:14 PM, Edward Capriolo <ed...@gmail.com>
>> wrote:
>>
>>> Currently using:
>>>
>>>     <dependency>
>>>             <groupId>org.apache.hadoop</groupId>
>>>             <artifactId>hadoop-hdfs</artifactId>
>>>             <version>2.3.0</version>
>>>         </dependency>
>>>
>>>
>>> I have this piece of code that does.
>>>
>>> writer = SequenceFile.createWriter(fs, conf, p, Text.class, Text.class,
>>> CompressionType.BLOCK, codec);
>>>
>>> Then I have a piece of code like this...
>>>
>>>   public static final long SYNC_EVERY_LINES = 1000;
>>>  if (meta.getLinesWritten() % SYNC_EVERY_LINES == 0){
>>>         meta.getWriter().sync();
>>>       }
>>>
>>>
>>> And I commonly see:
>>>
>>> [ecapriolo@staging-hadoop-cdh-67-14 ~]$ hadoop dfs -ls  /user/beacon/
>>> 2014072117
>>> DEPRECATED: Use of this script to execute hdfs command is deprecated.
>>> Instead use the hdfs command for it.
>>>
>>> Found 12 items
>>> -rw-r--r--   3 service-igor supergroup    1065682 2014-07-21 17:50
>>> /user/beacon/2014072117/0bb6cd71-70ac-405a-a8b7-b8caf9af8da1
>>> -rw-r--r--   3 service-igor supergroup    1029041 2014-07-21 17:40
>>> /user/beacon/2014072117/1b0ef6b3-bd51-4100-9d4b-1cecdd565f93
>>> -rw-r--r--   3 service-igor supergroup    1002096 2014-07-21 17:10
>>> /user/beacon/2014072117/34e2acb4-2054-44df-bbf7-a4ce7f1e5d1b
>>> -rw-r--r--   3 service-igor supergroup    1028450 2014-07-21 17:30
>>> /user/beacon/2014072117/41c7aa62-d27f-4d53-bed8-df2fb5803c92
>>> -rw-r--r--   3 service-igor supergroup          0 2014-07-21 17:50
>>> /user/beacon/2014072117/5450f246-7623-4bbd-8c97-8176a0c30351
>>> -rw-r--r--   3 service-igor supergroup    1084873 2014-07-21 17:30
>>> /user/beacon/2014072117/8b36fbca-6f5b-48a3-be3c-6df6254c3db2
>>> -rw-r--r--   3 service-igor supergroup    1043108 2014-07-21 17:20
>>> /user/beacon/2014072117/949da11a-247b-4992-b13a-5e6ce7e51e9b
>>> -rw-r--r--   3 service-igor supergroup     986866 2014-07-21 17:10
>>> /user/beacon/2014072117/979bba76-4d2e-423f-92f6-031bc41f6fbd
>>> -rw-r--r--   3 service-igor supergroup          0 2014-07-21 17:50
>>> /user/beacon/2014072117/b76db189-054f-4dac-84a4-a65f39a6c1a9
>>> -rw-r--r--   3 service-igor supergroup    1040931 2014-07-21 17:50
>>> /user/beacon/2014072117/bba6a677-226c-4982-8fb2-4b136108baf1
>>> -rw-r--r--   3 service-igor supergroup    1012137 2014-07-21 17:40
>>> /user/beacon/2014072117/be940202-f085-45bb-ac84-51ece2e1ba47
>>> -rw-r--r--   3 service-igor supergroup    1028467 2014-07-21 17:20
>>> /user/beacon/2014072117/c336e0c8-76e7-40e7-98e2-9f529f25577b
>>>
>>> Sometimes even though they show as 0 bytes you can read data from them.
>>> Sometimes it blows up with a stack trace I have lost.
>>>
>>>
>>> On Tue, Jul 22, 2014 at 5:45 PM, Bertrand Dechoux <de...@gmail.com>
>>> wrote:
>>>
>>>> I looked at the source by curiosity, for the latest version (2.4), the
>>>> header is flushed during the writer creation. Of course, key/value classes
>>>> are provided. By 0-bytes, you really mean even without the header? Or 0
>>>> bytes of payload?
>>>>
>>>>
>>>> On Tue, Jul 22, 2014 at 11:05 PM, Bertrand Dechoux <de...@gmail.com>
>>>> wrote:
>>>>
>>>>> The header is expected to have the full name of the key class and
>>>>> value class so if it is only detected with the first record (?) indeed the
>>>>> file can not respect its own format.
>>>>>
>>>>> I haven't tried it but LazyOutputFormat should solve your problem.
>>>>>
>>>>> https://hadoop.apache.org/docs/current/api/index.html?org/apache/hadoop/mapred/lib/LazyOutputFormat.html
>>>>>
>>>>> Regards
>>>>>
>>>>> Bertrand Dechoux
>>>>>
>>>>>
>>>>> Bertrand Dechoux
>>>>>
>>>>>
>>>>> On Tue, Jul 22, 2014 at 10:39 PM, Edward Capriolo <
>>>>> edlinuxguru@gmail.com> wrote:
>>>>>
>>>>>> I have two processes. One that writes sequence files directly to
>>>>>> hdfs, the other that is a hive table that reads these files.
>>>>>>
>>>>>> All works well with the exception that I am only flushing the files
>>>>>> periodically. SequenceFile input format gets angry when it encounters
>>>>>> 0-bytes seq files.
>>>>>>
>>>>>> I was considering flush and sync on first record write. Also was
>>>>>> thinking should just be able to hack sequence file input format to skip 0
>>>>>> byte files and not throw exception on readFully() which it sometimes does.
>>>>>>
>>>>>> Anyone ever tackled this?
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Skippin those gost darn 0 byte diles

Posted by Edward Capriolo <ed...@gmail.com>.

Anyway, a solution (seen in Flume if I remember correctly) is having a good
file name strategy. For exemple, all new files should end in ".open" and
only when they are finished the suffix is removed. Then for processing, you
only target the latter.

I am not sure this will help. The sequence file reader will still try to
open it regardless of it's name.

For Hive, you might need to adapt the strategy a bit because Hive may not
be able to target only files with a specific name (you are the expert). A
simple move of the file from a temporary directory to the table directory
would have the same effect (because from the point of view of HDFS, it's
the same operation : metadata change only).

I would like to consider the file as soon as their is reasonable data in
them. If I have to rename/move files I will not be able to see the data
until it is moved in/renamed. (I am building files for N minutes before
closing them). The problem only happens with 0 byte files- files being
written currently work fine.

It seems like the split calculation could throw away 0 byte files before we
ever get down to the record reader and parsing the header. An interesting
thing is that even though dfs -ls shows the files as 0 bytes....Sometimes I
can dfs -text theses 0 byte files and they actually have data! Sometimes
when I dfs -text them I get the exception attached!

So it is interesting that the semantics here are not obvious. Can we map
reduce a file being written? How does it work etc? It would be nice to
understand the semantics here.








On Wed, Jul 23, 2014 at 2:00 AM, Bertrand Dechoux <de...@gmail.com>
wrote:

> The best would be to get a hold on a Flume developer. I am not strictly
> sure of all the differences between sync/flush/hsync/hflush and the
> different hadoop versions. It might be the case that you are only flushing
> on the client side. Even if it was a clean strategy, creation+flush is
> unlikely to be an atomic operation.
>
> It is worth testing the read of an empty sequence file (real empty and
> with only header). It should be quite easy with a unit test. A solution
> would indeed to validate the behaviour of SequenceFileReader / InputFormat
> on edge cases. But nothing guarantee you that you won't have a record split
> between two HDFS blocks. This implies that during the writing only the
> first block is visible and only a part of the record. It would be normal
> for the reader to fail on that case. You could tweak mapreduce bad records
> skipping but that feels like hacking a system where the design is wrong
> from the beginning.
>
> Anyway, a solution (seen in Flume if I remember correctly) is having a
> good file name strategy. For exemple, all new files should end in ".open"
> and only when they are finished the suffix is removed. Then for processing,
> you only target the latter.
>
> For Hive, you might need to adapt the strategy a bit because Hive may not
> be able to target only files with a specific name (you are the expert). A
> simple move of the file from a temporary directory to the table directory
> would have the same effect (because from the point of view of HDFS, it's
> the same operation : metadata change only).
>
> Bertrand Dechoux
>
>
> On Wed, Jul 23, 2014 at 12:16 AM, Edward Capriolo <ed...@gmail.com>
> wrote:
>
>> Here is the stack trace...
>>
>>  Caused by: java.io.EOFException
>>   at java.io.DataInputStream.readByte(DataInputStream.java:267)
>>   at org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:308)
>>   at org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:329)
>>   at org.apache.hadoop.io.SequenceFile$Reader.readBuffer(SequenceFile.java:2072)
>>   at org.apache.hadoop.io.SequenceFile$Reader.seekToCurrentValue(SequenceFile.java:2139)
>>   at org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:2214)
>>   at org.apache.hadoop.mapred.SequenceFileRecordReader.getCurrentValue(SequenceFileRecordReader.java:109)
>>   at org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:84)
>>   at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:274)
>>   ... 15 more
>>
>>
>>
>>
>> On Tue, Jul 22, 2014 at 6:14 PM, Edward Capriolo <ed...@gmail.com>
>> wrote:
>>
>>> Currently using:
>>>
>>>     <dependency>
>>>             <groupId>org.apache.hadoop</groupId>
>>>             <artifactId>hadoop-hdfs</artifactId>
>>>             <version>2.3.0</version>
>>>         </dependency>
>>>
>>>
>>> I have this piece of code that does.
>>>
>>> writer = SequenceFile.createWriter(fs, conf, p, Text.class, Text.class,
>>> CompressionType.BLOCK, codec);
>>>
>>> Then I have a piece of code like this...
>>>
>>>   public static final long SYNC_EVERY_LINES = 1000;
>>>  if (meta.getLinesWritten() % SYNC_EVERY_LINES == 0){
>>>         meta.getWriter().sync();
>>>       }
>>>
>>>
>>> And I commonly see:
>>>
>>> [ecapriolo@staging-hadoop-cdh-67-14 ~]$ hadoop dfs -ls  /user/beacon/
>>> 2014072117
>>> DEPRECATED: Use of this script to execute hdfs command is deprecated.
>>> Instead use the hdfs command for it.
>>>
>>> Found 12 items
>>> -rw-r--r--   3 service-igor supergroup    1065682 2014-07-21 17:50
>>> /user/beacon/2014072117/0bb6cd71-70ac-405a-a8b7-b8caf9af8da1
>>> -rw-r--r--   3 service-igor supergroup    1029041 2014-07-21 17:40
>>> /user/beacon/2014072117/1b0ef6b3-bd51-4100-9d4b-1cecdd565f93
>>> -rw-r--r--   3 service-igor supergroup    1002096 2014-07-21 17:10
>>> /user/beacon/2014072117/34e2acb4-2054-44df-bbf7-a4ce7f1e5d1b
>>> -rw-r--r--   3 service-igor supergroup    1028450 2014-07-21 17:30
>>> /user/beacon/2014072117/41c7aa62-d27f-4d53-bed8-df2fb5803c92
>>> -rw-r--r--   3 service-igor supergroup          0 2014-07-21 17:50
>>> /user/beacon/2014072117/5450f246-7623-4bbd-8c97-8176a0c30351
>>> -rw-r--r--   3 service-igor supergroup    1084873 2014-07-21 17:30
>>> /user/beacon/2014072117/8b36fbca-6f5b-48a3-be3c-6df6254c3db2
>>> -rw-r--r--   3 service-igor supergroup    1043108 2014-07-21 17:20
>>> /user/beacon/2014072117/949da11a-247b-4992-b13a-5e6ce7e51e9b
>>> -rw-r--r--   3 service-igor supergroup     986866 2014-07-21 17:10
>>> /user/beacon/2014072117/979bba76-4d2e-423f-92f6-031bc41f6fbd
>>> -rw-r--r--   3 service-igor supergroup          0 2014-07-21 17:50
>>> /user/beacon/2014072117/b76db189-054f-4dac-84a4-a65f39a6c1a9
>>> -rw-r--r--   3 service-igor supergroup    1040931 2014-07-21 17:50
>>> /user/beacon/2014072117/bba6a677-226c-4982-8fb2-4b136108baf1
>>> -rw-r--r--   3 service-igor supergroup    1012137 2014-07-21 17:40
>>> /user/beacon/2014072117/be940202-f085-45bb-ac84-51ece2e1ba47
>>> -rw-r--r--   3 service-igor supergroup    1028467 2014-07-21 17:20
>>> /user/beacon/2014072117/c336e0c8-76e7-40e7-98e2-9f529f25577b
>>>
>>> Sometimes even though they show as 0 bytes you can read data from them.
>>> Sometimes it blows up with a stack trace I have lost.
>>>
>>>
>>> On Tue, Jul 22, 2014 at 5:45 PM, Bertrand Dechoux <de...@gmail.com>
>>> wrote:
>>>
>>>> I looked at the source by curiosity, for the latest version (2.4), the
>>>> header is flushed during the writer creation. Of course, key/value classes
>>>> are provided. By 0-bytes, you really mean even without the header? Or 0
>>>> bytes of payload?
>>>>
>>>>
>>>> On Tue, Jul 22, 2014 at 11:05 PM, Bertrand Dechoux <de...@gmail.com>
>>>> wrote:
>>>>
>>>>> The header is expected to have the full name of the key class and
>>>>> value class so if it is only detected with the first record (?) indeed the
>>>>> file can not respect its own format.
>>>>>
>>>>> I haven't tried it but LazyOutputFormat should solve your problem.
>>>>>
>>>>> https://hadoop.apache.org/docs/current/api/index.html?org/apache/hadoop/mapred/lib/LazyOutputFormat.html
>>>>>
>>>>> Regards
>>>>>
>>>>> Bertrand Dechoux
>>>>>
>>>>>
>>>>> Bertrand Dechoux
>>>>>
>>>>>
>>>>> On Tue, Jul 22, 2014 at 10:39 PM, Edward Capriolo <
>>>>> edlinuxguru@gmail.com> wrote:
>>>>>
>>>>>> I have two processes. One that writes sequence files directly to
>>>>>> hdfs, the other that is a hive table that reads these files.
>>>>>>
>>>>>> All works well with the exception that I am only flushing the files
>>>>>> periodically. SequenceFile input format gets angry when it encounters
>>>>>> 0-bytes seq files.
>>>>>>
>>>>>> I was considering flush and sync on first record write. Also was
>>>>>> thinking should just be able to hack sequence file input format to skip 0
>>>>>> byte files and not throw exception on readFully() which it sometimes does.
>>>>>>
>>>>>> Anyone ever tackled this?
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>