You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Dan Buchan <da...@rentify.com> on 2014/02/13 19:09:09 UTC

har file globbing problem

We have a dataset of ~8Milllion files about .5 to 2 Megs each. And we're
having trouble getting them analysed after building a har file.

The files are already in a pre-existing directory structure, with, two
nested set of dirs with 20-100 pdfs at the bottom of each leaf of the dir
tree.

user->hadoop->/all_the_files/*/*/*.pdf

It was trivial to move these to hdfs and to build a har archive; I used the
following command to make the archive

bin/hadoop archive -archiveName test.har -p /user/hadoop/
all_the_files/*/*/ /user/hadoop/

Listing the contents of the har (bin/hadoop fs -lsr
har:///user/hadoop/epc_test.har) and everything looks as I'd expect.

When we come to run the hadoop job with this command, trying to wildcard
the archive:

bin/hadoop jar My.jar har:///user/hadoop/test.har/all_the_files/*/*/ output

it fails with the following exception

    Exception in thread "main" java.lang.IllegalArgumentException: Can not
create a Path from an empty string

Running the job with the non-archived files is fine i.e:

    bin/hadoop jar My.jar all_the_files/*/*/ output

However this only works for our modest test set of files. Any substantial
number of files quickly makes the namenode run out of memory.

Can you use file globs with the har archives? Is there a different way to
build the archive to just include the files which I've missed?
I appreciate that a sequence file might be a better fit for this task but
I'd like to know the solution to this issue if there is one.

-- 
 

t.  020 7739 3277
a. 131 Shoreditch High Street, London E1 6JE