You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by Mohnish Kodnani <mo...@gmail.com> on 2012/09/27 03:38:01 UTC

Pig and HARFileSystem

Hi,
I had emailed the user mailing list regarding my problem but did not get
much input, hence emailing the developer community.
I have 2 questions about Pig and how it uses the HAR FileSystem.

1. It seems Path Globbing does not work with HAR Files with Pig, is this
intentional ? For example :
hadoop fs -ls har:///x/y/z/{a.har,b.har}/* works and lists all the files in
both har files. If I give the same path as input path to a pig script it
does not seem to work.

2. Wildcards in HAR path.
    Like the above example if I do the following on hadoop fs it works
hadoop fs -ls har://x/y/*/a.har/*
This lists all files from all folders that have a.har

If I give the path input path to pig it does not work. I have tried these 2
things on pig 0.8
Also, for the second use case. If I remove the last wild card where files
should be, then it works.
For example input path to pig :
har://x/y/*/a.har/logFile

then pig can read the file and give me records back, but wild card at the
last location does not work.

Any insights would be great around if this should or should not work. I
have 30000 files in one folder inside the har, I cannot list each one and
want to use wildcard as the last element in the path and use path globbing
to provide multiple har files.

Thanks
Mohnish