You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Meng Mao <me...@gmail.com> on 2011/09/02 22:04:48 UTC

do HDFS files starting with _ (underscore) have special properties?

We have a compression utility that tries to grab all subdirs to a directory
on HDFS. It makes a call like this:
FileStatus[] subdirs = fs.globStatus(new Path(inputdir, "*"));

and handles files vs dirs accordingly.

We tried to run our utility against a dir containing a computed SOLR shard,
which has files that look like this:
-rw-r--r--   2 hadoopuser visible 8538430603 2011-09-01 18:58
/test/output/solr-20110901165238/part-00000/data/index/_ox.fdt
-rw-r--r--   2 hadoopuser visible  233396596 2011-09-01 18:57
/test/output/solr-20110901165238/part-00000/data/index/_ox.fdx
-rw-r--r--   2 hadoopuser visible        130 2011-09-01 18:57
/test/output/solr-20110901165238/part-00000/data/index/_ox.fnm
-rw-r--r--   2 hadoopuser visible 2147948283 2011-09-01 18:55
/test/output/solr-20110901165238/part-00000/data/index/_ox.frq
-rw-r--r--   2 hadoopuser visible   87523726 2011-09-01 18:57
/test/output/solr-20110901165238/part-00000/data/index/_ox.nrm
-rw-r--r--   2 hadoopuser visible  920936168 2011-09-01 18:57
/test/output/solr-20110901165238/part-00000/data/index/_ox.prx
-rw-r--r--   2 hadoopuser visible   22619542 2011-09-01 18:58
/test/output/solr-20110901165238/part-00000/data/index/_ox.tii
-rw-r--r--   2 hadoopuser visible 2070214402 2011-09-01 18:51
/test/output/solr-20110901165238/part-00000/data/index/_ox.tis
-rw-r--r--   2 hadoopuser visible         20 2011-09-01 18:51
/test/output/solr-20110901165238/part-00000/data/index/segments.gen
-rw-r--r--   2 hadoopuser visible        282 2011-09-01 18:55
/test/output/solr-20110901165238/part-00000/data/index/segments_2


The globStatus call seems only able to pick up those last 2 files; the
several files that start with _ don't register.

I've skimmed the FileSystem and GlobExpander source to see if there's
anything related to this, but didn't see it. Google didn't turn up anything
about underscores. Am I misunderstanding something about the regex patterns
needed to pick these up or unaware of some filename convention in HDFS?

Re: do HDFS files starting with _ (underscore) have special properties?

Posted by Harsh J <ha...@cloudera.com>.
Meng,

- Moving this discussion to cdh-user@cloudera.org since it may be CDH
specific at this point. (Link:
https://groups.google.com/a/cloudera.org/group/cdh-user)
- I've bcc'd common-user@ for this mail alone.
- Added you on cc in case you aren't subscribed.

Reading your version output, that version is CDH2, the older version
of CDH. Would you be able to upgrade your cluster to CDH3?

I haven't tried running against your _exact_ version yet, but running
against the latest CDH2 version of HDFS from
http://archive.cloudera.com/cdh/2/, I think it still works fine (ditto
code in the jar again):

➜  hadoop-0.20.1+169.127 > bin/hadoop jar ~/globtester.jar
hdfs://localhost/user/harshchouraria/_abc
hdfs://localhost/user/harshchouraria/_def
hdfs://localhost/user/harshchouraria/abc
hdfs://localhost/user/harshchouraria/def

On Sun, Sep 4, 2011 at 12:04 AM, Meng Mao <me...@gmail.com> wrote:
> I get the opposite behavior --
>
> [this is more or less how I listed the files in the original email]
> hadoop dfs -ls /test/output/solr-20110901165238/part-00000/data/index/*
> -rw-r--r--   2 hadoopuser visible 8538430603 2011-09-01 18:58
> /test/output/solr-20110901165238/part-00000/data/index/_ox.fdt
> -rw-r--r--   2 hadoopuser visible  233396596 2011-09-01 18:57
> /test/output/solr-20110901165238/part-00000/data/index/_ox.fdx
> -rw-r--r--   2 hadoopuser visible        130 2011-09-01 18:57
> /test/output/solr-20110901165238/part-00000/data/index/_ox.fnm
> -rw-r--r--   2 hadoopuser visible 2147948283 2011-09-01 18:55
> /test/output/solr-20110901165238/part-00000/data/index/_ox.frq
> -rw-r--r--   2 hadoopuser visible   87523726 2011-09-01 18:57
> /test/output/solr-20110901165238/part-00000/data/index/_ox.nrm
> -rw-r--r--   2 hadoopuser visible  920936168 2011-09-01 18:57
> /test/output/solr-20110901165238/part-00000/data/index/_ox.prx
> -rw-r--r--   2 hadoopuser visible   22619542 2011-09-01 18:58
> /test/output/solr-20110901165238/part-00000/data/index/_ox.tii
> -rw-r--r--   2 hadoopuser visible 2070214402 2011-09-01 18:51
> /test/output/solr-20110901165238/part-00000/data/index/_ox.tis
> -rw-r--r--   2 hadoopuser visible         20 2011-09-01 18:51
> /test/output/solr-20110901165238/part-00000/data/index/segments.gen
> -rw-r--r--   2 hadoopuser visible        282 2011-09-01 18:55
> /test/output/solr-20110901165238/part-00000/data/index/segments_2
>
> Whereas my globStatus doesn't capture them.
>
> I thought we were on Cloudera's CDH3, but now I'm not sure. This is what
> version reports:
> $ hadoop version
> Hadoop 0.20.1+169.56
> Subversion  -r 8e662cb065be1c4bc61c55e6bff161e09c1d36f3
> Compiled by root on Tue Feb  9 13:40:08 EST 2010
>
>
>
>
>
> On Fri, Sep 2, 2011 at 11:45 PM, Harsh J <ha...@cloudera.com> wrote:
>
>> Meng,
>>
>> What version of hadoop are you on? I'm able to use globStatus(Path)
>> for '_' listing successfully, with a '*' glob. Although the same
>> doesn't apply to what FsShell's ls utility provide (which is odd
>> here!).
>>
>> Here's my test code which can validate that the listing is indeed
>> done: http://pastebin.com/vCbd2wmK
>>
>> $ hadoop dfs -ls
>> Found 4 items
>> drwxr-xr-x   - harshchouraria supergroup          0 2011-09-03 09:09
>> /user/harshchouraria/_abc
>> -rw-r--r--   1 harshchouraria supergroup          0 2011-09-03 09:10
>> /user/harshchouraria/_def
>> drwxr-xr-x   - harshchouraria supergroup          0 2011-09-03 08:10
>> /user/harshchouraria/abc
>> -rw-r--r--   1 harshchouraria supergroup          0 2011-09-03 09:10
>> /user/harshchouraria/def
>>
>>
>> $ hadoop dfs -ls '*'
>> -rw-r--r--   1 harshchouraria supergroup          0 2011-09-03 09:10
>> /user/harshchouraria/_def
>> -rw-r--r--   1 harshchouraria supergroup          0 2011-09-03 09:10
>> /user/harshchouraria/def
>>
>> $ # No dir results! ^^
>>
>> $ hadoop jar myjar.jar # (My code)
>> hdfs://localhost/user/harshchouraria/_abc
>> hdfs://localhost/user/harshchouraria/_def
>> hdfs://localhost/user/harshchouraria/abc
>> hdfs://localhost/user/harshchouraria/def
>>
>> I suppose that means globStatus is fine, but the FsShell.ls(…) code
>> does something more than a simple glob status, and filters away
>> directory results when used with a glob.
>>
>> On Sat, Sep 3, 2011 at 3:07 AM, Meng Mao <me...@gmail.com> wrote:
>> > Is there a programmatic way to access these hidden files then?
>> >
>> > On Fri, Sep 2, 2011 at 5:20 PM, Edward Capriolo <edlinuxguru@gmail.com
>> >wrote:
>> >
>> >> On Fri, Sep 2, 2011 at 4:04 PM, Meng Mao <me...@gmail.com> wrote:
>> >>
>> >> > We have a compression utility that tries to grab all subdirs to a
>> >> directory
>> >> > on HDFS. It makes a call like this:
>> >> > FileStatus[] subdirs = fs.globStatus(new Path(inputdir, "*"));
>> >> >
>> >> > and handles files vs dirs accordingly.
>> >> >
>> >> > We tried to run our utility against a dir containing a computed SOLR
>> >> shard,
>> >> > which has files that look like this:
>> >> > -rw-r--r--   2 hadoopuser visible 8538430603 2011-09-01 18:58
>> >> > /test/output/solr-20110901165238/part-00000/data/index/_ox.fdt
>> >> > -rw-r--r--   2 hadoopuser visible  233396596 2011-09-01 18:57
>> >> > /test/output/solr-20110901165238/part-00000/data/index/_ox.fdx
>> >> > -rw-r--r--   2 hadoopuser visible        130 2011-09-01 18:57
>> >> > /test/output/solr-20110901165238/part-00000/data/index/_ox.fnm
>> >> > -rw-r--r--   2 hadoopuser visible 2147948283 2011-09-01 18:55
>> >> > /test/output/solr-20110901165238/part-00000/data/index/_ox.frq
>> >> > -rw-r--r--   2 hadoopuser visible   87523726 2011-09-01 18:57
>> >> > /test/output/solr-20110901165238/part-00000/data/index/_ox.nrm
>> >> > -rw-r--r--   2 hadoopuser visible  920936168 2011-09-01 18:57
>> >> > /test/output/solr-20110901165238/part-00000/data/index/_ox.prx
>> >> > -rw-r--r--   2 hadoopuser visible   22619542 2011-09-01 18:58
>> >> > /test/output/solr-20110901165238/part-00000/data/index/_ox.tii
>> >> > -rw-r--r--   2 hadoopuser visible 2070214402 2011-09-01 18:51
>> >> > /test/output/solr-20110901165238/part-00000/data/index/_ox.tis
>> >> > -rw-r--r--   2 hadoopuser visible         20 2011-09-01 18:51
>> >> > /test/output/solr-20110901165238/part-00000/data/index/segments.gen
>> >> > -rw-r--r--   2 hadoopuser visible        282 2011-09-01 18:55
>> >> > /test/output/solr-20110901165238/part-00000/data/index/segments_2
>> >> >
>> >> >
>> >> > The globStatus call seems only able to pick up those last 2 files; the
>> >> > several files that start with _ don't register.
>> >> >
>> >> > I've skimmed the FileSystem and GlobExpander source to see if there's
>> >> > anything related to this, but didn't see it. Google didn't turn up
>> >> anything
>> >> > about underscores. Am I misunderstanding something about the regex
>> >> patterns
>> >> > needed to pick these up or unaware of some filename convention in
>> HDFS?
>> >> >
>> >>
>> >> Files starting with '_' are considered 'hidden' like unix files starting
>> >> with '.'. I did not know that for a very long time because not everyone
>> >> follows this rule or even knows about it.
>> >>
>> >
>>
>>
>>
>> --
>> Harsh J
>>
>



-- 
Harsh J

Re: do HDFS files starting with _ (underscore) have special properties?

Posted by Meng Mao <me...@gmail.com>.
I get the opposite behavior --

[this is more or less how I listed the files in the original email]
hadoop dfs -ls /test/output/solr-20110901165238/part-00000/data/index/*
-rw-r--r--   2 hadoopuser visible 8538430603 2011-09-01 18:58
/test/output/solr-20110901165238/part-00000/data/index/_ox.fdt
-rw-r--r--   2 hadoopuser visible  233396596 2011-09-01 18:57
/test/output/solr-20110901165238/part-00000/data/index/_ox.fdx
-rw-r--r--   2 hadoopuser visible        130 2011-09-01 18:57
/test/output/solr-20110901165238/part-00000/data/index/_ox.fnm
-rw-r--r--   2 hadoopuser visible 2147948283 2011-09-01 18:55
/test/output/solr-20110901165238/part-00000/data/index/_ox.frq
-rw-r--r--   2 hadoopuser visible   87523726 2011-09-01 18:57
/test/output/solr-20110901165238/part-00000/data/index/_ox.nrm
-rw-r--r--   2 hadoopuser visible  920936168 2011-09-01 18:57
/test/output/solr-20110901165238/part-00000/data/index/_ox.prx
-rw-r--r--   2 hadoopuser visible   22619542 2011-09-01 18:58
/test/output/solr-20110901165238/part-00000/data/index/_ox.tii
-rw-r--r--   2 hadoopuser visible 2070214402 2011-09-01 18:51
/test/output/solr-20110901165238/part-00000/data/index/_ox.tis
-rw-r--r--   2 hadoopuser visible         20 2011-09-01 18:51
/test/output/solr-20110901165238/part-00000/data/index/segments.gen
-rw-r--r--   2 hadoopuser visible        282 2011-09-01 18:55
/test/output/solr-20110901165238/part-00000/data/index/segments_2

Whereas my globStatus doesn't capture them.

I thought we were on Cloudera's CDH3, but now I'm not sure. This is what
version reports:
$ hadoop version
Hadoop 0.20.1+169.56
Subversion  -r 8e662cb065be1c4bc61c55e6bff161e09c1d36f3
Compiled by root on Tue Feb  9 13:40:08 EST 2010





On Fri, Sep 2, 2011 at 11:45 PM, Harsh J <ha...@cloudera.com> wrote:

> Meng,
>
> What version of hadoop are you on? I'm able to use globStatus(Path)
> for '_' listing successfully, with a '*' glob. Although the same
> doesn't apply to what FsShell's ls utility provide (which is odd
> here!).
>
> Here's my test code which can validate that the listing is indeed
> done: http://pastebin.com/vCbd2wmK
>
> $ hadoop dfs -ls
> Found 4 items
> drwxr-xr-x   - harshchouraria supergroup          0 2011-09-03 09:09
> /user/harshchouraria/_abc
> -rw-r--r--   1 harshchouraria supergroup          0 2011-09-03 09:10
> /user/harshchouraria/_def
> drwxr-xr-x   - harshchouraria supergroup          0 2011-09-03 08:10
> /user/harshchouraria/abc
> -rw-r--r--   1 harshchouraria supergroup          0 2011-09-03 09:10
> /user/harshchouraria/def
>
>
> $ hadoop dfs -ls '*'
> -rw-r--r--   1 harshchouraria supergroup          0 2011-09-03 09:10
> /user/harshchouraria/_def
> -rw-r--r--   1 harshchouraria supergroup          0 2011-09-03 09:10
> /user/harshchouraria/def
>
> $ # No dir results! ^^
>
> $ hadoop jar myjar.jar # (My code)
> hdfs://localhost/user/harshchouraria/_abc
> hdfs://localhost/user/harshchouraria/_def
> hdfs://localhost/user/harshchouraria/abc
> hdfs://localhost/user/harshchouraria/def
>
> I suppose that means globStatus is fine, but the FsShell.ls(…) code
> does something more than a simple glob status, and filters away
> directory results when used with a glob.
>
> On Sat, Sep 3, 2011 at 3:07 AM, Meng Mao <me...@gmail.com> wrote:
> > Is there a programmatic way to access these hidden files then?
> >
> > On Fri, Sep 2, 2011 at 5:20 PM, Edward Capriolo <edlinuxguru@gmail.com
> >wrote:
> >
> >> On Fri, Sep 2, 2011 at 4:04 PM, Meng Mao <me...@gmail.com> wrote:
> >>
> >> > We have a compression utility that tries to grab all subdirs to a
> >> directory
> >> > on HDFS. It makes a call like this:
> >> > FileStatus[] subdirs = fs.globStatus(new Path(inputdir, "*"));
> >> >
> >> > and handles files vs dirs accordingly.
> >> >
> >> > We tried to run our utility against a dir containing a computed SOLR
> >> shard,
> >> > which has files that look like this:
> >> > -rw-r--r--   2 hadoopuser visible 8538430603 2011-09-01 18:58
> >> > /test/output/solr-20110901165238/part-00000/data/index/_ox.fdt
> >> > -rw-r--r--   2 hadoopuser visible  233396596 2011-09-01 18:57
> >> > /test/output/solr-20110901165238/part-00000/data/index/_ox.fdx
> >> > -rw-r--r--   2 hadoopuser visible        130 2011-09-01 18:57
> >> > /test/output/solr-20110901165238/part-00000/data/index/_ox.fnm
> >> > -rw-r--r--   2 hadoopuser visible 2147948283 2011-09-01 18:55
> >> > /test/output/solr-20110901165238/part-00000/data/index/_ox.frq
> >> > -rw-r--r--   2 hadoopuser visible   87523726 2011-09-01 18:57
> >> > /test/output/solr-20110901165238/part-00000/data/index/_ox.nrm
> >> > -rw-r--r--   2 hadoopuser visible  920936168 2011-09-01 18:57
> >> > /test/output/solr-20110901165238/part-00000/data/index/_ox.prx
> >> > -rw-r--r--   2 hadoopuser visible   22619542 2011-09-01 18:58
> >> > /test/output/solr-20110901165238/part-00000/data/index/_ox.tii
> >> > -rw-r--r--   2 hadoopuser visible 2070214402 2011-09-01 18:51
> >> > /test/output/solr-20110901165238/part-00000/data/index/_ox.tis
> >> > -rw-r--r--   2 hadoopuser visible         20 2011-09-01 18:51
> >> > /test/output/solr-20110901165238/part-00000/data/index/segments.gen
> >> > -rw-r--r--   2 hadoopuser visible        282 2011-09-01 18:55
> >> > /test/output/solr-20110901165238/part-00000/data/index/segments_2
> >> >
> >> >
> >> > The globStatus call seems only able to pick up those last 2 files; the
> >> > several files that start with _ don't register.
> >> >
> >> > I've skimmed the FileSystem and GlobExpander source to see if there's
> >> > anything related to this, but didn't see it. Google didn't turn up
> >> anything
> >> > about underscores. Am I misunderstanding something about the regex
> >> patterns
> >> > needed to pick these up or unaware of some filename convention in
> HDFS?
> >> >
> >>
> >> Files starting with '_' are considered 'hidden' like unix files starting
> >> with '.'. I did not know that for a very long time because not everyone
> >> follows this rule or even knows about it.
> >>
> >
>
>
>
> --
> Harsh J
>

Re: do HDFS files starting with _ (underscore) have special properties?

Posted by Harsh J <ha...@cloudera.com>.
Meng,

What version of hadoop are you on? I'm able to use globStatus(Path)
for '_' listing successfully, with a '*' glob. Although the same
doesn't apply to what FsShell's ls utility provide (which is odd
here!).

Here's my test code which can validate that the listing is indeed
done: http://pastebin.com/vCbd2wmK

$ hadoop dfs -ls
Found 4 items
drwxr-xr-x   - harshchouraria supergroup          0 2011-09-03 09:09
/user/harshchouraria/_abc
-rw-r--r--   1 harshchouraria supergroup          0 2011-09-03 09:10
/user/harshchouraria/_def
drwxr-xr-x   - harshchouraria supergroup          0 2011-09-03 08:10
/user/harshchouraria/abc
-rw-r--r--   1 harshchouraria supergroup          0 2011-09-03 09:10
/user/harshchouraria/def


$ hadoop dfs -ls '*'
-rw-r--r--   1 harshchouraria supergroup          0 2011-09-03 09:10
/user/harshchouraria/_def
-rw-r--r--   1 harshchouraria supergroup          0 2011-09-03 09:10
/user/harshchouraria/def

$ # No dir results! ^^

$ hadoop jar myjar.jar # (My code)
hdfs://localhost/user/harshchouraria/_abc
hdfs://localhost/user/harshchouraria/_def
hdfs://localhost/user/harshchouraria/abc
hdfs://localhost/user/harshchouraria/def

I suppose that means globStatus is fine, but the FsShell.ls(…) code
does something more than a simple glob status, and filters away
directory results when used with a glob.

On Sat, Sep 3, 2011 at 3:07 AM, Meng Mao <me...@gmail.com> wrote:
> Is there a programmatic way to access these hidden files then?
>
> On Fri, Sep 2, 2011 at 5:20 PM, Edward Capriolo <ed...@gmail.com>wrote:
>
>> On Fri, Sep 2, 2011 at 4:04 PM, Meng Mao <me...@gmail.com> wrote:
>>
>> > We have a compression utility that tries to grab all subdirs to a
>> directory
>> > on HDFS. It makes a call like this:
>> > FileStatus[] subdirs = fs.globStatus(new Path(inputdir, "*"));
>> >
>> > and handles files vs dirs accordingly.
>> >
>> > We tried to run our utility against a dir containing a computed SOLR
>> shard,
>> > which has files that look like this:
>> > -rw-r--r--   2 hadoopuser visible 8538430603 2011-09-01 18:58
>> > /test/output/solr-20110901165238/part-00000/data/index/_ox.fdt
>> > -rw-r--r--   2 hadoopuser visible  233396596 2011-09-01 18:57
>> > /test/output/solr-20110901165238/part-00000/data/index/_ox.fdx
>> > -rw-r--r--   2 hadoopuser visible        130 2011-09-01 18:57
>> > /test/output/solr-20110901165238/part-00000/data/index/_ox.fnm
>> > -rw-r--r--   2 hadoopuser visible 2147948283 2011-09-01 18:55
>> > /test/output/solr-20110901165238/part-00000/data/index/_ox.frq
>> > -rw-r--r--   2 hadoopuser visible   87523726 2011-09-01 18:57
>> > /test/output/solr-20110901165238/part-00000/data/index/_ox.nrm
>> > -rw-r--r--   2 hadoopuser visible  920936168 2011-09-01 18:57
>> > /test/output/solr-20110901165238/part-00000/data/index/_ox.prx
>> > -rw-r--r--   2 hadoopuser visible   22619542 2011-09-01 18:58
>> > /test/output/solr-20110901165238/part-00000/data/index/_ox.tii
>> > -rw-r--r--   2 hadoopuser visible 2070214402 2011-09-01 18:51
>> > /test/output/solr-20110901165238/part-00000/data/index/_ox.tis
>> > -rw-r--r--   2 hadoopuser visible         20 2011-09-01 18:51
>> > /test/output/solr-20110901165238/part-00000/data/index/segments.gen
>> > -rw-r--r--   2 hadoopuser visible        282 2011-09-01 18:55
>> > /test/output/solr-20110901165238/part-00000/data/index/segments_2
>> >
>> >
>> > The globStatus call seems only able to pick up those last 2 files; the
>> > several files that start with _ don't register.
>> >
>> > I've skimmed the FileSystem and GlobExpander source to see if there's
>> > anything related to this, but didn't see it. Google didn't turn up
>> anything
>> > about underscores. Am I misunderstanding something about the regex
>> patterns
>> > needed to pick these up or unaware of some filename convention in HDFS?
>> >
>>
>> Files starting with '_' are considered 'hidden' like unix files starting
>> with '.'. I did not know that for a very long time because not everyone
>> follows this rule or even knows about it.
>>
>



-- 
Harsh J

Re: do HDFS files starting with _ (underscore) have special properties?

Posted by Meng Mao <me...@gmail.com>.
Is there a programmatic way to access these hidden files then?

On Fri, Sep 2, 2011 at 5:20 PM, Edward Capriolo <ed...@gmail.com>wrote:

> On Fri, Sep 2, 2011 at 4:04 PM, Meng Mao <me...@gmail.com> wrote:
>
> > We have a compression utility that tries to grab all subdirs to a
> directory
> > on HDFS. It makes a call like this:
> > FileStatus[] subdirs = fs.globStatus(new Path(inputdir, "*"));
> >
> > and handles files vs dirs accordingly.
> >
> > We tried to run our utility against a dir containing a computed SOLR
> shard,
> > which has files that look like this:
> > -rw-r--r--   2 hadoopuser visible 8538430603 2011-09-01 18:58
> > /test/output/solr-20110901165238/part-00000/data/index/_ox.fdt
> > -rw-r--r--   2 hadoopuser visible  233396596 2011-09-01 18:57
> > /test/output/solr-20110901165238/part-00000/data/index/_ox.fdx
> > -rw-r--r--   2 hadoopuser visible        130 2011-09-01 18:57
> > /test/output/solr-20110901165238/part-00000/data/index/_ox.fnm
> > -rw-r--r--   2 hadoopuser visible 2147948283 2011-09-01 18:55
> > /test/output/solr-20110901165238/part-00000/data/index/_ox.frq
> > -rw-r--r--   2 hadoopuser visible   87523726 2011-09-01 18:57
> > /test/output/solr-20110901165238/part-00000/data/index/_ox.nrm
> > -rw-r--r--   2 hadoopuser visible  920936168 2011-09-01 18:57
> > /test/output/solr-20110901165238/part-00000/data/index/_ox.prx
> > -rw-r--r--   2 hadoopuser visible   22619542 2011-09-01 18:58
> > /test/output/solr-20110901165238/part-00000/data/index/_ox.tii
> > -rw-r--r--   2 hadoopuser visible 2070214402 2011-09-01 18:51
> > /test/output/solr-20110901165238/part-00000/data/index/_ox.tis
> > -rw-r--r--   2 hadoopuser visible         20 2011-09-01 18:51
> > /test/output/solr-20110901165238/part-00000/data/index/segments.gen
> > -rw-r--r--   2 hadoopuser visible        282 2011-09-01 18:55
> > /test/output/solr-20110901165238/part-00000/data/index/segments_2
> >
> >
> > The globStatus call seems only able to pick up those last 2 files; the
> > several files that start with _ don't register.
> >
> > I've skimmed the FileSystem and GlobExpander source to see if there's
> > anything related to this, but didn't see it. Google didn't turn up
> anything
> > about underscores. Am I misunderstanding something about the regex
> patterns
> > needed to pick these up or unaware of some filename convention in HDFS?
> >
>
> Files starting with '_' are considered 'hidden' like unix files starting
> with '.'. I did not know that for a very long time because not everyone
> follows this rule or even knows about it.
>

Re: do HDFS files starting with _ (underscore) have special properties?

Posted by Edward Capriolo <ed...@gmail.com>.
On Fri, Sep 2, 2011 at 4:04 PM, Meng Mao <me...@gmail.com> wrote:

> We have a compression utility that tries to grab all subdirs to a directory
> on HDFS. It makes a call like this:
> FileStatus[] subdirs = fs.globStatus(new Path(inputdir, "*"));
>
> and handles files vs dirs accordingly.
>
> We tried to run our utility against a dir containing a computed SOLR shard,
> which has files that look like this:
> -rw-r--r--   2 hadoopuser visible 8538430603 2011-09-01 18:58
> /test/output/solr-20110901165238/part-00000/data/index/_ox.fdt
> -rw-r--r--   2 hadoopuser visible  233396596 2011-09-01 18:57
> /test/output/solr-20110901165238/part-00000/data/index/_ox.fdx
> -rw-r--r--   2 hadoopuser visible        130 2011-09-01 18:57
> /test/output/solr-20110901165238/part-00000/data/index/_ox.fnm
> -rw-r--r--   2 hadoopuser visible 2147948283 2011-09-01 18:55
> /test/output/solr-20110901165238/part-00000/data/index/_ox.frq
> -rw-r--r--   2 hadoopuser visible   87523726 2011-09-01 18:57
> /test/output/solr-20110901165238/part-00000/data/index/_ox.nrm
> -rw-r--r--   2 hadoopuser visible  920936168 2011-09-01 18:57
> /test/output/solr-20110901165238/part-00000/data/index/_ox.prx
> -rw-r--r--   2 hadoopuser visible   22619542 2011-09-01 18:58
> /test/output/solr-20110901165238/part-00000/data/index/_ox.tii
> -rw-r--r--   2 hadoopuser visible 2070214402 2011-09-01 18:51
> /test/output/solr-20110901165238/part-00000/data/index/_ox.tis
> -rw-r--r--   2 hadoopuser visible         20 2011-09-01 18:51
> /test/output/solr-20110901165238/part-00000/data/index/segments.gen
> -rw-r--r--   2 hadoopuser visible        282 2011-09-01 18:55
> /test/output/solr-20110901165238/part-00000/data/index/segments_2
>
>
> The globStatus call seems only able to pick up those last 2 files; the
> several files that start with _ don't register.
>
> I've skimmed the FileSystem and GlobExpander source to see if there's
> anything related to this, but didn't see it. Google didn't turn up anything
> about underscores. Am I misunderstanding something about the regex patterns
> needed to pick these up or unaware of some filename convention in HDFS?
>

Files starting with '_' are considered 'hidden' like unix files starting
with '.'. I did not know that for a very long time because not everyone
follows this rule or even knows about it.