You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hadoop.apache.org by Sigurd Spieckermann <si...@gmail.com> on 2012/09/10 11:57:30 UTC

Reading from HDFS from inside the mapper

Hi,

I would like to perform a map-side join of two large datasets where dataset
A consists of m*n elements and dataset B consists of n elements. For the
join, every element in dataset B needs to be accessed m times. Each mapper
would join one element from A with the corresponding element from B.
Elements here are actually data blocks. Is there a performance problem (and
difference compared to a slightly modified map-side join using the
join-package) if I set dataset A as the map-reduce input and load the
relevant element from dataset B directly from the HDFS inside the mapper? I
could store the elements of B in a MapFile for faster random access. In the
second case without the join-package I would not have to partition the
datasets manually which would allow a bit more flexibility, but I'm
wondering if HDFS access from inside a mapper is strictly bad. Also, does
Hadoop have a cache for such situations by any chance?

I appreciate any comments!

Sigurd

Re: Reading from HDFS from inside the mapper

Posted by Sigurd Spieckermann <si...@gmail.com>.

OK, I see... Is there any way to change this? I need guaranteed order in
order for the map-side join to work correctly and I need the standalone
mode for debugging code that is executed on the mapper/reducer nodes.

2012/9/17 Harsh J <ha...@cloudera.com>

> Sigurd,
>
> The implementation of fs -ls in the LocalFileSystem relies on Java's
> File#list
> http://docs.oracle.com/javase/6/docs/api/java/io/File.html#list()
> which states "There is no guarantee that the name strings in the
> resulting array will appear in any specific order; they are not, in
> particular, guaranteed to appear in alphabetical order.". That may
> just be what is biting you, since standalone mode uses LFS.
>
> On Mon, Sep 17, 2012 at 6:45 PM, Sigurd Spieckermann
> <si...@gmail.com> wrote:
> > I've tracked down the problem to only occur in standalone mode. In
> > pseudo-distributed mode, everything works fine. My underlying OS is
> Ubuntu
> > 12.04 64bit. When I access the directory in linux directly, everything
> looks
> > normal. It's just when I access it through hadoop. Has anyone seen this
> > problem before and knows a solution?
> >
> > Thanks,
> > Sigurd
> >
> >
> > 2012/9/17 Sigurd Spieckermann <si...@gmail.com>
> >>
> >> I'm experiencing a strange problem right now. I'm writing part-files to
> >> the HDFS providing initial data and (which should actually not make a
> >> difference anyway) write them in ascending order, i.e. part-00000,
> >> part-00001 etc. -- in that order. But when I do "hadoop dfs -ls xyz",
> they
> >> are in the order part-00001, part-00000, part-00002, part-00003 etc.
> How is
> >> that possible? Why aren't they shown in natural order? Also the map-side
> >> join package considers them in this order which causes problems.
> >>
> >>
> >> 2012/9/10 Sigurd Spieckermann <si...@gmail.com>
> >>>
> >>> OK, interesting. Just to confirm: is it okay to distribute quite large
> >>> files through the DistributedCache? Dataset B could be on the order of
> >>> gigabytes. Also, if I have much fewer nodes than elements/blocks in A,
> then
> >>> the probability that every node will have to read (almost) every block
> of B
> >>> is quite high so given DC is okay here in general, it would be more
> >>> efficient to use DC over HDFS reading. How about the case though that
> I have
> >>> m*n nodes, then every node would receive all of B while only needing a
> small
> >>> fraction, right? Could you maybe elaborate on this in a few sentence
> just to
> >>> be sure I understand Hadoop correctly?
> >>>
> >>> Thanks,
> >>> Sigurd
> >>>
> >>> 2012/9/10 Harsh J <ha...@cloudera.com>
> >>>>
> >>>> Sigurd,
> >>>>
> >>>> Hemanth's recommendation of DistributedCache does fit your requirement
> >>>> - it is a generic way of distributing files and archives to tasks of a
> >>>> job. It is not something that pushes things automatically in memory,
> >>>> but on the local disk of the TaskTracker your task runs on. You can
> >>>> choose to then use a LocalFileSystem impl. to read it out from there,
> >>>> which would end up being (slightly) faster than your same approach
> >>>> applied to MapFiles on HDFS.
> >>>>
> >>>> On Mon, Sep 10, 2012 at 4:15 PM, Sigurd Spieckermann
> >>>>
> >>>> <si...@gmail.com> wrote:
> >>>> > I checked DistributedCache, but in general I have to assume that
> none
> >>>> > of the
> >>>> > datasets fits in memory... That's why I was considering map-side
> join,
> >>>> > but
> >>>> > by default it doesn't fit to my problem. I could probably get it to
> >>>> > work
> >>>> > though, but I would have to enforce the requirements of the map-side
> >>>> > join.
> >>>> >
> >>>> >
> >>>> > 2012/9/10 Hemanth Yamijala <yh...@thoughtworks.com>
> >>>> >>
> >>>> >> Hi,
> >>>> >>
> >>>> >> You could check DistributedCache
> >>>> >>
> >>>> >> (
> http://hadoop.apache.org/common/docs/stable/mapred_tutorial.html#DistributedCache
> ).
> >>>> >> It would allow you to distribute data to the nodes where your tasks
> >>>> >> are run.
> >>>> >>
> >>>> >> Thanks
> >>>> >> Hemanth
> >>>> >>
> >>>> >>
> >>>> >> On Mon, Sep 10, 2012 at 3:27 PM, Sigurd Spieckermann
> >>>> >> <si...@gmail.com> wrote:
> >>>> >>>
> >>>> >>> Hi,
> >>>> >>>
> >>>> >>> I would like to perform a map-side join of two large datasets
> where
> >>>> >>> dataset A consists of m*n elements and dataset B consists of n
> >>>> >>> elements. For
> >>>> >>> the join, every element in dataset B needs to be accessed m times.
> >>>> >>> Each
> >>>> >>> mapper would join one element from A with the corresponding
> element
> >>>> >>> from B.
> >>>> >>> Elements here are actually data blocks. Is there a performance
> >>>> >>> problem (and
> >>>> >>> difference compared to a slightly modified map-side join using the
> >>>> >>> join-package) if I set dataset A as the map-reduce input and load
> >>>> >>> the
> >>>> >>> relevant element from dataset B directly from the HDFS inside the
> >>>> >>> mapper? I
> >>>> >>> could store the elements of B in a MapFile for faster random
> access.
> >>>> >>> In the
> >>>> >>> second case without the join-package I would not have to partition
> >>>> >>> the
> >>>> >>> datasets manually which would allow a bit more flexibility, but
> I'm
> >>>> >>> wondering if HDFS access from inside a mapper is strictly bad.
> Also,
> >>>> >>> does
> >>>> >>> Hadoop have a cache for such situations by any chance?
> >>>> >>>
> >>>> >>> I appreciate any comments!
> >>>> >>>
> >>>> >>> Sigurd
> >>>> >>
> >>>> >>
> >>>> >
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> Harsh J
> >>>
> >>>
> >>
> >
>
>
>
> --
> Harsh J
>

Re: Reading from HDFS from inside the mapper

Posted by Sigurd Spieckermann <si...@gmail.com>.

OK, I see... Is there any way to change this? I need guaranteed order in
order for the map-side join to work correctly and I need the standalone
mode for debugging code that is executed on the mapper/reducer nodes.

2012/9/17 Harsh J <ha...@cloudera.com>

> Sigurd,
>
> The implementation of fs -ls in the LocalFileSystem relies on Java's
> File#list
> http://docs.oracle.com/javase/6/docs/api/java/io/File.html#list()
> which states "There is no guarantee that the name strings in the
> resulting array will appear in any specific order; they are not, in
> particular, guaranteed to appear in alphabetical order.". That may
> just be what is biting you, since standalone mode uses LFS.
>
> On Mon, Sep 17, 2012 at 6:45 PM, Sigurd Spieckermann
> <si...@gmail.com> wrote:
> > I've tracked down the problem to only occur in standalone mode. In
> > pseudo-distributed mode, everything works fine. My underlying OS is
> Ubuntu
> > 12.04 64bit. When I access the directory in linux directly, everything
> looks
> > normal. It's just when I access it through hadoop. Has anyone seen this
> > problem before and knows a solution?
> >
> > Thanks,
> > Sigurd
> >
> >
> > 2012/9/17 Sigurd Spieckermann <si...@gmail.com>
> >>
> >> I'm experiencing a strange problem right now. I'm writing part-files to
> >> the HDFS providing initial data and (which should actually not make a
> >> difference anyway) write them in ascending order, i.e. part-00000,
> >> part-00001 etc. -- in that order. But when I do "hadoop dfs -ls xyz",
> they
> >> are in the order part-00001, part-00000, part-00002, part-00003 etc.
> How is
> >> that possible? Why aren't they shown in natural order? Also the map-side
> >> join package considers them in this order which causes problems.
> >>
> >>
> >> 2012/9/10 Sigurd Spieckermann <si...@gmail.com>
> >>>
> >>> OK, interesting. Just to confirm: is it okay to distribute quite large
> >>> files through the DistributedCache? Dataset B could be on the order of
> >>> gigabytes. Also, if I have much fewer nodes than elements/blocks in A,
> then
> >>> the probability that every node will have to read (almost) every block
> of B
> >>> is quite high so given DC is okay here in general, it would be more
> >>> efficient to use DC over HDFS reading. How about the case though that
> I have
> >>> m*n nodes, then every node would receive all of B while only needing a
> small
> >>> fraction, right? Could you maybe elaborate on this in a few sentence
> just to
> >>> be sure I understand Hadoop correctly?
> >>>
> >>> Thanks,
> >>> Sigurd
> >>>
> >>> 2012/9/10 Harsh J <ha...@cloudera.com>
> >>>>
> >>>> Sigurd,
> >>>>
> >>>> Hemanth's recommendation of DistributedCache does fit your requirement
> >>>> - it is a generic way of distributing files and archives to tasks of a
> >>>> job. It is not something that pushes things automatically in memory,
> >>>> but on the local disk of the TaskTracker your task runs on. You can
> >>>> choose to then use a LocalFileSystem impl. to read it out from there,
> >>>> which would end up being (slightly) faster than your same approach
> >>>> applied to MapFiles on HDFS.
> >>>>
> >>>> On Mon, Sep 10, 2012 at 4:15 PM, Sigurd Spieckermann
> >>>>
> >>>> <si...@gmail.com> wrote:
> >>>> > I checked DistributedCache, but in general I have to assume that
> none
> >>>> > of the
> >>>> > datasets fits in memory... That's why I was considering map-side
> join,
> >>>> > but
> >>>> > by default it doesn't fit to my problem. I could probably get it to
> >>>> > work
> >>>> > though, but I would have to enforce the requirements of the map-side
> >>>> > join.
> >>>> >
> >>>> >
> >>>> > 2012/9/10 Hemanth Yamijala <yh...@thoughtworks.com>
> >>>> >>
> >>>> >> Hi,
> >>>> >>
> >>>> >> You could check DistributedCache
> >>>> >>
> >>>> >> (
> http://hadoop.apache.org/common/docs/stable/mapred_tutorial.html#DistributedCache
> ).
> >>>> >> It would allow you to distribute data to the nodes where your tasks
> >>>> >> are run.
> >>>> >>
> >>>> >> Thanks
> >>>> >> Hemanth
> >>>> >>
> >>>> >>
> >>>> >> On Mon, Sep 10, 2012 at 3:27 PM, Sigurd Spieckermann
> >>>> >> <si...@gmail.com> wrote:
> >>>> >>>
> >>>> >>> Hi,
> >>>> >>>
> >>>> >>> I would like to perform a map-side join of two large datasets
> where
> >>>> >>> dataset A consists of m*n elements and dataset B consists of n
> >>>> >>> elements. For
> >>>> >>> the join, every element in dataset B needs to be accessed m times.
> >>>> >>> Each
> >>>> >>> mapper would join one element from A with the corresponding
> element
> >>>> >>> from B.
> >>>> >>> Elements here are actually data blocks. Is there a performance
> >>>> >>> problem (and
> >>>> >>> difference compared to a slightly modified map-side join using the
> >>>> >>> join-package) if I set dataset A as the map-reduce input and load
> >>>> >>> the
> >>>> >>> relevant element from dataset B directly from the HDFS inside the
> >>>> >>> mapper? I
> >>>> >>> could store the elements of B in a MapFile for faster random
> access.
> >>>> >>> In the
> >>>> >>> second case without the join-package I would not have to partition
> >>>> >>> the
> >>>> >>> datasets manually which would allow a bit more flexibility, but
> I'm
> >>>> >>> wondering if HDFS access from inside a mapper is strictly bad.
> Also,
> >>>> >>> does
> >>>> >>> Hadoop have a cache for such situations by any chance?
> >>>> >>>
> >>>> >>> I appreciate any comments!
> >>>> >>>
> >>>> >>> Sigurd
> >>>> >>
> >>>> >>
> >>>> >
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> Harsh J
> >>>
> >>>
> >>
> >
>
>
>
> --
> Harsh J
>

Re: Reading from HDFS from inside the mapper

Posted by Sigurd Spieckermann <si...@gmail.com>.

OK, I see... Is there any way to change this? I need guaranteed order in
order for the map-side join to work correctly and I need the standalone
mode for debugging code that is executed on the mapper/reducer nodes.

2012/9/17 Harsh J <ha...@cloudera.com>

> Sigurd,
>
> The implementation of fs -ls in the LocalFileSystem relies on Java's
> File#list
> http://docs.oracle.com/javase/6/docs/api/java/io/File.html#list()
> which states "There is no guarantee that the name strings in the
> resulting array will appear in any specific order; they are not, in
> particular, guaranteed to appear in alphabetical order.". That may
> just be what is biting you, since standalone mode uses LFS.
>
> On Mon, Sep 17, 2012 at 6:45 PM, Sigurd Spieckermann
> <si...@gmail.com> wrote:
> > I've tracked down the problem to only occur in standalone mode. In
> > pseudo-distributed mode, everything works fine. My underlying OS is
> Ubuntu
> > 12.04 64bit. When I access the directory in linux directly, everything
> looks
> > normal. It's just when I access it through hadoop. Has anyone seen this
> > problem before and knows a solution?
> >
> > Thanks,
> > Sigurd
> >
> >
> > 2012/9/17 Sigurd Spieckermann <si...@gmail.com>
> >>
> >> I'm experiencing a strange problem right now. I'm writing part-files to
> >> the HDFS providing initial data and (which should actually not make a
> >> difference anyway) write them in ascending order, i.e. part-00000,
> >> part-00001 etc. -- in that order. But when I do "hadoop dfs -ls xyz",
> they
> >> are in the order part-00001, part-00000, part-00002, part-00003 etc.
> How is
> >> that possible? Why aren't they shown in natural order? Also the map-side
> >> join package considers them in this order which causes problems.
> >>
> >>
> >> 2012/9/10 Sigurd Spieckermann <si...@gmail.com>
> >>>
> >>> OK, interesting. Just to confirm: is it okay to distribute quite large
> >>> files through the DistributedCache? Dataset B could be on the order of
> >>> gigabytes. Also, if I have much fewer nodes than elements/blocks in A,
> then
> >>> the probability that every node will have to read (almost) every block
> of B
> >>> is quite high so given DC is okay here in general, it would be more
> >>> efficient to use DC over HDFS reading. How about the case though that
> I have
> >>> m*n nodes, then every node would receive all of B while only needing a
> small
> >>> fraction, right? Could you maybe elaborate on this in a few sentence
> just to
> >>> be sure I understand Hadoop correctly?
> >>>
> >>> Thanks,
> >>> Sigurd
> >>>
> >>> 2012/9/10 Harsh J <ha...@cloudera.com>
> >>>>
> >>>> Sigurd,
> >>>>
> >>>> Hemanth's recommendation of DistributedCache does fit your requirement
> >>>> - it is a generic way of distributing files and archives to tasks of a
> >>>> job. It is not something that pushes things automatically in memory,
> >>>> but on the local disk of the TaskTracker your task runs on. You can
> >>>> choose to then use a LocalFileSystem impl. to read it out from there,
> >>>> which would end up being (slightly) faster than your same approach
> >>>> applied to MapFiles on HDFS.
> >>>>
> >>>> On Mon, Sep 10, 2012 at 4:15 PM, Sigurd Spieckermann
> >>>>
> >>>> <si...@gmail.com> wrote:
> >>>> > I checked DistributedCache, but in general I have to assume that
> none
> >>>> > of the
> >>>> > datasets fits in memory... That's why I was considering map-side
> join,
> >>>> > but
> >>>> > by default it doesn't fit to my problem. I could probably get it to
> >>>> > work
> >>>> > though, but I would have to enforce the requirements of the map-side
> >>>> > join.
> >>>> >
> >>>> >
> >>>> > 2012/9/10 Hemanth Yamijala <yh...@thoughtworks.com>
> >>>> >>
> >>>> >> Hi,
> >>>> >>
> >>>> >> You could check DistributedCache
> >>>> >>
> >>>> >> (
> http://hadoop.apache.org/common/docs/stable/mapred_tutorial.html#DistributedCache
> ).
> >>>> >> It would allow you to distribute data to the nodes where your tasks
> >>>> >> are run.
> >>>> >>
> >>>> >> Thanks
> >>>> >> Hemanth
> >>>> >>
> >>>> >>
> >>>> >> On Mon, Sep 10, 2012 at 3:27 PM, Sigurd Spieckermann
> >>>> >> <si...@gmail.com> wrote:
> >>>> >>>
> >>>> >>> Hi,
> >>>> >>>
> >>>> >>> I would like to perform a map-side join of two large datasets
> where
> >>>> >>> dataset A consists of m*n elements and dataset B consists of n
> >>>> >>> elements. For
> >>>> >>> the join, every element in dataset B needs to be accessed m times.
> >>>> >>> Each
> >>>> >>> mapper would join one element from A with the corresponding
> element
> >>>> >>> from B.
> >>>> >>> Elements here are actually data blocks. Is there a performance
> >>>> >>> problem (and
> >>>> >>> difference compared to a slightly modified map-side join using the
> >>>> >>> join-package) if I set dataset A as the map-reduce input and load
> >>>> >>> the
> >>>> >>> relevant element from dataset B directly from the HDFS inside the
> >>>> >>> mapper? I
> >>>> >>> could store the elements of B in a MapFile for faster random
> access.
> >>>> >>> In the
> >>>> >>> second case without the join-package I would not have to partition
> >>>> >>> the
> >>>> >>> datasets manually which would allow a bit more flexibility, but
> I'm
> >>>> >>> wondering if HDFS access from inside a mapper is strictly bad.
> Also,
> >>>> >>> does
> >>>> >>> Hadoop have a cache for such situations by any chance?
> >>>> >>>
> >>>> >>> I appreciate any comments!
> >>>> >>>
> >>>> >>> Sigurd
> >>>> >>
> >>>> >>
> >>>> >
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> Harsh J
> >>>
> >>>
> >>
> >
>
>
>
> --
> Harsh J
>

Re: Reading from HDFS from inside the mapper

Posted by Sigurd Spieckermann <si...@gmail.com>.

OK, I see... Is there any way to change this? I need guaranteed order in
order for the map-side join to work correctly and I need the standalone
mode for debugging code that is executed on the mapper/reducer nodes.

2012/9/17 Harsh J <ha...@cloudera.com>

> Sigurd,
>
> The implementation of fs -ls in the LocalFileSystem relies on Java's
> File#list
> http://docs.oracle.com/javase/6/docs/api/java/io/File.html#list()
> which states "There is no guarantee that the name strings in the
> resulting array will appear in any specific order; they are not, in
> particular, guaranteed to appear in alphabetical order.". That may
> just be what is biting you, since standalone mode uses LFS.
>
> On Mon, Sep 17, 2012 at 6:45 PM, Sigurd Spieckermann
> <si...@gmail.com> wrote:
> > I've tracked down the problem to only occur in standalone mode. In
> > pseudo-distributed mode, everything works fine. My underlying OS is
> Ubuntu
> > 12.04 64bit. When I access the directory in linux directly, everything
> looks
> > normal. It's just when I access it through hadoop. Has anyone seen this
> > problem before and knows a solution?
> >
> > Thanks,
> > Sigurd
> >
> >
> > 2012/9/17 Sigurd Spieckermann <si...@gmail.com>
> >>
> >> I'm experiencing a strange problem right now. I'm writing part-files to
> >> the HDFS providing initial data and (which should actually not make a
> >> difference anyway) write them in ascending order, i.e. part-00000,
> >> part-00001 etc. -- in that order. But when I do "hadoop dfs -ls xyz",
> they
> >> are in the order part-00001, part-00000, part-00002, part-00003 etc.
> How is
> >> that possible? Why aren't they shown in natural order? Also the map-side
> >> join package considers them in this order which causes problems.
> >>
> >>
> >> 2012/9/10 Sigurd Spieckermann <si...@gmail.com>
> >>>
> >>> OK, interesting. Just to confirm: is it okay to distribute quite large
> >>> files through the DistributedCache? Dataset B could be on the order of
> >>> gigabytes. Also, if I have much fewer nodes than elements/blocks in A,
> then
> >>> the probability that every node will have to read (almost) every block
> of B
> >>> is quite high so given DC is okay here in general, it would be more
> >>> efficient to use DC over HDFS reading. How about the case though that
> I have
> >>> m*n nodes, then every node would receive all of B while only needing a
> small
> >>> fraction, right? Could you maybe elaborate on this in a few sentence
> just to
> >>> be sure I understand Hadoop correctly?
> >>>
> >>> Thanks,
> >>> Sigurd
> >>>
> >>> 2012/9/10 Harsh J <ha...@cloudera.com>
> >>>>
> >>>> Sigurd,
> >>>>
> >>>> Hemanth's recommendation of DistributedCache does fit your requirement
> >>>> - it is a generic way of distributing files and archives to tasks of a
> >>>> job. It is not something that pushes things automatically in memory,
> >>>> but on the local disk of the TaskTracker your task runs on. You can
> >>>> choose to then use a LocalFileSystem impl. to read it out from there,
> >>>> which would end up being (slightly) faster than your same approach
> >>>> applied to MapFiles on HDFS.
> >>>>
> >>>> On Mon, Sep 10, 2012 at 4:15 PM, Sigurd Spieckermann
> >>>>
> >>>> <si...@gmail.com> wrote:
> >>>> > I checked DistributedCache, but in general I have to assume that
> none
> >>>> > of the
> >>>> > datasets fits in memory... That's why I was considering map-side
> join,
> >>>> > but
> >>>> > by default it doesn't fit to my problem. I could probably get it to
> >>>> > work
> >>>> > though, but I would have to enforce the requirements of the map-side
> >>>> > join.
> >>>> >
> >>>> >
> >>>> > 2012/9/10 Hemanth Yamijala <yh...@thoughtworks.com>
> >>>> >>
> >>>> >> Hi,
> >>>> >>
> >>>> >> You could check DistributedCache
> >>>> >>
> >>>> >> (
> http://hadoop.apache.org/common/docs/stable/mapred_tutorial.html#DistributedCache
> ).
> >>>> >> It would allow you to distribute data to the nodes where your tasks
> >>>> >> are run.
> >>>> >>
> >>>> >> Thanks
> >>>> >> Hemanth
> >>>> >>
> >>>> >>
> >>>> >> On Mon, Sep 10, 2012 at 3:27 PM, Sigurd Spieckermann
> >>>> >> <si...@gmail.com> wrote:
> >>>> >>>
> >>>> >>> Hi,
> >>>> >>>
> >>>> >>> I would like to perform a map-side join of two large datasets
> where
> >>>> >>> dataset A consists of m*n elements and dataset B consists of n
> >>>> >>> elements. For
> >>>> >>> the join, every element in dataset B needs to be accessed m times.
> >>>> >>> Each
> >>>> >>> mapper would join one element from A with the corresponding
> element
> >>>> >>> from B.
> >>>> >>> Elements here are actually data blocks. Is there a performance
> >>>> >>> problem (and
> >>>> >>> difference compared to a slightly modified map-side join using the
> >>>> >>> join-package) if I set dataset A as the map-reduce input and load
> >>>> >>> the
> >>>> >>> relevant element from dataset B directly from the HDFS inside the
> >>>> >>> mapper? I
> >>>> >>> could store the elements of B in a MapFile for faster random
> access.
> >>>> >>> In the
> >>>> >>> second case without the join-package I would not have to partition
> >>>> >>> the
> >>>> >>> datasets manually which would allow a bit more flexibility, but
> I'm
> >>>> >>> wondering if HDFS access from inside a mapper is strictly bad.
> Also,
> >>>> >>> does
> >>>> >>> Hadoop have a cache for such situations by any chance?
> >>>> >>>
> >>>> >>> I appreciate any comments!
> >>>> >>>
> >>>> >>> Sigurd
> >>>> >>
> >>>> >>
> >>>> >
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> Harsh J
> >>>
> >>>
> >>
> >
>
>
>
> --
> Harsh J
>

Re: Reading from HDFS from inside the mapper

Posted by Harsh J <ha...@cloudera.com>.

Sigurd,

The implementation of fs -ls in the LocalFileSystem relies on Java's
File#list http://docs.oracle.com/javase/6/docs/api/java/io/File.html#list()
which states "There is no guarantee that the name strings in the
resulting array will appear in any specific order; they are not, in
particular, guaranteed to appear in alphabetical order.". That may
just be what is biting you, since standalone mode uses LFS.

On Mon, Sep 17, 2012 at 6:45 PM, Sigurd Spieckermann
<si...@gmail.com> wrote:
> I've tracked down the problem to only occur in standalone mode. In
> pseudo-distributed mode, everything works fine. My underlying OS is Ubuntu
> 12.04 64bit. When I access the directory in linux directly, everything looks
> normal. It's just when I access it through hadoop. Has anyone seen this
> problem before and knows a solution?
>
> Thanks,
> Sigurd
>
>
> 2012/9/17 Sigurd Spieckermann <si...@gmail.com>
>>
>> I'm experiencing a strange problem right now. I'm writing part-files to
>> the HDFS providing initial data and (which should actually not make a
>> difference anyway) write them in ascending order, i.e. part-00000,
>> part-00001 etc. -- in that order. But when I do "hadoop dfs -ls xyz", they
>> are in the order part-00001, part-00000, part-00002, part-00003 etc. How is
>> that possible? Why aren't they shown in natural order? Also the map-side
>> join package considers them in this order which causes problems.
>>
>>
>> 2012/9/10 Sigurd Spieckermann <si...@gmail.com>
>>>
>>> OK, interesting. Just to confirm: is it okay to distribute quite large
>>> files through the DistributedCache? Dataset B could be on the order of
>>> gigabytes. Also, if I have much fewer nodes than elements/blocks in A, then
>>> the probability that every node will have to read (almost) every block of B
>>> is quite high so given DC is okay here in general, it would be more
>>> efficient to use DC over HDFS reading. How about the case though that I have
>>> m*n nodes, then every node would receive all of B while only needing a small
>>> fraction, right? Could you maybe elaborate on this in a few sentence just to
>>> be sure I understand Hadoop correctly?
>>>
>>> Thanks,
>>> Sigurd
>>>
>>> 2012/9/10 Harsh J <ha...@cloudera.com>
>>>>
>>>> Sigurd,
>>>>
>>>> Hemanth's recommendation of DistributedCache does fit your requirement
>>>> - it is a generic way of distributing files and archives to tasks of a
>>>> job. It is not something that pushes things automatically in memory,
>>>> but on the local disk of the TaskTracker your task runs on. You can
>>>> choose to then use a LocalFileSystem impl. to read it out from there,
>>>> which would end up being (slightly) faster than your same approach
>>>> applied to MapFiles on HDFS.
>>>>
>>>> On Mon, Sep 10, 2012 at 4:15 PM, Sigurd Spieckermann
>>>>
>>>> <si...@gmail.com> wrote:
>>>> > I checked DistributedCache, but in general I have to assume that none
>>>> > of the
>>>> > datasets fits in memory... That's why I was considering map-side join,
>>>> > but
>>>> > by default it doesn't fit to my problem. I could probably get it to
>>>> > work
>>>> > though, but I would have to enforce the requirements of the map-side
>>>> > join.
>>>> >
>>>> >
>>>> > 2012/9/10 Hemanth Yamijala <yh...@thoughtworks.com>
>>>> >>
>>>> >> Hi,
>>>> >>
>>>> >> You could check DistributedCache
>>>> >>
>>>> >> (http://hadoop.apache.org/common/docs/stable/mapred_tutorial.html#DistributedCache).
>>>> >> It would allow you to distribute data to the nodes where your tasks
>>>> >> are run.
>>>> >>
>>>> >> Thanks
>>>> >> Hemanth
>>>> >>
>>>> >>
>>>> >> On Mon, Sep 10, 2012 at 3:27 PM, Sigurd Spieckermann
>>>> >> <si...@gmail.com> wrote:
>>>> >>>
>>>> >>> Hi,
>>>> >>>
>>>> >>> I would like to perform a map-side join of two large datasets where
>>>> >>> dataset A consists of m*n elements and dataset B consists of n
>>>> >>> elements. For
>>>> >>> the join, every element in dataset B needs to be accessed m times.
>>>> >>> Each
>>>> >>> mapper would join one element from A with the corresponding element
>>>> >>> from B.
>>>> >>> Elements here are actually data blocks. Is there a performance
>>>> >>> problem (and
>>>> >>> difference compared to a slightly modified map-side join using the
>>>> >>> join-package) if I set dataset A as the map-reduce input and load
>>>> >>> the
>>>> >>> relevant element from dataset B directly from the HDFS inside the
>>>> >>> mapper? I
>>>> >>> could store the elements of B in a MapFile for faster random access.
>>>> >>> In the
>>>> >>> second case without the join-package I would not have to partition
>>>> >>> the
>>>> >>> datasets manually which would allow a bit more flexibility, but I'm
>>>> >>> wondering if HDFS access from inside a mapper is strictly bad. Also,
>>>> >>> does
>>>> >>> Hadoop have a cache for such situations by any chance?
>>>> >>>
>>>> >>> I appreciate any comments!
>>>> >>>
>>>> >>> Sigurd
>>>> >>
>>>> >>
>>>> >
>>>>
>>>>
>>>>
>>>> --
>>>> Harsh J
>>>
>>>
>>
>



-- 
Harsh J

Re: Reading from HDFS from inside the mapper

Posted by Harsh J <ha...@cloudera.com>.

Sigurd,

The implementation of fs -ls in the LocalFileSystem relies on Java's
File#list http://docs.oracle.com/javase/6/docs/api/java/io/File.html#list()
which states "There is no guarantee that the name strings in the
resulting array will appear in any specific order; they are not, in
particular, guaranteed to appear in alphabetical order.". That may
just be what is biting you, since standalone mode uses LFS.

On Mon, Sep 17, 2012 at 6:45 PM, Sigurd Spieckermann
<si...@gmail.com> wrote:
> I've tracked down the problem to only occur in standalone mode. In
> pseudo-distributed mode, everything works fine. My underlying OS is Ubuntu
> 12.04 64bit. When I access the directory in linux directly, everything looks
> normal. It's just when I access it through hadoop. Has anyone seen this
> problem before and knows a solution?
>
> Thanks,
> Sigurd
>
>
> 2012/9/17 Sigurd Spieckermann <si...@gmail.com>
>>
>> I'm experiencing a strange problem right now. I'm writing part-files to
>> the HDFS providing initial data and (which should actually not make a
>> difference anyway) write them in ascending order, i.e. part-00000,
>> part-00001 etc. -- in that order. But when I do "hadoop dfs -ls xyz", they
>> are in the order part-00001, part-00000, part-00002, part-00003 etc. How is
>> that possible? Why aren't they shown in natural order? Also the map-side
>> join package considers them in this order which causes problems.
>>
>>
>> 2012/9/10 Sigurd Spieckermann <si...@gmail.com>
>>>
>>> OK, interesting. Just to confirm: is it okay to distribute quite large
>>> files through the DistributedCache? Dataset B could be on the order of
>>> gigabytes. Also, if I have much fewer nodes than elements/blocks in A, then
>>> the probability that every node will have to read (almost) every block of B
>>> is quite high so given DC is okay here in general, it would be more
>>> efficient to use DC over HDFS reading. How about the case though that I have
>>> m*n nodes, then every node would receive all of B while only needing a small
>>> fraction, right? Could you maybe elaborate on this in a few sentence just to
>>> be sure I understand Hadoop correctly?
>>>
>>> Thanks,
>>> Sigurd
>>>
>>> 2012/9/10 Harsh J <ha...@cloudera.com>
>>>>
>>>> Sigurd,
>>>>
>>>> Hemanth's recommendation of DistributedCache does fit your requirement
>>>> - it is a generic way of distributing files and archives to tasks of a
>>>> job. It is not something that pushes things automatically in memory,
>>>> but on the local disk of the TaskTracker your task runs on. You can
>>>> choose to then use a LocalFileSystem impl. to read it out from there,
>>>> which would end up being (slightly) faster than your same approach
>>>> applied to MapFiles on HDFS.
>>>>
>>>> On Mon, Sep 10, 2012 at 4:15 PM, Sigurd Spieckermann
>>>>
>>>> <si...@gmail.com> wrote:
>>>> > I checked DistributedCache, but in general I have to assume that none
>>>> > of the
>>>> > datasets fits in memory... That's why I was considering map-side join,
>>>> > but
>>>> > by default it doesn't fit to my problem. I could probably get it to
>>>> > work
>>>> > though, but I would have to enforce the requirements of the map-side
>>>> > join.
>>>> >
>>>> >
>>>> > 2012/9/10 Hemanth Yamijala <yh...@thoughtworks.com>
>>>> >>
>>>> >> Hi,
>>>> >>
>>>> >> You could check DistributedCache
>>>> >>
>>>> >> (http://hadoop.apache.org/common/docs/stable/mapred_tutorial.html#DistributedCache).
>>>> >> It would allow you to distribute data to the nodes where your tasks
>>>> >> are run.
>>>> >>
>>>> >> Thanks
>>>> >> Hemanth
>>>> >>
>>>> >>
>>>> >> On Mon, Sep 10, 2012 at 3:27 PM, Sigurd Spieckermann
>>>> >> <si...@gmail.com> wrote:
>>>> >>>
>>>> >>> Hi,
>>>> >>>
>>>> >>> I would like to perform a map-side join of two large datasets where
>>>> >>> dataset A consists of m*n elements and dataset B consists of n
>>>> >>> elements. For
>>>> >>> the join, every element in dataset B needs to be accessed m times.
>>>> >>> Each
>>>> >>> mapper would join one element from A with the corresponding element
>>>> >>> from B.
>>>> >>> Elements here are actually data blocks. Is there a performance
>>>> >>> problem (and
>>>> >>> difference compared to a slightly modified map-side join using the
>>>> >>> join-package) if I set dataset A as the map-reduce input and load
>>>> >>> the
>>>> >>> relevant element from dataset B directly from the HDFS inside the
>>>> >>> mapper? I
>>>> >>> could store the elements of B in a MapFile for faster random access.
>>>> >>> In the
>>>> >>> second case without the join-package I would not have to partition
>>>> >>> the
>>>> >>> datasets manually which would allow a bit more flexibility, but I'm
>>>> >>> wondering if HDFS access from inside a mapper is strictly bad. Also,
>>>> >>> does
>>>> >>> Hadoop have a cache for such situations by any chance?
>>>> >>>
>>>> >>> I appreciate any comments!
>>>> >>>
>>>> >>> Sigurd
>>>> >>
>>>> >>
>>>> >
>>>>
>>>>
>>>>
>>>> --
>>>> Harsh J
>>>
>>>
>>
>



-- 
Harsh J

Re: Reading from HDFS from inside the mapper

Posted by Harsh J <ha...@cloudera.com>.

Sigurd,

The implementation of fs -ls in the LocalFileSystem relies on Java's
File#list http://docs.oracle.com/javase/6/docs/api/java/io/File.html#list()
which states "There is no guarantee that the name strings in the
resulting array will appear in any specific order; they are not, in
particular, guaranteed to appear in alphabetical order.". That may
just be what is biting you, since standalone mode uses LFS.

On Mon, Sep 17, 2012 at 6:45 PM, Sigurd Spieckermann
<si...@gmail.com> wrote:
> I've tracked down the problem to only occur in standalone mode. In
> pseudo-distributed mode, everything works fine. My underlying OS is Ubuntu
> 12.04 64bit. When I access the directory in linux directly, everything looks
> normal. It's just when I access it through hadoop. Has anyone seen this
> problem before and knows a solution?
>
> Thanks,
> Sigurd
>
>
> 2012/9/17 Sigurd Spieckermann <si...@gmail.com>
>>
>> I'm experiencing a strange problem right now. I'm writing part-files to
>> the HDFS providing initial data and (which should actually not make a
>> difference anyway) write them in ascending order, i.e. part-00000,
>> part-00001 etc. -- in that order. But when I do "hadoop dfs -ls xyz", they
>> are in the order part-00001, part-00000, part-00002, part-00003 etc. How is
>> that possible? Why aren't they shown in natural order? Also the map-side
>> join package considers them in this order which causes problems.
>>
>>
>> 2012/9/10 Sigurd Spieckermann <si...@gmail.com>
>>>
>>> OK, interesting. Just to confirm: is it okay to distribute quite large
>>> files through the DistributedCache? Dataset B could be on the order of
>>> gigabytes. Also, if I have much fewer nodes than elements/blocks in A, then
>>> the probability that every node will have to read (almost) every block of B
>>> is quite high so given DC is okay here in general, it would be more
>>> efficient to use DC over HDFS reading. How about the case though that I have
>>> m*n nodes, then every node would receive all of B while only needing a small
>>> fraction, right? Could you maybe elaborate on this in a few sentence just to
>>> be sure I understand Hadoop correctly?
>>>
>>> Thanks,
>>> Sigurd
>>>
>>> 2012/9/10 Harsh J <ha...@cloudera.com>
>>>>
>>>> Sigurd,
>>>>
>>>> Hemanth's recommendation of DistributedCache does fit your requirement
>>>> - it is a generic way of distributing files and archives to tasks of a
>>>> job. It is not something that pushes things automatically in memory,
>>>> but on the local disk of the TaskTracker your task runs on. You can
>>>> choose to then use a LocalFileSystem impl. to read it out from there,
>>>> which would end up being (slightly) faster than your same approach
>>>> applied to MapFiles on HDFS.
>>>>
>>>> On Mon, Sep 10, 2012 at 4:15 PM, Sigurd Spieckermann
>>>>
>>>> <si...@gmail.com> wrote:
>>>> > I checked DistributedCache, but in general I have to assume that none
>>>> > of the
>>>> > datasets fits in memory... That's why I was considering map-side join,
>>>> > but
>>>> > by default it doesn't fit to my problem. I could probably get it to
>>>> > work
>>>> > though, but I would have to enforce the requirements of the map-side
>>>> > join.
>>>> >
>>>> >
>>>> > 2012/9/10 Hemanth Yamijala <yh...@thoughtworks.com>
>>>> >>
>>>> >> Hi,
>>>> >>
>>>> >> You could check DistributedCache
>>>> >>
>>>> >> (http://hadoop.apache.org/common/docs/stable/mapred_tutorial.html#DistributedCache).
>>>> >> It would allow you to distribute data to the nodes where your tasks
>>>> >> are run.
>>>> >>
>>>> >> Thanks
>>>> >> Hemanth
>>>> >>
>>>> >>
>>>> >> On Mon, Sep 10, 2012 at 3:27 PM, Sigurd Spieckermann
>>>> >> <si...@gmail.com> wrote:
>>>> >>>
>>>> >>> Hi,
>>>> >>>
>>>> >>> I would like to perform a map-side join of two large datasets where
>>>> >>> dataset A consists of m*n elements and dataset B consists of n
>>>> >>> elements. For
>>>> >>> the join, every element in dataset B needs to be accessed m times.
>>>> >>> Each
>>>> >>> mapper would join one element from A with the corresponding element
>>>> >>> from B.
>>>> >>> Elements here are actually data blocks. Is there a performance
>>>> >>> problem (and
>>>> >>> difference compared to a slightly modified map-side join using the
>>>> >>> join-package) if I set dataset A as the map-reduce input and load
>>>> >>> the
>>>> >>> relevant element from dataset B directly from the HDFS inside the
>>>> >>> mapper? I
>>>> >>> could store the elements of B in a MapFile for faster random access.
>>>> >>> In the
>>>> >>> second case without the join-package I would not have to partition
>>>> >>> the
>>>> >>> datasets manually which would allow a bit more flexibility, but I'm
>>>> >>> wondering if HDFS access from inside a mapper is strictly bad. Also,
>>>> >>> does
>>>> >>> Hadoop have a cache for such situations by any chance?
>>>> >>>
>>>> >>> I appreciate any comments!
>>>> >>>
>>>> >>> Sigurd
>>>> >>
>>>> >>
>>>> >
>>>>
>>>>
>>>>
>>>> --
>>>> Harsh J
>>>
>>>
>>
>



-- 
Harsh J

Re: Reading from HDFS from inside the mapper

Posted by Harsh J <ha...@cloudera.com>.

Sigurd,

The implementation of fs -ls in the LocalFileSystem relies on Java's
File#list http://docs.oracle.com/javase/6/docs/api/java/io/File.html#list()
which states "There is no guarantee that the name strings in the
resulting array will appear in any specific order; they are not, in
particular, guaranteed to appear in alphabetical order.". That may
just be what is biting you, since standalone mode uses LFS.

On Mon, Sep 17, 2012 at 6:45 PM, Sigurd Spieckermann
<si...@gmail.com> wrote:
> I've tracked down the problem to only occur in standalone mode. In
> pseudo-distributed mode, everything works fine. My underlying OS is Ubuntu
> 12.04 64bit. When I access the directory in linux directly, everything looks
> normal. It's just when I access it through hadoop. Has anyone seen this
> problem before and knows a solution?
>
> Thanks,
> Sigurd
>
>
> 2012/9/17 Sigurd Spieckermann <si...@gmail.com>
>>
>> I'm experiencing a strange problem right now. I'm writing part-files to
>> the HDFS providing initial data and (which should actually not make a
>> difference anyway) write them in ascending order, i.e. part-00000,
>> part-00001 etc. -- in that order. But when I do "hadoop dfs -ls xyz", they
>> are in the order part-00001, part-00000, part-00002, part-00003 etc. How is
>> that possible? Why aren't they shown in natural order? Also the map-side
>> join package considers them in this order which causes problems.
>>
>>
>> 2012/9/10 Sigurd Spieckermann <si...@gmail.com>
>>>
>>> OK, interesting. Just to confirm: is it okay to distribute quite large
>>> files through the DistributedCache? Dataset B could be on the order of
>>> gigabytes. Also, if I have much fewer nodes than elements/blocks in A, then
>>> the probability that every node will have to read (almost) every block of B
>>> is quite high so given DC is okay here in general, it would be more
>>> efficient to use DC over HDFS reading. How about the case though that I have
>>> m*n nodes, then every node would receive all of B while only needing a small
>>> fraction, right? Could you maybe elaborate on this in a few sentence just to
>>> be sure I understand Hadoop correctly?
>>>
>>> Thanks,
>>> Sigurd
>>>
>>> 2012/9/10 Harsh J <ha...@cloudera.com>
>>>>
>>>> Sigurd,
>>>>
>>>> Hemanth's recommendation of DistributedCache does fit your requirement
>>>> - it is a generic way of distributing files and archives to tasks of a
>>>> job. It is not something that pushes things automatically in memory,
>>>> but on the local disk of the TaskTracker your task runs on. You can
>>>> choose to then use a LocalFileSystem impl. to read it out from there,
>>>> which would end up being (slightly) faster than your same approach
>>>> applied to MapFiles on HDFS.
>>>>
>>>> On Mon, Sep 10, 2012 at 4:15 PM, Sigurd Spieckermann
>>>>
>>>> <si...@gmail.com> wrote:
>>>> > I checked DistributedCache, but in general I have to assume that none
>>>> > of the
>>>> > datasets fits in memory... That's why I was considering map-side join,
>>>> > but
>>>> > by default it doesn't fit to my problem. I could probably get it to
>>>> > work
>>>> > though, but I would have to enforce the requirements of the map-side
>>>> > join.
>>>> >
>>>> >
>>>> > 2012/9/10 Hemanth Yamijala <yh...@thoughtworks.com>
>>>> >>
>>>> >> Hi,
>>>> >>
>>>> >> You could check DistributedCache
>>>> >>
>>>> >> (http://hadoop.apache.org/common/docs/stable/mapred_tutorial.html#DistributedCache).
>>>> >> It would allow you to distribute data to the nodes where your tasks
>>>> >> are run.
>>>> >>
>>>> >> Thanks
>>>> >> Hemanth
>>>> >>
>>>> >>
>>>> >> On Mon, Sep 10, 2012 at 3:27 PM, Sigurd Spieckermann
>>>> >> <si...@gmail.com> wrote:
>>>> >>>
>>>> >>> Hi,
>>>> >>>
>>>> >>> I would like to perform a map-side join of two large datasets where
>>>> >>> dataset A consists of m*n elements and dataset B consists of n
>>>> >>> elements. For
>>>> >>> the join, every element in dataset B needs to be accessed m times.
>>>> >>> Each
>>>> >>> mapper would join one element from A with the corresponding element
>>>> >>> from B.
>>>> >>> Elements here are actually data blocks. Is there a performance
>>>> >>> problem (and
>>>> >>> difference compared to a slightly modified map-side join using the
>>>> >>> join-package) if I set dataset A as the map-reduce input and load
>>>> >>> the
>>>> >>> relevant element from dataset B directly from the HDFS inside the
>>>> >>> mapper? I
>>>> >>> could store the elements of B in a MapFile for faster random access.
>>>> >>> In the
>>>> >>> second case without the join-package I would not have to partition
>>>> >>> the
>>>> >>> datasets manually which would allow a bit more flexibility, but I'm
>>>> >>> wondering if HDFS access from inside a mapper is strictly bad. Also,
>>>> >>> does
>>>> >>> Hadoop have a cache for such situations by any chance?
>>>> >>>
>>>> >>> I appreciate any comments!
>>>> >>>
>>>> >>> Sigurd
>>>> >>
>>>> >>
>>>> >
>>>>
>>>>
>>>>
>>>> --
>>>> Harsh J
>>>
>>>
>>
>



-- 
Harsh J

Re: Reading from HDFS from inside the mapper

Posted by Sigurd Spieckermann <si...@gmail.com>.

I've tracked down the problem to only occur in standalone mode. In
pseudo-distributed mode, everything works fine. My underlying OS is Ubuntu
12.04 64bit. When I access the directory in linux directly, everything
looks normal. It's just when I access it through hadoop. Has anyone seen
this problem before and knows a solution?

Thanks,
Sigurd

2012/9/17 Sigurd Spieckermann <si...@gmail.com>

> I'm experiencing a strange problem right now. I'm writing part-files to
> the HDFS providing initial data and (which should actually not make a
> difference anyway) write them in ascending order, i.e. part-00000,
> part-00001 etc. -- in that order. But when I do "hadoop dfs -ls xyz", they
> are in the order part-00001, part-00000, part-00002, part-00003 etc. How is
> that possible? Why aren't they shown in natural order? Also the map-side
> join package considers them in this order which causes problems.
>
>
> 2012/9/10 Sigurd Spieckermann <si...@gmail.com>
>
>> OK, interesting. Just to confirm: is it okay to distribute quite large
>> files through the DistributedCache? Dataset B could be on the order of
>> gigabytes. Also, if I have much fewer nodes than elements/blocks in A, then
>> the probability that every node will have to read (almost) every block of B
>> is quite high so given DC is okay here in general, it would be more
>> efficient to use DC over HDFS reading. How about the case though that I
>> have m*n nodes, then every node would receive all of B while only needing a
>> small fraction, right? Could you maybe elaborate on this in a few sentence
>> just to be sure I understand Hadoop correctly?
>>
>> Thanks,
>> Sigurd
>>
>> 2012/9/10 Harsh J <ha...@cloudera.com>
>>
>>> Sigurd,
>>>
>>> Hemanth's recommendation of DistributedCache does fit your requirement
>>> - it is a generic way of distributing files and archives to tasks of a
>>> job. It is not something that pushes things automatically in memory,
>>> but on the local disk of the TaskTracker your task runs on. You can
>>> choose to then use a LocalFileSystem impl. to read it out from there,
>>> which would end up being (slightly) faster than your same approach
>>> applied to MapFiles on HDFS.
>>>
>>> On Mon, Sep 10, 2012 at 4:15 PM, Sigurd Spieckermann
>>>
>>> <si...@gmail.com> wrote:
>>> > I checked DistributedCache, but in general I have to assume that none
>>> of the
>>> > datasets fits in memory... That's why I was considering map-side join,
>>> but
>>> > by default it doesn't fit to my problem. I could probably get it to
>>> work
>>> > though, but I would have to enforce the requirements of the map-side
>>> join.
>>> >
>>> >
>>> > 2012/9/10 Hemanth Yamijala <yh...@thoughtworks.com>
>>> >>
>>> >> Hi,
>>> >>
>>> >> You could check DistributedCache
>>> >> (
>>> http://hadoop.apache.org/common/docs/stable/mapred_tutorial.html#DistributedCache
>>> ).
>>> >> It would allow you to distribute data to the nodes where your tasks
>>> are run.
>>> >>
>>> >> Thanks
>>> >> Hemanth
>>> >>
>>> >>
>>> >> On Mon, Sep 10, 2012 at 3:27 PM, Sigurd Spieckermann
>>> >> <si...@gmail.com> wrote:
>>> >>>
>>> >>> Hi,
>>> >>>
>>> >>> I would like to perform a map-side join of two large datasets where
>>> >>> dataset A consists of m*n elements and dataset B consists of n
>>> elements. For
>>> >>> the join, every element in dataset B needs to be accessed m times.
>>> Each
>>> >>> mapper would join one element from A with the corresponding element
>>> from B.
>>> >>> Elements here are actually data blocks. Is there a performance
>>> problem (and
>>> >>> difference compared to a slightly modified map-side join using the
>>> >>> join-package) if I set dataset A as the map-reduce input and load the
>>> >>> relevant element from dataset B directly from the HDFS inside the
>>> mapper? I
>>> >>> could store the elements of B in a MapFile for faster random access.
>>> In the
>>> >>> second case without the join-package I would not have to partition
>>> the
>>> >>> datasets manually which would allow a bit more flexibility, but I'm
>>> >>> wondering if HDFS access from inside a mapper is strictly bad. Also,
>>> does
>>> >>> Hadoop have a cache for such situations by any chance?
>>> >>>
>>> >>> I appreciate any comments!
>>> >>>
>>> >>> Sigurd
>>> >>
>>> >>
>>> >
>>>
>>>
>>>
>>> --
>>> Harsh J
>>>
>>
>>
>

Re: Reading from HDFS from inside the mapper

Posted by Sigurd Spieckermann <si...@gmail.com>.

I've tracked down the problem to only occur in standalone mode. In
pseudo-distributed mode, everything works fine. My underlying OS is Ubuntu
12.04 64bit. When I access the directory in linux directly, everything
looks normal. It's just when I access it through hadoop. Has anyone seen
this problem before and knows a solution?

Thanks,
Sigurd

2012/9/17 Sigurd Spieckermann <si...@gmail.com>

> I'm experiencing a strange problem right now. I'm writing part-files to
> the HDFS providing initial data and (which should actually not make a
> difference anyway) write them in ascending order, i.e. part-00000,
> part-00001 etc. -- in that order. But when I do "hadoop dfs -ls xyz", they
> are in the order part-00001, part-00000, part-00002, part-00003 etc. How is
> that possible? Why aren't they shown in natural order? Also the map-side
> join package considers them in this order which causes problems.
>
>
> 2012/9/10 Sigurd Spieckermann <si...@gmail.com>
>
>> OK, interesting. Just to confirm: is it okay to distribute quite large
>> files through the DistributedCache? Dataset B could be on the order of
>> gigabytes. Also, if I have much fewer nodes than elements/blocks in A, then
>> the probability that every node will have to read (almost) every block of B
>> is quite high so given DC is okay here in general, it would be more
>> efficient to use DC over HDFS reading. How about the case though that I
>> have m*n nodes, then every node would receive all of B while only needing a
>> small fraction, right? Could you maybe elaborate on this in a few sentence
>> just to be sure I understand Hadoop correctly?
>>
>> Thanks,
>> Sigurd
>>
>> 2012/9/10 Harsh J <ha...@cloudera.com>
>>
>>> Sigurd,
>>>
>>> Hemanth's recommendation of DistributedCache does fit your requirement
>>> - it is a generic way of distributing files and archives to tasks of a
>>> job. It is not something that pushes things automatically in memory,
>>> but on the local disk of the TaskTracker your task runs on. You can
>>> choose to then use a LocalFileSystem impl. to read it out from there,
>>> which would end up being (slightly) faster than your same approach
>>> applied to MapFiles on HDFS.
>>>
>>> On Mon, Sep 10, 2012 at 4:15 PM, Sigurd Spieckermann
>>>
>>> <si...@gmail.com> wrote:
>>> > I checked DistributedCache, but in general I have to assume that none
>>> of the
>>> > datasets fits in memory... That's why I was considering map-side join,
>>> but
>>> > by default it doesn't fit to my problem. I could probably get it to
>>> work
>>> > though, but I would have to enforce the requirements of the map-side
>>> join.
>>> >
>>> >
>>> > 2012/9/10 Hemanth Yamijala <yh...@thoughtworks.com>
>>> >>
>>> >> Hi,
>>> >>
>>> >> You could check DistributedCache
>>> >> (
>>> http://hadoop.apache.org/common/docs/stable/mapred_tutorial.html#DistributedCache
>>> ).
>>> >> It would allow you to distribute data to the nodes where your tasks
>>> are run.
>>> >>
>>> >> Thanks
>>> >> Hemanth
>>> >>
>>> >>
>>> >> On Mon, Sep 10, 2012 at 3:27 PM, Sigurd Spieckermann
>>> >> <si...@gmail.com> wrote:
>>> >>>
>>> >>> Hi,
>>> >>>
>>> >>> I would like to perform a map-side join of two large datasets where
>>> >>> dataset A consists of m*n elements and dataset B consists of n
>>> elements. For
>>> >>> the join, every element in dataset B needs to be accessed m times.
>>> Each
>>> >>> mapper would join one element from A with the corresponding element
>>> from B.
>>> >>> Elements here are actually data blocks. Is there a performance
>>> problem (and
>>> >>> difference compared to a slightly modified map-side join using the
>>> >>> join-package) if I set dataset A as the map-reduce input and load the
>>> >>> relevant element from dataset B directly from the HDFS inside the
>>> mapper? I
>>> >>> could store the elements of B in a MapFile for faster random access.
>>> In the
>>> >>> second case without the join-package I would not have to partition
>>> the
>>> >>> datasets manually which would allow a bit more flexibility, but I'm
>>> >>> wondering if HDFS access from inside a mapper is strictly bad. Also,
>>> does
>>> >>> Hadoop have a cache for such situations by any chance?
>>> >>>
>>> >>> I appreciate any comments!
>>> >>>
>>> >>> Sigurd
>>> >>
>>> >>
>>> >
>>>
>>>
>>>
>>> --
>>> Harsh J
>>>
>>
>>
>

Re: Reading from HDFS from inside the mapper

Posted by Sigurd Spieckermann <si...@gmail.com>.

I've tracked down the problem to only occur in standalone mode. In
pseudo-distributed mode, everything works fine. My underlying OS is Ubuntu
12.04 64bit. When I access the directory in linux directly, everything
looks normal. It's just when I access it through hadoop. Has anyone seen
this problem before and knows a solution?

Thanks,
Sigurd

2012/9/17 Sigurd Spieckermann <si...@gmail.com>

> I'm experiencing a strange problem right now. I'm writing part-files to
> the HDFS providing initial data and (which should actually not make a
> difference anyway) write them in ascending order, i.e. part-00000,
> part-00001 etc. -- in that order. But when I do "hadoop dfs -ls xyz", they
> are in the order part-00001, part-00000, part-00002, part-00003 etc. How is
> that possible? Why aren't they shown in natural order? Also the map-side
> join package considers them in this order which causes problems.
>
>
> 2012/9/10 Sigurd Spieckermann <si...@gmail.com>
>
>> OK, interesting. Just to confirm: is it okay to distribute quite large
>> files through the DistributedCache? Dataset B could be on the order of
>> gigabytes. Also, if I have much fewer nodes than elements/blocks in A, then
>> the probability that every node will have to read (almost) every block of B
>> is quite high so given DC is okay here in general, it would be more
>> efficient to use DC over HDFS reading. How about the case though that I
>> have m*n nodes, then every node would receive all of B while only needing a
>> small fraction, right? Could you maybe elaborate on this in a few sentence
>> just to be sure I understand Hadoop correctly?
>>
>> Thanks,
>> Sigurd
>>
>> 2012/9/10 Harsh J <ha...@cloudera.com>
>>
>>> Sigurd,
>>>
>>> Hemanth's recommendation of DistributedCache does fit your requirement
>>> - it is a generic way of distributing files and archives to tasks of a
>>> job. It is not something that pushes things automatically in memory,
>>> but on the local disk of the TaskTracker your task runs on. You can
>>> choose to then use a LocalFileSystem impl. to read it out from there,
>>> which would end up being (slightly) faster than your same approach
>>> applied to MapFiles on HDFS.
>>>
>>> On Mon, Sep 10, 2012 at 4:15 PM, Sigurd Spieckermann
>>>
>>> <si...@gmail.com> wrote:
>>> > I checked DistributedCache, but in general I have to assume that none
>>> of the
>>> > datasets fits in memory... That's why I was considering map-side join,
>>> but
>>> > by default it doesn't fit to my problem. I could probably get it to
>>> work
>>> > though, but I would have to enforce the requirements of the map-side
>>> join.
>>> >
>>> >
>>> > 2012/9/10 Hemanth Yamijala <yh...@thoughtworks.com>
>>> >>
>>> >> Hi,
>>> >>
>>> >> You could check DistributedCache
>>> >> (
>>> http://hadoop.apache.org/common/docs/stable/mapred_tutorial.html#DistributedCache
>>> ).
>>> >> It would allow you to distribute data to the nodes where your tasks
>>> are run.
>>> >>
>>> >> Thanks
>>> >> Hemanth
>>> >>
>>> >>
>>> >> On Mon, Sep 10, 2012 at 3:27 PM, Sigurd Spieckermann
>>> >> <si...@gmail.com> wrote:
>>> >>>
>>> >>> Hi,
>>> >>>
>>> >>> I would like to perform a map-side join of two large datasets where
>>> >>> dataset A consists of m*n elements and dataset B consists of n
>>> elements. For
>>> >>> the join, every element in dataset B needs to be accessed m times.
>>> Each
>>> >>> mapper would join one element from A with the corresponding element
>>> from B.
>>> >>> Elements here are actually data blocks. Is there a performance
>>> problem (and
>>> >>> difference compared to a slightly modified map-side join using the
>>> >>> join-package) if I set dataset A as the map-reduce input and load the
>>> >>> relevant element from dataset B directly from the HDFS inside the
>>> mapper? I
>>> >>> could store the elements of B in a MapFile for faster random access.
>>> In the
>>> >>> second case without the join-package I would not have to partition
>>> the
>>> >>> datasets manually which would allow a bit more flexibility, but I'm
>>> >>> wondering if HDFS access from inside a mapper is strictly bad. Also,
>>> does
>>> >>> Hadoop have a cache for such situations by any chance?
>>> >>>
>>> >>> I appreciate any comments!
>>> >>>
>>> >>> Sigurd
>>> >>
>>> >>
>>> >
>>>
>>>
>>>
>>> --
>>> Harsh J
>>>
>>
>>
>

Re: Reading from HDFS from inside the mapper

Posted by Sigurd Spieckermann <si...@gmail.com>.

I've tracked down the problem to only occur in standalone mode. In
pseudo-distributed mode, everything works fine. My underlying OS is Ubuntu
12.04 64bit. When I access the directory in linux directly, everything
looks normal. It's just when I access it through hadoop. Has anyone seen
this problem before and knows a solution?

Thanks,
Sigurd

2012/9/17 Sigurd Spieckermann <si...@gmail.com>

> I'm experiencing a strange problem right now. I'm writing part-files to
> the HDFS providing initial data and (which should actually not make a
> difference anyway) write them in ascending order, i.e. part-00000,
> part-00001 etc. -- in that order. But when I do "hadoop dfs -ls xyz", they
> are in the order part-00001, part-00000, part-00002, part-00003 etc. How is
> that possible? Why aren't they shown in natural order? Also the map-side
> join package considers them in this order which causes problems.
>
>
> 2012/9/10 Sigurd Spieckermann <si...@gmail.com>
>
>> OK, interesting. Just to confirm: is it okay to distribute quite large
>> files through the DistributedCache? Dataset B could be on the order of
>> gigabytes. Also, if I have much fewer nodes than elements/blocks in A, then
>> the probability that every node will have to read (almost) every block of B
>> is quite high so given DC is okay here in general, it would be more
>> efficient to use DC over HDFS reading. How about the case though that I
>> have m*n nodes, then every node would receive all of B while only needing a
>> small fraction, right? Could you maybe elaborate on this in a few sentence
>> just to be sure I understand Hadoop correctly?
>>
>> Thanks,
>> Sigurd
>>
>> 2012/9/10 Harsh J <ha...@cloudera.com>
>>
>>> Sigurd,
>>>
>>> Hemanth's recommendation of DistributedCache does fit your requirement
>>> - it is a generic way of distributing files and archives to tasks of a
>>> job. It is not something that pushes things automatically in memory,
>>> but on the local disk of the TaskTracker your task runs on. You can
>>> choose to then use a LocalFileSystem impl. to read it out from there,
>>> which would end up being (slightly) faster than your same approach
>>> applied to MapFiles on HDFS.
>>>
>>> On Mon, Sep 10, 2012 at 4:15 PM, Sigurd Spieckermann
>>>
>>> <si...@gmail.com> wrote:
>>> > I checked DistributedCache, but in general I have to assume that none
>>> of the
>>> > datasets fits in memory... That's why I was considering map-side join,
>>> but
>>> > by default it doesn't fit to my problem. I could probably get it to
>>> work
>>> > though, but I would have to enforce the requirements of the map-side
>>> join.
>>> >
>>> >
>>> > 2012/9/10 Hemanth Yamijala <yh...@thoughtworks.com>
>>> >>
>>> >> Hi,
>>> >>
>>> >> You could check DistributedCache
>>> >> (
>>> http://hadoop.apache.org/common/docs/stable/mapred_tutorial.html#DistributedCache
>>> ).
>>> >> It would allow you to distribute data to the nodes where your tasks
>>> are run.
>>> >>
>>> >> Thanks
>>> >> Hemanth
>>> >>
>>> >>
>>> >> On Mon, Sep 10, 2012 at 3:27 PM, Sigurd Spieckermann
>>> >> <si...@gmail.com> wrote:
>>> >>>
>>> >>> Hi,
>>> >>>
>>> >>> I would like to perform a map-side join of two large datasets where
>>> >>> dataset A consists of m*n elements and dataset B consists of n
>>> elements. For
>>> >>> the join, every element in dataset B needs to be accessed m times.
>>> Each
>>> >>> mapper would join one element from A with the corresponding element
>>> from B.
>>> >>> Elements here are actually data blocks. Is there a performance
>>> problem (and
>>> >>> difference compared to a slightly modified map-side join using the
>>> >>> join-package) if I set dataset A as the map-reduce input and load the
>>> >>> relevant element from dataset B directly from the HDFS inside the
>>> mapper? I
>>> >>> could store the elements of B in a MapFile for faster random access.
>>> In the
>>> >>> second case without the join-package I would not have to partition
>>> the
>>> >>> datasets manually which would allow a bit more flexibility, but I'm
>>> >>> wondering if HDFS access from inside a mapper is strictly bad. Also,
>>> does
>>> >>> Hadoop have a cache for such situations by any chance?
>>> >>>
>>> >>> I appreciate any comments!
>>> >>>
>>> >>> Sigurd
>>> >>
>>> >>
>>> >
>>>
>>>
>>>
>>> --
>>> Harsh J
>>>
>>
>>
>

Re: Reading from HDFS from inside the mapper

Posted by Sigurd Spieckermann <si...@gmail.com>.

I'm experiencing a strange problem right now. I'm writing part-files to the
HDFS providing initial data and (which should actually not make a
difference anyway) write them in ascending order, i.e. part-00000,
part-00001 etc. -- in that order. But when I do "hadoop dfs -ls xyz", they
are in the order part-00001, part-00000, part-00002, part-00003 etc. How is
that possible? Why aren't they shown in natural order? Also the map-side
join package considers them in this order which causes problems.

2012/9/10 Sigurd Spieckermann <si...@gmail.com>

> OK, interesting. Just to confirm: is it okay to distribute quite large
> files through the DistributedCache? Dataset B could be on the order of
> gigabytes. Also, if I have much fewer nodes than elements/blocks in A, then
> the probability that every node will have to read (almost) every block of B
> is quite high so given DC is okay here in general, it would be more
> efficient to use DC over HDFS reading. How about the case though that I
> have m*n nodes, then every node would receive all of B while only needing a
> small fraction, right? Could you maybe elaborate on this in a few sentence
> just to be sure I understand Hadoop correctly?
>
> Thanks,
> Sigurd
>
> 2012/9/10 Harsh J <ha...@cloudera.com>
>
>> Sigurd,
>>
>> Hemanth's recommendation of DistributedCache does fit your requirement
>> - it is a generic way of distributing files and archives to tasks of a
>> job. It is not something that pushes things automatically in memory,
>> but on the local disk of the TaskTracker your task runs on. You can
>> choose to then use a LocalFileSystem impl. to read it out from there,
>> which would end up being (slightly) faster than your same approach
>> applied to MapFiles on HDFS.
>>
>> On Mon, Sep 10, 2012 at 4:15 PM, Sigurd Spieckermann
>>
>> <si...@gmail.com> wrote:
>> > I checked DistributedCache, but in general I have to assume that none
>> of the
>> > datasets fits in memory... That's why I was considering map-side join,
>> but
>> > by default it doesn't fit to my problem. I could probably get it to work
>> > though, but I would have to enforce the requirements of the map-side
>> join.
>> >
>> >
>> > 2012/9/10 Hemanth Yamijala <yh...@thoughtworks.com>
>> >>
>> >> Hi,
>> >>
>> >> You could check DistributedCache
>> >> (
>> http://hadoop.apache.org/common/docs/stable/mapred_tutorial.html#DistributedCache
>> ).
>> >> It would allow you to distribute data to the nodes where your tasks
>> are run.
>> >>
>> >> Thanks
>> >> Hemanth
>> >>
>> >>
>> >> On Mon, Sep 10, 2012 at 3:27 PM, Sigurd Spieckermann
>> >> <si...@gmail.com> wrote:
>> >>>
>> >>> Hi,
>> >>>
>> >>> I would like to perform a map-side join of two large datasets where
>> >>> dataset A consists of m*n elements and dataset B consists of n
>> elements. For
>> >>> the join, every element in dataset B needs to be accessed m times.
>> Each
>> >>> mapper would join one element from A with the corresponding element
>> from B.
>> >>> Elements here are actually data blocks. Is there a performance
>> problem (and
>> >>> difference compared to a slightly modified map-side join using the
>> >>> join-package) if I set dataset A as the map-reduce input and load the
>> >>> relevant element from dataset B directly from the HDFS inside the
>> mapper? I
>> >>> could store the elements of B in a MapFile for faster random access.
>> In the
>> >>> second case without the join-package I would not have to partition the
>> >>> datasets manually which would allow a bit more flexibility, but I'm
>> >>> wondering if HDFS access from inside a mapper is strictly bad. Also,
>> does
>> >>> Hadoop have a cache for such situations by any chance?
>> >>>
>> >>> I appreciate any comments!
>> >>>
>> >>> Sigurd
>> >>
>> >>
>> >
>>
>>
>>
>> --
>> Harsh J
>>
>
>

Re: Reading from HDFS from inside the mapper

Posted by Sigurd Spieckermann <si...@gmail.com>.

I'm experiencing a strange problem right now. I'm writing part-files to the
HDFS providing initial data and (which should actually not make a
difference anyway) write them in ascending order, i.e. part-00000,
part-00001 etc. -- in that order. But when I do "hadoop dfs -ls xyz", they
are in the order part-00001, part-00000, part-00002, part-00003 etc. How is
that possible? Why aren't they shown in natural order? Also the map-side
join package considers them in this order which causes problems.

2012/9/10 Sigurd Spieckermann <si...@gmail.com>

> OK, interesting. Just to confirm: is it okay to distribute quite large
> files through the DistributedCache? Dataset B could be on the order of
> gigabytes. Also, if I have much fewer nodes than elements/blocks in A, then
> the probability that every node will have to read (almost) every block of B
> is quite high so given DC is okay here in general, it would be more
> efficient to use DC over HDFS reading. How about the case though that I
> have m*n nodes, then every node would receive all of B while only needing a
> small fraction, right? Could you maybe elaborate on this in a few sentence
> just to be sure I understand Hadoop correctly?
>
> Thanks,
> Sigurd
>
> 2012/9/10 Harsh J <ha...@cloudera.com>
>
>> Sigurd,
>>
>> Hemanth's recommendation of DistributedCache does fit your requirement
>> - it is a generic way of distributing files and archives to tasks of a
>> job. It is not something that pushes things automatically in memory,
>> but on the local disk of the TaskTracker your task runs on. You can
>> choose to then use a LocalFileSystem impl. to read it out from there,
>> which would end up being (slightly) faster than your same approach
>> applied to MapFiles on HDFS.
>>
>> On Mon, Sep 10, 2012 at 4:15 PM, Sigurd Spieckermann
>>
>> <si...@gmail.com> wrote:
>> > I checked DistributedCache, but in general I have to assume that none
>> of the
>> > datasets fits in memory... That's why I was considering map-side join,
>> but
>> > by default it doesn't fit to my problem. I could probably get it to work
>> > though, but I would have to enforce the requirements of the map-side
>> join.
>> >
>> >
>> > 2012/9/10 Hemanth Yamijala <yh...@thoughtworks.com>
>> >>
>> >> Hi,
>> >>
>> >> You could check DistributedCache
>> >> (
>> http://hadoop.apache.org/common/docs/stable/mapred_tutorial.html#DistributedCache
>> ).
>> >> It would allow you to distribute data to the nodes where your tasks
>> are run.
>> >>
>> >> Thanks
>> >> Hemanth
>> >>
>> >>
>> >> On Mon, Sep 10, 2012 at 3:27 PM, Sigurd Spieckermann
>> >> <si...@gmail.com> wrote:
>> >>>
>> >>> Hi,
>> >>>
>> >>> I would like to perform a map-side join of two large datasets where
>> >>> dataset A consists of m*n elements and dataset B consists of n
>> elements. For
>> >>> the join, every element in dataset B needs to be accessed m times.
>> Each
>> >>> mapper would join one element from A with the corresponding element
>> from B.
>> >>> Elements here are actually data blocks. Is there a performance
>> problem (and
>> >>> difference compared to a slightly modified map-side join using the
>> >>> join-package) if I set dataset A as the map-reduce input and load the
>> >>> relevant element from dataset B directly from the HDFS inside the
>> mapper? I
>> >>> could store the elements of B in a MapFile for faster random access.
>> In the
>> >>> second case without the join-package I would not have to partition the
>> >>> datasets manually which would allow a bit more flexibility, but I'm
>> >>> wondering if HDFS access from inside a mapper is strictly bad. Also,
>> does
>> >>> Hadoop have a cache for such situations by any chance?
>> >>>
>> >>> I appreciate any comments!
>> >>>
>> >>> Sigurd
>> >>
>> >>
>> >
>>
>>
>>
>> --
>> Harsh J
>>
>
>

Re: Reading from HDFS from inside the mapper

Posted by Sigurd Spieckermann <si...@gmail.com>.

I'm experiencing a strange problem right now. I'm writing part-files to the
HDFS providing initial data and (which should actually not make a
difference anyway) write them in ascending order, i.e. part-00000,
part-00001 etc. -- in that order. But when I do "hadoop dfs -ls xyz", they
are in the order part-00001, part-00000, part-00002, part-00003 etc. How is
that possible? Why aren't they shown in natural order? Also the map-side
join package considers them in this order which causes problems.

2012/9/10 Sigurd Spieckermann <si...@gmail.com>

> OK, interesting. Just to confirm: is it okay to distribute quite large
> files through the DistributedCache? Dataset B could be on the order of
> gigabytes. Also, if I have much fewer nodes than elements/blocks in A, then
> the probability that every node will have to read (almost) every block of B
> is quite high so given DC is okay here in general, it would be more
> efficient to use DC over HDFS reading. How about the case though that I
> have m*n nodes, then every node would receive all of B while only needing a
> small fraction, right? Could you maybe elaborate on this in a few sentence
> just to be sure I understand Hadoop correctly?
>
> Thanks,
> Sigurd
>
> 2012/9/10 Harsh J <ha...@cloudera.com>
>
>> Sigurd,
>>
>> Hemanth's recommendation of DistributedCache does fit your requirement
>> - it is a generic way of distributing files and archives to tasks of a
>> job. It is not something that pushes things automatically in memory,
>> but on the local disk of the TaskTracker your task runs on. You can
>> choose to then use a LocalFileSystem impl. to read it out from there,
>> which would end up being (slightly) faster than your same approach
>> applied to MapFiles on HDFS.
>>
>> On Mon, Sep 10, 2012 at 4:15 PM, Sigurd Spieckermann
>>
>> <si...@gmail.com> wrote:
>> > I checked DistributedCache, but in general I have to assume that none
>> of the
>> > datasets fits in memory... That's why I was considering map-side join,
>> but
>> > by default it doesn't fit to my problem. I could probably get it to work
>> > though, but I would have to enforce the requirements of the map-side
>> join.
>> >
>> >
>> > 2012/9/10 Hemanth Yamijala <yh...@thoughtworks.com>
>> >>
>> >> Hi,
>> >>
>> >> You could check DistributedCache
>> >> (
>> http://hadoop.apache.org/common/docs/stable/mapred_tutorial.html#DistributedCache
>> ).
>> >> It would allow you to distribute data to the nodes where your tasks
>> are run.
>> >>
>> >> Thanks
>> >> Hemanth
>> >>
>> >>
>> >> On Mon, Sep 10, 2012 at 3:27 PM, Sigurd Spieckermann
>> >> <si...@gmail.com> wrote:
>> >>>
>> >>> Hi,
>> >>>
>> >>> I would like to perform a map-side join of two large datasets where
>> >>> dataset A consists of m*n elements and dataset B consists of n
>> elements. For
>> >>> the join, every element in dataset B needs to be accessed m times.
>> Each
>> >>> mapper would join one element from A with the corresponding element
>> from B.
>> >>> Elements here are actually data blocks. Is there a performance
>> problem (and
>> >>> difference compared to a slightly modified map-side join using the
>> >>> join-package) if I set dataset A as the map-reduce input and load the
>> >>> relevant element from dataset B directly from the HDFS inside the
>> mapper? I
>> >>> could store the elements of B in a MapFile for faster random access.
>> In the
>> >>> second case without the join-package I would not have to partition the
>> >>> datasets manually which would allow a bit more flexibility, but I'm
>> >>> wondering if HDFS access from inside a mapper is strictly bad. Also,
>> does
>> >>> Hadoop have a cache for such situations by any chance?
>> >>>
>> >>> I appreciate any comments!
>> >>>
>> >>> Sigurd
>> >>
>> >>
>> >
>>
>>
>>
>> --
>> Harsh J
>>
>
>

Re: Reading from HDFS from inside the mapper

Posted by Sigurd Spieckermann <si...@gmail.com>.

I'm experiencing a strange problem right now. I'm writing part-files to the
HDFS providing initial data and (which should actually not make a
difference anyway) write them in ascending order, i.e. part-00000,
part-00001 etc. -- in that order. But when I do "hadoop dfs -ls xyz", they
are in the order part-00001, part-00000, part-00002, part-00003 etc. How is
that possible? Why aren't they shown in natural order? Also the map-side
join package considers them in this order which causes problems.

2012/9/10 Sigurd Spieckermann <si...@gmail.com>

> OK, interesting. Just to confirm: is it okay to distribute quite large
> files through the DistributedCache? Dataset B could be on the order of
> gigabytes. Also, if I have much fewer nodes than elements/blocks in A, then
> the probability that every node will have to read (almost) every block of B
> is quite high so given DC is okay here in general, it would be more
> efficient to use DC over HDFS reading. How about the case though that I
> have m*n nodes, then every node would receive all of B while only needing a
> small fraction, right? Could you maybe elaborate on this in a few sentence
> just to be sure I understand Hadoop correctly?
>
> Thanks,
> Sigurd
>
> 2012/9/10 Harsh J <ha...@cloudera.com>
>
>> Sigurd,
>>
>> Hemanth's recommendation of DistributedCache does fit your requirement
>> - it is a generic way of distributing files and archives to tasks of a
>> job. It is not something that pushes things automatically in memory,
>> but on the local disk of the TaskTracker your task runs on. You can
>> choose to then use a LocalFileSystem impl. to read it out from there,
>> which would end up being (slightly) faster than your same approach
>> applied to MapFiles on HDFS.
>>
>> On Mon, Sep 10, 2012 at 4:15 PM, Sigurd Spieckermann
>>
>> <si...@gmail.com> wrote:
>> > I checked DistributedCache, but in general I have to assume that none
>> of the
>> > datasets fits in memory... That's why I was considering map-side join,
>> but
>> > by default it doesn't fit to my problem. I could probably get it to work
>> > though, but I would have to enforce the requirements of the map-side
>> join.
>> >
>> >
>> > 2012/9/10 Hemanth Yamijala <yh...@thoughtworks.com>
>> >>
>> >> Hi,
>> >>
>> >> You could check DistributedCache
>> >> (
>> http://hadoop.apache.org/common/docs/stable/mapred_tutorial.html#DistributedCache
>> ).
>> >> It would allow you to distribute data to the nodes where your tasks
>> are run.
>> >>
>> >> Thanks
>> >> Hemanth
>> >>
>> >>
>> >> On Mon, Sep 10, 2012 at 3:27 PM, Sigurd Spieckermann
>> >> <si...@gmail.com> wrote:
>> >>>
>> >>> Hi,
>> >>>
>> >>> I would like to perform a map-side join of two large datasets where
>> >>> dataset A consists of m*n elements and dataset B consists of n
>> elements. For
>> >>> the join, every element in dataset B needs to be accessed m times.
>> Each
>> >>> mapper would join one element from A with the corresponding element
>> from B.
>> >>> Elements here are actually data blocks. Is there a performance
>> problem (and
>> >>> difference compared to a slightly modified map-side join using the
>> >>> join-package) if I set dataset A as the map-reduce input and load the
>> >>> relevant element from dataset B directly from the HDFS inside the
>> mapper? I
>> >>> could store the elements of B in a MapFile for faster random access.
>> In the
>> >>> second case without the join-package I would not have to partition the
>> >>> datasets manually which would allow a bit more flexibility, but I'm
>> >>> wondering if HDFS access from inside a mapper is strictly bad. Also,
>> does
>> >>> Hadoop have a cache for such situations by any chance?
>> >>>
>> >>> I appreciate any comments!
>> >>>
>> >>> Sigurd
>> >>
>> >>
>> >
>>
>>
>>
>> --
>> Harsh J
>>
>
>

Re: Reading from HDFS from inside the mapper

Posted by Sigurd Spieckermann <si...@gmail.com>.

OK, interesting. Just to confirm: is it okay to distribute quite large
files through the DistributedCache? Dataset B could be on the order of
gigabytes. Also, if I have much fewer nodes than elements/blocks in A, then
the probability that every node will have to read (almost) every block of B
is quite high so given DC is okay here in general, it would be more
efficient to use DC over HDFS reading. How about the case though that I
have m*n nodes, then every node would receive all of B while only needing a
small fraction, right? Could you maybe elaborate on this in a few sentence
just to be sure I understand Hadoop correctly?

Thanks,
Sigurd

2012/9/10 Harsh J <ha...@cloudera.com>

> Sigurd,
>
> Hemanth's recommendation of DistributedCache does fit your requirement
> - it is a generic way of distributing files and archives to tasks of a
> job. It is not something that pushes things automatically in memory,
> but on the local disk of the TaskTracker your task runs on. You can
> choose to then use a LocalFileSystem impl. to read it out from there,
> which would end up being (slightly) faster than your same approach
> applied to MapFiles on HDFS.
>
> On Mon, Sep 10, 2012 at 4:15 PM, Sigurd Spieckermann
> <si...@gmail.com> wrote:
> > I checked DistributedCache, but in general I have to assume that none of
> the
> > datasets fits in memory... That's why I was considering map-side join,
> but
> > by default it doesn't fit to my problem. I could probably get it to work
> > though, but I would have to enforce the requirements of the map-side
> join.
> >
> >
> > 2012/9/10 Hemanth Yamijala <yh...@thoughtworks.com>
> >>
> >> Hi,
> >>
> >> You could check DistributedCache
> >> (
> http://hadoop.apache.org/common/docs/stable/mapred_tutorial.html#DistributedCache
> ).
> >> It would allow you to distribute data to the nodes where your tasks are
> run.
> >>
> >> Thanks
> >> Hemanth
> >>
> >>
> >> On Mon, Sep 10, 2012 at 3:27 PM, Sigurd Spieckermann
> >> <si...@gmail.com> wrote:
> >>>
> >>> Hi,
> >>>
> >>> I would like to perform a map-side join of two large datasets where
> >>> dataset A consists of m*n elements and dataset B consists of n
> elements. For
> >>> the join, every element in dataset B needs to be accessed m times. Each
> >>> mapper would join one element from A with the corresponding element
> from B.
> >>> Elements here are actually data blocks. Is there a performance problem
> (and
> >>> difference compared to a slightly modified map-side join using the
> >>> join-package) if I set dataset A as the map-reduce input and load the
> >>> relevant element from dataset B directly from the HDFS inside the
> mapper? I
> >>> could store the elements of B in a MapFile for faster random access.
> In the
> >>> second case without the join-package I would not have to partition the
> >>> datasets manually which would allow a bit more flexibility, but I'm
> >>> wondering if HDFS access from inside a mapper is strictly bad. Also,
> does
> >>> Hadoop have a cache for such situations by any chance?
> >>>
> >>> I appreciate any comments!
> >>>
> >>> Sigurd
> >>
> >>
> >
>
>
>
> --
> Harsh J
>

Re: Reading from HDFS from inside the mapper

Posted by Sigurd Spieckermann <si...@gmail.com>.

OK, interesting. Just to confirm: is it okay to distribute quite large
files through the DistributedCache? Dataset B could be on the order of
gigabytes. Also, if I have much fewer nodes than elements/blocks in A, then
the probability that every node will have to read (almost) every block of B
is quite high so given DC is okay here in general, it would be more
efficient to use DC over HDFS reading. How about the case though that I
have m*n nodes, then every node would receive all of B while only needing a
small fraction, right? Could you maybe elaborate on this in a few sentence
just to be sure I understand Hadoop correctly?

Thanks,
Sigurd

2012/9/10 Harsh J <ha...@cloudera.com>

> Sigurd,
>
> Hemanth's recommendation of DistributedCache does fit your requirement
> - it is a generic way of distributing files and archives to tasks of a
> job. It is not something that pushes things automatically in memory,
> but on the local disk of the TaskTracker your task runs on. You can
> choose to then use a LocalFileSystem impl. to read it out from there,
> which would end up being (slightly) faster than your same approach
> applied to MapFiles on HDFS.
>
> On Mon, Sep 10, 2012 at 4:15 PM, Sigurd Spieckermann
> <si...@gmail.com> wrote:
> > I checked DistributedCache, but in general I have to assume that none of
> the
> > datasets fits in memory... That's why I was considering map-side join,
> but
> > by default it doesn't fit to my problem. I could probably get it to work
> > though, but I would have to enforce the requirements of the map-side
> join.
> >
> >
> > 2012/9/10 Hemanth Yamijala <yh...@thoughtworks.com>
> >>
> >> Hi,
> >>
> >> You could check DistributedCache
> >> (
> http://hadoop.apache.org/common/docs/stable/mapred_tutorial.html#DistributedCache
> ).
> >> It would allow you to distribute data to the nodes where your tasks are
> run.
> >>
> >> Thanks
> >> Hemanth
> >>
> >>
> >> On Mon, Sep 10, 2012 at 3:27 PM, Sigurd Spieckermann
> >> <si...@gmail.com> wrote:
> >>>
> >>> Hi,
> >>>
> >>> I would like to perform a map-side join of two large datasets where
> >>> dataset A consists of m*n elements and dataset B consists of n
> elements. For
> >>> the join, every element in dataset B needs to be accessed m times. Each
> >>> mapper would join one element from A with the corresponding element
> from B.
> >>> Elements here are actually data blocks. Is there a performance problem
> (and
> >>> difference compared to a slightly modified map-side join using the
> >>> join-package) if I set dataset A as the map-reduce input and load the
> >>> relevant element from dataset B directly from the HDFS inside the
> mapper? I
> >>> could store the elements of B in a MapFile for faster random access.
> In the
> >>> second case without the join-package I would not have to partition the
> >>> datasets manually which would allow a bit more flexibility, but I'm
> >>> wondering if HDFS access from inside a mapper is strictly bad. Also,
> does
> >>> Hadoop have a cache for such situations by any chance?
> >>>
> >>> I appreciate any comments!
> >>>
> >>> Sigurd
> >>
> >>
> >
>
>
>
> --
> Harsh J
>

Re: Reading from HDFS from inside the mapper

Posted by Sigurd Spieckermann <si...@gmail.com>.

OK, interesting. Just to confirm: is it okay to distribute quite large
files through the DistributedCache? Dataset B could be on the order of
gigabytes. Also, if I have much fewer nodes than elements/blocks in A, then
the probability that every node will have to read (almost) every block of B
is quite high so given DC is okay here in general, it would be more
efficient to use DC over HDFS reading. How about the case though that I
have m*n nodes, then every node would receive all of B while only needing a
small fraction, right? Could you maybe elaborate on this in a few sentence
just to be sure I understand Hadoop correctly?

Thanks,
Sigurd

2012/9/10 Harsh J <ha...@cloudera.com>

> Sigurd,
>
> Hemanth's recommendation of DistributedCache does fit your requirement
> - it is a generic way of distributing files and archives to tasks of a
> job. It is not something that pushes things automatically in memory,
> but on the local disk of the TaskTracker your task runs on. You can
> choose to then use a LocalFileSystem impl. to read it out from there,
> which would end up being (slightly) faster than your same approach
> applied to MapFiles on HDFS.
>
> On Mon, Sep 10, 2012 at 4:15 PM, Sigurd Spieckermann
> <si...@gmail.com> wrote:
> > I checked DistributedCache, but in general I have to assume that none of
> the
> > datasets fits in memory... That's why I was considering map-side join,
> but
> > by default it doesn't fit to my problem. I could probably get it to work
> > though, but I would have to enforce the requirements of the map-side
> join.
> >
> >
> > 2012/9/10 Hemanth Yamijala <yh...@thoughtworks.com>
> >>
> >> Hi,
> >>
> >> You could check DistributedCache
> >> (
> http://hadoop.apache.org/common/docs/stable/mapred_tutorial.html#DistributedCache
> ).
> >> It would allow you to distribute data to the nodes where your tasks are
> run.
> >>
> >> Thanks
> >> Hemanth
> >>
> >>
> >> On Mon, Sep 10, 2012 at 3:27 PM, Sigurd Spieckermann
> >> <si...@gmail.com> wrote:
> >>>
> >>> Hi,
> >>>
> >>> I would like to perform a map-side join of two large datasets where
> >>> dataset A consists of m*n elements and dataset B consists of n
> elements. For
> >>> the join, every element in dataset B needs to be accessed m times. Each
> >>> mapper would join one element from A with the corresponding element
> from B.
> >>> Elements here are actually data blocks. Is there a performance problem
> (and
> >>> difference compared to a slightly modified map-side join using the
> >>> join-package) if I set dataset A as the map-reduce input and load the
> >>> relevant element from dataset B directly from the HDFS inside the
> mapper? I
> >>> could store the elements of B in a MapFile for faster random access.
> In the
> >>> second case without the join-package I would not have to partition the
> >>> datasets manually which would allow a bit more flexibility, but I'm
> >>> wondering if HDFS access from inside a mapper is strictly bad. Also,
> does
> >>> Hadoop have a cache for such situations by any chance?
> >>>
> >>> I appreciate any comments!
> >>>
> >>> Sigurd
> >>
> >>
> >
>
>
>
> --
> Harsh J
>

Re: Reading from HDFS from inside the mapper

Posted by Sigurd Spieckermann <si...@gmail.com>.

OK, interesting. Just to confirm: is it okay to distribute quite large
files through the DistributedCache? Dataset B could be on the order of
gigabytes. Also, if I have much fewer nodes than elements/blocks in A, then
the probability that every node will have to read (almost) every block of B
is quite high so given DC is okay here in general, it would be more
efficient to use DC over HDFS reading. How about the case though that I
have m*n nodes, then every node would receive all of B while only needing a
small fraction, right? Could you maybe elaborate on this in a few sentence
just to be sure I understand Hadoop correctly?

Thanks,
Sigurd

2012/9/10 Harsh J <ha...@cloudera.com>

> Sigurd,
>
> Hemanth's recommendation of DistributedCache does fit your requirement
> - it is a generic way of distributing files and archives to tasks of a
> job. It is not something that pushes things automatically in memory,
> but on the local disk of the TaskTracker your task runs on. You can
> choose to then use a LocalFileSystem impl. to read it out from there,
> which would end up being (slightly) faster than your same approach
> applied to MapFiles on HDFS.
>
> On Mon, Sep 10, 2012 at 4:15 PM, Sigurd Spieckermann
> <si...@gmail.com> wrote:
> > I checked DistributedCache, but in general I have to assume that none of
> the
> > datasets fits in memory... That's why I was considering map-side join,
> but
> > by default it doesn't fit to my problem. I could probably get it to work
> > though, but I would have to enforce the requirements of the map-side
> join.
> >
> >
> > 2012/9/10 Hemanth Yamijala <yh...@thoughtworks.com>
> >>
> >> Hi,
> >>
> >> You could check DistributedCache
> >> (
> http://hadoop.apache.org/common/docs/stable/mapred_tutorial.html#DistributedCache
> ).
> >> It would allow you to distribute data to the nodes where your tasks are
> run.
> >>
> >> Thanks
> >> Hemanth
> >>
> >>
> >> On Mon, Sep 10, 2012 at 3:27 PM, Sigurd Spieckermann
> >> <si...@gmail.com> wrote:
> >>>
> >>> Hi,
> >>>
> >>> I would like to perform a map-side join of two large datasets where
> >>> dataset A consists of m*n elements and dataset B consists of n
> elements. For
> >>> the join, every element in dataset B needs to be accessed m times. Each
> >>> mapper would join one element from A with the corresponding element
> from B.
> >>> Elements here are actually data blocks. Is there a performance problem
> (and
> >>> difference compared to a slightly modified map-side join using the
> >>> join-package) if I set dataset A as the map-reduce input and load the
> >>> relevant element from dataset B directly from the HDFS inside the
> mapper? I
> >>> could store the elements of B in a MapFile for faster random access.
> In the
> >>> second case without the join-package I would not have to partition the
> >>> datasets manually which would allow a bit more flexibility, but I'm
> >>> wondering if HDFS access from inside a mapper is strictly bad. Also,
> does
> >>> Hadoop have a cache for such situations by any chance?
> >>>
> >>> I appreciate any comments!
> >>>
> >>> Sigurd
> >>
> >>
> >
>
>
>
> --
> Harsh J
>

Re: Reading from HDFS from inside the mapper

Posted by Harsh J <ha...@cloudera.com>.

Sigurd,

Hemanth's recommendation of DistributedCache does fit your requirement
- it is a generic way of distributing files and archives to tasks of a
job. It is not something that pushes things automatically in memory,
but on the local disk of the TaskTracker your task runs on. You can
choose to then use a LocalFileSystem impl. to read it out from there,
which would end up being (slightly) faster than your same approach
applied to MapFiles on HDFS.

On Mon, Sep 10, 2012 at 4:15 PM, Sigurd Spieckermann
<si...@gmail.com> wrote:
> I checked DistributedCache, but in general I have to assume that none of the
> datasets fits in memory... That's why I was considering map-side join, but
> by default it doesn't fit to my problem. I could probably get it to work
> though, but I would have to enforce the requirements of the map-side join.
>
>
> 2012/9/10 Hemanth Yamijala <yh...@thoughtworks.com>
>>
>> Hi,
>>
>> You could check DistributedCache
>> (http://hadoop.apache.org/common/docs/stable/mapred_tutorial.html#DistributedCache).
>> It would allow you to distribute data to the nodes where your tasks are run.
>>
>> Thanks
>> Hemanth
>>
>>
>> On Mon, Sep 10, 2012 at 3:27 PM, Sigurd Spieckermann
>> <si...@gmail.com> wrote:
>>>
>>> Hi,
>>>
>>> I would like to perform a map-side join of two large datasets where
>>> dataset A consists of m*n elements and dataset B consists of n elements. For
>>> the join, every element in dataset B needs to be accessed m times. Each
>>> mapper would join one element from A with the corresponding element from B.
>>> Elements here are actually data blocks. Is there a performance problem (and
>>> difference compared to a slightly modified map-side join using the
>>> join-package) if I set dataset A as the map-reduce input and load the
>>> relevant element from dataset B directly from the HDFS inside the mapper? I
>>> could store the elements of B in a MapFile for faster random access. In the
>>> second case without the join-package I would not have to partition the
>>> datasets manually which would allow a bit more flexibility, but I'm
>>> wondering if HDFS access from inside a mapper is strictly bad. Also, does
>>> Hadoop have a cache for such situations by any chance?
>>>
>>> I appreciate any comments!
>>>
>>> Sigurd
>>
>>
>



-- 
Harsh J

Re: Reading from HDFS from inside the mapper

Posted by Harsh J <ha...@cloudera.com>.

Sigurd,

Hemanth's recommendation of DistributedCache does fit your requirement
- it is a generic way of distributing files and archives to tasks of a
job. It is not something that pushes things automatically in memory,
but on the local disk of the TaskTracker your task runs on. You can
choose to then use a LocalFileSystem impl. to read it out from there,
which would end up being (slightly) faster than your same approach
applied to MapFiles on HDFS.

On Mon, Sep 10, 2012 at 4:15 PM, Sigurd Spieckermann
<si...@gmail.com> wrote:
> I checked DistributedCache, but in general I have to assume that none of the
> datasets fits in memory... That's why I was considering map-side join, but
> by default it doesn't fit to my problem. I could probably get it to work
> though, but I would have to enforce the requirements of the map-side join.
>
>
> 2012/9/10 Hemanth Yamijala <yh...@thoughtworks.com>
>>
>> Hi,
>>
>> You could check DistributedCache
>> (http://hadoop.apache.org/common/docs/stable/mapred_tutorial.html#DistributedCache).
>> It would allow you to distribute data to the nodes where your tasks are run.
>>
>> Thanks
>> Hemanth
>>
>>
>> On Mon, Sep 10, 2012 at 3:27 PM, Sigurd Spieckermann
>> <si...@gmail.com> wrote:
>>>
>>> Hi,
>>>
>>> I would like to perform a map-side join of two large datasets where
>>> dataset A consists of m*n elements and dataset B consists of n elements. For
>>> the join, every element in dataset B needs to be accessed m times. Each
>>> mapper would join one element from A with the corresponding element from B.
>>> Elements here are actually data blocks. Is there a performance problem (and
>>> difference compared to a slightly modified map-side join using the
>>> join-package) if I set dataset A as the map-reduce input and load the
>>> relevant element from dataset B directly from the HDFS inside the mapper? I
>>> could store the elements of B in a MapFile for faster random access. In the
>>> second case without the join-package I would not have to partition the
>>> datasets manually which would allow a bit more flexibility, but I'm
>>> wondering if HDFS access from inside a mapper is strictly bad. Also, does
>>> Hadoop have a cache for such situations by any chance?
>>>
>>> I appreciate any comments!
>>>
>>> Sigurd
>>
>>
>



-- 
Harsh J

Re: Reading from HDFS from inside the mapper

Posted by Harsh J <ha...@cloudera.com>.

Sigurd,

Hemanth's recommendation of DistributedCache does fit your requirement
- it is a generic way of distributing files and archives to tasks of a
job. It is not something that pushes things automatically in memory,
but on the local disk of the TaskTracker your task runs on. You can
choose to then use a LocalFileSystem impl. to read it out from there,
which would end up being (slightly) faster than your same approach
applied to MapFiles on HDFS.

On Mon, Sep 10, 2012 at 4:15 PM, Sigurd Spieckermann
<si...@gmail.com> wrote:
> I checked DistributedCache, but in general I have to assume that none of the
> datasets fits in memory... That's why I was considering map-side join, but
> by default it doesn't fit to my problem. I could probably get it to work
> though, but I would have to enforce the requirements of the map-side join.
>
>
> 2012/9/10 Hemanth Yamijala <yh...@thoughtworks.com>
>>
>> Hi,
>>
>> You could check DistributedCache
>> (http://hadoop.apache.org/common/docs/stable/mapred_tutorial.html#DistributedCache).
>> It would allow you to distribute data to the nodes where your tasks are run.
>>
>> Thanks
>> Hemanth
>>
>>
>> On Mon, Sep 10, 2012 at 3:27 PM, Sigurd Spieckermann
>> <si...@gmail.com> wrote:
>>>
>>> Hi,
>>>
>>> I would like to perform a map-side join of two large datasets where
>>> dataset A consists of m*n elements and dataset B consists of n elements. For
>>> the join, every element in dataset B needs to be accessed m times. Each
>>> mapper would join one element from A with the corresponding element from B.
>>> Elements here are actually data blocks. Is there a performance problem (and
>>> difference compared to a slightly modified map-side join using the
>>> join-package) if I set dataset A as the map-reduce input and load the
>>> relevant element from dataset B directly from the HDFS inside the mapper? I
>>> could store the elements of B in a MapFile for faster random access. In the
>>> second case without the join-package I would not have to partition the
>>> datasets manually which would allow a bit more flexibility, but I'm
>>> wondering if HDFS access from inside a mapper is strictly bad. Also, does
>>> Hadoop have a cache for such situations by any chance?
>>>
>>> I appreciate any comments!
>>>
>>> Sigurd
>>
>>
>



-- 
Harsh J

Re: Reading from HDFS from inside the mapper

Posted by Harsh J <ha...@cloudera.com>.

Sigurd,

Hemanth's recommendation of DistributedCache does fit your requirement
- it is a generic way of distributing files and archives to tasks of a
job. It is not something that pushes things automatically in memory,
but on the local disk of the TaskTracker your task runs on. You can
choose to then use a LocalFileSystem impl. to read it out from there,
which would end up being (slightly) faster than your same approach
applied to MapFiles on HDFS.

On Mon, Sep 10, 2012 at 4:15 PM, Sigurd Spieckermann
<si...@gmail.com> wrote:
> I checked DistributedCache, but in general I have to assume that none of the
> datasets fits in memory... That's why I was considering map-side join, but
> by default it doesn't fit to my problem. I could probably get it to work
> though, but I would have to enforce the requirements of the map-side join.
>
>
> 2012/9/10 Hemanth Yamijala <yh...@thoughtworks.com>
>>
>> Hi,
>>
>> You could check DistributedCache
>> (http://hadoop.apache.org/common/docs/stable/mapred_tutorial.html#DistributedCache).
>> It would allow you to distribute data to the nodes where your tasks are run.
>>
>> Thanks
>> Hemanth
>>
>>
>> On Mon, Sep 10, 2012 at 3:27 PM, Sigurd Spieckermann
>> <si...@gmail.com> wrote:
>>>
>>> Hi,
>>>
>>> I would like to perform a map-side join of two large datasets where
>>> dataset A consists of m*n elements and dataset B consists of n elements. For
>>> the join, every element in dataset B needs to be accessed m times. Each
>>> mapper would join one element from A with the corresponding element from B.
>>> Elements here are actually data blocks. Is there a performance problem (and
>>> difference compared to a slightly modified map-side join using the
>>> join-package) if I set dataset A as the map-reduce input and load the
>>> relevant element from dataset B directly from the HDFS inside the mapper? I
>>> could store the elements of B in a MapFile for faster random access. In the
>>> second case without the join-package I would not have to partition the
>>> datasets manually which would allow a bit more flexibility, but I'm
>>> wondering if HDFS access from inside a mapper is strictly bad. Also, does
>>> Hadoop have a cache for such situations by any chance?
>>>
>>> I appreciate any comments!
>>>
>>> Sigurd
>>
>>
>



-- 
Harsh J

Re: Reading from HDFS from inside the mapper

Posted by Sigurd Spieckermann <si...@gmail.com>.

I checked DistributedCache, but in general I have to assume that none of
the datasets fits in memory... That's why I was considering map-side join,
but by default it doesn't fit to my problem. I could probably get it to
work though, but I would have to enforce the requirements of the map-side
join.

2012/9/10 Hemanth Yamijala <yh...@thoughtworks.com>

> Hi,
>
> You could check DistributedCache (
> http://hadoop.apache.org/common/docs/stable/mapred_tutorial.html#DistributedCache).
> It would allow you to distribute data to the nodes where your tasks are run.
>
> Thanks
> Hemanth
>
>
> On Mon, Sep 10, 2012 at 3:27 PM, Sigurd Spieckermann <
> sigurd.spieckermann@gmail.com> wrote:
>
>> Hi,
>>
>> I would like to perform a map-side join of two large datasets where
>> dataset A consists of m*n elements and dataset B consists of n elements.
>> For the join, every element in dataset B needs to be accessed m times. Each
>> mapper would join one element from A with the corresponding element from B.
>> Elements here are actually data blocks. Is there a performance problem (and
>> difference compared to a slightly modified map-side join using the
>> join-package) if I set dataset A as the map-reduce input and load the
>> relevant element from dataset B directly from the HDFS inside the mapper? I
>> could store the elements of B in a MapFile for faster random access. In the
>> second case without the join-package I would not have to partition the
>> datasets manually which would allow a bit more flexibility, but I'm
>> wondering if HDFS access from inside a mapper is strictly bad. Also, does
>> Hadoop have a cache for such situations by any chance?
>>
>> I appreciate any comments!
>>
>> Sigurd
>>
>
>

Re: Reading from HDFS from inside the mapper

Posted by Sigurd Spieckermann <si...@gmail.com>.

I checked DistributedCache, but in general I have to assume that none of
the datasets fits in memory... That's why I was considering map-side join,
but by default it doesn't fit to my problem. I could probably get it to
work though, but I would have to enforce the requirements of the map-side
join.

2012/9/10 Hemanth Yamijala <yh...@thoughtworks.com>

> Hi,
>
> You could check DistributedCache (
> http://hadoop.apache.org/common/docs/stable/mapred_tutorial.html#DistributedCache).
> It would allow you to distribute data to the nodes where your tasks are run.
>
> Thanks
> Hemanth
>
>
> On Mon, Sep 10, 2012 at 3:27 PM, Sigurd Spieckermann <
> sigurd.spieckermann@gmail.com> wrote:
>
>> Hi,
>>
>> I would like to perform a map-side join of two large datasets where
>> dataset A consists of m*n elements and dataset B consists of n elements.
>> For the join, every element in dataset B needs to be accessed m times. Each
>> mapper would join one element from A with the corresponding element from B.
>> Elements here are actually data blocks. Is there a performance problem (and
>> difference compared to a slightly modified map-side join using the
>> join-package) if I set dataset A as the map-reduce input and load the
>> relevant element from dataset B directly from the HDFS inside the mapper? I
>> could store the elements of B in a MapFile for faster random access. In the
>> second case without the join-package I would not have to partition the
>> datasets manually which would allow a bit more flexibility, but I'm
>> wondering if HDFS access from inside a mapper is strictly bad. Also, does
>> Hadoop have a cache for such situations by any chance?
>>
>> I appreciate any comments!
>>
>> Sigurd
>>
>
>

Re: Reading from HDFS from inside the mapper

Posted by Sigurd Spieckermann <si...@gmail.com>.

I checked DistributedCache, but in general I have to assume that none of
the datasets fits in memory... That's why I was considering map-side join,
but by default it doesn't fit to my problem. I could probably get it to
work though, but I would have to enforce the requirements of the map-side
join.

2012/9/10 Hemanth Yamijala <yh...@thoughtworks.com>

> Hi,
>
> You could check DistributedCache (
> http://hadoop.apache.org/common/docs/stable/mapred_tutorial.html#DistributedCache).
> It would allow you to distribute data to the nodes where your tasks are run.
>
> Thanks
> Hemanth
>
>
> On Mon, Sep 10, 2012 at 3:27 PM, Sigurd Spieckermann <
> sigurd.spieckermann@gmail.com> wrote:
>
>> Hi,
>>
>> I would like to perform a map-side join of two large datasets where
>> dataset A consists of m*n elements and dataset B consists of n elements.
>> For the join, every element in dataset B needs to be accessed m times. Each
>> mapper would join one element from A with the corresponding element from B.
>> Elements here are actually data blocks. Is there a performance problem (and
>> difference compared to a slightly modified map-side join using the
>> join-package) if I set dataset A as the map-reduce input and load the
>> relevant element from dataset B directly from the HDFS inside the mapper? I
>> could store the elements of B in a MapFile for faster random access. In the
>> second case without the join-package I would not have to partition the
>> datasets manually which would allow a bit more flexibility, but I'm
>> wondering if HDFS access from inside a mapper is strictly bad. Also, does
>> Hadoop have a cache for such situations by any chance?
>>
>> I appreciate any comments!
>>
>> Sigurd
>>
>
>

Re: Reading from HDFS from inside the mapper

Posted by Sigurd Spieckermann <si...@gmail.com>.

I checked DistributedCache, but in general I have to assume that none of
the datasets fits in memory... That's why I was considering map-side join,
but by default it doesn't fit to my problem. I could probably get it to
work though, but I would have to enforce the requirements of the map-side
join.

2012/9/10 Hemanth Yamijala <yh...@thoughtworks.com>

> Hi,
>
> You could check DistributedCache (
> http://hadoop.apache.org/common/docs/stable/mapred_tutorial.html#DistributedCache).
> It would allow you to distribute data to the nodes where your tasks are run.
>
> Thanks
> Hemanth
>
>
> On Mon, Sep 10, 2012 at 3:27 PM, Sigurd Spieckermann <
> sigurd.spieckermann@gmail.com> wrote:
>
>> Hi,
>>
>> I would like to perform a map-side join of two large datasets where
>> dataset A consists of m*n elements and dataset B consists of n elements.
>> For the join, every element in dataset B needs to be accessed m times. Each
>> mapper would join one element from A with the corresponding element from B.
>> Elements here are actually data blocks. Is there a performance problem (and
>> difference compared to a slightly modified map-side join using the
>> join-package) if I set dataset A as the map-reduce input and load the
>> relevant element from dataset B directly from the HDFS inside the mapper? I
>> could store the elements of B in a MapFile for faster random access. In the
>> second case without the join-package I would not have to partition the
>> datasets manually which would allow a bit more flexibility, but I'm
>> wondering if HDFS access from inside a mapper is strictly bad. Also, does
>> Hadoop have a cache for such situations by any chance?
>>
>> I appreciate any comments!
>>
>> Sigurd
>>
>
>

Re: Reading from HDFS from inside the mapper

Posted by Hemanth Yamijala <yh...@thoughtworks.com>.

Hi,

You could check DistributedCache (
http://hadoop.apache.org/common/docs/stable/mapred_tutorial.html#DistributedCache).
It would allow you to distribute data to the nodes where your tasks are run.

Thanks
Hemanth

On Mon, Sep 10, 2012 at 3:27 PM, Sigurd Spieckermann <
sigurd.spieckermann@gmail.com> wrote:

> Hi,
>
> I would like to perform a map-side join of two large datasets where
> dataset A consists of m*n elements and dataset B consists of n elements.
> For the join, every element in dataset B needs to be accessed m times. Each
> mapper would join one element from A with the corresponding element from B.
> Elements here are actually data blocks. Is there a performance problem (and
> difference compared to a slightly modified map-side join using the
> join-package) if I set dataset A as the map-reduce input and load the
> relevant element from dataset B directly from the HDFS inside the mapper? I
> could store the elements of B in a MapFile for faster random access. In the
> second case without the join-package I would not have to partition the
> datasets manually which would allow a bit more flexibility, but I'm
> wondering if HDFS access from inside a mapper is strictly bad. Also, does
> Hadoop have a cache for such situations by any chance?
>
> I appreciate any comments!
>
> Sigurd
>

Re: Reading from HDFS from inside the mapper

Posted by Hemanth Yamijala <yh...@thoughtworks.com>.

Hi,

You could check DistributedCache (
http://hadoop.apache.org/common/docs/stable/mapred_tutorial.html#DistributedCache).
It would allow you to distribute data to the nodes where your tasks are run.

Thanks
Hemanth

On Mon, Sep 10, 2012 at 3:27 PM, Sigurd Spieckermann <
sigurd.spieckermann@gmail.com> wrote:

> Hi,
>
> I would like to perform a map-side join of two large datasets where
> dataset A consists of m*n elements and dataset B consists of n elements.
> For the join, every element in dataset B needs to be accessed m times. Each
> mapper would join one element from A with the corresponding element from B.
> Elements here are actually data blocks. Is there a performance problem (and
> difference compared to a slightly modified map-side join using the
> join-package) if I set dataset A as the map-reduce input and load the
> relevant element from dataset B directly from the HDFS inside the mapper? I
> could store the elements of B in a MapFile for faster random access. In the
> second case without the join-package I would not have to partition the
> datasets manually which would allow a bit more flexibility, but I'm
> wondering if HDFS access from inside a mapper is strictly bad. Also, does
> Hadoop have a cache for such situations by any chance?
>
> I appreciate any comments!
>
> Sigurd
>

Re: Reading from HDFS from inside the mapper

Posted by Hemanth Yamijala <yh...@thoughtworks.com>.

Hi,

You could check DistributedCache (
http://hadoop.apache.org/common/docs/stable/mapred_tutorial.html#DistributedCache).
It would allow you to distribute data to the nodes where your tasks are run.

Thanks
Hemanth

On Mon, Sep 10, 2012 at 3:27 PM, Sigurd Spieckermann <
sigurd.spieckermann@gmail.com> wrote:

> Hi,
>
> I would like to perform a map-side join of two large datasets where
> dataset A consists of m*n elements and dataset B consists of n elements.
> For the join, every element in dataset B needs to be accessed m times. Each
> mapper would join one element from A with the corresponding element from B.
> Elements here are actually data blocks. Is there a performance problem (and
> difference compared to a slightly modified map-side join using the
> join-package) if I set dataset A as the map-reduce input and load the
> relevant element from dataset B directly from the HDFS inside the mapper? I
> could store the elements of B in a MapFile for faster random access. In the
> second case without the join-package I would not have to partition the
> datasets manually which would allow a bit more flexibility, but I'm
> wondering if HDFS access from inside a mapper is strictly bad. Also, does
> Hadoop have a cache for such situations by any chance?
>
> I appreciate any comments!
>
> Sigurd
>

Re: Reading from HDFS from inside the mapper

Posted by Hemanth Yamijala <yh...@thoughtworks.com>.

Hi,

You could check DistributedCache (
http://hadoop.apache.org/common/docs/stable/mapred_tutorial.html#DistributedCache).
It would allow you to distribute data to the nodes where your tasks are run.

Thanks
Hemanth

On Mon, Sep 10, 2012 at 3:27 PM, Sigurd Spieckermann <
sigurd.spieckermann@gmail.com> wrote:

> Hi,
>
> I would like to perform a map-side join of two large datasets where
> dataset A consists of m*n elements and dataset B consists of n elements.
> For the join, every element in dataset B needs to be accessed m times. Each
> mapper would join one element from A with the corresponding element from B.
> Elements here are actually data blocks. Is there a performance problem (and
> difference compared to a slightly modified map-side join using the
> join-package) if I set dataset A as the map-reduce input and load the
> relevant element from dataset B directly from the HDFS inside the mapper? I
> could store the elements of B in a MapFile for faster random access. In the
> second case without the join-package I would not have to partition the
> datasets manually which would allow a bit more flexibility, but I'm
> wondering if HDFS access from inside a mapper is strictly bad. Also, does
> Hadoop have a cache for such situations by any chance?
>
> I appreciate any comments!
>
> Sigurd
>