You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hadoop.apache.org by Adrian CAPDEFIER <ch...@gmail.com> on 2013/08/31 03:01:36 UTC

Job config before read fields

Howdy,

I apologise for the lack of code in this message, but the code is fairly
convoluted and it would obscure my problem. That being said, I can put
together some sample code if really needed.

I am trying to pass some metadata between the map & reduce steps. This
metadata is read and generated in the map step and stored in the job
config. It also needs to be recreated on the reduce node before the key/
value fields can be read in the readFields function.

I had assumed that I would be able to override the Reducer.setup() function
and that would be it, but apparently the readFields function is called
before the Reducer.setup() function.

My question is what is any (the best) place on the reduce node where I can
access the job configuration/ context before the readFields function is
called?

This is the stack trace:

        at
org.apache.hadoop.io.WritableComparator.compare(WritableComparator.java:103)
        at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.compare(MapTask.java:1111)
        at org.apache.hadoop.util.QuickSort.sortInternal(QuickSort.java:70)
        at org.apache.hadoop.util.QuickSort.sort(QuickSort.java:59)
        at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1399)
        at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1298)
        at
org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:699)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:766)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
        at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1149)
        at org.apache.hadoop.mapred.Child.main(Child.java:249)

Re: Job config before read fields

Posted by Adrian CAPDEFIER <ch...@gmail.com>.
Hi Shahab,

Sorry about the late reply, a personal matter came up and it took most of
my time. Thank you for your replies.

The solution I chose was to temporarily transfer the metadata along with
the data and then restore it on the reduce nodes. This works from a
functional perspective as long as there are no performance requirements and
it will have to do for now.

The permanent solution will likely involve tweaking hadoop, but that is a
different kettle of fish.


On Sun, Sep 1, 2013 at 12:48 AM, Shahab Yunus <sh...@gmail.com>wrote:

> Personally, I don't know a way to access job configuration parameters in
> custom implementation of Writables ( at least not an elegant and
> appropriate one. Of course hacks of various kinds be done.) Maybe experts
> can chime in?
>
> One idea that I thought about was to use MapWritable (if you have not
> explored it already.) You can encode the 'custom metadata' for you 'data'
> as one byte symbols and move your data in the M/R flow as a map. Then while
> deserialization you will have the type (or your 'custom metadata') in the
> key part of the map and the value would be you actual data. This aligns
> with the efficient approach that is used natively in Hadoop for
> Strings/Text i.e. compact metadata (though I agree that you are not taking
> advantage of the other aspect of non-dependence between metadata and the
> data it defines.)
>
> Take a look at that:
> Page 96 of the Definitive Guide:
>
> http://books.google.com/books?id=Nff49D7vnJcC&pg=PA96&lpg=PA96&dq=mapwritable+in+hadoop&source=bl&ots=IiixYu7vXu&sig=4V6H7cY-MrNT7Rzs3WlODsDOoP4&hl=en&sa=X&ei=aX4iUp2YGoaosASs_YCACQ&sqi=2&ved=0CFUQ6AEwBA#v=onepage&q=mapwritable%20in%20hadoop&f=false
>
> and then this:
>
> http://www.chrisstucchio.com/blog/2011/mapwritable_sometimes_a_performance_hog.html
>
> and add your own custom types here (note that you are restricted by size
> of byte):
>
> http://hadoop.sourcearchive.com/documentation/0.20.2plus-pdfsg1-1/AbstractMapWritable_8java-source.html
>
> Regards,
> Shahab
>
>
> On Sat, Aug 31, 2013 at 5:38 AM, Adrian CAPDEFIER <ch...@gmail.com>wrote:
>
>> Thank you for your help Shahab.
>>
>> I guess I wasn't being too clear. My logic is that I use a custom type as
>> key and in order to deserialize it on the compute nodes, I need an extra
>> piece of information (also a custom type).
>>
>> To use an analogy, a Text is serialized by writing the length of the
>> string as a number and then the bytes that compose the actual string. When
>> it is deserialized, the number informs the reader when to stop reading the
>> string. This number is varies from string to string and it is compact so it
>> makes sense to serialize it with the string.
>>
>> My use case is similar to it. I have a complex type (let's call this
>> data), and in order to deserialize it, I need another complex type (let's
>> call this second type metadata). The metadata is not closely tied to the
>> data (i.e. if the data value changes, the metadata does not) and the
>> metadata size is quite large.
>>
>> I ruled out a couple of options, but please let me know if you think I
>> did so for the wrong reasons:
>> 1. I could serialize each data value with it's own metadata value, but
>> since the data value count is in the +tens of millions and the metadata
>> value distinct count can be up to one hundred, it would waste resources in
>> the system.
>> 2. I could serialize the metadata and then the data as a collection
>> property of the metadata. This would be an elegant solution code-wise, but
>> then all the data would have to be read and kept in memory as a massive
>> object before any reduce operations can happen. I wasn't able to find any
>> info on this online so this is just a guess from peeking at the hadoop code.
>>
>> My "solution" was to serialize the data with a hash of the metadata and
>> separately serialize the metadata and its hash in the job configuration (as
>> key/value pairs). For this to work, I would need to be able to deserialize
>> the metadata on the reduce node before the data is deserialized in the
>> readFields() method.
>>
>> I think that for that to happen I need to hook into the code somewhere
>> where a context or job configuration is used (before readFields()), but I'm
>> stumped as to where that is.
>>
>>  Cheers,
>> Adi
>>
>>
>> On Sat, Aug 31, 2013 at 3:42 AM, Shahab Yunus <sh...@gmail.com>wrote:
>>
>>> What I meant was that you might have to split or redesign your logic or
>>> your usecase (which we don't know about)?
>>>
>>> Regards,
>>> Shahab
>>>
>>>
>>> On Fri, Aug 30, 2013 at 10:31 PM, Adrian CAPDEFIER <
>>> chivas314159@gmail.com> wrote:
>>>
>>>> But how would the comparator have access to the job config?
>>>>
>>>>
>>>> On Sat, Aug 31, 2013 at 2:38 AM, Shahab Yunus <sh...@gmail.com>wrote:
>>>>
>>>>> I think you have to override/extend the Comparator to achieve that,
>>>>> something like what is done in Secondary Sort?
>>>>>
>>>>> Regards,
>>>>> Shahab
>>>>>
>>>>>
>>>>> On Fri, Aug 30, 2013 at 9:01 PM, Adrian CAPDEFIER <
>>>>> chivas314159@gmail.com> wrote:
>>>>>
>>>>>> Howdy,
>>>>>>
>>>>>> I apologise for the lack of code in this message, but the code is
>>>>>> fairly convoluted and it would obscure my problem. That being said, I can
>>>>>> put together some sample code if really needed.
>>>>>>
>>>>>> I am trying to pass some metadata between the map & reduce steps.
>>>>>> This metadata is read and generated in the map step and stored in the job
>>>>>> config. It also needs to be recreated on the reduce node before the key/
>>>>>> value fields can be read in the readFields function.
>>>>>>
>>>>>> I had assumed that I would be able to override the Reducer.setup()
>>>>>> function and that would be it, but apparently the readFields function is
>>>>>> called before the Reducer.setup() function.
>>>>>>
>>>>>> My question is what is any (the best) place on the reduce node where
>>>>>> I can access the job configuration/ context before the readFields function
>>>>>> is called?
>>>>>>
>>>>>> This is the stack trace:
>>>>>>
>>>>>>         at
>>>>>> org.apache.hadoop.io.WritableComparator.compare(WritableComparator.java:103)
>>>>>>         at
>>>>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.compare(MapTask.java:1111)
>>>>>>         at
>>>>>> org.apache.hadoop.util.QuickSort.sortInternal(QuickSort.java:70)
>>>>>>         at org.apache.hadoop.util.QuickSort.sort(QuickSort.java:59)
>>>>>>         at
>>>>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1399)
>>>>>>         at
>>>>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1298)
>>>>>>         at
>>>>>> org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:699)
>>>>>>         at
>>>>>> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:766)
>>>>>>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
>>>>>>         at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>>>>>>         at java.security.AccessController.doPrivileged(Native Method)
>>>>>>         at javax.security.auth.Subject.doAs(Subject.java:415)
>>>>>>         at
>>>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1149)
>>>>>>         at org.apache.hadoop.mapred.Child.main(Child.java:249)
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Job config before read fields

Posted by Adrian CAPDEFIER <ch...@gmail.com>.
Hi Shahab,

Sorry about the late reply, a personal matter came up and it took most of
my time. Thank you for your replies.

The solution I chose was to temporarily transfer the metadata along with
the data and then restore it on the reduce nodes. This works from a
functional perspective as long as there are no performance requirements and
it will have to do for now.

The permanent solution will likely involve tweaking hadoop, but that is a
different kettle of fish.


On Sun, Sep 1, 2013 at 12:48 AM, Shahab Yunus <sh...@gmail.com>wrote:

> Personally, I don't know a way to access job configuration parameters in
> custom implementation of Writables ( at least not an elegant and
> appropriate one. Of course hacks of various kinds be done.) Maybe experts
> can chime in?
>
> One idea that I thought about was to use MapWritable (if you have not
> explored it already.) You can encode the 'custom metadata' for you 'data'
> as one byte symbols and move your data in the M/R flow as a map. Then while
> deserialization you will have the type (or your 'custom metadata') in the
> key part of the map and the value would be you actual data. This aligns
> with the efficient approach that is used natively in Hadoop for
> Strings/Text i.e. compact metadata (though I agree that you are not taking
> advantage of the other aspect of non-dependence between metadata and the
> data it defines.)
>
> Take a look at that:
> Page 96 of the Definitive Guide:
>
> http://books.google.com/books?id=Nff49D7vnJcC&pg=PA96&lpg=PA96&dq=mapwritable+in+hadoop&source=bl&ots=IiixYu7vXu&sig=4V6H7cY-MrNT7Rzs3WlODsDOoP4&hl=en&sa=X&ei=aX4iUp2YGoaosASs_YCACQ&sqi=2&ved=0CFUQ6AEwBA#v=onepage&q=mapwritable%20in%20hadoop&f=false
>
> and then this:
>
> http://www.chrisstucchio.com/blog/2011/mapwritable_sometimes_a_performance_hog.html
>
> and add your own custom types here (note that you are restricted by size
> of byte):
>
> http://hadoop.sourcearchive.com/documentation/0.20.2plus-pdfsg1-1/AbstractMapWritable_8java-source.html
>
> Regards,
> Shahab
>
>
> On Sat, Aug 31, 2013 at 5:38 AM, Adrian CAPDEFIER <ch...@gmail.com>wrote:
>
>> Thank you for your help Shahab.
>>
>> I guess I wasn't being too clear. My logic is that I use a custom type as
>> key and in order to deserialize it on the compute nodes, I need an extra
>> piece of information (also a custom type).
>>
>> To use an analogy, a Text is serialized by writing the length of the
>> string as a number and then the bytes that compose the actual string. When
>> it is deserialized, the number informs the reader when to stop reading the
>> string. This number is varies from string to string and it is compact so it
>> makes sense to serialize it with the string.
>>
>> My use case is similar to it. I have a complex type (let's call this
>> data), and in order to deserialize it, I need another complex type (let's
>> call this second type metadata). The metadata is not closely tied to the
>> data (i.e. if the data value changes, the metadata does not) and the
>> metadata size is quite large.
>>
>> I ruled out a couple of options, but please let me know if you think I
>> did so for the wrong reasons:
>> 1. I could serialize each data value with it's own metadata value, but
>> since the data value count is in the +tens of millions and the metadata
>> value distinct count can be up to one hundred, it would waste resources in
>> the system.
>> 2. I could serialize the metadata and then the data as a collection
>> property of the metadata. This would be an elegant solution code-wise, but
>> then all the data would have to be read and kept in memory as a massive
>> object before any reduce operations can happen. I wasn't able to find any
>> info on this online so this is just a guess from peeking at the hadoop code.
>>
>> My "solution" was to serialize the data with a hash of the metadata and
>> separately serialize the metadata and its hash in the job configuration (as
>> key/value pairs). For this to work, I would need to be able to deserialize
>> the metadata on the reduce node before the data is deserialized in the
>> readFields() method.
>>
>> I think that for that to happen I need to hook into the code somewhere
>> where a context or job configuration is used (before readFields()), but I'm
>> stumped as to where that is.
>>
>>  Cheers,
>> Adi
>>
>>
>> On Sat, Aug 31, 2013 at 3:42 AM, Shahab Yunus <sh...@gmail.com>wrote:
>>
>>> What I meant was that you might have to split or redesign your logic or
>>> your usecase (which we don't know about)?
>>>
>>> Regards,
>>> Shahab
>>>
>>>
>>> On Fri, Aug 30, 2013 at 10:31 PM, Adrian CAPDEFIER <
>>> chivas314159@gmail.com> wrote:
>>>
>>>> But how would the comparator have access to the job config?
>>>>
>>>>
>>>> On Sat, Aug 31, 2013 at 2:38 AM, Shahab Yunus <sh...@gmail.com>wrote:
>>>>
>>>>> I think you have to override/extend the Comparator to achieve that,
>>>>> something like what is done in Secondary Sort?
>>>>>
>>>>> Regards,
>>>>> Shahab
>>>>>
>>>>>
>>>>> On Fri, Aug 30, 2013 at 9:01 PM, Adrian CAPDEFIER <
>>>>> chivas314159@gmail.com> wrote:
>>>>>
>>>>>> Howdy,
>>>>>>
>>>>>> I apologise for the lack of code in this message, but the code is
>>>>>> fairly convoluted and it would obscure my problem. That being said, I can
>>>>>> put together some sample code if really needed.
>>>>>>
>>>>>> I am trying to pass some metadata between the map & reduce steps.
>>>>>> This metadata is read and generated in the map step and stored in the job
>>>>>> config. It also needs to be recreated on the reduce node before the key/
>>>>>> value fields can be read in the readFields function.
>>>>>>
>>>>>> I had assumed that I would be able to override the Reducer.setup()
>>>>>> function and that would be it, but apparently the readFields function is
>>>>>> called before the Reducer.setup() function.
>>>>>>
>>>>>> My question is what is any (the best) place on the reduce node where
>>>>>> I can access the job configuration/ context before the readFields function
>>>>>> is called?
>>>>>>
>>>>>> This is the stack trace:
>>>>>>
>>>>>>         at
>>>>>> org.apache.hadoop.io.WritableComparator.compare(WritableComparator.java:103)
>>>>>>         at
>>>>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.compare(MapTask.java:1111)
>>>>>>         at
>>>>>> org.apache.hadoop.util.QuickSort.sortInternal(QuickSort.java:70)
>>>>>>         at org.apache.hadoop.util.QuickSort.sort(QuickSort.java:59)
>>>>>>         at
>>>>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1399)
>>>>>>         at
>>>>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1298)
>>>>>>         at
>>>>>> org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:699)
>>>>>>         at
>>>>>> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:766)
>>>>>>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
>>>>>>         at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>>>>>>         at java.security.AccessController.doPrivileged(Native Method)
>>>>>>         at javax.security.auth.Subject.doAs(Subject.java:415)
>>>>>>         at
>>>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1149)
>>>>>>         at org.apache.hadoop.mapred.Child.main(Child.java:249)
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Job config before read fields

Posted by Adrian CAPDEFIER <ch...@gmail.com>.
Hi Shahab,

Sorry about the late reply, a personal matter came up and it took most of
my time. Thank you for your replies.

The solution I chose was to temporarily transfer the metadata along with
the data and then restore it on the reduce nodes. This works from a
functional perspective as long as there are no performance requirements and
it will have to do for now.

The permanent solution will likely involve tweaking hadoop, but that is a
different kettle of fish.


On Sun, Sep 1, 2013 at 12:48 AM, Shahab Yunus <sh...@gmail.com>wrote:

> Personally, I don't know a way to access job configuration parameters in
> custom implementation of Writables ( at least not an elegant and
> appropriate one. Of course hacks of various kinds be done.) Maybe experts
> can chime in?
>
> One idea that I thought about was to use MapWritable (if you have not
> explored it already.) You can encode the 'custom metadata' for you 'data'
> as one byte symbols and move your data in the M/R flow as a map. Then while
> deserialization you will have the type (or your 'custom metadata') in the
> key part of the map and the value would be you actual data. This aligns
> with the efficient approach that is used natively in Hadoop for
> Strings/Text i.e. compact metadata (though I agree that you are not taking
> advantage of the other aspect of non-dependence between metadata and the
> data it defines.)
>
> Take a look at that:
> Page 96 of the Definitive Guide:
>
> http://books.google.com/books?id=Nff49D7vnJcC&pg=PA96&lpg=PA96&dq=mapwritable+in+hadoop&source=bl&ots=IiixYu7vXu&sig=4V6H7cY-MrNT7Rzs3WlODsDOoP4&hl=en&sa=X&ei=aX4iUp2YGoaosASs_YCACQ&sqi=2&ved=0CFUQ6AEwBA#v=onepage&q=mapwritable%20in%20hadoop&f=false
>
> and then this:
>
> http://www.chrisstucchio.com/blog/2011/mapwritable_sometimes_a_performance_hog.html
>
> and add your own custom types here (note that you are restricted by size
> of byte):
>
> http://hadoop.sourcearchive.com/documentation/0.20.2plus-pdfsg1-1/AbstractMapWritable_8java-source.html
>
> Regards,
> Shahab
>
>
> On Sat, Aug 31, 2013 at 5:38 AM, Adrian CAPDEFIER <ch...@gmail.com>wrote:
>
>> Thank you for your help Shahab.
>>
>> I guess I wasn't being too clear. My logic is that I use a custom type as
>> key and in order to deserialize it on the compute nodes, I need an extra
>> piece of information (also a custom type).
>>
>> To use an analogy, a Text is serialized by writing the length of the
>> string as a number and then the bytes that compose the actual string. When
>> it is deserialized, the number informs the reader when to stop reading the
>> string. This number is varies from string to string and it is compact so it
>> makes sense to serialize it with the string.
>>
>> My use case is similar to it. I have a complex type (let's call this
>> data), and in order to deserialize it, I need another complex type (let's
>> call this second type metadata). The metadata is not closely tied to the
>> data (i.e. if the data value changes, the metadata does not) and the
>> metadata size is quite large.
>>
>> I ruled out a couple of options, but please let me know if you think I
>> did so for the wrong reasons:
>> 1. I could serialize each data value with it's own metadata value, but
>> since the data value count is in the +tens of millions and the metadata
>> value distinct count can be up to one hundred, it would waste resources in
>> the system.
>> 2. I could serialize the metadata and then the data as a collection
>> property of the metadata. This would be an elegant solution code-wise, but
>> then all the data would have to be read and kept in memory as a massive
>> object before any reduce operations can happen. I wasn't able to find any
>> info on this online so this is just a guess from peeking at the hadoop code.
>>
>> My "solution" was to serialize the data with a hash of the metadata and
>> separately serialize the metadata and its hash in the job configuration (as
>> key/value pairs). For this to work, I would need to be able to deserialize
>> the metadata on the reduce node before the data is deserialized in the
>> readFields() method.
>>
>> I think that for that to happen I need to hook into the code somewhere
>> where a context or job configuration is used (before readFields()), but I'm
>> stumped as to where that is.
>>
>>  Cheers,
>> Adi
>>
>>
>> On Sat, Aug 31, 2013 at 3:42 AM, Shahab Yunus <sh...@gmail.com>wrote:
>>
>>> What I meant was that you might have to split or redesign your logic or
>>> your usecase (which we don't know about)?
>>>
>>> Regards,
>>> Shahab
>>>
>>>
>>> On Fri, Aug 30, 2013 at 10:31 PM, Adrian CAPDEFIER <
>>> chivas314159@gmail.com> wrote:
>>>
>>>> But how would the comparator have access to the job config?
>>>>
>>>>
>>>> On Sat, Aug 31, 2013 at 2:38 AM, Shahab Yunus <sh...@gmail.com>wrote:
>>>>
>>>>> I think you have to override/extend the Comparator to achieve that,
>>>>> something like what is done in Secondary Sort?
>>>>>
>>>>> Regards,
>>>>> Shahab
>>>>>
>>>>>
>>>>> On Fri, Aug 30, 2013 at 9:01 PM, Adrian CAPDEFIER <
>>>>> chivas314159@gmail.com> wrote:
>>>>>
>>>>>> Howdy,
>>>>>>
>>>>>> I apologise for the lack of code in this message, but the code is
>>>>>> fairly convoluted and it would obscure my problem. That being said, I can
>>>>>> put together some sample code if really needed.
>>>>>>
>>>>>> I am trying to pass some metadata between the map & reduce steps.
>>>>>> This metadata is read and generated in the map step and stored in the job
>>>>>> config. It also needs to be recreated on the reduce node before the key/
>>>>>> value fields can be read in the readFields function.
>>>>>>
>>>>>> I had assumed that I would be able to override the Reducer.setup()
>>>>>> function and that would be it, but apparently the readFields function is
>>>>>> called before the Reducer.setup() function.
>>>>>>
>>>>>> My question is what is any (the best) place on the reduce node where
>>>>>> I can access the job configuration/ context before the readFields function
>>>>>> is called?
>>>>>>
>>>>>> This is the stack trace:
>>>>>>
>>>>>>         at
>>>>>> org.apache.hadoop.io.WritableComparator.compare(WritableComparator.java:103)
>>>>>>         at
>>>>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.compare(MapTask.java:1111)
>>>>>>         at
>>>>>> org.apache.hadoop.util.QuickSort.sortInternal(QuickSort.java:70)
>>>>>>         at org.apache.hadoop.util.QuickSort.sort(QuickSort.java:59)
>>>>>>         at
>>>>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1399)
>>>>>>         at
>>>>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1298)
>>>>>>         at
>>>>>> org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:699)
>>>>>>         at
>>>>>> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:766)
>>>>>>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
>>>>>>         at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>>>>>>         at java.security.AccessController.doPrivileged(Native Method)
>>>>>>         at javax.security.auth.Subject.doAs(Subject.java:415)
>>>>>>         at
>>>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1149)
>>>>>>         at org.apache.hadoop.mapred.Child.main(Child.java:249)
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Job config before read fields

Posted by Adrian CAPDEFIER <ch...@gmail.com>.
Hi Shahab,

Sorry about the late reply, a personal matter came up and it took most of
my time. Thank you for your replies.

The solution I chose was to temporarily transfer the metadata along with
the data and then restore it on the reduce nodes. This works from a
functional perspective as long as there are no performance requirements and
it will have to do for now.

The permanent solution will likely involve tweaking hadoop, but that is a
different kettle of fish.


On Sun, Sep 1, 2013 at 12:48 AM, Shahab Yunus <sh...@gmail.com>wrote:

> Personally, I don't know a way to access job configuration parameters in
> custom implementation of Writables ( at least not an elegant and
> appropriate one. Of course hacks of various kinds be done.) Maybe experts
> can chime in?
>
> One idea that I thought about was to use MapWritable (if you have not
> explored it already.) You can encode the 'custom metadata' for you 'data'
> as one byte symbols and move your data in the M/R flow as a map. Then while
> deserialization you will have the type (or your 'custom metadata') in the
> key part of the map and the value would be you actual data. This aligns
> with the efficient approach that is used natively in Hadoop for
> Strings/Text i.e. compact metadata (though I agree that you are not taking
> advantage of the other aspect of non-dependence between metadata and the
> data it defines.)
>
> Take a look at that:
> Page 96 of the Definitive Guide:
>
> http://books.google.com/books?id=Nff49D7vnJcC&pg=PA96&lpg=PA96&dq=mapwritable+in+hadoop&source=bl&ots=IiixYu7vXu&sig=4V6H7cY-MrNT7Rzs3WlODsDOoP4&hl=en&sa=X&ei=aX4iUp2YGoaosASs_YCACQ&sqi=2&ved=0CFUQ6AEwBA#v=onepage&q=mapwritable%20in%20hadoop&f=false
>
> and then this:
>
> http://www.chrisstucchio.com/blog/2011/mapwritable_sometimes_a_performance_hog.html
>
> and add your own custom types here (note that you are restricted by size
> of byte):
>
> http://hadoop.sourcearchive.com/documentation/0.20.2plus-pdfsg1-1/AbstractMapWritable_8java-source.html
>
> Regards,
> Shahab
>
>
> On Sat, Aug 31, 2013 at 5:38 AM, Adrian CAPDEFIER <ch...@gmail.com>wrote:
>
>> Thank you for your help Shahab.
>>
>> I guess I wasn't being too clear. My logic is that I use a custom type as
>> key and in order to deserialize it on the compute nodes, I need an extra
>> piece of information (also a custom type).
>>
>> To use an analogy, a Text is serialized by writing the length of the
>> string as a number and then the bytes that compose the actual string. When
>> it is deserialized, the number informs the reader when to stop reading the
>> string. This number is varies from string to string and it is compact so it
>> makes sense to serialize it with the string.
>>
>> My use case is similar to it. I have a complex type (let's call this
>> data), and in order to deserialize it, I need another complex type (let's
>> call this second type metadata). The metadata is not closely tied to the
>> data (i.e. if the data value changes, the metadata does not) and the
>> metadata size is quite large.
>>
>> I ruled out a couple of options, but please let me know if you think I
>> did so for the wrong reasons:
>> 1. I could serialize each data value with it's own metadata value, but
>> since the data value count is in the +tens of millions and the metadata
>> value distinct count can be up to one hundred, it would waste resources in
>> the system.
>> 2. I could serialize the metadata and then the data as a collection
>> property of the metadata. This would be an elegant solution code-wise, but
>> then all the data would have to be read and kept in memory as a massive
>> object before any reduce operations can happen. I wasn't able to find any
>> info on this online so this is just a guess from peeking at the hadoop code.
>>
>> My "solution" was to serialize the data with a hash of the metadata and
>> separately serialize the metadata and its hash in the job configuration (as
>> key/value pairs). For this to work, I would need to be able to deserialize
>> the metadata on the reduce node before the data is deserialized in the
>> readFields() method.
>>
>> I think that for that to happen I need to hook into the code somewhere
>> where a context or job configuration is used (before readFields()), but I'm
>> stumped as to where that is.
>>
>>  Cheers,
>> Adi
>>
>>
>> On Sat, Aug 31, 2013 at 3:42 AM, Shahab Yunus <sh...@gmail.com>wrote:
>>
>>> What I meant was that you might have to split or redesign your logic or
>>> your usecase (which we don't know about)?
>>>
>>> Regards,
>>> Shahab
>>>
>>>
>>> On Fri, Aug 30, 2013 at 10:31 PM, Adrian CAPDEFIER <
>>> chivas314159@gmail.com> wrote:
>>>
>>>> But how would the comparator have access to the job config?
>>>>
>>>>
>>>> On Sat, Aug 31, 2013 at 2:38 AM, Shahab Yunus <sh...@gmail.com>wrote:
>>>>
>>>>> I think you have to override/extend the Comparator to achieve that,
>>>>> something like what is done in Secondary Sort?
>>>>>
>>>>> Regards,
>>>>> Shahab
>>>>>
>>>>>
>>>>> On Fri, Aug 30, 2013 at 9:01 PM, Adrian CAPDEFIER <
>>>>> chivas314159@gmail.com> wrote:
>>>>>
>>>>>> Howdy,
>>>>>>
>>>>>> I apologise for the lack of code in this message, but the code is
>>>>>> fairly convoluted and it would obscure my problem. That being said, I can
>>>>>> put together some sample code if really needed.
>>>>>>
>>>>>> I am trying to pass some metadata between the map & reduce steps.
>>>>>> This metadata is read and generated in the map step and stored in the job
>>>>>> config. It also needs to be recreated on the reduce node before the key/
>>>>>> value fields can be read in the readFields function.
>>>>>>
>>>>>> I had assumed that I would be able to override the Reducer.setup()
>>>>>> function and that would be it, but apparently the readFields function is
>>>>>> called before the Reducer.setup() function.
>>>>>>
>>>>>> My question is what is any (the best) place on the reduce node where
>>>>>> I can access the job configuration/ context before the readFields function
>>>>>> is called?
>>>>>>
>>>>>> This is the stack trace:
>>>>>>
>>>>>>         at
>>>>>> org.apache.hadoop.io.WritableComparator.compare(WritableComparator.java:103)
>>>>>>         at
>>>>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.compare(MapTask.java:1111)
>>>>>>         at
>>>>>> org.apache.hadoop.util.QuickSort.sortInternal(QuickSort.java:70)
>>>>>>         at org.apache.hadoop.util.QuickSort.sort(QuickSort.java:59)
>>>>>>         at
>>>>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1399)
>>>>>>         at
>>>>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1298)
>>>>>>         at
>>>>>> org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:699)
>>>>>>         at
>>>>>> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:766)
>>>>>>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
>>>>>>         at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>>>>>>         at java.security.AccessController.doPrivileged(Native Method)
>>>>>>         at javax.security.auth.Subject.doAs(Subject.java:415)
>>>>>>         at
>>>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1149)
>>>>>>         at org.apache.hadoop.mapred.Child.main(Child.java:249)
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Job config before read fields

Posted by Shahab Yunus <sh...@gmail.com>.
Personally, I don't know a way to access job configuration parameters in
custom implementation of Writables ( at least not an elegant and
appropriate one. Of course hacks of various kinds be done.) Maybe experts
can chime in?

One idea that I thought about was to use MapWritable (if you have not
explored it already.) You can encode the 'custom metadata' for you 'data'
as one byte symbols and move your data in the M/R flow as a map. Then while
deserialization you will have the type (or your 'custom metadata') in the
key part of the map and the value would be you actual data. This aligns
with the efficient approach that is used natively in Hadoop for
Strings/Text i.e. compact metadata (though I agree that you are not taking
advantage of the other aspect of non-dependence between metadata and the
data it defines.)

Take a look at that:
Page 96 of the Definitive Guide:
http://books.google.com/books?id=Nff49D7vnJcC&pg=PA96&lpg=PA96&dq=mapwritable+in+hadoop&source=bl&ots=IiixYu7vXu&sig=4V6H7cY-MrNT7Rzs3WlODsDOoP4&hl=en&sa=X&ei=aX4iUp2YGoaosASs_YCACQ&sqi=2&ved=0CFUQ6AEwBA#v=onepage&q=mapwritable%20in%20hadoop&f=false

and then this:
http://www.chrisstucchio.com/blog/2011/mapwritable_sometimes_a_performance_hog.html

and add your own custom types here (note that you are restricted by size of
byte):
http://hadoop.sourcearchive.com/documentation/0.20.2plus-pdfsg1-1/AbstractMapWritable_8java-source.html

Regards,
Shahab


On Sat, Aug 31, 2013 at 5:38 AM, Adrian CAPDEFIER <ch...@gmail.com>wrote:

> Thank you for your help Shahab.
>
> I guess I wasn't being too clear. My logic is that I use a custom type as
> key and in order to deserialize it on the compute nodes, I need an extra
> piece of information (also a custom type).
>
> To use an analogy, a Text is serialized by writing the length of the
> string as a number and then the bytes that compose the actual string. When
> it is deserialized, the number informs the reader when to stop reading the
> string. This number is varies from string to string and it is compact so it
> makes sense to serialize it with the string.
>
> My use case is similar to it. I have a complex type (let's call this
> data), and in order to deserialize it, I need another complex type (let's
> call this second type metadata). The metadata is not closely tied to the
> data (i.e. if the data value changes, the metadata does not) and the
> metadata size is quite large.
>
> I ruled out a couple of options, but please let me know if you think I did
> so for the wrong reasons:
> 1. I could serialize each data value with it's own metadata value, but
> since the data value count is in the +tens of millions and the metadata
> value distinct count can be up to one hundred, it would waste resources in
> the system.
> 2. I could serialize the metadata and then the data as a collection
> property of the metadata. This would be an elegant solution code-wise, but
> then all the data would have to be read and kept in memory as a massive
> object before any reduce operations can happen. I wasn't able to find any
> info on this online so this is just a guess from peeking at the hadoop code.
>
> My "solution" was to serialize the data with a hash of the metadata and
> separately serialize the metadata and its hash in the job configuration (as
> key/value pairs). For this to work, I would need to be able to deserialize
> the metadata on the reduce node before the data is deserialized in the
> readFields() method.
>
> I think that for that to happen I need to hook into the code somewhere
> where a context or job configuration is used (before readFields()), but I'm
> stumped as to where that is.
>
> Cheers,
> Adi
>
>
> On Sat, Aug 31, 2013 at 3:42 AM, Shahab Yunus <sh...@gmail.com>wrote:
>
>> What I meant was that you might have to split or redesign your logic or
>> your usecase (which we don't know about)?
>>
>> Regards,
>> Shahab
>>
>>
>> On Fri, Aug 30, 2013 at 10:31 PM, Adrian CAPDEFIER <
>> chivas314159@gmail.com> wrote:
>>
>>> But how would the comparator have access to the job config?
>>>
>>>
>>> On Sat, Aug 31, 2013 at 2:38 AM, Shahab Yunus <sh...@gmail.com>wrote:
>>>
>>>> I think you have to override/extend the Comparator to achieve that,
>>>> something like what is done in Secondary Sort?
>>>>
>>>> Regards,
>>>> Shahab
>>>>
>>>>
>>>> On Fri, Aug 30, 2013 at 9:01 PM, Adrian CAPDEFIER <
>>>> chivas314159@gmail.com> wrote:
>>>>
>>>>> Howdy,
>>>>>
>>>>> I apologise for the lack of code in this message, but the code is
>>>>> fairly convoluted and it would obscure my problem. That being said, I can
>>>>> put together some sample code if really needed.
>>>>>
>>>>> I am trying to pass some metadata between the map & reduce steps. This
>>>>> metadata is read and generated in the map step and stored in the job
>>>>> config. It also needs to be recreated on the reduce node before the key/
>>>>> value fields can be read in the readFields function.
>>>>>
>>>>> I had assumed that I would be able to override the Reducer.setup()
>>>>> function and that would be it, but apparently the readFields function is
>>>>> called before the Reducer.setup() function.
>>>>>
>>>>> My question is what is any (the best) place on the reduce node where I
>>>>> can access the job configuration/ context before the readFields function is
>>>>> called?
>>>>>
>>>>> This is the stack trace:
>>>>>
>>>>>         at
>>>>> org.apache.hadoop.io.WritableComparator.compare(WritableComparator.java:103)
>>>>>         at
>>>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.compare(MapTask.java:1111)
>>>>>         at
>>>>> org.apache.hadoop.util.QuickSort.sortInternal(QuickSort.java:70)
>>>>>         at org.apache.hadoop.util.QuickSort.sort(QuickSort.java:59)
>>>>>         at
>>>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1399)
>>>>>         at
>>>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1298)
>>>>>         at
>>>>> org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:699)
>>>>>         at
>>>>> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:766)
>>>>>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
>>>>>         at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>>>>>         at java.security.AccessController.doPrivileged(Native Method)
>>>>>         at javax.security.auth.Subject.doAs(Subject.java:415)
>>>>>         at
>>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1149)
>>>>>         at org.apache.hadoop.mapred.Child.main(Child.java:249)
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Job config before read fields

Posted by Shahab Yunus <sh...@gmail.com>.
Personally, I don't know a way to access job configuration parameters in
custom implementation of Writables ( at least not an elegant and
appropriate one. Of course hacks of various kinds be done.) Maybe experts
can chime in?

One idea that I thought about was to use MapWritable (if you have not
explored it already.) You can encode the 'custom metadata' for you 'data'
as one byte symbols and move your data in the M/R flow as a map. Then while
deserialization you will have the type (or your 'custom metadata') in the
key part of the map and the value would be you actual data. This aligns
with the efficient approach that is used natively in Hadoop for
Strings/Text i.e. compact metadata (though I agree that you are not taking
advantage of the other aspect of non-dependence between metadata and the
data it defines.)

Take a look at that:
Page 96 of the Definitive Guide:
http://books.google.com/books?id=Nff49D7vnJcC&pg=PA96&lpg=PA96&dq=mapwritable+in+hadoop&source=bl&ots=IiixYu7vXu&sig=4V6H7cY-MrNT7Rzs3WlODsDOoP4&hl=en&sa=X&ei=aX4iUp2YGoaosASs_YCACQ&sqi=2&ved=0CFUQ6AEwBA#v=onepage&q=mapwritable%20in%20hadoop&f=false

and then this:
http://www.chrisstucchio.com/blog/2011/mapwritable_sometimes_a_performance_hog.html

and add your own custom types here (note that you are restricted by size of
byte):
http://hadoop.sourcearchive.com/documentation/0.20.2plus-pdfsg1-1/AbstractMapWritable_8java-source.html

Regards,
Shahab


On Sat, Aug 31, 2013 at 5:38 AM, Adrian CAPDEFIER <ch...@gmail.com>wrote:

> Thank you for your help Shahab.
>
> I guess I wasn't being too clear. My logic is that I use a custom type as
> key and in order to deserialize it on the compute nodes, I need an extra
> piece of information (also a custom type).
>
> To use an analogy, a Text is serialized by writing the length of the
> string as a number and then the bytes that compose the actual string. When
> it is deserialized, the number informs the reader when to stop reading the
> string. This number is varies from string to string and it is compact so it
> makes sense to serialize it with the string.
>
> My use case is similar to it. I have a complex type (let's call this
> data), and in order to deserialize it, I need another complex type (let's
> call this second type metadata). The metadata is not closely tied to the
> data (i.e. if the data value changes, the metadata does not) and the
> metadata size is quite large.
>
> I ruled out a couple of options, but please let me know if you think I did
> so for the wrong reasons:
> 1. I could serialize each data value with it's own metadata value, but
> since the data value count is in the +tens of millions and the metadata
> value distinct count can be up to one hundred, it would waste resources in
> the system.
> 2. I could serialize the metadata and then the data as a collection
> property of the metadata. This would be an elegant solution code-wise, but
> then all the data would have to be read and kept in memory as a massive
> object before any reduce operations can happen. I wasn't able to find any
> info on this online so this is just a guess from peeking at the hadoop code.
>
> My "solution" was to serialize the data with a hash of the metadata and
> separately serialize the metadata and its hash in the job configuration (as
> key/value pairs). For this to work, I would need to be able to deserialize
> the metadata on the reduce node before the data is deserialized in the
> readFields() method.
>
> I think that for that to happen I need to hook into the code somewhere
> where a context or job configuration is used (before readFields()), but I'm
> stumped as to where that is.
>
> Cheers,
> Adi
>
>
> On Sat, Aug 31, 2013 at 3:42 AM, Shahab Yunus <sh...@gmail.com>wrote:
>
>> What I meant was that you might have to split or redesign your logic or
>> your usecase (which we don't know about)?
>>
>> Regards,
>> Shahab
>>
>>
>> On Fri, Aug 30, 2013 at 10:31 PM, Adrian CAPDEFIER <
>> chivas314159@gmail.com> wrote:
>>
>>> But how would the comparator have access to the job config?
>>>
>>>
>>> On Sat, Aug 31, 2013 at 2:38 AM, Shahab Yunus <sh...@gmail.com>wrote:
>>>
>>>> I think you have to override/extend the Comparator to achieve that,
>>>> something like what is done in Secondary Sort?
>>>>
>>>> Regards,
>>>> Shahab
>>>>
>>>>
>>>> On Fri, Aug 30, 2013 at 9:01 PM, Adrian CAPDEFIER <
>>>> chivas314159@gmail.com> wrote:
>>>>
>>>>> Howdy,
>>>>>
>>>>> I apologise for the lack of code in this message, but the code is
>>>>> fairly convoluted and it would obscure my problem. That being said, I can
>>>>> put together some sample code if really needed.
>>>>>
>>>>> I am trying to pass some metadata between the map & reduce steps. This
>>>>> metadata is read and generated in the map step and stored in the job
>>>>> config. It also needs to be recreated on the reduce node before the key/
>>>>> value fields can be read in the readFields function.
>>>>>
>>>>> I had assumed that I would be able to override the Reducer.setup()
>>>>> function and that would be it, but apparently the readFields function is
>>>>> called before the Reducer.setup() function.
>>>>>
>>>>> My question is what is any (the best) place on the reduce node where I
>>>>> can access the job configuration/ context before the readFields function is
>>>>> called?
>>>>>
>>>>> This is the stack trace:
>>>>>
>>>>>         at
>>>>> org.apache.hadoop.io.WritableComparator.compare(WritableComparator.java:103)
>>>>>         at
>>>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.compare(MapTask.java:1111)
>>>>>         at
>>>>> org.apache.hadoop.util.QuickSort.sortInternal(QuickSort.java:70)
>>>>>         at org.apache.hadoop.util.QuickSort.sort(QuickSort.java:59)
>>>>>         at
>>>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1399)
>>>>>         at
>>>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1298)
>>>>>         at
>>>>> org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:699)
>>>>>         at
>>>>> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:766)
>>>>>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
>>>>>         at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>>>>>         at java.security.AccessController.doPrivileged(Native Method)
>>>>>         at javax.security.auth.Subject.doAs(Subject.java:415)
>>>>>         at
>>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1149)
>>>>>         at org.apache.hadoop.mapred.Child.main(Child.java:249)
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Job config before read fields

Posted by Shahab Yunus <sh...@gmail.com>.
Personally, I don't know a way to access job configuration parameters in
custom implementation of Writables ( at least not an elegant and
appropriate one. Of course hacks of various kinds be done.) Maybe experts
can chime in?

One idea that I thought about was to use MapWritable (if you have not
explored it already.) You can encode the 'custom metadata' for you 'data'
as one byte symbols and move your data in the M/R flow as a map. Then while
deserialization you will have the type (or your 'custom metadata') in the
key part of the map and the value would be you actual data. This aligns
with the efficient approach that is used natively in Hadoop for
Strings/Text i.e. compact metadata (though I agree that you are not taking
advantage of the other aspect of non-dependence between metadata and the
data it defines.)

Take a look at that:
Page 96 of the Definitive Guide:
http://books.google.com/books?id=Nff49D7vnJcC&pg=PA96&lpg=PA96&dq=mapwritable+in+hadoop&source=bl&ots=IiixYu7vXu&sig=4V6H7cY-MrNT7Rzs3WlODsDOoP4&hl=en&sa=X&ei=aX4iUp2YGoaosASs_YCACQ&sqi=2&ved=0CFUQ6AEwBA#v=onepage&q=mapwritable%20in%20hadoop&f=false

and then this:
http://www.chrisstucchio.com/blog/2011/mapwritable_sometimes_a_performance_hog.html

and add your own custom types here (note that you are restricted by size of
byte):
http://hadoop.sourcearchive.com/documentation/0.20.2plus-pdfsg1-1/AbstractMapWritable_8java-source.html

Regards,
Shahab


On Sat, Aug 31, 2013 at 5:38 AM, Adrian CAPDEFIER <ch...@gmail.com>wrote:

> Thank you for your help Shahab.
>
> I guess I wasn't being too clear. My logic is that I use a custom type as
> key and in order to deserialize it on the compute nodes, I need an extra
> piece of information (also a custom type).
>
> To use an analogy, a Text is serialized by writing the length of the
> string as a number and then the bytes that compose the actual string. When
> it is deserialized, the number informs the reader when to stop reading the
> string. This number is varies from string to string and it is compact so it
> makes sense to serialize it with the string.
>
> My use case is similar to it. I have a complex type (let's call this
> data), and in order to deserialize it, I need another complex type (let's
> call this second type metadata). The metadata is not closely tied to the
> data (i.e. if the data value changes, the metadata does not) and the
> metadata size is quite large.
>
> I ruled out a couple of options, but please let me know if you think I did
> so for the wrong reasons:
> 1. I could serialize each data value with it's own metadata value, but
> since the data value count is in the +tens of millions and the metadata
> value distinct count can be up to one hundred, it would waste resources in
> the system.
> 2. I could serialize the metadata and then the data as a collection
> property of the metadata. This would be an elegant solution code-wise, but
> then all the data would have to be read and kept in memory as a massive
> object before any reduce operations can happen. I wasn't able to find any
> info on this online so this is just a guess from peeking at the hadoop code.
>
> My "solution" was to serialize the data with a hash of the metadata and
> separately serialize the metadata and its hash in the job configuration (as
> key/value pairs). For this to work, I would need to be able to deserialize
> the metadata on the reduce node before the data is deserialized in the
> readFields() method.
>
> I think that for that to happen I need to hook into the code somewhere
> where a context or job configuration is used (before readFields()), but I'm
> stumped as to where that is.
>
> Cheers,
> Adi
>
>
> On Sat, Aug 31, 2013 at 3:42 AM, Shahab Yunus <sh...@gmail.com>wrote:
>
>> What I meant was that you might have to split or redesign your logic or
>> your usecase (which we don't know about)?
>>
>> Regards,
>> Shahab
>>
>>
>> On Fri, Aug 30, 2013 at 10:31 PM, Adrian CAPDEFIER <
>> chivas314159@gmail.com> wrote:
>>
>>> But how would the comparator have access to the job config?
>>>
>>>
>>> On Sat, Aug 31, 2013 at 2:38 AM, Shahab Yunus <sh...@gmail.com>wrote:
>>>
>>>> I think you have to override/extend the Comparator to achieve that,
>>>> something like what is done in Secondary Sort?
>>>>
>>>> Regards,
>>>> Shahab
>>>>
>>>>
>>>> On Fri, Aug 30, 2013 at 9:01 PM, Adrian CAPDEFIER <
>>>> chivas314159@gmail.com> wrote:
>>>>
>>>>> Howdy,
>>>>>
>>>>> I apologise for the lack of code in this message, but the code is
>>>>> fairly convoluted and it would obscure my problem. That being said, I can
>>>>> put together some sample code if really needed.
>>>>>
>>>>> I am trying to pass some metadata between the map & reduce steps. This
>>>>> metadata is read and generated in the map step and stored in the job
>>>>> config. It also needs to be recreated on the reduce node before the key/
>>>>> value fields can be read in the readFields function.
>>>>>
>>>>> I had assumed that I would be able to override the Reducer.setup()
>>>>> function and that would be it, but apparently the readFields function is
>>>>> called before the Reducer.setup() function.
>>>>>
>>>>> My question is what is any (the best) place on the reduce node where I
>>>>> can access the job configuration/ context before the readFields function is
>>>>> called?
>>>>>
>>>>> This is the stack trace:
>>>>>
>>>>>         at
>>>>> org.apache.hadoop.io.WritableComparator.compare(WritableComparator.java:103)
>>>>>         at
>>>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.compare(MapTask.java:1111)
>>>>>         at
>>>>> org.apache.hadoop.util.QuickSort.sortInternal(QuickSort.java:70)
>>>>>         at org.apache.hadoop.util.QuickSort.sort(QuickSort.java:59)
>>>>>         at
>>>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1399)
>>>>>         at
>>>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1298)
>>>>>         at
>>>>> org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:699)
>>>>>         at
>>>>> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:766)
>>>>>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
>>>>>         at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>>>>>         at java.security.AccessController.doPrivileged(Native Method)
>>>>>         at javax.security.auth.Subject.doAs(Subject.java:415)
>>>>>         at
>>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1149)
>>>>>         at org.apache.hadoop.mapred.Child.main(Child.java:249)
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Job config before read fields

Posted by Shahab Yunus <sh...@gmail.com>.
Personally, I don't know a way to access job configuration parameters in
custom implementation of Writables ( at least not an elegant and
appropriate one. Of course hacks of various kinds be done.) Maybe experts
can chime in?

One idea that I thought about was to use MapWritable (if you have not
explored it already.) You can encode the 'custom metadata' for you 'data'
as one byte symbols and move your data in the M/R flow as a map. Then while
deserialization you will have the type (or your 'custom metadata') in the
key part of the map and the value would be you actual data. This aligns
with the efficient approach that is used natively in Hadoop for
Strings/Text i.e. compact metadata (though I agree that you are not taking
advantage of the other aspect of non-dependence between metadata and the
data it defines.)

Take a look at that:
Page 96 of the Definitive Guide:
http://books.google.com/books?id=Nff49D7vnJcC&pg=PA96&lpg=PA96&dq=mapwritable+in+hadoop&source=bl&ots=IiixYu7vXu&sig=4V6H7cY-MrNT7Rzs3WlODsDOoP4&hl=en&sa=X&ei=aX4iUp2YGoaosASs_YCACQ&sqi=2&ved=0CFUQ6AEwBA#v=onepage&q=mapwritable%20in%20hadoop&f=false

and then this:
http://www.chrisstucchio.com/blog/2011/mapwritable_sometimes_a_performance_hog.html

and add your own custom types here (note that you are restricted by size of
byte):
http://hadoop.sourcearchive.com/documentation/0.20.2plus-pdfsg1-1/AbstractMapWritable_8java-source.html

Regards,
Shahab


On Sat, Aug 31, 2013 at 5:38 AM, Adrian CAPDEFIER <ch...@gmail.com>wrote:

> Thank you for your help Shahab.
>
> I guess I wasn't being too clear. My logic is that I use a custom type as
> key and in order to deserialize it on the compute nodes, I need an extra
> piece of information (also a custom type).
>
> To use an analogy, a Text is serialized by writing the length of the
> string as a number and then the bytes that compose the actual string. When
> it is deserialized, the number informs the reader when to stop reading the
> string. This number is varies from string to string and it is compact so it
> makes sense to serialize it with the string.
>
> My use case is similar to it. I have a complex type (let's call this
> data), and in order to deserialize it, I need another complex type (let's
> call this second type metadata). The metadata is not closely tied to the
> data (i.e. if the data value changes, the metadata does not) and the
> metadata size is quite large.
>
> I ruled out a couple of options, but please let me know if you think I did
> so for the wrong reasons:
> 1. I could serialize each data value with it's own metadata value, but
> since the data value count is in the +tens of millions and the metadata
> value distinct count can be up to one hundred, it would waste resources in
> the system.
> 2. I could serialize the metadata and then the data as a collection
> property of the metadata. This would be an elegant solution code-wise, but
> then all the data would have to be read and kept in memory as a massive
> object before any reduce operations can happen. I wasn't able to find any
> info on this online so this is just a guess from peeking at the hadoop code.
>
> My "solution" was to serialize the data with a hash of the metadata and
> separately serialize the metadata and its hash in the job configuration (as
> key/value pairs). For this to work, I would need to be able to deserialize
> the metadata on the reduce node before the data is deserialized in the
> readFields() method.
>
> I think that for that to happen I need to hook into the code somewhere
> where a context or job configuration is used (before readFields()), but I'm
> stumped as to where that is.
>
> Cheers,
> Adi
>
>
> On Sat, Aug 31, 2013 at 3:42 AM, Shahab Yunus <sh...@gmail.com>wrote:
>
>> What I meant was that you might have to split or redesign your logic or
>> your usecase (which we don't know about)?
>>
>> Regards,
>> Shahab
>>
>>
>> On Fri, Aug 30, 2013 at 10:31 PM, Adrian CAPDEFIER <
>> chivas314159@gmail.com> wrote:
>>
>>> But how would the comparator have access to the job config?
>>>
>>>
>>> On Sat, Aug 31, 2013 at 2:38 AM, Shahab Yunus <sh...@gmail.com>wrote:
>>>
>>>> I think you have to override/extend the Comparator to achieve that,
>>>> something like what is done in Secondary Sort?
>>>>
>>>> Regards,
>>>> Shahab
>>>>
>>>>
>>>> On Fri, Aug 30, 2013 at 9:01 PM, Adrian CAPDEFIER <
>>>> chivas314159@gmail.com> wrote:
>>>>
>>>>> Howdy,
>>>>>
>>>>> I apologise for the lack of code in this message, but the code is
>>>>> fairly convoluted and it would obscure my problem. That being said, I can
>>>>> put together some sample code if really needed.
>>>>>
>>>>> I am trying to pass some metadata between the map & reduce steps. This
>>>>> metadata is read and generated in the map step and stored in the job
>>>>> config. It also needs to be recreated on the reduce node before the key/
>>>>> value fields can be read in the readFields function.
>>>>>
>>>>> I had assumed that I would be able to override the Reducer.setup()
>>>>> function and that would be it, but apparently the readFields function is
>>>>> called before the Reducer.setup() function.
>>>>>
>>>>> My question is what is any (the best) place on the reduce node where I
>>>>> can access the job configuration/ context before the readFields function is
>>>>> called?
>>>>>
>>>>> This is the stack trace:
>>>>>
>>>>>         at
>>>>> org.apache.hadoop.io.WritableComparator.compare(WritableComparator.java:103)
>>>>>         at
>>>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.compare(MapTask.java:1111)
>>>>>         at
>>>>> org.apache.hadoop.util.QuickSort.sortInternal(QuickSort.java:70)
>>>>>         at org.apache.hadoop.util.QuickSort.sort(QuickSort.java:59)
>>>>>         at
>>>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1399)
>>>>>         at
>>>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1298)
>>>>>         at
>>>>> org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:699)
>>>>>         at
>>>>> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:766)
>>>>>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
>>>>>         at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>>>>>         at java.security.AccessController.doPrivileged(Native Method)
>>>>>         at javax.security.auth.Subject.doAs(Subject.java:415)
>>>>>         at
>>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1149)
>>>>>         at org.apache.hadoop.mapred.Child.main(Child.java:249)
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Job config before read fields

Posted by Adrian CAPDEFIER <ch...@gmail.com>.
Thank you for your help Shahab.

I guess I wasn't being too clear. My logic is that I use a custom type as
key and in order to deserialize it on the compute nodes, I need an extra
piece of information (also a custom type).

To use an analogy, a Text is serialized by writing the length of the string
as a number and then the bytes that compose the actual string. When it is
deserialized, the number informs the reader when to stop reading the
string. This number is varies from string to string and it is compact so it
makes sense to serialize it with the string.

My use case is similar to it. I have a complex type (let's call this data),
and in order to deserialize it, I need another complex type (let's call
this second type metadata). The metadata is not closely tied to the data
(i.e. if the data value changes, the metadata does not) and the metadata
size is quite large.

I ruled out a couple of options, but please let me know if you think I did
so for the wrong reasons:
1. I could serialize each data value with it's own metadata value, but
since the data value count is in the +tens of millions and the metadata
value distinct count can be up to one hundred, it would waste resources in
the system.
2. I could serialize the metadata and then the data as a collection
property of the metadata. This would be an elegant solution code-wise, but
then all the data would have to be read and kept in memory as a massive
object before any reduce operations can happen. I wasn't able to find any
info on this online so this is just a guess from peeking at the hadoop code.

My "solution" was to serialize the data with a hash of the metadata and
separately serialize the metadata and its hash in the job configuration (as
key/value pairs). For this to work, I would need to be able to deserialize
the metadata on the reduce node before the data is deserialized in the
readFields() method.

I think that for that to happen I need to hook into the code somewhere
where a context or job configuration is used (before readFields()), but I'm
stumped as to where that is.

Cheers,
Adi


On Sat, Aug 31, 2013 at 3:42 AM, Shahab Yunus <sh...@gmail.com>wrote:

> What I meant was that you might have to split or redesign your logic or
> your usecase (which we don't know about)?
>
> Regards,
> Shahab
>
>
> On Fri, Aug 30, 2013 at 10:31 PM, Adrian CAPDEFIER <chivas314159@gmail.com
> > wrote:
>
>> But how would the comparator have access to the job config?
>>
>>
>> On Sat, Aug 31, 2013 at 2:38 AM, Shahab Yunus <sh...@gmail.com>wrote:
>>
>>> I think you have to override/extend the Comparator to achieve that,
>>> something like what is done in Secondary Sort?
>>>
>>> Regards,
>>> Shahab
>>>
>>>
>>> On Fri, Aug 30, 2013 at 9:01 PM, Adrian CAPDEFIER <
>>> chivas314159@gmail.com> wrote:
>>>
>>>> Howdy,
>>>>
>>>> I apologise for the lack of code in this message, but the code is
>>>> fairly convoluted and it would obscure my problem. That being said, I can
>>>> put together some sample code if really needed.
>>>>
>>>> I am trying to pass some metadata between the map & reduce steps. This
>>>> metadata is read and generated in the map step and stored in the job
>>>> config. It also needs to be recreated on the reduce node before the key/
>>>> value fields can be read in the readFields function.
>>>>
>>>> I had assumed that I would be able to override the Reducer.setup()
>>>> function and that would be it, but apparently the readFields function is
>>>> called before the Reducer.setup() function.
>>>>
>>>> My question is what is any (the best) place on the reduce node where I
>>>> can access the job configuration/ context before the readFields function is
>>>> called?
>>>>
>>>> This is the stack trace:
>>>>
>>>>         at
>>>> org.apache.hadoop.io.WritableComparator.compare(WritableComparator.java:103)
>>>>         at
>>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.compare(MapTask.java:1111)
>>>>         at
>>>> org.apache.hadoop.util.QuickSort.sortInternal(QuickSort.java:70)
>>>>         at org.apache.hadoop.util.QuickSort.sort(QuickSort.java:59)
>>>>         at
>>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1399)
>>>>         at
>>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1298)
>>>>         at
>>>> org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:699)
>>>>         at
>>>> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:766)
>>>>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
>>>>         at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>>>>         at java.security.AccessController.doPrivileged(Native Method)
>>>>         at javax.security.auth.Subject.doAs(Subject.java:415)
>>>>         at
>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1149)
>>>>         at org.apache.hadoop.mapred.Child.main(Child.java:249)
>>>>
>>>>
>>>
>>
>

Re: Job config before read fields

Posted by Adrian CAPDEFIER <ch...@gmail.com>.
Thank you for your help Shahab.

I guess I wasn't being too clear. My logic is that I use a custom type as
key and in order to deserialize it on the compute nodes, I need an extra
piece of information (also a custom type).

To use an analogy, a Text is serialized by writing the length of the string
as a number and then the bytes that compose the actual string. When it is
deserialized, the number informs the reader when to stop reading the
string. This number is varies from string to string and it is compact so it
makes sense to serialize it with the string.

My use case is similar to it. I have a complex type (let's call this data),
and in order to deserialize it, I need another complex type (let's call
this second type metadata). The metadata is not closely tied to the data
(i.e. if the data value changes, the metadata does not) and the metadata
size is quite large.

I ruled out a couple of options, but please let me know if you think I did
so for the wrong reasons:
1. I could serialize each data value with it's own metadata value, but
since the data value count is in the +tens of millions and the metadata
value distinct count can be up to one hundred, it would waste resources in
the system.
2. I could serialize the metadata and then the data as a collection
property of the metadata. This would be an elegant solution code-wise, but
then all the data would have to be read and kept in memory as a massive
object before any reduce operations can happen. I wasn't able to find any
info on this online so this is just a guess from peeking at the hadoop code.

My "solution" was to serialize the data with a hash of the metadata and
separately serialize the metadata and its hash in the job configuration (as
key/value pairs). For this to work, I would need to be able to deserialize
the metadata on the reduce node before the data is deserialized in the
readFields() method.

I think that for that to happen I need to hook into the code somewhere
where a context or job configuration is used (before readFields()), but I'm
stumped as to where that is.

Cheers,
Adi


On Sat, Aug 31, 2013 at 3:42 AM, Shahab Yunus <sh...@gmail.com>wrote:

> What I meant was that you might have to split or redesign your logic or
> your usecase (which we don't know about)?
>
> Regards,
> Shahab
>
>
> On Fri, Aug 30, 2013 at 10:31 PM, Adrian CAPDEFIER <chivas314159@gmail.com
> > wrote:
>
>> But how would the comparator have access to the job config?
>>
>>
>> On Sat, Aug 31, 2013 at 2:38 AM, Shahab Yunus <sh...@gmail.com>wrote:
>>
>>> I think you have to override/extend the Comparator to achieve that,
>>> something like what is done in Secondary Sort?
>>>
>>> Regards,
>>> Shahab
>>>
>>>
>>> On Fri, Aug 30, 2013 at 9:01 PM, Adrian CAPDEFIER <
>>> chivas314159@gmail.com> wrote:
>>>
>>>> Howdy,
>>>>
>>>> I apologise for the lack of code in this message, but the code is
>>>> fairly convoluted and it would obscure my problem. That being said, I can
>>>> put together some sample code if really needed.
>>>>
>>>> I am trying to pass some metadata between the map & reduce steps. This
>>>> metadata is read and generated in the map step and stored in the job
>>>> config. It also needs to be recreated on the reduce node before the key/
>>>> value fields can be read in the readFields function.
>>>>
>>>> I had assumed that I would be able to override the Reducer.setup()
>>>> function and that would be it, but apparently the readFields function is
>>>> called before the Reducer.setup() function.
>>>>
>>>> My question is what is any (the best) place on the reduce node where I
>>>> can access the job configuration/ context before the readFields function is
>>>> called?
>>>>
>>>> This is the stack trace:
>>>>
>>>>         at
>>>> org.apache.hadoop.io.WritableComparator.compare(WritableComparator.java:103)
>>>>         at
>>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.compare(MapTask.java:1111)
>>>>         at
>>>> org.apache.hadoop.util.QuickSort.sortInternal(QuickSort.java:70)
>>>>         at org.apache.hadoop.util.QuickSort.sort(QuickSort.java:59)
>>>>         at
>>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1399)
>>>>         at
>>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1298)
>>>>         at
>>>> org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:699)
>>>>         at
>>>> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:766)
>>>>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
>>>>         at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>>>>         at java.security.AccessController.doPrivileged(Native Method)
>>>>         at javax.security.auth.Subject.doAs(Subject.java:415)
>>>>         at
>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1149)
>>>>         at org.apache.hadoop.mapred.Child.main(Child.java:249)
>>>>
>>>>
>>>
>>
>

Re: Job config before read fields

Posted by Adrian CAPDEFIER <ch...@gmail.com>.
Thank you for your help Shahab.

I guess I wasn't being too clear. My logic is that I use a custom type as
key and in order to deserialize it on the compute nodes, I need an extra
piece of information (also a custom type).

To use an analogy, a Text is serialized by writing the length of the string
as a number and then the bytes that compose the actual string. When it is
deserialized, the number informs the reader when to stop reading the
string. This number is varies from string to string and it is compact so it
makes sense to serialize it with the string.

My use case is similar to it. I have a complex type (let's call this data),
and in order to deserialize it, I need another complex type (let's call
this second type metadata). The metadata is not closely tied to the data
(i.e. if the data value changes, the metadata does not) and the metadata
size is quite large.

I ruled out a couple of options, but please let me know if you think I did
so for the wrong reasons:
1. I could serialize each data value with it's own metadata value, but
since the data value count is in the +tens of millions and the metadata
value distinct count can be up to one hundred, it would waste resources in
the system.
2. I could serialize the metadata and then the data as a collection
property of the metadata. This would be an elegant solution code-wise, but
then all the data would have to be read and kept in memory as a massive
object before any reduce operations can happen. I wasn't able to find any
info on this online so this is just a guess from peeking at the hadoop code.

My "solution" was to serialize the data with a hash of the metadata and
separately serialize the metadata and its hash in the job configuration (as
key/value pairs). For this to work, I would need to be able to deserialize
the metadata on the reduce node before the data is deserialized in the
readFields() method.

I think that for that to happen I need to hook into the code somewhere
where a context or job configuration is used (before readFields()), but I'm
stumped as to where that is.

Cheers,
Adi


On Sat, Aug 31, 2013 at 3:42 AM, Shahab Yunus <sh...@gmail.com>wrote:

> What I meant was that you might have to split or redesign your logic or
> your usecase (which we don't know about)?
>
> Regards,
> Shahab
>
>
> On Fri, Aug 30, 2013 at 10:31 PM, Adrian CAPDEFIER <chivas314159@gmail.com
> > wrote:
>
>> But how would the comparator have access to the job config?
>>
>>
>> On Sat, Aug 31, 2013 at 2:38 AM, Shahab Yunus <sh...@gmail.com>wrote:
>>
>>> I think you have to override/extend the Comparator to achieve that,
>>> something like what is done in Secondary Sort?
>>>
>>> Regards,
>>> Shahab
>>>
>>>
>>> On Fri, Aug 30, 2013 at 9:01 PM, Adrian CAPDEFIER <
>>> chivas314159@gmail.com> wrote:
>>>
>>>> Howdy,
>>>>
>>>> I apologise for the lack of code in this message, but the code is
>>>> fairly convoluted and it would obscure my problem. That being said, I can
>>>> put together some sample code if really needed.
>>>>
>>>> I am trying to pass some metadata between the map & reduce steps. This
>>>> metadata is read and generated in the map step and stored in the job
>>>> config. It also needs to be recreated on the reduce node before the key/
>>>> value fields can be read in the readFields function.
>>>>
>>>> I had assumed that I would be able to override the Reducer.setup()
>>>> function and that would be it, but apparently the readFields function is
>>>> called before the Reducer.setup() function.
>>>>
>>>> My question is what is any (the best) place on the reduce node where I
>>>> can access the job configuration/ context before the readFields function is
>>>> called?
>>>>
>>>> This is the stack trace:
>>>>
>>>>         at
>>>> org.apache.hadoop.io.WritableComparator.compare(WritableComparator.java:103)
>>>>         at
>>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.compare(MapTask.java:1111)
>>>>         at
>>>> org.apache.hadoop.util.QuickSort.sortInternal(QuickSort.java:70)
>>>>         at org.apache.hadoop.util.QuickSort.sort(QuickSort.java:59)
>>>>         at
>>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1399)
>>>>         at
>>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1298)
>>>>         at
>>>> org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:699)
>>>>         at
>>>> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:766)
>>>>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
>>>>         at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>>>>         at java.security.AccessController.doPrivileged(Native Method)
>>>>         at javax.security.auth.Subject.doAs(Subject.java:415)
>>>>         at
>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1149)
>>>>         at org.apache.hadoop.mapred.Child.main(Child.java:249)
>>>>
>>>>
>>>
>>
>

Re: Job config before read fields

Posted by Adrian CAPDEFIER <ch...@gmail.com>.
Thank you for your help Shahab.

I guess I wasn't being too clear. My logic is that I use a custom type as
key and in order to deserialize it on the compute nodes, I need an extra
piece of information (also a custom type).

To use an analogy, a Text is serialized by writing the length of the string
as a number and then the bytes that compose the actual string. When it is
deserialized, the number informs the reader when to stop reading the
string. This number is varies from string to string and it is compact so it
makes sense to serialize it with the string.

My use case is similar to it. I have a complex type (let's call this data),
and in order to deserialize it, I need another complex type (let's call
this second type metadata). The metadata is not closely tied to the data
(i.e. if the data value changes, the metadata does not) and the metadata
size is quite large.

I ruled out a couple of options, but please let me know if you think I did
so for the wrong reasons:
1. I could serialize each data value with it's own metadata value, but
since the data value count is in the +tens of millions and the metadata
value distinct count can be up to one hundred, it would waste resources in
the system.
2. I could serialize the metadata and then the data as a collection
property of the metadata. This would be an elegant solution code-wise, but
then all the data would have to be read and kept in memory as a massive
object before any reduce operations can happen. I wasn't able to find any
info on this online so this is just a guess from peeking at the hadoop code.

My "solution" was to serialize the data with a hash of the metadata and
separately serialize the metadata and its hash in the job configuration (as
key/value pairs). For this to work, I would need to be able to deserialize
the metadata on the reduce node before the data is deserialized in the
readFields() method.

I think that for that to happen I need to hook into the code somewhere
where a context or job configuration is used (before readFields()), but I'm
stumped as to where that is.

Cheers,
Adi


On Sat, Aug 31, 2013 at 3:42 AM, Shahab Yunus <sh...@gmail.com>wrote:

> What I meant was that you might have to split or redesign your logic or
> your usecase (which we don't know about)?
>
> Regards,
> Shahab
>
>
> On Fri, Aug 30, 2013 at 10:31 PM, Adrian CAPDEFIER <chivas314159@gmail.com
> > wrote:
>
>> But how would the comparator have access to the job config?
>>
>>
>> On Sat, Aug 31, 2013 at 2:38 AM, Shahab Yunus <sh...@gmail.com>wrote:
>>
>>> I think you have to override/extend the Comparator to achieve that,
>>> something like what is done in Secondary Sort?
>>>
>>> Regards,
>>> Shahab
>>>
>>>
>>> On Fri, Aug 30, 2013 at 9:01 PM, Adrian CAPDEFIER <
>>> chivas314159@gmail.com> wrote:
>>>
>>>> Howdy,
>>>>
>>>> I apologise for the lack of code in this message, but the code is
>>>> fairly convoluted and it would obscure my problem. That being said, I can
>>>> put together some sample code if really needed.
>>>>
>>>> I am trying to pass some metadata between the map & reduce steps. This
>>>> metadata is read and generated in the map step and stored in the job
>>>> config. It also needs to be recreated on the reduce node before the key/
>>>> value fields can be read in the readFields function.
>>>>
>>>> I had assumed that I would be able to override the Reducer.setup()
>>>> function and that would be it, but apparently the readFields function is
>>>> called before the Reducer.setup() function.
>>>>
>>>> My question is what is any (the best) place on the reduce node where I
>>>> can access the job configuration/ context before the readFields function is
>>>> called?
>>>>
>>>> This is the stack trace:
>>>>
>>>>         at
>>>> org.apache.hadoop.io.WritableComparator.compare(WritableComparator.java:103)
>>>>         at
>>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.compare(MapTask.java:1111)
>>>>         at
>>>> org.apache.hadoop.util.QuickSort.sortInternal(QuickSort.java:70)
>>>>         at org.apache.hadoop.util.QuickSort.sort(QuickSort.java:59)
>>>>         at
>>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1399)
>>>>         at
>>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1298)
>>>>         at
>>>> org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:699)
>>>>         at
>>>> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:766)
>>>>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
>>>>         at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>>>>         at java.security.AccessController.doPrivileged(Native Method)
>>>>         at javax.security.auth.Subject.doAs(Subject.java:415)
>>>>         at
>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1149)
>>>>         at org.apache.hadoop.mapred.Child.main(Child.java:249)
>>>>
>>>>
>>>
>>
>

Re: Job config before read fields

Posted by Shahab Yunus <sh...@gmail.com>.
What I meant was that you might have to split or redesign your logic or
your usecase (which we don't know about)?

Regards,
Shahab


On Fri, Aug 30, 2013 at 10:31 PM, Adrian CAPDEFIER
<ch...@gmail.com>wrote:

> But how would the comparator have access to the job config?
>
>
> On Sat, Aug 31, 2013 at 2:38 AM, Shahab Yunus <sh...@gmail.com>wrote:
>
>> I think you have to override/extend the Comparator to achieve that,
>> something like what is done in Secondary Sort?
>>
>> Regards,
>> Shahab
>>
>>
>> On Fri, Aug 30, 2013 at 9:01 PM, Adrian CAPDEFIER <chivas314159@gmail.com
>> > wrote:
>>
>>> Howdy,
>>>
>>> I apologise for the lack of code in this message, but the code is fairly
>>> convoluted and it would obscure my problem. That being said, I can put
>>> together some sample code if really needed.
>>>
>>> I am trying to pass some metadata between the map & reduce steps. This
>>> metadata is read and generated in the map step and stored in the job
>>> config. It also needs to be recreated on the reduce node before the key/
>>> value fields can be read in the readFields function.
>>>
>>> I had assumed that I would be able to override the Reducer.setup()
>>> function and that would be it, but apparently the readFields function is
>>> called before the Reducer.setup() function.
>>>
>>> My question is what is any (the best) place on the reduce node where I
>>> can access the job configuration/ context before the readFields function is
>>> called?
>>>
>>> This is the stack trace:
>>>
>>>         at
>>> org.apache.hadoop.io.WritableComparator.compare(WritableComparator.java:103)
>>>         at
>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.compare(MapTask.java:1111)
>>>         at
>>> org.apache.hadoop.util.QuickSort.sortInternal(QuickSort.java:70)
>>>         at org.apache.hadoop.util.QuickSort.sort(QuickSort.java:59)
>>>         at
>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1399)
>>>         at
>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1298)
>>>         at
>>> org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:699)
>>>         at
>>> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:766)
>>>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
>>>         at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>>>         at java.security.AccessController.doPrivileged(Native Method)
>>>         at javax.security.auth.Subject.doAs(Subject.java:415)
>>>         at
>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1149)
>>>         at org.apache.hadoop.mapred.Child.main(Child.java:249)
>>>
>>>
>>
>

Re: Job config before read fields

Posted by Shahab Yunus <sh...@gmail.com>.
What I meant was that you might have to split or redesign your logic or
your usecase (which we don't know about)?

Regards,
Shahab


On Fri, Aug 30, 2013 at 10:31 PM, Adrian CAPDEFIER
<ch...@gmail.com>wrote:

> But how would the comparator have access to the job config?
>
>
> On Sat, Aug 31, 2013 at 2:38 AM, Shahab Yunus <sh...@gmail.com>wrote:
>
>> I think you have to override/extend the Comparator to achieve that,
>> something like what is done in Secondary Sort?
>>
>> Regards,
>> Shahab
>>
>>
>> On Fri, Aug 30, 2013 at 9:01 PM, Adrian CAPDEFIER <chivas314159@gmail.com
>> > wrote:
>>
>>> Howdy,
>>>
>>> I apologise for the lack of code in this message, but the code is fairly
>>> convoluted and it would obscure my problem. That being said, I can put
>>> together some sample code if really needed.
>>>
>>> I am trying to pass some metadata between the map & reduce steps. This
>>> metadata is read and generated in the map step and stored in the job
>>> config. It also needs to be recreated on the reduce node before the key/
>>> value fields can be read in the readFields function.
>>>
>>> I had assumed that I would be able to override the Reducer.setup()
>>> function and that would be it, but apparently the readFields function is
>>> called before the Reducer.setup() function.
>>>
>>> My question is what is any (the best) place on the reduce node where I
>>> can access the job configuration/ context before the readFields function is
>>> called?
>>>
>>> This is the stack trace:
>>>
>>>         at
>>> org.apache.hadoop.io.WritableComparator.compare(WritableComparator.java:103)
>>>         at
>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.compare(MapTask.java:1111)
>>>         at
>>> org.apache.hadoop.util.QuickSort.sortInternal(QuickSort.java:70)
>>>         at org.apache.hadoop.util.QuickSort.sort(QuickSort.java:59)
>>>         at
>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1399)
>>>         at
>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1298)
>>>         at
>>> org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:699)
>>>         at
>>> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:766)
>>>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
>>>         at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>>>         at java.security.AccessController.doPrivileged(Native Method)
>>>         at javax.security.auth.Subject.doAs(Subject.java:415)
>>>         at
>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1149)
>>>         at org.apache.hadoop.mapred.Child.main(Child.java:249)
>>>
>>>
>>
>

Re: Job config before read fields

Posted by Shahab Yunus <sh...@gmail.com>.
What I meant was that you might have to split or redesign your logic or
your usecase (which we don't know about)?

Regards,
Shahab


On Fri, Aug 30, 2013 at 10:31 PM, Adrian CAPDEFIER
<ch...@gmail.com>wrote:

> But how would the comparator have access to the job config?
>
>
> On Sat, Aug 31, 2013 at 2:38 AM, Shahab Yunus <sh...@gmail.com>wrote:
>
>> I think you have to override/extend the Comparator to achieve that,
>> something like what is done in Secondary Sort?
>>
>> Regards,
>> Shahab
>>
>>
>> On Fri, Aug 30, 2013 at 9:01 PM, Adrian CAPDEFIER <chivas314159@gmail.com
>> > wrote:
>>
>>> Howdy,
>>>
>>> I apologise for the lack of code in this message, but the code is fairly
>>> convoluted and it would obscure my problem. That being said, I can put
>>> together some sample code if really needed.
>>>
>>> I am trying to pass some metadata between the map & reduce steps. This
>>> metadata is read and generated in the map step and stored in the job
>>> config. It also needs to be recreated on the reduce node before the key/
>>> value fields can be read in the readFields function.
>>>
>>> I had assumed that I would be able to override the Reducer.setup()
>>> function and that would be it, but apparently the readFields function is
>>> called before the Reducer.setup() function.
>>>
>>> My question is what is any (the best) place on the reduce node where I
>>> can access the job configuration/ context before the readFields function is
>>> called?
>>>
>>> This is the stack trace:
>>>
>>>         at
>>> org.apache.hadoop.io.WritableComparator.compare(WritableComparator.java:103)
>>>         at
>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.compare(MapTask.java:1111)
>>>         at
>>> org.apache.hadoop.util.QuickSort.sortInternal(QuickSort.java:70)
>>>         at org.apache.hadoop.util.QuickSort.sort(QuickSort.java:59)
>>>         at
>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1399)
>>>         at
>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1298)
>>>         at
>>> org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:699)
>>>         at
>>> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:766)
>>>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
>>>         at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>>>         at java.security.AccessController.doPrivileged(Native Method)
>>>         at javax.security.auth.Subject.doAs(Subject.java:415)
>>>         at
>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1149)
>>>         at org.apache.hadoop.mapred.Child.main(Child.java:249)
>>>
>>>
>>
>

Re: Job config before read fields

Posted by Shahab Yunus <sh...@gmail.com>.
What I meant was that you might have to split or redesign your logic or
your usecase (which we don't know about)?

Regards,
Shahab


On Fri, Aug 30, 2013 at 10:31 PM, Adrian CAPDEFIER
<ch...@gmail.com>wrote:

> But how would the comparator have access to the job config?
>
>
> On Sat, Aug 31, 2013 at 2:38 AM, Shahab Yunus <sh...@gmail.com>wrote:
>
>> I think you have to override/extend the Comparator to achieve that,
>> something like what is done in Secondary Sort?
>>
>> Regards,
>> Shahab
>>
>>
>> On Fri, Aug 30, 2013 at 9:01 PM, Adrian CAPDEFIER <chivas314159@gmail.com
>> > wrote:
>>
>>> Howdy,
>>>
>>> I apologise for the lack of code in this message, but the code is fairly
>>> convoluted and it would obscure my problem. That being said, I can put
>>> together some sample code if really needed.
>>>
>>> I am trying to pass some metadata between the map & reduce steps. This
>>> metadata is read and generated in the map step and stored in the job
>>> config. It also needs to be recreated on the reduce node before the key/
>>> value fields can be read in the readFields function.
>>>
>>> I had assumed that I would be able to override the Reducer.setup()
>>> function and that would be it, but apparently the readFields function is
>>> called before the Reducer.setup() function.
>>>
>>> My question is what is any (the best) place on the reduce node where I
>>> can access the job configuration/ context before the readFields function is
>>> called?
>>>
>>> This is the stack trace:
>>>
>>>         at
>>> org.apache.hadoop.io.WritableComparator.compare(WritableComparator.java:103)
>>>         at
>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.compare(MapTask.java:1111)
>>>         at
>>> org.apache.hadoop.util.QuickSort.sortInternal(QuickSort.java:70)
>>>         at org.apache.hadoop.util.QuickSort.sort(QuickSort.java:59)
>>>         at
>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1399)
>>>         at
>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1298)
>>>         at
>>> org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:699)
>>>         at
>>> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:766)
>>>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
>>>         at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>>>         at java.security.AccessController.doPrivileged(Native Method)
>>>         at javax.security.auth.Subject.doAs(Subject.java:415)
>>>         at
>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1149)
>>>         at org.apache.hadoop.mapred.Child.main(Child.java:249)
>>>
>>>
>>
>

Re: Job config before read fields

Posted by Adrian CAPDEFIER <ch...@gmail.com>.
But how would the comparator have access to the job config?


On Sat, Aug 31, 2013 at 2:38 AM, Shahab Yunus <sh...@gmail.com>wrote:

> I think you have to override/extend the Comparator to achieve that,
> something like what is done in Secondary Sort?
>
> Regards,
> Shahab
>
>
> On Fri, Aug 30, 2013 at 9:01 PM, Adrian CAPDEFIER <ch...@gmail.com>wrote:
>
>> Howdy,
>>
>> I apologise for the lack of code in this message, but the code is fairly
>> convoluted and it would obscure my problem. That being said, I can put
>> together some sample code if really needed.
>>
>> I am trying to pass some metadata between the map & reduce steps. This
>> metadata is read and generated in the map step and stored in the job
>> config. It also needs to be recreated on the reduce node before the key/
>> value fields can be read in the readFields function.
>>
>> I had assumed that I would be able to override the Reducer.setup()
>> function and that would be it, but apparently the readFields function is
>> called before the Reducer.setup() function.
>>
>> My question is what is any (the best) place on the reduce node where I
>> can access the job configuration/ context before the readFields function is
>> called?
>>
>> This is the stack trace:
>>
>>         at
>> org.apache.hadoop.io.WritableComparator.compare(WritableComparator.java:103)
>>         at
>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.compare(MapTask.java:1111)
>>         at
>> org.apache.hadoop.util.QuickSort.sortInternal(QuickSort.java:70)
>>         at org.apache.hadoop.util.QuickSort.sort(QuickSort.java:59)
>>         at
>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1399)
>>         at
>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1298)
>>         at
>> org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:699)
>>         at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:766)
>>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
>>         at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>>         at java.security.AccessController.doPrivileged(Native Method)
>>         at javax.security.auth.Subject.doAs(Subject.java:415)
>>         at
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1149)
>>         at org.apache.hadoop.mapred.Child.main(Child.java:249)
>>
>>
>

Re: Job config before read fields

Posted by Adrian CAPDEFIER <ch...@gmail.com>.
But how would the comparator have access to the job config?


On Sat, Aug 31, 2013 at 2:38 AM, Shahab Yunus <sh...@gmail.com>wrote:

> I think you have to override/extend the Comparator to achieve that,
> something like what is done in Secondary Sort?
>
> Regards,
> Shahab
>
>
> On Fri, Aug 30, 2013 at 9:01 PM, Adrian CAPDEFIER <ch...@gmail.com>wrote:
>
>> Howdy,
>>
>> I apologise for the lack of code in this message, but the code is fairly
>> convoluted and it would obscure my problem. That being said, I can put
>> together some sample code if really needed.
>>
>> I am trying to pass some metadata between the map & reduce steps. This
>> metadata is read and generated in the map step and stored in the job
>> config. It also needs to be recreated on the reduce node before the key/
>> value fields can be read in the readFields function.
>>
>> I had assumed that I would be able to override the Reducer.setup()
>> function and that would be it, but apparently the readFields function is
>> called before the Reducer.setup() function.
>>
>> My question is what is any (the best) place on the reduce node where I
>> can access the job configuration/ context before the readFields function is
>> called?
>>
>> This is the stack trace:
>>
>>         at
>> org.apache.hadoop.io.WritableComparator.compare(WritableComparator.java:103)
>>         at
>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.compare(MapTask.java:1111)
>>         at
>> org.apache.hadoop.util.QuickSort.sortInternal(QuickSort.java:70)
>>         at org.apache.hadoop.util.QuickSort.sort(QuickSort.java:59)
>>         at
>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1399)
>>         at
>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1298)
>>         at
>> org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:699)
>>         at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:766)
>>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
>>         at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>>         at java.security.AccessController.doPrivileged(Native Method)
>>         at javax.security.auth.Subject.doAs(Subject.java:415)
>>         at
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1149)
>>         at org.apache.hadoop.mapred.Child.main(Child.java:249)
>>
>>
>

Re: Job config before read fields

Posted by Adrian CAPDEFIER <ch...@gmail.com>.
But how would the comparator have access to the job config?


On Sat, Aug 31, 2013 at 2:38 AM, Shahab Yunus <sh...@gmail.com>wrote:

> I think you have to override/extend the Comparator to achieve that,
> something like what is done in Secondary Sort?
>
> Regards,
> Shahab
>
>
> On Fri, Aug 30, 2013 at 9:01 PM, Adrian CAPDEFIER <ch...@gmail.com>wrote:
>
>> Howdy,
>>
>> I apologise for the lack of code in this message, but the code is fairly
>> convoluted and it would obscure my problem. That being said, I can put
>> together some sample code if really needed.
>>
>> I am trying to pass some metadata between the map & reduce steps. This
>> metadata is read and generated in the map step and stored in the job
>> config. It also needs to be recreated on the reduce node before the key/
>> value fields can be read in the readFields function.
>>
>> I had assumed that I would be able to override the Reducer.setup()
>> function and that would be it, but apparently the readFields function is
>> called before the Reducer.setup() function.
>>
>> My question is what is any (the best) place on the reduce node where I
>> can access the job configuration/ context before the readFields function is
>> called?
>>
>> This is the stack trace:
>>
>>         at
>> org.apache.hadoop.io.WritableComparator.compare(WritableComparator.java:103)
>>         at
>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.compare(MapTask.java:1111)
>>         at
>> org.apache.hadoop.util.QuickSort.sortInternal(QuickSort.java:70)
>>         at org.apache.hadoop.util.QuickSort.sort(QuickSort.java:59)
>>         at
>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1399)
>>         at
>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1298)
>>         at
>> org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:699)
>>         at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:766)
>>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
>>         at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>>         at java.security.AccessController.doPrivileged(Native Method)
>>         at javax.security.auth.Subject.doAs(Subject.java:415)
>>         at
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1149)
>>         at org.apache.hadoop.mapred.Child.main(Child.java:249)
>>
>>
>

Re: Job config before read fields

Posted by Adrian CAPDEFIER <ch...@gmail.com>.
But how would the comparator have access to the job config?


On Sat, Aug 31, 2013 at 2:38 AM, Shahab Yunus <sh...@gmail.com>wrote:

> I think you have to override/extend the Comparator to achieve that,
> something like what is done in Secondary Sort?
>
> Regards,
> Shahab
>
>
> On Fri, Aug 30, 2013 at 9:01 PM, Adrian CAPDEFIER <ch...@gmail.com>wrote:
>
>> Howdy,
>>
>> I apologise for the lack of code in this message, but the code is fairly
>> convoluted and it would obscure my problem. That being said, I can put
>> together some sample code if really needed.
>>
>> I am trying to pass some metadata between the map & reduce steps. This
>> metadata is read and generated in the map step and stored in the job
>> config. It also needs to be recreated on the reduce node before the key/
>> value fields can be read in the readFields function.
>>
>> I had assumed that I would be able to override the Reducer.setup()
>> function and that would be it, but apparently the readFields function is
>> called before the Reducer.setup() function.
>>
>> My question is what is any (the best) place on the reduce node where I
>> can access the job configuration/ context before the readFields function is
>> called?
>>
>> This is the stack trace:
>>
>>         at
>> org.apache.hadoop.io.WritableComparator.compare(WritableComparator.java:103)
>>         at
>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.compare(MapTask.java:1111)
>>         at
>> org.apache.hadoop.util.QuickSort.sortInternal(QuickSort.java:70)
>>         at org.apache.hadoop.util.QuickSort.sort(QuickSort.java:59)
>>         at
>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1399)
>>         at
>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1298)
>>         at
>> org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:699)
>>         at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:766)
>>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
>>         at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>>         at java.security.AccessController.doPrivileged(Native Method)
>>         at javax.security.auth.Subject.doAs(Subject.java:415)
>>         at
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1149)
>>         at org.apache.hadoop.mapred.Child.main(Child.java:249)
>>
>>
>

Re: Job config before read fields

Posted by Shahab Yunus <sh...@gmail.com>.
I think you have to override/extend the Comparator to achieve that,
something like what is done in Secondary Sort?

Regards,
Shahab


On Fri, Aug 30, 2013 at 9:01 PM, Adrian CAPDEFIER <ch...@gmail.com>wrote:

> Howdy,
>
> I apologise for the lack of code in this message, but the code is fairly
> convoluted and it would obscure my problem. That being said, I can put
> together some sample code if really needed.
>
> I am trying to pass some metadata between the map & reduce steps. This
> metadata is read and generated in the map step and stored in the job
> config. It also needs to be recreated on the reduce node before the key/
> value fields can be read in the readFields function.
>
> I had assumed that I would be able to override the Reducer.setup()
> function and that would be it, but apparently the readFields function is
> called before the Reducer.setup() function.
>
> My question is what is any (the best) place on the reduce node where I can
> access the job configuration/ context before the readFields function is
> called?
>
> This is the stack trace:
>
>         at
> org.apache.hadoop.io.WritableComparator.compare(WritableComparator.java:103)
>         at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.compare(MapTask.java:1111)
>         at org.apache.hadoop.util.QuickSort.sortInternal(QuickSort.java:70)
>         at org.apache.hadoop.util.QuickSort.sort(QuickSort.java:59)
>         at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1399)
>         at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1298)
>         at
> org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:699)
>         at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:766)
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
>         at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:415)
>         at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1149)
>         at org.apache.hadoop.mapred.Child.main(Child.java:249)
>
>

Re: Job config before read fields

Posted by Shahab Yunus <sh...@gmail.com>.
I think you have to override/extend the Comparator to achieve that,
something like what is done in Secondary Sort?

Regards,
Shahab


On Fri, Aug 30, 2013 at 9:01 PM, Adrian CAPDEFIER <ch...@gmail.com>wrote:

> Howdy,
>
> I apologise for the lack of code in this message, but the code is fairly
> convoluted and it would obscure my problem. That being said, I can put
> together some sample code if really needed.
>
> I am trying to pass some metadata between the map & reduce steps. This
> metadata is read and generated in the map step and stored in the job
> config. It also needs to be recreated on the reduce node before the key/
> value fields can be read in the readFields function.
>
> I had assumed that I would be able to override the Reducer.setup()
> function and that would be it, but apparently the readFields function is
> called before the Reducer.setup() function.
>
> My question is what is any (the best) place on the reduce node where I can
> access the job configuration/ context before the readFields function is
> called?
>
> This is the stack trace:
>
>         at
> org.apache.hadoop.io.WritableComparator.compare(WritableComparator.java:103)
>         at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.compare(MapTask.java:1111)
>         at org.apache.hadoop.util.QuickSort.sortInternal(QuickSort.java:70)
>         at org.apache.hadoop.util.QuickSort.sort(QuickSort.java:59)
>         at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1399)
>         at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1298)
>         at
> org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:699)
>         at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:766)
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
>         at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:415)
>         at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1149)
>         at org.apache.hadoop.mapred.Child.main(Child.java:249)
>
>

Re: Job config before read fields

Posted by Shahab Yunus <sh...@gmail.com>.
I think you have to override/extend the Comparator to achieve that,
something like what is done in Secondary Sort?

Regards,
Shahab


On Fri, Aug 30, 2013 at 9:01 PM, Adrian CAPDEFIER <ch...@gmail.com>wrote:

> Howdy,
>
> I apologise for the lack of code in this message, but the code is fairly
> convoluted and it would obscure my problem. That being said, I can put
> together some sample code if really needed.
>
> I am trying to pass some metadata between the map & reduce steps. This
> metadata is read and generated in the map step and stored in the job
> config. It also needs to be recreated on the reduce node before the key/
> value fields can be read in the readFields function.
>
> I had assumed that I would be able to override the Reducer.setup()
> function and that would be it, but apparently the readFields function is
> called before the Reducer.setup() function.
>
> My question is what is any (the best) place on the reduce node where I can
> access the job configuration/ context before the readFields function is
> called?
>
> This is the stack trace:
>
>         at
> org.apache.hadoop.io.WritableComparator.compare(WritableComparator.java:103)
>         at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.compare(MapTask.java:1111)
>         at org.apache.hadoop.util.QuickSort.sortInternal(QuickSort.java:70)
>         at org.apache.hadoop.util.QuickSort.sort(QuickSort.java:59)
>         at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1399)
>         at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1298)
>         at
> org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:699)
>         at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:766)
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
>         at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:415)
>         at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1149)
>         at org.apache.hadoop.mapred.Child.main(Child.java:249)
>
>

Re: Job config before read fields

Posted by Shahab Yunus <sh...@gmail.com>.
I think you have to override/extend the Comparator to achieve that,
something like what is done in Secondary Sort?

Regards,
Shahab


On Fri, Aug 30, 2013 at 9:01 PM, Adrian CAPDEFIER <ch...@gmail.com>wrote:

> Howdy,
>
> I apologise for the lack of code in this message, but the code is fairly
> convoluted and it would obscure my problem. That being said, I can put
> together some sample code if really needed.
>
> I am trying to pass some metadata between the map & reduce steps. This
> metadata is read and generated in the map step and stored in the job
> config. It also needs to be recreated on the reduce node before the key/
> value fields can be read in the readFields function.
>
> I had assumed that I would be able to override the Reducer.setup()
> function and that would be it, but apparently the readFields function is
> called before the Reducer.setup() function.
>
> My question is what is any (the best) place on the reduce node where I can
> access the job configuration/ context before the readFields function is
> called?
>
> This is the stack trace:
>
>         at
> org.apache.hadoop.io.WritableComparator.compare(WritableComparator.java:103)
>         at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.compare(MapTask.java:1111)
>         at org.apache.hadoop.util.QuickSort.sortInternal(QuickSort.java:70)
>         at org.apache.hadoop.util.QuickSort.sort(QuickSort.java:59)
>         at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1399)
>         at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1298)
>         at
> org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:699)
>         at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:766)
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
>         at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:415)
>         at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1149)
>         at org.apache.hadoop.mapred.Child.main(Child.java:249)
>
>