You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-user@hadoop.apache.org by Suraj Nayak <sn...@gmail.com> on 2015/03/18 18:30:27 UTC

Re: Reading 2 table data in MapReduce for Performing Join

Hi All,

https://issues.apache.org/jira/browse/HIVE-4997 patch helped!

On Tue, Mar 17, 2015 at 1:05 AM, Suraj Nayak <sn...@gmail.com> wrote:

> Hi,
>
> I tried reading data via HCatalog for 1 Hive table in MapReduce using
> something similar to
> https://cwiki.apache.org/confluence/display/Hive/HCatalog+InputOutput#HCatalogInputOutput-RunningMapReducewithHCatalog.
> I was able to read successfully.
>
> Now am trying to read 2 tables, as the requirement is to join 2 tables. I
> did not find API similar to *FileInputFormat.addInputPaths* in
> *HCatInputFormat*. What is the equivalent of the same in HCat ?
>
> I had performed join using FilesInputFormat in HDFS(by getting split
> information in mapper). This article(
> http://www.codingjunkie.com/mapreduce-reduce-joins) helped me code join.
> <http://www.codingjunkie.com/mapreduce-reduce-joins/> Can someone suggest
> how I can perform join operation using HCatalog ?
>
> Briefly, the aim is to
>
>    - Read 2 tables (almost similar schema)
>    - If key exists in both the table send it to same reducer.
>    - Do some processing on the records in reducer.
>    - Save the output into file/Hive table.
>
> *P.S : The reason for using MapReduce to perform join is because of
> complex requirement which can't be solved via Hive/Pig directly. *
>
> Any help will be greatly appreciated :)
>
> --
> Thanks
> Suraj Nayak M
>



-- 
Thanks
Suraj Nayak M

Re: Reading 2 table data in MapReduce for Performing Join

Posted by Suraj Nayak <sn...@gmail.com>.

This is solved. Used Writable instead of LongWritable or NullWritable in
Mapper input key type.

Thanks
Suraj Nayak
On 19-Mar-2015 9:48 PM, "Suraj Nayak" <sn...@gmail.com> wrote:

> Is this related to https://issues.apache.org/jira/browse/HIVE-4329 ? Is
> there a workaround?
>
> On Thu, Mar 19, 2015 at 9:47 PM, Suraj Nayak <sn...@gmail.com> wrote:
>
>> Hi All,
>>
>> I was successfully able to integrate HCatMultipleInputs with the patch
>> for the tables created with TEXTFILE. But I get error when I read table
>> created with ORC file. The error is below :
>>
>> 15/03/19 10:51:32 INFO mapreduce.Job: Task Id :
>> attempt_1425012118520_9756_m_000000_0, Status : FAILED
>> Error: java.lang.ClassCastException: org.apache.hadoop.io.NullWritable
>> cannot be cast to org.apache.hadoop.io.LongWritable
>>     at com.abccompany.mapreduce.MyMapper.map(MyMapper.java:15)
>>     at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
>>     at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
>>     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
>>     at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167)
>>     at java.security.AccessController.doPrivileged(Native Method)
>>     at javax.security.auth.Subject.doAs(Subject.java:415)
>>     at
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1557)
>>     at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)
>>
>>
>> Can anyone help?
>>
>> Thanks in advance!
>>
>> On Wed, Mar 18, 2015 at 11:00 PM, Suraj Nayak <sn...@gmail.com> wrote:
>>
>>> Hi All,
>>>
>>> https://issues.apache.org/jira/browse/HIVE-4997 patch helped!
>>>
>>>
>>> On Tue, Mar 17, 2015 at 1:05 AM, Suraj Nayak <sn...@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> I tried reading data via HCatalog for 1 Hive table in MapReduce using
>>>> something similar to
>>>> https://cwiki.apache.org/confluence/display/Hive/HCatalog+InputOutput#HCatalogInputOutput-RunningMapReducewithHCatalog.
>>>> I was able to read successfully.
>>>>
>>>> Now am trying to read 2 tables, as the requirement is to join 2 tables.
>>>> I did not find API similar to *FileInputFormat.addInputPaths* in
>>>> *HCatInputFormat*. What is the equivalent of the same in HCat ?
>>>>
>>>> I had performed join using FilesInputFormat in HDFS(by getting split
>>>> information in mapper). This article(
>>>> http://www.codingjunkie.com/mapreduce-reduce-joins) helped me code
>>>> join. <http://www.codingjunkie.com/mapreduce-reduce-joins/> Can
>>>> someone suggest how I can perform join operation using HCatalog ?
>>>>
>>>> Briefly, the aim is to
>>>>
>>>>    - Read 2 tables (almost similar schema)
>>>>    - If key exists in both the table send it to same reducer.
>>>>    - Do some processing on the records in reducer.
>>>>    - Save the output into file/Hive table.
>>>>
>>>> *P.S : The reason for using MapReduce to perform join is because of
>>>> complex requirement which can't be solved via Hive/Pig directly. *
>>>>
>>>> Any help will be greatly appreciated :)
>>>>
>>>> --
>>>> Thanks
>>>> Suraj Nayak M
>>>>
>>>
>>>
>>>
>>> --
>>> Thanks
>>> Suraj Nayak M
>>>
>>
>>
>>
>> --
>> Thanks
>> Suraj Nayak M
>>
>
>
>
> --
> Thanks
> Suraj Nayak M
>

Re: Reading 2 table data in MapReduce for Performing Join

Posted by Suraj Nayak <sn...@gmail.com>.

This is solved. Used Writable instead of LongWritable or NullWritable in
Mapper input key type.

Thanks
Suraj Nayak
On 19-Mar-2015 9:48 PM, "Suraj Nayak" <sn...@gmail.com> wrote:

> Is this related to https://issues.apache.org/jira/browse/HIVE-4329 ? Is
> there a workaround?
>
> On Thu, Mar 19, 2015 at 9:47 PM, Suraj Nayak <sn...@gmail.com> wrote:
>
>> Hi All,
>>
>> I was successfully able to integrate HCatMultipleInputs with the patch
>> for the tables created with TEXTFILE. But I get error when I read table
>> created with ORC file. The error is below :
>>
>> 15/03/19 10:51:32 INFO mapreduce.Job: Task Id :
>> attempt_1425012118520_9756_m_000000_0, Status : FAILED
>> Error: java.lang.ClassCastException: org.apache.hadoop.io.NullWritable
>> cannot be cast to org.apache.hadoop.io.LongWritable
>>     at com.abccompany.mapreduce.MyMapper.map(MyMapper.java:15)
>>     at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
>>     at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
>>     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
>>     at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167)
>>     at java.security.AccessController.doPrivileged(Native Method)
>>     at javax.security.auth.Subject.doAs(Subject.java:415)
>>     at
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1557)
>>     at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)
>>
>>
>> Can anyone help?
>>
>> Thanks in advance!
>>
>> On Wed, Mar 18, 2015 at 11:00 PM, Suraj Nayak <sn...@gmail.com> wrote:
>>
>>> Hi All,
>>>
>>> https://issues.apache.org/jira/browse/HIVE-4997 patch helped!
>>>
>>>
>>> On Tue, Mar 17, 2015 at 1:05 AM, Suraj Nayak <sn...@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> I tried reading data via HCatalog for 1 Hive table in MapReduce using
>>>> something similar to
>>>> https://cwiki.apache.org/confluence/display/Hive/HCatalog+InputOutput#HCatalogInputOutput-RunningMapReducewithHCatalog.
>>>> I was able to read successfully.
>>>>
>>>> Now am trying to read 2 tables, as the requirement is to join 2 tables.
>>>> I did not find API similar to *FileInputFormat.addInputPaths* in
>>>> *HCatInputFormat*. What is the equivalent of the same in HCat ?
>>>>
>>>> I had performed join using FilesInputFormat in HDFS(by getting split
>>>> information in mapper). This article(
>>>> http://www.codingjunkie.com/mapreduce-reduce-joins) helped me code
>>>> join. <http://www.codingjunkie.com/mapreduce-reduce-joins/> Can
>>>> someone suggest how I can perform join operation using HCatalog ?
>>>>
>>>> Briefly, the aim is to
>>>>
>>>>    - Read 2 tables (almost similar schema)
>>>>    - If key exists in both the table send it to same reducer.
>>>>    - Do some processing on the records in reducer.
>>>>    - Save the output into file/Hive table.
>>>>
>>>> *P.S : The reason for using MapReduce to perform join is because of
>>>> complex requirement which can't be solved via Hive/Pig directly. *
>>>>
>>>> Any help will be greatly appreciated :)
>>>>
>>>> --
>>>> Thanks
>>>> Suraj Nayak M
>>>>
>>>
>>>
>>>
>>> --
>>> Thanks
>>> Suraj Nayak M
>>>
>>
>>
>>
>> --
>> Thanks
>> Suraj Nayak M
>>
>
>
>
> --
> Thanks
> Suraj Nayak M
>

Re: Reading 2 table data in MapReduce for Performing Join

Posted by Suraj Nayak <sn...@gmail.com>.

This is solved. Used Writable instead of LongWritable or NullWritable in
Mapper input key type.

Thanks
Suraj Nayak
On 19-Mar-2015 9:48 PM, "Suraj Nayak" <sn...@gmail.com> wrote:

> Is this related to https://issues.apache.org/jira/browse/HIVE-4329 ? Is
> there a workaround?
>
> On Thu, Mar 19, 2015 at 9:47 PM, Suraj Nayak <sn...@gmail.com> wrote:
>
>> Hi All,
>>
>> I was successfully able to integrate HCatMultipleInputs with the patch
>> for the tables created with TEXTFILE. But I get error when I read table
>> created with ORC file. The error is below :
>>
>> 15/03/19 10:51:32 INFO mapreduce.Job: Task Id :
>> attempt_1425012118520_9756_m_000000_0, Status : FAILED
>> Error: java.lang.ClassCastException: org.apache.hadoop.io.NullWritable
>> cannot be cast to org.apache.hadoop.io.LongWritable
>>     at com.abccompany.mapreduce.MyMapper.map(MyMapper.java:15)
>>     at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
>>     at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
>>     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
>>     at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167)
>>     at java.security.AccessController.doPrivileged(Native Method)
>>     at javax.security.auth.Subject.doAs(Subject.java:415)
>>     at
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1557)
>>     at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)
>>
>>
>> Can anyone help?
>>
>> Thanks in advance!
>>
>> On Wed, Mar 18, 2015 at 11:00 PM, Suraj Nayak <sn...@gmail.com> wrote:
>>
>>> Hi All,
>>>
>>> https://issues.apache.org/jira/browse/HIVE-4997 patch helped!
>>>
>>>
>>> On Tue, Mar 17, 2015 at 1:05 AM, Suraj Nayak <sn...@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> I tried reading data via HCatalog for 1 Hive table in MapReduce using
>>>> something similar to
>>>> https://cwiki.apache.org/confluence/display/Hive/HCatalog+InputOutput#HCatalogInputOutput-RunningMapReducewithHCatalog.
>>>> I was able to read successfully.
>>>>
>>>> Now am trying to read 2 tables, as the requirement is to join 2 tables.
>>>> I did not find API similar to *FileInputFormat.addInputPaths* in
>>>> *HCatInputFormat*. What is the equivalent of the same in HCat ?
>>>>
>>>> I had performed join using FilesInputFormat in HDFS(by getting split
>>>> information in mapper). This article(
>>>> http://www.codingjunkie.com/mapreduce-reduce-joins) helped me code
>>>> join. <http://www.codingjunkie.com/mapreduce-reduce-joins/> Can
>>>> someone suggest how I can perform join operation using HCatalog ?
>>>>
>>>> Briefly, the aim is to
>>>>
>>>>    - Read 2 tables (almost similar schema)
>>>>    - If key exists in both the table send it to same reducer.
>>>>    - Do some processing on the records in reducer.
>>>>    - Save the output into file/Hive table.
>>>>
>>>> *P.S : The reason for using MapReduce to perform join is because of
>>>> complex requirement which can't be solved via Hive/Pig directly. *
>>>>
>>>> Any help will be greatly appreciated :)
>>>>
>>>> --
>>>> Thanks
>>>> Suraj Nayak M
>>>>
>>>
>>>
>>>
>>> --
>>> Thanks
>>> Suraj Nayak M
>>>
>>
>>
>>
>> --
>> Thanks
>> Suraj Nayak M
>>
>
>
>
> --
> Thanks
> Suraj Nayak M
>

Re: Reading 2 table data in MapReduce for Performing Join

Posted by Suraj Nayak <sn...@gmail.com>.

Is this related to https://issues.apache.org/jira/browse/HIVE-4329 ? Is
there a workaround?

On Thu, Mar 19, 2015 at 9:47 PM, Suraj Nayak <sn...@gmail.com> wrote:

> Hi All,
>
> I was successfully able to integrate HCatMultipleInputs with the patch for
> the tables created with TEXTFILE. But I get error when I read table created
> with ORC file. The error is below :
>
> 15/03/19 10:51:32 INFO mapreduce.Job: Task Id :
> attempt_1425012118520_9756_m_000000_0, Status : FAILED
> Error: java.lang.ClassCastException: org.apache.hadoop.io.NullWritable
> cannot be cast to org.apache.hadoop.io.LongWritable
>     at com.abccompany.mapreduce.MyMapper.map(MyMapper.java:15)
>     at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
>     at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
>     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
>     at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167)
>     at java.security.AccessController.doPrivileged(Native Method)
>     at javax.security.auth.Subject.doAs(Subject.java:415)
>     at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1557)
>     at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)
>
>
> Can anyone help?
>
> Thanks in advance!
>
> On Wed, Mar 18, 2015 at 11:00 PM, Suraj Nayak <sn...@gmail.com> wrote:
>
>> Hi All,
>>
>> https://issues.apache.org/jira/browse/HIVE-4997 patch helped!
>>
>>
>> On Tue, Mar 17, 2015 at 1:05 AM, Suraj Nayak <sn...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I tried reading data via HCatalog for 1 Hive table in MapReduce using
>>> something similar to
>>> https://cwiki.apache.org/confluence/display/Hive/HCatalog+InputOutput#HCatalogInputOutput-RunningMapReducewithHCatalog.
>>> I was able to read successfully.
>>>
>>> Now am trying to read 2 tables, as the requirement is to join 2 tables.
>>> I did not find API similar to *FileInputFormat.addInputPaths* in
>>> *HCatInputFormat*. What is the equivalent of the same in HCat ?
>>>
>>> I had performed join using FilesInputFormat in HDFS(by getting split
>>> information in mapper). This article(
>>> http://www.codingjunkie.com/mapreduce-reduce-joins) helped me code join.
>>> <http://www.codingjunkie.com/mapreduce-reduce-joins/> Can someone
>>> suggest how I can perform join operation using HCatalog ?
>>>
>>> Briefly, the aim is to
>>>
>>>    - Read 2 tables (almost similar schema)
>>>    - If key exists in both the table send it to same reducer.
>>>    - Do some processing on the records in reducer.
>>>    - Save the output into file/Hive table.
>>>
>>> *P.S : The reason for using MapReduce to perform join is because of
>>> complex requirement which can't be solved via Hive/Pig directly. *
>>>
>>> Any help will be greatly appreciated :)
>>>
>>> --
>>> Thanks
>>> Suraj Nayak M
>>>
>>
>>
>>
>> --
>> Thanks
>> Suraj Nayak M
>>
>
>
>
> --
> Thanks
> Suraj Nayak M
>



-- 
Thanks
Suraj Nayak M

Re: Reading 2 table data in MapReduce for Performing Join

Posted by Suraj Nayak <sn...@gmail.com>.

Is this related to https://issues.apache.org/jira/browse/HIVE-4329 ? Is
there a workaround?

On Thu, Mar 19, 2015 at 9:47 PM, Suraj Nayak <sn...@gmail.com> wrote:

> Hi All,
>
> I was successfully able to integrate HCatMultipleInputs with the patch for
> the tables created with TEXTFILE. But I get error when I read table created
> with ORC file. The error is below :
>
> 15/03/19 10:51:32 INFO mapreduce.Job: Task Id :
> attempt_1425012118520_9756_m_000000_0, Status : FAILED
> Error: java.lang.ClassCastException: org.apache.hadoop.io.NullWritable
> cannot be cast to org.apache.hadoop.io.LongWritable
>     at com.abccompany.mapreduce.MyMapper.map(MyMapper.java:15)
>     at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
>     at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
>     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
>     at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167)
>     at java.security.AccessController.doPrivileged(Native Method)
>     at javax.security.auth.Subject.doAs(Subject.java:415)
>     at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1557)
>     at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)
>
>
> Can anyone help?
>
> Thanks in advance!
>
> On Wed, Mar 18, 2015 at 11:00 PM, Suraj Nayak <sn...@gmail.com> wrote:
>
>> Hi All,
>>
>> https://issues.apache.org/jira/browse/HIVE-4997 patch helped!
>>
>>
>> On Tue, Mar 17, 2015 at 1:05 AM, Suraj Nayak <sn...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I tried reading data via HCatalog for 1 Hive table in MapReduce using
>>> something similar to
>>> https://cwiki.apache.org/confluence/display/Hive/HCatalog+InputOutput#HCatalogInputOutput-RunningMapReducewithHCatalog.
>>> I was able to read successfully.
>>>
>>> Now am trying to read 2 tables, as the requirement is to join 2 tables.
>>> I did not find API similar to *FileInputFormat.addInputPaths* in
>>> *HCatInputFormat*. What is the equivalent of the same in HCat ?
>>>
>>> I had performed join using FilesInputFormat in HDFS(by getting split
>>> information in mapper). This article(
>>> http://www.codingjunkie.com/mapreduce-reduce-joins) helped me code join.
>>> <http://www.codingjunkie.com/mapreduce-reduce-joins/> Can someone
>>> suggest how I can perform join operation using HCatalog ?
>>>
>>> Briefly, the aim is to
>>>
>>>    - Read 2 tables (almost similar schema)
>>>    - If key exists in both the table send it to same reducer.
>>>    - Do some processing on the records in reducer.
>>>    - Save the output into file/Hive table.
>>>
>>> *P.S : The reason for using MapReduce to perform join is because of
>>> complex requirement which can't be solved via Hive/Pig directly. *
>>>
>>> Any help will be greatly appreciated :)
>>>
>>> --
>>> Thanks
>>> Suraj Nayak M
>>>
>>
>>
>>
>> --
>> Thanks
>> Suraj Nayak M
>>
>
>
>
> --
> Thanks
> Suraj Nayak M
>



-- 
Thanks
Suraj Nayak M

Re: Reading 2 table data in MapReduce for Performing Join

Posted by Suraj Nayak <sn...@gmail.com>.

Is this related to https://issues.apache.org/jira/browse/HIVE-4329 ? Is
there a workaround?

On Thu, Mar 19, 2015 at 9:47 PM, Suraj Nayak <sn...@gmail.com> wrote:

> Hi All,
>
> I was successfully able to integrate HCatMultipleInputs with the patch for
> the tables created with TEXTFILE. But I get error when I read table created
> with ORC file. The error is below :
>
> 15/03/19 10:51:32 INFO mapreduce.Job: Task Id :
> attempt_1425012118520_9756_m_000000_0, Status : FAILED
> Error: java.lang.ClassCastException: org.apache.hadoop.io.NullWritable
> cannot be cast to org.apache.hadoop.io.LongWritable
>     at com.abccompany.mapreduce.MyMapper.map(MyMapper.java:15)
>     at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
>     at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
>     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
>     at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167)
>     at java.security.AccessController.doPrivileged(Native Method)
>     at javax.security.auth.Subject.doAs(Subject.java:415)
>     at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1557)
>     at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)
>
>
> Can anyone help?
>
> Thanks in advance!
>
> On Wed, Mar 18, 2015 at 11:00 PM, Suraj Nayak <sn...@gmail.com> wrote:
>
>> Hi All,
>>
>> https://issues.apache.org/jira/browse/HIVE-4997 patch helped!
>>
>>
>> On Tue, Mar 17, 2015 at 1:05 AM, Suraj Nayak <sn...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I tried reading data via HCatalog for 1 Hive table in MapReduce using
>>> something similar to
>>> https://cwiki.apache.org/confluence/display/Hive/HCatalog+InputOutput#HCatalogInputOutput-RunningMapReducewithHCatalog.
>>> I was able to read successfully.
>>>
>>> Now am trying to read 2 tables, as the requirement is to join 2 tables.
>>> I did not find API similar to *FileInputFormat.addInputPaths* in
>>> *HCatInputFormat*. What is the equivalent of the same in HCat ?
>>>
>>> I had performed join using FilesInputFormat in HDFS(by getting split
>>> information in mapper). This article(
>>> http://www.codingjunkie.com/mapreduce-reduce-joins) helped me code join.
>>> <http://www.codingjunkie.com/mapreduce-reduce-joins/> Can someone
>>> suggest how I can perform join operation using HCatalog ?
>>>
>>> Briefly, the aim is to
>>>
>>>    - Read 2 tables (almost similar schema)
>>>    - If key exists in both the table send it to same reducer.
>>>    - Do some processing on the records in reducer.
>>>    - Save the output into file/Hive table.
>>>
>>> *P.S : The reason for using MapReduce to perform join is because of
>>> complex requirement which can't be solved via Hive/Pig directly. *
>>>
>>> Any help will be greatly appreciated :)
>>>
>>> --
>>> Thanks
>>> Suraj Nayak M
>>>
>>
>>
>>
>> --
>> Thanks
>> Suraj Nayak M
>>
>
>
>
> --
> Thanks
> Suraj Nayak M
>



-- 
Thanks
Suraj Nayak M

Re: Reading 2 table data in MapReduce for Performing Join

Posted by Suraj Nayak <sn...@gmail.com>.

Hi All,

I was successfully able to integrate HCatMultipleInputs with the patch for
the tables created with TEXTFILE. But I get error when I read table created
with ORC file. The error is below :

15/03/19 10:51:32 INFO mapreduce.Job: Task Id :
attempt_1425012118520_9756_m_000000_0, Status : FAILED
Error: java.lang.ClassCastException: org.apache.hadoop.io.NullWritable
cannot be cast to org.apache.hadoop.io.LongWritable
    at com.abccompany.mapreduce.MyMapper.map(MyMapper.java:15)
    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
    at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1557)
    at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)


Can anyone help?

Thanks in advance!

On Wed, Mar 18, 2015 at 11:00 PM, Suraj Nayak <sn...@gmail.com> wrote:

> Hi All,
>
> https://issues.apache.org/jira/browse/HIVE-4997 patch helped!
>
>
> On Tue, Mar 17, 2015 at 1:05 AM, Suraj Nayak <sn...@gmail.com> wrote:
>
>> Hi,
>>
>> I tried reading data via HCatalog for 1 Hive table in MapReduce using
>> something similar to
>> https://cwiki.apache.org/confluence/display/Hive/HCatalog+InputOutput#HCatalogInputOutput-RunningMapReducewithHCatalog.
>> I was able to read successfully.
>>
>> Now am trying to read 2 tables, as the requirement is to join 2 tables. I
>> did not find API similar to *FileInputFormat.addInputPaths* in
>> *HCatInputFormat*. What is the equivalent of the same in HCat ?
>>
>> I had performed join using FilesInputFormat in HDFS(by getting split
>> information in mapper). This article(
>> http://www.codingjunkie.com/mapreduce-reduce-joins) helped me code join.
>> <http://www.codingjunkie.com/mapreduce-reduce-joins/> Can someone
>> suggest how I can perform join operation using HCatalog ?
>>
>> Briefly, the aim is to
>>
>>    - Read 2 tables (almost similar schema)
>>    - If key exists in both the table send it to same reducer.
>>    - Do some processing on the records in reducer.
>>    - Save the output into file/Hive table.
>>
>> *P.S : The reason for using MapReduce to perform join is because of
>> complex requirement which can't be solved via Hive/Pig directly. *
>>
>> Any help will be greatly appreciated :)
>>
>> --
>> Thanks
>> Suraj Nayak M
>>
>
>
>
> --
> Thanks
> Suraj Nayak M
>



-- 
Thanks
Suraj Nayak M

Re: Reading 2 table data in MapReduce for Performing Join

Posted by Suraj Nayak <sn...@gmail.com>.

Hi All,

I was successfully able to integrate HCatMultipleInputs with the patch for
the tables created with TEXTFILE. But I get error when I read table created
with ORC file. The error is below :

15/03/19 10:51:32 INFO mapreduce.Job: Task Id :
attempt_1425012118520_9756_m_000000_0, Status : FAILED
Error: java.lang.ClassCastException: org.apache.hadoop.io.NullWritable
cannot be cast to org.apache.hadoop.io.LongWritable
    at com.abccompany.mapreduce.MyMapper.map(MyMapper.java:15)
    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
    at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1557)
    at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)


Can anyone help?

Thanks in advance!

On Wed, Mar 18, 2015 at 11:00 PM, Suraj Nayak <sn...@gmail.com> wrote:

> Hi All,
>
> https://issues.apache.org/jira/browse/HIVE-4997 patch helped!
>
>
> On Tue, Mar 17, 2015 at 1:05 AM, Suraj Nayak <sn...@gmail.com> wrote:
>
>> Hi,
>>
>> I tried reading data via HCatalog for 1 Hive table in MapReduce using
>> something similar to
>> https://cwiki.apache.org/confluence/display/Hive/HCatalog+InputOutput#HCatalogInputOutput-RunningMapReducewithHCatalog.
>> I was able to read successfully.
>>
>> Now am trying to read 2 tables, as the requirement is to join 2 tables. I
>> did not find API similar to *FileInputFormat.addInputPaths* in
>> *HCatInputFormat*. What is the equivalent of the same in HCat ?
>>
>> I had performed join using FilesInputFormat in HDFS(by getting split
>> information in mapper). This article(
>> http://www.codingjunkie.com/mapreduce-reduce-joins) helped me code join.
>> <http://www.codingjunkie.com/mapreduce-reduce-joins/> Can someone
>> suggest how I can perform join operation using HCatalog ?
>>
>> Briefly, the aim is to
>>
>>    - Read 2 tables (almost similar schema)
>>    - If key exists in both the table send it to same reducer.
>>    - Do some processing on the records in reducer.
>>    - Save the output into file/Hive table.
>>
>> *P.S : The reason for using MapReduce to perform join is because of
>> complex requirement which can't be solved via Hive/Pig directly. *
>>
>> Any help will be greatly appreciated :)
>>
>> --
>> Thanks
>> Suraj Nayak M
>>
>
>
>
> --
> Thanks
> Suraj Nayak M
>



-- 
Thanks
Suraj Nayak M

Re: Reading 2 table data in MapReduce for Performing Join

Posted by Suraj Nayak <sn...@gmail.com>.

Hi All,

I was successfully able to integrate HCatMultipleInputs with the patch for
the tables created with TEXTFILE. But I get error when I read table created
with ORC file. The error is below :

15/03/19 10:51:32 INFO mapreduce.Job: Task Id :
attempt_1425012118520_9756_m_000000_0, Status : FAILED
Error: java.lang.ClassCastException: org.apache.hadoop.io.NullWritable
cannot be cast to org.apache.hadoop.io.LongWritable
    at com.abccompany.mapreduce.MyMapper.map(MyMapper.java:15)
    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
    at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1557)
    at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)


Can anyone help?

Thanks in advance!

On Wed, Mar 18, 2015 at 11:00 PM, Suraj Nayak <sn...@gmail.com> wrote:

> Hi All,
>
> https://issues.apache.org/jira/browse/HIVE-4997 patch helped!
>
>
> On Tue, Mar 17, 2015 at 1:05 AM, Suraj Nayak <sn...@gmail.com> wrote:
>
>> Hi,
>>
>> I tried reading data via HCatalog for 1 Hive table in MapReduce using
>> something similar to
>> https://cwiki.apache.org/confluence/display/Hive/HCatalog+InputOutput#HCatalogInputOutput-RunningMapReducewithHCatalog.
>> I was able to read successfully.
>>
>> Now am trying to read 2 tables, as the requirement is to join 2 tables. I
>> did not find API similar to *FileInputFormat.addInputPaths* in
>> *HCatInputFormat*. What is the equivalent of the same in HCat ?
>>
>> I had performed join using FilesInputFormat in HDFS(by getting split
>> information in mapper). This article(
>> http://www.codingjunkie.com/mapreduce-reduce-joins) helped me code join.
>> <http://www.codingjunkie.com/mapreduce-reduce-joins/> Can someone
>> suggest how I can perform join operation using HCatalog ?
>>
>> Briefly, the aim is to
>>
>>    - Read 2 tables (almost similar schema)
>>    - If key exists in both the table send it to same reducer.
>>    - Do some processing on the records in reducer.
>>    - Save the output into file/Hive table.
>>
>> *P.S : The reason for using MapReduce to perform join is because of
>> complex requirement which can't be solved via Hive/Pig directly. *
>>
>> Any help will be greatly appreciated :)
>>
>> --
>> Thanks
>> Suraj Nayak M
>>
>
>
>
> --
> Thanks
> Suraj Nayak M
>



-- 
Thanks
Suraj Nayak M