You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Namit Jain (JIRA)" <ji...@apache.org> on 2010/02/25 01:40:27 UTC

[jira] Created: (HIVE-1197) create a new input format where a mapper spans a file

create a new input format where a mapper spans a file
-----------------------------------------------------

                 Key: HIVE-1197
                 URL: https://issues.apache.org/jira/browse/HIVE-1197
             Project: Hadoop Hive
          Issue Type: New Feature
          Components: Query Processor
            Reporter: Namit Jain
            Assignee: Namit Jain
             Fix For: 0.6.0


This will be needed for Sort merge joins.



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-1197) create a new input format where a mapper spans a file

Posted by "Siying Dong (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Siying Dong updated HIVE-1197:
------------------------------

    Attachment: hive.1197.4.patch

update test output file.

> create a new input format where a mapper spans a file
> -----------------------------------------------------
>
>                 Key: HIVE-1197
>                 URL: https://issues.apache.org/jira/browse/HIVE-1197
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: Siying Dong
>             Fix For: 0.6.0
>
>         Attachments: hive.1197.1.patch, hive.1197.2.patch, hive.1197.3.patch, hive.1197.4.patch
>
>
> This will be needed for Sort merge joins.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-1197) create a new input format where a mapper spans a file

Posted by "Siying Dong (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Siying Dong updated HIVE-1197:
------------------------------

    Attachment: hive.1197.3.patch

no change from hive.1197.3.patch besides adding a new line in the end of the file.

> create a new input format where a mapper spans a file
> -----------------------------------------------------
>
>                 Key: HIVE-1197
>                 URL: https://issues.apache.org/jira/browse/HIVE-1197
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: Siying Dong
>             Fix For: 0.6.0
>
>         Attachments: hive.1197.1.patch, hive.1197.2.patch, hive.1197.3.patch
>
>
> This will be needed for Sort merge joins.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-1197) create a new input format where a mapper spans a file

Posted by "Siying Dong (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12840787#action_12840787 ] 

Siying Dong commented on HIVE-1197:
-----------------------------------

At least when I tried from my box, it doesn't work.



> create a new input format where a mapper spans a file
> -----------------------------------------------------
>
>                 Key: HIVE-1197
>                 URL: https://issues.apache.org/jira/browse/HIVE-1197
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: Siying Dong
>             Fix For: 0.6.0
>
>         Attachments: hive.1197.1.patch, hive.1197.2.patch, hive.1197.3.patch
>
>
> This will be needed for Sort merge joins.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-1197) create a new input format where a mapper spans a file

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12840793#action_12840793 ] 

Namit Jain commented on HIVE-1197:
----------------------------------

Does the new test work for you ?

> create a new input format where a mapper spans a file
> -----------------------------------------------------
>
>                 Key: HIVE-1197
>                 URL: https://issues.apache.org/jira/browse/HIVE-1197
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: Siying Dong
>             Fix For: 0.6.0
>
>         Attachments: hive.1197.1.patch, hive.1197.2.patch, hive.1197.3.patch
>
>
> This will be needed for Sort merge joins.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-1197) create a new input format where a mapper spans a file

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12839576#action_12839576 ] 

He Yongqiang commented on HIVE-1197:
------------------------------------

Correction about 5.1, it should be  ((number of splits done) + currReader.getProgess() )/ (total split number)

> create a new input format where a mapper spans a file
> -----------------------------------------------------
>
>                 Key: HIVE-1197
>                 URL: https://issues.apache.org/jira/browse/HIVE-1197
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: Siying Dong
>             Fix For: 0.6.0
>
>         Attachments: hive.1197.1.patch
>
>
> This will be needed for Sort merge joins.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-1197) create a new input format where a mapper spans a file

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12839574#action_12839574 ] 

Namit Jain commented on HIVE-1197:
----------------------------------

Overall, looks good - some general comments.

Would it be a good idea to make BucketizedHiveInputFormat extend HiveInpuFormat, and BucketizedHiveRecordReader extend HiveRecordReader ?
You wont have to copy a lot of code, and it would be easy to maintain. For example, the check for ExecMapper in hiverecordreader and such future 
optimizations would be easier to maintain.

> create a new input format where a mapper spans a file
> -----------------------------------------------------
>
>                 Key: HIVE-1197
>                 URL: https://issues.apache.org/jira/browse/HIVE-1197
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: Siying Dong
>             Fix For: 0.6.0
>
>         Attachments: hive.1197.1.patch
>
>
> This will be needed for Sort merge joins.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-1197) create a new input format where a mapper spans a file

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12839575#action_12839575 ] 

He Yongqiang commented on HIVE-1197:
------------------------------------

Looks very good overall, congrats!

just few minor comments:
1. Can you change inputFormatClassName to use getter and setter method?
2. some duplication code with HiveInputFormat, can we reuse them?
3. In BucketizedHiveRecordReader's next, i think should remove the check of "curReader == null". we should throw an exception if curReader==null, which means the reader has been closed.
4. i think we should remove line 207 in BucketizedHiveInputFormat:   newjob.setInputFormat(inputFormat.getClass());
5. In HiveRecordReader,
5.1 progress is calculated based on (number of splits done) / (total split number), can we make it more accurate? Let's say the work is evenly divided among all splits. something like this: (number of splits done) / (total split number) + currReader.getProgess();
5.2 getPos should return this currReader.getPos()

Another one is do you think it is a good idea to let the BucketizedHiveInputFormat extend HiveInputFormat? That way, the code would be more clear. And we should put the RecordReader and InputSplit in the same file as BucketizedHiveInputFormat.

> create a new input format where a mapper spans a file
> -----------------------------------------------------
>
>                 Key: HIVE-1197
>                 URL: https://issues.apache.org/jira/browse/HIVE-1197
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: Siying Dong
>             Fix For: 0.6.0
>
>         Attachments: hive.1197.1.patch
>
>
> This will be needed for Sort merge joins.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-1197) create a new input format where a mapper spans a file

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12840903#action_12840903 ] 

Namit Jain commented on HIVE-1197:
----------------------------------

+1

will commit if the tests pass

> create a new input format where a mapper spans a file
> -----------------------------------------------------
>
>                 Key: HIVE-1197
>                 URL: https://issues.apache.org/jira/browse/HIVE-1197
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: Siying Dong
>             Fix For: 0.6.0
>
>         Attachments: hive.1197.1.patch, hive.1197.2.patch, hive.1197.3.patch, hive.1197.4.patch
>
>
> This will be needed for Sort merge joins.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-1197) create a new input format where a mapper spans a file

Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12838763#action_12838763 ] 

Zheng Shao commented on HIVE-1197:
----------------------------------

Can you explain what does "a mapper spans a file" mean?


> create a new input format where a mapper spans a file
> -----------------------------------------------------
>
>                 Key: HIVE-1197
>                 URL: https://issues.apache.org/jira/browse/HIVE-1197
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: Siying Dong
>             Fix For: 0.6.0
>
>
> This will be needed for Sort merge joins.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Assigned: (HIVE-1197) create a new input format where a mapper spans a file

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Namit Jain reassigned HIVE-1197:
--------------------------------

    Assignee: Siying Dong  (was: Namit Jain)

> create a new input format where a mapper spans a file
> -----------------------------------------------------
>
>                 Key: HIVE-1197
>                 URL: https://issues.apache.org/jira/browse/HIVE-1197
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: Siying Dong
>             Fix For: 0.6.0
>
>
> This will be needed for Sort merge joins.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-1197) create a new input format where a mapper spans a file

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12840785#action_12840785 ] 

Namit Jain commented on HIVE-1197:
----------------------------------

Looks good to me  - can you update  the test result file - I am getting a diff.
Other than that, I am fine and can merge, unless Yongqiang has some additional comments

> create a new input format where a mapper spans a file
> -----------------------------------------------------
>
>                 Key: HIVE-1197
>                 URL: https://issues.apache.org/jira/browse/HIVE-1197
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: Siying Dong
>             Fix For: 0.6.0
>
>         Attachments: hive.1197.1.patch, hive.1197.2.patch, hive.1197.3.patch
>
>
> This will be needed for Sort merge joins.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (HIVE-1197) create a new input format where a mapper spans a file

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Namit Jain resolved HIVE-1197.
------------------------------

      Resolution: Fixed
    Hadoop Flags: [Reviewed]

Committed. Thanks Siying

> create a new input format where a mapper spans a file
> -----------------------------------------------------
>
>                 Key: HIVE-1197
>                 URL: https://issues.apache.org/jira/browse/HIVE-1197
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: Siying Dong
>             Fix For: 0.6.0
>
>         Attachments: hive.1197.1.patch, hive.1197.2.patch, hive.1197.3.patch, hive.1197.4.patch
>
>
> This will be needed for Sort merge joins.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-1197) create a new input format where a mapper spans a file

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12840771#action_12840771 ] 

Namit Jain commented on HIVE-1197:
----------------------------------

What is the reason for creating such a large table in the test ?
Is that necessary for testing - since changing dfs.block.size is not helping ?

> create a new input format where a mapper spans a file
> -----------------------------------------------------
>
>                 Key: HIVE-1197
>                 URL: https://issues.apache.org/jira/browse/HIVE-1197
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: Siying Dong
>             Fix For: 0.6.0
>
>         Attachments: hive.1197.1.patch, hive.1197.2.patch, hive.1197.3.patch
>
>
> This will be needed for Sort merge joins.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-1197) create a new input format where a mapper spans a file

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12838959#action_12838959 ] 

Namit Jain commented on HIVE-1197:
----------------------------------

Currently, the split that a mapper processes is determined by a variety of parameters, including the dfs block size, min split size etc.

It might be useful to have an option when the users wants a mapper so scan 1 file. This will be specially useful for sort-merge join.
If the data is partitioned into various buckets, and each bucket us sorted, the sort merge join can join the different buckets together.

For example, consider the following scenario:

table T1: sorted and bucketed by column 'key' into 1000 buckets
table T2: sorted and bucketed by column 'key' into 1000 buckets


and the query:


select * from T1 join T2 on key
mapjoin.

Instead of joining the table T1 with T2, the 1000 buckets can be joined with each other individually.
Since the data is sorted on the join key, sort-merge join can be used.
Say the buckets are named: b0001, b0002 .. b1000
Say table T1 is the big table, and the buckets from T2 are being read as part of the mapper which is spawned to process T1,
under the current approach, it will be very difficult to perform outer joins.

For example, if bucket b1 for T1 contains:


1
2
5
6
9
16
22
30

and the corresponding bucket for T2 contains:

2
4
8


If there are 2 mappers for bucket b1 for T1, processing 4 records each ((1,2,5,6) and (9.16.22.30) respectively.
It will be very difficult to perform a outer join. The mapper will need to peek into the previous record
and the next record respectively.

Moreover, it will be very difficult to ensure that the result also has 1000 buckets. Another map-reduce job
will be needed for the same.

This can be easily solved if we are guaranteed that the whole bucket (or the file corresponding to the bucket),
will be processed by a single mapper.
 

> create a new input format where a mapper spans a file
> -----------------------------------------------------
>
>                 Key: HIVE-1197
>                 URL: https://issues.apache.org/jira/browse/HIVE-1197
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: Siying Dong
>             Fix For: 0.6.0
>
>
> This will be needed for Sort merge joins.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.