You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Chao Wang (JIRA)" <ji...@apache.org> on 2009/11/09 21:12:32 UTC

[jira] Created: (PIG-1077) [Zebra] to support record(row)-based file split in Zebra' TableInputFormat

[Zebra] to support record(row)-based file split in Zebra' TableInputFormat
--------------------------------------------------------------------------

                 Key: PIG-1077
                 URL: https://issues.apache.org/jira/browse/PIG-1077
             Project: Pig
          Issue Type: New Feature
    Affects Versions: 0.4.0
            Reporter: Chao Wang
            Assignee: Chao Wang
             Fix For: 0.6.0


TFile currently supports split by record sequence number (see Jira HADOOP-6218). We want to utilize this to provide record(row)-based input split support in Zebra.
One prominent benefit is that: in cases where we have very large data files, we can create much more fine-grained input splits than before where we can only create one big split for one big file.

In more detail, the new row-based getSplits() works by default (user does not specify no. of splits to be generated) as follows: 
1) Select the biggest column group in terms of data size, split all of its TFiles according to hdfs block size (64 MB or 128 MB) and get a list of physical byte offsets as the output per TFile. For example, let us assume for the 1st TFile we get offset1, offset2, ..., offset10; 
2) Invoke TFile.getRecordNumNear(long offset) to get the RecordNum of a key-value pair near a byte offset. For the example above, say we get recordNum1, recordNum2, ..., recordNum10; 
3) Stitch [0, recordNum1], [recordNum1+1, recordNum2], ..., [recordNum9+1, recordNum10], [recordNum10+1, lastRecordNum] splits of all column groups, respectively to form 11 record-based input splits for the 1st TFile. 
4) For each input split, we need to create a TFile scanner through: TFile.createScannerByRecordNum(long beginRecNum, long endRecNum). 

Note: conversion from byte offset to record number will be done by each mapper, rather than being done at the job initialization phase. This is due to performance concern since the conversion incurs some TFile reading overhead.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1077) [Zebra] to support record(row)-based file split in Zebra's TableInputFormat

Posted by "Alan Gates (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12780998#action_12780998 ] 

Alan Gates commented on PIG-1077:
---------------------------------

Patch checked into 0.6 branch.

> [Zebra] to support record(row)-based file split in Zebra's TableInputFormat
> ---------------------------------------------------------------------------
>
>                 Key: PIG-1077
>                 URL: https://issues.apache.org/jira/browse/PIG-1077
>             Project: Pig
>          Issue Type: New Feature
>    Affects Versions: 0.4.0
>            Reporter: Chao Wang
>            Assignee: Chao Wang
>             Fix For: 0.6.0, 0.7.0
>
>         Attachments: patch_Pig1077
>
>
> TFile currently supports split by record sequence number (see Jira HADOOP-6218). We want to utilize this to provide record(row)-based input split support in Zebra.
> One prominent benefit is that: in cases where we have very large data files, we can create much more fine-grained input splits than before where we can only create one big split for one big file.
> In more detail, the new row-based getSplits() works by default (user does not specify no. of splits to be generated) as follows: 
> 1) Select the biggest column group in terms of data size, split all of its TFiles according to hdfs block size (64 MB or 128 MB) and get a list of physical byte offsets as the output per TFile. For example, let us assume for the 1st TFile we get offset1, offset2, ..., offset10; 
> 2) Invoke TFile.getRecordNumNear(long offset) to get the RecordNum of a key-value pair near a byte offset. For the example above, say we get recordNum1, recordNum2, ..., recordNum10; 
> 3) Stitch [0, recordNum1], [recordNum1+1, recordNum2], ..., [recordNum9+1, recordNum10], [recordNum10+1, lastRecordNum] splits of all column groups, respectively to form 11 record-based input splits for the 1st TFile. 
> 4) For each input split, we need to create a TFile scanner through: TFile.createScannerByRecordNum(long beginRecNum, long endRecNum). 
> Note: conversion from byte offset to record number will be done by each mapper, rather than being done at the job initialization phase. This is due to performance concern since the conversion incurs some TFile reading overhead.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-1077) [Zebra] to support record(row)-based file split in Zebra's TableInputFormat

Posted by "Alan Gates (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alan Gates updated PIG-1077:
----------------------------

       Resolution: Fixed
    Fix Version/s:     (was: 0.6.0)
                   0.7.0
           Status: Resolved  (was: Patch Available)

Patch checked in.

> [Zebra] to support record(row)-based file split in Zebra's TableInputFormat
> ---------------------------------------------------------------------------
>
>                 Key: PIG-1077
>                 URL: https://issues.apache.org/jira/browse/PIG-1077
>             Project: Pig
>          Issue Type: New Feature
>    Affects Versions: 0.4.0
>            Reporter: Chao Wang
>            Assignee: Chao Wang
>             Fix For: 0.7.0
>
>         Attachments: patch_Pig1077
>
>
> TFile currently supports split by record sequence number (see Jira HADOOP-6218). We want to utilize this to provide record(row)-based input split support in Zebra.
> One prominent benefit is that: in cases where we have very large data files, we can create much more fine-grained input splits than before where we can only create one big split for one big file.
> In more detail, the new row-based getSplits() works by default (user does not specify no. of splits to be generated) as follows: 
> 1) Select the biggest column group in terms of data size, split all of its TFiles according to hdfs block size (64 MB or 128 MB) and get a list of physical byte offsets as the output per TFile. For example, let us assume for the 1st TFile we get offset1, offset2, ..., offset10; 
> 2) Invoke TFile.getRecordNumNear(long offset) to get the RecordNum of a key-value pair near a byte offset. For the example above, say we get recordNum1, recordNum2, ..., recordNum10; 
> 3) Stitch [0, recordNum1], [recordNum1+1, recordNum2], ..., [recordNum9+1, recordNum10], [recordNum10+1, lastRecordNum] splits of all column groups, respectively to form 11 record-based input splits for the 1st TFile. 
> 4) For each input split, we need to create a TFile scanner through: TFile.createScannerByRecordNum(long beginRecNum, long endRecNum). 
> Note: conversion from byte offset to record number will be done by each mapper, rather than being done at the job initialization phase. This is due to performance concern since the conversion incurs some TFile reading overhead.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-1077) [Zebra] to support record(row)-based file split in Zebra's TableInputFormat

Posted by "Alan Gates (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alan Gates updated PIG-1077:
----------------------------

    Fix Version/s: 0.6.0

> [Zebra] to support record(row)-based file split in Zebra's TableInputFormat
> ---------------------------------------------------------------------------
>
>                 Key: PIG-1077
>                 URL: https://issues.apache.org/jira/browse/PIG-1077
>             Project: Pig
>          Issue Type: New Feature
>    Affects Versions: 0.4.0
>            Reporter: Chao Wang
>            Assignee: Chao Wang
>             Fix For: 0.6.0, 0.7.0
>
>         Attachments: patch_Pig1077
>
>
> TFile currently supports split by record sequence number (see Jira HADOOP-6218). We want to utilize this to provide record(row)-based input split support in Zebra.
> One prominent benefit is that: in cases where we have very large data files, we can create much more fine-grained input splits than before where we can only create one big split for one big file.
> In more detail, the new row-based getSplits() works by default (user does not specify no. of splits to be generated) as follows: 
> 1) Select the biggest column group in terms of data size, split all of its TFiles according to hdfs block size (64 MB or 128 MB) and get a list of physical byte offsets as the output per TFile. For example, let us assume for the 1st TFile we get offset1, offset2, ..., offset10; 
> 2) Invoke TFile.getRecordNumNear(long offset) to get the RecordNum of a key-value pair near a byte offset. For the example above, say we get recordNum1, recordNum2, ..., recordNum10; 
> 3) Stitch [0, recordNum1], [recordNum1+1, recordNum2], ..., [recordNum9+1, recordNum10], [recordNum10+1, lastRecordNum] splits of all column groups, respectively to form 11 record-based input splits for the 1st TFile. 
> 4) For each input split, we need to create a TFile scanner through: TFile.createScannerByRecordNum(long beginRecNum, long endRecNum). 
> Note: conversion from byte offset to record number will be done by each mapper, rather than being done at the job initialization phase. This is due to performance concern since the conversion incurs some TFile reading overhead.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-1077) [Zebra] to support record(row)-based file split in Zebra's TableInputFormat

Posted by "Chao Wang (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chao Wang updated PIG-1077:
---------------------------

    Summary: [Zebra] to support record(row)-based file split in Zebra's TableInputFormat  (was: [Zebra] to support record(row)-based file split in Zebra' TableInputFormat)

> [Zebra] to support record(row)-based file split in Zebra's TableInputFormat
> ---------------------------------------------------------------------------
>
>                 Key: PIG-1077
>                 URL: https://issues.apache.org/jira/browse/PIG-1077
>             Project: Pig
>          Issue Type: New Feature
>    Affects Versions: 0.4.0
>            Reporter: Chao Wang
>            Assignee: Chao Wang
>             Fix For: 0.6.0
>
>
> TFile currently supports split by record sequence number (see Jira HADOOP-6218). We want to utilize this to provide record(row)-based input split support in Zebra.
> One prominent benefit is that: in cases where we have very large data files, we can create much more fine-grained input splits than before where we can only create one big split for one big file.
> In more detail, the new row-based getSplits() works by default (user does not specify no. of splits to be generated) as follows: 
> 1) Select the biggest column group in terms of data size, split all of its TFiles according to hdfs block size (64 MB or 128 MB) and get a list of physical byte offsets as the output per TFile. For example, let us assume for the 1st TFile we get offset1, offset2, ..., offset10; 
> 2) Invoke TFile.getRecordNumNear(long offset) to get the RecordNum of a key-value pair near a byte offset. For the example above, say we get recordNum1, recordNum2, ..., recordNum10; 
> 3) Stitch [0, recordNum1], [recordNum1+1, recordNum2], ..., [recordNum9+1, recordNum10], [recordNum10+1, lastRecordNum] splits of all column groups, respectively to form 11 record-based input splits for the 1st TFile. 
> 4) For each input split, we need to create a TFile scanner through: TFile.createScannerByRecordNum(long beginRecNum, long endRecNum). 
> Note: conversion from byte offset to record number will be done by each mapper, rather than being done at the job initialization phase. This is due to performance concern since the conversion incurs some TFile reading overhead.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-1077) [Zebra] to support record(row)-based file split in Zebra's TableInputFormat

Posted by "Chao Wang (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chao Wang updated PIG-1077:
---------------------------

    Attachment: patch_Pig1077

> [Zebra] to support record(row)-based file split in Zebra's TableInputFormat
> ---------------------------------------------------------------------------
>
>                 Key: PIG-1077
>                 URL: https://issues.apache.org/jira/browse/PIG-1077
>             Project: Pig
>          Issue Type: New Feature
>    Affects Versions: 0.4.0
>            Reporter: Chao Wang
>            Assignee: Chao Wang
>             Fix For: 0.6.0
>
>         Attachments: patch_Pig1077
>
>
> TFile currently supports split by record sequence number (see Jira HADOOP-6218). We want to utilize this to provide record(row)-based input split support in Zebra.
> One prominent benefit is that: in cases where we have very large data files, we can create much more fine-grained input splits than before where we can only create one big split for one big file.
> In more detail, the new row-based getSplits() works by default (user does not specify no. of splits to be generated) as follows: 
> 1) Select the biggest column group in terms of data size, split all of its TFiles according to hdfs block size (64 MB or 128 MB) and get a list of physical byte offsets as the output per TFile. For example, let us assume for the 1st TFile we get offset1, offset2, ..., offset10; 
> 2) Invoke TFile.getRecordNumNear(long offset) to get the RecordNum of a key-value pair near a byte offset. For the example above, say we get recordNum1, recordNum2, ..., recordNum10; 
> 3) Stitch [0, recordNum1], [recordNum1+1, recordNum2], ..., [recordNum9+1, recordNum10], [recordNum10+1, lastRecordNum] splits of all column groups, respectively to form 11 record-based input splits for the 1st TFile. 
> 4) For each input split, we need to create a TFile scanner through: TFile.createScannerByRecordNum(long beginRecNum, long endRecNum). 
> Note: conversion from byte offset to record number will be done by each mapper, rather than being done at the job initialization phase. This is due to performance concern since the conversion incurs some TFile reading overhead.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-1077) [Zebra] to support record(row)-based file split in Zebra's TableInputFormat

Posted by "Chao Wang (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chao Wang updated PIG-1077:
---------------------------

    Release Note: 
In this jira, we plan to also resolve the dependency issue that Zebra record-based split needs Hadoop TFile split support to work. For this dependency, Zebra has to maintain its own copy of Hadoop jar in svn for it to be able to build. Furthermore, the fact that Zebra currently sits inside Pig in svn and Pig itself maintains its own copy of Hadoop jar in lib directory makes things even messier. Finally, we notice that Zebra is new and making many changes and needs to get new revisions quickly, while Hadoop and Pig are more mature and moving slowly and thus can't make new releases for Zebra all the time. 

After carefully thinking through all this, we plan to fork the TFile part off the Hadoop and port it into Zebra's own code base. This will greatly simply the building process of Zebra and also enable it to make quick revisions.

Last, we would like to point out that this is a short term solution for Zebra and we plan to: 
1) port all changes to Zebra TFile back into Hadoop TFile. 
2) in the long run have a single unified solution for this.


> [Zebra] to support record(row)-based file split in Zebra's TableInputFormat
> ---------------------------------------------------------------------------
>
>                 Key: PIG-1077
>                 URL: https://issues.apache.org/jira/browse/PIG-1077
>             Project: Pig
>          Issue Type: New Feature
>    Affects Versions: 0.4.0
>            Reporter: Chao Wang
>            Assignee: Chao Wang
>             Fix For: 0.6.0
>
>
> TFile currently supports split by record sequence number (see Jira HADOOP-6218). We want to utilize this to provide record(row)-based input split support in Zebra.
> One prominent benefit is that: in cases where we have very large data files, we can create much more fine-grained input splits than before where we can only create one big split for one big file.
> In more detail, the new row-based getSplits() works by default (user does not specify no. of splits to be generated) as follows: 
> 1) Select the biggest column group in terms of data size, split all of its TFiles according to hdfs block size (64 MB or 128 MB) and get a list of physical byte offsets as the output per TFile. For example, let us assume for the 1st TFile we get offset1, offset2, ..., offset10; 
> 2) Invoke TFile.getRecordNumNear(long offset) to get the RecordNum of a key-value pair near a byte offset. For the example above, say we get recordNum1, recordNum2, ..., recordNum10; 
> 3) Stitch [0, recordNum1], [recordNum1+1, recordNum2], ..., [recordNum9+1, recordNum10], [recordNum10+1, lastRecordNum] splits of all column groups, respectively to form 11 record-based input splits for the 1st TFile. 
> 4) For each input split, we need to create a TFile scanner through: TFile.createScannerByRecordNum(long beginRecNum, long endRecNum). 
> Note: conversion from byte offset to record number will be done by each mapper, rather than being done at the job initialization phase. This is due to performance concern since the conversion incurs some TFile reading overhead.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1077) [Zebra] to support record(row)-based file split in Zebra's TableInputFormat

Posted by "Yan Zhou (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12780735#action_12780735 ] 

Yan Zhou commented on PIG-1077:
-------------------------------

This pacth is also targeted for the 0.6 release so it needs to be on the 0.6 branch too.

> [Zebra] to support record(row)-based file split in Zebra's TableInputFormat
> ---------------------------------------------------------------------------
>
>                 Key: PIG-1077
>                 URL: https://issues.apache.org/jira/browse/PIG-1077
>             Project: Pig
>          Issue Type: New Feature
>    Affects Versions: 0.4.0
>            Reporter: Chao Wang
>            Assignee: Chao Wang
>             Fix For: 0.7.0
>
>         Attachments: patch_Pig1077
>
>
> TFile currently supports split by record sequence number (see Jira HADOOP-6218). We want to utilize this to provide record(row)-based input split support in Zebra.
> One prominent benefit is that: in cases where we have very large data files, we can create much more fine-grained input splits than before where we can only create one big split for one big file.
> In more detail, the new row-based getSplits() works by default (user does not specify no. of splits to be generated) as follows: 
> 1) Select the biggest column group in terms of data size, split all of its TFiles according to hdfs block size (64 MB or 128 MB) and get a list of physical byte offsets as the output per TFile. For example, let us assume for the 1st TFile we get offset1, offset2, ..., offset10; 
> 2) Invoke TFile.getRecordNumNear(long offset) to get the RecordNum of a key-value pair near a byte offset. For the example above, say we get recordNum1, recordNum2, ..., recordNum10; 
> 3) Stitch [0, recordNum1], [recordNum1+1, recordNum2], ..., [recordNum9+1, recordNum10], [recordNum10+1, lastRecordNum] splits of all column groups, respectively to form 11 record-based input splits for the 1st TFile. 
> 4) For each input split, we need to create a TFile scanner through: TFile.createScannerByRecordNum(long beginRecNum, long endRecNum). 
> Note: conversion from byte offset to record number will be done by each mapper, rather than being done at the job initialization phase. This is due to performance concern since the conversion incurs some TFile reading overhead.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1077) [Zebra] to support record(row)-based file split in Zebra's TableInputFormat

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12777793#action_12777793 ] 

Hadoop QA commented on PIG-1077:
--------------------------------

+1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12424874/patch_Pig1077
  against trunk revision 835499.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 104 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    +1 core tests.  The patch passed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/49/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/49/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/49/console

This message is automatically generated.

> [Zebra] to support record(row)-based file split in Zebra's TableInputFormat
> ---------------------------------------------------------------------------
>
>                 Key: PIG-1077
>                 URL: https://issues.apache.org/jira/browse/PIG-1077
>             Project: Pig
>          Issue Type: New Feature
>    Affects Versions: 0.4.0
>            Reporter: Chao Wang
>            Assignee: Chao Wang
>             Fix For: 0.6.0
>
>         Attachments: patch_Pig1077
>
>
> TFile currently supports split by record sequence number (see Jira HADOOP-6218). We want to utilize this to provide record(row)-based input split support in Zebra.
> One prominent benefit is that: in cases where we have very large data files, we can create much more fine-grained input splits than before where we can only create one big split for one big file.
> In more detail, the new row-based getSplits() works by default (user does not specify no. of splits to be generated) as follows: 
> 1) Select the biggest column group in terms of data size, split all of its TFiles according to hdfs block size (64 MB or 128 MB) and get a list of physical byte offsets as the output per TFile. For example, let us assume for the 1st TFile we get offset1, offset2, ..., offset10; 
> 2) Invoke TFile.getRecordNumNear(long offset) to get the RecordNum of a key-value pair near a byte offset. For the example above, say we get recordNum1, recordNum2, ..., recordNum10; 
> 3) Stitch [0, recordNum1], [recordNum1+1, recordNum2], ..., [recordNum9+1, recordNum10], [recordNum10+1, lastRecordNum] splits of all column groups, respectively to form 11 record-based input splits for the 1st TFile. 
> 4) For each input split, we need to create a TFile scanner through: TFile.createScannerByRecordNum(long beginRecNum, long endRecNum). 
> Note: conversion from byte offset to record number will be done by each mapper, rather than being done at the job initialization phase. This is due to performance concern since the conversion incurs some TFile reading overhead.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-1077) [Zebra] to support record(row)-based file split in Zebra's TableInputFormat

Posted by "Chao Wang (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chao Wang updated PIG-1077:
---------------------------

    Status: Patch Available  (was: Open)

> [Zebra] to support record(row)-based file split in Zebra's TableInputFormat
> ---------------------------------------------------------------------------
>
>                 Key: PIG-1077
>                 URL: https://issues.apache.org/jira/browse/PIG-1077
>             Project: Pig
>          Issue Type: New Feature
>    Affects Versions: 0.4.0
>            Reporter: Chao Wang
>            Assignee: Chao Wang
>             Fix For: 0.6.0
>
>         Attachments: patch_Pig1077
>
>
> TFile currently supports split by record sequence number (see Jira HADOOP-6218). We want to utilize this to provide record(row)-based input split support in Zebra.
> One prominent benefit is that: in cases where we have very large data files, we can create much more fine-grained input splits than before where we can only create one big split for one big file.
> In more detail, the new row-based getSplits() works by default (user does not specify no. of splits to be generated) as follows: 
> 1) Select the biggest column group in terms of data size, split all of its TFiles according to hdfs block size (64 MB or 128 MB) and get a list of physical byte offsets as the output per TFile. For example, let us assume for the 1st TFile we get offset1, offset2, ..., offset10; 
> 2) Invoke TFile.getRecordNumNear(long offset) to get the RecordNum of a key-value pair near a byte offset. For the example above, say we get recordNum1, recordNum2, ..., recordNum10; 
> 3) Stitch [0, recordNum1], [recordNum1+1, recordNum2], ..., [recordNum9+1, recordNum10], [recordNum10+1, lastRecordNum] splits of all column groups, respectively to form 11 record-based input splits for the 1st TFile. 
> 4) For each input split, we need to create a TFile scanner through: TFile.createScannerByRecordNum(long beginRecNum, long endRecNum). 
> Note: conversion from byte offset to record number will be done by each mapper, rather than being done at the job initialization phase. This is due to performance concern since the conversion incurs some TFile reading overhead.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1077) [Zebra] to support record(row)-based file split in Zebra's TableInputFormat

Posted by "Yan Zhou (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12777680#action_12777680 ] 

Yan Zhou commented on PIG-1077:
-------------------------------

+1

> [Zebra] to support record(row)-based file split in Zebra's TableInputFormat
> ---------------------------------------------------------------------------
>
>                 Key: PIG-1077
>                 URL: https://issues.apache.org/jira/browse/PIG-1077
>             Project: Pig
>          Issue Type: New Feature
>    Affects Versions: 0.4.0
>            Reporter: Chao Wang
>            Assignee: Chao Wang
>             Fix For: 0.6.0
>
>         Attachments: patch_Pig1077
>
>
> TFile currently supports split by record sequence number (see Jira HADOOP-6218). We want to utilize this to provide record(row)-based input split support in Zebra.
> One prominent benefit is that: in cases where we have very large data files, we can create much more fine-grained input splits than before where we can only create one big split for one big file.
> In more detail, the new row-based getSplits() works by default (user does not specify no. of splits to be generated) as follows: 
> 1) Select the biggest column group in terms of data size, split all of its TFiles according to hdfs block size (64 MB or 128 MB) and get a list of physical byte offsets as the output per TFile. For example, let us assume for the 1st TFile we get offset1, offset2, ..., offset10; 
> 2) Invoke TFile.getRecordNumNear(long offset) to get the RecordNum of a key-value pair near a byte offset. For the example above, say we get recordNum1, recordNum2, ..., recordNum10; 
> 3) Stitch [0, recordNum1], [recordNum1+1, recordNum2], ..., [recordNum9+1, recordNum10], [recordNum10+1, lastRecordNum] splits of all column groups, respectively to form 11 record-based input splits for the 1st TFile. 
> 4) For each input split, we need to create a TFile scanner through: TFile.createScannerByRecordNum(long beginRecNum, long endRecNum). 
> Note: conversion from byte offset to record number will be done by each mapper, rather than being done at the job initialization phase. This is due to performance concern since the conversion incurs some TFile reading overhead.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.