You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Thejas M Nair (JIRA)" <ji...@apache.org> on 2009/10/29 20:09:59 UTC

[jira] Created: (PIG-1062) load-store-redesign branch: change SampleLoader and subclasses to work with new LoadFunc interface

load-store-redesign branch: change SampleLoader and subclasses to work with new LoadFunc interface 
---------------------------------------------------------------------------------------------------

                 Key: PIG-1062
                 URL: https://issues.apache.org/jira/browse/PIG-1062
             Project: Pig
          Issue Type: Task
            Reporter: Thejas M Nair


This is part of the effort to implement new load store interfaces as laid out in http://wiki.apache.org/pig/LoadStoreRedesignProposal .
PigStorage and BinStorage are now working.

SampleLoader and subclasses -RandomSampleLoader, PoissonSampleLoader need to be changed to work with new LoadFunc interface.  
Fixing SampleLoader and RandomSampleLoader will get order-by queries working.
PoissonSampleLoader is used by skew join. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1062) load-store-redesign branch: change SampleLoader and subclasses to work with new LoadFunc interface

Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12771615#action_12771615 ] 

Thejas M Nair commented on PIG-1062:
------------------------------------

Skew-join uses the total number of input tuples, in PartitionSkewedKeys.calculateReducers(..) to calculate number of reducers.
In the version in trunk, PoissonSampleLoader adds  size on disk of the sampled tuple , as the last column of the tuple. This is used to calculate average size on disk in PartitionSkewedKeys. Total number of tuples are estimated using input-file-size/avg-size-of-tuple-on-disk .

But with the new interface, the size on disk for a tuple cannot be estimated (there is no getPosition). Also, the size of input file on disk cannot be estimated if the input is not from a file or if the load function is passed some metadata instead of file name.

Ideally this information should be obtained through  ResourceStatistics in the proposal. Since that is not available right now, here is another proposal - 

PoissonSampleLoader currently reads almost all the rows because it tries to sample evenly spaced tuples from the split. It will now read till the last tuple, and add an additional tuple that has the number of tuples in that split. This special tuple needs to be distinguished from others that are sampled tuples. I don't have a good way to do that except for having two columns first column having an unique marker string, and second column has the number of rows. Does anybody have better suggestions ?

PartitionSkewedKeys will look at all these special rows and add the row-nums to get total number of rows.


> load-store-redesign branch: change SampleLoader and subclasses to work with new LoadFunc interface 
> ---------------------------------------------------------------------------------------------------
>
>                 Key: PIG-1062
>                 URL: https://issues.apache.org/jira/browse/PIG-1062
>             Project: Pig
>          Issue Type: Task
>            Reporter: Thejas M Nair
>
> This is part of the effort to implement new load store interfaces as laid out in http://wiki.apache.org/pig/LoadStoreRedesignProposal .
> PigStorage and BinStorage are now working.
> SampleLoader and subclasses -RandomSampleLoader, PoissonSampleLoader need to be changed to work with new LoadFunc interface.  
> Fixing SampleLoader and RandomSampleLoader will get order-by queries working.
> PoissonSampleLoader is used by skew join. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1062) load-store-redesign branch: change SampleLoader and subclasses to work with new LoadFunc interface

Posted by "Dmitriy V. Ryaboy (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772017#action_12772017 ] 

Dmitriy V. Ryaboy commented on PIG-1062:
----------------------------------------

I have ResourceStats hooked up to LogicalOperators already, need to port the code to the new branch.  This will let us take statistics, if they are available, and pass them into the PoissonSampleLoader at initialization time, so it can get the number of tuples and avg tuple size directly from Stats.

That being said, statistics may not always be available...

Before I go into the more fanciful suggestion below -- perhaps a simple hack will do.  We have counters in Hadoop. Any reason we can't just read "bytes read in map", "records read in map", "bytes written in map", "records written in map" counters directly?

If I am overlooking something obvious, here's the "ignore counters" suggestion:

If my understanding is correct, in PoissonSampleLoader we are interested in the average size of a tuple more than # of tuples -- # of tuples is just used as a way of crudely estimating avg size of tuple on disk, which is in turn used to crudely estimate the size of tuple in memory.  The estimate is likely to be very off, by the way, if we are not loading from BinStorage, but from arbitrary loadFuncs, as the underlying data, even if it is a file, might be compressed.

Perhaps we can get the average tuple size directly, instead? We could get that  in the mappers of the sampling job by recording memory usage at the first getNext() call, forcing garbage collection, buffering up K tuples, and getting memory usage again. 

We now have the following variables available to each sampling mapper in the SkewedPartitioner:

* sample rate S (for the appropriate Poisson distribution)
* total # of mappers, M
* available heap size on the reducer, H
* estimated avg size of tuple, s

The number of tuples we want to sample is then simply T = max(10, S*H/(s*M))

In getNext(), we can now allocate a buffer for T elements, populate it with the first T tuples, and continue scanning the partition. For every ith next() call, we generate a random number r s.t. 0<=r<i, and if r<T we insert the new tuple into our buffer at position r.  This gives us a nicely random sample of the tuples in the partition.

So this gets around the need for file size info on that side.

Now, PartitionSkewedKey uses the file size / avg_tuple_disk_size to estimate total number of tuples, and uses this estimate, plus the ratio of instances of a given key in the sample to the total sample size to predict the total number of records with a given key in the input.  But given the number of sampled tuples, and the sample rate, couldn't we calculate the total number of records in the original file by simply reversing the formula for determining the number of tuples to sample?  If we do this, no need to append any metadata.

Lastly, if we do want to move around metadata such as number of records in input, etc, and we don't want to use Hadoop counters, we should extend BinStorage with ResourceStats serialization, and use ResourceStatistics for this.  Even if the original data might not have stats, there is no reason we can't generate these basic counts at runtime for the data we write ourselves.

-D

> load-store-redesign branch: change SampleLoader and subclasses to work with new LoadFunc interface 
> ---------------------------------------------------------------------------------------------------
>
>                 Key: PIG-1062
>                 URL: https://issues.apache.org/jira/browse/PIG-1062
>             Project: Pig
>          Issue Type: Sub-task
>            Reporter: Thejas M Nair
>
> This is part of the effort to implement new load store interfaces as laid out in http://wiki.apache.org/pig/LoadStoreRedesignProposal .
> PigStorage and BinStorage are now working.
> SampleLoader and subclasses -RandomSampleLoader, PoissonSampleLoader need to be changed to work with new LoadFunc interface.  
> Fixing SampleLoader and RandomSampleLoader will get order-by queries working.
> PoissonSampleLoader is used by skew join. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1062) load-store-redesign branch: change SampleLoader and subclasses to work with new LoadFunc interface

Posted by "Sriranjan Manjunath (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772767#action_12772767 ] 

Sriranjan Manjunath commented on PIG-1062:
------------------------------------------

We cannot use counters since load/join will result in 2 jobs - the first one being the sampler. Your design looks good otherwise.

> load-store-redesign branch: change SampleLoader and subclasses to work with new LoadFunc interface 
> ---------------------------------------------------------------------------------------------------
>
>                 Key: PIG-1062
>                 URL: https://issues.apache.org/jira/browse/PIG-1062
>             Project: Pig
>          Issue Type: Sub-task
>            Reporter: Thejas M Nair
>            Assignee: Thejas M Nair
>
> This is part of the effort to implement new load store interfaces as laid out in http://wiki.apache.org/pig/LoadStoreRedesignProposal .
> PigStorage and BinStorage are now working.
> SampleLoader and subclasses -RandomSampleLoader, PoissonSampleLoader need to be changed to work with new LoadFunc interface.  
> Fixing SampleLoader and RandomSampleLoader will get order-by queries working.
> PoissonSampleLoader is used by skew join. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1062) load-store-redesign branch: change SampleLoader and subclasses to work with new LoadFunc interface

Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772197#action_12772197 ] 

Thejas M Nair commented on PIG-1062:
------------------------------------

Dmitriy,
I had overlooked the fact that input size of the file is being used also to calculate the number of samples. Thanks for pointing it out.  

I don't know if there are any problems in using counters directly, as long as information is required only after (first mapreduce) sampling phase, ie it could be used in PartitionSkewedKey().  

The logic in PoissonSampleLoader.computeSamples is  ( a detailed explanation will be added soon to the sampler wiki page). - The goal is to sample all keys from the first input that are will need to be partitioned across multiple reducers in the join phase. 
Let us assume X tuples fit into available memory in reducer. Lets say we want to sample 10 samples in each set of X tuples, with 95% confidence. Using poisson distribution formulas, we arrive at the number 17 as number of tuples to be sampled every X tuples. ( I don't know why poisson distrubution is the appropriate choice )

The total number of tuples to be sampled cannot be calculated without knowing total number of tuples. But what we know is that we should sample one tuple every (X/17) tuples. To calculate X, we need the average size of tuple in memory. Using the process memory usage is unlikely to give good approximation of that, because (as per my understanding) calling the garbage collector is not guaranteed to free memory used by all unused objects.  Tuple.getMemorySize() can be used to get an estimate of the memory used by the tuple. The average size could be estimated/corrected as we sample more tuples.
ie, PoissonSampleLoader.getNext() will return every H/s tuple in the input. (using H, s in previous comment)

In PartitionSkewedKey.exec(), Dmitriy's  idea of using number of samples, and sample rate (H/s) can be used to estimate total tuples. 

WeightedRangePartitioner.setConf is another function using fileSize().  That needs to change as well. I haven't looked at that yet.

> load-store-redesign branch: change SampleLoader and subclasses to work with new LoadFunc interface 
> ---------------------------------------------------------------------------------------------------
>
>                 Key: PIG-1062
>                 URL: https://issues.apache.org/jira/browse/PIG-1062
>             Project: Pig
>          Issue Type: Sub-task
>            Reporter: Thejas M Nair
>            Assignee: Thejas M Nair
>
> This is part of the effort to implement new load store interfaces as laid out in http://wiki.apache.org/pig/LoadStoreRedesignProposal .
> PigStorage and BinStorage are now working.
> SampleLoader and subclasses -RandomSampleLoader, PoissonSampleLoader need to be changed to work with new LoadFunc interface.  
> Fixing SampleLoader and RandomSampleLoader will get order-by queries working.
> PoissonSampleLoader is used by skew join. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1062) load-store-redesign branch: change SampleLoader and subclasses to work with new LoadFunc interface

Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12776035#action_12776035 ] 

Thejas M Nair commented on PIG-1062:
------------------------------------

bq. You can get the same info from the counters without unnecessarily complicating tuple processing, imo. In fact you can use (num bytes read / num records read) to get the old calculation, and not rely on number of samples and local average size estimates.

Yes, I agree that using counters would have been a cleaner way to get total number of tuples in reducer, but it looks there is no reliable way to get the complete counter information of a map from the reducer part of the MR job.


> load-store-redesign branch: change SampleLoader and subclasses to work with new LoadFunc interface 
> ---------------------------------------------------------------------------------------------------
>
>                 Key: PIG-1062
>                 URL: https://issues.apache.org/jira/browse/PIG-1062
>             Project: Pig
>          Issue Type: Sub-task
>            Reporter: Thejas M Nair
>            Assignee: Thejas M Nair
>
> This is part of the effort to implement new load store interfaces as laid out in http://wiki.apache.org/pig/LoadStoreRedesignProposal .
> PigStorage and BinStorage are now working.
> SampleLoader and subclasses -RandomSampleLoader, PoissonSampleLoader need to be changed to work with new LoadFunc interface.  
> Fixing SampleLoader and RandomSampleLoader will get order-by queries working.
> PoissonSampleLoader is used by skew join. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1062) load-store-redesign branch: change SampleLoader and subclasses to work with new LoadFunc interface

Posted by "Dmitriy V. Ryaboy (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772797#action_12772797 ] 

Dmitriy V. Ryaboy commented on PIG-1062:
----------------------------------------

The sampler (in this design) reads all the data, so number of records read is total number of records in dataset, and the number of records written is total number of samples. Same for bytes.  The sampler produces a histogram file, which is then used by the join task -- so there is no reliance on counters there.



> load-store-redesign branch: change SampleLoader and subclasses to work with new LoadFunc interface 
> ---------------------------------------------------------------------------------------------------
>
>                 Key: PIG-1062
>                 URL: https://issues.apache.org/jira/browse/PIG-1062
>             Project: Pig
>          Issue Type: Sub-task
>            Reporter: Thejas M Nair
>            Assignee: Thejas M Nair
>
> This is part of the effort to implement new load store interfaces as laid out in http://wiki.apache.org/pig/LoadStoreRedesignProposal .
> PigStorage and BinStorage are now working.
> SampleLoader and subclasses -RandomSampleLoader, PoissonSampleLoader need to be changed to work with new LoadFunc interface.  
> Fixing SampleLoader and RandomSampleLoader will get order-by queries working.
> PoissonSampleLoader is used by skew join. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-1062) load-store-redesign branch: change SampleLoader and subclasses to work with new LoadFunc interface

Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Thejas M Nair updated PIG-1062:
-------------------------------

    Attachment: PIG-1062.patch.3

New patch after merge with latest changes to load-store-redesign branch. Incompatible with trunk .
Pasting output of test-patch (test cases have not been updated)

     [exec] -1 overall.
     [exec]
     [exec]     +1 @author.  The patch does not contain any @author tags.
     [exec]
     [exec]     -1 tests included.  The patch doesn't appear to include any new or modified tests.
     [exec]                         Please justify why no tests are needed for this patch.
     [exec]
     [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.
     [exec]
     [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler warnings.
     [exec]
     [exec]     +1 findbugs.  The patch does not introduce any new Findbugs warnings.
     [exec]
     [exec]     +1 release audit.  The applied patch does not increase the total number of release audit warnings.


> load-store-redesign branch: change SampleLoader and subclasses to work with new LoadFunc interface 
> ---------------------------------------------------------------------------------------------------
>
>                 Key: PIG-1062
>                 URL: https://issues.apache.org/jira/browse/PIG-1062
>             Project: Pig
>          Issue Type: Sub-task
>            Reporter: Thejas M Nair
>            Assignee: Thejas M Nair
>         Attachments: PIG-1062.patch, PIG-1062.patch.3
>
>
> This is part of the effort to implement new load store interfaces as laid out in http://wiki.apache.org/pig/LoadStoreRedesignProposal .
> PigStorage and BinStorage are now working.
> SampleLoader and subclasses -RandomSampleLoader, PoissonSampleLoader need to be changed to work with new LoadFunc interface.  
> Fixing SampleLoader and RandomSampleLoader will get order-by queries working.
> PoissonSampleLoader is used by skew join. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1062) load-store-redesign branch: change SampleLoader and subclasses to work with new LoadFunc interface

Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772623#action_12772623 ] 

Thejas M Nair commented on PIG-1062:
------------------------------------

Even after the interface changes, pig can compute the file size by adding up size of each split (from InputSplit.getLenght()) . The documentation of the function in the interface does not make it clear if this is size on disk , compressed/uncompressed etc. Assuming it is size on disk (uncompressed), estimating the total memory it will require is a challenge, one has to make assumption about the compression ratio and the serialization method.
Using Tuple.getMemorySize() while sampling will give more accurate numbers for reducer memory that it will consume.

> load-store-redesign branch: change SampleLoader and subclasses to work with new LoadFunc interface 
> ---------------------------------------------------------------------------------------------------
>
>                 Key: PIG-1062
>                 URL: https://issues.apache.org/jira/browse/PIG-1062
>             Project: Pig
>          Issue Type: Sub-task
>            Reporter: Thejas M Nair
>            Assignee: Thejas M Nair
>
> This is part of the effort to implement new load store interfaces as laid out in http://wiki.apache.org/pig/LoadStoreRedesignProposal .
> PigStorage and BinStorage are now working.
> SampleLoader and subclasses -RandomSampleLoader, PoissonSampleLoader need to be changed to work with new LoadFunc interface.  
> Fixing SampleLoader and RandomSampleLoader will get order-by queries working.
> PoissonSampleLoader is used by skew join. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-1062) load-store-redesign branch: change SampleLoader and subclasses to work with new LoadFunc interface

Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Thejas M Nair updated PIG-1062:
-------------------------------

    Attachment: PIG-1062.patch

Patch for order-by sampling and skew-join sampling changes.

> load-store-redesign branch: change SampleLoader and subclasses to work with new LoadFunc interface 
> ---------------------------------------------------------------------------------------------------
>
>                 Key: PIG-1062
>                 URL: https://issues.apache.org/jira/browse/PIG-1062
>             Project: Pig
>          Issue Type: Sub-task
>            Reporter: Thejas M Nair
>            Assignee: Thejas M Nair
>         Attachments: PIG-1062.patch
>
>
> This is part of the effort to implement new load store interfaces as laid out in http://wiki.apache.org/pig/LoadStoreRedesignProposal .
> PigStorage and BinStorage are now working.
> SampleLoader and subclasses -RandomSampleLoader, PoissonSampleLoader need to be changed to work with new LoadFunc interface.  
> Fixing SampleLoader and RandomSampleLoader will get order-by queries working.
> PoissonSampleLoader is used by skew join. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-1062) load-store-redesign branch: change SampleLoader and subclasses to work with new LoadFunc interface

Posted by "Pradeep Kamath (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Pradeep Kamath updated PIG-1062:
--------------------------------

       Resolution: Fixed
    Fix Version/s: 0.7.0
     Hadoop Flags: [Incompatible change, Reviewed]
           Status: Resolved  (was: Patch Available)

Marking resolved since all changes have been committed. 

> load-store-redesign branch: change SampleLoader and subclasses to work with new LoadFunc interface 
> ---------------------------------------------------------------------------------------------------
>
>                 Key: PIG-1062
>                 URL: https://issues.apache.org/jira/browse/PIG-1062
>             Project: Pig
>          Issue Type: Sub-task
>            Reporter: Thejas M Nair
>            Assignee: Thejas M Nair
>             Fix For: 0.7.0
>
>         Attachments: PIG-1062.5.patch, PIG-1062.patch, PIG-1062.patch.3
>
>
> This is part of the effort to implement new load store interfaces as laid out in http://wiki.apache.org/pig/LoadStoreRedesignProposal .
> PigStorage and BinStorage are now working.
> SampleLoader and subclasses -RandomSampleLoader, PoissonSampleLoader need to be changed to work with new LoadFunc interface.  
> Fixing SampleLoader and RandomSampleLoader will get order-by queries working.
> PoissonSampleLoader is used by skew join. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Assigned: (PIG-1062) load-store-redesign branch: change SampleLoader and subclasses to work with new LoadFunc interface

Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Thejas M Nair reassigned PIG-1062:
----------------------------------

    Assignee: Thejas M Nair

> load-store-redesign branch: change SampleLoader and subclasses to work with new LoadFunc interface 
> ---------------------------------------------------------------------------------------------------
>
>                 Key: PIG-1062
>                 URL: https://issues.apache.org/jira/browse/PIG-1062
>             Project: Pig
>          Issue Type: Sub-task
>            Reporter: Thejas M Nair
>            Assignee: Thejas M Nair
>
> This is part of the effort to implement new load store interfaces as laid out in http://wiki.apache.org/pig/LoadStoreRedesignProposal .
> PigStorage and BinStorage are now working.
> SampleLoader and subclasses -RandomSampleLoader, PoissonSampleLoader need to be changed to work with new LoadFunc interface.  
> Fixing SampleLoader and RandomSampleLoader will get order-by queries working.
> PoissonSampleLoader is used by skew join. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1062) load-store-redesign branch: change SampleLoader and subclasses to work with new LoadFunc interface

Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779054#action_12779054 ] 

Thejas M Nair commented on PIG-1062:
------------------------------------

{quote}
In SampleLoader.java
====================
Isn't the idea of SampleLoader only to carry common code for RandomSampleLoader and PoissonLoader
and add a computeSamples() method? - Looks like now it has the getNext() implementation
needed by RandomSampleLoader in it now. Should we move that to RandomSampleLoader instead? 
{quote}
RandomSampleLoader.getNext() is fairly generic, it can be used by any new sample loader classes where the number of samples to be sampled in each map is known in advance. So having this getNext() implementation in SampleLoader can be useful in future.

{quote}
Why is skipNext() needed? Can't loader.getNext() == null be used instead? If so, is recordReader
needed?
{quote}
skipNext() calls recordReader.getNext() which does not parse the record in to a tuple, unlike loader.getNext(). This way records can be more efficiently skipped.

I will create a new patch addressing other comments.

> load-store-redesign branch: change SampleLoader and subclasses to work with new LoadFunc interface 
> ---------------------------------------------------------------------------------------------------
>
>                 Key: PIG-1062
>                 URL: https://issues.apache.org/jira/browse/PIG-1062
>             Project: Pig
>          Issue Type: Sub-task
>            Reporter: Thejas M Nair
>            Assignee: Thejas M Nair
>         Attachments: PIG-1062.patch, PIG-1062.patch.3
>
>
> This is part of the effort to implement new load store interfaces as laid out in http://wiki.apache.org/pig/LoadStoreRedesignProposal .
> PigStorage and BinStorage are now working.
> SampleLoader and subclasses -RandomSampleLoader, PoissonSampleLoader need to be changed to work with new LoadFunc interface.  
> Fixing SampleLoader and RandomSampleLoader will get order-by queries working.
> PoissonSampleLoader is used by skew join. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1062) load-store-redesign branch: change SampleLoader and subclasses to work with new LoadFunc interface

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12777794#action_12777794 ] 

Hadoop QA commented on PIG-1062:
--------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12424927/PIG-1062.patch.3
  against trunk revision 835499.

    +1 @author.  The patch does not contain any @author tags.

    -1 tests included.  The patch doesn't appear to include any new or modified tests.
                        Please justify why no tests are needed for this patch.

    -1 patch.  The patch command could not apply the patch.

Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/156/console

This message is automatically generated.

> load-store-redesign branch: change SampleLoader and subclasses to work with new LoadFunc interface 
> ---------------------------------------------------------------------------------------------------
>
>                 Key: PIG-1062
>                 URL: https://issues.apache.org/jira/browse/PIG-1062
>             Project: Pig
>          Issue Type: Sub-task
>            Reporter: Thejas M Nair
>            Assignee: Thejas M Nair
>         Attachments: PIG-1062.patch, PIG-1062.patch.3
>
>
> This is part of the effort to implement new load store interfaces as laid out in http://wiki.apache.org/pig/LoadStoreRedesignProposal .
> PigStorage and BinStorage are now working.
> SampleLoader and subclasses -RandomSampleLoader, PoissonSampleLoader need to be changed to work with new LoadFunc interface.  
> Fixing SampleLoader and RandomSampleLoader will get order-by queries working.
> PoissonSampleLoader is used by skew join. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1062) load-store-redesign branch: change SampleLoader and subclasses to work with new LoadFunc interface

Posted by "Pradeep Kamath (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12778654#action_12778654 ] 

Pradeep Kamath commented on PIG-1062:
-------------------------------------

Review comments:
In SampleLoader.java
====================
Isn't the idea of SampleLoader only to carry common code for RandomSampleLoader and PoissonLoader
and add a computeSamples() method? - Looks like now it has the getNext() implementation
needed by RandomSampleLoader in it now. Should we move that to RandomSampleLoader instead?


{code}
134             System.err.println("Sample " + samples[nextSampleIdx]);
{code}
Debug statement above should be removed.


Why is skipNext() needed? Can't loader.getNext() == null be used instead? If so, is recordReader
needed?

In RandomSampleLoader.java
==========================
XXX FIXME comment (put in by me :))should be removed

I think we should move the actual getNext() implementation code from SampleLoader to here

In PoissonSampleLoader.java
============================

{code}
 40         // this will be value of first column in the special row   
{code}
I think this is no longer the case - should be removed.


{code}
    58     // memory per sample. divide this by avgTupleMemSize to get skipInterval 
     59     private long memPerSample=0;
     60 
{code}
Should the above be called memToSkipPerSample?


{code}
 104         if(skipInterval == -1){
{code}
It doesn't look like skipInterval is initialized to -1


Instead of keeping track of max. num of columns in the different rows and then appending the
special marker string and num of rows at the end, would it be better to just have these as the
first two fields of the last tuple emitted and then introduce a split-union combination to 
ensure that the foreach pipeline gets the regular tuples (excluding the special tuple)?



> load-store-redesign branch: change SampleLoader and subclasses to work with new LoadFunc interface 
> ---------------------------------------------------------------------------------------------------
>
>                 Key: PIG-1062
>                 URL: https://issues.apache.org/jira/browse/PIG-1062
>             Project: Pig
>          Issue Type: Sub-task
>            Reporter: Thejas M Nair
>            Assignee: Thejas M Nair
>         Attachments: PIG-1062.patch, PIG-1062.patch.3
>
>
> This is part of the effort to implement new load store interfaces as laid out in http://wiki.apache.org/pig/LoadStoreRedesignProposal .
> PigStorage and BinStorage are now working.
> SampleLoader and subclasses -RandomSampleLoader, PoissonSampleLoader need to be changed to work with new LoadFunc interface.  
> Fixing SampleLoader and RandomSampleLoader will get order-by queries working.
> PoissonSampleLoader is used by skew join. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-1062) load-store-redesign branch: change SampleLoader and subclasses to work with new LoadFunc interface

Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Thejas M Nair updated PIG-1062:
-------------------------------

    Attachment: PIG-1062.5.patch

Latest patch addressing Pradeep's comments.
I will fix the orderby, skew-join test cases in a separate patch.
testpatch output -
     [exec] -1 overall.  
     [exec] 
     [exec]     +1 @author.  The patch does not contain any @author tags.
     [exec] 
     [exec]     -1 tests included.  The patch doesn't appear to include any new or modified tests.
     [exec]                         Please justify why no tests are needed for this patch.
     [exec] 
     [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.
     [exec] 
     [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler warnings.
     [exec] 
     [exec]     +1 findbugs.  The patch does not introduce any new Findbugs warnings.
     [exec] 
     [exec]     +1 release audit.  The applied patch does not increase the total number of release audit warnings.




> load-store-redesign branch: change SampleLoader and subclasses to work with new LoadFunc interface 
> ---------------------------------------------------------------------------------------------------
>
>                 Key: PIG-1062
>                 URL: https://issues.apache.org/jira/browse/PIG-1062
>             Project: Pig
>          Issue Type: Sub-task
>            Reporter: Thejas M Nair
>            Assignee: Thejas M Nair
>         Attachments: PIG-1062.5.patch, PIG-1062.patch, PIG-1062.patch.3
>
>
> This is part of the effort to implement new load store interfaces as laid out in http://wiki.apache.org/pig/LoadStoreRedesignProposal .
> PigStorage and BinStorage are now working.
> SampleLoader and subclasses -RandomSampleLoader, PoissonSampleLoader need to be changed to work with new LoadFunc interface.  
> Fixing SampleLoader and RandomSampleLoader will get order-by queries working.
> PoissonSampleLoader is used by skew join. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1062) load-store-redesign branch: change SampleLoader and subclasses to work with new LoadFunc interface

Posted by "Pradeep Kamath (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779081#action_12779081 ] 

Pradeep Kamath commented on PIG-1062:
-------------------------------------

Isn't the current implementation in SampleLoader.getNext() a random sample implementation? Given that it is random, would it not be more appropriate in RandomSampleLoader than in the base class SampleLoader? Since right now the only two subclasses under SampleLoader are Poisson and RandomSampleLoader, should we decide on moving this implementation into SampleLoader later when there is a use case for this being in the base class?

> load-store-redesign branch: change SampleLoader and subclasses to work with new LoadFunc interface 
> ---------------------------------------------------------------------------------------------------
>
>                 Key: PIG-1062
>                 URL: https://issues.apache.org/jira/browse/PIG-1062
>             Project: Pig
>          Issue Type: Sub-task
>            Reporter: Thejas M Nair
>            Assignee: Thejas M Nair
>         Attachments: PIG-1062.patch, PIG-1062.patch.3
>
>
> This is part of the effort to implement new load store interfaces as laid out in http://wiki.apache.org/pig/LoadStoreRedesignProposal .
> PigStorage and BinStorage are now working.
> SampleLoader and subclasses -RandomSampleLoader, PoissonSampleLoader need to be changed to work with new LoadFunc interface.  
> Fixing SampleLoader and RandomSampleLoader will get order-by queries working.
> PoissonSampleLoader is used by skew join. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Closed: (PIG-1062) load-store-redesign branch: change SampleLoader and subclasses to work with new LoadFunc interface

Posted by "Daniel Dai (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Daniel Dai closed PIG-1062.
---------------------------


> load-store-redesign branch: change SampleLoader and subclasses to work with new LoadFunc interface 
> ---------------------------------------------------------------------------------------------------
>
>                 Key: PIG-1062
>                 URL: https://issues.apache.org/jira/browse/PIG-1062
>             Project: Pig
>          Issue Type: Sub-task
>            Reporter: Thejas M Nair
>            Assignee: Thejas M Nair
>             Fix For: 0.7.0
>
>         Attachments: PIG-1062.5.patch, PIG-1062.patch, PIG-1062.patch.3
>
>
> This is part of the effort to implement new load store interfaces as laid out in http://wiki.apache.org/pig/LoadStoreRedesignProposal .
> PigStorage and BinStorage are now working.
> SampleLoader and subclasses -RandomSampleLoader, PoissonSampleLoader need to be changed to work with new LoadFunc interface.  
> Fixing SampleLoader and RandomSampleLoader will get order-by queries working.
> PoissonSampleLoader is used by skew join. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1062) load-store-redesign branch: change SampleLoader and subclasses to work with new LoadFunc interface

Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772565#action_12772565 ] 

Thejas M Nair commented on PIG-1062:
------------------------------------

WeightedRangePartitioner.setConf use of fileSize() is alright, it is checking size of intermediate file.

> load-store-redesign branch: change SampleLoader and subclasses to work with new LoadFunc interface 
> ---------------------------------------------------------------------------------------------------
>
>                 Key: PIG-1062
>                 URL: https://issues.apache.org/jira/browse/PIG-1062
>             Project: Pig
>          Issue Type: Sub-task
>            Reporter: Thejas M Nair
>            Assignee: Thejas M Nair
>
> This is part of the effort to implement new load store interfaces as laid out in http://wiki.apache.org/pig/LoadStoreRedesignProposal .
> PigStorage and BinStorage are now working.
> SampleLoader and subclasses -RandomSampleLoader, PoissonSampleLoader need to be changed to work with new LoadFunc interface.  
> Fixing SampleLoader and RandomSampleLoader will get order-by queries working.
> PoissonSampleLoader is used by skew join. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-1062) load-store-redesign branch: change SampleLoader and subclasses to work with new LoadFunc interface

Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Thejas M Nair updated PIG-1062:
-------------------------------

    Issue Type: Sub-task  (was: Task)
        Parent: PIG-966

> load-store-redesign branch: change SampleLoader and subclasses to work with new LoadFunc interface 
> ---------------------------------------------------------------------------------------------------
>
>                 Key: PIG-1062
>                 URL: https://issues.apache.org/jira/browse/PIG-1062
>             Project: Pig
>          Issue Type: Sub-task
>            Reporter: Thejas M Nair
>
> This is part of the effort to implement new load store interfaces as laid out in http://wiki.apache.org/pig/LoadStoreRedesignProposal .
> PigStorage and BinStorage are now working.
> SampleLoader and subclasses -RandomSampleLoader, PoissonSampleLoader need to be changed to work with new LoadFunc interface.  
> Fixing SampleLoader and RandomSampleLoader will get order-by queries working.
> PoissonSampleLoader is used by skew join. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1062) load-store-redesign branch: change SampleLoader and subclasses to work with new LoadFunc interface

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12777352#action_12777352 ] 

Hadoop QA commented on PIG-1062:
--------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12424797/PIG-1062.patch
  against trunk revision 835499.

    +1 @author.  The patch does not contain any @author tags.

    -1 tests included.  The patch doesn't appear to include any new or modified tests.
                        Please justify why no tests are needed for this patch.

    -1 patch.  The patch command could not apply the patch.

Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/48/console

This message is automatically generated.

> load-store-redesign branch: change SampleLoader and subclasses to work with new LoadFunc interface 
> ---------------------------------------------------------------------------------------------------
>
>                 Key: PIG-1062
>                 URL: https://issues.apache.org/jira/browse/PIG-1062
>             Project: Pig
>          Issue Type: Sub-task
>            Reporter: Thejas M Nair
>            Assignee: Thejas M Nair
>         Attachments: PIG-1062.patch
>
>
> This is part of the effort to implement new load store interfaces as laid out in http://wiki.apache.org/pig/LoadStoreRedesignProposal .
> PigStorage and BinStorage are now working.
> SampleLoader and subclasses -RandomSampleLoader, PoissonSampleLoader need to be changed to work with new LoadFunc interface.  
> Fixing SampleLoader and RandomSampleLoader will get order-by queries working.
> PoissonSampleLoader is used by skew join. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-1062) load-store-redesign branch: change SampleLoader and subclasses to work with new LoadFunc interface

Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Thejas M Nair updated PIG-1062:
-------------------------------

    Status: Open  (was: Patch Available)

> load-store-redesign branch: change SampleLoader and subclasses to work with new LoadFunc interface 
> ---------------------------------------------------------------------------------------------------
>
>                 Key: PIG-1062
>                 URL: https://issues.apache.org/jira/browse/PIG-1062
>             Project: Pig
>          Issue Type: Sub-task
>            Reporter: Thejas M Nair
>            Assignee: Thejas M Nair
>         Attachments: PIG-1062.patch
>
>
> This is part of the effort to implement new load store interfaces as laid out in http://wiki.apache.org/pig/LoadStoreRedesignProposal .
> PigStorage and BinStorage are now working.
> SampleLoader and subclasses -RandomSampleLoader, PoissonSampleLoader need to be changed to work with new LoadFunc interface.  
> Fixing SampleLoader and RandomSampleLoader will get order-by queries working.
> PoissonSampleLoader is used by skew join. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1062) load-store-redesign branch: change SampleLoader and subclasses to work with new LoadFunc interface

Posted by "Dmitriy V. Ryaboy (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12776052#action_12776052 ] 

Dmitriy V. Ryaboy commented on PIG-1062:
----------------------------------------

It looks like ReduceContext has a getCounter() method. Am I missing a subtlety?
http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapreduce/ReduceContext.html

> load-store-redesign branch: change SampleLoader and subclasses to work with new LoadFunc interface 
> ---------------------------------------------------------------------------------------------------
>
>                 Key: PIG-1062
>                 URL: https://issues.apache.org/jira/browse/PIG-1062
>             Project: Pig
>          Issue Type: Sub-task
>            Reporter: Thejas M Nair
>            Assignee: Thejas M Nair
>
> This is part of the effort to implement new load store interfaces as laid out in http://wiki.apache.org/pig/LoadStoreRedesignProposal .
> PigStorage and BinStorage are now working.
> SampleLoader and subclasses -RandomSampleLoader, PoissonSampleLoader need to be changed to work with new LoadFunc interface.  
> Fixing SampleLoader and RandomSampleLoader will get order-by queries working.
> PoissonSampleLoader is used by skew join. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1062) load-store-redesign branch: change SampleLoader and subclasses to work with new LoadFunc interface

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12778666#action_12778666 ] 

Arun C Murthy commented on PIG-1062:
------------------------------------

bq. It looks like ReduceContext has a getCounter() method. Am I missing a subtlety?

The counters you get from a {Map|Reduce}Context are only specific to the specific task. One would have to jump through a whole set of hoops i.e. create new JobClient or equivalent in the new context object apis), query the JobTracker for rolled up counters and even then they aren't guaranteed to be completely accurate (until job completion), thus I wouldn't recommend that we rely upon them.

> load-store-redesign branch: change SampleLoader and subclasses to work with new LoadFunc interface 
> ---------------------------------------------------------------------------------------------------
>
>                 Key: PIG-1062
>                 URL: https://issues.apache.org/jira/browse/PIG-1062
>             Project: Pig
>          Issue Type: Sub-task
>            Reporter: Thejas M Nair
>            Assignee: Thejas M Nair
>         Attachments: PIG-1062.patch, PIG-1062.patch.3
>
>
> This is part of the effort to implement new load store interfaces as laid out in http://wiki.apache.org/pig/LoadStoreRedesignProposal .
> PigStorage and BinStorage are now working.
> SampleLoader and subclasses -RandomSampleLoader, PoissonSampleLoader need to be changed to work with new LoadFunc interface.  
> Fixing SampleLoader and RandomSampleLoader will get order-by queries working.
> PoissonSampleLoader is used by skew join. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-1062) load-store-redesign branch: change SampleLoader and subclasses to work with new LoadFunc interface

Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Thejas M Nair updated PIG-1062:
-------------------------------

    Status: Patch Available  (was: Open)

> load-store-redesign branch: change SampleLoader and subclasses to work with new LoadFunc interface 
> ---------------------------------------------------------------------------------------------------
>
>                 Key: PIG-1062
>                 URL: https://issues.apache.org/jira/browse/PIG-1062
>             Project: Pig
>          Issue Type: Sub-task
>            Reporter: Thejas M Nair
>            Assignee: Thejas M Nair
>         Attachments: PIG-1062.patch, PIG-1062.patch.3
>
>
> This is part of the effort to implement new load store interfaces as laid out in http://wiki.apache.org/pig/LoadStoreRedesignProposal .
> PigStorage and BinStorage are now working.
> SampleLoader and subclasses -RandomSampleLoader, PoissonSampleLoader need to be changed to work with new LoadFunc interface.  
> Fixing SampleLoader and RandomSampleLoader will get order-by queries working.
> PoissonSampleLoader is used by skew join. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-1062) load-store-redesign branch: change SampleLoader and subclasses to work with new LoadFunc interface

Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Thejas M Nair updated PIG-1062:
-------------------------------

    Status: Patch Available  (was: Open)

For load-store-redesign branch, hudson might not be able to apply patch to trunk. 

> load-store-redesign branch: change SampleLoader and subclasses to work with new LoadFunc interface 
> ---------------------------------------------------------------------------------------------------
>
>                 Key: PIG-1062
>                 URL: https://issues.apache.org/jira/browse/PIG-1062
>             Project: Pig
>          Issue Type: Sub-task
>            Reporter: Thejas M Nair
>            Assignee: Thejas M Nair
>         Attachments: PIG-1062.patch
>
>
> This is part of the effort to implement new load store interfaces as laid out in http://wiki.apache.org/pig/LoadStoreRedesignProposal .
> PigStorage and BinStorage are now working.
> SampleLoader and subclasses -RandomSampleLoader, PoissonSampleLoader need to be changed to work with new LoadFunc interface.  
> Fixing SampleLoader and RandomSampleLoader will get order-by queries working.
> PoissonSampleLoader is used by skew join. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1062) load-store-redesign branch: change SampleLoader and subclasses to work with new LoadFunc interface

Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779573#action_12779573 ] 

Thejas M Nair commented on PIG-1062:
------------------------------------

Instead of adding the num-rows information as a last special tuple, I am making a change to have this as part of the last tuple, appended to its end (special marker column  and num-rows column).
{quote}
Instead of keeping track of max. num of columns in the different rows and then appending the
special marker string and num of rows at the end, would it be better to just have these as the
first two fields of the last tuple emitted and then introduce a split-union combination to
ensure that the foreach pipeline gets the regular tuples (excluding the special tuple)?
{quote}
In the implementation in my upcoming patch, foreach pipleline that evaluates the join expression (in map of sampling MR job) would be getting regular tuples, except in case of last tuple. This is safer than existing implementation in trunk where all the tuples had a disk-size column appended to it. The split-union approach proposed above helps in getting the special tuple to bypass the foreach, but getting it around the reduce stage (of sampling MR job) sort would involve lot more changes (if the special tuple has marker and num-rows as first two columns). 


> load-store-redesign branch: change SampleLoader and subclasses to work with new LoadFunc interface 
> ---------------------------------------------------------------------------------------------------
>
>                 Key: PIG-1062
>                 URL: https://issues.apache.org/jira/browse/PIG-1062
>             Project: Pig
>          Issue Type: Sub-task
>            Reporter: Thejas M Nair
>            Assignee: Thejas M Nair
>         Attachments: PIG-1062.patch, PIG-1062.patch.3
>
>
> This is part of the effort to implement new load store interfaces as laid out in http://wiki.apache.org/pig/LoadStoreRedesignProposal .
> PigStorage and BinStorage are now working.
> SampleLoader and subclasses -RandomSampleLoader, PoissonSampleLoader need to be changed to work with new LoadFunc interface.  
> Fixing SampleLoader and RandomSampleLoader will get order-by queries working.
> PoissonSampleLoader is used by skew join. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1062) load-store-redesign branch: change SampleLoader and subclasses to work with new LoadFunc interface

Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12776526#action_12776526 ] 

Thejas M Nair commented on PIG-1062:
------------------------------------

Proposal for sampling in RandomSampleLoader (as well as SampleLoader class)- (used for order-by queries) -
Problem: With new interface, we cannot use the old approach of dividing the size of file by number of samples required and skipping that many bytes to get new sample.
Proposal: The approach proposed by Dmitriy for sampling is used -
bq. In getNext(), we can now allocate a buffer for T elements, populate it with the first T tuples, and continue scanning the partition. For every ith next() call, we generate a random number r s.t. 0<=r<i, and if r<T we insert the new tuple into our buffer at position r. This gives us a nicely random sample of the tuples in the partition.
To avoid parsing all tuples RecordReader.nextKeyValue() will be called (instead of loader.getNext()) if the current tuple is to be skipped.

bq. It looks like ReduceContext has a getCounter() method. Am I missing a subtlety?
Arun C Murthy (mapreduce comitter) has agreed to elaborate on his recommendation on this in the jira.


> load-store-redesign branch: change SampleLoader and subclasses to work with new LoadFunc interface 
> ---------------------------------------------------------------------------------------------------
>
>                 Key: PIG-1062
>                 URL: https://issues.apache.org/jira/browse/PIG-1062
>             Project: Pig
>          Issue Type: Sub-task
>            Reporter: Thejas M Nair
>            Assignee: Thejas M Nair
>
> This is part of the effort to implement new load store interfaces as laid out in http://wiki.apache.org/pig/LoadStoreRedesignProposal .
> PigStorage and BinStorage are now working.
> SampleLoader and subclasses -RandomSampleLoader, PoissonSampleLoader need to be changed to work with new LoadFunc interface.  
> Fixing SampleLoader and RandomSampleLoader will get order-by queries working.
> PoissonSampleLoader is used by skew join. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1062) load-store-redesign branch: change SampleLoader and subclasses to work with new LoadFunc interface

Posted by "Dmitriy V. Ryaboy (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772807#action_12772807 ] 

Dmitriy V. Ryaboy commented on PIG-1062:
----------------------------------------

Thejas:

bq. sending a special tuple with number of rows in the split will likely lead to more accurate estimate of number of reducers required.

You can get the same info from the counters without unnecessarily complicating tuple processing, imo. In fact you can use (num bytes read / num records read) to get the old calculation, and not rely on number of samples and local average size estimates.

> load-store-redesign branch: change SampleLoader and subclasses to work with new LoadFunc interface 
> ---------------------------------------------------------------------------------------------------
>
>                 Key: PIG-1062
>                 URL: https://issues.apache.org/jira/browse/PIG-1062
>             Project: Pig
>          Issue Type: Sub-task
>            Reporter: Thejas M Nair
>            Assignee: Thejas M Nair
>
> This is part of the effort to implement new load store interfaces as laid out in http://wiki.apache.org/pig/LoadStoreRedesignProposal .
> PigStorage and BinStorage are now working.
> SampleLoader and subclasses -RandomSampleLoader, PoissonSampleLoader need to be changed to work with new LoadFunc interface.  
> Fixing SampleLoader and RandomSampleLoader will get order-by queries working.
> PoissonSampleLoader is used by skew join. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1062) load-store-redesign branch: change SampleLoader and subclasses to work with new LoadFunc interface

Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779126#action_12779126 ] 

Thejas M Nair commented on PIG-1062:
------------------------------------

Yes, I think SampleLoader.getNext()  can be moved to RandomSampleLoader(). Any new class that needs to use random sample implementation, can subclass RandomSampleLoader instead of SampleLoader. 

> load-store-redesign branch: change SampleLoader and subclasses to work with new LoadFunc interface 
> ---------------------------------------------------------------------------------------------------
>
>                 Key: PIG-1062
>                 URL: https://issues.apache.org/jira/browse/PIG-1062
>             Project: Pig
>          Issue Type: Sub-task
>            Reporter: Thejas M Nair
>            Assignee: Thejas M Nair
>         Attachments: PIG-1062.patch, PIG-1062.patch.3
>
>
> This is part of the effort to implement new load store interfaces as laid out in http://wiki.apache.org/pig/LoadStoreRedesignProposal .
> PigStorage and BinStorage are now working.
> SampleLoader and subclasses -RandomSampleLoader, PoissonSampleLoader need to be changed to work with new LoadFunc interface.  
> Fixing SampleLoader and RandomSampleLoader will get order-by queries working.
> PoissonSampleLoader is used by skew join. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1062) load-store-redesign branch: change SampleLoader and subclasses to work with new LoadFunc interface

Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772704#action_12772704 ] 

Thejas M Nair commented on PIG-1062:
------------------------------------

As indicated in previous comment, I am planning to go ahead with the [earlier proposal|https://issues.apache.org/jira/browse/PIG-1062?focusedCommentId=12772197&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12772197] . The current sample frequency would be one tuple every ( (H/s) * (1/17) ) tuples.  

In PartitionSkewedKey.exec(),  the number of reducers for join key k1 can be computed using (no_of_samples(k1) / 17) . But the accuracy of this calculation depends on how accurate the average tuple size computed is (s in (H/s) * (1/17)). Sending a special tuple with number of rows in the split will likely lead to more accurate estimate of number of reducers required.

> load-store-redesign branch: change SampleLoader and subclasses to work with new LoadFunc interface 
> ---------------------------------------------------------------------------------------------------
>
>                 Key: PIG-1062
>                 URL: https://issues.apache.org/jira/browse/PIG-1062
>             Project: Pig
>          Issue Type: Sub-task
>            Reporter: Thejas M Nair
>            Assignee: Thejas M Nair
>
> This is part of the effort to implement new load store interfaces as laid out in http://wiki.apache.org/pig/LoadStoreRedesignProposal .
> PigStorage and BinStorage are now working.
> SampleLoader and subclasses -RandomSampleLoader, PoissonSampleLoader need to be changed to work with new LoadFunc interface.  
> Fixing SampleLoader and RandomSampleLoader will get order-by queries working.
> PoissonSampleLoader is used by skew join. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1062) load-store-redesign branch: change SampleLoader and subclasses to work with new LoadFunc interface

Posted by "Ying He (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12771979#action_12771979 ] 

Ying He commented on PIG-1062:
------------------------------

I would suggest to add the total number of tuples of a split into the last sample as a field. All other sample tuples can have this field as NULL. Then in PartitionSkewedKey.calculateReducers, it can add up this field from all the samples to get total number of tuples from input.

If we use a separate tuple with different format to represent total number of tuples, that would involve a bigger change. The sampling job currently add an "all" to all samples to group them into one bag, and then sort the tuples by keys. If tuples are of different format, the execution plan has to be changed to be more complex to deal with these special tuples.

> load-store-redesign branch: change SampleLoader and subclasses to work with new LoadFunc interface 
> ---------------------------------------------------------------------------------------------------
>
>                 Key: PIG-1062
>                 URL: https://issues.apache.org/jira/browse/PIG-1062
>             Project: Pig
>          Issue Type: Sub-task
>            Reporter: Thejas M Nair
>
> This is part of the effort to implement new load store interfaces as laid out in http://wiki.apache.org/pig/LoadStoreRedesignProposal .
> PigStorage and BinStorage are now working.
> SampleLoader and subclasses -RandomSampleLoader, PoissonSampleLoader need to be changed to work with new LoadFunc interface.  
> Fixing SampleLoader and RandomSampleLoader will get order-by queries working.
> PoissonSampleLoader is used by skew join. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.