You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Harsh J (JIRA)" <ji...@apache.org> on 2012/09/19 16:38:07 UTC

[jira] [Created] (PIG-2924) PigStats should not be assuming all Storage classes to be file-based storage

Harsh J created PIG-2924:
----------------------------

             Summary: PigStats should not be assuming all Storage classes to be file-based storage
                 Key: PIG-2924
                 URL: https://issues.apache.org/jira/browse/PIG-2924
             Project: Pig
          Issue Type: Bug
          Components: tools
    Affects Versions: 0.9.2
            Reporter: Harsh J


Using PigStatsUtil (like Oozie does) to collect JobStats for jobs that use a HBaseStorage blows up when the stats are asked to be accumulated.

This is because JobStats (which adds stuff up) is assuming all storages are file based and that it can do listStatus/etc. operations on their filespec-provided filename. For HBaseStorage, this is set to the tablename and there's no such file, leading to an exception (FileNotFound or Invalid URI - depending on using 'tablename' or 'hbase://tablename').

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Assigned] (PIG-2924) PigStats should not be assuming all Storage classes to be file-based storage

Posted by "Cheolsoo Park (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-2924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Cheolsoo Park reassigned PIG-2924:
----------------------------------

    Assignee: Cheolsoo Park
    
> PigStats should not be assuming all Storage classes to be file-based storage
> ----------------------------------------------------------------------------
>
>                 Key: PIG-2924
>                 URL: https://issues.apache.org/jira/browse/PIG-2924
>             Project: Pig
>          Issue Type: Bug
>          Components: tools
>    Affects Versions: 0.9.2
>            Reporter: Harsh J
>            Assignee: Cheolsoo Park
>
> Using PigStatsUtil (like Oozie does) to collect JobStats for jobs that use a HBaseStorage blows up when the stats are asked to be accumulated.
> This is because JobStats (which adds stuff up) is assuming all storages are file based and that it can do listStatus/etc. operations on their filespec-provided filename. For HBaseStorage, this is set to the tablename and there's no such file, leading to an exception (FileNotFound or Invalid URI - depending on using 'tablename' or 'hbase://tablename').

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-2924) PigStats should not be assuming all Storage classes to be file-based storage

Posted by "Bill Graham (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13459167#comment-13459167 ] 

Bill Graham commented on PIG-2924:
----------------------------------

We ran into similar issues with HCatalog and reducer estimation (PIG-2573, PIG-2574), since an HDFS path was assumed.

For this issue we could register different classes that know how to look up (or not support) stats based on the URI prefix of the data location (hdfs, hbase, s3, etc).
                
> PigStats should not be assuming all Storage classes to be file-based storage
> ----------------------------------------------------------------------------
>
>                 Key: PIG-2924
>                 URL: https://issues.apache.org/jira/browse/PIG-2924
>             Project: Pig
>          Issue Type: Bug
>          Components: tools
>    Affects Versions: 0.9.2
>            Reporter: Harsh J
>
> Using PigStatsUtil (like Oozie does) to collect JobStats for jobs that use a HBaseStorage blows up when the stats are asked to be accumulated.
> This is because JobStats (which adds stuff up) is assuming all storages are file based and that it can do listStatus/etc. operations on their filespec-provided filename. For HBaseStorage, this is set to the tablename and there's no such file, leading to an exception (FileNotFound or Invalid URI - depending on using 'tablename' or 'hbase://tablename').

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-2924) PigStats should not be assuming all Storage classes to be file-based storage

Posted by "Bill Graham (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13493019#comment-13493019 ] 

Bill Graham commented on PIG-2924:
----------------------------------

Sorry for the delay on the review Chelsoo. Looking good. A few more comments. Let me know what you think.

- I think we need to pass the {{POStore}} instead of the location. PigStorage impls provided by random parties might not all abide by a unique namespacing convention in their location syntax. For example, {{VerticaStorer}} uses a syntax like "{[db_schema].[table_name]}" (curly brackets included). Another implementor could use the same syntax.  
- JobStats.getOuputSize could be simplified by doing this, which is more commonly done:
{noformat}
String reporterNames = conf.get(
   PigStatsOutputSizeReader.OUTPUT_SIZE_READER_KEY,
   FileBasedOutputSizeReader.class.getCanonicalName());
{noformat}
- Does {{PigContext.instantiateFuncFromSpec(className)}} (without appending "()") not work?
- It seems like it would be reasonable for {{PigStatsOutputSizeReader.getOutputSize}} to throw IOException all the way up to {{JobStats}}.
- Let's make {{DummyOutputSizeReader}} an inner class of {{TestJobStats}} since that package is already totally bloated.
- In {{pig.properties}} reducers' should not have an apostrophe (no possessive for inanimate objects).
                
> PigStats should not be assuming all Storage classes to be file-based storage
> ----------------------------------------------------------------------------
>
>                 Key: PIG-2924
>                 URL: https://issues.apache.org/jira/browse/PIG-2924
>             Project: Pig
>          Issue Type: Bug
>          Components: tools
>    Affects Versions: 0.9.2, 0.10.0
>            Reporter: Harsh J
>            Assignee: Cheolsoo Park
>         Attachments: PIG-2924-2.patch, PIG-2924-3.patch, PIG-2924-4.patch, PIG-2924.patch
>
>
> Using PigStatsUtil (like Oozie does) to collect JobStats for jobs that use a HBaseStorage blows up when the stats are asked to be accumulated.
> This is because JobStats (which adds stuff up) is assuming all storages are file based and that it can do listStatus/etc. operations on their filespec-provided filename. For HBaseStorage, this is set to the tablename and there's no such file, leading to an exception (FileNotFound or Invalid URI - depending on using 'tablename' or 'hbase://tablename').

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-2924) PigStats should not be assuming all Storage classes to be file-based storage

Posted by "Cheolsoo Park (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-2924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Cheolsoo Park updated PIG-2924:
-------------------------------

    Attachment: PIG-2924.patch

I am attaching a patch that implements Bill's suggestion. To make the computation of the output size plugable, I did the following:
- Added a new interface called PigStatsOutputSizeComputer.
- Added a default implementation of this interface called FileBasedOutputSizeComputer.
- Added a new flag via which a custom output size computer can be registered.
- Added unit test cases for both file-based and non-file-based systems.

Basically, I followed the pattern that Bill introduced for the reducer estimator in PIG-2574. Any comments would be appreciated.

Thanks!
                
> PigStats should not be assuming all Storage classes to be file-based storage
> ----------------------------------------------------------------------------
>
>                 Key: PIG-2924
>                 URL: https://issues.apache.org/jira/browse/PIG-2924
>             Project: Pig
>          Issue Type: Bug
>          Components: tools
>    Affects Versions: 0.9.2, 0.10.0
>            Reporter: Harsh J
>            Assignee: Cheolsoo Park
>         Attachments: PIG-2924.patch
>
>
> Using PigStatsUtil (like Oozie does) to collect JobStats for jobs that use a HBaseStorage blows up when the stats are asked to be accumulated.
> This is because JobStats (which adds stuff up) is assuming all storages are file based and that it can do listStatus/etc. operations on their filespec-provided filename. For HBaseStorage, this is set to the tablename and there's no such file, leading to an exception (FileNotFound or Invalid URI - depending on using 'tablename' or 'hbase://tablename').

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-2924) PigStats should not be assuming all Storage classes to be file-based storage

Posted by "Cheolsoo Park (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13486456#comment-13486456 ] 

Cheolsoo Park commented on PIG-2924:
------------------------------------

Hi Bill, thank you very much for reviewing my patch!

I totally agree with most of your comments. In particular, adding a supports() method seems like an elegant way to support multiple computers. I will make that change in a new patch.

But I am wondering if you would agree to remove POStore from the interface. The reason why I want to remove it is because I don't think that POStore is needed to implement supports() and getOutputSize() for any kinds of computers. All we need is probably the uri string, so it seems to make sense to pass the uri string (or a URI object) instead of the whole POStore. Please let me know if you think otherwise. 

Regarding the name of the interface, I couldn't come up with a better name. Reader sounds good to me. Maybe reporter or calculator?

Thanks!
                
> PigStats should not be assuming all Storage classes to be file-based storage
> ----------------------------------------------------------------------------
>
>                 Key: PIG-2924
>                 URL: https://issues.apache.org/jira/browse/PIG-2924
>             Project: Pig
>          Issue Type: Bug
>          Components: tools
>    Affects Versions: 0.9.2, 0.10.0
>            Reporter: Harsh J
>            Assignee: Cheolsoo Park
>         Attachments: PIG-2924-2.patch, PIG-2924.patch
>
>
> Using PigStatsUtil (like Oozie does) to collect JobStats for jobs that use a HBaseStorage blows up when the stats are asked to be accumulated.
> This is because JobStats (which adds stuff up) is assuming all storages are file based and that it can do listStatus/etc. operations on their filespec-provided filename. For HBaseStorage, this is set to the tablename and there's no such file, leading to an exception (FileNotFound or Invalid URI - depending on using 'tablename' or 'hbase://tablename').

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-2924) PigStats should not be assuming all Storage classes to be file-based storage

Posted by "Cheolsoo Park (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-2924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Cheolsoo Park updated PIG-2924:
-------------------------------

    Affects Version/s: 0.10.0
               Status: Patch Available  (was: Open)
    
> PigStats should not be assuming all Storage classes to be file-based storage
> ----------------------------------------------------------------------------
>
>                 Key: PIG-2924
>                 URL: https://issues.apache.org/jira/browse/PIG-2924
>             Project: Pig
>          Issue Type: Bug
>          Components: tools
>    Affects Versions: 0.10.0, 0.9.2
>            Reporter: Harsh J
>            Assignee: Cheolsoo Park
>         Attachments: PIG-2924.patch
>
>
> Using PigStatsUtil (like Oozie does) to collect JobStats for jobs that use a HBaseStorage blows up when the stats are asked to be accumulated.
> This is because JobStats (which adds stuff up) is assuming all storages are file based and that it can do listStatus/etc. operations on their filespec-provided filename. For HBaseStorage, this is set to the tablename and there's no such file, leading to an exception (FileNotFound or Invalid URI - depending on using 'tablename' or 'hbase://tablename').

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-2924) PigStats should not be assuming all Storage classes to be file-based storage

Posted by "Cheolsoo Park (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-2924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Cheolsoo Park updated PIG-2924:
-------------------------------

       Resolution: Fixed
    Fix Version/s: 0.12
           Status: Resolved  (was: Patch Available)

Bill gave +1 in the RB:
https://reviews.apache.org/r/8122/

Committed to trunk.
                
> PigStats should not be assuming all Storage classes to be file-based storage
> ----------------------------------------------------------------------------
>
>                 Key: PIG-2924
>                 URL: https://issues.apache.org/jira/browse/PIG-2924
>             Project: Pig
>          Issue Type: Bug
>          Components: tools
>    Affects Versions: 0.9.2, 0.10.0
>            Reporter: Harsh J
>            Assignee: Cheolsoo Park
>             Fix For: 0.12
>
>         Attachments: PIG-2924-2.patch, PIG-2924-3.patch, PIG-2924-4.patch, PIG-2924-5.patch, PIG-2924.patch
>
>
> Using PigStatsUtil (like Oozie does) to collect JobStats for jobs that use a HBaseStorage blows up when the stats are asked to be accumulated.
> This is because JobStats (which adds stuff up) is assuming all storages are file based and that it can do listStatus/etc. operations on their filespec-provided filename. For HBaseStorage, this is set to the tablename and there's no such file, leading to an exception (FileNotFound or Invalid URI - depending on using 'tablename' or 'hbase://tablename').

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-2924) PigStats should not be assuming all Storage classes to be file-based storage

Posted by "Cheolsoo Park (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-2924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Cheolsoo Park updated PIG-2924:
-------------------------------

    Status: Patch Available  (was: Open)
    
> PigStats should not be assuming all Storage classes to be file-based storage
> ----------------------------------------------------------------------------
>
>                 Key: PIG-2924
>                 URL: https://issues.apache.org/jira/browse/PIG-2924
>             Project: Pig
>          Issue Type: Bug
>          Components: tools
>    Affects Versions: 0.10.0, 0.9.2
>            Reporter: Harsh J
>            Assignee: Cheolsoo Park
>         Attachments: PIG-2924-2.patch, PIG-2924-3.patch, PIG-2924.patch
>
>
> Using PigStatsUtil (like Oozie does) to collect JobStats for jobs that use a HBaseStorage blows up when the stats are asked to be accumulated.
> This is because JobStats (which adds stuff up) is assuming all storages are file based and that it can do listStatus/etc. operations on their filespec-provided filename. For HBaseStorage, this is set to the tablename and there's no such file, leading to an exception (FileNotFound or Invalid URI - depending on using 'tablename' or 'hbase://tablename').

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-2924) PigStats should not be assuming all Storage classes to be file-based storage

Posted by "Cheolsoo Park (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-2924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Cheolsoo Park updated PIG-2924:
-------------------------------

    Attachment: PIG-2924-2.patch

Adding the new property (pig.stats.output.size.computer) to the default pig.properties file with comments.
                
> PigStats should not be assuming all Storage classes to be file-based storage
> ----------------------------------------------------------------------------
>
>                 Key: PIG-2924
>                 URL: https://issues.apache.org/jira/browse/PIG-2924
>             Project: Pig
>          Issue Type: Bug
>          Components: tools
>    Affects Versions: 0.9.2, 0.10.0
>            Reporter: Harsh J
>            Assignee: Cheolsoo Park
>         Attachments: PIG-2924-2.patch, PIG-2924.patch
>
>
> Using PigStatsUtil (like Oozie does) to collect JobStats for jobs that use a HBaseStorage blows up when the stats are asked to be accumulated.
> This is because JobStats (which adds stuff up) is assuming all storages are file based and that it can do listStatus/etc. operations on their filespec-provided filename. For HBaseStorage, this is set to the tablename and there's no such file, leading to an exception (FileNotFound or Invalid URI - depending on using 'tablename' or 'hbase://tablename').

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-2924) PigStats should not be assuming all Storage classes to be file-based storage

Posted by "Cheolsoo Park (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-2924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Cheolsoo Park updated PIG-2924:
-------------------------------

    Attachment: PIG-2924-5.patch

Thank you very much for reviewing my patch, Bill!

I agree that passing POStore is better. I also agree with the other points and incorporated them in this new patch. Please let me know what you think.
                
> PigStats should not be assuming all Storage classes to be file-based storage
> ----------------------------------------------------------------------------
>
>                 Key: PIG-2924
>                 URL: https://issues.apache.org/jira/browse/PIG-2924
>             Project: Pig
>          Issue Type: Bug
>          Components: tools
>    Affects Versions: 0.9.2, 0.10.0
>            Reporter: Harsh J
>            Assignee: Cheolsoo Park
>         Attachments: PIG-2924-2.patch, PIG-2924-3.patch, PIG-2924-4.patch, PIG-2924-5.patch, PIG-2924.patch
>
>
> Using PigStatsUtil (like Oozie does) to collect JobStats for jobs that use a HBaseStorage blows up when the stats are asked to be accumulated.
> This is because JobStats (which adds stuff up) is assuming all storages are file based and that it can do listStatus/etc. operations on their filespec-provided filename. For HBaseStorage, this is set to the tablename and there's no such file, leading to an exception (FileNotFound or Invalid URI - depending on using 'tablename' or 'hbase://tablename').

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-2924) PigStats should not be assuming all Storage classes to be file-based storage

Posted by "Bill Graham (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-2924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bill Graham updated PIG-2924:
-----------------------------

    Status: Open  (was: Patch Available)

This looks great, thanks for taking this one. I think we need to make a few changes to the pattern used PIG-2574 though, because we could have a case where we have multiple store funcs that each write to a different data source.

* Instead of registering a single new computer it would be ideal if we could register a list of computers.
* Each computer could have a {{boolean supports(POStore poStore)}} method that returns whether this class supports a given POStore. This can often be done by inspecting the output path. A default URI-based abstract class could help with that part.
* The computers would then be consulted in order, where the first to support the POStore wins.
* If a computer can't determine a size for some reason (i.e., it doesn't support it or an exception occurred), it shouldn't return 0. Instead maybe we reserve -1 for this case and document it as such. 
* Having the word Computer in the interface name and configs could cause confusion, due to how it's an overloaded term. I don't have any great suggestions though. {{PigStatsOutputSizeReader}}?

Thoughts? 
 

                
> PigStats should not be assuming all Storage classes to be file-based storage
> ----------------------------------------------------------------------------
>
>                 Key: PIG-2924
>                 URL: https://issues.apache.org/jira/browse/PIG-2924
>             Project: Pig
>          Issue Type: Bug
>          Components: tools
>    Affects Versions: 0.10.0, 0.9.2
>            Reporter: Harsh J
>            Assignee: Cheolsoo Park
>         Attachments: PIG-2924-2.patch, PIG-2924.patch
>
>
> Using PigStatsUtil (like Oozie does) to collect JobStats for jobs that use a HBaseStorage blows up when the stats are asked to be accumulated.
> This is because JobStats (which adds stuff up) is assuming all storages are file based and that it can do listStatus/etc. operations on their filespec-provided filename. For HBaseStorage, this is set to the tablename and there's no such file, leading to an exception (FileNotFound or Invalid URI - depending on using 'tablename' or 'hbase://tablename').

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-2924) PigStats should not be assuming all Storage classes to be file-based storage

Posted by "Cheolsoo Park (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-2924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Cheolsoo Park updated PIG-2924:
-------------------------------

    Attachment: PIG-2924-4.patch

Updating the comments in pig.properties to make it clear that the user can register multiple readers, and the 1st reader that supports the given uri will be used.
                
> PigStats should not be assuming all Storage classes to be file-based storage
> ----------------------------------------------------------------------------
>
>                 Key: PIG-2924
>                 URL: https://issues.apache.org/jira/browse/PIG-2924
>             Project: Pig
>          Issue Type: Bug
>          Components: tools
>    Affects Versions: 0.9.2, 0.10.0
>            Reporter: Harsh J
>            Assignee: Cheolsoo Park
>         Attachments: PIG-2924-2.patch, PIG-2924-3.patch, PIG-2924-4.patch, PIG-2924.patch
>
>
> Using PigStatsUtil (like Oozie does) to collect JobStats for jobs that use a HBaseStorage blows up when the stats are asked to be accumulated.
> This is because JobStats (which adds stuff up) is assuming all storages are file based and that it can do listStatus/etc. operations on their filespec-provided filename. For HBaseStorage, this is set to the tablename and there's no such file, leading to an exception (FileNotFound or Invalid URI - depending on using 'tablename' or 'hbase://tablename').

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-2924) PigStats should not be assuming all Storage classes to be file-based storage

Posted by "Harsh J (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-2924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13458723#comment-13458723 ] 

Harsh J commented on PIG-2924:
------------------------------

A straight forward fix may be to intercept the FileNotFound issues. A proper fix may be to have storages report if they are file-based or otherwise?
                
> PigStats should not be assuming all Storage classes to be file-based storage
> ----------------------------------------------------------------------------
>
>                 Key: PIG-2924
>                 URL: https://issues.apache.org/jira/browse/PIG-2924
>             Project: Pig
>          Issue Type: Bug
>          Components: tools
>    Affects Versions: 0.9.2
>            Reporter: Harsh J
>
> Using PigStatsUtil (like Oozie does) to collect JobStats for jobs that use a HBaseStorage blows up when the stats are asked to be accumulated.
> This is because JobStats (which adds stuff up) is assuming all storages are file based and that it can do listStatus/etc. operations on their filespec-provided filename. For HBaseStorage, this is set to the tablename and there's no such file, leading to an exception (FileNotFound or Invalid URI - depending on using 'tablename' or 'hbase://tablename').

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-2924) PigStats should not be assuming all Storage classes to be file-based storage

Posted by "Cheolsoo Park (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-2924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Cheolsoo Park updated PIG-2924:
-------------------------------

    Attachment: PIG-2924-3.patch

I updated the patch as follows:
{quote}
Having the word Computer in the interface name and configs could cause confusion, due to how it's an overloaded term. I don't have any great suggestions though. PigStatsOutputSizeReader?
{quote}
Changed to {{PigStatsOutputSizeReader}}.
{quote}
Instead of registering a single new computer it would be ideal if we could register a list of computers.
{quote}
Fixed.
{quote}
Each computer could have a boolean supports(POStore poStore) method that returns whether this class supports a given POStore. This can often be done by inspecting the output path. A default URI-based abstract class could help with that part.
{quote}
Each reader implements {{boolean supports(String uri)}} method. For {{FileBasedOutputSizeReader}}, the output of {{UriUtil.isHDFSFileOrLocalOrS3N()}} is returned.
{quote}
The computers would then be consulted in order, where the first to support the POStore wins.
{quote}
Fixed.
{quote}
If a computer can't determine a size for some reason (i.e., it doesn't support it or an exception occurred), it shouldn't return 0. Instead maybe we reserve -1 for this case and document it as such.
{quote}
Fixed.

In addition, I replaced {{POStore}} with {{String}}. Please let me know what you think.

Thanks!
                
> PigStats should not be assuming all Storage classes to be file-based storage
> ----------------------------------------------------------------------------
>
>                 Key: PIG-2924
>                 URL: https://issues.apache.org/jira/browse/PIG-2924
>             Project: Pig
>          Issue Type: Bug
>          Components: tools
>    Affects Versions: 0.9.2, 0.10.0
>            Reporter: Harsh J
>            Assignee: Cheolsoo Park
>         Attachments: PIG-2924-2.patch, PIG-2924-3.patch, PIG-2924.patch
>
>
> Using PigStatsUtil (like Oozie does) to collect JobStats for jobs that use a HBaseStorage blows up when the stats are asked to be accumulated.
> This is because JobStats (which adds stuff up) is assuming all storages are file based and that it can do listStatus/etc. operations on their filespec-provided filename. For HBaseStorage, this is set to the tablename and there's no such file, leading to an exception (FileNotFound or Invalid URI - depending on using 'tablename' or 'hbase://tablename').

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira