You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hcatalog-commits@incubator.apache.org by "Sushanth Sowmyan (JIRA)" <ji...@apache.org> on 2011/06/09 21:45:58 UTC

[jira] [Created] (HCATALOG-42) Storing across partitions(Dynamic Partitioning) from HCatStorer/HCatOutputFormat

Storing across partitions(Dynamic Partitioning) from HCatStorer/HCatOutputFormat
--------------------------------------------------------------------------------

                 Key: HCATALOG-42
                 URL: https://issues.apache.org/jira/browse/HCATALOG-42
             Project: HCatalog
          Issue Type: Improvement
    Affects Versions: 0.2
            Reporter: Sushanth Sowmyan


HCatalog allows people to abstract away underlying storage details and refer to them as tables and partitions. To this notion, the storage abstraction is more about classifying how data is organized, rather than bothering about where it is stored. A user thus then specifies partitions to be stored and leaves the job to HCatalog to figure out how and where it needs to do so.

When it comes to reading the data, a user is able to specify that they're interested in reading from the table and specify various partition key value combinations to prune, as if specifying a SQL-like where clause. However, when it comes to writing, the abstraction is not so seamless. We still require of the end user to write out data to the table partition-by-partition. And these partitions require fine-grained knowledge of what key-value-pairs they require, and we require this knowledge in advance, and we require the writer to have already grouped the requisite data accordingly before attempting to store.
For example, the following pig script illustrates this:

--
A = load 'raw' using HCatLoader(); 
... 
split Z into for_us if region='us', for_eu if region='eu', for_asia if region='asia'; 
store for_us into 'processed' using HCatStorage("ds=20110110, region=us"); 
store for_eu into 'processed' using HCatStorage("ds=20110110, region=eu"); 
store for_asia into 'processed' using HCatStorage("ds=20110110, region=asia"); 
--

This has a major issue in that MapReduce programs and pig scripts need to be aware of all the possible values of a key, and that needs to be maintained, and modified if needed when new values are introduced, which may/may not always be easy or even possible. With more partitions, scripts begin to look cumbersome. And if each partition being written launches a separate HCatalog store, we are increasing the load on the HCatalog server and launching more jobs for the store by a factor of the number of partitions

It would be much more preferable if HCatalog were to be able to figure out all the partitions required from the data being written, which would allow us to simplify the above script into the following:

--
A = load 'raw' using HCatLoader(); 
... 
store Z into 'processed' using HCatStorage("ds=20110110"); 
--



--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HCATALOG-42) Storing across partitions(Dynamic Partitioning) from HCatStorer/HCatOutputFormat

Posted by "Sushanth Sowmyan (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HCATALOG-42?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sushanth Sowmyan updated HCATALOG-42:
-------------------------------------

    Attachment:     (was: HCATALOG-42.4.patch)

> Storing across partitions(Dynamic Partitioning) from HCatStorer/HCatOutputFormat
> --------------------------------------------------------------------------------
>
>                 Key: HCATALOG-42
>                 URL: https://issues.apache.org/jira/browse/HCATALOG-42
>             Project: HCatalog
>          Issue Type: Improvement
>    Affects Versions: 0.2
>            Reporter: Sushanth Sowmyan
>            Assignee: Sushanth Sowmyan
>         Attachments: HCATALOG-42.5.patch, hadoop_archive-0.3.1.jar
>
>
> HCatalog allows people to abstract away underlying storage details and refer to them as tables and partitions. To this notion, the storage abstraction is more about classifying how data is organized, rather than bothering about where it is stored. A user thus then specifies partitions to be stored and leaves the job to HCatalog to figure out how and where it needs to do so.
> When it comes to reading the data, a user is able to specify that they're interested in reading from the table and specify various partition key value combinations to prune, as if specifying a SQL-like where clause. However, when it comes to writing, the abstraction is not so seamless. We still require of the end user to write out data to the table partition-by-partition. And these partitions require fine-grained knowledge of what key-value-pairs they require, and we require this knowledge in advance, and we require the writer to have already grouped the requisite data accordingly before attempting to store.
> For example, the following pig script illustrates this:
> --
> A = load 'raw' using HCatLoader(); 
> ... 
> split Z into for_us if region='us', for_eu if region='eu', for_asia if region='asia'; 
> store for_us into 'processed' using HCatStorage("ds=20110110, region=us"); 
> store for_eu into 'processed' using HCatStorage("ds=20110110, region=eu"); 
> store for_asia into 'processed' using HCatStorage("ds=20110110, region=asia"); 
> --
> This has a major issue in that MapReduce programs and pig scripts need to be aware of all the possible values of a key, and that needs to be maintained, and modified if needed when new values are introduced, which may/may not always be easy or even possible. With more partitions, scripts begin to look cumbersome. And if each partition being written launches a separate HCatalog store, we are increasing the load on the HCatalog server and launching more jobs for the store by a factor of the number of partitions
> It would be much more preferable if HCatalog were to be able to figure out all the partitions required from the data being written, which would allow us to simplify the above script into the following:
> --
> A = load 'raw' using HCatLoader(); 
> ... 
> store Z into 'processed' using HCatStorage("ds=20110110"); 
> --

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HCATALOG-42) Storing across partitions(Dynamic Partitioning) from HCatStorer/HCatOutputFormat

Posted by "Sushanth Sowmyan (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HCATALOG-42?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sushanth Sowmyan updated HCATALOG-42:
-------------------------------------

    Attachment: HCATALOG-42.patch

Patch attached

> Storing across partitions(Dynamic Partitioning) from HCatStorer/HCatOutputFormat
> --------------------------------------------------------------------------------
>
>                 Key: HCATALOG-42
>                 URL: https://issues.apache.org/jira/browse/HCATALOG-42
>             Project: HCatalog
>          Issue Type: Improvement
>    Affects Versions: 0.2
>            Reporter: Sushanth Sowmyan
>            Assignee: Sushanth Sowmyan
>         Attachments: HCATALOG-42.patch
>
>
> HCatalog allows people to abstract away underlying storage details and refer to them as tables and partitions. To this notion, the storage abstraction is more about classifying how data is organized, rather than bothering about where it is stored. A user thus then specifies partitions to be stored and leaves the job to HCatalog to figure out how and where it needs to do so.
> When it comes to reading the data, a user is able to specify that they're interested in reading from the table and specify various partition key value combinations to prune, as if specifying a SQL-like where clause. However, when it comes to writing, the abstraction is not so seamless. We still require of the end user to write out data to the table partition-by-partition. And these partitions require fine-grained knowledge of what key-value-pairs they require, and we require this knowledge in advance, and we require the writer to have already grouped the requisite data accordingly before attempting to store.
> For example, the following pig script illustrates this:
> --
> A = load 'raw' using HCatLoader(); 
> ... 
> split Z into for_us if region='us', for_eu if region='eu', for_asia if region='asia'; 
> store for_us into 'processed' using HCatStorage("ds=20110110, region=us"); 
> store for_eu into 'processed' using HCatStorage("ds=20110110, region=eu"); 
> store for_asia into 'processed' using HCatStorage("ds=20110110, region=asia"); 
> --
> This has a major issue in that MapReduce programs and pig scripts need to be aware of all the possible values of a key, and that needs to be maintained, and modified if needed when new values are introduced, which may/may not always be easy or even possible. With more partitions, scripts begin to look cumbersome. And if each partition being written launches a separate HCatalog store, we are increasing the load on the HCatalog server and launching more jobs for the store by a factor of the number of partitions
> It would be much more preferable if HCatalog were to be able to figure out all the partitions required from the data being written, which would allow us to simplify the above script into the following:
> --
> A = load 'raw' using HCatLoader(); 
> ... 
> store Z into 'processed' using HCatStorage("ds=20110110"); 
> --

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HCATALOG-42) Storing across partitions(Dynamic Partitioning) from HCatStorer/HCatOutputFormat

Posted by "Sushanth Sowmyan (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HCATALOG-42?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sushanth Sowmyan updated HCATALOG-42:
-------------------------------------

    Attachment: HCATALOG-42.3.patch

Patch update.

> Storing across partitions(Dynamic Partitioning) from HCatStorer/HCatOutputFormat
> --------------------------------------------------------------------------------
>
>                 Key: HCATALOG-42
>                 URL: https://issues.apache.org/jira/browse/HCATALOG-42
>             Project: HCatalog
>          Issue Type: Improvement
>    Affects Versions: 0.2
>            Reporter: Sushanth Sowmyan
>            Assignee: Sushanth Sowmyan
>         Attachments: HCATALOG-42.3.patch, hadoop_archive-0.3.1.jar
>
>
> HCatalog allows people to abstract away underlying storage details and refer to them as tables and partitions. To this notion, the storage abstraction is more about classifying how data is organized, rather than bothering about where it is stored. A user thus then specifies partitions to be stored and leaves the job to HCatalog to figure out how and where it needs to do so.
> When it comes to reading the data, a user is able to specify that they're interested in reading from the table and specify various partition key value combinations to prune, as if specifying a SQL-like where clause. However, when it comes to writing, the abstraction is not so seamless. We still require of the end user to write out data to the table partition-by-partition. And these partitions require fine-grained knowledge of what key-value-pairs they require, and we require this knowledge in advance, and we require the writer to have already grouped the requisite data accordingly before attempting to store.
> For example, the following pig script illustrates this:
> --
> A = load 'raw' using HCatLoader(); 
> ... 
> split Z into for_us if region='us', for_eu if region='eu', for_asia if region='asia'; 
> store for_us into 'processed' using HCatStorage("ds=20110110, region=us"); 
> store for_eu into 'processed' using HCatStorage("ds=20110110, region=eu"); 
> store for_asia into 'processed' using HCatStorage("ds=20110110, region=asia"); 
> --
> This has a major issue in that MapReduce programs and pig scripts need to be aware of all the possible values of a key, and that needs to be maintained, and modified if needed when new values are introduced, which may/may not always be easy or even possible. With more partitions, scripts begin to look cumbersome. And if each partition being written launches a separate HCatalog store, we are increasing the load on the HCatalog server and launching more jobs for the store by a factor of the number of partitions
> It would be much more preferable if HCatalog were to be able to figure out all the partitions required from the data being written, which would allow us to simplify the above script into the following:
> --
> A = load 'raw' using HCatLoader(); 
> ... 
> store Z into 'processed' using HCatStorage("ds=20110110"); 
> --

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HCATALOG-42) Storing across partitions(Dynamic Partitioning) from HCatStorer/HCatOutputFormat

Posted by "Sushanth Sowmyan (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HCATALOG-42?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sushanth Sowmyan updated HCATALOG-42:
-------------------------------------

    Attachment:     (was: HCATALOG-42.3.patch)

> Storing across partitions(Dynamic Partitioning) from HCatStorer/HCatOutputFormat
> --------------------------------------------------------------------------------
>
>                 Key: HCATALOG-42
>                 URL: https://issues.apache.org/jira/browse/HCATALOG-42
>             Project: HCatalog
>          Issue Type: Improvement
>    Affects Versions: 0.2
>            Reporter: Sushanth Sowmyan
>            Assignee: Sushanth Sowmyan
>         Attachments: HCATALOG-42.4.patch, hadoop_archive-0.3.1.jar
>
>
> HCatalog allows people to abstract away underlying storage details and refer to them as tables and partitions. To this notion, the storage abstraction is more about classifying how data is organized, rather than bothering about where it is stored. A user thus then specifies partitions to be stored and leaves the job to HCatalog to figure out how and where it needs to do so.
> When it comes to reading the data, a user is able to specify that they're interested in reading from the table and specify various partition key value combinations to prune, as if specifying a SQL-like where clause. However, when it comes to writing, the abstraction is not so seamless. We still require of the end user to write out data to the table partition-by-partition. And these partitions require fine-grained knowledge of what key-value-pairs they require, and we require this knowledge in advance, and we require the writer to have already grouped the requisite data accordingly before attempting to store.
> For example, the following pig script illustrates this:
> --
> A = load 'raw' using HCatLoader(); 
> ... 
> split Z into for_us if region='us', for_eu if region='eu', for_asia if region='asia'; 
> store for_us into 'processed' using HCatStorage("ds=20110110, region=us"); 
> store for_eu into 'processed' using HCatStorage("ds=20110110, region=eu"); 
> store for_asia into 'processed' using HCatStorage("ds=20110110, region=asia"); 
> --
> This has a major issue in that MapReduce programs and pig scripts need to be aware of all the possible values of a key, and that needs to be maintained, and modified if needed when new values are introduced, which may/may not always be easy or even possible. With more partitions, scripts begin to look cumbersome. And if each partition being written launches a separate HCatalog store, we are increasing the load on the HCatalog server and launching more jobs for the store by a factor of the number of partitions
> It would be much more preferable if HCatalog were to be able to figure out all the partitions required from the data being written, which would allow us to simplify the above script into the following:
> --
> A = load 'raw' using HCatLoader(); 
> ... 
> store Z into 'processed' using HCatStorage("ds=20110110"); 
> --

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Assigned] (HCATALOG-42) Storing across partitions(Dynamic Partitioning) from HCatStorer/HCatOutputFormat

Posted by "Sushanth Sowmyan (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HCATALOG-42?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sushanth Sowmyan reassigned HCATALOG-42:
----------------------------------------

    Assignee: Sushanth Sowmyan

> Storing across partitions(Dynamic Partitioning) from HCatStorer/HCatOutputFormat
> --------------------------------------------------------------------------------
>
>                 Key: HCATALOG-42
>                 URL: https://issues.apache.org/jira/browse/HCATALOG-42
>             Project: HCatalog
>          Issue Type: Improvement
>    Affects Versions: 0.2
>            Reporter: Sushanth Sowmyan
>            Assignee: Sushanth Sowmyan
>
> HCatalog allows people to abstract away underlying storage details and refer to them as tables and partitions. To this notion, the storage abstraction is more about classifying how data is organized, rather than bothering about where it is stored. A user thus then specifies partitions to be stored and leaves the job to HCatalog to figure out how and where it needs to do so.
> When it comes to reading the data, a user is able to specify that they're interested in reading from the table and specify various partition key value combinations to prune, as if specifying a SQL-like where clause. However, when it comes to writing, the abstraction is not so seamless. We still require of the end user to write out data to the table partition-by-partition. And these partitions require fine-grained knowledge of what key-value-pairs they require, and we require this knowledge in advance, and we require the writer to have already grouped the requisite data accordingly before attempting to store.
> For example, the following pig script illustrates this:
> --
> A = load 'raw' using HCatLoader(); 
> ... 
> split Z into for_us if region='us', for_eu if region='eu', for_asia if region='asia'; 
> store for_us into 'processed' using HCatStorage("ds=20110110, region=us"); 
> store for_eu into 'processed' using HCatStorage("ds=20110110, region=eu"); 
> store for_asia into 'processed' using HCatStorage("ds=20110110, region=asia"); 
> --
> This has a major issue in that MapReduce programs and pig scripts need to be aware of all the possible values of a key, and that needs to be maintained, and modified if needed when new values are introduced, which may/may not always be easy or even possible. With more partitions, scripts begin to look cumbersome. And if each partition being written launches a separate HCatalog store, we are increasing the load on the HCatalog server and launching more jobs for the store by a factor of the number of partitions
> It would be much more preferable if HCatalog were to be able to figure out all the partitions required from the data being written, which would allow us to simplify the above script into the following:
> --
> A = load 'raw' using HCatLoader(); 
> ... 
> store Z into 'processed' using HCatStorage("ds=20110110"); 
> --

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HCATALOG-42) Storing across partitions(Dynamic Partitioning) from HCatStorer/HCatOutputFormat

Posted by "Sushanth Sowmyan (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HCATALOG-42?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13064840#comment-13064840 ] 

Sushanth Sowmyan commented on HCATALOG-42:
------------------------------------------

A couple of points of feedback and results from testing:

+ Need to disable the max_partitions check as it's not clear if we want to keep it or not. So we can disable it by default(through code), but we'll revisit this if we see the need to change it.
+ Issues with HAR - In a secure cluster, with security enabled, our trying to launch a HAR job in OutputCommitter fails because we can't launch a job from a task without having a JobTracker delegation token. Code needs to be fixed to fetch one, pass it on, use it to launch HAR and to cancel it.

> Storing across partitions(Dynamic Partitioning) from HCatStorer/HCatOutputFormat
> --------------------------------------------------------------------------------
>
>                 Key: HCATALOG-42
>                 URL: https://issues.apache.org/jira/browse/HCATALOG-42
>             Project: HCatalog
>          Issue Type: Improvement
>    Affects Versions: 0.2
>            Reporter: Sushanth Sowmyan
>            Assignee: Sushanth Sowmyan
>         Attachments: HCATALOG-42.6.patch, hadoop_archive-0.3.1.jar
>
>
> HCatalog allows people to abstract away underlying storage details and refer to them as tables and partitions. To this notion, the storage abstraction is more about classifying how data is organized, rather than bothering about where it is stored. A user thus then specifies partitions to be stored and leaves the job to HCatalog to figure out how and where it needs to do so.
> When it comes to reading the data, a user is able to specify that they're interested in reading from the table and specify various partition key value combinations to prune, as if specifying a SQL-like where clause. However, when it comes to writing, the abstraction is not so seamless. We still require of the end user to write out data to the table partition-by-partition. And these partitions require fine-grained knowledge of what key-value-pairs they require, and we require this knowledge in advance, and we require the writer to have already grouped the requisite data accordingly before attempting to store.
> For example, the following pig script illustrates this:
> --
> A = load 'raw' using HCatLoader(); 
> ... 
> split Z into for_us if region='us', for_eu if region='eu', for_asia if region='asia'; 
> store for_us into 'processed' using HCatStorage("ds=20110110, region=us"); 
> store for_eu into 'processed' using HCatStorage("ds=20110110, region=eu"); 
> store for_asia into 'processed' using HCatStorage("ds=20110110, region=asia"); 
> --
> This has a major issue in that MapReduce programs and pig scripts need to be aware of all the possible values of a key, and that needs to be maintained, and modified if needed when new values are introduced, which may/may not always be easy or even possible. With more partitions, scripts begin to look cumbersome. And if each partition being written launches a separate HCatalog store, we are increasing the load on the HCatalog server and launching more jobs for the store by a factor of the number of partitions
> It would be much more preferable if HCatalog were to be able to figure out all the partitions required from the data being written, which would allow us to simplify the above script into the following:
> --
> A = load 'raw' using HCatLoader(); 
> ... 
> store Z into 'processed' using HCatStorage("ds=20110110"); 
> --

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HCATALOG-42) Storing across partitions(Dynamic Partitioning) from HCatStorer/HCatOutputFormat

Posted by "Sushanth Sowmyan (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HCATALOG-42?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sushanth Sowmyan updated HCATALOG-42:
-------------------------------------

    Attachment: HCATALOG-42.5.patch

Added max_partitions test, explicit cleanupJob call semantic in output storage driver rather than calling default one.

> Storing across partitions(Dynamic Partitioning) from HCatStorer/HCatOutputFormat
> --------------------------------------------------------------------------------
>
>                 Key: HCATALOG-42
>                 URL: https://issues.apache.org/jira/browse/HCATALOG-42
>             Project: HCatalog
>          Issue Type: Improvement
>    Affects Versions: 0.2
>            Reporter: Sushanth Sowmyan
>            Assignee: Sushanth Sowmyan
>         Attachments: HCATALOG-42.5.patch, hadoop_archive-0.3.1.jar
>
>
> HCatalog allows people to abstract away underlying storage details and refer to them as tables and partitions. To this notion, the storage abstraction is more about classifying how data is organized, rather than bothering about where it is stored. A user thus then specifies partitions to be stored and leaves the job to HCatalog to figure out how and where it needs to do so.
> When it comes to reading the data, a user is able to specify that they're interested in reading from the table and specify various partition key value combinations to prune, as if specifying a SQL-like where clause. However, when it comes to writing, the abstraction is not so seamless. We still require of the end user to write out data to the table partition-by-partition. And these partitions require fine-grained knowledge of what key-value-pairs they require, and we require this knowledge in advance, and we require the writer to have already grouped the requisite data accordingly before attempting to store.
> For example, the following pig script illustrates this:
> --
> A = load 'raw' using HCatLoader(); 
> ... 
> split Z into for_us if region='us', for_eu if region='eu', for_asia if region='asia'; 
> store for_us into 'processed' using HCatStorage("ds=20110110, region=us"); 
> store for_eu into 'processed' using HCatStorage("ds=20110110, region=eu"); 
> store for_asia into 'processed' using HCatStorage("ds=20110110, region=asia"); 
> --
> This has a major issue in that MapReduce programs and pig scripts need to be aware of all the possible values of a key, and that needs to be maintained, and modified if needed when new values are introduced, which may/may not always be easy or even possible. With more partitions, scripts begin to look cumbersome. And if each partition being written launches a separate HCatalog store, we are increasing the load on the HCatalog server and launching more jobs for the store by a factor of the number of partitions
> It would be much more preferable if HCatalog were to be able to figure out all the partitions required from the data being written, which would allow us to simplify the above script into the following:
> --
> A = load 'raw' using HCatLoader(); 
> ... 
> store Z into 'processed' using HCatStorage("ds=20110110"); 
> --

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HCATALOG-42) Storing across partitions(Dynamic Partitioning) from HCatStorer/HCatOutputFormat

Posted by "Sushanth Sowmyan (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HCATALOG-42?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sushanth Sowmyan updated HCATALOG-42:
-------------------------------------

    Attachment:     (was: HCATALOG-42.5.patch)

> Storing across partitions(Dynamic Partitioning) from HCatStorer/HCatOutputFormat
> --------------------------------------------------------------------------------
>
>                 Key: HCATALOG-42
>                 URL: https://issues.apache.org/jira/browse/HCATALOG-42
>             Project: HCatalog
>          Issue Type: Improvement
>    Affects Versions: 0.2
>            Reporter: Sushanth Sowmyan
>            Assignee: Sushanth Sowmyan
>         Attachments: HCATALOG-42.6.patch, hadoop_archive-0.3.1.jar
>
>
> HCatalog allows people to abstract away underlying storage details and refer to them as tables and partitions. To this notion, the storage abstraction is more about classifying how data is organized, rather than bothering about where it is stored. A user thus then specifies partitions to be stored and leaves the job to HCatalog to figure out how and where it needs to do so.
> When it comes to reading the data, a user is able to specify that they're interested in reading from the table and specify various partition key value combinations to prune, as if specifying a SQL-like where clause. However, when it comes to writing, the abstraction is not so seamless. We still require of the end user to write out data to the table partition-by-partition. And these partitions require fine-grained knowledge of what key-value-pairs they require, and we require this knowledge in advance, and we require the writer to have already grouped the requisite data accordingly before attempting to store.
> For example, the following pig script illustrates this:
> --
> A = load 'raw' using HCatLoader(); 
> ... 
> split Z into for_us if region='us', for_eu if region='eu', for_asia if region='asia'; 
> store for_us into 'processed' using HCatStorage("ds=20110110, region=us"); 
> store for_eu into 'processed' using HCatStorage("ds=20110110, region=eu"); 
> store for_asia into 'processed' using HCatStorage("ds=20110110, region=asia"); 
> --
> This has a major issue in that MapReduce programs and pig scripts need to be aware of all the possible values of a key, and that needs to be maintained, and modified if needed when new values are introduced, which may/may not always be easy or even possible. With more partitions, scripts begin to look cumbersome. And if each partition being written launches a separate HCatalog store, we are increasing the load on the HCatalog server and launching more jobs for the store by a factor of the number of partitions
> It would be much more preferable if HCatalog were to be able to figure out all the partitions required from the data being written, which would allow us to simplify the above script into the following:
> --
> A = load 'raw' using HCatLoader(); 
> ... 
> store Z into 'processed' using HCatStorage("ds=20110110"); 
> --

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HCATALOG-42) Storing across partitions(Dynamic Partitioning) from HCatStorer/HCatOutputFormat

Posted by "Sushanth Sowmyan (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HCATALOG-42?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sushanth Sowmyan updated HCATALOG-42:
-------------------------------------

    Attachment: hadoop_archive-0.3.1.jar

jar required for HAR stuff to compile/work, to be included in the lib directory. Covered under the Apache Public License as it's part of hadoop and is covered by its license

> Storing across partitions(Dynamic Partitioning) from HCatStorer/HCatOutputFormat
> --------------------------------------------------------------------------------
>
>                 Key: HCATALOG-42
>                 URL: https://issues.apache.org/jira/browse/HCATALOG-42
>             Project: HCatalog
>          Issue Type: Improvement
>    Affects Versions: 0.2
>            Reporter: Sushanth Sowmyan
>            Assignee: Sushanth Sowmyan
>         Attachments: HCATALOG-42.patch, hadoop_archive-0.3.1.jar
>
>
> HCatalog allows people to abstract away underlying storage details and refer to them as tables and partitions. To this notion, the storage abstraction is more about classifying how data is organized, rather than bothering about where it is stored. A user thus then specifies partitions to be stored and leaves the job to HCatalog to figure out how and where it needs to do so.
> When it comes to reading the data, a user is able to specify that they're interested in reading from the table and specify various partition key value combinations to prune, as if specifying a SQL-like where clause. However, when it comes to writing, the abstraction is not so seamless. We still require of the end user to write out data to the table partition-by-partition. And these partitions require fine-grained knowledge of what key-value-pairs they require, and we require this knowledge in advance, and we require the writer to have already grouped the requisite data accordingly before attempting to store.
> For example, the following pig script illustrates this:
> --
> A = load 'raw' using HCatLoader(); 
> ... 
> split Z into for_us if region='us', for_eu if region='eu', for_asia if region='asia'; 
> store for_us into 'processed' using HCatStorage("ds=20110110, region=us"); 
> store for_eu into 'processed' using HCatStorage("ds=20110110, region=eu"); 
> store for_asia into 'processed' using HCatStorage("ds=20110110, region=asia"); 
> --
> This has a major issue in that MapReduce programs and pig scripts need to be aware of all the possible values of a key, and that needs to be maintained, and modified if needed when new values are introduced, which may/may not always be easy or even possible. With more partitions, scripts begin to look cumbersome. And if each partition being written launches a separate HCatalog store, we are increasing the load on the HCatalog server and launching more jobs for the store by a factor of the number of partitions
> It would be much more preferable if HCatalog were to be able to figure out all the partitions required from the data being written, which would allow us to simplify the above script into the following:
> --
> A = load 'raw' using HCatLoader(); 
> ... 
> store Z into 'processed' using HCatStorage("ds=20110110"); 
> --

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HCATALOG-42) Storing across partitions(Dynamic Partitioning) from HCatStorer/HCatOutputFormat

Posted by "Sushanth Sowmyan (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HCATALOG-42?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sushanth Sowmyan updated HCATALOG-42:
-------------------------------------

    Attachment:     (was: HCATALOG-42.patch)

> Storing across partitions(Dynamic Partitioning) from HCatStorer/HCatOutputFormat
> --------------------------------------------------------------------------------
>
>                 Key: HCATALOG-42
>                 URL: https://issues.apache.org/jira/browse/HCATALOG-42
>             Project: HCatalog
>          Issue Type: Improvement
>    Affects Versions: 0.2
>            Reporter: Sushanth Sowmyan
>            Assignee: Sushanth Sowmyan
>         Attachments: HCATALOG-42.2.patch, hadoop_archive-0.3.1.jar
>
>
> HCatalog allows people to abstract away underlying storage details and refer to them as tables and partitions. To this notion, the storage abstraction is more about classifying how data is organized, rather than bothering about where it is stored. A user thus then specifies partitions to be stored and leaves the job to HCatalog to figure out how and where it needs to do so.
> When it comes to reading the data, a user is able to specify that they're interested in reading from the table and specify various partition key value combinations to prune, as if specifying a SQL-like where clause. However, when it comes to writing, the abstraction is not so seamless. We still require of the end user to write out data to the table partition-by-partition. And these partitions require fine-grained knowledge of what key-value-pairs they require, and we require this knowledge in advance, and we require the writer to have already grouped the requisite data accordingly before attempting to store.
> For example, the following pig script illustrates this:
> --
> A = load 'raw' using HCatLoader(); 
> ... 
> split Z into for_us if region='us', for_eu if region='eu', for_asia if region='asia'; 
> store for_us into 'processed' using HCatStorage("ds=20110110, region=us"); 
> store for_eu into 'processed' using HCatStorage("ds=20110110, region=eu"); 
> store for_asia into 'processed' using HCatStorage("ds=20110110, region=asia"); 
> --
> This has a major issue in that MapReduce programs and pig scripts need to be aware of all the possible values of a key, and that needs to be maintained, and modified if needed when new values are introduced, which may/may not always be easy or even possible. With more partitions, scripts begin to look cumbersome. And if each partition being written launches a separate HCatalog store, we are increasing the load on the HCatalog server and launching more jobs for the store by a factor of the number of partitions
> It would be much more preferable if HCatalog were to be able to figure out all the partitions required from the data being written, which would allow us to simplify the above script into the following:
> --
> A = load 'raw' using HCatLoader(); 
> ... 
> store Z into 'processed' using HCatStorage("ds=20110110"); 
> --

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HCATALOG-42) Storing across partitions(Dynamic Partitioning) from HCatStorer/HCatOutputFormat

Posted by "Sushanth Sowmyan (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HCATALOG-42?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sushanth Sowmyan updated HCATALOG-42:
-------------------------------------

    Attachment: HCATALOG-42.6.patch

Updated patch to fix bug where order of path generated might vary, now consistently generates partition paths in order of table partition keys.

> Storing across partitions(Dynamic Partitioning) from HCatStorer/HCatOutputFormat
> --------------------------------------------------------------------------------
>
>                 Key: HCATALOG-42
>                 URL: https://issues.apache.org/jira/browse/HCATALOG-42
>             Project: HCatalog
>          Issue Type: Improvement
>    Affects Versions: 0.2
>            Reporter: Sushanth Sowmyan
>            Assignee: Sushanth Sowmyan
>         Attachments: HCATALOG-42.6.patch, hadoop_archive-0.3.1.jar
>
>
> HCatalog allows people to abstract away underlying storage details and refer to them as tables and partitions. To this notion, the storage abstraction is more about classifying how data is organized, rather than bothering about where it is stored. A user thus then specifies partitions to be stored and leaves the job to HCatalog to figure out how and where it needs to do so.
> When it comes to reading the data, a user is able to specify that they're interested in reading from the table and specify various partition key value combinations to prune, as if specifying a SQL-like where clause. However, when it comes to writing, the abstraction is not so seamless. We still require of the end user to write out data to the table partition-by-partition. And these partitions require fine-grained knowledge of what key-value-pairs they require, and we require this knowledge in advance, and we require the writer to have already grouped the requisite data accordingly before attempting to store.
> For example, the following pig script illustrates this:
> --
> A = load 'raw' using HCatLoader(); 
> ... 
> split Z into for_us if region='us', for_eu if region='eu', for_asia if region='asia'; 
> store for_us into 'processed' using HCatStorage("ds=20110110, region=us"); 
> store for_eu into 'processed' using HCatStorage("ds=20110110, region=eu"); 
> store for_asia into 'processed' using HCatStorage("ds=20110110, region=asia"); 
> --
> This has a major issue in that MapReduce programs and pig scripts need to be aware of all the possible values of a key, and that needs to be maintained, and modified if needed when new values are introduced, which may/may not always be easy or even possible. With more partitions, scripts begin to look cumbersome. And if each partition being written launches a separate HCatalog store, we are increasing the load on the HCatalog server and launching more jobs for the store by a factor of the number of partitions
> It would be much more preferable if HCatalog were to be able to figure out all the partitions required from the data being written, which would allow us to simplify the above script into the following:
> --
> A = load 'raw' using HCatLoader(); 
> ... 
> store Z into 'processed' using HCatStorage("ds=20110110"); 
> --

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HCATALOG-42) Storing across partitions(Dynamic Partitioning) from HCatStorer/HCatOutputFormat

Posted by "Sushanth Sowmyan (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HCATALOG-42?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sushanth Sowmyan updated HCATALOG-42:
-------------------------------------

    Attachment: HCATALOG-42.2.patch

new patch update, invalidates previous patch, depends on attached jar being included in lib/

> Storing across partitions(Dynamic Partitioning) from HCatStorer/HCatOutputFormat
> --------------------------------------------------------------------------------
>
>                 Key: HCATALOG-42
>                 URL: https://issues.apache.org/jira/browse/HCATALOG-42
>             Project: HCatalog
>          Issue Type: Improvement
>    Affects Versions: 0.2
>            Reporter: Sushanth Sowmyan
>            Assignee: Sushanth Sowmyan
>         Attachments: HCATALOG-42.2.patch, HCATALOG-42.patch, hadoop_archive-0.3.1.jar
>
>
> HCatalog allows people to abstract away underlying storage details and refer to them as tables and partitions. To this notion, the storage abstraction is more about classifying how data is organized, rather than bothering about where it is stored. A user thus then specifies partitions to be stored and leaves the job to HCatalog to figure out how and where it needs to do so.
> When it comes to reading the data, a user is able to specify that they're interested in reading from the table and specify various partition key value combinations to prune, as if specifying a SQL-like where clause. However, when it comes to writing, the abstraction is not so seamless. We still require of the end user to write out data to the table partition-by-partition. And these partitions require fine-grained knowledge of what key-value-pairs they require, and we require this knowledge in advance, and we require the writer to have already grouped the requisite data accordingly before attempting to store.
> For example, the following pig script illustrates this:
> --
> A = load 'raw' using HCatLoader(); 
> ... 
> split Z into for_us if region='us', for_eu if region='eu', for_asia if region='asia'; 
> store for_us into 'processed' using HCatStorage("ds=20110110, region=us"); 
> store for_eu into 'processed' using HCatStorage("ds=20110110, region=eu"); 
> store for_asia into 'processed' using HCatStorage("ds=20110110, region=asia"); 
> --
> This has a major issue in that MapReduce programs and pig scripts need to be aware of all the possible values of a key, and that needs to be maintained, and modified if needed when new values are introduced, which may/may not always be easy or even possible. With more partitions, scripts begin to look cumbersome. And if each partition being written launches a separate HCatalog store, we are increasing the load on the HCatalog server and launching more jobs for the store by a factor of the number of partitions
> It would be much more preferable if HCatalog were to be able to figure out all the partitions required from the data being written, which would allow us to simplify the above script into the following:
> --
> A = load 'raw' using HCatLoader(); 
> ... 
> store Z into 'processed' using HCatStorage("ds=20110110"); 
> --

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HCATALOG-42) Storing across partitions(Dynamic Partitioning) from HCatStorer/HCatOutputFormat

Posted by "Sushanth Sowmyan (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HCATALOG-42?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sushanth Sowmyan updated HCATALOG-42:
-------------------------------------

    Attachment:     (was: HCATALOG-42.2.patch)

> Storing across partitions(Dynamic Partitioning) from HCatStorer/HCatOutputFormat
> --------------------------------------------------------------------------------
>
>                 Key: HCATALOG-42
>                 URL: https://issues.apache.org/jira/browse/HCATALOG-42
>             Project: HCatalog
>          Issue Type: Improvement
>    Affects Versions: 0.2
>            Reporter: Sushanth Sowmyan
>            Assignee: Sushanth Sowmyan
>         Attachments: HCATALOG-42.3.patch, hadoop_archive-0.3.1.jar
>
>
> HCatalog allows people to abstract away underlying storage details and refer to them as tables and partitions. To this notion, the storage abstraction is more about classifying how data is organized, rather than bothering about where it is stored. A user thus then specifies partitions to be stored and leaves the job to HCatalog to figure out how and where it needs to do so.
> When it comes to reading the data, a user is able to specify that they're interested in reading from the table and specify various partition key value combinations to prune, as if specifying a SQL-like where clause. However, when it comes to writing, the abstraction is not so seamless. We still require of the end user to write out data to the table partition-by-partition. And these partitions require fine-grained knowledge of what key-value-pairs they require, and we require this knowledge in advance, and we require the writer to have already grouped the requisite data accordingly before attempting to store.
> For example, the following pig script illustrates this:
> --
> A = load 'raw' using HCatLoader(); 
> ... 
> split Z into for_us if region='us', for_eu if region='eu', for_asia if region='asia'; 
> store for_us into 'processed' using HCatStorage("ds=20110110, region=us"); 
> store for_eu into 'processed' using HCatStorage("ds=20110110, region=eu"); 
> store for_asia into 'processed' using HCatStorage("ds=20110110, region=asia"); 
> --
> This has a major issue in that MapReduce programs and pig scripts need to be aware of all the possible values of a key, and that needs to be maintained, and modified if needed when new values are introduced, which may/may not always be easy or even possible. With more partitions, scripts begin to look cumbersome. And if each partition being written launches a separate HCatalog store, we are increasing the load on the HCatalog server and launching more jobs for the store by a factor of the number of partitions
> It would be much more preferable if HCatalog were to be able to figure out all the partitions required from the data being written, which would allow us to simplify the above script into the following:
> --
> A = load 'raw' using HCatLoader(); 
> ... 
> store Z into 'processed' using HCatStorage("ds=20110110"); 
> --

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HCATALOG-42) Storing across partitions(Dynamic Partitioning) from HCatStorer/HCatOutputFormat

Posted by "Sushanth Sowmyan (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HCATALOG-42?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sushanth Sowmyan updated HCATALOG-42:
-------------------------------------

    Attachment: HCATALOG-42.4.patch

Patch updated after some test fixes.

> Storing across partitions(Dynamic Partitioning) from HCatStorer/HCatOutputFormat
> --------------------------------------------------------------------------------
>
>                 Key: HCATALOG-42
>                 URL: https://issues.apache.org/jira/browse/HCATALOG-42
>             Project: HCatalog
>          Issue Type: Improvement
>    Affects Versions: 0.2
>            Reporter: Sushanth Sowmyan
>            Assignee: Sushanth Sowmyan
>         Attachments: HCATALOG-42.4.patch, hadoop_archive-0.3.1.jar
>
>
> HCatalog allows people to abstract away underlying storage details and refer to them as tables and partitions. To this notion, the storage abstraction is more about classifying how data is organized, rather than bothering about where it is stored. A user thus then specifies partitions to be stored and leaves the job to HCatalog to figure out how and where it needs to do so.
> When it comes to reading the data, a user is able to specify that they're interested in reading from the table and specify various partition key value combinations to prune, as if specifying a SQL-like where clause. However, when it comes to writing, the abstraction is not so seamless. We still require of the end user to write out data to the table partition-by-partition. And these partitions require fine-grained knowledge of what key-value-pairs they require, and we require this knowledge in advance, and we require the writer to have already grouped the requisite data accordingly before attempting to store.
> For example, the following pig script illustrates this:
> --
> A = load 'raw' using HCatLoader(); 
> ... 
> split Z into for_us if region='us', for_eu if region='eu', for_asia if region='asia'; 
> store for_us into 'processed' using HCatStorage("ds=20110110, region=us"); 
> store for_eu into 'processed' using HCatStorage("ds=20110110, region=eu"); 
> store for_asia into 'processed' using HCatStorage("ds=20110110, region=asia"); 
> --
> This has a major issue in that MapReduce programs and pig scripts need to be aware of all the possible values of a key, and that needs to be maintained, and modified if needed when new values are introduced, which may/may not always be easy or even possible. With more partitions, scripts begin to look cumbersome. And if each partition being written launches a separate HCatalog store, we are increasing the load on the HCatalog server and launching more jobs for the store by a factor of the number of partitions
> It would be much more preferable if HCatalog were to be able to figure out all the partitions required from the data being written, which would allow us to simplify the above script into the following:
> --
> A = load 'raw' using HCatLoader(); 
> ... 
> store Z into 'processed' using HCatStorage("ds=20110110"); 
> --

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira