You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@gora.apache.org by "Alexis (JIRA)" <ji...@apache.org> on 2010/12/18 19:53:01 UTC

[jira] Created: (GORA-20) Flush datastore regularly

Flush datastore regularly
-------------------------

Key: GORA-20
URL: https://issues.apache.org/jira/browse/GORA-20
Project: Gora
Issue Type: New Feature
Components: storage
Reporter: Alexis

Right now you need to explicitly call the flush method to make the IO operation happen, or close the datastore.

The issue is described here: http://techvineyard.blogspot.com/2010/12/build-nutch-20.html#Free_up_the_memory. Click on the image to see it in real size and look at the Heap utilization on the top right chart.

Not everybody has infinite memory. In a Nutch fetch process, I usually run into trouble after around 20k urls downloaded because it takes up all the memory, the Java Heap space being set to 1G with a system that "only" has 1G RAM as well.

The feature consists of allowing the datastore to be flushed regularly during the Hadoop job's reducer, org.apache.gora.mapreduce.GoraReducer. We would just add a maxBuffer parameter, which default value is 10000 for example and that you can override in org.apache.gora.mapreduce.GoraOutputFormat. It indicates the maximum number of records buffered in memory before the next flush operation occurs to actually write them in the datastore. This would actually be a member of the org.apache.hadoop.mapreduce.RecordWriter extension returned by getRecordWriter method.

An idea of the fix is suggested in the above link.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (GORA-20) Flush datastore regularly

Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/GORA-20?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12972861#action_12972861 ] 

Doğacan Güney commented on GORA-20:
-----------------------------------

Hello,

Thanks for the patch. Can you attach it to this issue so we can commit it and credit the patch to you?

> Flush datastore regularly
> -------------------------
>
>                 Key: GORA-20
>                 URL: https://issues.apache.org/jira/browse/GORA-20
>             Project: Gora
>          Issue Type: New Feature
>          Components: storage
>            Reporter: Alexis
>
> Right now you need to explicitly call the flush method to make the IO operation happen, or close the datastore.
> The issue is described here: http://techvineyard.blogspot.com/2010/12/build-nutch-20.html#Free_up_the_memory. Click on the image to see it in real size and look at the Heap utilization on the top right chart.
> Not everybody has infinite memory. In a Nutch fetch process, I usually run into trouble after around 20k urls downloaded because it takes up all the memory, the Java Heap space being set to 1G with a system that "only" has 1G RAM as well.
> The feature consists of allowing the datastore to be flushed regularly during the Hadoop job's reducer, org.apache.gora.mapreduce.GoraReducer. We would just add a maxBuffer parameter, which default value is 10000 for example and that you can override in org.apache.gora.mapreduce.GoraOutputFormat. It indicates the maximum number of records buffered in memory before the next flush operation occurs to actually write them in the datastore. This would actually be a member of the org.apache.hadoop.mapreduce.RecordWriter extension returned by getRecordWriter method.
> An idea of the fix is suggested in the above link. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (GORA-20) Flush datastore regularly

Posted by "Alexis (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/GORA-20?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12972879#action_12972879 ] 

Alexis edited comment on GORA-20 at 12/18/10 7:11 PM:
------------------------------------------------------

I added a new org.apache.gora.mapreduce.GoraRecordWriter class (GoraRecordReader already existed)

The configuration reads a "gora.buffer.limit" hadoop property. This allows the user to change the 10000 default value. One needs to set it in the usual mapred-site.xml configuration file, to be added $NUTCH_HOME/conf directory for example. See attached example.

I hope this feature is ok, otherwise I can revert it and stick with a constant, immutable value equals to 10000.


      was (Author: alexis779):
    I added a new org.apache.gora.mapreduceGoraRecordWriter class (GoraRecordReader already existed)

The configuration reads a "gora.buffer.limit" hadoop property. This allows the user to change the 10000 default value. One needs to set it in the usual mapred-site.xml configuration file, to be added $NUTCH_HOME/conf directory for example. See attached example.

I hope this feature is ok, otherwise I can revert it and stick with a constant, immutable value equals to 10000.

  
> Flush datastore regularly
> -------------------------
>
>                 Key: GORA-20
>                 URL: https://issues.apache.org/jira/browse/GORA-20
>             Project: Gora
>          Issue Type: New Feature
>          Components: storage
>            Reporter: Alexis
>         Attachments: gora.patch, mapred-site.xml
>
>
> Right now you need to explicitly call the flush method to make the IO operation happen, or close the datastore.
> The issue is described here: http://techvineyard.blogspot.com/2010/12/build-nutch-20.html#Free_up_the_memory. Click on the image to see it in real size and look at the Heap utilization on the top right chart.
> Not everybody has infinite memory. In a Nutch fetch process, I usually run into trouble after around 20k urls downloaded because it takes up all the memory, the Java Heap space being set to 1G with a system that "only" has 1G RAM as well.
> The feature consists of allowing the datastore to be flushed regularly during the Hadoop job's reducer, org.apache.gora.mapreduce.GoraReducer. We would just add a maxBuffer parameter, which default value is 10000 for example and that you can override in org.apache.gora.mapreduce.GoraOutputFormat. It indicates the maximum number of records buffered in memory before the next flush operation occurs to actually write them in the datastore. This would actually be a member of the org.apache.hadoop.mapreduce.RecordWriter extension returned by getRecordWriter method.
> An idea of the fix is suggested in the above link. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (GORA-20) Flush datastore regularly

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/GORA-20?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Julien Nioche updated GORA-20:
------------------------------

    Affects Version/s: 0.1-incubating
        Fix Version/s: 0.1-incubating

Marking as 0.1-incubating

> Flush datastore regularly
> -------------------------
>
>                 Key: GORA-20
>                 URL: https://issues.apache.org/jira/browse/GORA-20
>             Project: Gora
>          Issue Type: New Feature
>          Components: storage
>    Affects Versions: 0.1-incubating
>            Reporter: Alexis
>             Fix For: 0.1-incubating
>
>         Attachments: gora.patch, mapred-site.xml
>
>
> Right now you need to explicitly call the flush method to make the IO operation happen, or close the datastore.
> The issue is described here: http://techvineyard.blogspot.com/2010/12/build-nutch-20.html#Free_up_the_memory. Click on the image to see it in real size and look at the Heap utilization on the top right chart.
> Not everybody has infinite memory. In a Nutch fetch process, I usually run into trouble after around 20k urls downloaded because it takes up all the memory, the Java Heap space being set to 1G with a system that "only" has 1G RAM as well.
> The feature consists of allowing the datastore to be flushed regularly during the Hadoop job's reducer, org.apache.gora.mapreduce.GoraReducer. We would just add a maxBuffer parameter, which default value is 10000 for example and that you can override in org.apache.gora.mapreduce.GoraOutputFormat. It indicates the maximum number of records buffered in memory before the next flush operation occurs to actually write them in the datastore. This would actually be a member of the org.apache.hadoop.mapreduce.RecordWriter extension returned by getRecordWriter method.
> An idea of the fix is suggested in the above link. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (GORA-20) Flush datastore regularly

Posted by "Alexis (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/GORA-20?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alexis updated GORA-20:
-----------------------

    Attachment: mapred-site.xml
                gora.patch

I added a new org.apache.gora.mapreduceGoraRecordWriter class (GoraRecordReader already existed)

The configuration reads a "gora.buffer.limit" hadoop property. This allows the user to change the 10000 default value. One needs to set it in the usual mapred-site.xml configuration file, to be added $NUTCH_HOME/conf directory for example. See attached example.

I hope this feature is ok, otherwise I can revert it and stick with a constant, immutable value equals to 10000.


> Flush datastore regularly
> -------------------------
>
>                 Key: GORA-20
>                 URL: https://issues.apache.org/jira/browse/GORA-20
>             Project: Gora
>          Issue Type: New Feature
>          Components: storage
>            Reporter: Alexis
>         Attachments: gora.patch, mapred-site.xml
>
>
> Right now you need to explicitly call the flush method to make the IO operation happen, or close the datastore.
> The issue is described here: http://techvineyard.blogspot.com/2010/12/build-nutch-20.html#Free_up_the_memory. Click on the image to see it in real size and look at the Heap utilization on the top right chart.
> Not everybody has infinite memory. In a Nutch fetch process, I usually run into trouble after around 20k urls downloaded because it takes up all the memory, the Java Heap space being set to 1G with a system that "only" has 1G RAM as well.
> The feature consists of allowing the datastore to be flushed regularly during the Hadoop job's reducer, org.apache.gora.mapreduce.GoraReducer. We would just add a maxBuffer parameter, which default value is 10000 for example and that you can override in org.apache.gora.mapreduce.GoraOutputFormat. It indicates the maximum number of records buffered in memory before the next flush operation occurs to actually write them in the datastore. This would actually be a member of the org.apache.hadoop.mapreduce.RecordWriter extension returned by getRecordWriter method.
> An idea of the fix is suggested in the above link. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (GORA-20) Flush datastore regularly

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/GORA-20?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Julien Nioche resolved GORA-20.
-------------------------------

    Resolution: Fixed

Committed revision 1057554.

I've modified the patch slightly to include the license header + removed author comment (as the other classes do not have that)
GORA does not come with a default Hadoop-like config file, overriding the value can be done e.g. in the Nutch conf

Thanks Alexis! 

> Flush datastore regularly
> -------------------------
>
>                 Key: GORA-20
>                 URL: https://issues.apache.org/jira/browse/GORA-20
>             Project: Gora
>          Issue Type: New Feature
>          Components: storage
>    Affects Versions: 0.1-incubating
>            Reporter: Alexis
>             Fix For: 0.1-incubating
>
>         Attachments: gora.patch, mapred-site.xml
>
>
> Right now you need to explicitly call the flush method to make the IO operation happen, or close the datastore.
> The issue is described here: http://techvineyard.blogspot.com/2010/12/build-nutch-20.html#Free_up_the_memory. Click on the image to see it in real size and look at the Heap utilization on the top right chart.
> Not everybody has infinite memory. In a Nutch fetch process, I usually run into trouble after around 20k urls downloaded because it takes up all the memory, the Java Heap space being set to 1G with a system that "only" has 1G RAM as well.
> The feature consists of allowing the datastore to be flushed regularly during the Hadoop job's reducer, org.apache.gora.mapreduce.GoraReducer. We would just add a maxBuffer parameter, which default value is 10000 for example and that you can override in org.apache.gora.mapreduce.GoraOutputFormat. It indicates the maximum number of records buffered in memory before the next flush operation occurs to actually write them in the datastore. This would actually be a member of the org.apache.hadoop.mapreduce.RecordWriter extension returned by getRecordWriter method.
> An idea of the fix is suggested in the above link. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.