You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Ferdy Galema (Created) (JIRA)" <ji...@apache.org> on 2012/04/18 16:39:36 UTC

[jira] [Created] (NUTCH-1340) Increase scalability by only removing markers when they actually exist for DbUpdaterReducer

Increase scalability by only removing markers when they actually exist for DbUpdaterReducer
-------------------------------------------------------------------------------------------

                 Key: NUTCH-1340
                 URL: https://issues.apache.org/jira/browse/NUTCH-1340
             Project: Nutch
          Issue Type: Improvement
            Reporter: Ferdy Galema
             Fix For: nutchgora


After applying GORA-120 (this already is a huge performance boost by itself) one of the major bottlenecks of the DbUpdaterReducer is the deletion of the markers. The update reducer simply sets every row to delete its markers. A lot of rows do not actually have the markers but the deletes are fired away in any case. Because the markers are already always on the input, a simple check to see if they exist greaty improves performance.

In particular it is very expensive in HBase, because every single Delete inmediately triggers a connection to the regionservers. (They ignore the "autoflush=false" directive). Although deletes can be done in batch, this is currently not supported by Gora. For one it is very difficult to implement in the current HBaseStore with regard to multithreading, and secondly I noticed performance did not increase significantly.

By performance debugging on a real life cluster this currently seems to be the biggest bottleneck of the DbUpdaterReducer. (Remember only after applying GORA-120)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1340) Increase scalability by only removing markers when they actually exist for DbUpdaterReducer

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13263364#comment-13263364 ] 

Hudson commented on NUTCH-1340:
-------------------------------

Integrated in Nutch-nutchgora #240 (See [https://builds.apache.org/job/Nutch-nutchgora/240/])
    NUTCH-1340 Increase scalability by only removing markers when they actually exist for DbUpdaterReducer (Revision 1330722)

     Result = SUCCESS
ferdy : 
Files : 
* /nutch/branches/nutchgora/CHANGES.txt
* /nutch/branches/nutchgora/src/java/org/apache/nutch/crawl/DbUpdateReducer.java
* /nutch/branches/nutchgora/src/java/org/apache/nutch/storage/Mark.java

                
> Increase scalability by only removing markers when they actually exist for DbUpdaterReducer
> -------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-1340
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1340
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Ferdy Galema
>             Fix For: nutchgora
>
>         Attachments: NUTCH-1340-v1.txt, NUTCH-1340-v2.txt
>
>
> After applying GORA-120 (this already is a huge performance boost by itself) one of the major bottlenecks of the DbUpdaterReducer is the deletion of the markers. The update reducer simply sets every row to delete its markers. A lot of rows do not actually have the markers but the deletes are fired away in any case. Because the markers are already always on the input, a simple check to see if they exist greaty improves performance.
> In particular it is very expensive in HBase, because every single Delete inmediately triggers a connection to the regionservers. (They ignore the "autoflush=false" directive). Although deletes can be done in batch, this is currently not supported by Gora. For one it is very difficult to implement in the current HBaseStore with regard to multithreading, and secondly I noticed performance did not increase significantly.
> By performance debugging on a real life cluster this currently seems to be the biggest bottleneck of the DbUpdaterReducer. (Remember only after applying GORA-120)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1340) Increase scalability by only removing markers when they actually exist for DbUpdaterReducer

Posted by "Ferdy Galema (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ferdy Galema updated NUTCH-1340:
--------------------------------

    Attachment: NUTCH-1340-v1.txt
    
> Increase scalability by only removing markers when they actually exist for DbUpdaterReducer
> -------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-1340
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1340
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Ferdy Galema
>             Fix For: nutchgora
>
>         Attachments: NUTCH-1340-v1.txt
>
>
> After applying GORA-120 (this already is a huge performance boost by itself) one of the major bottlenecks of the DbUpdaterReducer is the deletion of the markers. The update reducer simply sets every row to delete its markers. A lot of rows do not actually have the markers but the deletes are fired away in any case. Because the markers are already always on the input, a simple check to see if they exist greaty improves performance.
> In particular it is very expensive in HBase, because every single Delete inmediately triggers a connection to the regionservers. (They ignore the "autoflush=false" directive). Although deletes can be done in batch, this is currently not supported by Gora. For one it is very difficult to implement in the current HBaseStore with regard to multithreading, and secondly I noticed performance did not increase significantly.
> By performance debugging on a real life cluster this currently seems to be the biggest bottleneck of the DbUpdaterReducer. (Remember only after applying GORA-120)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Closed] (NUTCH-1340) Increase scalability by only removing markers when they actually exist for DbUpdaterReducer

Posted by "Ferdy Galema (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ferdy Galema closed NUTCH-1340.
-------------------------------

    Resolution: Fixed
    
> Increase scalability by only removing markers when they actually exist for DbUpdaterReducer
> -------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-1340
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1340
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Ferdy Galema
>             Fix For: nutchgora
>
>         Attachments: NUTCH-1340-v1.txt, NUTCH-1340-v2.txt
>
>
> After applying GORA-120 (this already is a huge performance boost by itself) one of the major bottlenecks of the DbUpdaterReducer is the deletion of the markers. The update reducer simply sets every row to delete its markers. A lot of rows do not actually have the markers but the deletes are fired away in any case. Because the markers are already always on the input, a simple check to see if they exist greaty improves performance.
> In particular it is very expensive in HBase, because every single Delete inmediately triggers a connection to the regionservers. (They ignore the "autoflush=false" directive). Although deletes can be done in batch, this is currently not supported by Gora. For one it is very difficult to implement in the current HBaseStore with regard to multithreading, and secondly I noticed performance did not increase significantly.
> By performance debugging on a real life cluster this currently seems to be the biggest bottleneck of the DbUpdaterReducer. (Remember only after applying GORA-120)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1340) Increase scalability by only removing markers when they actually exist for DbUpdaterReducer

Posted by "Ferdy Galema (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ferdy Galema updated NUTCH-1340:
--------------------------------

    Attachment: NUTCH-1340-v2.txt

v2 of patch, including javadoc. This patch increases performance, but when updating huge crawls it still can be a bit troublesome to process the huge amounts of deletes. However this is something that needs to be solved in Gora.

Committed!

Thanks Lewis.
                
> Increase scalability by only removing markers when they actually exist for DbUpdaterReducer
> -------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-1340
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1340
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Ferdy Galema
>             Fix For: nutchgora
>
>         Attachments: NUTCH-1340-v1.txt, NUTCH-1340-v2.txt
>
>
> After applying GORA-120 (this already is a huge performance boost by itself) one of the major bottlenecks of the DbUpdaterReducer is the deletion of the markers. The update reducer simply sets every row to delete its markers. A lot of rows do not actually have the markers but the deletes are fired away in any case. Because the markers are already always on the input, a simple check to see if they exist greaty improves performance.
> In particular it is very expensive in HBase, because every single Delete inmediately triggers a connection to the regionservers. (They ignore the "autoflush=false" directive). Although deletes can be done in batch, this is currently not supported by Gora. For one it is very difficult to implement in the current HBaseStore with regard to multithreading, and secondly I noticed performance did not increase significantly.
> By performance debugging on a real life cluster this currently seems to be the biggest bottleneck of the DbUpdaterReducer. (Remember only after applying GORA-120)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1340) Increase scalability by only removing markers when they actually exist for DbUpdaterReducer

Posted by "Lewis John McGibbney (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13262121#comment-13262121 ] 

Lewis John McGibbney commented on NUTCH-1340:
---------------------------------------------

Hi Ferdy. I am +1 for this going into 2.0. If you could do your usual and provide a small Javadoc comment for the new method you introduce that would be great. 
                
> Increase scalability by only removing markers when they actually exist for DbUpdaterReducer
> -------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-1340
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1340
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Ferdy Galema
>             Fix For: nutchgora
>
>         Attachments: NUTCH-1340-v1.txt
>
>
> After applying GORA-120 (this already is a huge performance boost by itself) one of the major bottlenecks of the DbUpdaterReducer is the deletion of the markers. The update reducer simply sets every row to delete its markers. A lot of rows do not actually have the markers but the deletes are fired away in any case. Because the markers are already always on the input, a simple check to see if they exist greaty improves performance.
> In particular it is very expensive in HBase, because every single Delete inmediately triggers a connection to the regionservers. (They ignore the "autoflush=false" directive). Although deletes can be done in batch, this is currently not supported by Gora. For one it is very difficult to implement in the current HBaseStore with regard to multithreading, and secondly I noticed performance did not increase significantly.
> By performance debugging on a real life cluster this currently seems to be the biggest bottleneck of the DbUpdaterReducer. (Remember only after applying GORA-120)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira