You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Giuseppe Totaro <to...@gmail.com> on 2015/03/02 03:45:52 UTC

Re: Review Request 31579: Patch fo NUTCH-1949: Dump out the Nuth data into the Common Crawl format

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/31579/
-----------------------------------------------------------

(Updated March 2, 2015, 2:45 a.m.)


Review request for nutch, Lewis McGibbney and Chris Mattmann.


Bugs: NUTCH-1949
    https://issues.apache.org/jira/browse/NUTCH-1949


Repository: nutch


Description
-------

Patch fo NUTCH-1949: first version of the CommonCrawlDataDumper tool that maps Nutch data into Common Crawl format.


Diffs
-----

  trunk/conf/nutch-default.xml 1662875 
  trunk/src/bin/nutch 1662875 
  trunk/src/java/org/apache/nutch/tools/AbstractCommonCrawlFormat.java PRE-CREATION 
  trunk/src/java/org/apache/nutch/tools/CommonCrawlDataDumper.java PRE-CREATION 
  trunk/src/java/org/apache/nutch/tools/CommonCrawlFormat.java PRE-CREATION 
  trunk/src/java/org/apache/nutch/tools/CommonCrawlFormatFactory.java PRE-CREATION 
  trunk/src/java/org/apache/nutch/tools/CommonCrawlFormatJackson.java PRE-CREATION 
  trunk/src/java/org/apache/nutch/tools/CommonCrawlFormatJettinson.java PRE-CREATION 
  trunk/src/java/org/apache/nutch/tools/CommonCrawlFormatSimple.java PRE-CREATION 
  trunk/src/java/org/apache/nutch/tools/FileDumper.java 1662875 

Diff: https://reviews.apache.org/r/31579/diff/


Testing
-------

Tested locally against Nutch segments.


Thanks,

Giuseppe Totaro


Re: Review Request 31579: Patch fo NUTCH-1949: Dump out the Nuth data into the Common Crawl format

Posted by Chris Mattmann <ma...@apache.org>.

> On March 2, 2015, 6:53 p.m., Julien Nioche wrote:
> > Any reason why you can't have this in a separate plugin as an extension of IndexWriter? See [https://issues.apache.org/jira/browse/NUTCH-1949?focusedCommentId=14336272&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14336272]

Yep, I think we should do this, as a next step. I'll file a ticket to get this integrated into an IndexingPlugin, but I think this is +1 and good to go now.


- Chris


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/31579/#review74783
-----------------------------------------------------------


On March 2, 2015, 5:58 p.m., Giuseppe Totaro wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/31579/
> -----------------------------------------------------------
> 
> (Updated March 2, 2015, 5:58 p.m.)
> 
> 
> Review request for nutch, Lewis McGibbney and Chris Mattmann.
> 
> 
> Bugs: NUTCH-1949
>     https://issues.apache.org/jira/browse/NUTCH-1949
> 
> 
> Repository: nutch
> 
> 
> Description
> -------
> 
> Patch fo NUTCH-1949: first version of the CommonCrawlDataDumper tool that maps Nutch data into Common Crawl format.
> 
> 
> Diffs
> -----
> 
>   trunk/src/bin/nutch 1662875 
>   trunk/src/java/org/apache/nutch/tools/AbstractCommonCrawlFormat.java PRE-CREATION 
>   trunk/src/java/org/apache/nutch/tools/CommonCrawlDataDumper.java PRE-CREATION 
>   trunk/src/java/org/apache/nutch/tools/CommonCrawlFormat.java PRE-CREATION 
>   trunk/src/java/org/apache/nutch/tools/CommonCrawlFormatFactory.java PRE-CREATION 
>   trunk/src/java/org/apache/nutch/tools/CommonCrawlFormatJackson.java PRE-CREATION 
>   trunk/src/java/org/apache/nutch/tools/CommonCrawlFormatJettinson.java PRE-CREATION 
>   trunk/src/java/org/apache/nutch/tools/CommonCrawlFormatSimple.java PRE-CREATION 
>   trunk/src/java/org/apache/nutch/tools/FileDumper.java 1662875 
> 
> Diff: https://reviews.apache.org/r/31579/diff/
> 
> 
> Testing
> -------
> 
> Tested locally against Nutch segments.
> 
> 
> Thanks,
> 
> Giuseppe Totaro
> 
>


Re: Review Request 31579: Patch fo NUTCH-1949: Dump out the Nuth data into the Common Crawl format

Posted by Julien Nioche <jn...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/31579/#review74783
-----------------------------------------------------------


Any reason why you can't have this in a separate plugin as an extension of IndexWriter? See [https://issues.apache.org/jira/browse/NUTCH-1949?focusedCommentId=14336272&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14336272]

- Julien Nioche


On March 2, 2015, 5:58 p.m., Giuseppe Totaro wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/31579/
> -----------------------------------------------------------
> 
> (Updated March 2, 2015, 5:58 p.m.)
> 
> 
> Review request for nutch, Lewis McGibbney and Chris Mattmann.
> 
> 
> Bugs: NUTCH-1949
>     https://issues.apache.org/jira/browse/NUTCH-1949
> 
> 
> Repository: nutch
> 
> 
> Description
> -------
> 
> Patch fo NUTCH-1949: first version of the CommonCrawlDataDumper tool that maps Nutch data into Common Crawl format.
> 
> 
> Diffs
> -----
> 
>   trunk/src/bin/nutch 1662875 
>   trunk/src/java/org/apache/nutch/tools/AbstractCommonCrawlFormat.java PRE-CREATION 
>   trunk/src/java/org/apache/nutch/tools/CommonCrawlDataDumper.java PRE-CREATION 
>   trunk/src/java/org/apache/nutch/tools/CommonCrawlFormat.java PRE-CREATION 
>   trunk/src/java/org/apache/nutch/tools/CommonCrawlFormatFactory.java PRE-CREATION 
>   trunk/src/java/org/apache/nutch/tools/CommonCrawlFormatJackson.java PRE-CREATION 
>   trunk/src/java/org/apache/nutch/tools/CommonCrawlFormatJettinson.java PRE-CREATION 
>   trunk/src/java/org/apache/nutch/tools/CommonCrawlFormatSimple.java PRE-CREATION 
>   trunk/src/java/org/apache/nutch/tools/FileDumper.java 1662875 
> 
> Diff: https://reviews.apache.org/r/31579/diff/
> 
> 
> Testing
> -------
> 
> Tested locally against Nutch segments.
> 
> 
> Thanks,
> 
> Giuseppe Totaro
> 
>


Re: Review Request 31579: Patch for NUTCH-1949: Dump out the Nuth data into the Common Crawl format

Posted by Chris Mattmann <ma...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/31579/#review75199
-----------------------------------------------------------

Ship it!


Ship It!

- Chris Mattmann


On March 4, 2015, 12:19 a.m., Giuseppe Totaro wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/31579/
> -----------------------------------------------------------
> 
> (Updated March 4, 2015, 12:19 a.m.)
> 
> 
> Review request for nutch, Lewis McGibbney and Chris Mattmann.
> 
> 
> Bugs: NUTCH-1949
>     https://issues.apache.org/jira/browse/NUTCH-1949
> 
> 
> Repository: nutch
> 
> 
> Description
> -------
> 
> Patch fo NUTCH-1949: first version of the CommonCrawlDataDumper tool that maps Nutch data into Common Crawl format.
> 
> 
> Diffs
> -----
> 
>   trunk/src/bin/nutch 1662875 
>   trunk/src/java/org/apache/nutch/tools/AbstractCommonCrawlFormat.java PRE-CREATION 
>   trunk/src/java/org/apache/nutch/tools/CommonCrawlDataDumper.java PRE-CREATION 
>   trunk/src/java/org/apache/nutch/tools/CommonCrawlFormat.java PRE-CREATION 
>   trunk/src/java/org/apache/nutch/tools/CommonCrawlFormatFactory.java PRE-CREATION 
>   trunk/src/java/org/apache/nutch/tools/CommonCrawlFormatJackson.java PRE-CREATION 
>   trunk/src/java/org/apache/nutch/tools/CommonCrawlFormatJettinson.java PRE-CREATION 
>   trunk/src/java/org/apache/nutch/tools/CommonCrawlFormatSimple.java PRE-CREATION 
>   trunk/src/java/org/apache/nutch/tools/FileDumper.java 1662875 
> 
> Diff: https://reviews.apache.org/r/31579/diff/
> 
> 
> Testing
> -------
> 
> Tested locally against Nutch segments.
> 
> 
> Thanks,
> 
> Giuseppe Totaro
> 
>


Re: Review Request 31579: Patch for NUTCH-1949: Dump out the Nuth data into the Common Crawl format

Posted by Lewis McGibbney <le...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/31579/#review75187
-----------------------------------------------------------

Ship it!


Ship It!

- Lewis McGibbney


On March 4, 2015, 12:19 a.m., Giuseppe Totaro wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/31579/
> -----------------------------------------------------------
> 
> (Updated March 4, 2015, 12:19 a.m.)
> 
> 
> Review request for nutch, Lewis McGibbney and Chris Mattmann.
> 
> 
> Bugs: NUTCH-1949
>     https://issues.apache.org/jira/browse/NUTCH-1949
> 
> 
> Repository: nutch
> 
> 
> Description
> -------
> 
> Patch fo NUTCH-1949: first version of the CommonCrawlDataDumper tool that maps Nutch data into Common Crawl format.
> 
> 
> Diffs
> -----
> 
>   trunk/src/bin/nutch 1662875 
>   trunk/src/java/org/apache/nutch/tools/AbstractCommonCrawlFormat.java PRE-CREATION 
>   trunk/src/java/org/apache/nutch/tools/CommonCrawlDataDumper.java PRE-CREATION 
>   trunk/src/java/org/apache/nutch/tools/CommonCrawlFormat.java PRE-CREATION 
>   trunk/src/java/org/apache/nutch/tools/CommonCrawlFormatFactory.java PRE-CREATION 
>   trunk/src/java/org/apache/nutch/tools/CommonCrawlFormatJackson.java PRE-CREATION 
>   trunk/src/java/org/apache/nutch/tools/CommonCrawlFormatJettinson.java PRE-CREATION 
>   trunk/src/java/org/apache/nutch/tools/CommonCrawlFormatSimple.java PRE-CREATION 
>   trunk/src/java/org/apache/nutch/tools/FileDumper.java 1662875 
> 
> Diff: https://reviews.apache.org/r/31579/diff/
> 
> 
> Testing
> -------
> 
> Tested locally against Nutch segments.
> 
> 
> Thanks,
> 
> Giuseppe Totaro
> 
>


Re: Review Request 31579: Patch for NUTCH-1949: Dump out the Nuth data into the Common Crawl format

Posted by Giuseppe Totaro <to...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/31579/
-----------------------------------------------------------

(Updated March 4, 2015, 12:19 a.m.)


Review request for nutch, Lewis McGibbney and Chris Mattmann.


Summary (updated)
-----------------

Patch for NUTCH-1949: Dump out the Nuth data into the Common Crawl format


Bugs: NUTCH-1949
    https://issues.apache.org/jira/browse/NUTCH-1949


Repository: nutch


Description
-------

Patch fo NUTCH-1949: first version of the CommonCrawlDataDumper tool that maps Nutch data into Common Crawl format.


Diffs
-----

  trunk/src/bin/nutch 1662875 
  trunk/src/java/org/apache/nutch/tools/AbstractCommonCrawlFormat.java PRE-CREATION 
  trunk/src/java/org/apache/nutch/tools/CommonCrawlDataDumper.java PRE-CREATION 
  trunk/src/java/org/apache/nutch/tools/CommonCrawlFormat.java PRE-CREATION 
  trunk/src/java/org/apache/nutch/tools/CommonCrawlFormatFactory.java PRE-CREATION 
  trunk/src/java/org/apache/nutch/tools/CommonCrawlFormatJackson.java PRE-CREATION 
  trunk/src/java/org/apache/nutch/tools/CommonCrawlFormatJettinson.java PRE-CREATION 
  trunk/src/java/org/apache/nutch/tools/CommonCrawlFormatSimple.java PRE-CREATION 
  trunk/src/java/org/apache/nutch/tools/FileDumper.java 1662875 

Diff: https://reviews.apache.org/r/31579/diff/


Testing
-------

Tested locally against Nutch segments.


Thanks,

Giuseppe Totaro


Re: Review Request 31579: Patch fo NUTCH-1949: Dump out the Nuth data into the Common Crawl format

Posted by Lewis McGibbney <le...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/31579/#review74818
-----------------------------------------------------------


I am +1 on this patch based on all of Chris' suggestions for improvement. Would be nice to see a plan come together which moves the functionality to an indexing plugin as suggested by Julien. I think that would make sense.

- Lewis McGibbney


On March 2, 2015, 5:58 p.m., Giuseppe Totaro wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/31579/
> -----------------------------------------------------------
> 
> (Updated March 2, 2015, 5:58 p.m.)
> 
> 
> Review request for nutch, Lewis McGibbney and Chris Mattmann.
> 
> 
> Bugs: NUTCH-1949
>     https://issues.apache.org/jira/browse/NUTCH-1949
> 
> 
> Repository: nutch
> 
> 
> Description
> -------
> 
> Patch fo NUTCH-1949: first version of the CommonCrawlDataDumper tool that maps Nutch data into Common Crawl format.
> 
> 
> Diffs
> -----
> 
>   trunk/src/bin/nutch 1662875 
>   trunk/src/java/org/apache/nutch/tools/AbstractCommonCrawlFormat.java PRE-CREATION 
>   trunk/src/java/org/apache/nutch/tools/CommonCrawlDataDumper.java PRE-CREATION 
>   trunk/src/java/org/apache/nutch/tools/CommonCrawlFormat.java PRE-CREATION 
>   trunk/src/java/org/apache/nutch/tools/CommonCrawlFormatFactory.java PRE-CREATION 
>   trunk/src/java/org/apache/nutch/tools/CommonCrawlFormatJackson.java PRE-CREATION 
>   trunk/src/java/org/apache/nutch/tools/CommonCrawlFormatJettinson.java PRE-CREATION 
>   trunk/src/java/org/apache/nutch/tools/CommonCrawlFormatSimple.java PRE-CREATION 
>   trunk/src/java/org/apache/nutch/tools/FileDumper.java 1662875 
> 
> Diff: https://reviews.apache.org/r/31579/diff/
> 
> 
> Testing
> -------
> 
> Tested locally against Nutch segments.
> 
> 
> Thanks,
> 
> Giuseppe Totaro
> 
>


Re: Review Request 31579: Patch fo NUTCH-1949: Dump out the Nuth data into the Common Crawl format

Posted by Giuseppe Totaro <to...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/31579/
-----------------------------------------------------------

(Updated March 2, 2015, 5:58 p.m.)


Review request for nutch, Lewis McGibbney and Chris Mattmann.


Changes
-------

Patch Update based on feedback by Chris Mattmann.


Bugs: NUTCH-1949
    https://issues.apache.org/jira/browse/NUTCH-1949


Repository: nutch


Description
-------

Patch fo NUTCH-1949: first version of the CommonCrawlDataDumper tool that maps Nutch data into Common Crawl format.


Diffs (updated)
-----

  trunk/src/bin/nutch 1662875 
  trunk/src/java/org/apache/nutch/tools/AbstractCommonCrawlFormat.java PRE-CREATION 
  trunk/src/java/org/apache/nutch/tools/CommonCrawlDataDumper.java PRE-CREATION 
  trunk/src/java/org/apache/nutch/tools/CommonCrawlFormat.java PRE-CREATION 
  trunk/src/java/org/apache/nutch/tools/CommonCrawlFormatFactory.java PRE-CREATION 
  trunk/src/java/org/apache/nutch/tools/CommonCrawlFormatJackson.java PRE-CREATION 
  trunk/src/java/org/apache/nutch/tools/CommonCrawlFormatJettinson.java PRE-CREATION 
  trunk/src/java/org/apache/nutch/tools/CommonCrawlFormatSimple.java PRE-CREATION 
  trunk/src/java/org/apache/nutch/tools/FileDumper.java 1662875 

Diff: https://reviews.apache.org/r/31579/diff/


Testing
-------

Tested locally against Nutch segments.


Thanks,

Giuseppe Totaro