You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Giuseppe Totaro <to...@gmail.com> on 2015/03/02 03:45:52 UTC
Re: Review Request 31579: Patch fo NUTCH-1949: Dump out the Nuth data
into the Common Crawl format
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/31579/
-----------------------------------------------------------
(Updated March 2, 2015, 2:45 a.m.)
Review request for nutch, Lewis McGibbney and Chris Mattmann.
Bugs: NUTCH-1949
https://issues.apache.org/jira/browse/NUTCH-1949
Repository: nutch
Description
-------
Patch fo NUTCH-1949: first version of the CommonCrawlDataDumper tool that maps Nutch data into Common Crawl format.
Diffs
-----
trunk/conf/nutch-default.xml 1662875
trunk/src/bin/nutch 1662875
trunk/src/java/org/apache/nutch/tools/AbstractCommonCrawlFormat.java PRE-CREATION
trunk/src/java/org/apache/nutch/tools/CommonCrawlDataDumper.java PRE-CREATION
trunk/src/java/org/apache/nutch/tools/CommonCrawlFormat.java PRE-CREATION
trunk/src/java/org/apache/nutch/tools/CommonCrawlFormatFactory.java PRE-CREATION
trunk/src/java/org/apache/nutch/tools/CommonCrawlFormatJackson.java PRE-CREATION
trunk/src/java/org/apache/nutch/tools/CommonCrawlFormatJettinson.java PRE-CREATION
trunk/src/java/org/apache/nutch/tools/CommonCrawlFormatSimple.java PRE-CREATION
trunk/src/java/org/apache/nutch/tools/FileDumper.java 1662875
Diff: https://reviews.apache.org/r/31579/diff/
Testing
-------
Tested locally against Nutch segments.
Thanks,
Giuseppe Totaro
Re: Review Request 31579: Patch fo NUTCH-1949: Dump out the Nuth data
into the Common Crawl format
Posted by Chris Mattmann <ma...@apache.org>.
> On March 2, 2015, 6:53 p.m., Julien Nioche wrote:
> > Any reason why you can't have this in a separate plugin as an extension of IndexWriter? See [https://issues.apache.org/jira/browse/NUTCH-1949?focusedCommentId=14336272&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14336272]
Yep, I think we should do this, as a next step. I'll file a ticket to get this integrated into an IndexingPlugin, but I think this is +1 and good to go now.
- Chris
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/31579/#review74783
-----------------------------------------------------------
On March 2, 2015, 5:58 p.m., Giuseppe Totaro wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/31579/
> -----------------------------------------------------------
>
> (Updated March 2, 2015, 5:58 p.m.)
>
>
> Review request for nutch, Lewis McGibbney and Chris Mattmann.
>
>
> Bugs: NUTCH-1949
> https://issues.apache.org/jira/browse/NUTCH-1949
>
>
> Repository: nutch
>
>
> Description
> -------
>
> Patch fo NUTCH-1949: first version of the CommonCrawlDataDumper tool that maps Nutch data into Common Crawl format.
>
>
> Diffs
> -----
>
> trunk/src/bin/nutch 1662875
> trunk/src/java/org/apache/nutch/tools/AbstractCommonCrawlFormat.java PRE-CREATION
> trunk/src/java/org/apache/nutch/tools/CommonCrawlDataDumper.java PRE-CREATION
> trunk/src/java/org/apache/nutch/tools/CommonCrawlFormat.java PRE-CREATION
> trunk/src/java/org/apache/nutch/tools/CommonCrawlFormatFactory.java PRE-CREATION
> trunk/src/java/org/apache/nutch/tools/CommonCrawlFormatJackson.java PRE-CREATION
> trunk/src/java/org/apache/nutch/tools/CommonCrawlFormatJettinson.java PRE-CREATION
> trunk/src/java/org/apache/nutch/tools/CommonCrawlFormatSimple.java PRE-CREATION
> trunk/src/java/org/apache/nutch/tools/FileDumper.java 1662875
>
> Diff: https://reviews.apache.org/r/31579/diff/
>
>
> Testing
> -------
>
> Tested locally against Nutch segments.
>
>
> Thanks,
>
> Giuseppe Totaro
>
>
Re: Review Request 31579: Patch fo NUTCH-1949: Dump out the Nuth data
into the Common Crawl format
Posted by Julien Nioche <jn...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/31579/#review74783
-----------------------------------------------------------
Any reason why you can't have this in a separate plugin as an extension of IndexWriter? See [https://issues.apache.org/jira/browse/NUTCH-1949?focusedCommentId=14336272&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14336272]
- Julien Nioche
On March 2, 2015, 5:58 p.m., Giuseppe Totaro wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/31579/
> -----------------------------------------------------------
>
> (Updated March 2, 2015, 5:58 p.m.)
>
>
> Review request for nutch, Lewis McGibbney and Chris Mattmann.
>
>
> Bugs: NUTCH-1949
> https://issues.apache.org/jira/browse/NUTCH-1949
>
>
> Repository: nutch
>
>
> Description
> -------
>
> Patch fo NUTCH-1949: first version of the CommonCrawlDataDumper tool that maps Nutch data into Common Crawl format.
>
>
> Diffs
> -----
>
> trunk/src/bin/nutch 1662875
> trunk/src/java/org/apache/nutch/tools/AbstractCommonCrawlFormat.java PRE-CREATION
> trunk/src/java/org/apache/nutch/tools/CommonCrawlDataDumper.java PRE-CREATION
> trunk/src/java/org/apache/nutch/tools/CommonCrawlFormat.java PRE-CREATION
> trunk/src/java/org/apache/nutch/tools/CommonCrawlFormatFactory.java PRE-CREATION
> trunk/src/java/org/apache/nutch/tools/CommonCrawlFormatJackson.java PRE-CREATION
> trunk/src/java/org/apache/nutch/tools/CommonCrawlFormatJettinson.java PRE-CREATION
> trunk/src/java/org/apache/nutch/tools/CommonCrawlFormatSimple.java PRE-CREATION
> trunk/src/java/org/apache/nutch/tools/FileDumper.java 1662875
>
> Diff: https://reviews.apache.org/r/31579/diff/
>
>
> Testing
> -------
>
> Tested locally against Nutch segments.
>
>
> Thanks,
>
> Giuseppe Totaro
>
>
Re: Review Request 31579: Patch for NUTCH-1949: Dump out the Nuth
data into the Common Crawl format
Posted by Chris Mattmann <ma...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/31579/#review75199
-----------------------------------------------------------
Ship it!
Ship It!
- Chris Mattmann
On March 4, 2015, 12:19 a.m., Giuseppe Totaro wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/31579/
> -----------------------------------------------------------
>
> (Updated March 4, 2015, 12:19 a.m.)
>
>
> Review request for nutch, Lewis McGibbney and Chris Mattmann.
>
>
> Bugs: NUTCH-1949
> https://issues.apache.org/jira/browse/NUTCH-1949
>
>
> Repository: nutch
>
>
> Description
> -------
>
> Patch fo NUTCH-1949: first version of the CommonCrawlDataDumper tool that maps Nutch data into Common Crawl format.
>
>
> Diffs
> -----
>
> trunk/src/bin/nutch 1662875
> trunk/src/java/org/apache/nutch/tools/AbstractCommonCrawlFormat.java PRE-CREATION
> trunk/src/java/org/apache/nutch/tools/CommonCrawlDataDumper.java PRE-CREATION
> trunk/src/java/org/apache/nutch/tools/CommonCrawlFormat.java PRE-CREATION
> trunk/src/java/org/apache/nutch/tools/CommonCrawlFormatFactory.java PRE-CREATION
> trunk/src/java/org/apache/nutch/tools/CommonCrawlFormatJackson.java PRE-CREATION
> trunk/src/java/org/apache/nutch/tools/CommonCrawlFormatJettinson.java PRE-CREATION
> trunk/src/java/org/apache/nutch/tools/CommonCrawlFormatSimple.java PRE-CREATION
> trunk/src/java/org/apache/nutch/tools/FileDumper.java 1662875
>
> Diff: https://reviews.apache.org/r/31579/diff/
>
>
> Testing
> -------
>
> Tested locally against Nutch segments.
>
>
> Thanks,
>
> Giuseppe Totaro
>
>
Re: Review Request 31579: Patch for NUTCH-1949: Dump out the Nuth
data into the Common Crawl format
Posted by Lewis McGibbney <le...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/31579/#review75187
-----------------------------------------------------------
Ship it!
Ship It!
- Lewis McGibbney
On March 4, 2015, 12:19 a.m., Giuseppe Totaro wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/31579/
> -----------------------------------------------------------
>
> (Updated March 4, 2015, 12:19 a.m.)
>
>
> Review request for nutch, Lewis McGibbney and Chris Mattmann.
>
>
> Bugs: NUTCH-1949
> https://issues.apache.org/jira/browse/NUTCH-1949
>
>
> Repository: nutch
>
>
> Description
> -------
>
> Patch fo NUTCH-1949: first version of the CommonCrawlDataDumper tool that maps Nutch data into Common Crawl format.
>
>
> Diffs
> -----
>
> trunk/src/bin/nutch 1662875
> trunk/src/java/org/apache/nutch/tools/AbstractCommonCrawlFormat.java PRE-CREATION
> trunk/src/java/org/apache/nutch/tools/CommonCrawlDataDumper.java PRE-CREATION
> trunk/src/java/org/apache/nutch/tools/CommonCrawlFormat.java PRE-CREATION
> trunk/src/java/org/apache/nutch/tools/CommonCrawlFormatFactory.java PRE-CREATION
> trunk/src/java/org/apache/nutch/tools/CommonCrawlFormatJackson.java PRE-CREATION
> trunk/src/java/org/apache/nutch/tools/CommonCrawlFormatJettinson.java PRE-CREATION
> trunk/src/java/org/apache/nutch/tools/CommonCrawlFormatSimple.java PRE-CREATION
> trunk/src/java/org/apache/nutch/tools/FileDumper.java 1662875
>
> Diff: https://reviews.apache.org/r/31579/diff/
>
>
> Testing
> -------
>
> Tested locally against Nutch segments.
>
>
> Thanks,
>
> Giuseppe Totaro
>
>
Re: Review Request 31579: Patch for NUTCH-1949: Dump out the Nuth
data into the Common Crawl format
Posted by Giuseppe Totaro <to...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/31579/
-----------------------------------------------------------
(Updated March 4, 2015, 12:19 a.m.)
Review request for nutch, Lewis McGibbney and Chris Mattmann.
Summary (updated)
-----------------
Patch for NUTCH-1949: Dump out the Nuth data into the Common Crawl format
Bugs: NUTCH-1949
https://issues.apache.org/jira/browse/NUTCH-1949
Repository: nutch
Description
-------
Patch fo NUTCH-1949: first version of the CommonCrawlDataDumper tool that maps Nutch data into Common Crawl format.
Diffs
-----
trunk/src/bin/nutch 1662875
trunk/src/java/org/apache/nutch/tools/AbstractCommonCrawlFormat.java PRE-CREATION
trunk/src/java/org/apache/nutch/tools/CommonCrawlDataDumper.java PRE-CREATION
trunk/src/java/org/apache/nutch/tools/CommonCrawlFormat.java PRE-CREATION
trunk/src/java/org/apache/nutch/tools/CommonCrawlFormatFactory.java PRE-CREATION
trunk/src/java/org/apache/nutch/tools/CommonCrawlFormatJackson.java PRE-CREATION
trunk/src/java/org/apache/nutch/tools/CommonCrawlFormatJettinson.java PRE-CREATION
trunk/src/java/org/apache/nutch/tools/CommonCrawlFormatSimple.java PRE-CREATION
trunk/src/java/org/apache/nutch/tools/FileDumper.java 1662875
Diff: https://reviews.apache.org/r/31579/diff/
Testing
-------
Tested locally against Nutch segments.
Thanks,
Giuseppe Totaro
Re: Review Request 31579: Patch fo NUTCH-1949: Dump out the Nuth data
into the Common Crawl format
Posted by Lewis McGibbney <le...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/31579/#review74818
-----------------------------------------------------------
I am +1 on this patch based on all of Chris' suggestions for improvement. Would be nice to see a plan come together which moves the functionality to an indexing plugin as suggested by Julien. I think that would make sense.
- Lewis McGibbney
On March 2, 2015, 5:58 p.m., Giuseppe Totaro wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/31579/
> -----------------------------------------------------------
>
> (Updated March 2, 2015, 5:58 p.m.)
>
>
> Review request for nutch, Lewis McGibbney and Chris Mattmann.
>
>
> Bugs: NUTCH-1949
> https://issues.apache.org/jira/browse/NUTCH-1949
>
>
> Repository: nutch
>
>
> Description
> -------
>
> Patch fo NUTCH-1949: first version of the CommonCrawlDataDumper tool that maps Nutch data into Common Crawl format.
>
>
> Diffs
> -----
>
> trunk/src/bin/nutch 1662875
> trunk/src/java/org/apache/nutch/tools/AbstractCommonCrawlFormat.java PRE-CREATION
> trunk/src/java/org/apache/nutch/tools/CommonCrawlDataDumper.java PRE-CREATION
> trunk/src/java/org/apache/nutch/tools/CommonCrawlFormat.java PRE-CREATION
> trunk/src/java/org/apache/nutch/tools/CommonCrawlFormatFactory.java PRE-CREATION
> trunk/src/java/org/apache/nutch/tools/CommonCrawlFormatJackson.java PRE-CREATION
> trunk/src/java/org/apache/nutch/tools/CommonCrawlFormatJettinson.java PRE-CREATION
> trunk/src/java/org/apache/nutch/tools/CommonCrawlFormatSimple.java PRE-CREATION
> trunk/src/java/org/apache/nutch/tools/FileDumper.java 1662875
>
> Diff: https://reviews.apache.org/r/31579/diff/
>
>
> Testing
> -------
>
> Tested locally against Nutch segments.
>
>
> Thanks,
>
> Giuseppe Totaro
>
>
Re: Review Request 31579: Patch fo NUTCH-1949: Dump out the Nuth data
into the Common Crawl format
Posted by Giuseppe Totaro <to...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/31579/
-----------------------------------------------------------
(Updated March 2, 2015, 5:58 p.m.)
Review request for nutch, Lewis McGibbney and Chris Mattmann.
Changes
-------
Patch Update based on feedback by Chris Mattmann.
Bugs: NUTCH-1949
https://issues.apache.org/jira/browse/NUTCH-1949
Repository: nutch
Description
-------
Patch fo NUTCH-1949: first version of the CommonCrawlDataDumper tool that maps Nutch data into Common Crawl format.
Diffs (updated)
-----
trunk/src/bin/nutch 1662875
trunk/src/java/org/apache/nutch/tools/AbstractCommonCrawlFormat.java PRE-CREATION
trunk/src/java/org/apache/nutch/tools/CommonCrawlDataDumper.java PRE-CREATION
trunk/src/java/org/apache/nutch/tools/CommonCrawlFormat.java PRE-CREATION
trunk/src/java/org/apache/nutch/tools/CommonCrawlFormatFactory.java PRE-CREATION
trunk/src/java/org/apache/nutch/tools/CommonCrawlFormatJackson.java PRE-CREATION
trunk/src/java/org/apache/nutch/tools/CommonCrawlFormatJettinson.java PRE-CREATION
trunk/src/java/org/apache/nutch/tools/CommonCrawlFormatSimple.java PRE-CREATION
trunk/src/java/org/apache/nutch/tools/FileDumper.java 1662875
Diff: https://reviews.apache.org/r/31579/diff/
Testing
-------
Tested locally against Nutch segments.
Thanks,
Giuseppe Totaro