You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Chris Mattmann <ma...@apache.org> on 2014/09/06 06:57:07 UTC
Re: Review Request 9119: Create SegmentContentDumperTool for easily
extracting out file contents from SegmentDirs
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/9119/
-----------------------------------------------------------
(Updated Sept. 6, 2014, 4:57 a.m.)
Review request for nutch and Julien Le Dem.
Bugs: NUTCH-1526
https://issues.apache.org/jira/browse/NUTCH-1526
Repository: nutch
Description
-------
Will contain the patch the SegmentContentDumperTool described in NUTCH-1526:
./bin/nutch org.apache.nutch.tools.SegmentContentDumper [options]
-segmentRootDir full file path to the root segment directory, e.g., crawl/segments
-regexUrlPattern a regex URL pattern to select URL keys to dump from the content DB in each segment
-outputDir The output directory to write file names to.
-metadata --key=value where key is a Content Metadata key and value is a value to check.
Diffs (updated)
-----
./trunk/src/java/org/apache/nutch/tools/FileDumper.java PRE-CREATION
Diff: https://reviews.apache.org/r/9119/diff/
Testing
-------
Testing it on DARPA XDATA XNET.
Thanks,
Chris Mattmann
Re: Review Request 9119: Create SegmentContentDumperTool for easily
extracting out file contents from SegmentDirs
Posted by Chris Mattmann <ma...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/9119/
-----------------------------------------------------------
(Updated Sept. 10, 2014, 3:15 a.m.)
Review request for nutch.
Bugs: NUTCH-1526
https://issues.apache.org/jira/browse/NUTCH-1526
Repository: nutch
Description
-------
Will contain the patch the SegmentContentDumperTool described in NUTCH-1526:
./bin/nutch org.apache.nutch.tools.SegmentContentDumper [options]
-segmentRootDir full file path to the root segment directory, e.g., crawl/segments
-regexUrlPattern a regex URL pattern to select URL keys to dump from the content DB in each segment
-outputDir The output directory to write file names to.
-metadata --key=value where key is a Content Metadata key and value is a value to check.
Diffs
-----
./trunk/src/java/org/apache/nutch/tools/FileDumper.java PRE-CREATION
Diff: https://reviews.apache.org/r/9119/diff/
Testing
-------
Testing it on DARPA XDATA XNET.
Thanks,
Chris Mattmann
Re: Review Request 9119: Create SegmentContentDumperTool for easily
extracting out file contents from SegmentDirs
Posted by Chris Mattmann <ma...@apache.org>.
> On Sept. 10, 2014, 1:24 a.m., Julien Le Dem wrote:
> > Not sure why I'm added to this review but I figured I could review it anyway :)
Thank you for the review Julien! I removed you from the Review Board after this, my mistake! BTW, your comments are great, and I plan on addressing them!
- Chris
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/9119/#review52809
-----------------------------------------------------------
On Sept. 6, 2014, 4:57 a.m., Chris Mattmann wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/9119/
> -----------------------------------------------------------
>
> (Updated Sept. 6, 2014, 4:57 a.m.)
>
>
> Review request for nutch and Julien Le Dem.
>
>
> Bugs: NUTCH-1526
> https://issues.apache.org/jira/browse/NUTCH-1526
>
>
> Repository: nutch
>
>
> Description
> -------
>
> Will contain the patch the SegmentContentDumperTool described in NUTCH-1526:
>
> ./bin/nutch org.apache.nutch.tools.SegmentContentDumper [options]
> -segmentRootDir full file path to the root segment directory, e.g., crawl/segments
> -regexUrlPattern a regex URL pattern to select URL keys to dump from the content DB in each segment
> -outputDir The output directory to write file names to.
> -metadata --key=value where key is a Content Metadata key and value is a value to check.
>
>
> Diffs
> -----
>
> ./trunk/src/java/org/apache/nutch/tools/FileDumper.java PRE-CREATION
>
> Diff: https://reviews.apache.org/r/9119/diff/
>
>
> Testing
> -------
>
> Testing it on DARPA XDATA XNET.
>
>
> Thanks,
>
> Chris Mattmann
>
>
Re: Review Request 9119: Create SegmentContentDumperTool for easily
extracting out file contents from SegmentDirs
Posted by Chris Mattmann <ma...@apache.org>.
> On Sept. 10, 2014, 1:24 a.m., Julien Le Dem wrote:
> > ./trunk/src/java/org/apache/nutch/tools/FileDumper.java, line 45
> > <https://reviews.apache.org/r/9119/diff/1/?file=681989#file681989line45>
> >
> > this should be in the scope of the main method.
> > If you wanted to write unit tests it would be inconvenient as calling main more than once would cumulate the stats.
Thanks, addressed.
> On Sept. 10, 2014, 1:24 a.m., Julien Le Dem wrote:
> > ./trunk/src/java/org/apache/nutch/tools/FileDumper.java, line 64
> > <https://reviews.apache.org/r/9119/diff/1/?file=681989#file681989line64>
> >
> > you might want to throw an exception if this returns false
Fixed.
> On Sept. 10, 2014, 1:24 a.m., Julien Le Dem wrote:
> > ./trunk/src/java/org/apache/nutch/tools/FileDumper.java, line 88
> > <https://reviews.apache.org/r/9119/diff/1/?file=681989#file681989line88>
> >
> > as this all working from the local file system, using Files all the way and converting to path when needed seems more natural.
> > new Path(file.toURI()) for example.
Not sure how to address this one?
> On Sept. 10, 2014, 1:24 a.m., Julien Le Dem wrote:
> > ./trunk/src/java/org/apache/nutch/tools/FileDumper.java, line 117
> > <https://reviews.apache.org/r/9119/diff/1/?file=681989#file681989line117>
> >
> > we create new File(outputFullPath) twice.
Fixed.
> On Sept. 10, 2014, 1:24 a.m., Julien Le Dem wrote:
> > ./trunk/src/java/org/apache/nutch/tools/FileDumper.java, line 116
> > <https://reviews.apache.org/r/9119/diff/1/?file=681989#file681989line116>
> >
> > does content close the stream?
Not sure, but I put the close in the finally block now and refactored.
- Chris
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/9119/#review52809
-----------------------------------------------------------
On Sept. 10, 2014, 3:15 a.m., Chris Mattmann wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/9119/
> -----------------------------------------------------------
>
> (Updated Sept. 10, 2014, 3:15 a.m.)
>
>
> Review request for nutch.
>
>
> Bugs: NUTCH-1526
> https://issues.apache.org/jira/browse/NUTCH-1526
>
>
> Repository: nutch
>
>
> Description
> -------
>
> Will contain the patch the SegmentContentDumperTool described in NUTCH-1526:
>
> ./bin/nutch org.apache.nutch.tools.SegmentContentDumper [options]
> -segmentRootDir full file path to the root segment directory, e.g., crawl/segments
> -regexUrlPattern a regex URL pattern to select URL keys to dump from the content DB in each segment
> -outputDir The output directory to write file names to.
> -metadata --key=value where key is a Content Metadata key and value is a value to check.
>
>
> Diffs
> -----
>
> ./trunk/src/java/org/apache/nutch/tools/FileDumper.java PRE-CREATION
>
> Diff: https://reviews.apache.org/r/9119/diff/
>
>
> Testing
> -------
>
> Testing it on DARPA XDATA XNET.
>
>
> Thanks,
>
> Chris Mattmann
>
>
Re: Review Request 9119: Create SegmentContentDumperTool for easily
extracting out file contents from SegmentDirs
Posted by Julien Le Dem <ju...@ledem.net>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/9119/#review52809
-----------------------------------------------------------
Not sure why I'm added to this review but I figured I could review it anyway :)
./trunk/src/java/org/apache/nutch/tools/FileDumper.java
<https://reviews.apache.org/r/9119/#comment91979>
this should be in the scope of the main method.
If you wanted to write unit tests it would be inconvenient as calling main more than once would cumulate the stats.
./trunk/src/java/org/apache/nutch/tools/FileDumper.java
<https://reviews.apache.org/r/9119/#comment91974>
you might want to throw an exception if this returns false
./trunk/src/java/org/apache/nutch/tools/FileDumper.java
<https://reviews.apache.org/r/9119/#comment91975>
as this all working from the local file system, using Files all the way and converting to path when needed seems more natural.
new Path(file.toURI()) for example.
./trunk/src/java/org/apache/nutch/tools/FileDumper.java
<https://reviews.apache.org/r/9119/#comment91978>
usually the fileSystem object needs to be retrieved from the path.
path.getFIleSystem(conf)
./trunk/src/java/org/apache/nutch/tools/FileDumper.java
<https://reviews.apache.org/r/9119/#comment91976>
does content close the stream?
./trunk/src/java/org/apache/nutch/tools/FileDumper.java
<https://reviews.apache.org/r/9119/#comment91977>
we create new File(outputFullPath) twice.
- Julien Le Dem
On Sept. 6, 2014, 4:57 a.m., Chris Mattmann wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/9119/
> -----------------------------------------------------------
>
> (Updated Sept. 6, 2014, 4:57 a.m.)
>
>
> Review request for nutch and Julien Le Dem.
>
>
> Bugs: NUTCH-1526
> https://issues.apache.org/jira/browse/NUTCH-1526
>
>
> Repository: nutch
>
>
> Description
> -------
>
> Will contain the patch the SegmentContentDumperTool described in NUTCH-1526:
>
> ./bin/nutch org.apache.nutch.tools.SegmentContentDumper [options]
> -segmentRootDir full file path to the root segment directory, e.g., crawl/segments
> -regexUrlPattern a regex URL pattern to select URL keys to dump from the content DB in each segment
> -outputDir The output directory to write file names to.
> -metadata --key=value where key is a Content Metadata key and value is a value to check.
>
>
> Diffs
> -----
>
> ./trunk/src/java/org/apache/nutch/tools/FileDumper.java PRE-CREATION
>
> Diff: https://reviews.apache.org/r/9119/diff/
>
>
> Testing
> -------
>
> Testing it on DARPA XDATA XNET.
>
>
> Thanks,
>
> Chris Mattmann
>
>
Re: Review Request 9119: Create SegmentContentDumperTool for easily
extracting out file contents from SegmentDirs
Posted by Chris Mattmann <ma...@apache.org>.
> On Sept. 9, 2014, 11:40 p.m., Lewis McGibbney wrote:
> >
Thanks Lewis, I'll address these right away.
- Chris
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/9119/#review52796
-----------------------------------------------------------
On Sept. 6, 2014, 4:57 a.m., Chris Mattmann wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/9119/
> -----------------------------------------------------------
>
> (Updated Sept. 6, 2014, 4:57 a.m.)
>
>
> Review request for nutch and Julien Le Dem.
>
>
> Bugs: NUTCH-1526
> https://issues.apache.org/jira/browse/NUTCH-1526
>
>
> Repository: nutch
>
>
> Description
> -------
>
> Will contain the patch the SegmentContentDumperTool described in NUTCH-1526:
>
> ./bin/nutch org.apache.nutch.tools.SegmentContentDumper [options]
> -segmentRootDir full file path to the root segment directory, e.g., crawl/segments
> -regexUrlPattern a regex URL pattern to select URL keys to dump from the content DB in each segment
> -outputDir The output directory to write file names to.
> -metadata --key=value where key is a Content Metadata key and value is a value to check.
>
>
> Diffs
> -----
>
> ./trunk/src/java/org/apache/nutch/tools/FileDumper.java PRE-CREATION
>
> Diff: https://reviews.apache.org/r/9119/diff/
>
>
> Testing
> -------
>
> Testing it on DARPA XDATA XNET.
>
>
> Thanks,
>
> Chris Mattmann
>
>
Re: Review Request 9119: Create SegmentContentDumperTool for easily
extracting out file contents from SegmentDirs
Posted by Chris Mattmann <ma...@apache.org>.
> On Sept. 9, 2014, 11:40 p.m., Lewis McGibbney wrote:
> > ./trunk/src/java/org/apache/nutch/tools/FileDumper.java, line 101
> > <https://reviews.apache.org/r/9119/diff/1/?file=681989#file681989line101>
> >
> > When I change the Text() class to use the UTF8() class, I get the following
> >
> > lmcgibbn@LMC-032857 /usr/local/trunk/runtime/local(master) $ ./bin/nutch org.apache.nutch.tools.FileDumper . /usr/local/trunk/src/testresources/testcrawl/segments/
> > 2014-09-09 16:02:21.339 java[3942:1903] Unable to load realm info from SCDynamicStore
> > Sep 09, 2014 4:02:21 PM org.apache.nutch.tools.FileDumper main
> > INFO: Processing segment: [/usr/local/trunk/src/testresources/testcrawl/segments/20060919213635]
> > Exception in thread "main" java.io.EOFException
> > at java.io.DataInputStream.readFully(DataInputStream.java:197)
> > at java.io.DataInputStream.readFully(DataInputStream.java:169)
> > at org.apache.nutch.protocol.Content.readFieldsCompressed(Content.java:99)
> > at org.apache.nutch.protocol.Content.readFields(Content.java:154)
> > at org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1813)
> > at org.apache.nutch.tools.FileDumper.main(FileDumper.java:101)
> >
> > UTF8 is of course deprecated now so we need to stick with Text and implement the corect code.
hey @Lewis, not sure if this is really an error or not. I grepped around all the Nutch code, and also did a find -name for anything that references testcrawl. No Nutch code in src/test or src/java reference it. So I'm not sure that we should be using old UTF8 (instead of Text) crawl dirs here. I will go ahead and add some exception handling anyways and try to make it more robust.
[chipotle:~/src/nutch/src] mattmann% grep -R "testcrawl" test
[chipotle:~/src/nutch/src] mattmann% grep -R "testcrawl" *
[chipotle:~/src/nutch/src] mattmann% find . -name "testcrawl" -print
- Chris
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/9119/#review52796
-----------------------------------------------------------
On Sept. 10, 2014, 3:15 a.m., Chris Mattmann wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/9119/
> -----------------------------------------------------------
>
> (Updated Sept. 10, 2014, 3:15 a.m.)
>
>
> Review request for nutch.
>
>
> Bugs: NUTCH-1526
> https://issues.apache.org/jira/browse/NUTCH-1526
>
>
> Repository: nutch
>
>
> Description
> -------
>
> Will contain the patch the SegmentContentDumperTool described in NUTCH-1526:
>
> ./bin/nutch org.apache.nutch.tools.SegmentContentDumper [options]
> -segmentRootDir full file path to the root segment directory, e.g., crawl/segments
> -regexUrlPattern a regex URL pattern to select URL keys to dump from the content DB in each segment
> -outputDir The output directory to write file names to.
> -metadata --key=value where key is a Content Metadata key and value is a value to check.
>
>
> Diffs
> -----
>
> ./trunk/src/java/org/apache/nutch/tools/FileDumper.java PRE-CREATION
>
> Diff: https://reviews.apache.org/r/9119/diff/
>
>
> Testing
> -------
>
> Testing it on DARPA XDATA XNET.
>
>
> Thanks,
>
> Chris Mattmann
>
>
Re: Review Request 9119: Create SegmentContentDumperTool for easily
extracting out file contents from SegmentDirs
Posted by Lewis McGibbney <le...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/9119/#review52796
-----------------------------------------------------------
./trunk/src/java/org/apache/nutch/tools/FileDumper.java
<https://reviews.apache.org/r/9119/#comment91927>
This should read
FileDumper <output directory> <segments dir>
./trunk/src/java/org/apache/nutch/tools/FileDumper.java
<https://reviews.apache.org/r/9119/#comment91928>
If I invoke this tool without ANY arguments, I get the following
lmcgibbn@LMC-032857 /usr/local/trunk/runtime/local(master) $ ./bin/nutch org.apache.nutch.tools.FileDumper
2014-09-09 15:57:19.045 java[3866:1903] Unable to load realm info from SCDynamicStore
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 0
at org.apache.nutch.tools.FileDumper.main(FileDumper.java:53)
./trunk/src/java/org/apache/nutch/tools/FileDumper.java
<https://reviews.apache.org/r/9119/#comment91929>
When I invoke this tool as follows
lmcgibbn@LMC-032857 /usr/local/trunk/runtime/local(master) $ ./bin/nutch org.apache.nutch.tools.FileDumper . /usr/local/trunk/src/testresources/testcrawl/segments/
2014-09-09 15:59:06.185 java[3883:1903] Unable to load realm info from SCDynamicStore
Sep 09, 2014 3:59:06 PM org.apache.nutch.tools.FileDumper main
INFO: Processing segment: [/usr/local/trunk/src/testresources/testcrawl/segments/20060919213635]
Exception in thread "main" java.io.IOException: wrong key class: org.apache.hadoop.io.Text is not class org.apache.hadoop.io.UTF8
at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1886)
at org.apache.nutch.tools.FileDumper.main(FileDumper.java:99)
./trunk/src/java/org/apache/nutch/tools/FileDumper.java
<https://reviews.apache.org/r/9119/#comment91931>
When I change the Text() class to use the UTF8() class, I get the following
lmcgibbn@LMC-032857 /usr/local/trunk/runtime/local(master) $ ./bin/nutch org.apache.nutch.tools.FileDumper . /usr/local/trunk/src/testresources/testcrawl/segments/
2014-09-09 16:02:21.339 java[3942:1903] Unable to load realm info from SCDynamicStore
Sep 09, 2014 4:02:21 PM org.apache.nutch.tools.FileDumper main
INFO: Processing segment: [/usr/local/trunk/src/testresources/testcrawl/segments/20060919213635]
Exception in thread "main" java.io.EOFException
at java.io.DataInputStream.readFully(DataInputStream.java:197)
at java.io.DataInputStream.readFully(DataInputStream.java:169)
at org.apache.nutch.protocol.Content.readFieldsCompressed(Content.java:99)
at org.apache.nutch.protocol.Content.readFields(Content.java:154)
at org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1813)
at org.apache.nutch.tools.FileDumper.main(FileDumper.java:101)
UTF8 is of course deprecated now so we need to stick with Text and implement the corect code.
- Lewis McGibbney
On Sept. 6, 2014, 4:57 a.m., Chris Mattmann wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/9119/
> -----------------------------------------------------------
>
> (Updated Sept. 6, 2014, 4:57 a.m.)
>
>
> Review request for nutch and Julien Le Dem.
>
>
> Bugs: NUTCH-1526
> https://issues.apache.org/jira/browse/NUTCH-1526
>
>
> Repository: nutch
>
>
> Description
> -------
>
> Will contain the patch the SegmentContentDumperTool described in NUTCH-1526:
>
> ./bin/nutch org.apache.nutch.tools.SegmentContentDumper [options]
> -segmentRootDir full file path to the root segment directory, e.g., crawl/segments
> -regexUrlPattern a regex URL pattern to select URL keys to dump from the content DB in each segment
> -outputDir The output directory to write file names to.
> -metadata --key=value where key is a Content Metadata key and value is a value to check.
>
>
> Diffs
> -----
>
> ./trunk/src/java/org/apache/nutch/tools/FileDumper.java PRE-CREATION
>
> Diff: https://reviews.apache.org/r/9119/diff/
>
>
> Testing
> -------
>
> Testing it on DARPA XDATA XNET.
>
>
> Thanks,
>
> Chris Mattmann
>
>