You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Chris Mattmann <ma...@apache.org> on 2014/09/06 06:57:07 UTC

Re: Review Request 9119: Create SegmentContentDumperTool for easily extracting out file contents from SegmentDirs

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/9119/
-----------------------------------------------------------

(Updated Sept. 6, 2014, 4:57 a.m.)


Review request for nutch and Julien Le Dem.


Bugs: NUTCH-1526
    https://issues.apache.org/jira/browse/NUTCH-1526


Repository: nutch


Description
-------

Will contain the patch the SegmentContentDumperTool described in NUTCH-1526:

./bin/nutch org.apache.nutch.tools.SegmentContentDumper [options]
   -segmentRootDir full file path to the root segment directory, e.g., crawl/segments
   -regexUrlPattern a regex URL pattern to select URL keys to dump from the content DB in each segment
   -outputDir The output directory to write file names to.
   -metadata --key=value where key is a Content Metadata key and value is a value to check.


Diffs (updated)
-----

  ./trunk/src/java/org/apache/nutch/tools/FileDumper.java PRE-CREATION 

Diff: https://reviews.apache.org/r/9119/diff/


Testing
-------

Testing it on DARPA XDATA XNET.


Thanks,

Chris Mattmann


Re: Review Request 9119: Create SegmentContentDumperTool for easily extracting out file contents from SegmentDirs

Posted by Chris Mattmann <ma...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/9119/
-----------------------------------------------------------

(Updated Sept. 10, 2014, 3:15 a.m.)


Review request for nutch.


Bugs: NUTCH-1526
    https://issues.apache.org/jira/browse/NUTCH-1526


Repository: nutch


Description
-------

Will contain the patch the SegmentContentDumperTool described in NUTCH-1526:

./bin/nutch org.apache.nutch.tools.SegmentContentDumper [options]
   -segmentRootDir full file path to the root segment directory, e.g., crawl/segments
   -regexUrlPattern a regex URL pattern to select URL keys to dump from the content DB in each segment
   -outputDir The output directory to write file names to.
   -metadata --key=value where key is a Content Metadata key and value is a value to check.


Diffs
-----

  ./trunk/src/java/org/apache/nutch/tools/FileDumper.java PRE-CREATION 

Diff: https://reviews.apache.org/r/9119/diff/


Testing
-------

Testing it on DARPA XDATA XNET.


Thanks,

Chris Mattmann


Re: Review Request 9119: Create SegmentContentDumperTool for easily extracting out file contents from SegmentDirs

Posted by Chris Mattmann <ma...@apache.org>.

> On Sept. 10, 2014, 1:24 a.m., Julien Le Dem wrote:
> > Not sure why I'm added to this review but I figured I could review it anyway :)

Thank you for the review Julien! I removed you from the Review Board after this, my mistake! BTW, your comments are great, and I plan on addressing them!


- Chris


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/9119/#review52809
-----------------------------------------------------------


On Sept. 6, 2014, 4:57 a.m., Chris Mattmann wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/9119/
> -----------------------------------------------------------
> 
> (Updated Sept. 6, 2014, 4:57 a.m.)
> 
> 
> Review request for nutch and Julien Le Dem.
> 
> 
> Bugs: NUTCH-1526
>     https://issues.apache.org/jira/browse/NUTCH-1526
> 
> 
> Repository: nutch
> 
> 
> Description
> -------
> 
> Will contain the patch the SegmentContentDumperTool described in NUTCH-1526:
> 
> ./bin/nutch org.apache.nutch.tools.SegmentContentDumper [options]
>    -segmentRootDir full file path to the root segment directory, e.g., crawl/segments
>    -regexUrlPattern a regex URL pattern to select URL keys to dump from the content DB in each segment
>    -outputDir The output directory to write file names to.
>    -metadata --key=value where key is a Content Metadata key and value is a value to check.
> 
> 
> Diffs
> -----
> 
>   ./trunk/src/java/org/apache/nutch/tools/FileDumper.java PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/9119/diff/
> 
> 
> Testing
> -------
> 
> Testing it on DARPA XDATA XNET.
> 
> 
> Thanks,
> 
> Chris Mattmann
> 
>


Re: Review Request 9119: Create SegmentContentDumperTool for easily extracting out file contents from SegmentDirs

Posted by Chris Mattmann <ma...@apache.org>.

> On Sept. 10, 2014, 1:24 a.m., Julien Le Dem wrote:
> > ./trunk/src/java/org/apache/nutch/tools/FileDumper.java, line 45
> > <https://reviews.apache.org/r/9119/diff/1/?file=681989#file681989line45>
> >
> >     this should be in the scope of the main method.
> >     If you wanted to write unit tests it would be inconvenient as calling main more than once would cumulate the stats.

Thanks, addressed.


> On Sept. 10, 2014, 1:24 a.m., Julien Le Dem wrote:
> > ./trunk/src/java/org/apache/nutch/tools/FileDumper.java, line 64
> > <https://reviews.apache.org/r/9119/diff/1/?file=681989#file681989line64>
> >
> >     you might want to throw an exception if this returns false

Fixed.


> On Sept. 10, 2014, 1:24 a.m., Julien Le Dem wrote:
> > ./trunk/src/java/org/apache/nutch/tools/FileDumper.java, line 88
> > <https://reviews.apache.org/r/9119/diff/1/?file=681989#file681989line88>
> >
> >     as this all working from the local file system, using Files all the way and converting to path when needed seems more natural.
> >     new Path(file.toURI()) for example.

Not sure how to address this one?


> On Sept. 10, 2014, 1:24 a.m., Julien Le Dem wrote:
> > ./trunk/src/java/org/apache/nutch/tools/FileDumper.java, line 117
> > <https://reviews.apache.org/r/9119/diff/1/?file=681989#file681989line117>
> >
> >     we create new File(outputFullPath) twice.

Fixed.


> On Sept. 10, 2014, 1:24 a.m., Julien Le Dem wrote:
> > ./trunk/src/java/org/apache/nutch/tools/FileDumper.java, line 116
> > <https://reviews.apache.org/r/9119/diff/1/?file=681989#file681989line116>
> >
> >     does content close the stream?

Not sure, but I put the close in the finally block now and refactored.


- Chris


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/9119/#review52809
-----------------------------------------------------------


On Sept. 10, 2014, 3:15 a.m., Chris Mattmann wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/9119/
> -----------------------------------------------------------
> 
> (Updated Sept. 10, 2014, 3:15 a.m.)
> 
> 
> Review request for nutch.
> 
> 
> Bugs: NUTCH-1526
>     https://issues.apache.org/jira/browse/NUTCH-1526
> 
> 
> Repository: nutch
> 
> 
> Description
> -------
> 
> Will contain the patch the SegmentContentDumperTool described in NUTCH-1526:
> 
> ./bin/nutch org.apache.nutch.tools.SegmentContentDumper [options]
>    -segmentRootDir full file path to the root segment directory, e.g., crawl/segments
>    -regexUrlPattern a regex URL pattern to select URL keys to dump from the content DB in each segment
>    -outputDir The output directory to write file names to.
>    -metadata --key=value where key is a Content Metadata key and value is a value to check.
> 
> 
> Diffs
> -----
> 
>   ./trunk/src/java/org/apache/nutch/tools/FileDumper.java PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/9119/diff/
> 
> 
> Testing
> -------
> 
> Testing it on DARPA XDATA XNET.
> 
> 
> Thanks,
> 
> Chris Mattmann
> 
>


Re: Review Request 9119: Create SegmentContentDumperTool for easily extracting out file contents from SegmentDirs

Posted by Julien Le Dem <ju...@ledem.net>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/9119/#review52809
-----------------------------------------------------------


Not sure why I'm added to this review but I figured I could review it anyway :)


./trunk/src/java/org/apache/nutch/tools/FileDumper.java
<https://reviews.apache.org/r/9119/#comment91979>

    this should be in the scope of the main method.
    If you wanted to write unit tests it would be inconvenient as calling main more than once would cumulate the stats.



./trunk/src/java/org/apache/nutch/tools/FileDumper.java
<https://reviews.apache.org/r/9119/#comment91974>

    you might want to throw an exception if this returns false



./trunk/src/java/org/apache/nutch/tools/FileDumper.java
<https://reviews.apache.org/r/9119/#comment91975>

    as this all working from the local file system, using Files all the way and converting to path when needed seems more natural.
    new Path(file.toURI()) for example.



./trunk/src/java/org/apache/nutch/tools/FileDumper.java
<https://reviews.apache.org/r/9119/#comment91978>

    usually the fileSystem object needs to be retrieved from the path.
    path.getFIleSystem(conf)



./trunk/src/java/org/apache/nutch/tools/FileDumper.java
<https://reviews.apache.org/r/9119/#comment91976>

    does content close the stream?



./trunk/src/java/org/apache/nutch/tools/FileDumper.java
<https://reviews.apache.org/r/9119/#comment91977>

    we create new File(outputFullPath) twice.


- Julien Le Dem


On Sept. 6, 2014, 4:57 a.m., Chris Mattmann wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/9119/
> -----------------------------------------------------------
> 
> (Updated Sept. 6, 2014, 4:57 a.m.)
> 
> 
> Review request for nutch and Julien Le Dem.
> 
> 
> Bugs: NUTCH-1526
>     https://issues.apache.org/jira/browse/NUTCH-1526
> 
> 
> Repository: nutch
> 
> 
> Description
> -------
> 
> Will contain the patch the SegmentContentDumperTool described in NUTCH-1526:
> 
> ./bin/nutch org.apache.nutch.tools.SegmentContentDumper [options]
>    -segmentRootDir full file path to the root segment directory, e.g., crawl/segments
>    -regexUrlPattern a regex URL pattern to select URL keys to dump from the content DB in each segment
>    -outputDir The output directory to write file names to.
>    -metadata --key=value where key is a Content Metadata key and value is a value to check.
> 
> 
> Diffs
> -----
> 
>   ./trunk/src/java/org/apache/nutch/tools/FileDumper.java PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/9119/diff/
> 
> 
> Testing
> -------
> 
> Testing it on DARPA XDATA XNET.
> 
> 
> Thanks,
> 
> Chris Mattmann
> 
>


Re: Review Request 9119: Create SegmentContentDumperTool for easily extracting out file contents from SegmentDirs

Posted by Chris Mattmann <ma...@apache.org>.

> On Sept. 9, 2014, 11:40 p.m., Lewis McGibbney wrote:
> >

Thanks Lewis, I'll address these right away.


- Chris


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/9119/#review52796
-----------------------------------------------------------


On Sept. 6, 2014, 4:57 a.m., Chris Mattmann wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/9119/
> -----------------------------------------------------------
> 
> (Updated Sept. 6, 2014, 4:57 a.m.)
> 
> 
> Review request for nutch and Julien Le Dem.
> 
> 
> Bugs: NUTCH-1526
>     https://issues.apache.org/jira/browse/NUTCH-1526
> 
> 
> Repository: nutch
> 
> 
> Description
> -------
> 
> Will contain the patch the SegmentContentDumperTool described in NUTCH-1526:
> 
> ./bin/nutch org.apache.nutch.tools.SegmentContentDumper [options]
>    -segmentRootDir full file path to the root segment directory, e.g., crawl/segments
>    -regexUrlPattern a regex URL pattern to select URL keys to dump from the content DB in each segment
>    -outputDir The output directory to write file names to.
>    -metadata --key=value where key is a Content Metadata key and value is a value to check.
> 
> 
> Diffs
> -----
> 
>   ./trunk/src/java/org/apache/nutch/tools/FileDumper.java PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/9119/diff/
> 
> 
> Testing
> -------
> 
> Testing it on DARPA XDATA XNET.
> 
> 
> Thanks,
> 
> Chris Mattmann
> 
>


Re: Review Request 9119: Create SegmentContentDumperTool for easily extracting out file contents from SegmentDirs

Posted by Chris Mattmann <ma...@apache.org>.

> On Sept. 9, 2014, 11:40 p.m., Lewis McGibbney wrote:
> > ./trunk/src/java/org/apache/nutch/tools/FileDumper.java, line 101
> > <https://reviews.apache.org/r/9119/diff/1/?file=681989#file681989line101>
> >
> >     When I change the Text() class to use the UTF8() class, I get the following
> >     
> >     lmcgibbn@LMC-032857 /usr/local/trunk/runtime/local(master) $ ./bin/nutch org.apache.nutch.tools.FileDumper . /usr/local/trunk/src/testresources/testcrawl/segments/
> >     2014-09-09 16:02:21.339 java[3942:1903] Unable to load realm info from SCDynamicStore
> >     Sep 09, 2014 4:02:21 PM org.apache.nutch.tools.FileDumper main
> >     INFO: Processing segment: [/usr/local/trunk/src/testresources/testcrawl/segments/20060919213635]
> >     Exception in thread "main" java.io.EOFException
> >     	at java.io.DataInputStream.readFully(DataInputStream.java:197)
> >     	at java.io.DataInputStream.readFully(DataInputStream.java:169)
> >     	at org.apache.nutch.protocol.Content.readFieldsCompressed(Content.java:99)
> >     	at org.apache.nutch.protocol.Content.readFields(Content.java:154)
> >     	at org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1813)
> >     	at org.apache.nutch.tools.FileDumper.main(FileDumper.java:101)
> >         
> >     UTF8 is of course deprecated now so we need to stick with Text and implement the corect code.

hey @Lewis, not sure if this is really an error or not. I grepped around all the Nutch code, and also did a find -name for anything that references testcrawl. No Nutch code in src/test or src/java reference it. So I'm not sure that we should be using old UTF8 (instead of Text) crawl dirs here. I will go ahead and add some exception handling anyways and try to make it more robust. 

[chipotle:~/src/nutch/src] mattmann% grep -R "testcrawl" test
[chipotle:~/src/nutch/src] mattmann% grep -R "testcrawl" *
[chipotle:~/src/nutch/src] mattmann% find . -name "testcrawl" -print


- Chris


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/9119/#review52796
-----------------------------------------------------------


On Sept. 10, 2014, 3:15 a.m., Chris Mattmann wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/9119/
> -----------------------------------------------------------
> 
> (Updated Sept. 10, 2014, 3:15 a.m.)
> 
> 
> Review request for nutch.
> 
> 
> Bugs: NUTCH-1526
>     https://issues.apache.org/jira/browse/NUTCH-1526
> 
> 
> Repository: nutch
> 
> 
> Description
> -------
> 
> Will contain the patch the SegmentContentDumperTool described in NUTCH-1526:
> 
> ./bin/nutch org.apache.nutch.tools.SegmentContentDumper [options]
>    -segmentRootDir full file path to the root segment directory, e.g., crawl/segments
>    -regexUrlPattern a regex URL pattern to select URL keys to dump from the content DB in each segment
>    -outputDir The output directory to write file names to.
>    -metadata --key=value where key is a Content Metadata key and value is a value to check.
> 
> 
> Diffs
> -----
> 
>   ./trunk/src/java/org/apache/nutch/tools/FileDumper.java PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/9119/diff/
> 
> 
> Testing
> -------
> 
> Testing it on DARPA XDATA XNET.
> 
> 
> Thanks,
> 
> Chris Mattmann
> 
>


Re: Review Request 9119: Create SegmentContentDumperTool for easily extracting out file contents from SegmentDirs

Posted by Lewis McGibbney <le...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/9119/#review52796
-----------------------------------------------------------



./trunk/src/java/org/apache/nutch/tools/FileDumper.java
<https://reviews.apache.org/r/9119/#comment91927>

    This should read 
    
    FileDumper <output directory> <segments dir>



./trunk/src/java/org/apache/nutch/tools/FileDumper.java
<https://reviews.apache.org/r/9119/#comment91928>

    If I invoke this tool without ANY arguments, I get the following
    
    lmcgibbn@LMC-032857 /usr/local/trunk/runtime/local(master) $ ./bin/nutch org.apache.nutch.tools.FileDumper
    2014-09-09 15:57:19.045 java[3866:1903] Unable to load realm info from SCDynamicStore
    Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 0
    	at org.apache.nutch.tools.FileDumper.main(FileDumper.java:53)



./trunk/src/java/org/apache/nutch/tools/FileDumper.java
<https://reviews.apache.org/r/9119/#comment91929>

    When I invoke this tool as follows
    
    lmcgibbn@LMC-032857 /usr/local/trunk/runtime/local(master) $ ./bin/nutch org.apache.nutch.tools.FileDumper . /usr/local/trunk/src/testresources/testcrawl/segments/
    2014-09-09 15:59:06.185 java[3883:1903] Unable to load realm info from SCDynamicStore
    Sep 09, 2014 3:59:06 PM org.apache.nutch.tools.FileDumper main
    INFO: Processing segment: [/usr/local/trunk/src/testresources/testcrawl/segments/20060919213635]
    Exception in thread "main" java.io.IOException: wrong key class: org.apache.hadoop.io.Text is not class org.apache.hadoop.io.UTF8
    	at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1886)
    	at org.apache.nutch.tools.FileDumper.main(FileDumper.java:99)



./trunk/src/java/org/apache/nutch/tools/FileDumper.java
<https://reviews.apache.org/r/9119/#comment91931>

    When I change the Text() class to use the UTF8() class, I get the following
    
    lmcgibbn@LMC-032857 /usr/local/trunk/runtime/local(master) $ ./bin/nutch org.apache.nutch.tools.FileDumper . /usr/local/trunk/src/testresources/testcrawl/segments/
    2014-09-09 16:02:21.339 java[3942:1903] Unable to load realm info from SCDynamicStore
    Sep 09, 2014 4:02:21 PM org.apache.nutch.tools.FileDumper main
    INFO: Processing segment: [/usr/local/trunk/src/testresources/testcrawl/segments/20060919213635]
    Exception in thread "main" java.io.EOFException
    	at java.io.DataInputStream.readFully(DataInputStream.java:197)
    	at java.io.DataInputStream.readFully(DataInputStream.java:169)
    	at org.apache.nutch.protocol.Content.readFieldsCompressed(Content.java:99)
    	at org.apache.nutch.protocol.Content.readFields(Content.java:154)
    	at org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1813)
    	at org.apache.nutch.tools.FileDumper.main(FileDumper.java:101)
        
    UTF8 is of course deprecated now so we need to stick with Text and implement the corect code.


- Lewis McGibbney


On Sept. 6, 2014, 4:57 a.m., Chris Mattmann wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/9119/
> -----------------------------------------------------------
> 
> (Updated Sept. 6, 2014, 4:57 a.m.)
> 
> 
> Review request for nutch and Julien Le Dem.
> 
> 
> Bugs: NUTCH-1526
>     https://issues.apache.org/jira/browse/NUTCH-1526
> 
> 
> Repository: nutch
> 
> 
> Description
> -------
> 
> Will contain the patch the SegmentContentDumperTool described in NUTCH-1526:
> 
> ./bin/nutch org.apache.nutch.tools.SegmentContentDumper [options]
>    -segmentRootDir full file path to the root segment directory, e.g., crawl/segments
>    -regexUrlPattern a regex URL pattern to select URL keys to dump from the content DB in each segment
>    -outputDir The output directory to write file names to.
>    -metadata --key=value where key is a Content Metadata key and value is a value to check.
> 
> 
> Diffs
> -----
> 
>   ./trunk/src/java/org/apache/nutch/tools/FileDumper.java PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/9119/diff/
> 
> 
> Testing
> -------
> 
> Testing it on DARPA XDATA XNET.
> 
> 
> Thanks,
> 
> Chris Mattmann
> 
>