You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Gabriele Kahlout (Updated) (JIRA)" <ji...@apache.org> on 2012/03/04 12:50:00 UTC
[jira] [Updated] (NUTCH-1001) bin/nutch fetch/parse handle
crawl/segments directory
[ https://issues.apache.org/jira/browse/NUTCH-1001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Gabriele Kahlout updated NUTCH-1001:
------------------------------------
Attachment: Fetcher.java
nutch1001v2.patch
Here is the awaited patch. Here's how I generated it:
svn co https://svn.apache.org/repos/asf/nutch/trunk/ nutch
cd nutch
ant
It checked out revision 1296779 and built successfully. Then I applied the desidered changes and successfully built nutch again. I then exported the diff:
svn diff > ../nutch1001v2.patch
I checked the resulting patch and it looks good this time.
> bin/nutch fetch/parse handle crawl/segments directory
> -----------------------------------------------------
>
> Key: NUTCH-1001
> URL: https://issues.apache.org/jira/browse/NUTCH-1001
> Project: Nutch
> Issue Type: Improvement
> Reporter: Gabriele Kahlout
> Priority: Minor
> Fix For: 1.5
>
> Attachments: Fetcher.java, NUTCH-1001.patch, nutch1001v2.patch
>
>
> I'm having issues porting scripts across different systems to support the step of extracting the latest/only segments resulting from the generate phase.
> Variants include:
> $ export SEGMENT=crawl/segments/`ls -tr crawl/segments|tail -1` #[1]
> $ s1=`ls -d crawl/segments/2* | tail -1` #[2]
> $ segment=`$HADOOP_HOME/bin/hadoop dfs -ls crawl/segments | tail -1 | grep -o [a-zA-Z0-9/\-]* |tail -1`
> $ segment=`$HADOOP_HOME/bin/hdfs -ls crawl/segments | tail -1 | grep -o [a-zA-Z0-9/\-]* |tail -1`
> And I'm not sure what windows users would have to do. Some users may also do with:
> bin/nutch fetch with crawl/segments/2*
> But I don't see a need in having the user extract/worry-about the latest/only segment, and have it a described step in every nutch tutorial. More over only fetch and parse expect a segment while other commands are fine with the directory of segments.
> Therefore, I think it's beneficial if fetch and parse also handle directories of segments.
> [1] http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/
> [2] http://wiki.apache.org/nutch/NutchTutorial#Command_Line_Searching
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira