You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Gabriele Kahlout (Issue Comment Edited) (JIRA)" <ji...@apache.org> on 2012/02/26 19:26:48 UTC
[jira] [Issue Comment Edited] (NUTCH-1001) bin/nutch fetch/parse handle crawl/segments directory

    [ https://issues.apache.org/jira/browse/NUTCH-1001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13216802#comment-13216802 ] 

Gabriele Kahlout edited comment on NUTCH-1001 at 2/26/12 6:25 PM:
------------------------------------------------------------------

I've re-applied the desired changes to the Fetcher file, this time without changing the indentation and formatting of the file.

The attached patch is just for the Fetcher file. Once you agree on it, I'll be happy to carry the dual changes to the Parser file as well.

I've also attached the modified file with the patch applied to it. It's based on the latest checkout of branch-1.4, rev 1293880. 

The novelty introduced by the patch is a fetch file that expects an array of segments (so a directory) and processes each one in turn:

{code}
  private void fetch(FileStatus[] segsStates, int threads) throws IOException{
	for(FileStatus segmentStatus : segsStates){
		fetch(segmentStatus.getPath(), threads);
	}
  }
{code}

The mechanism to identify whether the user input is a directory of segments or not is doen by examining the input filename. If it's a long timestamp then it's a segment, otherwise it might be a directory (or wrong input).
{code}
try {
      validateSegmentName(segment); 
      fetch(segment, threads);
      return 0;
    } catch(NumberFormatException nfe){
	  FileStatus[] segStates = segment.getFileSystem(getConf()).listStatus(segment);
	  fetch(segStates, threads);
	  return 0;
	}
{code}
                
      was (Author: simpatico):
    I've re-applied the desired changes to the Fetcher file, this time without changing the indentation and formatting of the file.

The attached patch is just for the Fetcher file. Once you agree on it, I'll be happy to carry the dual changes to the Parser file as well.

I've also attached the modified file with the patch applied to it. It's based on the latest checkout of branch-1.4, rev 1293880. 

The novelty introduced by the patch is a fetch file that expects an array of segments (so a directory) and processes each one in turn:

  private void fetch(FileStatus[] segsStates, int threads) throws IOException{
	for(FileStatus segmentStatus : segsStates){
		fetch(segmentStatus.getPath(), threads);
	}
  }

The mechanism to identify whether the user input is a directory of segments or not is doen by examining the input filename. If it's a long timestamp then it's a segment, otherwise it might be a directory (or wrong input).

try {
      validateSegmentName(segment); 
      fetch(segment, threads);
      return 0;
    } catch(NumberFormatException nfe){
	  FileStatus[] segStates = segment.getFileSystem(getConf()).listStatus(segment);
	  fetch(segStates, threads);
	  return 0;
	}

                  
> bin/nutch fetch/parse handle crawl/segments directory
> -----------------------------------------------------
>
>                 Key: NUTCH-1001
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1001
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Gabriele Kahlout
>            Priority: Minor
>             Fix For: 1.5
>
>         Attachments: Fetcher.java, NUTCH-1001.patch, nutch-1001_fetcher.patch
>
>
> I'm having issues porting scripts across different systems to support the step of extracting the latest/only segments resulting from the generate phase.
> Variants include:
> $ export SEGMENT=crawl/segments/`ls -tr crawl/segments|tail -1` #[1]
> $ s1=`ls -d crawl/segments/2* | tail -1` #[2]
> $ segment=`$HADOOP_HOME/bin/hadoop dfs -ls crawl/segments | tail -1 | grep -o [a-zA-Z0-9/\-]* |tail -1`
> $ segment=`$HADOOP_HOME/bin/hdfs -ls crawl/segments | tail -1 | grep -o [a-zA-Z0-9/\-]* |tail -1`
> And I'm not sure what windows users would have to do. Some users may also do with:
> bin/nutch fetch with crawl/segments/2*
> But I don't see a need in having the user extract/worry-about the latest/only segment, and have it a described step in every nutch tutorial. More over only fetch and parse expect a segment while other commands are fine with the directory of segments.
> Therefore, I think it's beneficial if fetch and parse also handle directories of segments. 
> [1] http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/
> [2] http://wiki.apache.org/nutch/NutchTutorial#Command_Line_Searching

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira