You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Stefan Groschupf (JIRA)" <ji...@apache.org> on 2006/01/08 00:08:20 UTC

[jira] Created: (NUTCH-166) secure jobtracker info pages with a password

secure jobtracker info pages with a password
--------------------------------------------

         Key: NUTCH-166
         URL: http://issues.apache.org/jira/browse/NUTCH-166
     Project: Nutch
        Type: Improvement
    Versions: 0.8-dev    
    Reporter: Stefan Groschupf
     Fix For: 0.8-dev


Since people often post stack-traces in the mailing list that contains ip addresses it is easy for others to view the info pages of the jobtracker. 
This may contains more security critical informations like more ip addresses and internal host-names etc.

Therefore this patch adds a Basic password authentication  to the jetty server. 
The user name is 'admin'  and the password can be configured in the nutch configuration file.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


Re: NPE in Indexer.java line 184

Posted by Gal Nitzan <gn...@usa.net>.
OK. thanks for the patch.

I shall embed it tonight.

I promise :) to let you know...

Gal.


On Mon, 2006-01-09 at 10:53 +0100, Andrzej Bialecki wrote:
> Gal Nitzan wrote:
> 
> >Sorry :) no.
> >
> >  
> >
> 
> Hmm. ok. :) But I think that patch is needed anyway, because now we 
> silently assume that parse plugins will always copy all Content metadata 
> to ParseData.metadata, while it may not be the case - and it certainly 
> does not happen if there is a parse error ... and this patch fixes it. 
> Later on, Indexer tries to retrieve these values from 
> parseData.metadata, and not from the content.metadata (because we try to 
> avoid reading too much data, so the content part of a segment is not 
> accessed during indexing).
> 
> >I run fetcher with parse.
> >
> >This NPE  happens for only a few documents and that is the problem :)
> >  
> >
> 
> Ok, then I think I know what is going on... Please try this patch - 
> that's the same problem, actually: these few documents failed to parse, 
> and we got an empty parseData - but in this case it means also empty 
> metadata, which means no segment name nor score in parseData.metadata.
> 
> Please test and report if it helps.
> 
> plain text document attachment (patch)
> Index: Fetcher.java
> ===================================================================
> --- Fetcher.java	(revision 367099)
> +++ Fetcher.java	(working copy)
> @@ -223,6 +223,9 @@
>          parse.getData().getMetadata().setProperty(SIGNATURE_KEY, StringUtil.toHexString(signature));
>          datum.setSignature(signature);
>        }
> +      // add segment name and score to parseData metadata
> +      parse.getData().getMetadata().setProperty(SEGMENT_NAME_KEY, segmentName);
> +      parse.getData().getMetadata().setProperty(SCORE_KEY, Float.toString(datum.getScore()));
>  
>        try {
>          output.collect



Re: What/how num of required maps is set? OOP Wrong list

Posted by Gal Nitzan <gn...@usa.net>.
On Mon, 2006-01-09 at 12:07 +0200, Gal Nitzan wrote:
> I am trying to figure out how the required map is set/calculated by
> Nutch.
> 
> I have 3 task trackers.
> 
> I added one more.
> 
> When I run fetch only the initial three are fetching.
> 
> I have added the task tracker before calling generate (if it has any
> meanning)
> 
> Thanks,
> 
> G.
> 
> 
> 
> 



What/how num of required maps is set?

Posted by Gal Nitzan <gn...@usa.net>.
I am trying to figure out how the required map is set/calculated by
Nutch.

I have 3 task trackers.

I added one more.

When I run fetch only the initial three are fetching.

I have added the task tracker before calling generate (if it has any
meanning)

Thanks,

G.




Re: NPE in Indexer.java line 184

Posted by Andrzej Bialecki <ab...@getopt.org>.
Gal Nitzan wrote:

>Sorry :) no.
>
>  
>

Hmm. ok. :) But I think that patch is needed anyway, because now we 
silently assume that parse plugins will always copy all Content metadata 
to ParseData.metadata, while it may not be the case - and it certainly 
does not happen if there is a parse error ... and this patch fixes it. 
Later on, Indexer tries to retrieve these values from 
parseData.metadata, and not from the content.metadata (because we try to 
avoid reading too much data, so the content part of a segment is not 
accessed during indexing).

>I run fetcher with parse.
>
>This NPE  happens for only a few documents and that is the problem :)
>  
>

Ok, then I think I know what is going on... Please try this patch - 
that's the same problem, actually: these few documents failed to parse, 
and we got an empty parseData - but in this case it means also empty 
metadata, which means no segment name nor score in parseData.metadata.

Please test and report if it helps.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: NPE in Indexer.java line 184

Posted by Gal Nitzan <gn...@usa.net>.
Sorry :) no.

I run fetcher with parse.

This NPE  happens for only a few documents and that is the problem :)



On Mon, 2006-01-09 at 09:43 +0100, Andrzej Bialecki wrote:
> Gal Nitzan wrote:
> 
> >Hi Andrzej,
> >
> >The value cannot be null is my message :)
> >
> >  
> >
> 
> :)
> 
> I'm guessing that you are using Fetcher in non-parsing mode, and then 
> you run ParseSegment as a separate step, right?
> 
> Please try the attached patch.
> 
> plain text document attachment (patch)
> Index: ParseSegment.java
> ===================================================================
> --- ParseSegment.java	(revision 367099)
> +++ ParseSegment.java	(working copy)
> @@ -58,9 +58,16 @@
>        status = new ParseStatus(e);
>      }
>  
> +    ContentProperties metadata = parse.getData().getMetadata();
>      // compute the new signature
>      byte[] signature = SignatureFactory.getSignature(getConf()).calculate(content, parse);
> -    parse.getData().getMetadata().setProperty(Fetcher.SIGNATURE_KEY, StringUtil.toHexString(signature));
> +    metadata.setProperty(Fetcher.SIGNATURE_KEY, StringUtil.toHexString(signature));
> +    // copy segment name and score
> +    String segmentName = content.getMetadata().getProperty(Fetcher.SEGMENT_NAME_KEY);
> +    String score = content.getMetadata().getProperty(Fetcher.SCORE_KEY);
> +    metadata.setProperty(Fetcher.SEGMENT_NAME_KEY, segmentName);
> +    metadata.setProperty(Fetcher.SCORE_KEY, score);
> +    
>      if (status.isSuccess()) {
>        output.collect(key, new ParseImpl(parse.getText(), parse.getData()));
>      } else {



Re: NPE in Indexer.java line 184

Posted by Andrzej Bialecki <ab...@getopt.org>.
Gal Nitzan wrote:

>Hi Andrzej,
>
>The value cannot be null is my message :)
>
>  
>

:)

I'm guessing that you are using Fetcher in non-parsing mode, and then 
you run ParseSegment as a separate step, right?

Please try the attached patch.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: NPE in Indexer.java line 184

Posted by Gal Nitzan <gn...@usa.net>.
Hi Andrzej,

The value cannot be null is my message :)

060109 094543 task_r_9xvvcz  Could not get property: segment name
060109 094543 task_r_9xvvcz  [Ljava.lang.StackTraceElement;@154864a
060109 094543 task_r_9xvvcz java.lang.NullPointerException: value cannot
be null
060109 094543 task_r_9xvvcz     at
org.apache.lucene.document.Field.<init>(Field.java:469)
060109 094543 task_r_9xvvcz     at
org.apache.lucene.document.Field.<init>(Field.java:412)
060109 094543 task_r_9xvvcz     at
org.apache.lucene.document.Field.UnIndexed(Field.java:195)
060109 094543 task_r_9xvvcz     at
org.apache.nutch.indexer.Indexer.reduce(Indexer.java:200)
060109 094543 task_r_9xvvcz     at
org.apache.nutch.mapred.ReduceTask.run(ReduceTask.java:260)
060109 094543 task_r_9xvvcz     at org.apache.nutch.mapred.TaskTracker
$Child.main(TaskTracker.java:603)


Gal

On Sun, 2006-01-08 at 10:07 +0100, Andrzej Bialecki wrote:
> Gal Nitzan wrote:
> 
> >Hi
> >
> >While the reduce task is running I sometime get this exception and it
> >breaks the whole job.
> >
> >As a work around I put this line in a try catch and just return however
> >I was not sure why the meta can not find the segment key name.
> >
> >This work around is good for now.
> >
> >  
> >
> 
> Stacktrace?
> 



Re: NPE in Indexer.java line 184

Posted by Andrzej Bialecki <ab...@getopt.org>.
Gal Nitzan wrote:

>Hi
>
>While the reduce task is running I sometime get this exception and it
>breaks the whole job.
>
>As a work around I put this line in a try catch and just return however
>I was not sure why the meta can not find the segment key name.
>
>This work around is good for now.
>
>  
>

Stacktrace?

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



NPE in Indexer.java line 184

Posted by Gal Nitzan <gn...@usa.net>.
Hi

While the reduce task is running I sometime get this exception and it
breaks the whole job.

As a work around I put this line in a try catch and just return however
I was not sure why the meta can not find the segment key name.

This work around is good for now.

G.



[jira] Updated: (NUTCH-166) secure jobtracker info pages with a password

Posted by "Stefan Groschupf (JIRA)" <ji...@apache.org>.
     [ http://issues.apache.org/jira/browse/NUTCH-166?page=all ]

Stefan Groschupf updated NUTCH-166:
-----------------------------------

    Attachment: passwordPatch.txt

> secure jobtracker info pages with a password
> --------------------------------------------
>
>          Key: NUTCH-166
>          URL: http://issues.apache.org/jira/browse/NUTCH-166
>      Project: Nutch
>         Type: Improvement
>     Versions: 0.8-dev
>     Reporter: Stefan Groschupf
>      Fix For: 0.8-dev
>  Attachments: passwordPatch.txt
>
> Since people often post stack-traces in the mailing list that contains ip addresses it is easy for others to view the info pages of the jobtracker. 
> This may contains more security critical informations like more ip addresses and internal host-names etc.
> Therefore this patch adds a Basic password authentication  to the jetty server. 
> The user name is 'admin'  and the password can be configured in the nutch configuration file.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Resolved: (NUTCH-166) secure jobtracker info pages with a password

Posted by "Sami Siren (JIRA)" <ji...@apache.org>.
     [ http://issues.apache.org/jira/browse/NUTCH-166?page=all ]
     
Sami Siren resolved NUTCH-166:
------------------------------

    Resolution: Won't Fix

this is hadoop related

> secure jobtracker info pages with a password
> --------------------------------------------
>
>          Key: NUTCH-166
>          URL: http://issues.apache.org/jira/browse/NUTCH-166
>      Project: Nutch
>         Type: Improvement

>     Versions: 0.8-dev
>     Reporter: Stefan Groschupf
>      Fix For: 0.8-dev
>  Attachments: passwordPatch.txt
>
> Since people often post stack-traces in the mailing list that contains ip addresses it is easy for others to view the info pages of the jobtracker. 
> This may contains more security critical informations like more ip addresses and internal host-names etc.
> Therefore this patch adds a Basic password authentication  to the jetty server. 
> The user name is 'admin'  and the password can be configured in the nutch configuration file.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira