You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Max Stricker <st...@gmail.com> on 2011/08/12 13:36:54 UTC

ParseResult.put : result not added if Url contains ?,& or #

Hi, 

I need to add the result of parsing manually to my index using ParseResult.put(). 
Everything works fine and the result shows up in my Solr index afterwards except if the 
Url (which I use as key) includes characters like #,? or &. 
First I thought that crawl-urlfiler.txt could be the issue and ignore the urls that do not match the filters 
but I already removed -[?*!@=] with no success. 
Looking at the source code at of ParseResult.java
I cannot see why some results would be rejected, because the stuff is simply put in a HashMap<Text,Parse> where Text contains my Url as key. 
The Url has some form like this: www.host.com?param=val. 
What setting would cause such an issue, and how could I force to put such urls into the index? 

In the filter method of my HtmlParseFilter implementation I add a result like this: 

parseResult.put(URL, new ParseText("myParseText"), new ParseData( 
       new ParseStatus(ParseStatus.SUCCESS), "aTitle", new Outlink[0], 
       content.getMetadata())); 

and if URL contains none of #,?,& all works fine. 

Any ideas? 
Thanks for any help.        

Re: ParseResult.put : result not added if Url contains ?,& or #

Posted by jasimop <st...@gmail.com>.
> 
> Did you do a complete recrawl? 

Yes I did, does not change anything.



--
View this message in context: http://lucene.472066.n3.nabble.com/ParseResult-put-result-not-added-if-Url-contains-or-tp3248975p3257904.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: ParseResult.put : result not added if Url contains ?,& or #

Posted by Markus Jelsma <ma...@openindex.io>.

On Friday 12 August 2011 13:36:54 Max Stricker wrote:
> Hi,
> 
> I need to add the result of parsing manually to my index using
> ParseResult.put(). Everything works fine and the result shows up in my
> Solr index afterwards except if the Url (which I use as key) includes
> characters like #,? or &.
> First I thought that crawl-urlfiler.txt could be the issue and ignore the
> urls that do not match the filters but I already removed -[?*!@=] with no
> success.

Did you do a complete recrawl?

> Looking at the source code at of ParseResult.java
> I cannot see why some results would be rejected, because the stuff is
> simply put in a HashMap<Text,Parse> where Text contains my Url as key. The
> Url has some form like this: www.host.com?param=val.
> What setting would cause such an issue, and how could I force to put such
> urls into the index?
> 
> In the filter method of my HtmlParseFilter implementation I add a result
> like this:
> 
> parseResult.put(URL, new ParseText("myParseText"), new ParseData(
>        new ParseStatus(ParseStatus.SUCCESS), "aTitle", new Outlink[0],
>        content.getMetadata()));
> 
> and if URL contains none of #,?,& all works fine.
> 
> Any ideas?
> Thanks for any help.

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350