You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by dietric <di...@gmail.com> on 2011/07/22 16:28:02 UTC

Re: Customize Tika Parser - How to access nutch Content object or is it possible to stack Parsers

Torsten,
I am trying to do the same thing - manipulating the content of a document
parsed with Tika using HTMLParseFilter. I have trouble identifying the
proper API call in the filter implementaion class, would you be willing to
share your code since you said you had that part working?
Thx
Dietrich
 

Torsten Krah wrote:
> 
> Am Freitag, 23. Juli 2010, um 11:12:28 schrieb Julien Nioche:
> 
> For HTML this is ok and works already.
> But for non HTML content (PDF, DOC etc.) i did not found any filter API
> like 
> the HTML one (e.g. BinaryParseFilter or something else)?
> How to do this there (filter like approach)?
> 
> thx
> 
> Torsten
> 
> 


--
View this message in context: http://lucene.472066.n3.nabble.com/Customize-Tika-Parser-How-to-access-nutch-Content-object-or-is-it-possible-to-stack-Parsers-tp987281p3191544.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: ranking of search results

Posted by Dietrich <di...@gmail.com>.
I'd suggest a proximity search:
http://wiki.apache.org/solr/SolrRelevancyFAQ#How_can_I_search_for_one_term_near_another_term_.28say.2C_.22batman.22_and_.22movie.22.29
I suppose you are using the dismax handler (you should anyway).
Quote:
The dismax handler can easily create sloppy phrase queries with the pf
(phrase fields) and ps (phrase slop) parameters:

q=batman movie&pf=text&ps=100

The dismax handler also allows users to explicitly specify a phrase
query with double quotes, and the qs(query slop) parameter can be used
to add slop to any explicit phrase queries:

q="batman movie"&qs=100



Dietrich Schmidt
914 298 0548
http://www.linkedin.com/in/dietrichschmidt



On Fri, Jul 22, 2011 at 8:01 PM,  <al...@aim.com> wrote:
> Hello,
>
> I use nutch 1.2 and solr to index about 3500 domains. I noticed that search results for two or more keywords are not ranked properly.
> For example for keyword Lady Gaga some results that has Lady are displayed first then some results with both keywords and etc. It seems to me that results with both words must be displayed in the first place and those with one of the keywords must follow them.
>
> Any idea how to correct this.
>
> Thanks.
> Alex.
>

ranking of search results

Posted by al...@aim.com.
Hello,

I use nutch 1.2 and solr to index about 3500 domains. I noticed that search results for two or more keywords are not ranked properly.
For example for keyword Lady Gaga some results that has Lady are displayed first then some results with both keywords and etc. It seems to me that results with both words must be displayed in the first place and those with one of the keywords must follow them.

Any idea how to correct this.

Thanks.
Alex.