You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Harris Rappaport <hp...@gmail.com> on 2011/08/21 05:10:00 UTC

Searching for special characters

Hi,
On your the wiki (here: https://wiki.apache.org/nutch/Features ) it says
that special characters and punctuation are treated as spaces, but it does
not say where in the code this is or how to configure it. How can I
configure nutch not to ignore special characters?

Re: Searching for special characters

Posted by Markus Jelsma <ma...@openindex.io>.
Harris,

This depends entirely on how you set up your analyzers in Solr. Nutch will 
send the parsed text verbatim to Solr. It's likely you have some filter 
stripping your non-alphanum chars such as a misconfigured word delimiter 
filter.

Consult the Solr docs and mailing list.

Cheers,

> Are you sure this is the case? Do I need to change any configurations
> somewhere (either to make solr search for special characters, or to make
> sure nutch indexes them)? The wiki says that Nutch 1.3 treats special
> characters are whitespace, is this wrong? I indexed some pages and am
> testing out the search privately using http://127.0.0.1:8983/solr/admin/ as
> shown on the Nutch tutorial page
> https://wiki.apache.org/nutch/NutchTutorial. Special characters seem
> to be ignored. For example, I can search for
> "tree" and get certain results, then try "tree\+\+\+" and get exactly the
> same results, even though the string "tree+++" does not appear anywhere, so
> shouldn't I get no results, just as if I had searched for treennn?
> 
> On Mon, Aug 22, 2011 at 4:27 AM, Markus Jelsma
> 
> <ma...@openindex.io>wrote:
> > In 1.3 search is delegated to Solr. It can happily search (or ignore)
> > `special` chars.
> > 
> > > I downloaded and played around a bit with 1.3 but I don't really have
> > > anything invested in it (so if this is easier using another version, I
> > > would gladly use that instead).
> > > 
> > > On Sun, Aug 21, 2011 at 7:23 AM, Markus Jelsma
> > > 
> > > <ma...@openindex.io>wrote:
> > > > What version of Nutch are you using?
> > > > 
> > > > > Hi,
> > > > > On your the wiki (here: https://wiki.apache.org/nutch/Features ) it
> > > > > says that special characters and punctuation are treated as spaces,
> > > > > but it
> > > > 
> > > > does
> > > > 
> > > > > not say where in the code this is or how to configure it. How can I
> > > > > configure nutch not to ignore special characters?

Re: Searching for special characters

Posted by Harris Rappaport <hp...@gmail.com>.
Are you sure this is the case? Do I need to change any configurations
somewhere (either to make solr search for special characters, or to make
sure nutch indexes them)? The wiki says that Nutch 1.3 treats special
characters are whitespace, is this wrong? I indexed some pages and am
testing out the search privately using http://127.0.0.1:8983/solr/admin/ as
shown on the Nutch tutorial page
https://wiki.apache.org/nutch/NutchTutorial. Special characters seem
to be ignored. For example, I can search for
"tree" and get certain results, then try "tree\+\+\+" and get exactly the
same results, even though the string "tree+++" does not appear anywhere, so
shouldn't I get no results, just as if I had searched for treennn?

On Mon, Aug 22, 2011 at 4:27 AM, Markus Jelsma
<ma...@openindex.io>wrote:

> In 1.3 search is delegated to Solr. It can happily search (or ignore)
> `special` chars.
>
> > I downloaded and played around a bit with 1.3 but I don't really have
> > anything invested in it (so if this is easier using another version, I
> > would gladly use that instead).
> >
> > On Sun, Aug 21, 2011 at 7:23 AM, Markus Jelsma
> >
> > <ma...@openindex.io>wrote:
> > > What version of Nutch are you using?
> > >
> > > > Hi,
> > > > On your the wiki (here: https://wiki.apache.org/nutch/Features ) it
> > > > says that special characters and punctuation are treated as spaces,
> > > > but it
> > >
> > > does
> > >
> > > > not say where in the code this is or how to configure it. How can I
> > > > configure nutch not to ignore special characters?
>

Re: Searching for special characters

Posted by Markus Jelsma <ma...@openindex.io>.
In 1.3 search is delegated to Solr. It can happily search (or ignore) 
`special` chars. 

> I downloaded and played around a bit with 1.3 but I don't really have
> anything invested in it (so if this is easier using another version, I
> would gladly use that instead).
> 
> On Sun, Aug 21, 2011 at 7:23 AM, Markus Jelsma
> 
> <ma...@openindex.io>wrote:
> > What version of Nutch are you using?
> > 
> > > Hi,
> > > On your the wiki (here: https://wiki.apache.org/nutch/Features ) it
> > > says that special characters and punctuation are treated as spaces,
> > > but it
> > 
> > does
> > 
> > > not say where in the code this is or how to configure it. How can I
> > > configure nutch not to ignore special characters?

Re: Searching for special characters

Posted by Harris Rappaport <hp...@gmail.com>.
I downloaded and played around a bit with 1.3 but I don't really have
anything invested in it (so if this is easier using another version, I would
gladly use that instead).

On Sun, Aug 21, 2011 at 7:23 AM, Markus Jelsma
<ma...@openindex.io>wrote:

> What version of Nutch are you using?
>
> > Hi,
> > On your the wiki (here: https://wiki.apache.org/nutch/Features ) it says
> > that special characters and punctuation are treated as spaces, but it
> does
> > not say where in the code this is or how to configure it. How can I
> > configure nutch not to ignore special characters?
>

Re: Searching for special characters

Posted by Markus Jelsma <ma...@openindex.io>.
What version of Nutch are you using?

> Hi,
> On your the wiki (here: https://wiki.apache.org/nutch/Features ) it says
> that special characters and punctuation are treated as spaces, but it does
> not say where in the code this is or how to configure it. How can I
> configure nutch not to ignore special characters?