You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by CP Hennessy <cp...@openapp.ie> on 2010/03/10 01:53:16 UTC

Using SOLR

Hi,
  I'm trying to figure out if SOLR is the component I need and if so that 
I'm asking the right questions :)

I need to index a large set of multilingual documents against a project 
specific taxonomy. 

From what I've read SOLR should be perfect for this. 

However I'm not sure that my approach is correct. I've been able to run the 
example solr setup and index the given documents. 

Now I want to add my taxonomy (in English first), and this is where I'm 
stumbling (or not understanding the documentation).

To do this I understand that I need to define a field to store the result of 
the taxonomy analysis. I also need to define the analysis steps used to 
generate the values for this field ( lowercase, synonyms, stemming, etc).

In the file solr/conf/schema.xml in the <types> I've added :

    <fieldType name="Taxonomy" class="solr.TextField" indexed="True">
      <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="ontology-
synonyms.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.SnowballPorterFilterFactory" language="English"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
        <filter class="solr.KeepWordFilterFactory" words="keepwords.txt" 
ignoreCase="true"/>
      </analyzer>
    </fieldType>

and 

   <field name="taxonomy" type="Taxonomy" indexed="true" stored="true" 
required="true" multiValued="true"/>

I am able to test my fieldType thru the /solr/admin/analysis.jsp page and it 
seems to be doing what I expect. 

When I now add a test document containing several words from the keepwords.txt 
file the result seems to indicate that it was processed correctly.  

How can I get the details of what has been indexed for my file?


Also I do not know how to perform a search based on the taxonomy ?

Any pointers would be greatly appreciated.

Thanks in advance,
CPH

Re: Using SOLR

Posted by Erick Erickson <er...@gmail.com>.
Luke won't help you "retrieve the matched taxonomy",
it just lets you look at your index and run queries against
it....

WARNING: I haven't personally used MoreLikeThis
functionality, but it sounds like that's at least in
the ballpark if you consider your Taxonomy a document
and want the list of documents that are similar...

I don't know how you'd get the matches though...

Erick@NotVeryMuchHelp.


2010/3/10 CP Hennessy <cp...@openapp.ie>

> Hi,
>
> I may not have stated my aim clearly enough or in case I'm using the wrong
> terms, I'll restate what I want to be able to do:
>
> - I have a fixed set of words and phrases some of which I expect to find in
> the documents I want to process. This set I call my taxonomy.
>
> - I have many documents to process from which the primary thing I want to
> be
> able to do is to match the content with my taxonomy.
>
> - For every document processed I need to retrieve the matches with the
> taxonomy.
>
> If I was able to do the above I'd be quite happy.
>
> Is this the type of thing that solr is well matched to do ?
>
> If so, then is luke the right mechanism to retrieve the matched taxonomy
> terms.
>
> Thanks,
> CPH
>
>
> On Wed 10 Mar 2010 01:17:25 Erick Erickson wrote:
> > Well, the LukeRequestHandler lets you peek at the
> > index, see:
> > http://wiki.apache.org/solr/LukeRequestHandler
> >
> > warning: it'll take a bit for this to make lots of sense.
> >
> > You can get a copy of Luke (google Lucene Luke) for
> > what the above is based on, point it at your index and
> > have at it.
> >
> > One bit of warning though. It'll be easy to confuse
> > what you stored (which is just a raw copy of
> > your input) with what you indexed (which is
> > what's searched on). If you're looking at either tool
> > and what you see looks suspiciously like
> > your raw data, look further to see it you can find
> > the terms...
> >
> > To answer your question about searching, it all depends
> > (tm). What do you mean by Taxonomy? Different
> > people use that term...er...differently. Some example
> > inputs and how searching should behave in your
> > problem space would be very helpful.
> >
> > HTH
> > Erick
> >
> > On Tue, Mar 9, 2010 at 7:53 PM, CP Hennessy <cp...@openapp.ie>
> wrote:
> > > Hi,
> > >
> > >  I'm trying to figure out if SOLR is the component I need and if so
> that
> > >
> > > I'm asking the right questions :)
> > >
> > > I need to index a large set of multilingual documents against a project
> > > specific taxonomy.
> > >
> > > From what I've read SOLR should be perfect for this.
> > >
> > > However I'm not sure that my approach is correct. I've been able to run
> > > the example solr setup and index the given documents.
> > >
> > > Now I want to add my taxonomy (in English first), and this is where I'm
> > > stumbling (or not understanding the documentation).
> > >
> > > To do this I understand that I need to define a field to store the
> result
> > > of
> > > the taxonomy analysis. I also need to define the analysis steps used to
> > > generate the values for this field ( lowercase, synonyms, stemming,
> etc).
> > >
> > > In the file solr/conf/schema.xml in the <types> I've added :
> > >    <fieldType name="Taxonomy" class="solr.TextField" indexed="True">
> > >
> > >      <analyzer type="index">
> > >
> > >        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> > >        <filter class="solr.LowerCaseFilterFactory"/>
> > >        <filter class="solr.SynonymFilterFactory" synonyms="ontology-
> > >
> > > synonyms.txt" ignoreCase="true" expand="true"/>
> > >
> > >        <filter class="solr.SnowballPorterFilterFactory"
> > >
> > > language="English"/>
> > >
> > >        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
> > >        <filter class="solr.KeepWordFilterFactory" words="keepwords.txt"
> > >
> > > ignoreCase="true"/>
> > >
> > >      </analyzer>
> > >
> > >    </fieldType>
> > >
> > > and
> > >
> > >   <field name="taxonomy" type="Taxonomy" indexed="true" stored="true"
> > >
> > > required="true" multiValued="true"/>
> > >
> > > I am able to test my fieldType thru the /solr/admin/analysis.jsp page
> and
> > > it
> > > seems to be doing what I expect.
> > >
> > > When I now add a test document containing several words from the
> > > keepwords.txt
> > > file the result seems to indicate that it was processed correctly.
> > >
> > > How can I get the details of what has been indexed for my file?
> > >
> > >
> > > Also I do not know how to perform a search based on the taxonomy ?
> > >
> > > Any pointers would be greatly appreciated.
> > >
> > > Thanks in advance,
> > > CPH
>

Re: Using SOLR

Posted by CP Hennessy <cp...@openapp.ie>.
Hi,

I may not have stated my aim clearly enough or in case I'm using the wrong 
terms, I'll restate what I want to be able to do:

- I have a fixed set of words and phrases some of which I expect to find in 
the documents I want to process. This set I call my taxonomy.

- I have many documents to process from which the primary thing I want to be 
able to do is to match the content with my taxonomy.

- For every document processed I need to retrieve the matches with the 
taxonomy.

If I was able to do the above I'd be quite happy. 

Is this the type of thing that solr is well matched to do ?

If so, then is luke the right mechanism to retrieve the matched taxonomy 
terms.

Thanks,
CPH


On Wed 10 Mar 2010 01:17:25 Erick Erickson wrote:
> Well, the LukeRequestHandler lets you peek at the
> index, see:
> http://wiki.apache.org/solr/LukeRequestHandler
> 
> warning: it'll take a bit for this to make lots of sense.
> 
> You can get a copy of Luke (google Lucene Luke) for
> what the above is based on, point it at your index and
> have at it.
> 
> One bit of warning though. It'll be easy to confuse
> what you stored (which is just a raw copy of
> your input) with what you indexed (which is
> what's searched on). If you're looking at either tool
> and what you see looks suspiciously like
> your raw data, look further to see it you can find
> the terms...
> 
> To answer your question about searching, it all depends
> (tm). What do you mean by Taxonomy? Different
> people use that term...er...differently. Some example
> inputs and how searching should behave in your
> problem space would be very helpful.
> 
> HTH
> Erick
> 
> On Tue, Mar 9, 2010 at 7:53 PM, CP Hennessy <cp...@openapp.ie> wrote:
> > Hi,
> > 
> >  I'm trying to figure out if SOLR is the component I need and if so that
> > 
> > I'm asking the right questions :)
> > 
> > I need to index a large set of multilingual documents against a project
> > specific taxonomy.
> > 
> > From what I've read SOLR should be perfect for this.
> > 
> > However I'm not sure that my approach is correct. I've been able to run
> > the example solr setup and index the given documents.
> > 
> > Now I want to add my taxonomy (in English first), and this is where I'm
> > stumbling (or not understanding the documentation).
> > 
> > To do this I understand that I need to define a field to store the result
> > of
> > the taxonomy analysis. I also need to define the analysis steps used to
> > generate the values for this field ( lowercase, synonyms, stemming, etc).
> > 
> > In the file solr/conf/schema.xml in the <types> I've added :
> >    <fieldType name="Taxonomy" class="solr.TextField" indexed="True">
> >    
> >      <analyzer type="index">
> >      
> >        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >        <filter class="solr.LowerCaseFilterFactory"/>
> >        <filter class="solr.SynonymFilterFactory" synonyms="ontology-
> > 
> > synonyms.txt" ignoreCase="true" expand="true"/>
> > 
> >        <filter class="solr.SnowballPorterFilterFactory"
> > 
> > language="English"/>
> > 
> >        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
> >        <filter class="solr.KeepWordFilterFactory" words="keepwords.txt"
> > 
> > ignoreCase="true"/>
> > 
> >      </analyzer>
> >    
> >    </fieldType>
> > 
> > and
> > 
> >   <field name="taxonomy" type="Taxonomy" indexed="true" stored="true"
> > 
> > required="true" multiValued="true"/>
> > 
> > I am able to test my fieldType thru the /solr/admin/analysis.jsp page and
> > it
> > seems to be doing what I expect.
> > 
> > When I now add a test document containing several words from the
> > keepwords.txt
> > file the result seems to indicate that it was processed correctly.
> > 
> > How can I get the details of what has been indexed for my file?
> > 
> > 
> > Also I do not know how to perform a search based on the taxonomy ?
> > 
> > Any pointers would be greatly appreciated.
> > 
> > Thanks in advance,
> > CPH

Re: Using SOLR

Posted by Erick Erickson <er...@gmail.com>.
Well, the LukeRequestHandler lets you peek at the
index, see:
http://wiki.apache.org/solr/LukeRequestHandler

warning: it'll take a bit for this to make lots of sense.

You can get a copy of Luke (google Lucene Luke) for
what the above is based on, point it at your index and
have at it.

One bit of warning though. It'll be easy to confuse
what you stored (which is just a raw copy of
your input) with what you indexed (which is
what's searched on). If you're looking at either tool
and what you see looks suspiciously like
your raw data, look further to see it you can find
the terms...

To answer your question about searching, it all depends
(tm). What do you mean by Taxonomy? Different
people use that term...er...differently. Some example
inputs and how searching should behave in your
problem space would be very helpful.

HTH
Erick

On Tue, Mar 9, 2010 at 7:53 PM, CP Hennessy <cp...@openapp.ie> wrote:

> Hi,
>  I'm trying to figure out if SOLR is the component I need and if so that
> I'm asking the right questions :)
>
> I need to index a large set of multilingual documents against a project
> specific taxonomy.
>
> From what I've read SOLR should be perfect for this.
>
> However I'm not sure that my approach is correct. I've been able to run the
> example solr setup and index the given documents.
>
> Now I want to add my taxonomy (in English first), and this is where I'm
> stumbling (or not understanding the documentation).
>
> To do this I understand that I need to define a field to store the result
> of
> the taxonomy analysis. I also need to define the analysis steps used to
> generate the values for this field ( lowercase, synonyms, stemming, etc).
>
> In the file solr/conf/schema.xml in the <types> I've added :
>
>    <fieldType name="Taxonomy" class="solr.TextField" indexed="True">
>      <analyzer type="index">
>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>        <filter class="solr.SynonymFilterFactory" synonyms="ontology-
> synonyms.txt" ignoreCase="true" expand="true"/>
>        <filter class="solr.SnowballPorterFilterFactory"
> language="English"/>
>        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>        <filter class="solr.KeepWordFilterFactory" words="keepwords.txt"
> ignoreCase="true"/>
>      </analyzer>
>    </fieldType>
>
> and
>
>   <field name="taxonomy" type="Taxonomy" indexed="true" stored="true"
> required="true" multiValued="true"/>
>
> I am able to test my fieldType thru the /solr/admin/analysis.jsp page and
> it
> seems to be doing what I expect.
>
> When I now add a test document containing several words from the
> keepwords.txt
> file the result seems to indicate that it was processed correctly.
>
> How can I get the details of what has been indexed for my file?
>
>
> Also I do not know how to perform a search based on the taxonomy ?
>
> Any pointers would be greatly appreciated.
>
> Thanks in advance,
> CPH
>