You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Michael Imbeault <mi...@sympatico.ca> on 2006/11/13 00:53:37 UTC

Sentence level searching

Hello everyone,

I'm trying to do some sentence-level searching with Solr; basically, I 
want to find if two words are in the same sentence. As I read on the 
Lucene mailing list, there's many ways to do this, including but not 
limited to :

-inserting special boundary terms to denote the start and end of a 
sentence. It is unclear to me what kind of query should be used to fetch 
results from within one sentence (something like: start_sentence_token 
word1 word2 end_sentence_token)?
-increase token position at a sentence boundary by a large factor 
(1000?) so that "x y"~500 (or more) won't match across sentence boundaries.

Is there an existing filter class that I could use to do this, or should 
I first parse my text fields with PHP and some NLP tool, and index the 
result (for the first case)? For the second case (increment token 
position), how should I do this within Solr?

Is there any plans to implement such functionality as standard?

Thanks for the help,

-- 
Michael Imbeault
CHUL Research Center (CHUQ)
2705 boul. Laurier
Ste-Foy, QC, Canada, G1V 4G2
Tel: (418) 654-2705, Fax: (418) 654-2212


Re: Sentence level searching

Posted by Naveen33 <na...@gmail.com>.
Hi Michael,
 what are you were looking for ,it can be achieved in Solr but not directly.
We will have to write a custom query parser which will use Lucene Query
parser. In the parser you will have to use the span queries.
SpanQuery1- your term1, term2, .....termN and the range like standard its 50
and the boolean ordered true or false.
SpanQuery2- your sentence starting and sentence ending indications. like you
keep #sb#,#se#
 and your 2nd spanQuery will be #se# #sb# length -1 and order-TRue
final SPanQuery will be- spanQUery1-spanQuery2

In this way you would be able to achieve what you want.It works and i have
tried it.


Thanks & Regard
Naveen
India



--
View this message in context: http://lucene.472066.n3.nabble.com/Sentence-level-searching-tp474384p4348874.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Sentence level searching

Posted by Michael Imbeault <mi...@sympatico.ca>.
So basically its just as I thought it was, thanks for the help :) I had 
checked the wiki before asking, but it lacks details and is often vague, 
or presuppose that you have knowledge about some specific terms without 
explaining them. Its all clear now, thanks to you ;)

Michael Imbeault
CHUL Research Center (CHUQ)
2705 boul. Laurier
Ste-Foy, QC, Canada, G1V 4G2
Tel: (418) 654-2705, Fax: (418) 654-2212



Chris Hostetter wrote:
> : Thanks for the answer Yonik; I forgot about Multivalued fields! I'm not
> : exactly sure of how to add multiple values to a single field (aside from
> : fieldcopy). The code I'm thinking of using :
>
> If you look at the exampledocs, "features" and "cat" are both multivalued
> fields... you just list multiple <field>s with the same name in your
>
> : Field in schema.xml : <field name ="abstract" type="text" indexed="true"
> : stored="false" multivalued="true" />
> :
> : Where am I supposed to configure the value of the gap?
> : positionIncrementGap in the fieldtype definition is my guess, but I'm
>
> correct.
>
> : not sure. Also, am I supposed to put multivalued in the fieldtype
> : definition? Alternatively, could I put positionIncrementGap in the
> : <field> that I posted just above?
>
> I *think* positionIncrementGap has to be set by on the fieldtype ... but
> i'm not 100% certain of that.
>
> multiValued and the other field attributes (indexed, stored,
> compressed, omitNorms) can be set on the field or inherited from the
> fieldtype.
>
> More info can be found in the comments of the example schema.xml, as well
> as these wiki pages...
>
> http://wiki.apache.org/solr/SchemaXml
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
>
>
> -Hoss
>
>   

Re: Sentence level searching

Posted by Chris Hostetter <ho...@fucit.org>.
: Thanks for the answer Yonik; I forgot about Multivalued fields! I'm not
: exactly sure of how to add multiple values to a single field (aside from
: fieldcopy). The code I'm thinking of using :

If you look at the exampledocs, "features" and "cat" are both multivalued
fields... you just list multiple <field>s with the same name in your

: Field in schema.xml : <field name ="abstract" type="text" indexed="true"
: stored="false" multivalued="true" />
:
: Where am I supposed to configure the value of the gap?
: positionIncrementGap in the fieldtype definition is my guess, but I'm

correct.

: not sure. Also, am I supposed to put multivalued in the fieldtype
: definition? Alternatively, could I put positionIncrementGap in the
: <field> that I posted just above?

I *think* positionIncrementGap has to be set by on the fieldtype ... but
i'm not 100% certain of that.

multiValued and the other field attributes (indexed, stored,
compressed, omitNorms) can be set on the field or inherited from the
fieldtype.

More info can be found in the comments of the example schema.xml, as well
as these wiki pages...

http://wiki.apache.org/solr/SchemaXml
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters


-Hoss


Re: Sentence level searching

Posted by Michael Imbeault <mi...@sympatico.ca>.
Hello everyone,
> Solr puts a configurable gap between values of the same field, so you
> could index every sentence as a separate value of a multi-valued
> field.
Thanks for the answer Yonik; I forgot about Multivalued fields! I'm not 
exactly sure of how to add multiple values to a single field (aside from 
fieldcopy). The code I'm thinking of using :

                        PHP code to build the XML

                        (loop for each sentence)
                        $abstract_element = $dom->createElement('field');
                        $abstract_element->setAttribute('name', 'abstract');
                        $abstract_text = 
$dom->createTextNode($array['abstract']);
                        $abstract_element->appendChild($abstract_text);
                        (end loop)
                        $doc->appendChild($abstract_element);

Field in schema.xml : <field name ="abstract" type="text" indexed="true" 
stored="false" multivalued="true" />

Where am I supposed to configure the value of the gap? 
positionIncrementGap in the fieldtype definition is my guess, but I'm 
not sure. Also, am I supposed to put multivalued in the fieldtype 
definition? Alternatively, could I put positionIncrementGap in the 
<field> that I posted just above?

Thanks for the help,
Michael

>

Re: Sentence level searching

Posted by Yonik Seeley <yo...@apache.org>.
On 11/12/06, Michael Imbeault <mi...@sympatico.ca> wrote:
> I'm trying to do some sentence-level searching with Solr; basically, I
> want to find if two words are in the same sentence. As I read on the
> Lucene mailing list, there's many ways to do this, including but not
> limited to :
>
> -inserting special boundary terms to denote the start and end of a
> sentence. It is unclear to me what kind of query should be used to fetch
> results from within one sentence (something like: start_sentence_token
> word1 word2 end_sentence_token)?

Span queries... but there isn't really query parser support for them.

> -increase token position at a sentence boundary by a large factor
> (1000?) so that "x y"~500 (or more) won't match across sentence boundaries.

That's probably the easiest and simplest.

> Is there an existing filter class that I could use to do this, or should
> I first parse my text fields with PHP and some NLP tool, and index the
> result (for the first case)? For the second case (increment token
> position), how should I do this within Solr?

Solr puts a configurable gap between values of the same field, so you
could index every sentence as a separate value of a multi-valued
field.

A better solution would be to have a tokenizer that could detect the
end of sentences and either insert a gap or a special token that
another filter could act on.

-Yonik