You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Oleksiy Druzhynin <me...@gmail.com> on 2013/05/26 23:41:08 UTC

split document or not

I have document divider by paragraphs. How better to add it to Solr?
As single str field:

<field name="main">
  paragraph1
  paragraph2
  paragraph3
</field>

 Or multivalued fields:
 <field name=" paragraph "> paragraph1 </field>
 <field name=" paragraph "> paragraph2 </field>
 <field name=" paragraph "> paragraph3 </field>

Re: split document or not

Posted by Upayavira <uv...@odoko.co.uk>.
On Sun, May 26, 2013, at 10:41 PM, Oleksiy Druzhynin wrote:
> I have document divider by paragraphs. How better to add it to Solr?
> As single str field:
> 
> <field name="main">
>   paragraph1
>   paragraph2
>   paragraph3
> </field>
> 
>  Or multivalued fields:
>  <field name=" paragraph "> paragraph1 </field>
>  <field name=" paragraph "> paragraph2 </field>
>  <field name=" paragraph "> paragraph3 </field>

Depends what you want!

Leaving aside what you want back in terms of stored fields, it won't
make a huge amount of difference - the words will still be indexed.

The main difference I can think of is to do with positionIncrementGap,
which is used to influence term positions, which is relevant for phrase
queries. Take the following sentences:

I like Summer.
Sun warms the earth.

As a single field, "summer sun" would match as a phrase query. As a
multivalued field with a positionIncrementGap of zero, I'm pretty sure
"summer sun" would also match. However, with a gap of 100 for a
multivalued field, "summer" and "sun" would be considered 101 positions
apart - as such they're not next to each other and therefore wouldn't
constitute a phrase.

Upayavira

Re: split document or not

Posted by Hard_Club <me...@gmail.com>.
Do I need first search whole document Id and next between its paragraphs
stored in separated docs?



--
View this message in context: http://lucene.472066.n3.nabble.com/split-document-or-not-tp4066170p4066751.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: split document or not

Posted by Hard_Club <me...@gmail.com>.
But in this case phrase frequence per whole document will be not taken into
accout because document is splitted by subdocuments. Or it is not true?



--
View this message in context: http://lucene.472066.n3.nabble.com/split-document-or-not-tp4066170p4066734.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: split document or not

Posted by Jason Hellman <jh...@innoventsolutions.com>.
You may wish to explore the concept of using the Result Grouping (Field Collapsing) feature in which your paragraphs are individual documents that share a field to group them by (the ID of the document/book/article/whatever).

http://wiki.apache.org/solr/FieldCollapsing

This will net you absolutely isolated results for paragraphs, and give you a great deal of flexibility on how to query the results in cases where you do or do not need them grouped.

Jason


On May 28, 2013, at 3:10 PM, Hard_Club <me...@gmail.com> wrote:

> Thanks, Alexandre.
> 
> But I need to know in which paragraph is matched the request. I need it
> because paragraphs are binded to some extra data that I need to output on
> result page. So I need to know paragraphs is'd. How to bind such attribute
> to multivalued field?
> 
> 
> 
> --
> View this message in context: http://lucene.472066.n3.nabble.com/split-document-or-not-tp4066170p4066629.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: split document or not

Posted by Hard_Club <me...@gmail.com>.
Thanks, Alexandre.

But I need to know in which paragraph is matched the request. I need it
because paragraphs are binded to some extra data that I need to output on
result page. So I need to know paragraphs is'd. How to bind such attribute
to multivalued field?



--
View this message in context: http://lucene.472066.n3.nabble.com/split-document-or-not-tp4066170p4066629.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: split document or not

Posted by Alexandre Rafalovitch <ar...@gmail.com>.
That depends on what you are trying to search. Start your schema
design from your _search_ requirements, not your document
requirements.

See the presentation by Gilt on how they went through different
iterations on their document schema design:
http://www.slideshare.net/trenaman/lucene-revolution-2013-adrian-trenaman
. Don't worry if you don't get half of it at first though, they are
quite advanced for many of us.

Regards,
   Alex.
Personal blog: http://blog.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all
at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
book)


On Sun, May 26, 2013 at 5:41 PM, Oleksiy Druzhynin <me...@gmail.com> wrote:
> I have document divider by paragraphs. How better to add it to Solr?
> As single str field:
>
> <field name="main">
>   paragraph1
>   paragraph2
>   paragraph3
> </field>
>
>  Or multivalued fields:
>  <field name=" paragraph "> paragraph1 </field>
>  <field name=" paragraph "> paragraph2 </field>
>  <field name=" paragraph "> paragraph3 </field>