You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Oleksiy Druzhynin <me...@gmail.com> on 2013/05/26 23:41:08 UTC
split document or not
I have document divider by paragraphs. How better to add it to Solr?
As single str field:
<field name="main">
paragraph1
paragraph2
paragraph3
</field>
Or multivalued fields:
<field name=" paragraph "> paragraph1 </field>
<field name=" paragraph "> paragraph2 </field>
<field name=" paragraph "> paragraph3 </field>
Re: split document or not
Posted by Upayavira <uv...@odoko.co.uk>.
On Sun, May 26, 2013, at 10:41 PM, Oleksiy Druzhynin wrote:
> I have document divider by paragraphs. How better to add it to Solr?
> As single str field:
>
> <field name="main">
> paragraph1
> paragraph2
> paragraph3
> </field>
>
> Or multivalued fields:
> <field name=" paragraph "> paragraph1 </field>
> <field name=" paragraph "> paragraph2 </field>
> <field name=" paragraph "> paragraph3 </field>
Depends what you want!
Leaving aside what you want back in terms of stored fields, it won't
make a huge amount of difference - the words will still be indexed.
The main difference I can think of is to do with positionIncrementGap,
which is used to influence term positions, which is relevant for phrase
queries. Take the following sentences:
I like Summer.
Sun warms the earth.
As a single field, "summer sun" would match as a phrase query. As a
multivalued field with a positionIncrementGap of zero, I'm pretty sure
"summer sun" would also match. However, with a gap of 100 for a
multivalued field, "summer" and "sun" would be considered 101 positions
apart - as such they're not next to each other and therefore wouldn't
constitute a phrase.
Upayavira
Re: split document or not
Posted by Hard_Club <me...@gmail.com>.
Do I need first search whole document Id and next between its paragraphs
stored in separated docs?
--
View this message in context: http://lucene.472066.n3.nabble.com/split-document-or-not-tp4066170p4066751.html
Sent from the Solr - User mailing list archive at Nabble.com.
Re: split document or not
Posted by Hard_Club <me...@gmail.com>.
But in this case phrase frequence per whole document will be not taken into
accout because document is splitted by subdocuments. Or it is not true?
--
View this message in context: http://lucene.472066.n3.nabble.com/split-document-or-not-tp4066170p4066734.html
Sent from the Solr - User mailing list archive at Nabble.com.
Re: split document or not
Posted by Jason Hellman <jh...@innoventsolutions.com>.
You may wish to explore the concept of using the Result Grouping (Field Collapsing) feature in which your paragraphs are individual documents that share a field to group them by (the ID of the document/book/article/whatever).
http://wiki.apache.org/solr/FieldCollapsing
This will net you absolutely isolated results for paragraphs, and give you a great deal of flexibility on how to query the results in cases where you do or do not need them grouped.
Jason
On May 28, 2013, at 3:10 PM, Hard_Club <me...@gmail.com> wrote:
> Thanks, Alexandre.
>
> But I need to know in which paragraph is matched the request. I need it
> because paragraphs are binded to some extra data that I need to output on
> result page. So I need to know paragraphs is'd. How to bind such attribute
> to multivalued field?
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/split-document-or-not-tp4066170p4066629.html
> Sent from the Solr - User mailing list archive at Nabble.com.
Re: split document or not
Posted by Hard_Club <me...@gmail.com>.
Thanks, Alexandre.
But I need to know in which paragraph is matched the request. I need it
because paragraphs are binded to some extra data that I need to output on
result page. So I need to know paragraphs is'd. How to bind such attribute
to multivalued field?
--
View this message in context: http://lucene.472066.n3.nabble.com/split-document-or-not-tp4066170p4066629.html
Sent from the Solr - User mailing list archive at Nabble.com.
Re: split document or not
Posted by Alexandre Rafalovitch <ar...@gmail.com>.
That depends on what you are trying to search. Start your schema
design from your _search_ requirements, not your document
requirements.
See the presentation by Gilt on how they went through different
iterations on their document schema design:
http://www.slideshare.net/trenaman/lucene-revolution-2013-adrian-trenaman
. Don't worry if you don't get half of it at first though, they are
quite advanced for many of us.
Regards,
Alex.
Personal blog: http://blog.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all
at once. Lately, it doesn't seem to be working. (Anonymous - via GTD
book)
On Sun, May 26, 2013 at 5:41 PM, Oleksiy Druzhynin <me...@gmail.com> wrote:
> I have document divider by paragraphs. How better to add it to Solr?
> As single str field:
>
> <field name="main">
> paragraph1
> paragraph2
> paragraph3
> </field>
>
> Or multivalued fields:
> <field name=" paragraph "> paragraph1 </field>
> <field name=" paragraph "> paragraph2 </field>
> <field name=" paragraph "> paragraph3 </field>