You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Атанас Атанасов <at...@gmail.com> on 2013/08/28 10:14:25 UTC

Newbie SOLR question

Hello,

My name is Atanas Atanasov, I'm using SOLR 1.4/3.5/4.3 for an year and a
half and I'm really satisfied of what it provides. Searching and indexing
are extremely fast, it is easy to work with.
However I ran into a small problem and I can't figure it out.
I'm using SOLR to store the content/text of different types of
documents(.pdf, .txt, .doc, etc.).
The whole document content represents a SOLR record(all the text from all
pages of the document).
schema.xml is in SOLR_Document_Level folder of attached .zip file.
This worked absolutely fine but I wanted to see the exact page/pages of a
document where the search match is/are.

I redesigned it so that every page of a document is a row in the SOLR
database (schema.xml is in SOLR_Page_Level folder of attached .zip file.)
and it works good but this resulted in the following problem:
Example: I search for (lucene AND apache). If both words are on the same
page I will get a hit and
result will be returned. However If the words are on different pages of a
document no results will be found.
My goal is to find out the exact page of a document where the match is.
Dynamic fields would solve this problem but there are very big documents
with many pages so I don't think this is a solution.
Can you help me with some ideas on how to make it work?

Just for information. I am using SOLR as a REST service hosted in Apache
and a .NET application to work with it.
If you have questions please feel free to ask.

Thanks in advance and Best Regards,
Atanas Atanasov

Re: Newbie SOLR question

Posted by Атанас Атанасов <at...@gmail.com>.

Thanks a lot!

I don't know how I missed this discussion.
Thank you again!

Best regards
Atanas


On Fri, Aug 30, 2013 at 11:31 AM, Aloke Ghoshal <al...@gmail.com> wrote:

> Hi,
>
> Please refer to my response from a few months back:
>
> http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201303.mbox/%3CCAHT6S2aZ_W2AV04rdMOeeCk5e9o0k4YTktF0pjSEcsH-LLsnSw@mail.gmail.com%3E
>
> Our modelling is to index N (individual pages) + 1 (original document) in
> Solr. Once a document has matched for a given set of terms, the
> corresponding page boundary cases can be handled by relaxing the page
> search condition to an OR (you could even add these alongside with a lower
> boost).
>
> Regards,
> Aloke
>
>
>
> On Fri, Aug 30, 2013 at 12:11 PM, Атанас Атанасов <atanasovit@gmail.com
> >wrote:
>
> > Thanks for the response. Your suggestion is to keep the existing way of
> > indexing data where every page of a document is a row in the SOLR
> database,
> > changing the "content" field to be store-only and add another field (ex.
> > document_content) for "index only" where I should put the whole content
> of
> > the document. This is a good idea but I am also using HighLighter and I
> > think it won't work since it requires the field to be stored=true. My
> > problem will be solved if there is a way to search in the index-only
> field
> > where the whole document is indexed but to get the highlights/context of
> > the match from the existing page.
> > Originally my idea was to keep data in existing format (1 page - 1
> record)
> > but somehow search in grouped (by document) results or some kind of union
> > between pages of a document. Is this possible?
> >
> >
> > On Thu, Aug 29, 2013 at 4:45 PM, Alexandre Rafalovitch
> > <ar...@gmail.com>wrote:
> >
> > > Assuming you want both pages to match you need the text to be present
> on
> > > both pages. Do you actually return/store text of the page in Solr? If
> so,
> > > you can have that 'page' field store-only and have another field which
> is
> > > index-only and into which you put all your matching logic. So, that
> > > index-only field can contain the page plus another line/paragraph/page
> on
> > > each side.
> > >
> > > Regards,
> > >    Alex.
> > >
> > > Personal website: http://www.outerthoughts.com/
> > > LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
> > > - Time is the quality of nature that keeps events from happening all at
> > > once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
> book)
> > >
> > >
> > > On Thu, Aug 29, 2013 at 2:49 PM, Alexandre Rafalovitch
> > > <ar...@gmail.com>wrote:
> > >
> > > > So, if the match spans pages 4 and 5, what do you want returned? Page
> > 4,
> > > > page 5, or both?
> > > >
> > > > Regards,
> > > >      Alex
> > > > On 28 Aug 2013 06:55, "Атанас Атанасов" <at...@gmail.com>
> wrote:
> > > >
> > > >> Hello,
> > > >>
> > > >> My name is Atanas Atanasov, I'm using SOLR 1.4/3.5/4.3 for an year
> > and a
> > > >> half and I'm really satisfied of what it provides. Searching and
> > > indexing
> > > >> are extremely fast, it is easy to work with.
> > > >> However I ran into a small problem and I can't figure it out.
> > > >> I'm using SOLR to store the content/text of different types of
> > > >> documents(.pdf, .txt, .doc, etc.).
> > > >> The whole document content represents a SOLR record(all the text
> from
> > > all
> > > >> pages of the document).
> > > >> schema.xml is in SOLR_Document_Level folder of attached .zip file.
> > > >> This worked absolutely fine but I wanted to see the exact page/pages
> > of
> > > a
> > > >> document where the search match is/are.
> > > >>
> > > >> I redesigned it so that every page of a document is a row in the
> SOLR
> > > >> database (schema.xml is in SOLR_Page_Level folder of attached .zip
> > > file.)
> > > >> and it works good but this resulted in the following problem:
> > > >> Example: I search for (lucene AND apache). If both words are on the
> > same
> > > >> page I will get a hit and
> > > >> result will be returned. However If the words are on different pages
> > of
> > > a
> > > >> document no results will be found.
> > > >> My goal is to find out the exact page of a document where the match
> > is.
> > > >> Dynamic fields would solve this problem but there are very big
> > documents
> > > >> with many pages so I don't think this is a solution.
> > > >> Can you help me with some ideas on how to make it work?
> > > >>
> > > >> Just for information. I am using SOLR as a REST service hosted in
> > Apache
> > > >> and a .NET application to work with it.
> > > >> If you have questions please feel free to ask.
> > > >>
> > > >> Thanks in advance and Best Regards,
> > > >> Atanas Atanasov
> > > >>
> > > >>
> > >
> >
>

Re: Newbie SOLR question

Posted by Aloke Ghoshal <al...@gmail.com>.

Hi,

Please refer to my response from a few months back:
http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201303.mbox/%3CCAHT6S2aZ_W2AV04rdMOeeCk5e9o0k4YTktF0pjSEcsH-LLsnSw@mail.gmail.com%3E

Our modelling is to index N (individual pages) + 1 (original document) in
Solr. Once a document has matched for a given set of terms, the
corresponding page boundary cases can be handled by relaxing the page
search condition to an OR (you could even add these alongside with a lower
boost).

Regards,
Aloke



On Fri, Aug 30, 2013 at 12:11 PM, Атанас Атанасов <at...@gmail.com>wrote:

> Thanks for the response. Your suggestion is to keep the existing way of
> indexing data where every page of a document is a row in the SOLR database,
> changing the "content" field to be store-only and add another field (ex.
> document_content) for "index only" where I should put the whole content of
> the document. This is a good idea but I am also using HighLighter and I
> think it won't work since it requires the field to be stored=true. My
> problem will be solved if there is a way to search in the index-only field
> where the whole document is indexed but to get the highlights/context of
> the match from the existing page.
> Originally my idea was to keep data in existing format (1 page - 1 record)
> but somehow search in grouped (by document) results or some kind of union
> between pages of a document. Is this possible?
>
>
> On Thu, Aug 29, 2013 at 4:45 PM, Alexandre Rafalovitch
> <ar...@gmail.com>wrote:
>
> > Assuming you want both pages to match you need the text to be present on
> > both pages. Do you actually return/store text of the page in Solr? If so,
> > you can have that 'page' field store-only and have another field which is
> > index-only and into which you put all your matching logic. So, that
> > index-only field can contain the page plus another line/paragraph/page on
> > each side.
> >
> > Regards,
> >    Alex.
> >
> > Personal website: http://www.outerthoughts.com/
> > LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
> > - Time is the quality of nature that keeps events from happening all at
> > once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)
> >
> >
> > On Thu, Aug 29, 2013 at 2:49 PM, Alexandre Rafalovitch
> > <ar...@gmail.com>wrote:
> >
> > > So, if the match spans pages 4 and 5, what do you want returned? Page
> 4,
> > > page 5, or both?
> > >
> > > Regards,
> > >      Alex
> > > On 28 Aug 2013 06:55, "Атанас Атанасов" <at...@gmail.com> wrote:
> > >
> > >> Hello,
> > >>
> > >> My name is Atanas Atanasov, I'm using SOLR 1.4/3.5/4.3 for an year
> and a
> > >> half and I'm really satisfied of what it provides. Searching and
> > indexing
> > >> are extremely fast, it is easy to work with.
> > >> However I ran into a small problem and I can't figure it out.
> > >> I'm using SOLR to store the content/text of different types of
> > >> documents(.pdf, .txt, .doc, etc.).
> > >> The whole document content represents a SOLR record(all the text from
> > all
> > >> pages of the document).
> > >> schema.xml is in SOLR_Document_Level folder of attached .zip file.
> > >> This worked absolutely fine but I wanted to see the exact page/pages
> of
> > a
> > >> document where the search match is/are.
> > >>
> > >> I redesigned it so that every page of a document is a row in the SOLR
> > >> database (schema.xml is in SOLR_Page_Level folder of attached .zip
> > file.)
> > >> and it works good but this resulted in the following problem:
> > >> Example: I search for (lucene AND apache). If both words are on the
> same
> > >> page I will get a hit and
> > >> result will be returned. However If the words are on different pages
> of
> > a
> > >> document no results will be found.
> > >> My goal is to find out the exact page of a document where the match
> is.
> > >> Dynamic fields would solve this problem but there are very big
> documents
> > >> with many pages so I don't think this is a solution.
> > >> Can you help me with some ideas on how to make it work?
> > >>
> > >> Just for information. I am using SOLR as a REST service hosted in
> Apache
> > >> and a .NET application to work with it.
> > >> If you have questions please feel free to ask.
> > >>
> > >> Thanks in advance and Best Regards,
> > >> Atanas Atanasov
> > >>
> > >>
> >
>

Re: Newbie SOLR question

Posted by Атанас Атанасов <at...@gmail.com>.

Thanks for the response. Your suggestion is to keep the existing way of
indexing data where every page of a document is a row in the SOLR database,
changing the "content" field to be store-only and add another field (ex.
document_content) for "index only" where I should put the whole content of
the document. This is a good idea but I am also using HighLighter and I
think it won't work since it requires the field to be stored=true. My
problem will be solved if there is a way to search in the index-only field
where the whole document is indexed but to get the highlights/context of
the match from the existing page.
Originally my idea was to keep data in existing format (1 page - 1 record)
but somehow search in grouped (by document) results or some kind of union
between pages of a document. Is this possible?


On Thu, Aug 29, 2013 at 4:45 PM, Alexandre Rafalovitch
<ar...@gmail.com>wrote:

> Assuming you want both pages to match you need the text to be present on
> both pages. Do you actually return/store text of the page in Solr? If so,
> you can have that 'page' field store-only and have another field which is
> index-only and into which you put all your matching logic. So, that
> index-only field can contain the page plus another line/paragraph/page on
> each side.
>
> Regards,
>    Alex.
>
> Personal website: http://www.outerthoughts.com/
> LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
> - Time is the quality of nature that keeps events from happening all at
> once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)
>
>
> On Thu, Aug 29, 2013 at 2:49 PM, Alexandre Rafalovitch
> <ar...@gmail.com>wrote:
>
> > So, if the match spans pages 4 and 5, what do you want returned? Page 4,
> > page 5, or both?
> >
> > Regards,
> >      Alex
> > On 28 Aug 2013 06:55, "Атанас Атанасов" <at...@gmail.com> wrote:
> >
> >> Hello,
> >>
> >> My name is Atanas Atanasov, I'm using SOLR 1.4/3.5/4.3 for an year and a
> >> half and I'm really satisfied of what it provides. Searching and
> indexing
> >> are extremely fast, it is easy to work with.
> >> However I ran into a small problem and I can't figure it out.
> >> I'm using SOLR to store the content/text of different types of
> >> documents(.pdf, .txt, .doc, etc.).
> >> The whole document content represents a SOLR record(all the text from
> all
> >> pages of the document).
> >> schema.xml is in SOLR_Document_Level folder of attached .zip file.
> >> This worked absolutely fine but I wanted to see the exact page/pages of
> a
> >> document where the search match is/are.
> >>
> >> I redesigned it so that every page of a document is a row in the SOLR
> >> database (schema.xml is in SOLR_Page_Level folder of attached .zip
> file.)
> >> and it works good but this resulted in the following problem:
> >> Example: I search for (lucene AND apache). If both words are on the same
> >> page I will get a hit and
> >> result will be returned. However If the words are on different pages of
> a
> >> document no results will be found.
> >> My goal is to find out the exact page of a document where the match is.
> >> Dynamic fields would solve this problem but there are very big documents
> >> with many pages so I don't think this is a solution.
> >> Can you help me with some ideas on how to make it work?
> >>
> >> Just for information. I am using SOLR as a REST service hosted in Apache
> >> and a .NET application to work with it.
> >> If you have questions please feel free to ask.
> >>
> >> Thanks in advance and Best Regards,
> >> Atanas Atanasov
> >>
> >>
>

Re: Newbie SOLR question

Posted by Alexandre Rafalovitch <ar...@gmail.com>.

Assuming you want both pages to match you need the text to be present on
both pages. Do you actually return/store text of the page in Solr? If so,
you can have that 'page' field store-only and have another field which is
index-only and into which you put all your matching logic. So, that
index-only field can contain the page plus another line/paragraph/page on
each side.

Regards,
   Alex.

Personal website: http://www.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all at
once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)


On Thu, Aug 29, 2013 at 2:49 PM, Alexandre Rafalovitch
<ar...@gmail.com>wrote:

> So, if the match spans pages 4 and 5, what do you want returned? Page 4,
> page 5, or both?
>
> Regards,
>      Alex
> On 28 Aug 2013 06:55, "Атанас Атанасов" <at...@gmail.com> wrote:
>
>> Hello,
>>
>> My name is Atanas Atanasov, I'm using SOLR 1.4/3.5/4.3 for an year and a
>> half and I'm really satisfied of what it provides. Searching and indexing
>> are extremely fast, it is easy to work with.
>> However I ran into a small problem and I can't figure it out.
>> I'm using SOLR to store the content/text of different types of
>> documents(.pdf, .txt, .doc, etc.).
>> The whole document content represents a SOLR record(all the text from all
>> pages of the document).
>> schema.xml is in SOLR_Document_Level folder of attached .zip file.
>> This worked absolutely fine but I wanted to see the exact page/pages of a
>> document where the search match is/are.
>>
>> I redesigned it so that every page of a document is a row in the SOLR
>> database (schema.xml is in SOLR_Page_Level folder of attached .zip file.)
>> and it works good but this resulted in the following problem:
>> Example: I search for (lucene AND apache). If both words are on the same
>> page I will get a hit and
>> result will be returned. However If the words are on different pages of a
>> document no results will be found.
>> My goal is to find out the exact page of a document where the match is.
>> Dynamic fields would solve this problem but there are very big documents
>> with many pages so I don't think this is a solution.
>> Can you help me with some ideas on how to make it work?
>>
>> Just for information. I am using SOLR as a REST service hosted in Apache
>> and a .NET application to work with it.
>> If you have questions please feel free to ask.
>>
>> Thanks in advance and Best Regards,
>> Atanas Atanasov
>>
>>

Re: Newbie SOLR question

Posted by Alexandre Rafalovitch <ar...@gmail.com>.

So, if the match spans pages 4 and 5, what do you want returned? Page 4,
page 5, or both?

Regards,
     Alex
On 28 Aug 2013 06:55, "Атанас Атанасов" <at...@gmail.com> wrote:

> Hello,
>
> My name is Atanas Atanasov, I'm using SOLR 1.4/3.5/4.3 for an year and a
> half and I'm really satisfied of what it provides. Searching and indexing
> are extremely fast, it is easy to work with.
> However I ran into a small problem and I can't figure it out.
> I'm using SOLR to store the content/text of different types of
> documents(.pdf, .txt, .doc, etc.).
> The whole document content represents a SOLR record(all the text from all
> pages of the document).
> schema.xml is in SOLR_Document_Level folder of attached .zip file.
> This worked absolutely fine but I wanted to see the exact page/pages of a
> document where the search match is/are.
>
> I redesigned it so that every page of a document is a row in the SOLR
> database (schema.xml is in SOLR_Page_Level folder of attached .zip file.)
> and it works good but this resulted in the following problem:
> Example: I search for (lucene AND apache). If both words are on the same
> page I will get a hit and
> result will be returned. However If the words are on different pages of a
> document no results will be found.
> My goal is to find out the exact page of a document where the match is.
> Dynamic fields would solve this problem but there are very big documents
> with many pages so I don't think this is a solution.
> Can you help me with some ideas on how to make it work?
>
> Just for information. I am using SOLR as a REST service hosted in Apache
> and a .NET application to work with it.
> If you have questions please feel free to ask.
>
> Thanks in advance and Best Regards,
> Atanas Atanasov
>
>