You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Jason Funk <ja...@gmail.com> on 2013/04/23 23:02:37 UTC

Book text with chapter line number

Hello.

I'm trying to figure out if Solr is going to work for a new project that I am wanting to build. At it's heart it's a book text searching application. Each book is broken into chapters and each chapter is broken into lines. I want to be able to search these books and return relevant sections of the book and display the results with chapter and line number. I'm not sure how I would structure my data so that it's efficient and functional. I could simply treat each line of text as a document which would provide some of the functionality but what if the search query spanned two lines? Then it seems the passage the user was searching for wouldn't be returned. I could treat each book as a document and use highlighting to find the context but that seems to limit weighting/results for best matches as well as difficultly in finding chapter/line numbers. What is the best way to do this with Solr?

Is there a better tool to use to solve my problem?

Re: Book text with chapter line number

Posted by Alexandre Rafalovitch <ar...@gmail.com>.
It seems that the normal use case is line=document with some exception
for cross-line indexing.

The edge case could be solved by either indexing additional 'two-line'
documents with lower boost or to have 'context' field with line
before/after where applicable (e.g. within same para).  Then there
might also be some trick around using highlighter to figure out
whether the match came from the 'line' field or from 'context' field.

I also like payload idea, though there does not seem to be too much
information around on using that.

Regards,
   Alex.
Personal blog: http://blog.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all
at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
book)


On Wed, Apr 24, 2013 at 10:28 AM, Paul Libbrecht <pa...@hoplahup.net> wrote:
> It's easy to then store a map of "term position" to line-number and page-number along with each paragraph, or?
>
> Paul
>
>
> On 24 avr. 2013, at 16:24, Timothy Potter wrote:
>
>> Chapter seems too broad and line seems too narrow -- have you thought
>> about paragraph level? Something like:
>>
>> docID, book fields (title, author, publisher, etc), chapter fields (#,
>> title, pages, etc), section fields (title, #, etc), sub-sectionN
>> fields, paragraph text, lines
>>
>> Seems like line #'s would only be useful for display so just store the
>> lines the paragraph covers.
>>
>>
>>
>> On Tue, Apr 23, 2013 at 7:51 PM, Walter Underwood <wu...@wunderwood.org> wrote:
>>> If you can represent your books in XML, then MarkLogic could do the job very cleanly. It isn't free, but it is very good.
>>>
>>> wunder
>>>
>>> On Apr 23, 2013, at 6:47 PM, Jason Funk wrote:
>>>
>>>> Is there a better tool than Solr to use for my situation?
>>>>
>>>>
>>>> On Apr 23, 2013, at 5:04 PM, Jack Krupansky <ja...@basetechnology.com> wrote:
>>>>
>>>>> There is no simple, obvious, and direct approach, right out of the box. Sure, you can highlight passages of raw text, right out of the box, but that won't give you chapters, pages, and line numbers. To do all of that, you would have to either:
>>>>>
>>>>> 1. Add chapter, page, and line number as part of the payload for each word. And add some custom document transformers to access the information.
>>>>> or
>>>>> 2. Index each line as a separate Solr document, with fields for book, chapter, page, and line number.
>>>>>
>>>>> -- Jack Krupansky
>>>>>
>>>>> -----Original Message----- From: Jason Funk
>>>>> Sent: Tuesday, April 23, 2013 5:02 PM
>>>>> To: solr-user@lucene.apache.org
>>>>> Subject: Book text with chapter line number
>>>>>
>>>>> Hello.
>>>>>
>>>>> I'm trying to figure out if Solr is going to work for a new project that I am wanting to build. At it's heart it's a book text searching application. Each book is broken into chapters and each chapter is broken into lines. I want to be able to search these books and return relevant sections of the book and display the results with chapter and line number. I'm not sure how I would structure my data so that it's efficient and functional. I could simply treat each line of text as a document which would provide some of the functionality but what if the search query spanned two lines? Then it seems the passage the user was searching for wouldn't be returned. I could treat each book as a document and use highlighting to find the context but that seems to limit weighting/results for best matches as well as difficultly in finding chapter/line numbers. What is the best way to do this with Solr?
>>>>>
>>>>> Is there a better tool to use to solve my problem?
>>>>
>>>
>>> --
>>> Walter Underwood
>>> wunder@wunderwood.org
>>>
>>>
>>>
>

Re: Book text with chapter line number

Posted by Paul Libbrecht <pa...@hoplahup.net>.
It's easy to then store a map of "term position" to line-number and page-number along with each paragraph, or?

Paul


On 24 avr. 2013, at 16:24, Timothy Potter wrote:

> Chapter seems too broad and line seems too narrow -- have you thought
> about paragraph level? Something like:
> 
> docID, book fields (title, author, publisher, etc), chapter fields (#,
> title, pages, etc), section fields (title, #, etc), sub-sectionN
> fields, paragraph text, lines
> 
> Seems like line #'s would only be useful for display so just store the
> lines the paragraph covers.
> 
> 
> 
> On Tue, Apr 23, 2013 at 7:51 PM, Walter Underwood <wu...@wunderwood.org> wrote:
>> If you can represent your books in XML, then MarkLogic could do the job very cleanly. It isn't free, but it is very good.
>> 
>> wunder
>> 
>> On Apr 23, 2013, at 6:47 PM, Jason Funk wrote:
>> 
>>> Is there a better tool than Solr to use for my situation?
>>> 
>>> 
>>> On Apr 23, 2013, at 5:04 PM, Jack Krupansky <ja...@basetechnology.com> wrote:
>>> 
>>>> There is no simple, obvious, and direct approach, right out of the box. Sure, you can highlight passages of raw text, right out of the box, but that won't give you chapters, pages, and line numbers. To do all of that, you would have to either:
>>>> 
>>>> 1. Add chapter, page, and line number as part of the payload for each word. And add some custom document transformers to access the information.
>>>> or
>>>> 2. Index each line as a separate Solr document, with fields for book, chapter, page, and line number.
>>>> 
>>>> -- Jack Krupansky
>>>> 
>>>> -----Original Message----- From: Jason Funk
>>>> Sent: Tuesday, April 23, 2013 5:02 PM
>>>> To: solr-user@lucene.apache.org
>>>> Subject: Book text with chapter line number
>>>> 
>>>> Hello.
>>>> 
>>>> I'm trying to figure out if Solr is going to work for a new project that I am wanting to build. At it's heart it's a book text searching application. Each book is broken into chapters and each chapter is broken into lines. I want to be able to search these books and return relevant sections of the book and display the results with chapter and line number. I'm not sure how I would structure my data so that it's efficient and functional. I could simply treat each line of text as a document which would provide some of the functionality but what if the search query spanned two lines? Then it seems the passage the user was searching for wouldn't be returned. I could treat each book as a document and use highlighting to find the context but that seems to limit weighting/results for best matches as well as difficultly in finding chapter/line numbers. What is the best way to do this with Solr?
>>>> 
>>>> Is there a better tool to use to solve my problem?
>>> 
>> 
>> --
>> Walter Underwood
>> wunder@wunderwood.org
>> 
>> 
>> 


Re: Book text with chapter line number

Posted by Timothy Potter <th...@gmail.com>.
Chapter seems too broad and line seems too narrow -- have you thought
about paragraph level? Something like:

docID, book fields (title, author, publisher, etc), chapter fields (#,
title, pages, etc), section fields (title, #, etc), sub-sectionN
fields, paragraph text, lines

Seems like line #'s would only be useful for display so just store the
lines the paragraph covers.



On Tue, Apr 23, 2013 at 7:51 PM, Walter Underwood <wu...@wunderwood.org> wrote:
> If you can represent your books in XML, then MarkLogic could do the job very cleanly. It isn't free, but it is very good.
>
> wunder
>
> On Apr 23, 2013, at 6:47 PM, Jason Funk wrote:
>
>> Is there a better tool than Solr to use for my situation?
>>
>>
>> On Apr 23, 2013, at 5:04 PM, Jack Krupansky <ja...@basetechnology.com> wrote:
>>
>>> There is no simple, obvious, and direct approach, right out of the box. Sure, you can highlight passages of raw text, right out of the box, but that won't give you chapters, pages, and line numbers. To do all of that, you would have to either:
>>>
>>> 1. Add chapter, page, and line number as part of the payload for each word. And add some custom document transformers to access the information.
>>> or
>>> 2. Index each line as a separate Solr document, with fields for book, chapter, page, and line number.
>>>
>>> -- Jack Krupansky
>>>
>>> -----Original Message----- From: Jason Funk
>>> Sent: Tuesday, April 23, 2013 5:02 PM
>>> To: solr-user@lucene.apache.org
>>> Subject: Book text with chapter line number
>>>
>>> Hello.
>>>
>>> I'm trying to figure out if Solr is going to work for a new project that I am wanting to build. At it's heart it's a book text searching application. Each book is broken into chapters and each chapter is broken into lines. I want to be able to search these books and return relevant sections of the book and display the results with chapter and line number. I'm not sure how I would structure my data so that it's efficient and functional. I could simply treat each line of text as a document which would provide some of the functionality but what if the search query spanned two lines? Then it seems the passage the user was searching for wouldn't be returned. I could treat each book as a document and use highlighting to find the context but that seems to limit weighting/results for best matches as well as difficultly in finding chapter/line numbers. What is the best way to do this with Solr?
>>>
>>> Is there a better tool to use to solve my problem?
>>
>
> --
> Walter Underwood
> wunder@wunderwood.org
>
>
>

Re: Book text with chapter line number

Posted by Walter Underwood <wu...@wunderwood.org>.
If you can represent your books in XML, then MarkLogic could do the job very cleanly. It isn't free, but it is very good.

wunder

On Apr 23, 2013, at 6:47 PM, Jason Funk wrote:

> Is there a better tool than Solr to use for my situation?
> 
> 
> On Apr 23, 2013, at 5:04 PM, Jack Krupansky <ja...@basetechnology.com> wrote:
> 
>> There is no simple, obvious, and direct approach, right out of the box. Sure, you can highlight passages of raw text, right out of the box, but that won't give you chapters, pages, and line numbers. To do all of that, you would have to either:
>> 
>> 1. Add chapter, page, and line number as part of the payload for each word. And add some custom document transformers to access the information.
>> or
>> 2. Index each line as a separate Solr document, with fields for book, chapter, page, and line number.
>> 
>> -- Jack Krupansky
>> 
>> -----Original Message----- From: Jason Funk
>> Sent: Tuesday, April 23, 2013 5:02 PM
>> To: solr-user@lucene.apache.org
>> Subject: Book text with chapter line number
>> 
>> Hello.
>> 
>> I'm trying to figure out if Solr is going to work for a new project that I am wanting to build. At it's heart it's a book text searching application. Each book is broken into chapters and each chapter is broken into lines. I want to be able to search these books and return relevant sections of the book and display the results with chapter and line number. I'm not sure how I would structure my data so that it's efficient and functional. I could simply treat each line of text as a document which would provide some of the functionality but what if the search query spanned two lines? Then it seems the passage the user was searching for wouldn't be returned. I could treat each book as a document and use highlighting to find the context but that seems to limit weighting/results for best matches as well as difficultly in finding chapter/line numbers. What is the best way to do this with Solr?
>> 
>> Is there a better tool to use to solve my problem? 
> 

--
Walter Underwood
wunder@wunderwood.org




Re: Book text with chapter line number

Posted by Jason Funk <ja...@gmail.com>.
Is there a better tool than Solr to use for my situation?


On Apr 23, 2013, at 5:04 PM, Jack Krupansky <ja...@basetechnology.com> wrote:

> There is no simple, obvious, and direct approach, right out of the box. Sure, you can highlight passages of raw text, right out of the box, but that won't give you chapters, pages, and line numbers. To do all of that, you would have to either:
> 
> 1. Add chapter, page, and line number as part of the payload for each word. And add some custom document transformers to access the information.
> or
> 2. Index each line as a separate Solr document, with fields for book, chapter, page, and line number.
> 
> -- Jack Krupansky
> 
> -----Original Message----- From: Jason Funk
> Sent: Tuesday, April 23, 2013 5:02 PM
> To: solr-user@lucene.apache.org
> Subject: Book text with chapter line number
> 
> Hello.
> 
> I'm trying to figure out if Solr is going to work for a new project that I am wanting to build. At it's heart it's a book text searching application. Each book is broken into chapters and each chapter is broken into lines. I want to be able to search these books and return relevant sections of the book and display the results with chapter and line number. I'm not sure how I would structure my data so that it's efficient and functional. I could simply treat each line of text as a document which would provide some of the functionality but what if the search query spanned two lines? Then it seems the passage the user was searching for wouldn't be returned. I could treat each book as a document and use highlighting to find the context but that seems to limit weighting/results for best matches as well as difficultly in finding chapter/line numbers. What is the best way to do this with Solr?
> 
> Is there a better tool to use to solve my problem? 


Re: Book text with chapter line number

Posted by Jack Krupansky <ja...@basetechnology.com>.
There is no simple, obvious, and direct approach, right out of the box. 
Sure, you can highlight passages of raw text, right out of the box, but that 
won't give you chapters, pages, and line numbers. To do all of that, you 
would have to either:

1. Add chapter, page, and line number as part of the payload for each word. 
And add some custom document transformers to access the information.
or
2. Index each line as a separate Solr document, with fields for book, 
chapter, page, and line number.

-- Jack Krupansky

-----Original Message----- 
From: Jason Funk
Sent: Tuesday, April 23, 2013 5:02 PM
To: solr-user@lucene.apache.org
Subject: Book text with chapter line number

Hello.

I'm trying to figure out if Solr is going to work for a new project that I 
am wanting to build. At it's heart it's a book text searching application. 
Each book is broken into chapters and each chapter is broken into lines. I 
want to be able to search these books and return relevant sections of the 
book and display the results with chapter and line number. I'm not sure how 
I would structure my data so that it's efficient and functional. I could 
simply treat each line of text as a document which would provide some of the 
functionality but what if the search query spanned two lines? Then it seems 
the passage the user was searching for wouldn't be returned. I could treat 
each book as a document and use highlighting to find the context but that 
seems to limit weighting/results for best matches as well as difficultly in 
finding chapter/line numbers. What is the best way to do this with Solr?

Is there a better tool to use to solve my problem?