You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by "logic.cpp" <lo...@gmail.com> on 2011/11/17 21:31:56 UTC

Best document format / markup for text indexing?

tl;dr version:

We're converting tons (hundreds of thousands?) of books into digital text.

What is the best format/markup/ebook standard/document standard/other to use for easiest and best text search support?

***

Longer version;

The following are some desired user experience features of the project, these probably influence the way in which the content should preferably be stored;

- Granular access to the text content.
Users would be able to fetch a specific phrase in a specific paragraph in a specific page in a specific chapter in a specific book. (A 'document' may consist of a single chapter of a book).

- Cross referencing.
Most likely achieved through a RDBMS, users should have references to/from content that refers or mentions a topic or quotes related content in other books.
(Similar to Wikipedia articles linking to one-another.)

- Full text search
This is probably where Lucene comes in.


So which format/markup/standard would allow for software to easily fetch and cross-reference granular bits of data, as well as be easily indexable by Lucene?

Would it maybe be better to store all the books' digital text straight into the RDBMS? In which case, can Lucene index such data?

Thanks

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Best document format / markup for text indexing?

Posted by "logic.cpp" <lo...@gmail.com>.

Thank you for the help, I will see where this leads me.



On Nov 23, 2011, at 10:01 AM, Michael Sokolov <so...@ifactory.com> wrote:

> In my experience, books and other semi-structured text documents are best handled as XML.  There are many many different XML "vocabularies" for doing this, each of which has benefits for different kinds of documents.  You probably should look at TEI, NLM Book, and DocBook though - these are some widely-used standard formats for capturing structured book-type texts.  There are other standards for journal articles and other kinds of documents.
> 
> The question of how to store, index and retrieve the kind of information and structure captured by XML documents has gotten a lot of attention, too.  There are XML-specific data stores such as MarkLogic and eXist (which uses Lucene for full text search).  Or you could consider "rolling your own" with something like Solr/Lucene as a search index.  Because you're posting on this list, I assume you're considering the last option, which is a good one, but will require some development effort as you consider how to map document structures into indexes, how to preserve document structure when you highlight query terms, etc.
> 
> -Mike Sokolov
> 
> PS - if you are interested in professional help, please consider our platform (pubfactory.net) and drop me an e-mail.
> 
> On 11/17/2011 3:46 PM, logic.cpp wrote:
>> tl;dr version:
>> 
>> We're converting tons (hundreds of thousands?) of books into digital text.
>> 
>> What is the best format/markup/ebook standard/document standard/other to use for easiest and best text search support?
>> 
>> ***
>> 
>> Longer version;
>> 
>> The following are some desired user experience features of the project, these probably influence the way in which the content should preferably be stored;
>> 
>> - Granular access to the text content.
>> Users would be able to fetch a specific phrase in a specific paragraph in a specific page in a specific chapter in a specific book. (A 'document' may consist of a single chapter of a book).
>> 
>> - Cross referencing.
>> Most likely achieved through a RDBMS, users should have references to/from content that refers or mentions a topic or quotes related content in other books.
>> (Similar to Wikipedia articles linking to one-another.)
>> 
>> - Full text search
>> This is probably where Lucene comes in.
>> 
>> 
>> So which format/markup/standard would allow for software to easily fetch and cross-reference granular bits of data, as well as be easily indexable by Lucene?
>> 
>> Would it maybe be better to store all the books' digital text straight into the RDBMS? In which case, can Lucene index such data?
>> 
>> Thanks
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>> 
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Best document format / markup for text indexing?

Posted by Michael Sokolov <so...@ifactory.com>.

In my experience, books and other semi-structured text documents are 
best handled as XML.  There are many many different XML "vocabularies" 
for doing this, each of which has benefits for different kinds of 
documents.  You probably should look at TEI, NLM Book, and DocBook 
though - these are some widely-used standard formats for capturing 
structured book-type texts.  There are other standards for journal 
articles and other kinds of documents.

The question of how to store, index and retrieve the kind of information 
and structure captured by XML documents has gotten a lot of attention, 
too.  There are XML-specific data stores such as MarkLogic and eXist 
(which uses Lucene for full text search).  Or you could consider 
"rolling your own" with something like Solr/Lucene as a search index.  
Because you're posting on this list, I assume you're considering the 
last option, which is a good one, but will require some development 
effort as you consider how to map document structures into indexes, how 
to preserve document structure when you highlight query terms, etc.

-Mike Sokolov

PS - if you are interested in professional help, please consider our 
platform (pubfactory.net) and drop me an e-mail.

On 11/17/2011 3:46 PM, logic.cpp wrote:
> tl;dr version:
>
> We're converting tons (hundreds of thousands?) of books into digital text.
>
> What is the best format/markup/ebook standard/document standard/other to use for easiest and best text search support?
>
> ***
>
> Longer version;
>
> The following are some desired user experience features of the project, these probably influence the way in which the content should preferably be stored;
>
> - Granular access to the text content.
> Users would be able to fetch a specific phrase in a specific paragraph in a specific page in a specific chapter in a specific book. (A 'document' may consist of a single chapter of a book).
>
> - Cross referencing.
> Most likely achieved through a RDBMS, users should have references to/from content that refers or mentions a topic or quotes related content in other books.
> (Similar to Wikipedia articles linking to one-another.)
>
> - Full text search
> This is probably where Lucene comes in.
>
>
> So which format/markup/standard would allow for software to easily fetch and cross-reference granular bits of data, as well as be easily indexable by Lucene?
>
> Would it maybe be better to store all the books' digital text straight into the RDBMS? In which case, can Lucene index such data?
>
> Thanks
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Best document format / markup for text indexing?

Posted by "logic.cpp" <lo...@gmail.com>.

tl;dr version:

We're converting tons (hundreds of thousands?) of books into digital text.

What is the best format/markup/ebook standard/document standard/other to use for easiest and best text search support?

***

Longer version;

The following are some desired user experience features of the project, these probably influence the way in which the content should preferably be stored;

- Granular access to the text content.
Users would be able to fetch a specific phrase in a specific paragraph in a specific page in a specific chapter in a specific book. (A 'document' may consist of a single chapter of a book).

- Cross referencing.
Most likely achieved through a RDBMS, users should have references to/from content that refers or mentions a topic or quotes related content in other books.
(Similar to Wikipedia articles linking to one-another.)

- Full text search
This is probably where Lucene comes in.


So which format/markup/standard would allow for software to easily fetch and cross-reference granular bits of data, as well as be easily indexable by Lucene?

Would it maybe be better to store all the books' digital text straight into the RDBMS? In which case, can Lucene index such data?

Thanks

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: Best document format / markup for text indexing?

Posted by Paul Allan Hill <pa...@metajure.com>.

> What is the best format/markup/ebook standard/document standard/other to use for easiest and best text search support?

The helpful Tika libraries can parse any number of formats and then index the text into Lucene, so I'm thinking the question is what is the better format when you want to display the document.

It seems you need to ask what is a "document" as far as Lucene is concerned. Possibly the answer is each sentence (not the chapter), because I'm wondering if fundamentally the user wants to see each line and the references to other lines in this or other documents, but also view the whole document when needed.
So then you need
1. A nice viewable version of each file (chapter).
2. Table(s) (in RDBS) that can cross-link every verse/sentence/line to every other. Isn't that how cross references work? At the sentence level?
3. Table(s) (in RDBS) that link each sentence to chapter to book to work (or alternatively some field(s) in Lucene that can be used to get to the definition of the context).
4. A Lucene index that indexes the "sentences" (the fundamental cross referencable subunit of the text).

Maybe someone else has ideas about mapping from text in a document to a particular verse and its cross references, but that sounds like a lot of mapping to me, so I think of doing the work up front and building the index of verses/sentences.
Just my beginners 0.02 cents worth.

-Paul

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org