You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Will Murnane <wi...@gmail.com> on 2009/10/27 23:50:44 UTC

Split single string into several fields?

Hello list,
  I have some semi-structured text that has some markup elements, and
I want to put those elements into a separate field so I can search by
them.  For example (using HTML syntax):
---- 8< ---- document
<h1>Section title</h1>
Body content
---- >8 ----
I can find that the things inside <h1>s are "Section" and "title", and
"Body" and "content" are outside.  I want to create two fields for
this document:
insideh1 -> "Section", "title"
alltext -> "Section", "title", "Body", "content"

What's the best way to approach this?  My initial thought is to make
some kind of MultiAnalyzer that consumes the text and produces several
token streams, which are added to the document one at a time.  Is that
a reasonable strategy?

Thanks!
Will

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Split single string into several fields?

Posted by Andrzej Bialecki <ab...@getopt.org>.

Robert Muir wrote:
> Will, I think this parsing of documents into different fields, is separate
> and unrelated from lucene's analysis (tokenization)...
> the analysis comes to play once you have a field, and you want to break the
> text into indexable units (words, or entire field as token like your urls).
> 
> i wouldn't suggest make a big complicated analyzer that tries to parse html
> in addition to breaking text into words, I would keep parsing and analysis
> separate.
> then i would handle different fields with different analyzers, i think Erick
> already mentioned PerFieldAnalyzerWrapper, its useful for this.

It's also possible to do the tokenization ahead of time, i.e. before you 
pass the document to IndexWriter. You can construct the TokenStream 
using your own analysis chain, and use Field.setTokenStreamValue() - 
this way you will index exactly the token stream you want, and you can 
even create other fields in the document (or split this token stream 
into several fields).


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Split single string into several fields?

Posted by Robert Muir <rc...@gmail.com>.

Will, I think this parsing of documents into different fields, is separate
and unrelated from lucene's analysis (tokenization)...
the analysis comes to play once you have a field, and you want to break the
text into indexable units (words, or entire field as token like your urls).

i wouldn't suggest make a big complicated analyzer that tries to parse html
in addition to breaking text into words, I would keep parsing and analysis
separate.
then i would handle different fields with different analyzers, i think Erick
already mentioned PerFieldAnalyzerWrapper, its useful for this.

if there is some performance consideration... it seems you are worried about
this wrt parsing, not actual analysis.... then maybe use the sax api?

On Tue, Oct 27, 2009 at 9:56 PM, Will Murnane <wi...@gmail.com>wrote:

> On Tue, Oct 27, 2009 at 21:21, Jake Mannix <ja...@gmail.com> wrote:
> > On Tue, Oct 27, 2009 at 6:12 PM, Erick Erickson <erickerickson@gmail.com
> >wrote:
> >
> >> Could you go into your use case a bit more? Because I'm confused.
> >> Why don't you want your text tokenized? You say you want to search it,
> >> which means you have to analyze it.
> >
> >
> > I think Will is suggesting that he doesn't want to have to analyze it
> > *again* -
> > if he really has different fields for every tag type, it would get
> > prohibitively
> > expensive in terms of Indexing CPU usage to retokenize over and over
> > again.
> >
> > Is that what your concern is, Will?
> More or less.  Different types of tags need different tokenization:
> just as an example, I want to parse an img tag which contains a src
> attribute as a URL, and tokenize the URL as such (i.e., even if there
> are spaces they're treated as a unit), but the contents of a paragraph
> must be tokenized as English text.
>
> So I think the solution (because there's only one Analyzer per
> IndexWriter, and thus per document) is to do all the
> field-type-specific stuff outside of Lucene, and then use a very
> generic Analyzer, like the "\0"-splitter mentioned above.
>
> On Tue, Oct 27, 2009 at 21:12, Erick Erickson <er...@gmail.com>
> wrote:
> > If you need different analyzers for each field, see
> PerFieldAnalyzerWrapper.
>
> That's very close to what I need, but I don't think it lines up quite
> right.  When I find some tokens inside an h1 tag (assume for
> simplicity that I only need to consider the innermost tag around a
> particular element) they won't be in the category for
> things-inside-h2-tags.  So I think trying to find all the things that
> are in h1 tags in one pass through the DOM tree, then things in h2
> tags in another, and so forth, will be slower than traversing the tree
> once and filing everything in its place myself, then feeding each list
> into Lucene as a field.
>
> So, in other words, I think using an individual Analyzer for each type
> of tag will be inefficient, so I'll run one big Analyzer, then put its
> results into Lucene.
>
> Will
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


-- 
Robert Muir
rcmuir@gmail.com

Re: Split single string into several fields?

Posted by Grant Ingersoll <gs...@apache.org>.

Not sure if it completely applies here, but you might also have a look  
at the TeeSinkTokenFilter in the contrib/analysis package.  It is  
designed to tee/sink tokens off from one main field to other fields.


On Oct 27, 2009, at 9:56 PM, Will Murnane wrote:

> On Tue, Oct 27, 2009 at 21:21, Jake Mannix <ja...@gmail.com>  
> wrote:
>> On Tue, Oct 27, 2009 at 6:12 PM, Erick Erickson <erickerickson@gmail.com 
>> >wrote:
>>
>>> Could you go into your use case a bit more? Because I'm confused.
>>> Why don't you want your text tokenized? You say you want to search  
>>> it,
>>> which means you have to analyze it.
>>
>>
>> I think Will is suggesting that he doesn't want to have to analyze it
>> *again* -
>> if he really has different fields for every tag type, it would get
>> prohibitively
>> expensive in terms of Indexing CPU usage to retokenize over and over
>> again.
>>
>> Is that what your concern is, Will?
> More or less.  Different types of tags need different tokenization:
> just as an example, I want to parse an img tag which contains a src
> attribute as a URL, and tokenize the URL as such (i.e., even if there
> are spaces they're treated as a unit), but the contents of a paragraph
> must be tokenized as English text.
>
> So I think the solution (because there's only one Analyzer per
> IndexWriter, and thus per document) is to do all the
> field-type-specific stuff outside of Lucene, and then use a very
> generic Analyzer, like the "\0"-splitter mentioned above.
>
> On Tue, Oct 27, 2009 at 21:12, Erick Erickson  
> <er...@gmail.com> wrote:
>> If you need different analyzers for each field, see  
>> PerFieldAnalyzerWrapper.
>
> That's very close to what I need, but I don't think it lines up quite
> right.  When I find some tokens inside an h1 tag (assume for
> simplicity that I only need to consider the innermost tag around a
> particular element) they won't be in the category for
> things-inside-h2-tags.  So I think trying to find all the things that
> are in h1 tags in one pass through the DOM tree, then things in h2
> tags in another, and so forth, will be slower than traversing the tree
> once and filing everything in its place myself, then feeding each list
> into Lucene as a field.
>
> So, in other words, I think using an individual Analyzer for each type
> of tag will be inefficient, so I'll run one big Analyzer, then put its
> results into Lucene.
>
> Will
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Split single string into several fields?

Posted by Will Murnane <wi...@gmail.com>.

On Tue, Oct 27, 2009 at 21:21, Jake Mannix <ja...@gmail.com> wrote:
> On Tue, Oct 27, 2009 at 6:12 PM, Erick Erickson <er...@gmail.com>wrote:
>
>> Could you go into your use case a bit more? Because I'm confused.
>> Why don't you want your text tokenized? You say you want to search it,
>> which means you have to analyze it.
>
>
> I think Will is suggesting that he doesn't want to have to analyze it
> *again* -
> if he really has different fields for every tag type, it would get
> prohibitively
> expensive in terms of Indexing CPU usage to retokenize over and over
> again.
>
> Is that what your concern is, Will?
More or less.  Different types of tags need different tokenization:
just as an example, I want to parse an img tag which contains a src
attribute as a URL, and tokenize the URL as such (i.e., even if there
are spaces they're treated as a unit), but the contents of a paragraph
must be tokenized as English text.

So I think the solution (because there's only one Analyzer per
IndexWriter, and thus per document) is to do all the
field-type-specific stuff outside of Lucene, and then use a very
generic Analyzer, like the "\0"-splitter mentioned above.

On Tue, Oct 27, 2009 at 21:12, Erick Erickson <er...@gmail.com> wrote:
> If you need different analyzers for each field, see PerFieldAnalyzerWrapper.

That's very close to what I need, but I don't think it lines up quite
right.  When I find some tokens inside an h1 tag (assume for
simplicity that I only need to consider the innermost tag around a
particular element) they won't be in the category for
things-inside-h2-tags.  So I think trying to find all the things that
are in h1 tags in one pass through the DOM tree, then things in h2
tags in another, and so forth, will be slower than traversing the tree
once and filing everything in its place myself, then feeding each list
into Lucene as a field.

So, in other words, I think using an individual Analyzer for each type
of tag will be inefficient, so I'll run one big Analyzer, then put its
results into Lucene.

Will

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Split single string into several fields?

Posted by Jake Mannix <ja...@gmail.com>.

On Tue, Oct 27, 2009 at 6:12 PM, Erick Erickson <er...@gmail.com>wrote:

> Could you go into your use case a bit more? Because I'm confused.
> Why don't you want your text tokenized? You say you want to search it,
> which means you have to analyze it.

I think Will is suggesting that he doesn't want to have to analyze it
*again* -
if he really has different fields for every tag type, it would get
prohibitively
expensive in terms of Indexing CPU usage to retokenize over and over
again.

Is that what your concern is, Will?

  -jake

Re: Split single string into several fields?

Posted by Erick Erickson <er...@gmail.com>.

Could you go into your use case a bit more? Because I'm confused.
Why don't you want your text tokenized? You say you want to search it,
which means you have to analyze it. All I'm suggesting is passing the text
from whatever HTML element into the analyzer, without the surrounding
markup. I'm suggesting that you might be able to use the analyzers
Lucene provides and just pass in text strings, without any need to create
your own analyzer.

If you need different analyzers for each field, see PerFieldAnalyzerWrapper.

Best
Erick

On Tue, Oct 27, 2009 at 8:44 PM, Will Murnane <wi...@gmail.com>wrote:

> On Tue, Oct 27, 2009 at 19:17, Erick Erickson <er...@gmail.com>
> wrote:
> > Unless I don't understand at all what you're going for, wouldn't
> > it work to just put the HTML through some kind of parser (strict or
> > loose depending on how well-formed your HTML is), then just
> > extract the text from your document and push them into your
> > Lucene document? Various parsers make this more or less
> > simple...
> That's more or less what I was suggesting.  The problem as I see it is
> that Lucene wants to do its own tokenizing step.  I declared my
> IndexWriter like this:
> writer = new IndexWriter(IndexDirectory, new MySpecialAnalyzer(),
> true, MaxFieldLength.UNLIMITED);
> and the code in the MySpecialAnalyzer class is indeed called later on.
>
> So, I think this approach:
> > domObj = parse(htmldocument);
> > Document lucDoc = new Document();
> > lucDoc.add("insideh1", domObj.getText(<dom path to H1>));
> (etc) won't work, because when I put that text in it'll be analyzed again.
>
> Perhaps I'll write a ZeroSplittingAnalyzer or something, do all the
> work before I give anything to Lucene, then '\0'-join my tokens and
> feed them to the simple analyzer.  So something like this:
> Document doc = new Document();
> doc.add(new Field("h1", "hello\0world"));
> doc.add(new Field("alltext", "hello\0world\0goodnight\0moon"));
>
> I think that makes sense.  Comments?
>
> Will
>
> >
> > HTH
> > Erick
> >
> >
> > On Tue, Oct 27, 2009 at 6:50 PM, Will Murnane <will.murnane@gmail.com
> >wrote:
> >
> >> Hello list,
> >>  I have some semi-structured text that has some markup elements, and
> >> I want to put those elements into a separate field so I can search by
> >> them.  For example (using HTML syntax):
> >> ---- 8< ---- document
> >> <h1>Section title</h1>
> >> Body content
> >> ---- >8 ----
> >> I can find that the things inside <h1>s are "Section" and "title", and
> >> "Body" and "content" are outside.  I want to create two fields for
> >> this document:
> >> insideh1 -> "Section", "title"
> >> alltext -> "Section", "title", "Body", "content"
> >>
> >> What's the best way to approach this?  My initial thought is to make
> >> some kind of MultiAnalyzer that consumes the text and produces several
> >> token streams, which are added to the document one at a time.  Is that
> >> a reasonable strategy?
> >>
> >> Thanks!
> >> Will
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Split single string into several fields?

Posted by Will Murnane <wi...@gmail.com>.

On Tue, Oct 27, 2009 at 19:17, Erick Erickson <er...@gmail.com> wrote:
> Unless I don't understand at all what you're going for, wouldn't
> it work to just put the HTML through some kind of parser (strict or
> loose depending on how well-formed your HTML is), then just
> extract the text from your document and push them into your
> Lucene document? Various parsers make this more or less
> simple...
That's more or less what I was suggesting.  The problem as I see it is
that Lucene wants to do its own tokenizing step.  I declared my
IndexWriter like this:
writer = new IndexWriter(IndexDirectory, new MySpecialAnalyzer(),
true, MaxFieldLength.UNLIMITED);
and the code in the MySpecialAnalyzer class is indeed called later on.

So, I think this approach:
> domObj = parse(htmldocument);
> Document lucDoc = new Document();
> lucDoc.add("insideh1", domObj.getText(<dom path to H1>));
(etc) won't work, because when I put that text in it'll be analyzed again.

Perhaps I'll write a ZeroSplittingAnalyzer or something, do all the
work before I give anything to Lucene, then '\0'-join my tokens and
feed them to the simple analyzer.  So something like this:
Document doc = new Document();
doc.add(new Field("h1", "hello\0world"));
doc.add(new Field("alltext", "hello\0world\0goodnight\0moon"));

I think that makes sense.  Comments?

Will

>
> HTH
> Erick
>
>
> On Tue, Oct 27, 2009 at 6:50 PM, Will Murnane <wi...@gmail.com>wrote:
>
>> Hello list,
>>  I have some semi-structured text that has some markup elements, and
>> I want to put those elements into a separate field so I can search by
>> them.  For example (using HTML syntax):
>> ---- 8< ---- document
>> <h1>Section title</h1>
>> Body content
>> ---- >8 ----
>> I can find that the things inside <h1>s are "Section" and "title", and
>> "Body" and "content" are outside.  I want to create two fields for
>> this document:
>> insideh1 -> "Section", "title"
>> alltext -> "Section", "title", "Body", "content"
>>
>> What's the best way to approach this?  My initial thought is to make
>> some kind of MultiAnalyzer that consumes the text and produces several
>> token streams, which are added to the document one at a time.  Is that
>> a reasonable strategy?
>>
>> Thanks!
>> Will
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Split single string into several fields?

Posted by Erick Erickson <er...@gmail.com>.

Unless I don't understand at all what you're going for, wouldn't
it work to just put the HTML through some kind of parser (strict or
loose depending on how well-formed your HTML is), then just
extract the text from your document and push them into your
Lucene document? Various parsers make this more or less
simple...

Something like, for each document
domObj = parse(htmldocument);
Document lucDoc = new Document();

lucDoc.add("insideh1", domObj.getText(<dom path to H1>));

lucDoc.add("insideh1", domObj.getText(<dom path to title>));

lucDoc.add("alltext", <like above>);
lucDoc.add("alltext, <like above>);
.
.
.
<add document to lucene index>

HTH
Erick


On Tue, Oct 27, 2009 at 6:50 PM, Will Murnane <wi...@gmail.com>wrote:

> Hello list,
>  I have some semi-structured text that has some markup elements, and
> I want to put those elements into a separate field so I can search by
> them.  For example (using HTML syntax):
> ---- 8< ---- document
> <h1>Section title</h1>
> Body content
> ---- >8 ----
> I can find that the things inside <h1>s are "Section" and "title", and
> "Body" and "content" are outside.  I want to create two fields for
> this document:
> insideh1 -> "Section", "title"
> alltext -> "Section", "title", "Body", "content"
>
> What's the best way to approach this?  My initial thought is to make
> some kind of MultiAnalyzer that consumes the text and produces several
> token streams, which are added to the document one at a time.  Is that
> a reasonable strategy?
>
> Thanks!
> Will
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>