You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Christoph Hermann <he...@informatik.uni-freiburg.de> on 2010/10/15 18:15:33 UTC

Tokenizing XML

Hi,

is there a Tokenizer in Lucene, that tokenizes XML correctly?

I.e. that one gets from the following XML:
<span>this is <span attr="foo">example</span>text.</span>

Tokens (or similar):
<span> | this | is | <span attr="foo"> | example | </span> | text. | </span>

Or would i need to write such a Tokenizer myself?

regards
Christoph Hermann

-- 
Christoph Hermann
Institut für Informatik
Tel: +49 761-203-8171 Fax: +49 761-203-8162
e-mail: hermann@informatik.uni-freiburg.de

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Tokenizing XML

Posted by Erick Erickson <er...@gmail.com>.
Well, it's hard to say what "correctly" would be. Remove all
XML? Preserve attributes? Preserve tags? Put the attributes
and values into fields in the document? My point is that there's
no obviously "correct" parsing.

But if you just want to strip out all the <....>, it seems like
PatternTokenizer might work for you...

HTH
Erick

2010/10/15 Christoph Hermann <he...@informatik.uni-freiburg.de>

> Hi,
>
> is there a Tokenizer in Lucene, that tokenizes XML correctly?
>
> I.e. that one gets from the following XML:
> <span>this is <span attr="foo">example</span>text.</span>
>
> Tokens (or similar):
> <span> | this | is | <span attr="foo"> | example | </span> | text. |
> </span>
>
> Or would i need to write such a Tokenizer myself?
>
> regards
> Christoph Hermann
>
> --
> Christoph Hermann
> Institut für Informatik
> Tel: +49 761-203-8171 Fax: +49 761-203-8162
> e-mail: hermann@informatik.uni-freiburg.de
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>