You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Eran Sevi <er...@gmail.com> on 2008/03/11 14:23:29 UTC

Specialized XML handling in Lucene

Hi,

I would like to ask for suggestions of the best design for the following
scenario:

I have a very large number of XML files (around 1M).
Each file contains several sections. Each section contains many elements
(about 1000-5000).
Each element has a value and some attributes describing the value (like
metadata), for example:

<Section1>
    <Element1  id="0"  type="A" meta1="val11"
meta2="val21">value1</Element1>
    <Element1 id="1"  type="B" meta1="val12" meta2="val21">value2</Element1>
...
</Section1>
<Section2>
    <Element2 id="0"  type="D" meta1="val11" meta3="val31">value3</Element2>
    <Element2 id="1"  type="B" meta1="val13" meta3="val34">value1</Element2>
...
<Section2>
...

As you can see, each attribute can have any value, and attribute names can
be the same in different sections.

I would like to index the XML in such a way so I can perform queries like:

element1=value1 AND type=A AND meta2=val21

and also more complicated queries that include positions between elements,
and even range queries on attribute values.

Indexing each element as a different document might not be possible because
of the large number of documents it might create (more then 5 billion docs),
and might also make it difficult to parse results - I still want to know how
many original XML documents contains the searched terms.

Indexing each attribute as a different field is also difficult because I
then need the positional information of the found terms and check that they
were all found in the same place (and thus "belong" to the same element).

Please send me your thoughts of a possible solution with lucene (if such
exists).

Thanks,
Eran S.