You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Alex vB <ma...@avomberg.de> on 2010/11/09 23:30:22 UTC

Implementing indexing of Versioned Document Collections

Hello everybody,

I would like to implement the paper "Compact Full-Text Indexing of Versioned
Document Collections" [1] from Torsten Suel for my diploma thesis in Lucene.
The basic idea is to create a two-level index structure. On the first level
a document is identified by document ID with a posting list entry if the
term exists at least in one version. For every posting on the first level
with term t we have a bitvector on the second one. These bitvectors contain
as many bits as there are versions for one document, and bit i is set to 1
if version i contains term t or otherwise it remains 0.

http://lucene.472066.n3.nabble.com/file/n1872701/Unbenannt_1.jpg

This little picture is just for demonstration purposes. It shows a posting
list for the term car and is composed of 4 document IDs. If a hit is found
in document 6 another look-up is needed on the second level to get the
corresponding versions (version 1, 5, 7, 8, 9, 10 from 10 versions at all).

At the moment I am using wikipedia (simplewiki dump) as source with a
SAXParser and can resolve each document with all its versions from the XML
file (Fields are Title, ID, Content(seperated for each version)). My problem
is that I am unsure how to connect the second level with the first one and
how to store it. The key points that are needed:
- Information from posting list creation to create the bitvector (term ->
doc -> versions)
- Storing the bitvectors
- Implementing search on second level

For the first steps I disabled term frequencies and positions because the
paper isn't handling them. I would be happy to get any running version at
all. :)
At the moment I can create bitvectors for the documents. I realized this
with a HashMap<String, BitSet> in TermsHashPerField where I grab the current
term in add() (I hope this is the correct location for retrieving the
inverted lists terms). Anyway I can create the corret bitvectors and write
them into a text file.
Excerpt of bitVectors from article "April":
april :
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111101101110111111111111111111
never :
0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000010000000000000000000000000
ayriway :
0000000000000000000000000000000000000111111111111111111111111111111111111111111111111111111111111111111111111111111111111101101110111111111111111111
inclusive :
1111111111111111111111111000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000

Next step would be storing all bitvecors in the index. At first glance I
like to use an extra field to store the created bitvectors permanent in the
index. It seems to be the easiest way for a first implementation without
accessing the low level functions of Lucene. Can I add a field after I
already started writing the document through IndexWriter? How would I do
this? Or are there any other suggestions for storing? Another idea is to
expand the index format of Lucene but this seems a little bit to difficult
for me. Maybe I could write these information into my own file. Could
anybody point me to the right direction? :)

Currently I am focusing on storing and try to extend Lucenes search after
the former step.

THX in advance & best regards
Alex

[1] http://cis.poly.edu/suel/
--
View this message in context: http://lucene.472066.n3.nabble.com/Implementing-indexing-of-Versioned-Document-Collections-tp1872701p1872701.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Implementing indexing of Versioned Document Collections

Posted by Alex vB <ma...@avomberg.de>.

Hi again,

my Payloads are working fine as I figured out now (haven't seen the
nextPosition method). I really have problems with adding the bitvectors.
Currently I am creating them during tokenization. Therefore, as already
mentioned, they are only completely created when all fields are tokenized
because I add every new term occurence into HashMap and create/update the
linked bitvector during this analysis process. I read in another post that
changing or updating already set payloads isn't possible. Furthermore I need
to store payload only ONCE for a term and not on every term position. For
example on the wiki article for April I would have around 5000 term
occurrences for the term "April"! This would save a lot of memory.

1) Is it possible to pre analyze fields? Maybe analyzing twice. First time
for getting the bitvectors (without writing them!) and second time for
normal index writing with bitvector payloads.
2) Alternatively I could still add the bitvectors during tokenization if I
would be able to set the current term in my custom Filter (extends
TokenFilter). In my HashMap I have pairs of <Term, BitVector> and I could
iterate over all term keys. Is it possible to manually set the current term
and the corresponding payload? I tried something like this after all fields
and streams have been tokenized (Without success):

for (Map.Entry<String, BitSet> e : map.entrySet()) {
	key = e.getKey();
	value = e.getValue();

	termAtt.setTermBuffer(key);
	bitvectorPalyoad = new Payload(toByteArray(value)); 
        payloadAttr.setPayload(bitvectorPalyoad);
}

3) Can I use payloads without term positions? 

If my questions are unclear please tell me! :)

Best regards
Alex



-- 
View this message in context: http://lucene.472066.n3.nabble.com/Implementing-indexing-of-Versioned-Document-Collections-tp1872701p1913140.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Implementing indexing of Versioned Document Collections

Posted by Alex vB <ma...@avomberg.de>.

Hello Pulkit,

thank you for your answer and excuse me for my late reply. I am currently
working on the payload stuff and have implemented my own Analyzer and
Tokenfilter for adding custom payloads. As far as I understand I can add
Payload for every term occurence and write this into the posting list. My
posting list now looks like this:

car -> DocID1, [Payload 1], DocID2, [Payload2]....., DocID N, [Payload N]

Where each payload is a BitSet depending on the versions of a document. I
must admit that the index is getting really big at the moment because I am
adding around 8 to 16 bytes with each payload. I have to find a good
compression for the bitvectors.
Further I am always getting the error
org.apache.lucene.index.CorruptIndexException: checksum mismatch in segments
file if I use my own Analyzer. After I uncomment the checksum test
everything works fine. Even Luke isn't giving me an error. Any ideas?
Another problem is the BitVector creation during tokenization. I am running
through all versions during the tokenizing step for creating my bitvectors
(stored in a HashMap). So my bitvectors are completly created after the last
field is analyzed (I added every wikipedia verison as an own field).
Therefore I need to add the payload after the tokenizing step. Is this
possible? What happens if I add payload for a current term and I add another
payload for the same term later ? Is it overwritten or appended?

Greetings
Alex
--
View this message in context: http://lucene.472066.n3.nabble.com/Implementing-indexing-of-Versioned-Document-Collections-tp1872701p1910449.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Implementing indexing of Versioned Document Collections

Posted by Pulkit Singhal <pu...@gmail.com>.

1) You can attach byte array "Payloads" for every occurrence of a term
during indexing. It will be stored at each term position, during indexing,
and then
can be retrieved during searching. You may want to consider taking this
approach rather than writing bitvectors to a text file. If you feel that I
should have read your thesis summary more closely and don't know what the
heck I'm talking about ... then I politely yield :)

2) Can I add a field after I already started writing the document through
IndexWriter?
Yes, you can. You can add new documents that have more fields than the ones
you added in the past. You can also "update" (internally it does a delete
then add) your document to have more fields at a later point in time.
Although in your use-case I simply didn't see why you would need to go back
and add more fields.

3) How would I do this?
a) Well if you are simply adding documents, keep adding new ones with more
fields and lucene won't complain. If you use stored fields then you may
notice inconsistencies when you pull documents in your search results
because some documents will have additional stored fields while others will
not... I am not sure how you would want to handle that.
b) If you are updating, then my personal approach would be to fetch the
document via a search that is unique enough to just get you that one doc
back. In order to do this, people usually make sure to store a unique id
when indexing the document. Then using that document, create a new one with
all the values of the old one plus the new fields you want to add. Deleted
the old doc. Add the new doc.

Good luck!

- Pulkit

On Tue, Nov 9, 2010 at 5:30 PM, Alex vB <ma...@avomberg.de> wrote:

>
> Hello everybody,
>
> I would like to implement the paper "Compact Full-Text Indexing of
> Versioned
> Document Collections" [1] from Torsten Suel for my diploma thesis in
> Lucene.
> The basic idea is to create a two-level index structure. On the first level
> a document is identified by document ID with a posting list entry if the
> term exists at least in one version. For every posting on the first level
> with term t we have a bitvector on the second one. These bitvectors contain
> as many bits as there are versions for one document, and bit i is set to 1
> if version i contains term t or otherwise it remains 0.
>
> http://lucene.472066.n3.nabble.com/file/n1872701/Unbenannt_1.jpg
>
> This little picture is just for demonstration purposes. It shows a posting
> list for the term car and is composed of 4 document IDs. If a hit is found
> in document 6 another look-up is needed on the second level to get the
> corresponding versions (version 1, 5, 7, 8, 9, 10 from 10 versions at all).
>
> At the moment I am using wikipedia (simplewiki dump) as source with a
> SAXParser and can resolve each document with all its versions from the XML
> file (Fields are Title, ID, Content(seperated for each version)). My
> problem
> is that I am unsure how to connect the second level with the first one and
> how to store it. The key points that are needed:
> - Information from posting list creation to create the bitvector (term ->
> doc -> versions)
> - Storing the bitvectors
> - Implementing search on second level
>
> For the first steps I disabled term frequencies and positions because the
> paper isn't handling them. I would be happy to get any running version at
> all. :)
> At the moment I can create bitvectors for the documents. I realized this
> with a HashMap<String, BitSet> in TermsHashPerField where I grab the
> current
> term in add() (I hope this is the correct location for retrieving the
> inverted lists terms). Anyway I can create the corret bitvectors and write
> them into a text file.
> Excerpt of bitVectors from article "April":
> april :
>
> 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111101101110111111111111111111
> never :
>
> 0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000010000000000000000000000000
> ayriway :
>
> 0000000000000000000000000000000000000111111111111111111111111111111111111111111111111111111111111111111111111111111111111101101110111111111111111111
> inclusive :
>
> 1111111111111111111111111000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
>
> Next step would be storing all bitvecors in the index. At first glance I
> like to use an extra field to store the created bitvectors permanent in the
> index. It seems to be the easiest way for a first implementation without
> accessing the low level functions of Lucene. Can I add a field after I
> already started writing the document through IndexWriter? How would I do
> this? Or are there any other suggestions for storing? Another idea is to
> expand the index format of Lucene but this seems a little bit to difficult
> for me. Maybe I could write these information into my own file. Could
> anybody point me to the right direction? :)
>
> Currently I am focusing on storing and try to extend Lucenes search after
> the former step.
>
> THX in advance & best regards
> Alex
>
> [1] http://cis.poly.edu/suel/
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Implementing-indexing-of-Versioned-Document-Collections-tp1872701p1872701.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>