You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@vxquery.apache.org by Steven Jacobs <sj...@ucr.edu> on 2013/09/17 19:59:06 UTC

Index test on bigger dataset

I ran a test on one of Preston's real-world data sets (Weather collection)
that had around 40,000 files. I am attaching the results. There are three
graphs.

The first shows the time for returning the entire XML for all 40000 files.
My index algorithm has huge gains over collection, no matter how much of
the data is returned.

The second shows how the two algorithms perform as the number of files
increases. Both linearly increase, but collection has a much higher slope.

The last is just a one-point comparison for returning paths that only exist
in only 100 out of the 40000 files. Once again, index has a huge advantage.


Steven

Re: Index test on bigger dataset

Posted by Michael Carey <mj...@ics.uci.edu>.

And I may be misinterpreting, but - it also sounds like it's not 
apples-to-apples in terms of the queries?
(I.e, you are maybe not returning the entirety of the documents?  Or 
does the index hold a complete replica of the data?  What is the query, 
or queries?)

Cheers,
Mike

On 9/17/13 12:06 PM, Steven Jacobs wrote:
> I just found a problem with my results. It looks like the way lucene was
> set it was unable to cache the entire result set, so it was only returning
> a subset of the files. The result is still correct for returning 2% of the
> file (Because it was small enough for lucene to return), so it does get 2%
> in around ten seconds, whereas collection takes around 40 seconds, but I
> need to figure out some things to retest the 100% results. It might not
> even be possible to return this amount of data from the index at one time.
>
> Steven
>
>
> On Tue, Sep 17, 2013 at 11:47 AM, Vinayak Borkar <vi...@gmail.com> wrote:
>
>> That seems very plausible. Opening and closing files is fairly expensive.
>> In addition, each file read is a random seek on disk.
>>
>>
>>
>>
>> On 9/17/13 11:33 AM, Steven Jacobs wrote:
>>
>>> My current theory is that collection faces the overhead of creating file
>>> handlers 40,000 times while index creates a handler only once to read the
>>> index results. I don't know if this is enough to produce the large
>>> difference though.
>>>
>>> Steven
>>>
>>>
>>> On Tue, Sep 17, 2013 at 11:28 AM, Michael Carey <mj...@ics.uci.edu>
>>> wrote:
>>>
>>>   Interesting!  I haven't followed enough yet, but now you have my
>>>> interest;. :-)
>>>> do you have an explanation for why your index wins even in the case of
>>>> 100%?
>>>> (Not intuitive - maybe I am missing some details that would fill my
>>>> intuition gap.)
>>>>
>>>>
>>>> On 9/17/13 10:59 AM, Steven Jacobs wrote:
>>>>
>>>>   I ran a test on one of Preston's real-world data sets (Weather
>>>>> collection) that had around 40,000 files. I am attaching the results.
>>>>> There
>>>>> are three graphs.
>>>>>
>>>>> The first shows the time for returning the entire XML for all 40000
>>>>> files. My index algorithm has huge gains over collection, no matter how
>>>>> much of the data is returned.
>>>>>
>>>>> The second shows how the two algorithms perform as the number of files
>>>>> increases. Both linearly increase, but collection has a much higher
>>>>> slope.
>>>>>
>>>>> The last is just a one-point comparison for returning paths that only
>>>>> exist in only 100 out of the 40000 files. Once again, index has a huge
>>>>> advantage.
>>>>>
>>>>>
>>>>> Steven
>>>>>
>>>>>
>>>>>
>>>>>

Re: Index test on bigger dataset

Posted by Steven Jacobs <sj...@ucr.edu>.

I just found a problem with my results. It looks like the way lucene was
set it was unable to cache the entire result set, so it was only returning
a subset of the files. The result is still correct for returning 2% of the
file (Because it was small enough for lucene to return), so it does get 2%
in around ten seconds, whereas collection takes around 40 seconds, but I
need to figure out some things to retest the 100% results. It might not
even be possible to return this amount of data from the index at one time.

Steven


On Tue, Sep 17, 2013 at 11:47 AM, Vinayak Borkar <vi...@gmail.com> wrote:

> That seems very plausible. Opening and closing files is fairly expensive.
> In addition, each file read is a random seek on disk.
>
>
>
>
> On 9/17/13 11:33 AM, Steven Jacobs wrote:
>
>> My current theory is that collection faces the overhead of creating file
>> handlers 40,000 times while index creates a handler only once to read the
>> index results. I don't know if this is enough to produce the large
>> difference though.
>>
>> Steven
>>
>>
>> On Tue, Sep 17, 2013 at 11:28 AM, Michael Carey <mj...@ics.uci.edu>
>> wrote:
>>
>>  Interesting!  I haven't followed enough yet, but now you have my
>>> interest;. :-)
>>> do you have an explanation for why your index wins even in the case of
>>> 100%?
>>> (Not intuitive - maybe I am missing some details that would fill my
>>> intuition gap.)
>>>
>>>
>>> On 9/17/13 10:59 AM, Steven Jacobs wrote:
>>>
>>>  I ran a test on one of Preston's real-world data sets (Weather
>>>> collection) that had around 40,000 files. I am attaching the results.
>>>> There
>>>> are three graphs.
>>>>
>>>> The first shows the time for returning the entire XML for all 40000
>>>> files. My index algorithm has huge gains over collection, no matter how
>>>> much of the data is returned.
>>>>
>>>> The second shows how the two algorithms perform as the number of files
>>>> increases. Both linearly increase, but collection has a much higher
>>>> slope.
>>>>
>>>> The last is just a one-point comparison for returning paths that only
>>>> exist in only 100 out of the 40000 files. Once again, index has a huge
>>>> advantage.
>>>>
>>>>
>>>> Steven
>>>>
>>>>
>>>>
>>>>
>>>
>>
>

Re: Index test on bigger dataset

Posted by Vinayak Borkar <vi...@gmail.com>.

That seems very plausible. Opening and closing files is fairly 
expensive. In addition, each file read is a random seek on disk.



On 9/17/13 11:33 AM, Steven Jacobs wrote:
> My current theory is that collection faces the overhead of creating file
> handlers 40,000 times while index creates a handler only once to read the
> index results. I don't know if this is enough to produce the large
> difference though.
>
> Steven
>
>
> On Tue, Sep 17, 2013 at 11:28 AM, Michael Carey <mj...@ics.uci.edu> wrote:
>
>> Interesting!  I haven't followed enough yet, but now you have my
>> interest;. :-)
>> do you have an explanation for why your index wins even in the case of
>> 100%?
>> (Not intuitive - maybe I am missing some details that would fill my
>> intuition gap.)
>>
>>
>> On 9/17/13 10:59 AM, Steven Jacobs wrote:
>>
>>> I ran a test on one of Preston's real-world data sets (Weather
>>> collection) that had around 40,000 files. I am attaching the results. There
>>> are three graphs.
>>>
>>> The first shows the time for returning the entire XML for all 40000
>>> files. My index algorithm has huge gains over collection, no matter how
>>> much of the data is returned.
>>>
>>> The second shows how the two algorithms perform as the number of files
>>> increases. Both linearly increase, but collection has a much higher slope.
>>>
>>> The last is just a one-point comparison for returning paths that only
>>> exist in only 100 out of the 40000 files. Once again, index has a huge
>>> advantage.
>>>
>>>
>>> Steven
>>>
>>>
>>>
>>
>

Re: Index test on bigger dataset

Posted by Steven Jacobs <sj...@ucr.edu>.

I made the changes to my indexing algorithm to receive incrementally the
amount of data that it can handle at a time. This yielded results more like
what we expected. I am attaching the revised graphs. The second and third
graph did not change, as these results were fitting everything at once
originally. They show that index has a huge advantage for returning small
amounts of data.

The first graph shows the negative effect of returning all of the data from
the index. What breaks my algorithm down is the number of times that it
needs to re-fetch from the index. In the case of returning all data from
all 40000 files, it took 32 chunks to perform the operation. An interesting
thing to find out is how the number of chunk fetches correlates to the loss
in advantage over collection. I don't know if I will be able to do a more
thorough examination before summer of code is over, but it is definitely
something to look at in the future.

In any case, indexing will be a poor choice when returning most of the data
from large data sets, but shines when it is fetching a small subset.

Steven



On Tue, Sep 17, 2013 at 10:51 PM, Michael Carey <mj...@ics.uci.edu> wrote:

> Got it - now it makes more sense.  (In that case, I question the use of
> the term "index", though; I would view this instead as an "indexed
> collection implementation" that's an alternative to basic file storage.
>  But that's just words; doesn't affect the implementation work or
> performance work. :-))
>
>
> On 9/17/13 2:35 PM, Steven Jacobs wrote:
>
>> The index actually has all of the data, so it is capable of recreating the
>> xml structure. We use a parent-child relationship to maintain an id for
>> each piece of data. For example, assume the following xml code:
>> <letter>
>>    <title maxlength="10"> My letter</title>
>>    <body>Here is the body </body>
>>     <footer>
>> <greeting>Good Bye!</greeting>
>>      </footer>
>> </letter>
>>
>> >From this snippet we would get the following tuples (id, value, type):
>> 0 "letter" element
>> 0.0 "title" element
>> 0.0.0 "maxlength" attribute
>> 0.0.0.0 "10" text
>> 0.0.1 "My Letter" text
>> 0.1 "body" element
>> 0.1.0 "Here is the body" text
>> 0.2 "footer" element
>> 0.2.0 "greeting" element
>> 0.2.0.0 "Good Bye!" text
>>
>> Then we store these tuples in a lucene index. This means that the tests I
>> ran on 100% of the data were actually producing the entire result from the
>> index. Intuitively, the index should really stand out when we are only
>> returning a small subset of the results, as collection will parse
>> everything and index will only look at the relevant data, and this does
>> seem to be the case (Finished in 1/4 of the time on my test).
>>
>> At this point I am getting a memory overflow when trying to retrieve ALL
>> data from the index, so I am looking into Lucene's capability of
>> incremental results. I will post results when I get things working.
>>
>> Steven
>>
>>
>> On Tue, Sep 17, 2013 at 2:04 PM, Vinayak Borkar <vi...@gmail.com>
>> wrote:
>>
>>  Steven,
>>>
>>>
>>> It might be useful to send a brief note describing how you use Lucene to
>>> store XML documents.
>>>
>>> Thanks,
>>> Vinayak
>>>
>>>
>>>
>>> On 9/17/13 12:56 PM, Michael Carey wrote:
>>>
>>>  PS (to my last mail) - I would have thought the index results would
>>>> still have to now be used to go get the full XML documents back? (Or is
>>>> this less of an index and more of an alternative complete storage
>>>> solution?)
>>>>
>>>> On 9/17/13 11:33 AM, Steven Jacobs wrote:
>>>>
>>>>  My current theory is that collection faces the overhead of creating
>>>>> file
>>>>> handlers 40,000 times while index creates a handler only once to read
>>>>> the
>>>>> index results. I don't know if this is enough to produce the large
>>>>> difference though.
>>>>>
>>>>> Steven
>>>>>
>>>>>
>>>>> On Tue, Sep 17, 2013 at 11:28 AM, Michael Carey <mj...@ics.uci.edu>
>>>>> wrote:
>>>>>
>>>>>   Interesting!  I haven't followed enough yet, but now you have my
>>>>>
>>>>>> interest;. :-)
>>>>>> do you have an explanation for why your index wins even in the case of
>>>>>> 100%?
>>>>>> (Not intuitive - maybe I am missing some details that would fill my
>>>>>> intuition gap.)
>>>>>>
>>>>>>
>>>>>> On 9/17/13 10:59 AM, Steven Jacobs wrote:
>>>>>>
>>>>>>   I ran a test on one of Preston's real-world data sets (Weather
>>>>>>
>>>>>>> collection) that had around 40,000 files. I am attaching the
>>>>>>> results. There
>>>>>>> are three graphs.
>>>>>>>
>>>>>>> The first shows the time for returning the entire XML for all 40000
>>>>>>> files. My index algorithm has huge gains over collection, no matter
>>>>>>> how
>>>>>>> much of the data is returned.
>>>>>>>
>>>>>>> The second shows how the two algorithms perform as the number of
>>>>>>> files
>>>>>>> increases. Both linearly increase, but collection has a much higher
>>>>>>> slope.
>>>>>>>
>>>>>>> The last is just a one-point comparison for returning paths that only
>>>>>>> exist in only 100 out of the 40000 files. Once again, index has a
>>>>>>> huge
>>>>>>> advantage.
>>>>>>>
>>>>>>>
>>>>>>> Steven
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>
>

Re: Index test on bigger dataset

Posted by Michael Carey <mj...@ics.uci.edu>.

Got it - now it makes more sense.  (In that case, I question the use of 
the term "index", though; I would view this instead as an "indexed 
collection implementation" that's an alternative to basic file storage.  
But that's just words; doesn't affect the implementation work or 
performance work. :-))

On 9/17/13 2:35 PM, Steven Jacobs wrote:
> The index actually has all of the data, so it is capable of recreating the
> xml structure. We use a parent-child relationship to maintain an id for
> each piece of data. For example, assume the following xml code:
> <letter>
>    <title maxlength="10"> My letter</title>
>    <body>Here is the body </body>
>     <footer>
> <greeting>Good Bye!</greeting>
>      </footer>
> </letter>
>
> >From this snippet we would get the following tuples (id, value, type):
> 0 "letter" element
> 0.0 "title" element
> 0.0.0 "maxlength" attribute
> 0.0.0.0 "10" text
> 0.0.1 "My Letter" text
> 0.1 "body" element
> 0.1.0 "Here is the body" text
> 0.2 "footer" element
> 0.2.0 "greeting" element
> 0.2.0.0 "Good Bye!" text
>
> Then we store these tuples in a lucene index. This means that the tests I
> ran on 100% of the data were actually producing the entire result from the
> index. Intuitively, the index should really stand out when we are only
> returning a small subset of the results, as collection will parse
> everything and index will only look at the relevant data, and this does
> seem to be the case (Finished in 1/4 of the time on my test).
>
> At this point I am getting a memory overflow when trying to retrieve ALL
> data from the index, so I am looking into Lucene's capability of
> incremental results. I will post results when I get things working.
>
> Steven
>
>
> On Tue, Sep 17, 2013 at 2:04 PM, Vinayak Borkar <vi...@gmail.com> wrote:
>
>> Steven,
>>
>>
>> It might be useful to send a brief note describing how you use Lucene to
>> store XML documents.
>>
>> Thanks,
>> Vinayak
>>
>>
>>
>> On 9/17/13 12:56 PM, Michael Carey wrote:
>>
>>> PS (to my last mail) - I would have thought the index results would
>>> still have to now be used to go get the full XML documents back? (Or is
>>> this less of an index and more of an alternative complete storage
>>> solution?)
>>>
>>> On 9/17/13 11:33 AM, Steven Jacobs wrote:
>>>
>>>> My current theory is that collection faces the overhead of creating file
>>>> handlers 40,000 times while index creates a handler only once to read the
>>>> index results. I don't know if this is enough to produce the large
>>>> difference though.
>>>>
>>>> Steven
>>>>
>>>>
>>>> On Tue, Sep 17, 2013 at 11:28 AM, Michael Carey <mj...@ics.uci.edu>
>>>> wrote:
>>>>
>>>>   Interesting!  I haven't followed enough yet, but now you have my
>>>>> interest;. :-)
>>>>> do you have an explanation for why your index wins even in the case of
>>>>> 100%?
>>>>> (Not intuitive - maybe I am missing some details that would fill my
>>>>> intuition gap.)
>>>>>
>>>>>
>>>>> On 9/17/13 10:59 AM, Steven Jacobs wrote:
>>>>>
>>>>>   I ran a test on one of Preston's real-world data sets (Weather
>>>>>> collection) that had around 40,000 files. I am attaching the
>>>>>> results. There
>>>>>> are three graphs.
>>>>>>
>>>>>> The first shows the time for returning the entire XML for all 40000
>>>>>> files. My index algorithm has huge gains over collection, no matter how
>>>>>> much of the data is returned.
>>>>>>
>>>>>> The second shows how the two algorithms perform as the number of files
>>>>>> increases. Both linearly increase, but collection has a much higher
>>>>>> slope.
>>>>>>
>>>>>> The last is just a one-point comparison for returning paths that only
>>>>>> exist in only 100 out of the 40000 files. Once again, index has a huge
>>>>>> advantage.
>>>>>>
>>>>>>
>>>>>> Steven
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>

Re: Index test on bigger dataset

Posted by Steven Jacobs <sj...@ucr.edu>.

The index actually has all of the data, so it is capable of recreating the
xml structure. We use a parent-child relationship to maintain an id for
each piece of data. For example, assume the following xml code:
<letter>
  <title maxlength="10"> My letter</title>
  <body>Here is the body </body>
   <footer>
<greeting>Good Bye!</greeting>
    </footer>
</letter>

>From this snippet we would get the following tuples (id, value, type):
0 "letter" element
0.0 "title" element
0.0.0 "maxlength" attribute
0.0.0.0 "10" text
0.0.1 "My Letter" text
0.1 "body" element
0.1.0 "Here is the body" text
0.2 "footer" element
0.2.0 "greeting" element
0.2.0.0 "Good Bye!" text

Then we store these tuples in a lucene index. This means that the tests I
ran on 100% of the data were actually producing the entire result from the
index. Intuitively, the index should really stand out when we are only
returning a small subset of the results, as collection will parse
everything and index will only look at the relevant data, and this does
seem to be the case (Finished in 1/4 of the time on my test).

At this point I am getting a memory overflow when trying to retrieve ALL
data from the index, so I am looking into Lucene's capability of
incremental results. I will post results when I get things working.

Steven


On Tue, Sep 17, 2013 at 2:04 PM, Vinayak Borkar <vi...@gmail.com> wrote:

> Steven,
>
>
> It might be useful to send a brief note describing how you use Lucene to
> store XML documents.
>
> Thanks,
> Vinayak
>
>
>
> On 9/17/13 12:56 PM, Michael Carey wrote:
>
>> PS (to my last mail) - I would have thought the index results would
>> still have to now be used to go get the full XML documents back? (Or is
>> this less of an index and more of an alternative complete storage
>> solution?)
>>
>> On 9/17/13 11:33 AM, Steven Jacobs wrote:
>>
>>> My current theory is that collection faces the overhead of creating file
>>> handlers 40,000 times while index creates a handler only once to read the
>>> index results. I don't know if this is enough to produce the large
>>> difference though.
>>>
>>> Steven
>>>
>>>
>>> On Tue, Sep 17, 2013 at 11:28 AM, Michael Carey <mj...@ics.uci.edu>
>>> wrote:
>>>
>>>  Interesting!  I haven't followed enough yet, but now you have my
>>>> interest;. :-)
>>>> do you have an explanation for why your index wins even in the case of
>>>> 100%?
>>>> (Not intuitive - maybe I am missing some details that would fill my
>>>> intuition gap.)
>>>>
>>>>
>>>> On 9/17/13 10:59 AM, Steven Jacobs wrote:
>>>>
>>>>  I ran a test on one of Preston's real-world data sets (Weather
>>>>> collection) that had around 40,000 files. I am attaching the
>>>>> results. There
>>>>> are three graphs.
>>>>>
>>>>> The first shows the time for returning the entire XML for all 40000
>>>>> files. My index algorithm has huge gains over collection, no matter how
>>>>> much of the data is returned.
>>>>>
>>>>> The second shows how the two algorithms perform as the number of files
>>>>> increases. Both linearly increase, but collection has a much higher
>>>>> slope.
>>>>>
>>>>> The last is just a one-point comparison for returning paths that only
>>>>> exist in only 100 out of the 40000 files. Once again, index has a huge
>>>>> advantage.
>>>>>
>>>>>
>>>>> Steven
>>>>>
>>>>>
>>>>>
>>>>>
>>
>>
>

Re: Index test on bigger dataset

Posted by Vinayak Borkar <vi...@gmail.com>.

Steven,


It might be useful to send a brief note describing how you use Lucene to 
store XML documents.

Thanks,
Vinayak


On 9/17/13 12:56 PM, Michael Carey wrote:
> PS (to my last mail) - I would have thought the index results would
> still have to now be used to go get the full XML documents back? (Or is
> this less of an index and more of an alternative complete storage
> solution?)
>
> On 9/17/13 11:33 AM, Steven Jacobs wrote:
>> My current theory is that collection faces the overhead of creating file
>> handlers 40,000 times while index creates a handler only once to read the
>> index results. I don't know if this is enough to produce the large
>> difference though.
>>
>> Steven
>>
>>
>> On Tue, Sep 17, 2013 at 11:28 AM, Michael Carey <mj...@ics.uci.edu>
>> wrote:
>>
>>> Interesting!  I haven't followed enough yet, but now you have my
>>> interest;. :-)
>>> do you have an explanation for why your index wins even in the case of
>>> 100%?
>>> (Not intuitive - maybe I am missing some details that would fill my
>>> intuition gap.)
>>>
>>>
>>> On 9/17/13 10:59 AM, Steven Jacobs wrote:
>>>
>>>> I ran a test on one of Preston's real-world data sets (Weather
>>>> collection) that had around 40,000 files. I am attaching the
>>>> results. There
>>>> are three graphs.
>>>>
>>>> The first shows the time for returning the entire XML for all 40000
>>>> files. My index algorithm has huge gains over collection, no matter how
>>>> much of the data is returned.
>>>>
>>>> The second shows how the two algorithms perform as the number of files
>>>> increases. Both linearly increase, but collection has a much higher
>>>> slope.
>>>>
>>>> The last is just a one-point comparison for returning paths that only
>>>> exist in only 100 out of the 40000 files. Once again, index has a huge
>>>> advantage.
>>>>
>>>>
>>>> Steven
>>>>
>>>>
>>>>
>
>

Re: Index test on bigger dataset

Posted by Michael Carey <mj...@ics.uci.edu>.

PS (to my last mail) - I would have thought the index results would 
still have to now be used to go get the full XML documents back? (Or is 
this less of an index and more of an alternative complete storage solution?)

On 9/17/13 11:33 AM, Steven Jacobs wrote:
> My current theory is that collection faces the overhead of creating file
> handlers 40,000 times while index creates a handler only once to read the
> index results. I don't know if this is enough to produce the large
> difference though.
>
> Steven
>
>
> On Tue, Sep 17, 2013 at 11:28 AM, Michael Carey <mj...@ics.uci.edu> wrote:
>
>> Interesting!  I haven't followed enough yet, but now you have my
>> interest;. :-)
>> do you have an explanation for why your index wins even in the case of
>> 100%?
>> (Not intuitive - maybe I am missing some details that would fill my
>> intuition gap.)
>>
>>
>> On 9/17/13 10:59 AM, Steven Jacobs wrote:
>>
>>> I ran a test on one of Preston's real-world data sets (Weather
>>> collection) that had around 40,000 files. I am attaching the results. There
>>> are three graphs.
>>>
>>> The first shows the time for returning the entire XML for all 40000
>>> files. My index algorithm has huge gains over collection, no matter how
>>> much of the data is returned.
>>>
>>> The second shows how the two algorithms perform as the number of files
>>> increases. Both linearly increase, but collection has a much higher slope.
>>>
>>> The last is just a one-point comparison for returning paths that only
>>> exist in only 100 out of the 40000 files. Once again, index has a huge
>>> advantage.
>>>
>>>
>>> Steven
>>>
>>>
>>>

Re: Index test on bigger dataset

Posted by Steven Jacobs <sj...@ucr.edu>.

My current theory is that collection faces the overhead of creating file
handlers 40,000 times while index creates a handler only once to read the
index results. I don't know if this is enough to produce the large
difference though.

Steven


On Tue, Sep 17, 2013 at 11:28 AM, Michael Carey <mj...@ics.uci.edu> wrote:

> Interesting!  I haven't followed enough yet, but now you have my
> interest;. :-)
> do you have an explanation for why your index wins even in the case of
> 100%?
> (Not intuitive - maybe I am missing some details that would fill my
> intuition gap.)
>
>
> On 9/17/13 10:59 AM, Steven Jacobs wrote:
>
>> I ran a test on one of Preston's real-world data sets (Weather
>> collection) that had around 40,000 files. I am attaching the results. There
>> are three graphs.
>>
>> The first shows the time for returning the entire XML for all 40000
>> files. My index algorithm has huge gains over collection, no matter how
>> much of the data is returned.
>>
>> The second shows how the two algorithms perform as the number of files
>> increases. Both linearly increase, but collection has a much higher slope.
>>
>> The last is just a one-point comparison for returning paths that only
>> exist in only 100 out of the 40000 files. Once again, index has a huge
>> advantage.
>>
>>
>> Steven
>>
>>
>>
>

Re: Index test on bigger dataset

Posted by Michael Carey <mj...@ics.uci.edu>.

Interesting!  I haven't followed enough yet, but now you have my 
interest;. :-)
do you have an explanation for why your index wins even in the case of 100%?
(Not intuitive - maybe I am missing some details that would fill my 
intuition gap.)

On 9/17/13 10:59 AM, Steven Jacobs wrote:
> I ran a test on one of Preston's real-world data sets (Weather 
> collection) that had around 40,000 files. I am attaching the results. 
> There are three graphs.
>
> The first shows the time for returning the entire XML for all 40000 
> files. My index algorithm has huge gains over collection, no matter 
> how much of the data is returned.
>
> The second shows how the two algorithms perform as the number of files 
> increases. Both linearly increase, but collection has a much higher slope.
>
> The last is just a one-point comparison for returning paths that only 
> exist in only 100 out of the 40000 files. Once again, index has a huge 
> advantage.
>
>
> Steven
>
>