You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Robert Muir <rc...@gmail.com> on 2009/11/12 19:53:59 UTC

TrecContentSource and docname/iteration number

If I use TrecContentSource to index a collection, it puts the doc name into
the docname field, just as I like.
say i have a doc with
<DOCNO>DOCID0001</DOCNO>
the problem is that concatenates the iteration number to this document name:

name = name + "_" + iteration;

this produces a docname of DOCID0001_0, which won't work if I am trying to
use the quality package to measure relevance.

Does anyone object to changing TrecContentSource to *not do this* ???
I would think the primary reason you would want to use it would be to
measure relevance.

alternatively, we could change DocNameExtractor in the quality package to
ignore this _Iteration suffix... doesn't matter to me.
-- 
Robert Muir
rcmuir@gmail.com

Re: TrecContentSource and docname/iteration number

Posted by Robert Muir <rc...@gmail.com>.
Shai, ok. I couldn't tell if it was round iterations (looked that way).

I left the default as it was, for back compat. but now theres an obscure
option content.source.excludeIteration you can use to disable it.

I used content.source.* because i saw the other ContentSources seemed to do
a similar thing (reuters, etc), although I didn't implement any code to
respect it there.

On Thu, Nov 12, 2009 at 10:54 PM, Shai Erera <se...@gmail.com> wrote:

> I think the fix you've made makes sense. The iteration number is added in
> case you want to collect more than avail documents (such that it starts over
> with the first one). I don't think it has to do with the iterations option
> in Benchmark, although it could.
>
> Being able to configure it makes sense to me. What's the default? I
> personally don't mind if it would be without iterations ...
>
> BTW, we could decide not to allow configuring it, and only if there is a
> second iteration, the code would add _<iter> to the names. So that names
> would be DOCID0001 and in the second iteration DOCID0001_0 (or _1).
>
> Shai
>
>
> On Thu, Nov 12, 2009 at 8:53 PM, Robert Muir <rc...@gmail.com> wrote:
>
>> If I use TrecContentSource to index a collection, it puts the doc name
>> into the docname field, just as I like.
>> say i have a doc with
>> <DOCNO>DOCID0001</DOCNO>
>> the problem is that concatenates the iteration number to this document
>> name:
>>
>> name = name + "_" + iteration;
>>
>> this produces a docname of DOCID0001_0, which won't work if I am trying to
>> use the quality package to measure relevance.
>>
>> Does anyone object to changing TrecContentSource to *not do this* ???
>> I would think the primary reason you would want to use it would be to
>> measure relevance.
>>
>> alternatively, we could change DocNameExtractor in the quality package to
>> ignore this _Iteration suffix... doesn't matter to me.
>>  --
>> Robert Muir
>> rcmuir@gmail.com
>>
>
>


-- 
Robert Muir
rcmuir@gmail.com

Re: TrecContentSource and docname/iteration number

Posted by Shai Erera <se...@gmail.com>.
I think the fix you've made makes sense. The iteration number is added in
case you want to collect more than avail documents (such that it starts over
with the first one). I don't think it has to do with the iterations option
in Benchmark, although it could.

Being able to configure it makes sense to me. What's the default? I
personally don't mind if it would be without iterations ...

BTW, we could decide not to allow configuring it, and only if there is a
second iteration, the code would add _<iter> to the names. So that names
would be DOCID0001 and in the second iteration DOCID0001_0 (or _1).

Shai

On Thu, Nov 12, 2009 at 8:53 PM, Robert Muir <rc...@gmail.com> wrote:

> If I use TrecContentSource to index a collection, it puts the doc name into
> the docname field, just as I like.
> say i have a doc with
> <DOCNO>DOCID0001</DOCNO>
> the problem is that concatenates the iteration number to this document
> name:
>
> name = name + "_" + iteration;
>
> this produces a docname of DOCID0001_0, which won't work if I am trying to
> use the quality package to measure relevance.
>
> Does anyone object to changing TrecContentSource to *not do this* ???
> I would think the primary reason you would want to use it would be to
> measure relevance.
>
> alternatively, we could change DocNameExtractor in the quality package to
> ignore this _Iteration suffix... doesn't matter to me.
> --
> Robert Muir
> rcmuir@gmail.com
>