You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Shai Erera (JIRA)" <ji...@apache.org> on 2008/12/05 15:33:44 UTC

[jira] Created: (LUCENE-1479) TrecDocMaker skips over documents when "Date" is missing from documents

TrecDocMaker skips over documents when "Date" is missing from documents
-----------------------------------------------------------------------

                 Key: LUCENE-1479
                 URL: https://issues.apache.org/jira/browse/LUCENE-1479
             Project: Lucene - Java
          Issue Type: Bug
          Components: contrib/benchmark
            Reporter: Shai Erera
             Fix For: 2.4.1


TrecDocMaker skips over Trec documents if they do not have a "Date" line. When such a document is encountered, the code may skip over several documents until the next tag that is searched for is found.
The result is, instead of reading ~25M documents from the GOV2 collection, the code reads only ~23M (don't remember the actual numbers).

The fix adds a terminatingTag to read() such that the code looks for prefix, but only until terminatingTag is found. Appropriate changes were made in getNextDocData().

Patch to follow

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1479) TrecDocMaker skips over documents when "Date" is missing from documents

Posted by "Shai Erera (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662452#action_12662452 ] 

Shai Erera commented on LUCENE-1479:
------------------------------------

The reason why this patch does not include a test case is because it requires the TREC data set. Is it valid to add a test case which will fail if the TREC data is missing? If not, can you suggest how can I simulate it?
I can create several documents in the TREC format and feed the TrecDocMaker with those files.
Or ... I'll look into extending TrecDocMaker and instead of feeding it with File(s), I'll feed it with some mock documents (String), which simulate the bug. Not sure if that's doable right-away - might need to change a method to protected.

Also, I'm not near the code now, so I can't tell if DocData allows for a null Date. But I guess it's just easier to assign the current date, for simplicity (you never know if at some point date becomes a *must* ...).
I kept that logic from TrecDocMaker w/o the patch ...

Shai

> TrecDocMaker skips over documents when "Date" is missing from documents
> -----------------------------------------------------------------------
>
>                 Key: LUCENE-1479
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1479
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/benchmark
>            Reporter: Shai Erera
>            Assignee: Michael McCandless
>             Fix For: 2.4.1, 2.9
>
>         Attachments: LUCENE-1479.patch
>
>
> TrecDocMaker skips over Trec documents if they do not have a "Date" line. When such a document is encountered, the code may skip over several documents until the next tag that is searched for is found.
> The result is, instead of reading ~25M documents from the GOV2 collection, the code reads only ~23M (don't remember the actual numbers).
> The fix adds a terminatingTag to read() such that the code looks for prefix, but only until terminatingTag is found. Appropriate changes were made in getNextDocData().
> Patch to follow

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Resolved: (LUCENE-1479) TrecDocMaker skips over documents when "Date" is missing from documents

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless resolved LUCENE-1479.
----------------------------------------

       Resolution: Fixed
    Fix Version/s:     (was: 2.4.1)
    Lucene Fields: [New, Patch Available]  (was: [Patch Available, New])

Committed revision 733697.

Thanks Shai!

> TrecDocMaker skips over documents when "Date" is missing from documents
> -----------------------------------------------------------------------
>
>                 Key: LUCENE-1479
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1479
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/benchmark
>            Reporter: Shai Erera
>            Assignee: Michael McCandless
>             Fix For: 2.9
>
>         Attachments: LUCENE-1479-2.patch, LUCENE-1479.patch
>
>
> TrecDocMaker skips over Trec documents if they do not have a "Date" line. When such a document is encountered, the code may skip over several documents until the next tag that is searched for is found.
> The result is, instead of reading ~25M documents from the GOV2 collection, the code reads only ~23M (don't remember the actual numbers).
> The fix adds a terminatingTag to read() such that the code looks for prefix, but only until terminatingTag is found. Appropriate changes were made in getNextDocData().
> Patch to follow

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-1479) TrecDocMaker skips over documents when "Date" is missing from documents

Posted by "Shai Erera (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shai Erera updated LUCENE-1479:
-------------------------------

    Attachment:     (was: LUCENE-1479.patch)

> TrecDocMaker skips over documents when "Date" is missing from documents
> -----------------------------------------------------------------------
>
>                 Key: LUCENE-1479
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1479
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/benchmark
>            Reporter: Shai Erera
>            Assignee: Michael McCandless
>             Fix For: 2.4.1, 2.9
>
>
> TrecDocMaker skips over Trec documents if they do not have a "Date" line. When such a document is encountered, the code may skip over several documents until the next tag that is searched for is found.
> The result is, instead of reading ~25M documents from the GOV2 collection, the code reads only ~23M (don't remember the actual numbers).
> The fix adds a terminatingTag to read() such that the code looks for prefix, but only until terminatingTag is found. Appropriate changes were made in getNextDocData().
> Patch to follow

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Assigned: (LUCENE-1479) TrecDocMaker skips over documents when "Date" is missing from documents

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless reassigned LUCENE-1479:
------------------------------------------

    Assignee: Michael McCandless

> TrecDocMaker skips over documents when "Date" is missing from documents
> -----------------------------------------------------------------------
>
>                 Key: LUCENE-1479
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1479
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/benchmark
>            Reporter: Shai Erera
>            Assignee: Michael McCandless
>             Fix For: 2.4.1, 2.9
>
>         Attachments: LUCENE-1479.patch
>
>
> TrecDocMaker skips over Trec documents if they do not have a "Date" line. When such a document is encountered, the code may skip over several documents until the next tag that is searched for is found.
> The result is, instead of reading ~25M documents from the GOV2 collection, the code reads only ~23M (don't remember the actual numbers).
> The fix adds a terminatingTag to read() such that the code looks for prefix, but only until terminatingTag is found. Appropriate changes were made in getNextDocData().
> Patch to follow

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1479) TrecDocMaker skips over documents when "Date" is missing from documents

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662338#action_12662338 ] 

Michael McCandless commented on LUCENE-1479:
--------------------------------------------

Ahh the last minute "trivial" code fix... gets you every time ;)

With the new patch, a document missing a date is assigned the current Date, right?  Can we instead leave it as unset (null)?  (Does DocData allow a null Date)?

Can you add a test case exposing the bug & showing the fix?


> TrecDocMaker skips over documents when "Date" is missing from documents
> -----------------------------------------------------------------------
>
>                 Key: LUCENE-1479
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1479
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/benchmark
>            Reporter: Shai Erera
>            Assignee: Michael McCandless
>             Fix For: 2.4.1, 2.9
>
>         Attachments: LUCENE-1479.patch
>
>
> TrecDocMaker skips over Trec documents if they do not have a "Date" line. When such a document is encountered, the code may skip over several documents until the next tag that is searched for is found.
> The result is, instead of reading ~25M documents from the GOV2 collection, the code reads only ~23M (don't remember the actual numbers).
> The fix adds a terminatingTag to read() such that the code looks for prefix, but only until terminatingTag is found. Appropriate changes were made in getNextDocData().
> Patch to follow

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1479) TrecDocMaker skips over documents when "Date" is missing from documents

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662941#action_12662941 ] 

Michael McCandless commented on LUCENE-1479:
--------------------------------------------

Patch looks great Shai!  I'll commit shortly.

> TrecDocMaker skips over documents when "Date" is missing from documents
> -----------------------------------------------------------------------
>
>                 Key: LUCENE-1479
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1479
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/benchmark
>            Reporter: Shai Erera
>            Assignee: Michael McCandless
>             Fix For: 2.4.1, 2.9
>
>         Attachments: LUCENE-1479-2.patch, LUCENE-1479.patch
>
>
> TrecDocMaker skips over Trec documents if they do not have a "Date" line. When such a document is encountered, the code may skip over several documents until the next tag that is searched for is found.
> The result is, instead of reading ~25M documents from the GOV2 collection, the code reads only ~23M (don't remember the actual numbers).
> The fix adds a terminatingTag to read() such that the code looks for prefix, but only until terminatingTag is found. Appropriate changes were made in getNextDocData().
> Patch to follow

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1479) TrecDocMaker skips over documents when "Date" is missing from documents

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662464#action_12662464 ] 

Michael McCandless commented on LUCENE-1479:
--------------------------------------------

{quote}
> I'll feed it with some mock documents (String), which simulate the bug.
{quote}
Right, either that or make a mocked set of docs in a .gz file and just use TrecDocMaker directly.

{quote}
> Also, I'm not near the code now, so I can't tell if DocData allows for a null Date. 
{quote}
I think DocData ought to be able to handle a null Date, especially since the real data (GOV2 in your OP) has real documents lacking a Date.

To pretend the document did have a Date that happens to be today's date doesn't seem good (I don't know of anything in contrib/benchmark that depends on accurate dates now, but we may have something in the future where shifting dates could confuse things).

> TrecDocMaker skips over documents when "Date" is missing from documents
> -----------------------------------------------------------------------
>
>                 Key: LUCENE-1479
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1479
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/benchmark
>            Reporter: Shai Erera
>            Assignee: Michael McCandless
>             Fix For: 2.4.1, 2.9
>
>         Attachments: LUCENE-1479.patch
>
>
> TrecDocMaker skips over Trec documents if they do not have a "Date" line. When such a document is encountered, the code may skip over several documents until the next tag that is searched for is found.
> The result is, instead of reading ~25M documents from the GOV2 collection, the code reads only ~23M (don't remember the actual numbers).
> The fix adds a terminatingTag to read() such that the code looks for prefix, but only until terminatingTag is found. Appropriate changes were made in getNextDocData().
> Patch to follow

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-1479) TrecDocMaker skips over documents when "Date" is missing from documents

Posted by "Shai Erera (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shai Erera updated LUCENE-1479:
-------------------------------

    Attachment: LUCENE-1479-2.patch

This patch includes TrecDocMakerTest as well as removing the assignment of new Date() in case "Date:" is missing.
The test includes several test cases, including the missing date one.
I checked and null date is considered when building a Document object, and therefore it's safe to assign a null date.

> TrecDocMaker skips over documents when "Date" is missing from documents
> -----------------------------------------------------------------------
>
>                 Key: LUCENE-1479
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1479
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/benchmark
>            Reporter: Shai Erera
>            Assignee: Michael McCandless
>             Fix For: 2.4.1, 2.9
>
>         Attachments: LUCENE-1479-2.patch, LUCENE-1479.patch
>
>
> TrecDocMaker skips over Trec documents if they do not have a "Date" line. When such a document is encountered, the code may skip over several documents until the next tag that is searched for is found.
> The result is, instead of reading ~25M documents from the GOV2 collection, the code reads only ~23M (don't remember the actual numbers).
> The fix adds a terminatingTag to read() such that the code looks for prefix, but only until terminatingTag is found. Appropriate changes were made in getNextDocData().
> Patch to follow

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1479) TrecDocMaker skips over documents when "Date" is missing from documents

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662073#action_12662073 ] 

Michael McCandless commented on LUCENE-1479:
--------------------------------------------

Shai, it seems like a doc that has no "Date: XXX" would leave dateStr as null and would then cause an NPE when parseDate is later called?  Or am I missing something?

Also I'm getting a compilation error:

{code}
[javac] Compiling 1 source file to /tango/mike/src/lucene.trecdocmaker/build/contrib/benchmark/classes/java
[javac] /tango/mike/src/lucene.trecdocmaker/contrib/benchmark/src/java/org/apache/lucene/benchmark/byTask/feeds/TrecDocMaker.java:190: variable name might not have been initialized
[javac]     String name = sb.substring(DOCNO.length(), name.indexOf(TERM_DOCNO, DOCNO.length()));
[javac]                                                ^
[javac] 1 error
{code}

> TrecDocMaker skips over documents when "Date" is missing from documents
> -----------------------------------------------------------------------
>
>                 Key: LUCENE-1479
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1479
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/benchmark
>            Reporter: Shai Erera
>             Fix For: 2.4.1, 2.9
>
>         Attachments: LUCENE-1479.patch
>
>
> TrecDocMaker skips over Trec documents if they do not have a "Date" line. When such a document is encountered, the code may skip over several documents until the next tag that is searched for is found.
> The result is, instead of reading ~25M documents from the GOV2 collection, the code reads only ~23M (don't remember the actual numbers).
> The fix adds a terminatingTag to read() such that the code looks for prefix, but only until terminatingTag is found. Appropriate changes were made in getNextDocData().
> Patch to follow

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-1479) TrecDocMaker skips over documents when "Date" is missing from documents

Posted by "Shai Erera (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shai Erera updated LUCENE-1479:
-------------------------------

    Attachment: LUCENE-1479.patch

Thanks Mike, you're right. The compilation error is a result of a refactoring I did to that line, by using a single substring call instead of two. I forgot to use 'sb' in the second indexOf call, and hence the compilation error.

Regarding dateStr - I fixed that. Thanks for noticing it

> TrecDocMaker skips over documents when "Date" is missing from documents
> -----------------------------------------------------------------------
>
>                 Key: LUCENE-1479
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1479
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/benchmark
>            Reporter: Shai Erera
>            Assignee: Michael McCandless
>             Fix For: 2.4.1, 2.9
>
>         Attachments: LUCENE-1479.patch
>
>
> TrecDocMaker skips over Trec documents if they do not have a "Date" line. When such a document is encountered, the code may skip over several documents until the next tag that is searched for is found.
> The result is, instead of reading ~25M documents from the GOV2 collection, the code reads only ~23M (don't remember the actual numbers).
> The fix adds a terminatingTag to read() such that the code looks for prefix, but only until terminatingTag is found. Appropriate changes were made in getNextDocData().
> Patch to follow

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-1479) TrecDocMaker skips over documents when "Date" is missing from documents

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-1479:
---------------------------------------

    Lucene Fields: [New, Patch Available]  (was: [Patch Available, New])
    Fix Version/s: 2.9

Please remember to include the trunk release (2.9, in this case) in the fix version when you set fix version to a point release on a branch (2.4.1, in this case).

> TrecDocMaker skips over documents when "Date" is missing from documents
> -----------------------------------------------------------------------
>
>                 Key: LUCENE-1479
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1479
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/benchmark
>            Reporter: Shai Erera
>             Fix For: 2.4.1, 2.9
>
>         Attachments: LUCENE-1479.patch
>
>
> TrecDocMaker skips over Trec documents if they do not have a "Date" line. When such a document is encountered, the code may skip over several documents until the next tag that is searched for is found.
> The result is, instead of reading ~25M documents from the GOV2 collection, the code reads only ~23M (don't remember the actual numbers).
> The fix adds a terminatingTag to read() such that the code looks for prefix, but only until terminatingTag is found. Appropriate changes were made in getNextDocData().
> Patch to follow

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-1479) TrecDocMaker skips over documents when "Date" is missing from documents

Posted by "Shai Erera (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shai Erera updated LUCENE-1479:
-------------------------------

    Attachment: LUCENE-1479.patch

Patch to fix the bug

> TrecDocMaker skips over documents when "Date" is missing from documents
> -----------------------------------------------------------------------
>
>                 Key: LUCENE-1479
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1479
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/benchmark
>            Reporter: Shai Erera
>             Fix For: 2.4.1
>
>         Attachments: LUCENE-1479.patch
>
>
> TrecDocMaker skips over Trec documents if they do not have a "Date" line. When such a document is encountered, the code may skip over several documents until the next tag that is searched for is found.
> The result is, instead of reading ~25M documents from the GOV2 collection, the code reads only ~23M (don't remember the actual numbers).
> The fix adds a terminatingTag to read() such that the code looks for prefix, but only until terminatingTag is found. Appropriate changes were made in getNextDocData().
> Patch to follow

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org