You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Steven Parkes (JIRA)" <ji...@apache.org> on 2007/07/31 23:07:52 UTC

[jira] Created: (LUCENE-971) Create enwiki indexable data as line-per-article rather than file-per-article

Create enwiki indexable data as line-per-article rather than file-per-article
-----------------------------------------------------------------------------

                 Key: LUCENE-971
                 URL: https://issues.apache.org/jira/browse/LUCENE-971
             Project: Lucene - Java
          Issue Type: Improvement
            Reporter: Steven Parkes


Create a line per article rather than a file. Consume with indexLineFile task.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Resolved: (LUCENE-971) Create enwiki indexable data as line-per-article rather than file-per-article

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless resolved LUCENE-971.
---------------------------------------

    Resolution: Fixed


OK, committed with these small changes:

  * Replaced conf/wikipedia.alg -> conf/extractWikipedia.alg in the
    comment in that file.

  * Moved doc.maker line up under the "# Where to get documents from:"
    comment

  * In build.xml, removed the extract-enwiki target so that "ant
    enwiki" does the right thing.

Thanks Steve!


> Create enwiki indexable data as line-per-article rather than file-per-article
> -----------------------------------------------------------------------------
>
>                 Key: LUCENE-971
>                 URL: https://issues.apache.org/jira/browse/LUCENE-971
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Steven Parkes
>            Assignee: Steven Parkes
>         Attachments: LUCENE-971.patch.txt, LUCENE-971.patch.txt, LUCENE-971.patch.txt
>
>
> Create a line per article rather than a file. Consume with indexLineFile task.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-971) Create enwiki indexable data as line-per-article rather than file-per-article

Posted by "Doron Cohen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12517048 ] 

Doron Cohen commented on LUCENE-971:
------------------------------------

Mmm... an additional advantage of this is not needing to extract 
the entire enwiki collection in order to index it - setting the 
repetition count to 100 for AddDocTask in alternative 1 or for 
WriteLineDocTask in alternative 2 would  mean that only 100 
docs from the huge file are extracted.

> Create enwiki indexable data as line-per-article rather than file-per-article
> -----------------------------------------------------------------------------
>
>                 Key: LUCENE-971
>                 URL: https://issues.apache.org/jira/browse/LUCENE-971
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Steven Parkes
>         Attachments: LUCENE-971.patch.txt
>
>
> Create a line per article rather than a file. Consume with indexLineFile task.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-971) Create enwiki indexable data as line-per-article rather than file-per-article

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12518663 ] 

Michael McCandless commented on LUCENE-971:
-------------------------------------------

Super, new patch looks good.  I will commit!

> Create enwiki indexable data as line-per-article rather than file-per-article
> -----------------------------------------------------------------------------
>
>                 Key: LUCENE-971
>                 URL: https://issues.apache.org/jira/browse/LUCENE-971
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Steven Parkes
>            Assignee: Steven Parkes
>         Attachments: LUCENE-971.patch.txt, LUCENE-971.patch.txt, LUCENE-971.patch.txt
>
>
> Create a line per article rather than a file. Consume with indexLineFile task.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-971) Create enwiki indexable data as line-per-article rather than file-per-article

Posted by "Steven Parkes (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Steven Parkes updated LUCENE-971:
---------------------------------

         Assignee: Steven Parkes
    Lucene Fields: [Patch Available]  (was: [Patch Available, New])

> Create enwiki indexable data as line-per-article rather than file-per-article
> -----------------------------------------------------------------------------
>
>                 Key: LUCENE-971
>                 URL: https://issues.apache.org/jira/browse/LUCENE-971
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Steven Parkes
>            Assignee: Steven Parkes
>         Attachments: LUCENE-971.patch.txt, LUCENE-971.patch.txt
>
>
> Create a line per article rather than a file. Consume with indexLineFile task.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-971) Create enwiki indexable data as line-per-article rather than file-per-article

Posted by "Steven Parkes (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12518016 ] 

Steven Parkes commented on LUCENE-971:
--------------------------------------

Sounds good. New patch soon.

> Create enwiki indexable data as line-per-article rather than file-per-article
> -----------------------------------------------------------------------------
>
>                 Key: LUCENE-971
>                 URL: https://issues.apache.org/jira/browse/LUCENE-971
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Steven Parkes
>         Attachments: LUCENE-971.patch.txt
>
>
> Create a line per article rather than a file. Consume with indexLineFile task.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-971) Create enwiki indexable data as line-per-article rather than file-per-article

Posted by "Steven Parkes (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Steven Parkes updated LUCENE-971:
---------------------------------

    Attachment: LUCENE-971.patch.txt

All agreed and fixed. Thanks.

> Create enwiki indexable data as line-per-article rather than file-per-article
> -----------------------------------------------------------------------------
>
>                 Key: LUCENE-971
>                 URL: https://issues.apache.org/jira/browse/LUCENE-971
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Steven Parkes
>            Assignee: Steven Parkes
>         Attachments: LUCENE-971.patch.txt, LUCENE-971.patch.txt, LUCENE-971.patch.txt
>
>
> Create a line per article rather than a file. Consume with indexLineFile task.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-971) Create enwiki indexable data as line-per-article rather than file-per-article

Posted by "Steven Parkes (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Steven Parkes updated LUCENE-971:
---------------------------------

    Attachment: LUCENE-971.patch.txt

> Create enwiki indexable data as line-per-article rather than file-per-article
> -----------------------------------------------------------------------------
>
>                 Key: LUCENE-971
>                 URL: https://issues.apache.org/jira/browse/LUCENE-971
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Steven Parkes
>         Attachments: LUCENE-971.patch.txt
>
>
> Create a line per article rather than a file. Consume with indexLineFile task.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-971) Create enwiki indexable data as line-per-article rather than file-per-article

Posted by "Steven Parkes (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516997 ] 

Steven Parkes commented on LUCENE-971:
--------------------------------------

I can look at what it would take to avoid the line file ... but ... what about the overhead of the XML parser? I don't tend to think of XML parsers as "light". Would bundling that into the test be a concern?

I guess it's not an issue if you're just using this to create an index and then are going to do your performance measurements on the queries of the index. But for measuring index performance, I would probably be cautious of bundling in the XML processing (until proven insignificant).

> Create enwiki indexable data as line-per-article rather than file-per-article
> -----------------------------------------------------------------------------
>
>                 Key: LUCENE-971
>                 URL: https://issues.apache.org/jira/browse/LUCENE-971
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Steven Parkes
>         Attachments: LUCENE-971.patch.txt
>
>
> Create a line per article rather than a file. Consume with indexLineFile task.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-971) Create enwiki indexable data as line-per-article rather than file-per-article

Posted by "Steven Parkes (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Steven Parkes updated LUCENE-971:
---------------------------------

    Attachment: LUCENE-971.patch.txt

Okay. Here's an update to the patch.

Changes:

1) EnwikiDocMaker replaces ExtractWikipedia

2) A sample algorithm is provided (and used by the build.xml file, which could be removed if desired

3) A bug in LineDocMaker is fixed (it was storing both the title and date in the title field (small enough that it doesn't need its own JIRA(?))

4) LineDocMaker was made derivable-from

Much of the code in LineDocMaker is useful in EnwikiDocMaker so I made it so (it's inheritance for impl, not abstraction so it could be changed, of course)

5) Made LineDocMaker and WriteLineDocTask multicharater safe

Or at least I tried to. Wikipedia has non-ascii characters in it. To make LineDocMaker work as a base class, I made it use an explicit FileInputStream which is required so that SAX can extract the encoding correctly. I made WriteLineDocTask always write UTF-8 so that I can get non-ASCII in the output file. Seems like UTF-8 is the best encoding for line files? At the same time, I made LineDocMaker assume UTF-8 (unless told otherwise by a derived class like EnwikiDocMaker) so that the line files created by EnwikiDocMaker/WriteLineDocTask can be read by LineDocMaker w/o loss.

> Create enwiki indexable data as line-per-article rather than file-per-article
> -----------------------------------------------------------------------------
>
>                 Key: LUCENE-971
>                 URL: https://issues.apache.org/jira/browse/LUCENE-971
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Steven Parkes
>         Attachments: LUCENE-971.patch.txt, LUCENE-971.patch.txt
>
>
> Create a line per article rather than a file. Consume with indexLineFile task.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-971) Create enwiki indexable data as line-per-article rather than file-per-article

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516996 ] 

Michael McCandless commented on LUCENE-971:
-------------------------------------------

This looks great!

One alternate approach here would be to create a WikipediaDocMaker
(implementing DocMaker interface) that pulls directly from the XML
file and feeds documents into the alg.

Then, to make a line file, one could create an alg that pulls docs
from WikipediaDocMaker and uses WriteLineDoc task to create the
line-by-line file.

One benefit of this approach is creating docs of a certain size (10
tokens, 100 tokens, etc) would become a one-step process (single alg)
instead of what I think is a 2-step process now (make first line file,
then reprocess into second line file).  Another benefit would be you
could make wikipedia tasks that pull directly from the XML file and
not even use a line file as an intermediary.

Steve do you think this would be a hard change?  I think it should be
easy, except, I'm not sure how to do this w/ SAX since SAX is "in
control".  You sort of need coroutines.  Or maybe one thread is
running SAX and putting doc data into a shared queue, and then the other
thread (the normal "main" thread that benchmark runs) would pull from
this queue?


> Create enwiki indexable data as line-per-article rather than file-per-article
> -----------------------------------------------------------------------------
>
>                 Key: LUCENE-971
>                 URL: https://issues.apache.org/jira/browse/LUCENE-971
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Steven Parkes
>         Attachments: LUCENE-971.patch.txt
>
>
> Create a line per article rather than a file. Consume with indexLineFile task.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-971) Create enwiki indexable data as line-per-article rather than file-per-article

Posted by "Doron Cohen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12517047 ] 

Doron Cohen commented on LUCENE-971:
------------------------------------

> But, this is the case regardless of which approach we use (ie, both
> approaches allow you use a line file -- the WriteLineDocTask writes a
> line file from any DocMaker).  It's just that the new approach would
> buy us more flexibility for those people who don't need (or want) to
> use the line file as an intermediary.

So there would now be two alternative ways to index wiki data:
(1) using the proposed WikiDocMaker directly to feed AddDoc task.
(2) using line file after first running WriteLineDocTask when the 
doc maker was WikiDocMaker.

I like this approach.

This means that WikiDocMaker would read the data straight from 
temp/enwiki-20070527-pages-articles.xml. So the extract-enwiki 
target in build.xml would no longer be needed, right?



> Create enwiki indexable data as line-per-article rather than file-per-article
> -----------------------------------------------------------------------------
>
>                 Key: LUCENE-971
>                 URL: https://issues.apache.org/jira/browse/LUCENE-971
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Steven Parkes
>         Attachments: LUCENE-971.patch.txt
>
>
> Create a line per article rather than a file. Consume with indexLineFile task.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-971) Create enwiki indexable data as line-per-article rather than file-per-article

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12517007 ] 

Michael McCandless commented on LUCENE-971:
-------------------------------------------


> I can look at what it would take to avoid the line file ... but
> ... what about the overhead of the XML parser? I don't tend to think
> of XML parsers as "light". Would bundling that into the test be a
> concern?

Right I too would not consider XML parsing overhead "light".  So tests
that are sensitive to the XML parsing cost should first create a line
file.

But, this is the case regardless of which approach we use (ie, both
approaches allow you use a line file -- the WriteLineDocTask writes a
line file from any DocMaker).  It's just that the new approach would
buy us more flexibility for those people who don't need (or want) to
use the line file as an intermediary.


> Create enwiki indexable data as line-per-article rather than file-per-article
> -----------------------------------------------------------------------------
>
>                 Key: LUCENE-971
>                 URL: https://issues.apache.org/jira/browse/LUCENE-971
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Steven Parkes
>         Attachments: LUCENE-971.patch.txt
>
>
> Create a line per article rather than a file. Consume with indexLineFile task.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-971) Create enwiki indexable data as line-per-article rather than file-per-article

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12518524 ] 

Michael McCandless commented on LUCENE-971:
-------------------------------------------

Patch looks good; a few comments:

  * In conf/wikipedia.alg:

    - The comment says "Reuters" but should say "Wikipedia"

    - It's only processing 1 doc?  I think you should change the ": 1"
      to ": *"?
 
    - Maybe rename this to conf/extractEnWikipedia.alg?

  * When I tried to run this I hit OOM (on Linux).  Then I changed the
    line in conf/wikipedia.alg to this:

      {WriteLineDoc() > : *

    And OOM went away and I was able to produce the full line file.
    That change tells benchmark not to record PerfTask details.  So I
    think we should make that change too.


> Create enwiki indexable data as line-per-article rather than file-per-article
> -----------------------------------------------------------------------------
>
>                 Key: LUCENE-971
>                 URL: https://issues.apache.org/jira/browse/LUCENE-971
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Steven Parkes
>            Assignee: Steven Parkes
>         Attachments: LUCENE-971.patch.txt, LUCENE-971.patch.txt
>
>
> Create a line per article rather than a file. Consume with indexLineFile task.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org