You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "Grant Ingersoll (JIRA)" <ji...@apache.org> on 2007/12/28 18:20:43 UTC

[jira] Created: (LUCENE-1103) WikipediaTokenizer

WikipediaTokenizer
------------------

Key: LUCENE-1103
URL: https://issues.apache.org/jira/browse/LUCENE-1103
Project: Lucene - Java
Issue Type: Improvement
Components: Analysis
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor

I have extended StandardTokenizer to recognize Wikipedia syntax and mark tokens with certain attributes. It isn't necessarily complete, but it does a good enough job for it to be consumed and improved by others.

It sets the Token.type() value depending on the Wikipedia syntax (links, internal links, bold, italics, etc.) based on my pass at http://en.wikipedia.org/wiki/Wikipedia:Tutorial

I have only tested it with the benchmarking EnwikiDocMaker wikipedia stuff and it seems to do a decent job.

Caveats: I am not sure how to best handle testing, since the content is licensed under GNU Free Doc License, I believe I can't copy and paste a whole document into the unit test. I have hand coded one doc and have another one that just generally runs over the benchmark Wikipedia download.

One more question is where to put it. It could go in analysis, but the tests at least will have a dependency on Benchmark. I am thinking of adding a new contrib/wikipedia where this could grow to have other wikipedia things (perhaps we would move EnwikiDocMaker there????) and reverse the dependency on Benchmark.

I will post a patch over the next few days.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1103) WikipediaTokenizer

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll updated LUCENE-1103:
------------------------------------

    Attachment: LUCENE-1103.patch

More legible unit test.  Also now skips HTML tags from within the text.  Fixed issue with heading.  Added support for <ref> tag in addition to the brace-based citation {{

> WikipediaTokenizer
> ------------------
>
>                 Key: LUCENE-1103
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1103
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 2.3
>
>         Attachments: LUCENE-1103.patch, LUCENE-1103.patch, LUCENE-1103.patch, LUCENE-1103.patch
>
>
> I have extended StandardTokenizer to recognize Wikipedia syntax and mark tokens with certain attributes.  It isn't necessarily complete, but it does a good enough job for it to be consumed and improved by others.
> It sets the Token.type() value depending on the Wikipedia syntax (links, internal links, bold, italics, etc.) based on my pass at http://en.wikipedia.org/wiki/Wikipedia:Tutorial
> I have only tested it with the benchmarking EnwikiDocMaker wikipedia stuff and it seems to do a decent job.
> Caveats:  I am not sure how to best handle testing, since the content is licensed under GNU Free Doc License, I believe I can't copy and paste a whole document into the unit test.  I have hand coded one doc and have another one that just generally runs over the benchmark Wikipedia download.
> One more question is where to put it.  It could go in analysis, but the tests at least will have a dependency on Benchmark.  I am thinking of adding a new contrib/wikipedia where this could grow to have other wikipedia things (perhaps we would move EnwikiDocMaker there????) and reverse the dependency on Benchmark.
> I will post a patch over the next few days.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1103) WikipediaTokenizer

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll updated LUCENE-1103:
------------------------------------

    Attachment: LUCENE-1103.patch

More URL testing and fixes.

> WikipediaTokenizer
> ------------------
>
>                 Key: LUCENE-1103
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1103
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 2.3
>
>         Attachments: LUCENE-1103.patch, LUCENE-1103.patch, LUCENE-1103.patch
>
>
> I have extended StandardTokenizer to recognize Wikipedia syntax and mark tokens with certain attributes.  It isn't necessarily complete, but it does a good enough job for it to be consumed and improved by others.
> It sets the Token.type() value depending on the Wikipedia syntax (links, internal links, bold, italics, etc.) based on my pass at http://en.wikipedia.org/wiki/Wikipedia:Tutorial
> I have only tested it with the benchmarking EnwikiDocMaker wikipedia stuff and it seems to do a decent job.
> Caveats:  I am not sure how to best handle testing, since the content is licensed under GNU Free Doc License, I believe I can't copy and paste a whole document into the unit test.  I have hand coded one doc and have another one that just generally runs over the benchmark Wikipedia download.
> One more question is where to put it.  It could go in analysis, but the tests at least will have a dependency on Benchmark.  I am thinking of adding a new contrib/wikipedia where this could grow to have other wikipedia things (perhaps we would move EnwikiDocMaker there????) and reverse the dependency on Benchmark.
> I will post a patch over the next few days.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1103) WikipediaTokenizer

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12555642#action_12555642 ] 

Doug Cutting commented on LUCENE-1103:
--------------------------------------

Should the position increment be zero for link urls, so that phrase searches work correctly with anchors?  One might even index URLs in a separate field...

> WikipediaTokenizer
> ------------------
>
>                 Key: LUCENE-1103
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1103
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 2.3
>
>         Attachments: LUCENE-1103.patch, LUCENE-1103.patch, LUCENE-1103.patch, LUCENE-1103.patch
>
>
> I have extended StandardTokenizer to recognize Wikipedia syntax and mark tokens with certain attributes.  It isn't necessarily complete, but it does a good enough job for it to be consumed and improved by others.
> It sets the Token.type() value depending on the Wikipedia syntax (links, internal links, bold, italics, etc.) based on my pass at http://en.wikipedia.org/wiki/Wikipedia:Tutorial
> I have only tested it with the benchmarking EnwikiDocMaker wikipedia stuff and it seems to do a decent job.
> Caveats:  I am not sure how to best handle testing, since the content is licensed under GNU Free Doc License, I believe I can't copy and paste a whole document into the unit test.  I have hand coded one doc and have another one that just generally runs over the benchmark Wikipedia download.
> One more question is where to put it.  It could go in analysis, but the tests at least will have a dependency on Benchmark.  I am thinking of adding a new contrib/wikipedia where this could grow to have other wikipedia things (perhaps we would move EnwikiDocMaker there????) and reverse the dependency on Benchmark.
> I will post a patch over the next few days.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1103) WikipediaTokenizer

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll updated LUCENE-1103:
------------------------------------

    Attachment: LUCENE-1103.patch

Handle anchors in links and check for various link conditions

> WikipediaTokenizer
> ------------------
>
>                 Key: LUCENE-1103
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1103
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 2.3
>
>         Attachments: LUCENE-1103.patch, LUCENE-1103.patch, LUCENE-1103.patch, LUCENE-1103.patch, LUCENE-1103.patch, LUCENE-1103.patch
>
>
> I have extended StandardTokenizer to recognize Wikipedia syntax and mark tokens with certain attributes.  It isn't necessarily complete, but it does a good enough job for it to be consumed and improved by others.
> It sets the Token.type() value depending on the Wikipedia syntax (links, internal links, bold, italics, etc.) based on my pass at http://en.wikipedia.org/wiki/Wikipedia:Tutorial
> I have only tested it with the benchmarking EnwikiDocMaker wikipedia stuff and it seems to do a decent job.
> Caveats:  I am not sure how to best handle testing, since the content is licensed under GNU Free Doc License, I believe I can't copy and paste a whole document into the unit test.  I have hand coded one doc and have another one that just generally runs over the benchmark Wikipedia download.
> One more question is where to put it.  It could go in analysis, but the tests at least will have a dependency on Benchmark.  I am thinking of adding a new contrib/wikipedia where this could grow to have other wikipedia things (perhaps we would move EnwikiDocMaker there????) and reverse the dependency on Benchmark.
> I will post a patch over the next few days.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1103) WikipediaTokenizer

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll updated LUCENE-1103:
------------------------------------

    Attachment: LUCENE-1103.patch

Should now handle external links with non link text, i.e. [http://foo.com/junk Foo Junk] and spit out 3 tokens: http://foo.com/junk, Foo, Junk

> WikipediaTokenizer
> ------------------
>
>                 Key: LUCENE-1103
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1103
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 2.3
>
>         Attachments: LUCENE-1103.patch, LUCENE-1103.patch
>
>
> I have extended StandardTokenizer to recognize Wikipedia syntax and mark tokens with certain attributes.  It isn't necessarily complete, but it does a good enough job for it to be consumed and improved by others.
> It sets the Token.type() value depending on the Wikipedia syntax (links, internal links, bold, italics, etc.) based on my pass at http://en.wikipedia.org/wiki/Wikipedia:Tutorial
> I have only tested it with the benchmarking EnwikiDocMaker wikipedia stuff and it seems to do a decent job.
> Caveats:  I am not sure how to best handle testing, since the content is licensed under GNU Free Doc License, I believe I can't copy and paste a whole document into the unit test.  I have hand coded one doc and have another one that just generally runs over the benchmark Wikipedia download.
> One more question is where to put it.  It could go in analysis, but the tests at least will have a dependency on Benchmark.  I am thinking of adding a new contrib/wikipedia where this could grow to have other wikipedia things (perhaps we would move EnwikiDocMaker there????) and reverse the dependency on Benchmark.
> I will post a patch over the next few days.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1103) WikipediaTokenizer

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12555674#action_12555674 ] 

Grant Ingersoll commented on LUCENE-1103:
-----------------------------------------

{quote}
Should the position increment be zero for link urls, so that phrase searches work correctly with anchors? 
{quote}

Good point.  I will look into it.

{quote}
One might even index URLs in a separate field...
{quote}

Yep, there are all kinds of fun things you can do with the various pieces, and the TeeTokenFilter/SinkTokenizer is really handy for this stuff. 

> WikipediaTokenizer
> ------------------
>
>                 Key: LUCENE-1103
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1103
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 2.3
>
>         Attachments: LUCENE-1103.patch, LUCENE-1103.patch, LUCENE-1103.patch, LUCENE-1103.patch
>
>
> I have extended StandardTokenizer to recognize Wikipedia syntax and mark tokens with certain attributes.  It isn't necessarily complete, but it does a good enough job for it to be consumed and improved by others.
> It sets the Token.type() value depending on the Wikipedia syntax (links, internal links, bold, italics, etc.) based on my pass at http://en.wikipedia.org/wiki/Wikipedia:Tutorial
> I have only tested it with the benchmarking EnwikiDocMaker wikipedia stuff and it seems to do a decent job.
> Caveats:  I am not sure how to best handle testing, since the content is licensed under GNU Free Doc License, I believe I can't copy and paste a whole document into the unit test.  I have hand coded one doc and have another one that just generally runs over the benchmark Wikipedia download.
> One more question is where to put it.  It could go in analysis, but the tests at least will have a dependency on Benchmark.  I am thinking of adding a new contrib/wikipedia where this could grow to have other wikipedia things (perhaps we would move EnwikiDocMaker there????) and reverse the dependency on Benchmark.
> I will post a patch over the next few days.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1103) WikipediaTokenizer

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll updated LUCENE-1103:
------------------------------------

    Attachment: LUCENE-1103.patch

Adds in EXTERNAL_LINK_URL type to distinguish the first token of an external link from the other tokens.

Also adds in a POM template, so this project can be mavenized.

Are we there yet?

> WikipediaTokenizer
> ------------------
>
>                 Key: LUCENE-1103
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1103
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 2.3
>
>         Attachments: LUCENE-1103.patch, LUCENE-1103.patch, LUCENE-1103.patch, LUCENE-1103.patch, LUCENE-1103.patch, LUCENE-1103.patch, LUCENE-1103.patch
>
>
> I have extended StandardTokenizer to recognize Wikipedia syntax and mark tokens with certain attributes.  It isn't necessarily complete, but it does a good enough job for it to be consumed and improved by others.
> It sets the Token.type() value depending on the Wikipedia syntax (links, internal links, bold, italics, etc.) based on my pass at http://en.wikipedia.org/wiki/Wikipedia:Tutorial
> I have only tested it with the benchmarking EnwikiDocMaker wikipedia stuff and it seems to do a decent job.
> Caveats:  I am not sure how to best handle testing, since the content is licensed under GNU Free Doc License, I believe I can't copy and paste a whole document into the unit test.  I have hand coded one doc and have another one that just generally runs over the benchmark Wikipedia download.
> One more question is where to put it.  It could go in analysis, but the tests at least will have a dependency on Benchmark.  I am thinking of adding a new contrib/wikipedia where this could grow to have other wikipedia things (perhaps we would move EnwikiDocMaker there????) and reverse the dependency on Benchmark.
> I will post a patch over the next few days.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1103) WikipediaTokenizer

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll updated LUCENE-1103:
------------------------------------

    Fix Version/s: 2.3

Patch shortly.  This will be all new code, other than minor changes to include javadocs.  I am going to create contrib/wikipedia, as there are probably other things that can go in here once the seed is started.

> WikipediaTokenizer
> ------------------
>
>                 Key: LUCENE-1103
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1103
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 2.3
>
>
> I have extended StandardTokenizer to recognize Wikipedia syntax and mark tokens with certain attributes.  It isn't necessarily complete, but it does a good enough job for it to be consumed and improved by others.
> It sets the Token.type() value depending on the Wikipedia syntax (links, internal links, bold, italics, etc.) based on my pass at http://en.wikipedia.org/wiki/Wikipedia:Tutorial
> I have only tested it with the benchmarking EnwikiDocMaker wikipedia stuff and it seems to do a decent job.
> Caveats:  I am not sure how to best handle testing, since the content is licensed under GNU Free Doc License, I believe I can't copy and paste a whole document into the unit test.  I have hand coded one doc and have another one that just generally runs over the benchmark Wikipedia download.
> One more question is where to put it.  It could go in analysis, but the tests at least will have a dependency on Benchmark.  I am thinking of adding a new contrib/wikipedia where this could grow to have other wikipedia things (perhaps we would move EnwikiDocMaker there????) and reverse the dependency on Benchmark.
> I will post a patch over the next few days.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1103) WikipediaTokenizer

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll updated LUCENE-1103:
------------------------------------

    Attachment: LUCENE-1103.patch

Adds contrib/wikipedia.  Updates the javadocs build and the site docs.

Will commit Friday, pending feedback.

Current issues:
Doesn't assign multiple types to tokens that have more than one possible type, i.e. something like '''[[link]]''' (a bold link).  In this case, I assign the more important type: the link.

> WikipediaTokenizer
> ------------------
>
>                 Key: LUCENE-1103
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1103
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 2.3
>
>         Attachments: LUCENE-1103.patch
>
>
> I have extended StandardTokenizer to recognize Wikipedia syntax and mark tokens with certain attributes.  It isn't necessarily complete, but it does a good enough job for it to be consumed and improved by others.
> It sets the Token.type() value depending on the Wikipedia syntax (links, internal links, bold, italics, etc.) based on my pass at http://en.wikipedia.org/wiki/Wikipedia:Tutorial
> I have only tested it with the benchmarking EnwikiDocMaker wikipedia stuff and it seems to do a decent job.
> Caveats:  I am not sure how to best handle testing, since the content is licensed under GNU Free Doc License, I believe I can't copy and paste a whole document into the unit test.  I have hand coded one doc and have another one that just generally runs over the benchmark Wikipedia download.
> One more question is where to put it.  It could go in analysis, but the tests at least will have a dependency on Benchmark.  I am thinking of adding a new contrib/wikipedia where this could grow to have other wikipedia things (perhaps we would move EnwikiDocMaker there????) and reverse the dependency on Benchmark.
> I will post a patch over the next few days.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1103) WikipediaTokenizer

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll updated LUCENE-1103:
------------------------------------

    Attachment: LUCENE-1103.patch

The first alphanum in internal and external link gets a positionIncrement of 0, so phrases can still work.

> WikipediaTokenizer
> ------------------
>
>                 Key: LUCENE-1103
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1103
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 2.3
>
>         Attachments: LUCENE-1103.patch, LUCENE-1103.patch, LUCENE-1103.patch, LUCENE-1103.patch, LUCENE-1103.patch
>
>
> I have extended StandardTokenizer to recognize Wikipedia syntax and mark tokens with certain attributes.  It isn't necessarily complete, but it does a good enough job for it to be consumed and improved by others.
> It sets the Token.type() value depending on the Wikipedia syntax (links, internal links, bold, italics, etc.) based on my pass at http://en.wikipedia.org/wiki/Wikipedia:Tutorial
> I have only tested it with the benchmarking EnwikiDocMaker wikipedia stuff and it seems to do a decent job.
> Caveats:  I am not sure how to best handle testing, since the content is licensed under GNU Free Doc License, I believe I can't copy and paste a whole document into the unit test.  I have hand coded one doc and have another one that just generally runs over the benchmark Wikipedia download.
> One more question is where to put it.  It could go in analysis, but the tests at least will have a dependency on Benchmark.  I am thinking of adding a new contrib/wikipedia where this could grow to have other wikipedia things (perhaps we would move EnwikiDocMaker there????) and reverse the dependency on Benchmark.
> I will post a patch over the next few days.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1103) WikipediaTokenizer

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12556057#action_12556057 ] 

Grant Ingersoll commented on LUCENE-1103:
-----------------------------------------

I updated the link slightly to increment the first token of a link (i.e. the URL or the Wiki link) and then not increment the next token in the link, such that the link and the first display token will be at the same position instead of the first way I had it which put the link token at the same position as the previous token.

I also modified the EXTERNAL link state to recognize https

> WikipediaTokenizer
> ------------------
>
>                 Key: LUCENE-1103
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1103
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 2.3
>
>         Attachments: LUCENE-1103.patch, LUCENE-1103.patch, LUCENE-1103.patch, LUCENE-1103.patch, LUCENE-1103.patch, LUCENE-1103.patch, LUCENE-1103.patch
>
>
> I have extended StandardTokenizer to recognize Wikipedia syntax and mark tokens with certain attributes.  It isn't necessarily complete, but it does a good enough job for it to be consumed and improved by others.
> It sets the Token.type() value depending on the Wikipedia syntax (links, internal links, bold, italics, etc.) based on my pass at http://en.wikipedia.org/wiki/Wikipedia:Tutorial
> I have only tested it with the benchmarking EnwikiDocMaker wikipedia stuff and it seems to do a decent job.
> Caveats:  I am not sure how to best handle testing, since the content is licensed under GNU Free Doc License, I believe I can't copy and paste a whole document into the unit test.  I have hand coded one doc and have another one that just generally runs over the benchmark Wikipedia download.
> One more question is where to put it.  It could go in analysis, but the tests at least will have a dependency on Benchmark.  I am thinking of adding a new contrib/wikipedia where this could grow to have other wikipedia things (perhaps we would move EnwikiDocMaker there????) and reverse the dependency on Benchmark.
> I will post a patch over the next few days.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Resolved: (LUCENE-1103) WikipediaTokenizer

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll resolved LUCENE-1103.
-------------------------------------

       Resolution: Fixed
    Lucene Fields: [Patch Available]  (was: [New])

Committed.

> WikipediaTokenizer
> ------------------
>
>                 Key: LUCENE-1103
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1103
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 2.3
>
>         Attachments: LUCENE-1103.patch, LUCENE-1103.patch, LUCENE-1103.patch, LUCENE-1103.patch, LUCENE-1103.patch, LUCENE-1103.patch, LUCENE-1103.patch
>
>
> I have extended StandardTokenizer to recognize Wikipedia syntax and mark tokens with certain attributes.  It isn't necessarily complete, but it does a good enough job for it to be consumed and improved by others.
> It sets the Token.type() value depending on the Wikipedia syntax (links, internal links, bold, italics, etc.) based on my pass at http://en.wikipedia.org/wiki/Wikipedia:Tutorial
> I have only tested it with the benchmarking EnwikiDocMaker wikipedia stuff and it seems to do a decent job.
> Caveats:  I am not sure how to best handle testing, since the content is licensed under GNU Free Doc License, I believe I can't copy and paste a whole document into the unit test.  I have hand coded one doc and have another one that just generally runs over the benchmark Wikipedia download.
> One more question is where to put it.  It could go in analysis, but the tests at least will have a dependency on Benchmark.  I am thinking of adding a new contrib/wikipedia where this could grow to have other wikipedia things (perhaps we would move EnwikiDocMaker there????) and reverse the dependency on Benchmark.
> I will post a patch over the next few days.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org