You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "Mike Sokolov (JIRA)" <ji...@apache.org> on 2011/06/15 15:12:47 UTC

[jira] [Created] (SOLR-2597) XmlCharFilter

XmlCharFilter
-------------

                 Key: SOLR-2597
                 URL: https://issues.apache.org/jira/browse/SOLR-2597
             Project: Solr
          Issue Type: Improvement
          Components: Schema and Analysis
    Affects Versions: 4.0
            Reporter: Mike Sokolov


This CharFilter processes incoming XML using the Woodstox parser, stripping all non-text content and remembering offsets, just like HTMLCharFilter, but respecting XML conventions like XML entities defined in a DTD.  XmlCharFilter also provides the ability to exclude (and include) the content of certain named elements.

In order to compute character offsets properly when mixed line termination styles are present (\r, \r\n), or when XML character entities (&lt;, &quot;, &amp;) are present, we require a newer version of Woodstox (4.1.1) than is currently in solr/lib.  The earlier versions of the parser could not report these entity events, so we couldn't tell the difference between "<" and "&lt;" and the offsets could be wrong.  The upgraded version is in the patch.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: [jira] [Created] (SOLR-2597) XmlCharFilter

Posted by Mike Sokolov <so...@ifactory.com>.

Perhaps that name would be more consistent with HTMLStripCharFilter, 
yes, but it wasn't the one I was using.  Also - I mean to post a patch 
here, but left the important files on a machine which is inaccessible at 
the moment, so I will post this evening.

-Mike

On 06/15/2011 09:28 AM, Koji Sekiguchi wrote:
> Did you mean Xml*Strip*CharFilter?
>
> koji

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: [jira] [Created] (SOLR-2597) XmlCharFilter

Posted by Koji Sekiguchi <ko...@r.email.ne.jp>.

Did you mean Xml*Strip*CharFilter?

koji
-- 
http://www.rondhuit.com/en/

(11/06/15 22:12), Mike Sokolov (JIRA) wrote:
> XmlCharFilter
> -------------
>
>                   Key: SOLR-2597
>                   URL: https://issues.apache.org/jira/browse/SOLR-2597
>               Project: Solr
>            Issue Type: Improvement
>            Components: Schema and Analysis
>      Affects Versions: 4.0
>              Reporter: Mike Sokolov
>
>
> This CharFilter processes incoming XML using the Woodstox parser, stripping all non-text content and remembering offsets, just like HTMLCharFilter, but respecting XML conventions like XML entities defined in a DTD.  XmlCharFilter also provides the ability to exclude (and include) the content of certain named elements.
>
> In order to compute character offsets properly when mixed line termination styles are present (\r, \r\n), or when XML character entities (&lt;,&quot;,&amp;) are present, we require a newer version of Woodstox (4.1.1) than is currently in solr/lib.  The earlier versions of the parser could not report these entity events, so we couldn't tell the difference between "<" and"&lt;" and the offsets could be wrong.  The upgraded version is in the patch.
>
> --
> This message is automatically generated by JIRA.
> For more information on JIRA, see: http://www.atlassian.com/software/jira
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Updated] (SOLR-2597) XmlCharFilter

Posted by "Mike Sokolov (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-2597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mike Sokolov updated SOLR-2597:
-------------------------------

    Attachment: SOLR-2597.patch

Updated patch addresses (most of) Robert and Hoss' comments (thanks for the speedy review!):

Test now uses the random in the test framework

I added a test for the factory (actually all the tests now use the factory since it is now used to create the parser), but I haven't plumbed this all the way through to a schema declaration. 

Moved to org.apache.solr.analysis: I don't know if this is the right place for this, but at least it should resolve any jar and java 1.6 dependency problems - I think? - at least I can compile and run the tests from both eclipse and ant command line although I'm not sure what that proves exactly.

The parser is now created in the factory rather than being maintained as a static in the reader class.

> XmlCharFilter
> -------------
>
>                 Key: SOLR-2597
>                 URL: https://issues.apache.org/jira/browse/SOLR-2597
>             Project: Solr
>          Issue Type: Improvement
>          Components: Schema and Analysis
>    Affects Versions: 4.0
>            Reporter: Mike Sokolov
>         Attachments: SOLR-2597.patch, SOLR-2597.patch
>
>
> This CharFilter processes incoming XML using the Woodstox parser, stripping all non-text content and remembering offsets, just like HTMLCharFilter, but respecting XML conventions like XML entities defined in a DTD.  XmlCharFilter also provides the ability to exclude (and include) the content of certain named elements.
> In order to compute character offsets properly when mixed line termination styles are present (\r, \r\n), or when XML character entities (&lt;, &quot;, &amp;) are present, we require a newer version of Woodstox (4.1.1) than is currently in solr/lib.  The earlier versions of the parser could not report these entity events, so we couldn't tell the difference between "<" and "&lt;" and the offsets could be wrong.  The upgraded version is in the patch.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (SOLR-2597) XmlCharFilter

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-2597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13050406#comment-13050406 ] 

Robert Muir commented on SOLR-2597:
-----------------------------------

yes, the factories are long-lived and do expensive things up-front to configure themselves (parsing files etc)


> XmlCharFilter
> -------------
>
>                 Key: SOLR-2597
>                 URL: https://issues.apache.org/jira/browse/SOLR-2597
>             Project: Solr
>          Issue Type: Improvement
>          Components: Schema and Analysis
>    Affects Versions: 4.0
>            Reporter: Mike Sokolov
>         Attachments: SOLR-2597.patch
>
>
> This CharFilter processes incoming XML using the Woodstox parser, stripping all non-text content and remembering offsets, just like HTMLCharFilter, but respecting XML conventions like XML entities defined in a DTD.  XmlCharFilter also provides the ability to exclude (and include) the content of certain named elements.
> In order to compute character offsets properly when mixed line termination styles are present (\r, \r\n), or when XML character entities (&lt;, &quot;, &amp;) are present, we require a newer version of Woodstox (4.1.1) than is currently in solr/lib.  The earlier versions of the parser could not report these entity events, so we couldn't tell the difference between "<" and "&lt;" and the offsets could be wrong.  The upgraded version is in the patch.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Updated] (SOLR-2597) XmlCharFilter

Posted by "Mike Sokolov (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-2597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mike Sokolov updated SOLR-2597:
-------------------------------

    Attachment: SOLR-2597.patch

I tried to include the upgraded Woodstox jars, but I don't think I figured how to put binaries in the patch actually.  What's needed are: http://repository.codehaus.org/org/codehaus/woodstox/woodstox-core-asl/4.1.1/woodstox-core-asl-4.1.1.jar and http://repository.codehaus.org/org/codehaus/woodstox/stax2-api/3.1.1/stax2-api-3.1.1.jar
which replace the existing wstx-asl-xxx.jar. 

> XmlCharFilter
> -------------
>
>                 Key: SOLR-2597
>                 URL: https://issues.apache.org/jira/browse/SOLR-2597
>             Project: Solr
>          Issue Type: Improvement
>          Components: Schema and Analysis
>    Affects Versions: 4.0
>            Reporter: Mike Sokolov
>         Attachments: SOLR-2597.patch
>
>
> This CharFilter processes incoming XML using the Woodstox parser, stripping all non-text content and remembering offsets, just like HTMLCharFilter, but respecting XML conventions like XML entities defined in a DTD.  XmlCharFilter also provides the ability to exclude (and include) the content of certain named elements.
> In order to compute character offsets properly when mixed line termination styles are present (\r, \r\n), or when XML character entities (&lt;, &quot;, &amp;) are present, we require a newer version of Woodstox (4.1.1) than is currently in solr/lib.  The earlier versions of the parser could not report these entity events, so we couldn't tell the difference between "<" and "&lt;" and the offsets could be wrong.  The upgraded version is in the patch.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (SOLR-2597) XmlCharFilter

Posted by "Mike Sokolov (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-2597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13050370#comment-13050370 ] 

Mike Sokolov commented on SOLR-2597:
------------------------------------

OK - I can extend LuceneTestCase, use its random, add can certainly a test for the Factory.

I'm not sure what the right package for this code is; working in Eclipse of course, all the jars get mushed into one giant classpath.  I guess I should build w/ant to see the dependency issues?  But it does sound as if it needs to move somewhere where solr/lib contents can be a dependent.

Apparently there is another jar you can get (http://woodstox.codehaus.org/stax-api-1.0.1.jar) to provide the javax.xml.stream package (StaX) for Java 5, but it doesn't sound as if it would be worth the trouble if this moves into solr land - is that right, can we rely on Java 6 there? 

I agree that having a static parser is distasteful, but it's a performance optimization.  It tends to be expensive to instantiate these parsers.  I'm not clear on what the object lifecycle for the XmlCharFilter is exactly - Robert are you saying the factory is long-lived, but the filter is not?

> XmlCharFilter
> -------------
>
>                 Key: SOLR-2597
>                 URL: https://issues.apache.org/jira/browse/SOLR-2597
>             Project: Solr
>          Issue Type: Improvement
>          Components: Schema and Analysis
>    Affects Versions: 4.0
>            Reporter: Mike Sokolov
>         Attachments: SOLR-2597.patch
>
>
> This CharFilter processes incoming XML using the Woodstox parser, stripping all non-text content and remembering offsets, just like HTMLCharFilter, but respecting XML conventions like XML entities defined in a DTD.  XmlCharFilter also provides the ability to exclude (and include) the content of certain named elements.
> In order to compute character offsets properly when mixed line termination styles are present (\r, \r\n), or when XML character entities (&lt;, &quot;, &amp;) are present, we require a newer version of Woodstox (4.1.1) than is currently in solr/lib.  The earlier versions of the parser could not report these entity events, so we couldn't tell the difference between "<" and "&lt;" and the offsets could be wrong.  The upgraded version is in the patch.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (SOLR-2597) XmlCharFilter

Posted by "Hoss Man (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-2597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13050184#comment-13050184 ] 

Hoss Man commented on SOLR-2597:
--------------------------------

Mike: thanks for the patch!

as Koji mentioned on the mailing list, might want to consider naming this XmlStripCharFilter ... that was my first opinion, but reading the docs the "include" and "exclude" options definitely make it a bit more generic, so i'm leaning towards the opinion that XmlCharFilter is better.

(there's an argument to be made that we should have an XmlStripCharFilter that only removes pi/comments/whitespace and resolves entities, and then a distinct XmlTagCharFilter that does the include/exclude -- but i'm guessing that would be less efficient since this makes it possible to do in one pass, and anyone who wants include/exclude at the "tag" level is almost certainly going to want the striping/entities as well)

skiming the patch i'm +1 except for the "new Random" in the test case ... if you take a look at the existing test cases you'll see how you can hook into the solr test framework to get random values that are consistent with a global seed -- that way if a test fails, it will report which seed was used and people can reproduce it using system properties.

would also be nice to have a test of the Factory (using a schema.xml declaration) but that's not nearly as important.

and of course: would be great if "the xml policeman" uwe could review.

bq. I tried to include the upgraded Woodstox jars, but I don't think I figured how to put binaries in the patch actually.

it's not possible, so don't worry about it.  the important thing is noting in a comment (like you did) exactly what new/upgraded jars are needed.


> XmlCharFilter
> -------------
>
>                 Key: SOLR-2597
>                 URL: https://issues.apache.org/jira/browse/SOLR-2597
>             Project: Solr
>          Issue Type: Improvement
>          Components: Schema and Analysis
>    Affects Versions: 4.0
>            Reporter: Mike Sokolov
>         Attachments: SOLR-2597.patch
>
>
> This CharFilter processes incoming XML using the Woodstox parser, stripping all non-text content and remembering offsets, just like HTMLCharFilter, but respecting XML conventions like XML entities defined in a DTD.  XmlCharFilter also provides the ability to exclude (and include) the content of certain named elements.
> In order to compute character offsets properly when mixed line termination styles are present (\r, \r\n), or when XML character entities (&lt;, &quot;, &amp;) are present, we require a newer version of Woodstox (4.1.1) than is currently in solr/lib.  The earlier versions of the parser could not report these entity events, so we couldn't tell the difference between "<" and "&lt;" and the offsets could be wrong.  The upgraded version is in the patch.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (SOLR-2597) XmlCharFilter

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-2597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13050189#comment-13050189 ] 

Robert Muir commented on SOLR-2597:
-----------------------------------

just one comment: taking a look at the patch, it currently won't compile because the analysis module has no dependencies and thus no woodstox or whatever.
(but, thanks for trying to integrate it here!!!)

One step would be, rather than have this thing static, can we just have the ctors to this thing take a general XMLInputFactory instead, e.g.
{noformat}
public XmlCharFilter (CharStream reader, XMLInputFactory inputFactory) {
{noformat}

The corresponding Solr CharFilterFactory could then configure it with all the woodstox-specific parameters.
But, this still wouldn't solve the issue that all of lucene and modules are on java5 (and it looks like this uses java6-specific APIs).

I don't think it makes sense to block the patch for these issues, so one workaround would be to just add it to Solr-only.
If/when we ever move to java 6 in lucene we could then move it into the analysis module.
Another option would be if the XML policeman knows some workaround (sorry, not my thing).


> XmlCharFilter
> -------------
>
>                 Key: SOLR-2597
>                 URL: https://issues.apache.org/jira/browse/SOLR-2597
>             Project: Solr
>          Issue Type: Improvement
>          Components: Schema and Analysis
>    Affects Versions: 4.0
>            Reporter: Mike Sokolov
>         Attachments: SOLR-2597.patch
>
>
> This CharFilter processes incoming XML using the Woodstox parser, stripping all non-text content and remembering offsets, just like HTMLCharFilter, but respecting XML conventions like XML entities defined in a DTD.  XmlCharFilter also provides the ability to exclude (and include) the content of certain named elements.
> In order to compute character offsets properly when mixed line termination styles are present (\r, \r\n), or when XML character entities (&lt;, &quot;, &amp;) are present, we require a newer version of Woodstox (4.1.1) than is currently in solr/lib.  The earlier versions of the parser could not report these entity events, so we couldn't tell the difference between "<" and "&lt;" and the offsets could be wrong.  The upgraded version is in the patch.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org