You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-dev@lucene.apache.org by "Fergus McMenemie (JIRA)" <ji...@apache.org> on 2009/09/16 14:59:57 UTC

[jira] Created: (SOLR-1437) DIH: Enhance XPathRecordReader to deal with //tagname and other improvments.

DIH: Enhance XPathRecordReader to deal with //tagname and other improvments.
----------------------------------------------------------------------------

                 Key: SOLR-1437
                 URL: https://issues.apache.org/jira/browse/SOLR-1437
             Project: Solr
          Issue Type: Improvement
          Components: contrib - DataImportHandler
    Affects Versions: 1.4
            Reporter: Fergus McMenemie
            Priority: Minor
             Fix For: 1.4


As per http://www.nabble.com/Re%3A-Extract-info-from-parent-node-during-data-import-%28redirect%3A%29-td25471162.html it would be nice to be able to use expressions such as //tagname when parsing XML documents.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-1437) DIH: Enhance XPathRecordReader to deal with //tagname and other improvments.

Posted by "Noble Paul (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-1437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Noble Paul updated SOLR-1437:
-----------------------------

    Attachment: masked_NPE.patch

This must be fixed anyway in the trunk . I am not opening another issue because my previous patch includes this fix too

> DIH: Enhance XPathRecordReader to deal with //tagname and other improvments.
> ----------------------------------------------------------------------------
>
>                 Key: SOLR-1437
>                 URL: https://issues.apache.org/jira/browse/SOLR-1437
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - DataImportHandler
>    Affects Versions: 1.4
>            Reporter: Fergus McMenemie
>            Assignee: Noble Paul
>            Priority: Minor
>             Fix For: 1.5
>
>         Attachments: masked_NPE.patch, SOLR-1437.patch, SOLR-1437.patch, SOLR-1437.patch, SOLR-1437.patch, SOLR-1437.patch
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> As per http://www.nabble.com/Re%3A-Extract-info-from-parent-node-during-data-import-%28redirect%3A%29-td25471162.html it would be nice to be able to use expressions such as //tagname when parsing XML documents.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-1437) DIH: Enhance XPathRecordReader to deal with //tagname and other improvments.

Posted by "Noble Paul (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-1437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Noble Paul updated SOLR-1437:
-----------------------------

    Attachment: putNulls_using_bitset.patch

another enhancement to use a BitSet instead of a HashSet

> DIH: Enhance XPathRecordReader to deal with //tagname and other improvments.
> ----------------------------------------------------------------------------
>
>                 Key: SOLR-1437
>                 URL: https://issues.apache.org/jira/browse/SOLR-1437
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - DataImportHandler
>    Affects Versions: 1.4
>            Reporter: Fergus McMenemie
>            Assignee: Noble Paul
>             Fix For: 1.4
>
>         Attachments: masked_NPE.patch, putNulls_using_bitset.patch, SOLR-1437.patch, SOLR-1437.patch, SOLR-1437.patch, SOLR-1437.patch, SOLR-1437.patch, SOLR-1437.patch
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> As per http://www.nabble.com/Re%3A-Extract-info-from-parent-node-during-data-import-%28redirect%3A%29-td25471162.html it would be nice to be able to use expressions such as //tagname when parsing XML documents.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-1437) DIH: Enhance XPathRecordReader to deal with //tagname and other improvments.

Posted by "Fergus McMenemie (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-1437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Fergus McMenemie updated SOLR-1437:
-----------------------------------

    Attachment: SOLR-1437.patch

incorporated SOLR-1465 into this patch!

> DIH: Enhance XPathRecordReader to deal with //tagname and other improvments.
> ----------------------------------------------------------------------------
>
>                 Key: SOLR-1437
>                 URL: https://issues.apache.org/jira/browse/SOLR-1437
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - DataImportHandler
>    Affects Versions: 1.4
>            Reporter: Fergus McMenemie
>            Assignee: Noble Paul
>            Priority: Minor
>             Fix For: 1.5
>
>         Attachments: SOLR-1437.patch, SOLR-1437.patch, SOLR-1437.patch, SOLR-1437.patch, SOLR-1437.patch
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> As per http://www.nabble.com/Re%3A-Extract-info-from-parent-node-during-data-import-%28redirect%3A%29-td25471162.html it would be nice to be able to use expressions such as //tagname when parsing XML documents.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (SOLR-1437) DIH: Enhance XPathRecordReader to deal with //tagname and other improvments.

Posted by "Noble Paul (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-1437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12756453#action_12756453 ] 

Noble Paul edited comment on SOLR-1437 at 9/17/09 3:16 AM:
-----------------------------------------------------------

renaming done in  trunk

let us be clear about what all do we wish to achieve in wild-card support

* Do we wish to support wild-cards in the forEach parameter 
* How to handle the following case
example
{code}
d=  /a/b/c/d
e  = //e
{code}
and if the real  path to the tag e = /a/x/e . it would not really hard to traverse both the paths
{code]
Just keep in mind that It is hard to achieve everything. But is is easy to achieve a subset of features and meet most of the requirements. We are not trying to implement a complete framework for xml handling

      was (Author: noble.paul):
    renaming done in  trunk

let us be clear about what all do we wish to achieve in wild-card support

* Do we wish to support wild-cards in the forEach parameter 
* How to handle the following case
example
{code}
d=  /a/b/c/d
e  = //e
and if the real  path to the tag e = /a/x/e . it would not really hard to traverse both the paths
{code]
Just keep in mind that It is hard to achieve everything. But is is easy to achieve a subset of features and meet most of the requirements. We are not trying to implement a complete framework for xml handling
  
> DIH: Enhance XPathRecordReader to deal with //tagname and other improvments.
> ----------------------------------------------------------------------------
>
>                 Key: SOLR-1437
>                 URL: https://issues.apache.org/jira/browse/SOLR-1437
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - DataImportHandler
>    Affects Versions: 1.4
>            Reporter: Fergus McMenemie
>            Priority: Minor
>             Fix For: 1.5
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> As per http://www.nabble.com/Re%3A-Extract-info-from-parent-node-during-data-import-%28redirect%3A%29-td25471162.html it would be nice to be able to use expressions such as //tagname when parsing XML documents.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1437) DIH: Enhance XPathRecordReader to deal with //tagname and other improvments.

Posted by "Noble Paul (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-1437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12762093#action_12762093 ] 

Noble Paul commented on SOLR-1437:
----------------------------------

bq.PutNulls: I can see what it is doing but I still dont know why it is needed.

This is not very useful if you are indexing directly w/o tranformers. Imagine , if you write a transformer which wouldhave to make some business decision if one value is missing, this helps. 

I see that you have optimized for the case where no wild card is used .

Fergus, I am +tive of putting this into 1.4 (if you are done with it). This is clearly a huge improvment over what was already existing . Whatever other enhancements are required , can be done .

> DIH: Enhance XPathRecordReader to deal with //tagname and other improvments.
> ----------------------------------------------------------------------------
>
>                 Key: SOLR-1437
>                 URL: https://issues.apache.org/jira/browse/SOLR-1437
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - DataImportHandler
>    Affects Versions: 1.4
>            Reporter: Fergus McMenemie
>            Assignee: Noble Paul
>            Priority: Minor
>             Fix For: 1.5
>
>         Attachments: SOLR-1437.patch, SOLR-1437.patch, SOLR-1437.patch
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> As per http://www.nabble.com/Re%3A-Extract-info-from-parent-node-during-data-import-%28redirect%3A%29-td25471162.html it would be nice to be able to use expressions such as //tagname when parsing XML documents.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (SOLR-1437) DIH: Enhance XPathRecordReader to deal with //tagname and other improvments.

Posted by "Noble Paul (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-1437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Noble Paul resolved SOLR-1437.
------------------------------

    Resolution: Fixed

committed r822154

Thanks Fergus

> DIH: Enhance XPathRecordReader to deal with //tagname and other improvments.
> ----------------------------------------------------------------------------
>
>                 Key: SOLR-1437
>                 URL: https://issues.apache.org/jira/browse/SOLR-1437
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - DataImportHandler
>    Affects Versions: 1.4
>            Reporter: Fergus McMenemie
>            Assignee: Noble Paul
>             Fix For: 1.4
>
>         Attachments: masked_NPE.patch, SOLR-1437.patch, SOLR-1437.patch, SOLR-1437.patch, SOLR-1437.patch, SOLR-1437.patch, SOLR-1437.patch
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> As per http://www.nabble.com/Re%3A-Extract-info-from-parent-node-during-data-import-%28redirect%3A%29-td25471162.html it would be nice to be able to use expressions such as //tagname when parsing XML documents.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1437) DIH: Enhance XPathRecordReader to deal with //tagname and other improvments.

Posted by "Noble Paul (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-1437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12756358#action_12756358 ] 

Noble Paul commented on SOLR-1437:
----------------------------------

for any normal event , parser.next(); should be called in each iteration. But for CDATA it should not do so because handling of CDATA itself would have consumed the next event

> DIH: Enhance XPathRecordReader to deal with //tagname and other improvments.
> ----------------------------------------------------------------------------
>
>                 Key: SOLR-1437
>                 URL: https://issues.apache.org/jira/browse/SOLR-1437
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - DataImportHandler
>    Affects Versions: 1.4
>            Reporter: Fergus McMenemie
>            Priority: Minor
>             Fix For: 1.5
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> As per http://www.nabble.com/Re%3A-Extract-info-from-parent-node-during-data-import-%28redirect%3A%29-td25471162.html it would be nice to be able to use expressions such as //tagname when parsing XML documents.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-1437) DIH: Enhance XPathRecordReader to deal with //tagname and other improvments.

Posted by "Noble Paul (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-1437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Noble Paul updated SOLR-1437:
-----------------------------

    Attachment: SOLR-1437.patch

removed a few unused lines and formatted at some places.

Fergus, I am tempted to make it go in 1.4? 
It looks largely OK to me. What else do you plan to add ? Do we need more testcases?

> DIH: Enhance XPathRecordReader to deal with //tagname and other improvments.
> ----------------------------------------------------------------------------
>
>                 Key: SOLR-1437
>                 URL: https://issues.apache.org/jira/browse/SOLR-1437
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - DataImportHandler
>    Affects Versions: 1.4
>            Reporter: Fergus McMenemie
>            Assignee: Noble Paul
>            Priority: Minor
>             Fix For: 1.5
>
>         Attachments: SOLR-1437.patch, SOLR-1437.patch
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> As per http://www.nabble.com/Re%3A-Extract-info-from-parent-node-during-data-import-%28redirect%3A%29-td25471162.html it would be nice to be able to use expressions such as //tagname when parsing XML documents.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1437) DIH: Enhance XPathRecordReader to deal with //tagname and other improvments.

Posted by "Noble Paul (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-1437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12757091#action_12757091 ] 

Noble Paul commented on SOLR-1437:
----------------------------------

committed r816577
thanks Fergus

> DIH: Enhance XPathRecordReader to deal with //tagname and other improvments.
> ----------------------------------------------------------------------------
>
>                 Key: SOLR-1437
>                 URL: https://issues.apache.org/jira/browse/SOLR-1437
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - DataImportHandler
>    Affects Versions: 1.4
>            Reporter: Fergus McMenemie
>            Assignee: Noble Paul
>            Priority: Minor
>             Fix For: 1.5
>
>         Attachments: SOLR-1437.patch, SOLR-1437.patch
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> As per http://www.nabble.com/Re%3A-Extract-info-from-parent-node-during-data-import-%28redirect%3A%29-td25471162.html it would be nice to be able to use expressions such as //tagname when parsing XML documents.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1437) DIH: Enhance XPathRecordReader to deal with //tagname and other improvments.

Posted by "Fergus McMenemie (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-1437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12758860#action_12758860 ] 

Fergus McMenemie commented on SOLR-1437:
----------------------------------------

Apologies! my testcase which uses the same Xpath with the forEach and field value works; once I removed the typo!



> DIH: Enhance XPathRecordReader to deal with //tagname and other improvments.
> ----------------------------------------------------------------------------
>
>                 Key: SOLR-1437
>                 URL: https://issues.apache.org/jira/browse/SOLR-1437
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - DataImportHandler
>    Affects Versions: 1.4
>            Reporter: Fergus McMenemie
>            Assignee: Noble Paul
>            Priority: Minor
>             Fix For: 1.5
>
>         Attachments: SOLR-1437.patch, SOLR-1437.patch
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> As per http://www.nabble.com/Re%3A-Extract-info-from-parent-node-during-data-import-%28redirect%3A%29-td25471162.html it would be nice to be able to use expressions such as //tagname when parsing XML documents.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1437) DIH: Enhance XPathRecordReader to deal with //tagname and other improvments.

Posted by "Noble Paul (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-1437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12761146#action_12761146 ] 

Noble Paul commented on SOLR-1437:
----------------------------------

the putNulls are very important . I can probably add a testcase for that and probably it will be evident to you.

one another concern I have with the patch is that every node which is not relevant results in walking up the tree to figure out if there was a wild-card . So even if somebody does not use wild-cards he ends up paying the price. Each node can keep a boolean flag if that branch has a wild-card upstream and this flag has to be set at the time of building the tree.

> DIH: Enhance XPathRecordReader to deal with //tagname and other improvments.
> ----------------------------------------------------------------------------
>
>                 Key: SOLR-1437
>                 URL: https://issues.apache.org/jira/browse/SOLR-1437
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - DataImportHandler
>    Affects Versions: 1.4
>            Reporter: Fergus McMenemie
>            Assignee: Noble Paul
>            Priority: Minor
>             Fix For: 1.5
>
>         Attachments: SOLR-1437.patch, SOLR-1437.patch
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> As per http://www.nabble.com/Re%3A-Extract-info-from-parent-node-during-data-import-%28redirect%3A%29-td25471162.html it would be nice to be able to use expressions such as //tagname when parsing XML documents.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-1437) DIH: Enhance XPathRecordReader to deal with //tagname and other improvments.

Posted by "Fergus McMenemie (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-1437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Fergus McMenemie updated SOLR-1437:
-----------------------------------

    Attachment: SOLR-1437.patch

XPathRecordReader enhanced to allow use of //abc type expressions. Sorry but there are lots of changes!

Lots of comments added which attempt to clarify its operation. Ditto for TestXPathRecordReader where the test strings have been unwrapped and formatted to make them clearer.

Passes all tests except for TestErrorHandling.java which I think is sensitive to the new features.

> DIH: Enhance XPathRecordReader to deal with //tagname and other improvments.
> ----------------------------------------------------------------------------
>
>                 Key: SOLR-1437
>                 URL: https://issues.apache.org/jira/browse/SOLR-1437
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - DataImportHandler
>    Affects Versions: 1.4
>            Reporter: Fergus McMenemie
>            Assignee: Noble Paul
>            Priority: Minor
>             Fix For: 1.5
>
>         Attachments: SOLR-1437.patch, SOLR-1437.patch, SOLR-1437.patch, SOLR-1437.patch
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> As per http://www.nabble.com/Re%3A-Extract-info-from-parent-node-during-data-import-%28redirect%3A%29-td25471162.html it would be nice to be able to use expressions such as //tagname when parsing XML documents.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1437) DIH: Enhance XPathRecordReader to deal with //tagname and other improvments.

Posted by "Fergus McMenemie (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-1437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12756469#action_12756469 ] 

Fergus McMenemie commented on SOLR-1437:
----------------------------------------

Good points.

I am still make sure I understand the existing code completely. Adding lots of comments.

I was not going to allow wild-cards in the forEach parameter. Can you think of a use case?

Your second point is covered. At this point my plan is to scrap skipTag, replacing it with equivalent inline code. Then after I have found some way of recording //tagnames within the node tree (I am considering the fact that currently it is invalid for the rootNode to have attributes; and I was thinking of abusing the rootNode attributes field and using it to store //tagname Node trees). Then where skipTag would currently be called, I would instead compare the parsed localname against the "attributes" of the rootTag.

Potential issues I intend to ignored are. If we have: 
{{
column="e" xpath="//e/f[@qualifier='fullTitle']"
column="d" xpath="/a/b/c/d" flatten="true"
}}

and the XML stream contains
{{
/a/b/c/d/e...
}}
Then I will not populate column 'e'.  Also the expression for column 'd' which is an absolute Xpath takes precedence over that for column 'e'.

Thoughts?

> DIH: Enhance XPathRecordReader to deal with //tagname and other improvments.
> ----------------------------------------------------------------------------
>
>                 Key: SOLR-1437
>                 URL: https://issues.apache.org/jira/browse/SOLR-1437
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - DataImportHandler
>    Affects Versions: 1.4
>            Reporter: Fergus McMenemie
>            Assignee: Noble Paul
>            Priority: Minor
>             Fix For: 1.5
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> As per http://www.nabble.com/Re%3A-Extract-info-from-parent-node-during-data-import-%28redirect%3A%29-td25471162.html it would be nice to be able to use expressions such as //tagname when parsing XML documents.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1437) DIH: Enhance XPathRecordReader to deal with //tagname and other improvments.

Posted by "Noble Paul (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-1437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12762158#action_12762158 ] 

Noble Paul commented on SOLR-1437:
----------------------------------

the else return; was a bug . look at the mask_NPE patch . finally must not have a return statement it leads to suprression of exceptions

> DIH: Enhance XPathRecordReader to deal with //tagname and other improvments.
> ----------------------------------------------------------------------------
>
>                 Key: SOLR-1437
>                 URL: https://issues.apache.org/jira/browse/SOLR-1437
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - DataImportHandler
>    Affects Versions: 1.4
>            Reporter: Fergus McMenemie
>            Assignee: Noble Paul
>            Priority: Minor
>             Fix For: 1.5
>
>         Attachments: masked_NPE.patch, SOLR-1437.patch, SOLR-1437.patch, SOLR-1437.patch, SOLR-1437.patch, SOLR-1437.patch
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> As per http://www.nabble.com/Re%3A-Extract-info-from-parent-node-during-data-import-%28redirect%3A%29-td25471162.html it would be nice to be able to use expressions such as //tagname when parsing XML documents.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1437) DIH: Enhance XPathRecordReader to deal with //tagname and other improvments.

Posted by "Fergus McMenemie (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-1437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12756450#action_12756450 ] 

Fergus McMenemie commented on SOLR-1437:
----------------------------------------

OK. Would it be bad if I recoded things around a renamed variable "eventAlreadyFetched"? IMHO it far more self-explanatory.

> DIH: Enhance XPathRecordReader to deal with //tagname and other improvments.
> ----------------------------------------------------------------------------
>
>                 Key: SOLR-1437
>                 URL: https://issues.apache.org/jira/browse/SOLR-1437
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - DataImportHandler
>    Affects Versions: 1.4
>            Reporter: Fergus McMenemie
>            Priority: Minor
>             Fix For: 1.5
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> As per http://www.nabble.com/Re%3A-Extract-info-from-parent-node-during-data-import-%28redirect%3A%29-td25471162.html it would be nice to be able to use expressions such as //tagname when parsing XML documents.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1437) DIH: Enhance XPathRecordReader to deal with //tagname and other improvments.

Posted by "Noble Paul (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-1437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12758664#action_12758664 ] 

Noble Paul commented on SOLR-1437:
----------------------------------

bq. inside parse() the valuesAddedinThisFrame HashSet and the Stack<Set<String>> stack variables are only used to aid in the clean up after out-puting record.

yes,

there is a testcase sameForEachAndXpath()  which has same forEach and field. So, something is strange  here

> DIH: Enhance XPathRecordReader to deal with //tagname and other improvments.
> ----------------------------------------------------------------------------
>
>                 Key: SOLR-1437
>                 URL: https://issues.apache.org/jira/browse/SOLR-1437
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - DataImportHandler
>    Affects Versions: 1.4
>            Reporter: Fergus McMenemie
>            Assignee: Noble Paul
>            Priority: Minor
>             Fix For: 1.5
>
>         Attachments: SOLR-1437.patch, SOLR-1437.patch
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> As per http://www.nabble.com/Re%3A-Extract-info-from-parent-node-during-data-import-%28redirect%3A%29-td25471162.html it would be nice to be able to use expressions such as //tagname when parsing XML documents.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1437) DIH: Enhance XPathRecordReader to deal with //tagname and other improvments.

Posted by "Noble Paul (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-1437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12756477#action_12756477 ] 

Noble Paul commented on SOLR-1437:
----------------------------------

bq. I was not going to allow wild-cards in the forEach parameter. Can you think of a use case?

not required.

If a path can be reached in a non-wildcard xpath it should take precedence over the wild-card one

> DIH: Enhance XPathRecordReader to deal with //tagname and other improvments.
> ----------------------------------------------------------------------------
>
>                 Key: SOLR-1437
>                 URL: https://issues.apache.org/jira/browse/SOLR-1437
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - DataImportHandler
>    Affects Versions: 1.4
>            Reporter: Fergus McMenemie
>            Assignee: Noble Paul
>            Priority: Minor
>             Fix For: 1.5
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> As per http://www.nabble.com/Re%3A-Extract-info-from-parent-node-during-data-import-%28redirect%3A%29-td25471162.html it would be nice to be able to use expressions such as //tagname when parsing XML documents.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-1437) DIH: Enhance XPathRecordReader to deal with //tagname and other improvments.

Posted by "Fergus McMenemie (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-1437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Fergus McMenemie updated SOLR-1437:
-----------------------------------

    Comment: was deleted

(was: A patch of XPathRecordReader.java adding a few cosmetic changes to comments plus.....

A rewritten core of the parse() method. IMHO it is simpler to understand with the inner while and special paser.next state variable removed and it is probably faster as well.

Rearranged method getOrAddChild renaming it to getOrAddNode in preparation for adding support for // construct. This method is called from build().

While in build(), renamed local variable "name" to "xpseg" to better reflect that it holds a segment from an Xpath. And to remove warnings about local variables masking class variables.

"ant clean test" runs OK with these changes!)

> DIH: Enhance XPathRecordReader to deal with //tagname and other improvments.
> ----------------------------------------------------------------------------
>
>                 Key: SOLR-1437
>                 URL: https://issues.apache.org/jira/browse/SOLR-1437
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - DataImportHandler
>    Affects Versions: 1.4
>            Reporter: Fergus McMenemie
>            Assignee: Noble Paul
>            Priority: Minor
>             Fix For: 1.5
>
>         Attachments: SOLR-1437.patch, SOLR-1437.patch
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> As per http://www.nabble.com/Re%3A-Extract-info-from-parent-node-during-data-import-%28redirect%3A%29-td25471162.html it would be nice to be able to use expressions such as //tagname when parsing XML documents.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-1437) DIH: Enhance XPathRecordReader to deal with //tagname and other improvments.

Posted by "Fergus McMenemie (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-1437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Fergus McMenemie updated SOLR-1437:
-----------------------------------

    Attachment: SOLR-1437.patch

Good to see you reuse your own code!

This new patch is the same as the previous version excepting that the references to SOLR and datasource etc have been rewritten.

Also, Noble, can you check over and review my comments around line 237 in the file XPathRecordReader.java. Is this correct?

{code}
      } else {
        // can we ever get here? This means we are collecting for an Xpath
        // that is outwith any forEach expression
        if (attributes != null || hasText)
          valuesAddedinThisFrame = new HashSet<String>();
        stack.push(valuesAddedinThisFrame);
      }
{code}


> DIH: Enhance XPathRecordReader to deal with //tagname and other improvments.
> ----------------------------------------------------------------------------
>
>                 Key: SOLR-1437
>                 URL: https://issues.apache.org/jira/browse/SOLR-1437
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - DataImportHandler
>    Affects Versions: 1.4
>            Reporter: Fergus McMenemie
>            Assignee: Noble Paul
>            Priority: Minor
>             Fix For: 1.5
>
>         Attachments: SOLR-1437.patch, SOLR-1437.patch
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> As per http://www.nabble.com/Re%3A-Extract-info-from-parent-node-during-data-import-%28redirect%3A%29-td25471162.html it would be nice to be able to use expressions such as //tagname when parsing XML documents.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (SOLR-1437) DIH: Enhance XPathRecordReader to deal with //tagname and other improvments.

Posted by "Noble Paul (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-1437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12762093#action_12762093 ] 

Noble Paul edited comment on SOLR-1437 at 10/4/09 10:52 PM:
------------------------------------------------------------

bq.PutNulls: I can see what it is doing but I still dont know why it is needed.

This is not very useful if you are indexing directly w/o tranformers. Imagine , if you write a transformer which wouldhave to make some business decision if one value is missing, this helps. 

I see that you have optimized for the case where no wild card is used .

Fergus, I am +tive of putting this into 1.4 (if you are done with it). This is clearly a huge improvment over what was already existing . Whatever other enhancements are required , can be done later.

      was (Author: noble.paul):
    bq.PutNulls: I can see what it is doing but I still dont know why it is needed.

This is not very useful if you are indexing directly w/o tranformers. Imagine , if you write a transformer which wouldhave to make some business decision if one value is missing, this helps. 

I see that you have optimized for the case where no wild card is used .

Fergus, I am +tive of putting this into 1.4 (if you are done with it). This is clearly a huge improvment over what was already existing . Whatever other enhancements are required , can be done .
  
> DIH: Enhance XPathRecordReader to deal with //tagname and other improvments.
> ----------------------------------------------------------------------------
>
>                 Key: SOLR-1437
>                 URL: https://issues.apache.org/jira/browse/SOLR-1437
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - DataImportHandler
>    Affects Versions: 1.4
>            Reporter: Fergus McMenemie
>            Assignee: Noble Paul
>            Priority: Minor
>             Fix For: 1.5
>
>         Attachments: SOLR-1437.patch, SOLR-1437.patch, SOLR-1437.patch
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> As per http://www.nabble.com/Re%3A-Extract-info-from-parent-node-during-data-import-%28redirect%3A%29-td25471162.html it would be nice to be able to use expressions such as //tagname when parsing XML documents.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-1437) DIH: Enhance XPathRecordReader to deal with //tagname and other improvments.

Posted by "Fergus McMenemie (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-1437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Fergus McMenemie updated SOLR-1437:
-----------------------------------

    Attachment:     (was: SOLR-1437.patch)

> DIH: Enhance XPathRecordReader to deal with //tagname and other improvments.
> ----------------------------------------------------------------------------
>
>                 Key: SOLR-1437
>                 URL: https://issues.apache.org/jira/browse/SOLR-1437
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - DataImportHandler
>    Affects Versions: 1.4
>            Reporter: Fergus McMenemie
>            Assignee: Noble Paul
>            Priority: Minor
>             Fix For: 1.5
>
>         Attachments: SOLR-1437.patch
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> As per http://www.nabble.com/Re%3A-Extract-info-from-parent-node-during-data-import-%28redirect%3A%29-td25471162.html it would be nice to be able to use expressions such as //tagname when parsing XML documents.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (SOLR-1437) DIH: Enhance XPathRecordReader to deal with //tagname and other improvments.

Posted by "Fergus McMenemie (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-1437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12756469#action_12756469 ] 

Fergus McMenemie edited comment on SOLR-1437 at 9/17/09 3:53 AM:
-----------------------------------------------------------------

Good points.

I am still make sure I understand the existing code completely. Adding lots of comments.

I was not going to allow wild-cards in the forEach parameter. Can you think of a use case?

Your second point is covered. At this point my plan is to scrap skipTag, replacing it with equivalent inline code. Then after I have found some way of recording //tagnames within the node tree (I am considering the fact that currently it is invalid for the rootNode to have attributes; and I was thinking of abusing the rootNode attributes field and using it to store //tagname Node trees). Then where skipTag would currently be called, I would instead compare the parsed localname against the "attributes" of the rootTag.

Potential issues I intend to ignored are. If we have: 
{noformat} 
column="e" xpath="//e/f[@qualifier='fullTitle']"
column="d" xpath="/a/b/c/d" flatten="true"
{noformat} 

and the XML stream contains
{noformat} 
/a/b/c/d/e...
{noformat} 
Then I will not populate column 'e'.  Also the expression for column 'd' which is an absolute Xpath takes precedence over that for column 'e'.

Thoughts?

      was (Author: fergus):
    Good points.

I am still make sure I understand the existing code completely. Adding lots of comments.

I was not going to allow wild-cards in the forEach parameter. Can you think of a use case?

Your second point is covered. At this point my plan is to scrap skipTag, replacing it with equivalent inline code. Then after I have found some way of recording //tagnames within the node tree (I am considering the fact that currently it is invalid for the rootNode to have attributes; and I was thinking of abusing the rootNode attributes field and using it to store //tagname Node trees). Then where skipTag would currently be called, I would instead compare the parsed localname against the "attributes" of the rootTag.

Potential issues I intend to ignored are. If we have: 
{{
column="e" xpath="//e/f[@qualifier='fullTitle']"
column="d" xpath="/a/b/c/d" flatten="true"
}}

and the XML stream contains
{{
/a/b/c/d/e...
}}
Then I will not populate column 'e'.  Also the expression for column 'd' which is an absolute Xpath takes precedence over that for column 'e'.

Thoughts?
  
> DIH: Enhance XPathRecordReader to deal with //tagname and other improvments.
> ----------------------------------------------------------------------------
>
>                 Key: SOLR-1437
>                 URL: https://issues.apache.org/jira/browse/SOLR-1437
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - DataImportHandler
>    Affects Versions: 1.4
>            Reporter: Fergus McMenemie
>            Assignee: Noble Paul
>            Priority: Minor
>             Fix For: 1.5
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> As per http://www.nabble.com/Re%3A-Extract-info-from-parent-node-during-data-import-%28redirect%3A%29-td25471162.html it would be nice to be able to use expressions such as //tagname when parsing XML documents.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1437) DIH: Enhance XPathRecordReader to deal with //tagname and other improvments.

Posted by "Fergus McMenemie (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-1437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12758652#action_12758652 ] 

Fergus McMenemie commented on SOLR-1437:
----------------------------------------

Noble,

Playing with the code... some observations I would like confirmed.

1) inside parse() the valuesAddedinThisFrame HashSet and the Stack<Set<String>> stack variables are only used to aid in the clean up after out-puting  record.

2) The code seems unable to collect text for a forEach xpath. So for the following fragment of code

{code}
    String xml="<root>\n"
             + "  <status>live</status>\n"
             + "  <contenido id=\"10097\" idioma=\"cat\">\n"
             + "    Cats can be cute\n"
             + "    <antetitulo></antetitulo>\n"
             + "    <titulo>\n           This is my title\n    </titulo>\n"
             + "    <resumen>\n          This is my summary\n   </resumen>\n"
             + "    <texto>\n     This is the body of my text\n   </texto>\n"
             + "    </contenido>\n"
             + "</root>";
    XPathRecordReader rr = new XPathRecordReader("/root/contenido");
    rr.addField("cat"   ,"/root/contenido", false); //  ***** FAILS *****
    rr.addField("id",    "/root/contenido/@id", false);
{code}

we can get the string associated with the id attrbute of <contenido> but not its child text! Is this a design goal, or just the way the code ended up behaving. Do we want it to continue to work this way?

> DIH: Enhance XPathRecordReader to deal with //tagname and other improvments.
> ----------------------------------------------------------------------------
>
>                 Key: SOLR-1437
>                 URL: https://issues.apache.org/jira/browse/SOLR-1437
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - DataImportHandler
>    Affects Versions: 1.4
>            Reporter: Fergus McMenemie
>            Assignee: Noble Paul
>            Priority: Minor
>             Fix For: 1.5
>
>         Attachments: SOLR-1437.patch, SOLR-1437.patch
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> As per http://www.nabble.com/Re%3A-Extract-info-from-parent-node-during-data-import-%28redirect%3A%29-td25471162.html it would be nice to be able to use expressions such as //tagname when parsing XML documents.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-1437) DIH: Enhance XPathRecordReader to deal with //tagname and other improvments.

Posted by "Noble Paul (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-1437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Noble Paul updated SOLR-1437:
-----------------------------

    Attachment: SOLR-1437.patch

There is a crazy bug (i've added a testcase) we need to fix this before committing

> DIH: Enhance XPathRecordReader to deal with //tagname and other improvments.
> ----------------------------------------------------------------------------
>
>                 Key: SOLR-1437
>                 URL: https://issues.apache.org/jira/browse/SOLR-1437
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - DataImportHandler
>    Affects Versions: 1.4
>            Reporter: Fergus McMenemie
>            Assignee: Noble Paul
>            Priority: Minor
>             Fix For: 1.5
>
>         Attachments: SOLR-1437.patch, SOLR-1437.patch, SOLR-1437.patch, SOLR-1437.patch
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> As per http://www.nabble.com/Re%3A-Extract-info-from-parent-node-during-data-import-%28redirect%3A%29-td25471162.html it would be nice to be able to use expressions such as //tagname when parsing XML documents.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1437) DIH: Enhance XPathRecordReader to deal with //tagname and other improvments.

Posted by "Noble Paul (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-1437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12757823#action_12757823 ] 

Noble Paul commented on SOLR-1437:
----------------------------------

DIH needs streaming of a records which contain a  bunch of fields.  The Tika one is not designed for that kind of consumption

> DIH: Enhance XPathRecordReader to deal with //tagname and other improvments.
> ----------------------------------------------------------------------------
>
>                 Key: SOLR-1437
>                 URL: https://issues.apache.org/jira/browse/SOLR-1437
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - DataImportHandler
>    Affects Versions: 1.4
>            Reporter: Fergus McMenemie
>            Assignee: Noble Paul
>            Priority: Minor
>             Fix For: 1.5
>
>         Attachments: SOLR-1437.patch, SOLR-1437.patch
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> As per http://www.nabble.com/Re%3A-Extract-info-from-parent-node-during-data-import-%28redirect%3A%29-td25471162.html it would be nice to be able to use expressions such as //tagname when parsing XML documents.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (SOLR-1437) DIH: Enhance XPathRecordReader to deal with //tagname and other improvments.

Posted by "Noble Paul (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-1437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Noble Paul reassigned SOLR-1437:
--------------------------------

    Assignee: Noble Paul

> DIH: Enhance XPathRecordReader to deal with //tagname and other improvments.
> ----------------------------------------------------------------------------
>
>                 Key: SOLR-1437
>                 URL: https://issues.apache.org/jira/browse/SOLR-1437
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - DataImportHandler
>    Affects Versions: 1.4
>            Reporter: Fergus McMenemie
>            Assignee: Noble Paul
>            Priority: Minor
>             Fix For: 1.5
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> As per http://www.nabble.com/Re%3A-Extract-info-from-parent-node-during-data-import-%28redirect%3A%29-td25471162.html it would be nice to be able to use expressions such as //tagname when parsing XML documents.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1437) DIH: Enhance XPathRecordReader to deal with //tagname and other improvments.

Posted by "Fergus McMenemie (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-1437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12756051#action_12756051 ] 

Fergus McMenemie commented on SOLR-1437:
----------------------------------------

A pity we may not make the 1.4 release, but I guess there is no harm in trying!

Looking through the code for XPathRecordReader I see a variable skipNextEvent inside the parse method. Can anybody explain why we need to skip an event at the end of a text block?

> DIH: Enhance XPathRecordReader to deal with //tagname and other improvments.
> ----------------------------------------------------------------------------
>
>                 Key: SOLR-1437
>                 URL: https://issues.apache.org/jira/browse/SOLR-1437
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - DataImportHandler
>    Affects Versions: 1.4
>            Reporter: Fergus McMenemie
>            Priority: Minor
>             Fix For: 1.5
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> As per http://www.nabble.com/Re%3A-Extract-info-from-parent-node-during-data-import-%28redirect%3A%29-td25471162.html it would be nice to be able to use expressions such as //tagname when parsing XML documents.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1437) DIH: Enhance XPathRecordReader to deal with //tagname and other improvments.

Posted by "Noble Paul (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-1437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12758663#action_12758663 ] 

Noble Paul commented on SOLR-1437:
----------------------------------

it is not an intentional. 

> DIH: Enhance XPathRecordReader to deal with //tagname and other improvments.
> ----------------------------------------------------------------------------
>
>                 Key: SOLR-1437
>                 URL: https://issues.apache.org/jira/browse/SOLR-1437
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - DataImportHandler
>    Affects Versions: 1.4
>            Reporter: Fergus McMenemie
>            Assignee: Noble Paul
>            Priority: Minor
>             Fix For: 1.5
>
>         Attachments: SOLR-1437.patch, SOLR-1437.patch
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> As per http://www.nabble.com/Re%3A-Extract-info-from-parent-node-during-data-import-%28redirect%3A%29-td25471162.html it would be nice to be able to use expressions such as //tagname when parsing XML documents.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1437) DIH: Enhance XPathRecordReader to deal with //tagname and other improvments.

Posted by "Noble Paul (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-1437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12756916#action_12756916 ] 

Noble Paul commented on SOLR-1437:
----------------------------------

the patch looks fine.

* It is not created from the latest trunk. SO it did not apply
* Remove the references to Solr (and other stuff like dataSource) in the documentation. XPathRecordReader is completely independent of Solr.  We actually use it extensively in our internal projects for xml parsing

> DIH: Enhance XPathRecordReader to deal with //tagname and other improvments.
> ----------------------------------------------------------------------------
>
>                 Key: SOLR-1437
>                 URL: https://issues.apache.org/jira/browse/SOLR-1437
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - DataImportHandler
>    Affects Versions: 1.4
>            Reporter: Fergus McMenemie
>            Assignee: Noble Paul
>            Priority: Minor
>             Fix For: 1.5
>
>         Attachments: SOLR-1437.patch
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> As per http://www.nabble.com/Re%3A-Extract-info-from-parent-node-during-data-import-%28redirect%3A%29-td25471162.html it would be nice to be able to use expressions such as //tagname when parsing XML documents.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1437) DIH: Enhance XPathRecordReader to deal with //tagname and other improvments.

Posted by "Lance Norskog (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-1437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12757795#action_12757795 ] 

Lance Norskog commented on SOLR-1437:
-------------------------------------

Should we just steal the streaming XPath implementation from TIKA? Does it do all we need?

Note that it seems to only be used for their OpenDocument readers, so it might not have the breadth of testing that we would need.

[http://lucene.apache.org/tika/apidocs/org/apache/tika/sax/xpath/package-tree.html]

[http://issues.apache.org/jira/browse/TIKA-116]
[http://issues.apache.org/jira/browse/TIKA-129]

> DIH: Enhance XPathRecordReader to deal with //tagname and other improvments.
> ----------------------------------------------------------------------------
>
>                 Key: SOLR-1437
>                 URL: https://issues.apache.org/jira/browse/SOLR-1437
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - DataImportHandler
>    Affects Versions: 1.4
>            Reporter: Fergus McMenemie
>            Assignee: Noble Paul
>            Priority: Minor
>             Fix For: 1.5
>
>         Attachments: SOLR-1437.patch, SOLR-1437.patch
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> As per http://www.nabble.com/Re%3A-Extract-info-from-parent-node-during-data-import-%28redirect%3A%29-td25471162.html it would be nice to be able to use expressions such as //tagname when parsing XML documents.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (SOLR-1437) DIH: Enhance XPathRecordReader to deal with //tagname and other improvments.

Posted by "Fergus McMenemie (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-1437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12756469#action_12756469 ] 

Fergus McMenemie edited comment on SOLR-1437 at 9/17/09 3:55 AM:
-----------------------------------------------------------------

Good points.

I am still making sure I understand the existing code completely. Adding lots of comments.

I was not going to allow wild-cards in the forEach parameter. Can you think of a use case?

Your second point is covered. At this point my plan is to scrap skipTag, replacing it with equivalent inline code. Then after I have found some way of recording //tagnames within the node tree (I am considering the fact that currently it is invalid for the rootNode to have attributes; and I was thinking of abusing the rootNode attributes field and using it to store //tagname Node trees). Then where skipTag would currently be called, I would instead compare the parsed localname against the "attributes" of the rootTag.

Potential issues I intend to ignored are. If we have: 
{noformat} 
column="e" xpath="//e/f[@qualifier='fullTitle']"
column="d" xpath="/a/b/c/d" flatten="true"
{noformat} 

and the XML stream contains
{noformat} 
/a/b/c/d/e...
{noformat} 
Then I will not populate column 'e'.  Also the expression for column 'd' which is an absolute Xpath takes precedence over that for column 'e'.

Thoughts?

      was (Author: fergus):
    Good points.

I am still make sure I understand the existing code completely. Adding lots of comments.

I was not going to allow wild-cards in the forEach parameter. Can you think of a use case?

Your second point is covered. At this point my plan is to scrap skipTag, replacing it with equivalent inline code. Then after I have found some way of recording //tagnames within the node tree (I am considering the fact that currently it is invalid for the rootNode to have attributes; and I was thinking of abusing the rootNode attributes field and using it to store //tagname Node trees). Then where skipTag would currently be called, I would instead compare the parsed localname against the "attributes" of the rootTag.

Potential issues I intend to ignored are. If we have: 
{noformat} 
column="e" xpath="//e/f[@qualifier='fullTitle']"
column="d" xpath="/a/b/c/d" flatten="true"
{noformat} 

and the XML stream contains
{noformat} 
/a/b/c/d/e...
{noformat} 
Then I will not populate column 'e'.  Also the expression for column 'd' which is an absolute Xpath takes precedence over that for column 'e'.

Thoughts?
  
> DIH: Enhance XPathRecordReader to deal with //tagname and other improvments.
> ----------------------------------------------------------------------------
>
>                 Key: SOLR-1437
>                 URL: https://issues.apache.org/jira/browse/SOLR-1437
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - DataImportHandler
>    Affects Versions: 1.4
>            Reporter: Fergus McMenemie
>            Assignee: Noble Paul
>            Priority: Minor
>             Fix For: 1.5
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> As per http://www.nabble.com/Re%3A-Extract-info-from-parent-node-during-data-import-%28redirect%3A%29-td25471162.html it would be nice to be able to use expressions such as //tagname when parsing XML documents.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-1437) DIH: Enhance XPathRecordReader to deal with //tagname and other improvments.

Posted by "Fergus McMenemie (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-1437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Fergus McMenemie updated SOLR-1437:
-----------------------------------

    Attachment: SOLR-1437.patch

Sorted; added another check for null.

> DIH: Enhance XPathRecordReader to deal with //tagname and other improvments.
> ----------------------------------------------------------------------------
>
>                 Key: SOLR-1437
>                 URL: https://issues.apache.org/jira/browse/SOLR-1437
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - DataImportHandler
>    Affects Versions: 1.4
>            Reporter: Fergus McMenemie
>            Assignee: Noble Paul
>            Priority: Minor
>             Fix For: 1.5
>
>         Attachments: masked_NPE.patch, SOLR-1437.patch, SOLR-1437.patch, SOLR-1437.patch, SOLR-1437.patch, SOLR-1437.patch, SOLR-1437.patch
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> As per http://www.nabble.com/Re%3A-Extract-info-from-parent-node-during-data-import-%28redirect%3A%29-td25471162.html it would be nice to be able to use expressions such as //tagname when parsing XML documents.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1437) DIH: Enhance XPathRecordReader to deal with //tagname and other improvments.

Posted by "Fergus McMenemie (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-1437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12759211#action_12759211 ] 

Fergus McMenemie commented on SOLR-1437:
----------------------------------------

Still trying to understand all the ins and outs of the code! 

Around line 250 in XPathRecordReader.java we have the following expression "recordStarted && !isRecord". Why is this done? Why do we want to call putText() with a null value? Why do we only call putText Nodes where isRecord is false?

{code}
      try {
        if (attributes != null) {
          for (Node node : attributes) {
            String value = parser.getAttributeValue(null, node.name);
            if (value != null || (recordStarted && !isRecord)) {
              putText(values, value, node.fieldName, node.multiValued);
              valuesAddedinThisFrame.add(node.fieldName);
            }
          }
        }
{code}

> DIH: Enhance XPathRecordReader to deal with //tagname and other improvments.
> ----------------------------------------------------------------------------
>
>                 Key: SOLR-1437
>                 URL: https://issues.apache.org/jira/browse/SOLR-1437
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - DataImportHandler
>    Affects Versions: 1.4
>            Reporter: Fergus McMenemie
>            Assignee: Noble Paul
>            Priority: Minor
>             Fix For: 1.5
>
>         Attachments: SOLR-1437.patch, SOLR-1437.patch
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> As per http://www.nabble.com/Re%3A-Extract-info-from-parent-node-during-data-import-%28redirect%3A%29-td25471162.html it would be nice to be able to use expressions such as //tagname when parsing XML documents.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-1437) DIH: Enhance XPathRecordReader to deal with //tagname and other improvments.

Posted by "Fergus McMenemie (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-1437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Fergus McMenemie updated SOLR-1437:
-----------------------------------

    Attachment: SOLR-1437.patch

I propose to have this patch which only addresses the //* issue submitted in time for 1.4 and close this issue.

The other new feature and sorting out pushNulls can be left for later.

> DIH: Enhance XPathRecordReader to deal with //tagname and other improvments.
> ----------------------------------------------------------------------------
>
>                 Key: SOLR-1437
>                 URL: https://issues.apache.org/jira/browse/SOLR-1437
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - DataImportHandler
>    Affects Versions: 1.4
>            Reporter: Fergus McMenemie
>            Assignee: Noble Paul
>            Priority: Minor
>             Fix For: 1.5
>
>         Attachments: SOLR-1437.patch, SOLR-1437.patch, SOLR-1437.patch
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> As per http://www.nabble.com/Re%3A-Extract-info-from-parent-node-during-data-import-%28redirect%3A%29-td25471162.html it would be nice to be able to use expressions such as //tagname when parsing XML documents.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1437) DIH: Enhance XPathRecordReader to deal with //tagname and other improvments.

Posted by "Fergus McMenemie (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-1437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12761153#action_12761153 ] 

Fergus McMenemie commented on SOLR-1437:
----------------------------------------

Re your concern. Correct, i was was also concerned about this, and of course how bad it is depends on the nature of your data and the tree of interesting nodes. However, testing indicated it did not seem to cause to much of an impact. It is not quite as bad as you say; it has to walk up the tree ONCE , not for every node which is not relivent, but rather every time we depart from the list of interesting nodes. Nevertheless your solution is solution is a good one.


> DIH: Enhance XPathRecordReader to deal with //tagname and other improvments.
> ----------------------------------------------------------------------------
>
>                 Key: SOLR-1437
>                 URL: https://issues.apache.org/jira/browse/SOLR-1437
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - DataImportHandler
>    Affects Versions: 1.4
>            Reporter: Fergus McMenemie
>            Assignee: Noble Paul
>            Priority: Minor
>             Fix For: 1.5
>
>         Attachments: SOLR-1437.patch, SOLR-1437.patch
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> As per http://www.nabble.com/Re%3A-Extract-info-from-parent-node-during-data-import-%28redirect%3A%29-td25471162.html it would be nice to be able to use expressions such as //tagname when parsing XML documents.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1437) DIH: Enhance XPathRecordReader to deal with //tagname and other improvments.

Posted by "Fergus McMenemie (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-1437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12762156#action_12762156 ] 

Fergus McMenemie commented on SOLR-1437:
----------------------------------------

Hmmm, the version I uploaded works. I think that while polishing my code you removed an "else return;" from XPathRecordReader.java around line 317.

> DIH: Enhance XPathRecordReader to deal with //tagname and other improvments.
> ----------------------------------------------------------------------------
>
>                 Key: SOLR-1437
>                 URL: https://issues.apache.org/jira/browse/SOLR-1437
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - DataImportHandler
>    Affects Versions: 1.4
>            Reporter: Fergus McMenemie
>            Assignee: Noble Paul
>            Priority: Minor
>             Fix For: 1.5
>
>         Attachments: masked_NPE.patch, SOLR-1437.patch, SOLR-1437.patch, SOLR-1437.patch, SOLR-1437.patch, SOLR-1437.patch
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> As per http://www.nabble.com/Re%3A-Extract-info-from-parent-node-during-data-import-%28redirect%3A%29-td25471162.html it would be nice to be able to use expressions such as //tagname when parsing XML documents.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-1437) DIH: Enhance XPathRecordReader to deal with //tagname and other improvments.

Posted by "Noble Paul (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-1437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Noble Paul updated SOLR-1437:
-----------------------------

         Priority: Major  (was: Minor)
    Fix Version/s:     (was: 1.5)
                   1.4

> DIH: Enhance XPathRecordReader to deal with //tagname and other improvments.
> ----------------------------------------------------------------------------
>
>                 Key: SOLR-1437
>                 URL: https://issues.apache.org/jira/browse/SOLR-1437
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - DataImportHandler
>    Affects Versions: 1.4
>            Reporter: Fergus McMenemie
>            Assignee: Noble Paul
>             Fix For: 1.4
>
>         Attachments: masked_NPE.patch, SOLR-1437.patch, SOLR-1437.patch, SOLR-1437.patch, SOLR-1437.patch, SOLR-1437.patch, SOLR-1437.patch
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> As per http://www.nabble.com/Re%3A-Extract-info-from-parent-node-during-data-import-%28redirect%3A%29-td25471162.html it would be nice to be able to use expressions such as //tagname when parsing XML documents.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1437) DIH: Enhance XPathRecordReader to deal with //tagname and other improvments.

Posted by "Fergus McMenemie (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-1437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12762128#action_12762128 ] 

Fergus McMenemie commented on SOLR-1437:
----------------------------------------

OK, having a look at it right now.

> DIH: Enhance XPathRecordReader to deal with //tagname and other improvments.
> ----------------------------------------------------------------------------
>
>                 Key: SOLR-1437
>                 URL: https://issues.apache.org/jira/browse/SOLR-1437
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - DataImportHandler
>    Affects Versions: 1.4
>            Reporter: Fergus McMenemie
>            Assignee: Noble Paul
>            Priority: Minor
>             Fix For: 1.5
>
>         Attachments: masked_NPE.patch, SOLR-1437.patch, SOLR-1437.patch, SOLR-1437.patch, SOLR-1437.patch, SOLR-1437.patch
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> As per http://www.nabble.com/Re%3A-Extract-info-from-parent-node-during-data-import-%28redirect%3A%29-td25471162.html it would be nice to be able to use expressions such as //tagname when parsing XML documents.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (SOLR-1437) DIH: Enhance XPathRecordReader to deal with //tagname and other improvments.

Posted by "Fergus McMenemie (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-1437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12762156#action_12762156 ] 

Fergus McMenemie edited comment on SOLR-1437 at 10/5/09 4:47 AM:
-----------------------------------------------------------------

Hmmm, the version I uploaded works. I think that while polishing my code you removed an "else return;" from XPathRecordReader.java around line 317. I guess you thought it was a redundant statement, not sure what is going on at all. But I think that code is unchanged from the original.

      was (Author: fergus):
    Hmmm, the version I uploaded works. I think that while polishing my code you removed an "else return;" from XPathRecordReader.java around line 317.
  
> DIH: Enhance XPathRecordReader to deal with //tagname and other improvments.
> ----------------------------------------------------------------------------
>
>                 Key: SOLR-1437
>                 URL: https://issues.apache.org/jira/browse/SOLR-1437
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - DataImportHandler
>    Affects Versions: 1.4
>            Reporter: Fergus McMenemie
>            Assignee: Noble Paul
>            Priority: Minor
>             Fix For: 1.5
>
>         Attachments: masked_NPE.patch, SOLR-1437.patch, SOLR-1437.patch, SOLR-1437.patch, SOLR-1437.patch, SOLR-1437.patch
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> As per http://www.nabble.com/Re%3A-Extract-info-from-parent-node-during-data-import-%28redirect%3A%29-td25471162.html it would be nice to be able to use expressions such as //tagname when parsing XML documents.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-1437) DIH: Enhance XPathRecordReader to deal with //tagname and other improvments.

Posted by "Fergus McMenemie (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-1437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Fergus McMenemie updated SOLR-1437:
-----------------------------------

    Attachment: SOLR-1437.patch

regenerated patch against latest trunk

> DIH: Enhance XPathRecordReader to deal with //tagname and other improvments.
> ----------------------------------------------------------------------------
>
>                 Key: SOLR-1437
>                 URL: https://issues.apache.org/jira/browse/SOLR-1437
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - DataImportHandler
>    Affects Versions: 1.4
>            Reporter: Fergus McMenemie
>            Assignee: Noble Paul
>            Priority: Minor
>             Fix For: 1.5
>
>         Attachments: SOLR-1437.patch, SOLR-1437.patch
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> As per http://www.nabble.com/Re%3A-Extract-info-from-parent-node-during-data-import-%28redirect%3A%29-td25471162.html it would be nice to be able to use expressions such as //tagname when parsing XML documents.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1437) DIH: Enhance XPathRecordReader to deal with //tagname and other improvments.

Posted by "Fergus McMenemie (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-1437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12761496#action_12761496 ] 

Fergus McMenemie commented on SOLR-1437:
----------------------------------------

I am doing more work on this to:
* improve performance by avoiding having to walk back up tree
* to review use of putNulls
* avoid emitting parent nodes of an emitted record

> DIH: Enhance XPathRecordReader to deal with //tagname and other improvments.
> ----------------------------------------------------------------------------
>
>                 Key: SOLR-1437
>                 URL: https://issues.apache.org/jira/browse/SOLR-1437
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - DataImportHandler
>    Affects Versions: 1.4
>            Reporter: Fergus McMenemie
>            Assignee: Noble Paul
>            Priority: Minor
>             Fix For: 1.5
>
>         Attachments: SOLR-1437.patch, SOLR-1437.patch
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> As per http://www.nabble.com/Re%3A-Extract-info-from-parent-node-during-data-import-%28redirect%3A%29-td25471162.html it would be nice to be able to use expressions such as //tagname when parsing XML documents.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-1437) DIH: Enhance XPathRecordReader to deal with //tagname and other improvments.

Posted by "Fergus McMenemie (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-1437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Fergus McMenemie updated SOLR-1437:
-----------------------------------

    Attachment:     (was: SOLR-1437.patch)

> DIH: Enhance XPathRecordReader to deal with //tagname and other improvments.
> ----------------------------------------------------------------------------
>
>                 Key: SOLR-1437
>                 URL: https://issues.apache.org/jira/browse/SOLR-1437
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - DataImportHandler
>    Affects Versions: 1.4
>            Reporter: Fergus McMenemie
>            Assignee: Noble Paul
>            Priority: Minor
>             Fix For: 1.5
>
>         Attachments: SOLR-1437.patch, SOLR-1437.patch
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> As per http://www.nabble.com/Re%3A-Extract-info-from-parent-node-during-data-import-%28redirect%3A%29-td25471162.html it would be nice to be able to use expressions such as //tagname when parsing XML documents.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1437) DIH: Enhance XPathRecordReader to deal with //tagname and other improvments.

Posted by "Noble Paul (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-1437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12756453#action_12756453 ] 

Noble Paul commented on SOLR-1437:
----------------------------------

renaming done in  trunk

let us be clear about what all do we wish to achieve in wild-card support

* Do we wish to support wild-cards in the forEach parameter 
* How to handle the following case
example
{code}
d=  /a/b/c/d
e  = //e
and if the real  path to the tag e = /a/x/e . it would not really hard to traverse both the paths
{code]
Just keep in mind that It is hard to achieve everything. But is is easy to achieve a subset of features and meet most of the requirements. We are not trying to implement a complete framework for xml handling

> DIH: Enhance XPathRecordReader to deal with //tagname and other improvments.
> ----------------------------------------------------------------------------
>
>                 Key: SOLR-1437
>                 URL: https://issues.apache.org/jira/browse/SOLR-1437
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - DataImportHandler
>    Affects Versions: 1.4
>            Reporter: Fergus McMenemie
>            Priority: Minor
>             Fix For: 1.5
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> As per http://www.nabble.com/Re%3A-Extract-info-from-parent-node-during-data-import-%28redirect%3A%29-td25471162.html it would be nice to be able to use expressions such as //tagname when parsing XML documents.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-1437) DIH: Enhance XPathRecordReader to deal with //tagname and other improvments.

Posted by "Noble Paul (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-1437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Noble Paul updated SOLR-1437:
-----------------------------

    Fix Version/s:     (was: 1.4)
                   1.5

it may not be viable to target this for 1.4

> DIH: Enhance XPathRecordReader to deal with //tagname and other improvments.
> ----------------------------------------------------------------------------
>
>                 Key: SOLR-1437
>                 URL: https://issues.apache.org/jira/browse/SOLR-1437
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - DataImportHandler
>    Affects Versions: 1.4
>            Reporter: Fergus McMenemie
>            Priority: Minor
>             Fix For: 1.5
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> As per http://www.nabble.com/Re%3A-Extract-info-from-parent-node-during-data-import-%28redirect%3A%29-td25471162.html it would be nice to be able to use expressions such as //tagname when parsing XML documents.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-1437) DIH: Enhance XPathRecordReader to deal with //tagname and other improvments.

Posted by "Fergus McMenemie (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-1437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Fergus McMenemie updated SOLR-1437:
-----------------------------------

    Attachment: SOLR-1437.patch

A patch of XPathRecordReader.java adding a few cosmetic changes to comments plus.....

A rewritten core of the parse() method. IMHO it is simpler to understand with the inner while and special paser.next state variable removed and it is probably faster as well.

Rearranged method getOrAddChild renaming it to getOrAddNode in preparation for adding support for // construct. This method is called from build().

While in build(), renamed local variable "name" to "xpseg" to better reflect that it holds a segment from an Xpath. And to remove warnings about local variables masking class variables.

"ant clean test" runs OK with these changes!


> DIH: Enhance XPathRecordReader to deal with //tagname and other improvments.
> ----------------------------------------------------------------------------
>
>                 Key: SOLR-1437
>                 URL: https://issues.apache.org/jira/browse/SOLR-1437
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - DataImportHandler
>    Affects Versions: 1.4
>            Reporter: Fergus McMenemie
>            Assignee: Noble Paul
>            Priority: Minor
>             Fix For: 1.5
>
>         Attachments: SOLR-1437.patch, SOLR-1437.patch, SOLR-1437.patch
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> As per http://www.nabble.com/Re%3A-Extract-info-from-parent-node-during-data-import-%28redirect%3A%29-td25471162.html it would be nice to be able to use expressions such as //tagname when parsing XML documents.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (SOLR-1437) DIH: Enhance XPathRecordReader to deal with //tagname and other improvments.

Posted by "Noble Paul (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-1437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12756453#action_12756453 ] 

Noble Paul edited comment on SOLR-1437 at 9/17/09 3:16 AM:
-----------------------------------------------------------

renaming done in  trunk

let us be clear about what all do we wish to achieve in wild-card support

* Do we wish to support wild-cards in the forEach parameter 
* How to handle the following case
example
{code}
d=  /a/b/c/d
e  = //e
{code}
and if the real  path to the tag e = /a/x/e . it would not really hard to traverse both the paths

Just keep in mind that It is hard to achieve everything. But is is easy to achieve a subset of features and meet most of the requirements. We are not trying to implement a complete framework for xml handling

      was (Author: noble.paul):
    renaming done in  trunk

let us be clear about what all do we wish to achieve in wild-card support

* Do we wish to support wild-cards in the forEach parameter 
* How to handle the following case
example
{code}
d=  /a/b/c/d
e  = //e
{code}
and if the real  path to the tag e = /a/x/e . it would not really hard to traverse both the paths
{code]
Just keep in mind that It is hard to achieve everything. But is is easy to achieve a subset of features and meet most of the requirements. We are not trying to implement a complete framework for xml handling
  
> DIH: Enhance XPathRecordReader to deal with //tagname and other improvments.
> ----------------------------------------------------------------------------
>
>                 Key: SOLR-1437
>                 URL: https://issues.apache.org/jira/browse/SOLR-1437
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - DataImportHandler
>    Affects Versions: 1.4
>            Reporter: Fergus McMenemie
>            Priority: Minor
>             Fix For: 1.5
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> As per http://www.nabble.com/Re%3A-Extract-info-from-parent-node-during-data-import-%28redirect%3A%29-td25471162.html it would be nice to be able to use expressions such as //tagname when parsing XML documents.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1437) DIH: Enhance XPathRecordReader to deal with //tagname and other improvments.

Posted by "Noble Paul (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-1437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12760895#action_12760895 ] 

Noble Paul commented on SOLR-1437:
----------------------------------

Fergus can you give the patch updated to trunk?

> DIH: Enhance XPathRecordReader to deal with //tagname and other improvments.
> ----------------------------------------------------------------------------
>
>                 Key: SOLR-1437
>                 URL: https://issues.apache.org/jira/browse/SOLR-1437
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - DataImportHandler
>    Affects Versions: 1.4
>            Reporter: Fergus McMenemie
>            Assignee: Noble Paul
>            Priority: Minor
>             Fix For: 1.5
>
>         Attachments: SOLR-1437.patch, SOLR-1437.patch, SOLR-1437.patch, SOLR-1437.patch, SOLR-1437.patch
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> As per http://www.nabble.com/Re%3A-Extract-info-from-parent-node-during-data-import-%28redirect%3A%29-td25471162.html it would be nice to be able to use expressions such as //tagname when parsing XML documents.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1437) DIH: Enhance XPathRecordReader to deal with //tagname and other improvments.

Posted by "Noble Paul (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-1437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12756476#action_12756476 ] 

Noble Paul commented on SOLR-1437:
----------------------------------

try not to make too many changes. If you have javadocs in your patch give it as a separate patch and I shall commit it immedietly .  

Keeping the patch small can make it easy for me to review it.

> DIH: Enhance XPathRecordReader to deal with //tagname and other improvments.
> ----------------------------------------------------------------------------
>
>                 Key: SOLR-1437
>                 URL: https://issues.apache.org/jira/browse/SOLR-1437
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - DataImportHandler
>    Affects Versions: 1.4
>            Reporter: Fergus McMenemie
>            Assignee: Noble Paul
>            Priority: Minor
>             Fix For: 1.5
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> As per http://www.nabble.com/Re%3A-Extract-info-from-parent-node-during-data-import-%28redirect%3A%29-td25471162.html it would be nice to be able to use expressions such as //tagname when parsing XML documents.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1437) DIH: Enhance XPathRecordReader to deal with //tagname and other improvments.

Posted by "Fergus McMenemie (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-1437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12762094#action_12762094 ] 

Fergus McMenemie commented on SOLR-1437:
----------------------------------------

I am done!

> DIH: Enhance XPathRecordReader to deal with //tagname and other improvments.
> ----------------------------------------------------------------------------
>
>                 Key: SOLR-1437
>                 URL: https://issues.apache.org/jira/browse/SOLR-1437
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - DataImportHandler
>    Affects Versions: 1.4
>            Reporter: Fergus McMenemie
>            Assignee: Noble Paul
>            Priority: Minor
>             Fix For: 1.5
>
>         Attachments: SOLR-1437.patch, SOLR-1437.patch, SOLR-1437.patch
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> As per http://www.nabble.com/Re%3A-Extract-info-from-parent-node-during-data-import-%28redirect%3A%29-td25471162.html it would be nice to be able to use expressions such as //tagname when parsing XML documents.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-1437) DIH: Enhance XPathRecordReader to deal with //tagname and other improvments.

Posted by "Noble Paul (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-1437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Noble Paul updated SOLR-1437:
-----------------------------

    Attachment: SOLR-1437.patch

any_decendent_from_root fails

> DIH: Enhance XPathRecordReader to deal with //tagname and other improvments.
> ----------------------------------------------------------------------------
>
>                 Key: SOLR-1437
>                 URL: https://issues.apache.org/jira/browse/SOLR-1437
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - DataImportHandler
>    Affects Versions: 1.4
>            Reporter: Fergus McMenemie
>            Assignee: Noble Paul
>            Priority: Minor
>             Fix For: 1.5
>
>         Attachments: SOLR-1437.patch, SOLR-1437.patch, SOLR-1437.patch, SOLR-1437.patch, SOLR-1437.patch
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> As per http://www.nabble.com/Re%3A-Extract-info-from-parent-node-during-data-import-%28redirect%3A%29-td25471162.html it would be nice to be able to use expressions such as //tagname when parsing XML documents.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-1437) DIH: Enhance XPathRecordReader to deal with //tagname and other improvments.

Posted by "Fergus McMenemie (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-1437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Fergus McMenemie updated SOLR-1437:
-----------------------------------

    Attachment:     (was: SOLR-1437.patch)

> DIH: Enhance XPathRecordReader to deal with //tagname and other improvments.
> ----------------------------------------------------------------------------
>
>                 Key: SOLR-1437
>                 URL: https://issues.apache.org/jira/browse/SOLR-1437
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - DataImportHandler
>    Affects Versions: 1.4
>            Reporter: Fergus McMenemie
>            Assignee: Noble Paul
>            Priority: Minor
>             Fix For: 1.5
>
>         Attachments: SOLR-1437.patch, SOLR-1437.patch
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> As per http://www.nabble.com/Re%3A-Extract-info-from-parent-node-during-data-import-%28redirect%3A%29-td25471162.html it would be nice to be able to use expressions such as //tagname when parsing XML documents.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1437) DIH: Enhance XPathRecordReader to deal with //tagname and other improvments.

Posted by "Fergus McMenemie (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-1437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12760602#action_12760602 ] 

Fergus McMenemie commented on SOLR-1437:
----------------------------------------

Making good progress! The //expression and its testcase is done and appear to work well. I would have submitted a patch already except that having only altered "XPathRecordReader.java" I am now trying to get "ant clean test" to stop failing in org.apache.solr.handler.dataimport.TestErrorHandling.

Also while I do not yet have a total grasp of the code I think it is possible to have XPathRecordReader.java only return fields which are children of the forEach expression when a record is emitted. I consider this desirable and would like to press ahead with this extra change; are there other views?

Having established that "Set<String> valuesAddedinThisFrame" and "Stack<Set<String>> stack", in Node are only used for clearing up after a record has been emitted. I am now pondering the purpose of the HashSet childrenFound and putNulls. Any suggestions?

Lastly what are the chances of this making the 1.4 release?

> DIH: Enhance XPathRecordReader to deal with //tagname and other improvments.
> ----------------------------------------------------------------------------
>
>                 Key: SOLR-1437
>                 URL: https://issues.apache.org/jira/browse/SOLR-1437
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - DataImportHandler
>    Affects Versions: 1.4
>            Reporter: Fergus McMenemie
>            Assignee: Noble Paul
>            Priority: Minor
>             Fix For: 1.5
>
>         Attachments: SOLR-1437.patch, SOLR-1437.patch, SOLR-1437.patch
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> As per http://www.nabble.com/Re%3A-Extract-info-from-parent-node-during-data-import-%28redirect%3A%29-td25471162.html it would be nice to be able to use expressions such as //tagname when parsing XML documents.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1437) DIH: Enhance XPathRecordReader to deal with //tagname and other improvments.

Posted by "Fergus McMenemie (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-1437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12761124#action_12761124 ] 

Fergus McMenemie commented on SOLR-1437:
----------------------------------------

I am quite pleased with it as far as it goes and think it would be good for 1.4. I have tested it against my test set of 3000 XML documents and replacing:

{code}
        <field column="para1" name="text"                xpath="/record/sect1/para" flatten="true"/>
        <field column="para2" name="text"                xpath="/record/list/listitem/para" flatten="true"/>
        <field column="para32"     name="text"                        xpath="/record/address/para"  flatten="true" />
        <field column="para40"     name="text"                        xpath="/record/authoredBy/para"  flatten="true" />
        <field column="para43"     name="text"                        xpath="/record/dataGroup/address/para"  flatten="true" />
        <field column="para47"     name="text"                        xpath="/record/dataGroup/keyPersonnel/doubleList/first/para"  flatten="true" />
        <field column="para49"     name="text"                        xpath="/record/dataGroup/keyPersonnel/doubleList/second/para"  flatten="true" />
        <field column="para50"     name="text"                        xpath="/record/dataGroup/keyPersonnel/para"  flatten="true" />
        <field column="para51"     name="text"                        xpath="/record/dataGroup/para"  flatten="true" />
        <field column="para57"     name="text"                        xpath="/record/doubleList/first/para"  flatten="true" />
        <field column="para59"     name="text"                        xpath="/record/doubleList/second/para"  flatten="true" />
        <field column="para63"     name="text"                        xpath="/record/keyPersonnel/doubleList/first/para"  flatten="true" />
        <field column="para65"     name="text"                        xpath="/record/keyPersonnel/doubleList/second/para"  flatten="true" />
        <field column="para68"     name="text"                        xpath="/record/list/listItem/para"  flatten="true" />
        <field column="para75"     name="text"                        xpath="/record/mediaBlock/doubleList/first/para"  flatten="true" />
        <field column="para77"     name="text"                        xpath="/record/mediaBlock/doubleList/second/para"  flatten="true" />
        <field column="para172"     name="text"                        xpath="/record/noteGroup/note/para"  flatten="true" />
        <field column="para174"     name="text"                        xpath="/record/para"  flatten="true" />
        <field column="para179"     name="text"                        xpath="/record/relatedInfo/list/listItem/relatedArticle/para"  flatten="true" />
        <field column="para184"     name="text"                        xpath="/record/sect1/address/dataGroup/para"  flatten="true" />
        <field column="para185"     name="text"                        xpath="/record/sect1/address/para"  flatten="true" />
        <field column="para195"     name="text"                        xpath="/record/sect1/dataGroup/address/para"  flatten="true" />
        <field column="para199"     name="text"                        xpath="/record/sect1/dataGroup/keyPersonnel/doubleList/first/para"  flatten="true" />
        <field column="para201"     name="text"                        xpath="/record/sect1/dataGroup/keyPersonnel/doubleList/second/para"  flatten="true" />
        <field column="para202"     name="text"                        xpath="/record/sect1/dataGroup/keyPersonnel/para"  flatten="true" />
        <field column="para203"     name="text"                        xpath="/record/sect1/dataGroup/para"  flatten="true" />
        <field column="para208"     name="text"                        xpath="/record/sect1/doubleList/first/para"  flatten="true" />
        <field column="para212"     name="text"                        xpath="/record/sect1/doubleList/second/list/listItem/para"  flatten="true" />
        <field column="para213"     name="text"                        xpath="/record/sect1/doubleList/second/para"  flatten="true" />
        <field column="para217"     name="text"                        xpath="/record/sect1/keyPersonnel/doubleList/first/para"  flatten="true" />
        <field column="para219"     name="text"                        xpath="/record/sect1/keyPersonnel/doubleList/second/para"  flatten="true" />
        <field column="para220"     name="text"                        xpath="/record/sect1/keyPersonnel/para"  flatten="true" />
        <field column="para225"     name="text"                        xpath="/record/sect1/list/listItem/list/listItem/para"  flatten="true" />
        <field column="para226"     name="text"                        xpath="/record/sect1/list/listItem/para"  flatten="true" />
        <field column="para240"     name="text"                        xpath="/record/sect1/para"  flatten="true" />
        <field column="para244"     name="text"                        xpath="/record/sect1/sect2/doubleList/first/para"  flatten="true" />
        <field column="para246"     name="text"                        xpath="/record/sect1/sect2/doubleList/second/para"  flatten="true" />
        <field column="para251"     name="text"                        xpath="/record/sect1/sect2/list/listItem/list/listItem/para"  flatten="true" />
        <field column="para252"     name="text"                        xpath="/record/sect1/sect2/list/listItem/para"  flatten="true" />
        <field column="para258"     name="text"                        xpath="/record/sect1/sect2/noteGroup/note/para"  flatten="true" />
        <field column="para259"     name="text"                        xpath="/record/sect1/sect2/para"  flatten="true" />
        <field column="para265"     name="text"                        xpath="/record/sect1/sect2/sect3/list/listItem/list/listItem/para"  flatten="true" />
        <field column="para266"     name="text"                        xpath="/record/sect1/sect2/sect3/list/listItem/para"  flatten="true" />
        <field column="para271"     name="text"                        xpath="/record/sect1/sect2/sect3/para"  flatten="true" />
        <field column="para275"     name="text"                        xpath="/record/sect1/sect2/sect3/sect4/list/listItem/para"  flatten="true" />
        <field column="para279"     name="text"                        xpath="/record/sect1/sect2/sect3/sect4/para"  flatten="true" />
        <field column="para284"     name="text"                        xpath="/record/sect1/sect2/sect3/sect4/sect5/para"  flatten="true" />
        <field column="para295"     name="text"                        xpath="/record/sect1/sect2/sect3/table/tgroup/tbody/row/entry/noteGroup/note/para"  flatten="true" />
        <field column="para297"     name="text"                        xpath="/record/sect1/sect2/sect3/table/tgroup/tbody/row/entry/para"  flatten="true" />
        <field column="para301"     name="text"                        xpath="/record/sect1/sect2/sect3/table/tgroup/thead/row/entry/para"  flatten="true" />
        <field column="para312"     name="text"                        xpath="/record/sect1/sect2/table/tgroup/tbody/row/entry/list/listItem/para"  flatten="true" />
        <field column="para315"     name="text"                        xpath="/record/sect1/sect2/table/tgroup/tbody/row/entry/noteGroup/note/para"  flatten="true" />
        <field column="para316"     name="text"                        xpath="/record/sect1/sect2/table/tgroup/tbody/row/entry/noteGroup/para"  flatten="true" />
        <field column="para318"     name="text"                        xpath="/record/sect1/sect2/table/tgroup/tbody/row/entry/para"  flatten="true" />
        <field column="para322"     name="text"                        xpath="/record/sect1/sect2/table/tgroup/thead/row/entry/para"  flatten="true" />
        <field column="para341"     name="text"                        xpath="/record/sect1/table/tgroup/tbody/row/entry/noteGroup/note/para"  flatten="true" />
        <field column="para342"     name="text"                        xpath="/record/sect1/table/tgroup/tbody/row/entry/noteGroup/para"  flatten="true" />
        <field column="para344"     name="text"                        xpath="/record/sect1/table/tgroup/tbody/row/entry/para"  flatten="true" />
        <field column="para348"     name="text"                        xpath="/record/sect1/table/tgroup/thead/row/entry/para"  flatten="true" />
        <field column="para371"     name="text"                        xpath="/record/table/tgroup/tbody/row/entry/noteGroup/note/para"  flatten="true" />
        <field column="para373"     name="text"                        xpath="/record/table/tgroup/tbody/row/entry/para"  flatten="true" />
        <field column="para377"     name="text"                        xpath="/record/table/tgroup/thead/row/entry/para"  flatten="true" />
{code]

with 

{code}
       <field column="text"                             xpath="//para" flatten="true"/>
{code}

The indexes seemed equivalent and time to index was also equivalent.

I have one concern which should be addressed before any 1.4 release. I still do not understand the purpose of the HashSet childrenFound and putNulls, if its important then I suspect that whatever is done to childNodes when an end_element is parsed also needs done to descNodes; but I have a feeling the whole lot may be unnecessary and can be removed. If it is required we need to explain it.

The last change I would like to see, which I am happy to leave to 1.5, involves making sure emitted records do not contain tags from parent nodes unless they are stipulated by "commonField"

> DIH: Enhance XPathRecordReader to deal with //tagname and other improvments.
> ----------------------------------------------------------------------------
>
>                 Key: SOLR-1437
>                 URL: https://issues.apache.org/jira/browse/SOLR-1437
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - DataImportHandler
>    Affects Versions: 1.4
>            Reporter: Fergus McMenemie
>            Assignee: Noble Paul
>            Priority: Minor
>             Fix For: 1.5
>
>         Attachments: SOLR-1437.patch, SOLR-1437.patch
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> As per http://www.nabble.com/Re%3A-Extract-info-from-parent-node-during-data-import-%28redirect%3A%29-td25471162.html it would be nice to be able to use expressions such as //tagname when parsing XML documents.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-1437) DIH: Enhance XPathRecordReader to deal with //tagname and other improvments.

Posted by "Fergus McMenemie (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-1437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Fergus McMenemie updated SOLR-1437:
-----------------------------------

    Attachment: SOLR-1437.patch

A first patch to XPathRecordReader.java. These changes pass "ant clean test" and my own test cases. The changes are all about adding or improving comments and JavaDoc. There are very minor code changes where method definitions are changed from single to multiple lines, to allow descriptions to be added against the method arguments.

One bigger code change involves removal of the skiptag method and its replacement with some in-iine code in the one place it was called from.

> DIH: Enhance XPathRecordReader to deal with //tagname and other improvments.
> ----------------------------------------------------------------------------
>
>                 Key: SOLR-1437
>                 URL: https://issues.apache.org/jira/browse/SOLR-1437
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - DataImportHandler
>    Affects Versions: 1.4
>            Reporter: Fergus McMenemie
>            Assignee: Noble Paul
>            Priority: Minor
>             Fix For: 1.5
>
>         Attachments: SOLR-1437.patch
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> As per http://www.nabble.com/Re%3A-Extract-info-from-parent-node-during-data-import-%28redirect%3A%29-td25471162.html it would be nice to be able to use expressions such as //tagname when parsing XML documents.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1437) DIH: Enhance XPathRecordReader to deal with //tagname and other improvments.

Posted by "Fergus McMenemie (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-1437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12761838#action_12761838 ] 

Fergus McMenemie commented on SOLR-1437:
----------------------------------------

OK!

* I have a better method of dealing with //* searches.
* I think I know how to only emitting fields associated with the current record.
* PutNulls: I can see what it is doing but I still dont know why it is needed. I could understand if there was a requirement that every valid hasText node for a record must be defined or null; but I dont think that is what it id doing? Can you help?

> DIH: Enhance XPathRecordReader to deal with //tagname and other improvments.
> ----------------------------------------------------------------------------
>
>                 Key: SOLR-1437
>                 URL: https://issues.apache.org/jira/browse/SOLR-1437
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - DataImportHandler
>    Affects Versions: 1.4
>            Reporter: Fergus McMenemie
>            Assignee: Noble Paul
>            Priority: Minor
>             Fix For: 1.5
>
>         Attachments: SOLR-1437.patch, SOLR-1437.patch
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> As per http://www.nabble.com/Re%3A-Extract-info-from-parent-node-during-data-import-%28redirect%3A%29-td25471162.html it would be nice to be able to use expressions such as //tagname when parsing XML documents.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-1437) DIH: Enhance XPathRecordReader to deal with //tagname and other improvments.

Posted by "Fergus McMenemie (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-1437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Fergus McMenemie updated SOLR-1437:
-----------------------------------

    Attachment: SOLR-1437.patch

A patch of XPathRecordReader.java adding a few cosmetic changes to comments plus.....

A rewritten core of the parse() method. IMHO it is simpler to understand with the inner while and special paser.next state variable removed and it is probably faster as well.

Rearranged method getOrAddChild renaming it to getOrAddNode in preparation for adding support for // construct. This method is called from build().

While in build(), renamed local variable "name" to "xpseg" to better reflect that it holds a segment from an Xpath. And to remove warnings about local variables masking class variables.

"ant clean test" runs OK with these changes!

> DIH: Enhance XPathRecordReader to deal with //tagname and other improvments.
> ----------------------------------------------------------------------------
>
>                 Key: SOLR-1437
>                 URL: https://issues.apache.org/jira/browse/SOLR-1437
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - DataImportHandler
>    Affects Versions: 1.4
>            Reporter: Fergus McMenemie
>            Assignee: Noble Paul
>            Priority: Minor
>             Fix For: 1.5
>
>         Attachments: SOLR-1437.patch, SOLR-1437.patch, SOLR-1437.patch
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> As per http://www.nabble.com/Re%3A-Extract-info-from-parent-node-during-data-import-%28redirect%3A%29-td25471162.html it would be nice to be able to use expressions such as //tagname when parsing XML documents.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.