You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-dev@lucene.apache.org by "Fergus McMenemie (JIRA)" <ji...@apache.org> on 2009/02/01 23:15:59 UTC

[jira] Created: (SOLR-1000) DIH FileListEntityProcessor fileName filters directory names and stops recursion

DIH FileListEntityProcessor fileName filters directory names and stops recursion 
---------------------------------------------------------------------------------

                 Key: SOLR-1000
                 URL: https://issues.apache.org/jira/browse/SOLR-1000
             Project: Solr
          Issue Type: Improvement
          Components: contrib - DataImportHandler
    Affects Versions: 1.3
            Reporter: Fergus McMenemie


I have been trying to find out why DIH in FileListEntityProcessor mode did not appear to be recursing into subdirectories. Going through FileListEntityProcessor.java I eventually tumbled to the fact that my filename filter setting from data-config.xml also applied to directory names.

Now, I feel that the fieldName filter should be applied to files fed into the parser, it should not be applied to the directory names we are recursing through. I bodged the code to adjust the behavior so that the "FileName" and "excludes" attributes of "entity" only apply to filenames and not directory names. It now recurses though my directory tree only indexing the appropriate files! I think the new behavior is more standard.

I will submit the a patch once I have constructed one!


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (SOLR-1000) DIH FileListEntityProcessor fileName filters directory names and stops recursion

Posted by "Shalin Shekhar Mangar (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-1000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12670102#action_12670102 ] 

Shalin Shekhar Mangar commented on SOLR-1000:
---------------------------------------------

Thanks Fergus.

One minor thing:
{code}
while (true) {
      Map<String, Object> r = getNext();
      if (r != null) r = applyTransformer(r);
        return r;
    }
{code}

In the new code the loop is not used at all. The difference is important because Transformers have the ability to skip documents by doing map.put("$skipDoc", true) on this map. If a document is skipped, applyTransformer will return null and we'd like to request a new row from the data source (entity processor in this case). With this change, null will be returned which signals that the DataSource/EntityProcessor has run out of data even though it has not.

Except for this, the patch looks great! I'll commit this shortly.

> DIH FileListEntityProcessor fileName filters directory names and stops recursion 
> ---------------------------------------------------------------------------------
>
>                 Key: SOLR-1000
>                 URL: https://issues.apache.org/jira/browse/SOLR-1000
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - DataImportHandler
>    Affects Versions: 1.3
>            Reporter: Fergus McMenemie
>            Assignee: Shalin Shekhar Mangar
>         Attachments: SOLR-1000.patch, SOLR-1000.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> I have been trying to find out why DIH in FileListEntityProcessor mode did not appear to be recursing into subdirectories. Going through FileListEntityProcessor.java I eventually tumbled to the fact that my filename filter setting from data-config.xml also applied to directory names.
> Now, I feel that the fieldName filter should be applied to files fed into the parser, it should not be applied to the directory names we are recursing through. I bodged the code to adjust the behavior so that the "FileName" and "excludes" attributes of "entity" only apply to filenames and not directory names. It now recurses though my directory tree only indexing the appropriate files! I think the new behavior is more standard.
> I will submit the a patch once I have constructed one!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Assigned: (SOLR-1000) DIH FileListEntityProcessor fileName filters directory names and stops recursion

Posted by "Shalin Shekhar Mangar (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/SOLR-1000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shalin Shekhar Mangar reassigned SOLR-1000:
-------------------------------------------

    Assignee: Shalin Shekhar Mangar

> DIH FileListEntityProcessor fileName filters directory names and stops recursion 
> ---------------------------------------------------------------------------------
>
>                 Key: SOLR-1000
>                 URL: https://issues.apache.org/jira/browse/SOLR-1000
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - DataImportHandler
>    Affects Versions: 1.3
>            Reporter: Fergus McMenemie
>            Assignee: Shalin Shekhar Mangar
>         Attachments: SOLR-1000.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> I have been trying to find out why DIH in FileListEntityProcessor mode did not appear to be recursing into subdirectories. Going through FileListEntityProcessor.java I eventually tumbled to the fact that my filename filter setting from data-config.xml also applied to directory names.
> Now, I feel that the fieldName filter should be applied to files fed into the parser, it should not be applied to the directory names we are recursing through. I bodged the code to adjust the behavior so that the "FileName" and "excludes" attributes of "entity" only apply to filenames and not directory names. It now recurses though my directory tree only indexing the appropriate files! I think the new behavior is more standard.
> I will submit the a patch once I have constructed one!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (SOLR-1000) DIH FileListEntityProcessor fileName filters directory names and stops recursion

Posted by "Fergus McMenemie (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/SOLR-1000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Fergus McMenemie updated SOLR-1000:
-----------------------------------

    Attachment: SOLR-1000.patch

Here is my first attempt at a patch, it seems to work OK however the testcase I added TestFileListEntityProcessor.java fails. I need somebody who knows what they are doing to point out what I am doing wrong!

> DIH FileListEntityProcessor fileName filters directory names and stops recursion 
> ---------------------------------------------------------------------------------
>
>                 Key: SOLR-1000
>                 URL: https://issues.apache.org/jira/browse/SOLR-1000
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - DataImportHandler
>    Affects Versions: 1.3
>            Reporter: Fergus McMenemie
>         Attachments: SOLR-1000.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> I have been trying to find out why DIH in FileListEntityProcessor mode did not appear to be recursing into subdirectories. Going through FileListEntityProcessor.java I eventually tumbled to the fact that my filename filter setting from data-config.xml also applied to directory names.
> Now, I feel that the fieldName filter should be applied to files fed into the parser, it should not be applied to the directory names we are recursing through. I bodged the code to adjust the behavior so that the "FileName" and "excludes" attributes of "entity" only apply to filenames and not directory names. It now recurses though my directory tree only indexing the appropriate files! I think the new behavior is more standard.
> I will submit the a patch once I have constructed one!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (SOLR-1000) DIH FileListEntityProcessor fileName filters directory names and stops recursion

Posted by "Fergus McMenemie (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/SOLR-1000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Fergus McMenemie updated SOLR-1000:
-----------------------------------

    Attachment: SOLR-1000.patch

Sorted bugs in the Junit test and added a few other improvements to the test.

> DIH FileListEntityProcessor fileName filters directory names and stops recursion 
> ---------------------------------------------------------------------------------
>
>                 Key: SOLR-1000
>                 URL: https://issues.apache.org/jira/browse/SOLR-1000
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - DataImportHandler
>    Affects Versions: 1.3
>            Reporter: Fergus McMenemie
>            Assignee: Shalin Shekhar Mangar
>         Attachments: SOLR-1000.patch, SOLR-1000.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> I have been trying to find out why DIH in FileListEntityProcessor mode did not appear to be recursing into subdirectories. Going through FileListEntityProcessor.java I eventually tumbled to the fact that my filename filter setting from data-config.xml also applied to directory names.
> Now, I feel that the fieldName filter should be applied to files fed into the parser, it should not be applied to the directory names we are recursing through. I bodged the code to adjust the behavior so that the "FileName" and "excludes" attributes of "entity" only apply to filenames and not directory names. It now recurses though my directory tree only indexing the appropriate files! I think the new behavior is more standard.
> I will submit the a patch once I have constructed one!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (SOLR-1000) DIH FileListEntityProcessor fileName filters directory names and stops recursion

Posted by "Shalin Shekhar Mangar (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-1000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12669662#action_12669662 ] 

Shalin Shekhar Mangar commented on SOLR-1000:
---------------------------------------------

First the ClassCastException was because AbstractDataImportHandlerTest tries to read a string from the attributes map. But in this case, testRecursive put in a boolean true rather than a string to the 'recursive' attribute. That was fixed by adding string "true" instead of a boolean. I'll fix AbstractDataImportHandlerTest to read String.valueOf to handle these cases in the future.

After this fix, the assert at the end of the testRecursive failed. This is because it expects to find 3 files but "a.xml", "b.xml" and "c.props" are in the same directory and due to the 'fileName' regex, c.props won't be picked up. I guess you meant to add c.props to another child directory inside the one you are creating?

> DIH FileListEntityProcessor fileName filters directory names and stops recursion 
> ---------------------------------------------------------------------------------
>
>                 Key: SOLR-1000
>                 URL: https://issues.apache.org/jira/browse/SOLR-1000
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - DataImportHandler
>    Affects Versions: 1.3
>            Reporter: Fergus McMenemie
>            Assignee: Shalin Shekhar Mangar
>         Attachments: SOLR-1000.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> I have been trying to find out why DIH in FileListEntityProcessor mode did not appear to be recursing into subdirectories. Going through FileListEntityProcessor.java I eventually tumbled to the fact that my filename filter setting from data-config.xml also applied to directory names.
> Now, I feel that the fieldName filter should be applied to files fed into the parser, it should not be applied to the directory names we are recursing through. I bodged the code to adjust the behavior so that the "FileName" and "excludes" attributes of "entity" only apply to filenames and not directory names. It now recurses though my directory tree only indexing the appropriate files! I think the new behavior is more standard.
> I will submit the a patch once I have constructed one!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (SOLR-1000) DIH FileListEntityProcessor fileName filters directory names and stops recursion

Posted by "Shalin Shekhar Mangar (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/SOLR-1000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shalin Shekhar Mangar resolved SOLR-1000.
-----------------------------------------

       Resolution: Fixed
    Fix Version/s: 1.4

Committed revision 740423.

I reverted the change I mentioned above and I moved the issue to the DIH CHANGES.txt

Thanks Fergus!

> DIH FileListEntityProcessor fileName filters directory names and stops recursion 
> ---------------------------------------------------------------------------------
>
>                 Key: SOLR-1000
>                 URL: https://issues.apache.org/jira/browse/SOLR-1000
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - DataImportHandler
>    Affects Versions: 1.3
>            Reporter: Fergus McMenemie
>            Assignee: Shalin Shekhar Mangar
>             Fix For: 1.4
>
>         Attachments: SOLR-1000.patch, SOLR-1000.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> I have been trying to find out why DIH in FileListEntityProcessor mode did not appear to be recursing into subdirectories. Going through FileListEntityProcessor.java I eventually tumbled to the fact that my filename filter setting from data-config.xml also applied to directory names.
> Now, I feel that the fieldName filter should be applied to files fed into the parser, it should not be applied to the directory names we are recursing through. I bodged the code to adjust the behavior so that the "FileName" and "excludes" attributes of "entity" only apply to filenames and not directory names. It now recurses though my directory tree only indexing the appropriate files! I think the new behavior is more standard.
> I will submit the a patch once I have constructed one!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.