You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-dev@lucene.apache.org by "Trey Grainger (JIRA)" <ji...@apache.org> on 2010/03/21 03:17:27 UTC

[jira] Created: (SOLR-1837) Reconstruct a Document (stored fields, indexed fields, payloads)

Reconstruct a Document (stored fields, indexed fields, payloads)
----------------------------------------------------------------

                 Key: SOLR-1837
                 URL: https://issues.apache.org/jira/browse/SOLR-1837
             Project: Solr
          Issue Type: New Feature
          Components: Schema and Analysis, web gui
    Affects Versions: 1.5
         Environment: All
            Reporter: Trey Grainger
            Priority: Minor
             Fix For: 1.5


One Solr feature I've been sorely in need of is the ability to inspect an index for any particular document.  While the analysis page is good when you have specific content and a specific field/type your want to test the analysis process for, once a document is indexed it is not currently possible to easily see what is actually sitting in the index.

One can use the Lucene Index Browser (Luke), but this has several limitations (gui only, doesn't understand solr schema, doesn't display many non-text fields in human readable format, doesn't show payloads, some bugs lead to missing terms, exposes features dangerous to use in a production Solr environment, slow or difficult to check from a remote location, etc.).  The document reconstruction feature of Luke provides the base for what can become a much more powerful tool when coupled with Solr's understanding of a schema, however.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (SOLR-1837) Reconstruct a Document (stored fields, indexed fields, payloads)

Posted by "Trey Grainger (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/SOLR-1837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Trey Grainger updated SOLR-1837:
--------------------------------

    Remaining Estimate: 168h  (was: 120h)
     Original Estimate: 168h  (was: 120h)

> Reconstruct a Document (stored fields, indexed fields, payloads)
> ----------------------------------------------------------------
>
>                 Key: SOLR-1837
>                 URL: https://issues.apache.org/jira/browse/SOLR-1837
>             Project: Solr
>          Issue Type: New Feature
>          Components: Schema and Analysis, web gui
>    Affects Versions: 1.5
>         Environment: All
>            Reporter: Trey Grainger
>            Priority: Minor
>             Fix For: 1.5
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> One Solr feature I've been sorely in need of is the ability to inspect an index for any particular document.  While the analysis page is good when you have specific content and a specific field/type your want to test the analysis process for, once a document is indexed it is not currently possible to easily see what is actually sitting in the index.
> One can use the Lucene Index Browser (Luke), but this has several limitations (gui only, doesn't understand solr schema, doesn't display many non-text fields in human readable format, doesn't show payloads, some bugs lead to missing terms, exposes features dangerous to use in a production Solr environment, slow or difficult to check from a remote location, etc.).  The document reconstruction feature of Luke provides the base for what can become a much more powerful tool when coupled with Solr's understanding of a schema, however.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (SOLR-1837) Reconstruct a Document (stored fields, indexed fields, payloads)

Posted by "Trey Grainger (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/SOLR-1837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Trey Grainger updated SOLR-1837:
--------------------------------

    Attachment: SOLR-1837.patch

Here's what I have thusfar.  Only bug I currently know about is that Solr multi-valued fields (i.e. <field name="x">value1</field><field name="x">value2</field>) currently display as concatenated together instead of as an array of separate fields in the "stored fields" view.

I've referred to the tool in the admin interface as the "Document Inspector" instead of "Document Reconstructor" to prevent confusion over lost/changed/added terms due to index-time analysis.

Any feedback appreciated.

> Reconstruct a Document (stored fields, indexed fields, payloads)
> ----------------------------------------------------------------
>
>                 Key: SOLR-1837
>                 URL: https://issues.apache.org/jira/browse/SOLR-1837
>             Project: Solr
>          Issue Type: New Feature
>          Components: Schema and Analysis, web gui
>    Affects Versions: 1.5
>         Environment: All
>            Reporter: Trey Grainger
>            Priority: Minor
>             Fix For: 1.5
>
>         Attachments: SOLR-1837.patch
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> One Solr feature I've been sorely in need of is the ability to inspect an index for any particular document.  While the analysis page is good when you have specific content and a specific field/type your want to test the analysis process for, once a document is indexed it is not currently possible to easily see what is actually sitting in the index.
> One can use the Lucene Index Browser (Luke), but this has several limitations (gui only, doesn't understand solr schema, doesn't display many non-text fields in human readable format, doesn't show payloads, some bugs lead to missing terms, exposes features dangerous to use in a production Solr environment, slow or difficult to check from a remote location, etc.).  The document reconstruction feature of Luke provides the base for what can become a much more powerful tool when coupled with Solr's understanding of a schema, however.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (SOLR-1837) Reconstruct a Document (stored fields, indexed fields, payloads)

Posted by "Trey Grainger (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-1837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12847936#action_12847936 ] 

Trey Grainger commented on SOLR-1837:
-------------------------------------

Re: bugs in Luke that result in missing terms - I recently fixed one such bug, and indeed it was located in the DocReconstructor - if you are aware of others then please report them using the Luke issue tracker.

I just pulled down the most recent Luke code, and it does looks like that recent fix was made to cover the bug I saw.  Unfortunately, the fix results in a null ref for me on my index.  I'll open an issue, as it looks like all that's needed is an extra null check.

Re: Document reconstruction is a very IO-intensive operation, so I would advise against using it on a production system, and also it produces inexact results (because analysis is usually a lossy operation).

I hear you about it being IO-intensive.  There's also other admin tools in Solr which do similarly intensive operations (the schema browser, for example, which generates a list of all fields and a distribution of terms within those fields).  The intent of the tool is for one-off debugging, not for any kind of automated querying, but I'll try do some tests to see to what degree this tool is affecting our current production systems (I have not see any noticeable effect thus far).

Also, regarding the process being lossy.  In this case, that is kind of the point of the tool (in my use) - to see what has actually been put into the index vs what was in the document sent to the engine.  For example, if I index a field with the text "Wi-fi hotspots are a life-saver" with payloads on parts of speech, as well as stemming I want to be able to see something like:
"wi [1] / fi [1] | wifi [1] / hotspot [1] / are [2] / a [3] / life [1] / saver [1] | lifesaver [1]"

With no payloads, this would simply be
"wi / fi | wifi / hotspots | hotspot / are / a / life / saver | lifesaver"

So I had initially named to tool the Solr Document Reconstructor, after the name you gave to the tool in Luke.  Based on your comments, I think it might be less confusing for me to call it something like "Document Inspector", since it is not truly reconstructing the original document.

I'll try to get what I have pushed up today so you can check it out if you want.  Thanks for your great work on that tool!

> Reconstruct a Document (stored fields, indexed fields, payloads)
> ----------------------------------------------------------------
>
>                 Key: SOLR-1837
>                 URL: https://issues.apache.org/jira/browse/SOLR-1837
>             Project: Solr
>          Issue Type: New Feature
>          Components: Schema and Analysis, web gui
>    Affects Versions: 1.5
>         Environment: All
>            Reporter: Trey Grainger
>            Priority: Minor
>             Fix For: 1.5
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> One Solr feature I've been sorely in need of is the ability to inspect an index for any particular document.  While the analysis page is good when you have specific content and a specific field/type your want to test the analysis process for, once a document is indexed it is not currently possible to easily see what is actually sitting in the index.
> One can use the Lucene Index Browser (Luke), but this has several limitations (gui only, doesn't understand solr schema, doesn't display many non-text fields in human readable format, doesn't show payloads, some bugs lead to missing terms, exposes features dangerous to use in a production Solr environment, slow or difficult to check from a remote location, etc.).  The document reconstruction feature of Luke provides the base for what can become a much more powerful tool when coupled with Solr's understanding of a schema, however.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (SOLR-1837) Reconstruct a Document (stored fields, indexed fields, payloads)

Posted by "Trey Grainger (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-1837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12847866#action_12847866 ] 

Trey Grainger commented on SOLR-1837:
-------------------------------------

I've been working on implementing the document reconstruction feature over the past week and have created an additional admin page which exposes it.  The functionality is essentially a reworking of the lucene document reconstruction functionality in Luke, but with improvements to handle the problems listed in the jira issue description above.

I'll be pushing up a patch soon and will look forward to any additional recommendations after others have had a chance to try it out.

> Reconstruct a Document (stored fields, indexed fields, payloads)
> ----------------------------------------------------------------
>
>                 Key: SOLR-1837
>                 URL: https://issues.apache.org/jira/browse/SOLR-1837
>             Project: Solr
>          Issue Type: New Feature
>          Components: Schema and Analysis, web gui
>    Affects Versions: 1.5
>         Environment: All
>            Reporter: Trey Grainger
>            Priority: Minor
>             Fix For: 1.5
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> One Solr feature I've been sorely in need of is the ability to inspect an index for any particular document.  While the analysis page is good when you have specific content and a specific field/type your want to test the analysis process for, once a document is indexed it is not currently possible to easily see what is actually sitting in the index.
> One can use the Lucene Index Browser (Luke), but this has several limitations (gui only, doesn't understand solr schema, doesn't display many non-text fields in human readable format, doesn't show payloads, some bugs lead to missing terms, exposes features dangerous to use in a production Solr environment, slow or difficult to check from a remote location, etc.).  The document reconstruction feature of Luke provides the base for what can become a much more powerful tool when coupled with Solr's understanding of a schema, however.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (SOLR-1837) Reconstruct a Document (stored fields, indexed fields, payloads)

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-1837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12847923#action_12847923 ] 

Andrzej Bialecki  commented on SOLR-1837:
-----------------------------------------

Re: bugs in Luke that result in missing terms - I recently fixed one such bug, and indeed it was located in the DocReconstructor - if you are aware of others then please report them using the Luke issue tracker.

Document reconstruction is a very IO-intensive operation, so I would advise against using it on a production system, and also it produces inexact results (because analysis is usually a lossy operation).

> Reconstruct a Document (stored fields, indexed fields, payloads)
> ----------------------------------------------------------------
>
>                 Key: SOLR-1837
>                 URL: https://issues.apache.org/jira/browse/SOLR-1837
>             Project: Solr
>          Issue Type: New Feature
>          Components: Schema and Analysis, web gui
>    Affects Versions: 1.5
>         Environment: All
>            Reporter: Trey Grainger
>            Priority: Minor
>             Fix For: 1.5
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> One Solr feature I've been sorely in need of is the ability to inspect an index for any particular document.  While the analysis page is good when you have specific content and a specific field/type your want to test the analysis process for, once a document is indexed it is not currently possible to easily see what is actually sitting in the index.
> One can use the Lucene Index Browser (Luke), but this has several limitations (gui only, doesn't understand solr schema, doesn't display many non-text fields in human readable format, doesn't show payloads, some bugs lead to missing terms, exposes features dangerous to use in a production Solr environment, slow or difficult to check from a remote location, etc.).  The document reconstruction feature of Luke provides the base for what can become a much more powerful tool when coupled with Solr's understanding of a schema, however.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.