You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Markus Jelsma (JIRA)" <ji...@apache.org> on 2010/11/19 18:56:15 UTC

[jira] Created: (NUTCH-936) LanguageIdentifier should not set empty lang field on NutchDocument

LanguageIdentifier should not set empty lang field on NutchDocument
-------------------------------------------------------------------

                 Key: NUTCH-936
                 URL: https://issues.apache.org/jira/browse/NUTCH-936
             Project: Nutch
          Issue Type: Bug
          Components: indexer
    Affects Versions: 1.2
            Reporter: Markus Jelsma
            Assignee: Markus Jelsma
            Priority: Minor
             Fix For: 1.3, 2.0


For some reason the language identifier plugin sometimes sets an empty value for the lang field. It is confirmed to occur in 1.2 when parsing a scanned PDF file which cannot be OCR'd to proper text. Anyway, whether it's a problem with the parser or not, the plugin itself should not add an empty value. The plugin already checks for a null value and then sets the lang field to `unknown`, which is fine. But when the lang string is empty, it should also be set to `unknown`.

This might break clients that have conditional logic on the empty value, but not on the `unknown` value because it may never have occurred in their set up and therefore they might not have added `unknown` to their logic.

However, it might seem a little bit overkill to put this proposal behind a configuration option and let Nutch by default continue to behave as it currently does. Any thoughts on this one?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-936) LanguageIdentifier should not set empty lang field on NutchDocument

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma updated NUTCH-936:
--------------------------------

    Patch Info: [Patch Available]

> LanguageIdentifier should not set empty lang field on NutchDocument
> -------------------------------------------------------------------
>
>                 Key: NUTCH-936
>                 URL: https://issues.apache.org/jira/browse/NUTCH-936
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 1.2
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.3, 2.0
>
>         Attachments: NUTCH-936-v12-1.patch, NUTCH-936-v13-1.patch, NUTCH-936-v13-1.patch
>
>
> For some reason the language identifier plugin sometimes sets an empty value for the lang field. It is confirmed to occur in 1.2 when parsing a scanned PDF file which cannot be OCR'd to proper text, resulting in an empty content field. Anyway, whether it's a problem with the parser or not, the plugin itself should not add an empty value because the content field can always be empty. The plugin already checks for a null value and then sets the lang field to `unknown`, which is fine. But when the lang string is empty, it should also be set to `unknown`.
> This might break clients that have conditional logic on the empty value, but not on the `unknown` value because it may never have occurred in their set up and therefore they might not have added `unknown` to their logic. However, it might seem a little bit overkill to put this proposal behind a configuration option and let Nutch by default continue to behave as it currently does. Any thoughts on this one?
> Here's the troublesome URL : http://www.nrc.nl/redactie/binnenland/memo_buza_irak.pdf that returns an empty content field and an empty lang string in 1.2 and presumably in trunk and other versions as well.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-936) LanguageIdentifier should not set empty lang field on NutchDocument

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12934474#action_12934474 ] 

Markus Jelsma commented on NUTCH-936:
-------------------------------------

Committed for 1.3 in 1037732
Can't commit right now for trunk because i still cannot compile the check out.

> LanguageIdentifier should not set empty lang field on NutchDocument
> -------------------------------------------------------------------
>
>                 Key: NUTCH-936
>                 URL: https://issues.apache.org/jira/browse/NUTCH-936
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 1.2
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.3, 2.0
>
>         Attachments: NUTCH-936-v12-1.patch, NUTCH-936-v13-1.patch, NUTCH-936-v13-1.patch
>
>
> For some reason the language identifier plugin sometimes sets an empty value for the lang field. It is confirmed to occur in 1.2 when parsing a scanned PDF file which cannot be OCR'd to proper text, resulting in an empty content field. Anyway, whether it's a problem with the parser or not, the plugin itself should not add an empty value because the content field can always be empty. The plugin already checks for a null value and then sets the lang field to `unknown`, which is fine. But when the lang string is empty, it should also be set to `unknown`.
> This might break clients that have conditional logic on the empty value, but not on the `unknown` value because it may never have occurred in their set up and therefore they might not have added `unknown` to their logic. However, it might seem a little bit overkill to put this proposal behind a configuration option and let Nutch by default continue to behave as it currently does. Any thoughts on this one?
> Here's the troublesome URL : http://www.nrc.nl/redactie/binnenland/memo_buza_irak.pdf that returns an empty content field and an empty lang string in 1.2 and presumably in trunk and other versions as well.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Issue Comment Edited: (NUTCH-936) LanguageIdentifier should not set empty lang field on NutchDocument

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12934453#action_12934453 ] 

Markus Jelsma edited comment on NUTCH-936 at 11/22/10 8:10 AM:
---------------------------------------------------------------

Here are patches for the current 1.2 stable, branch 1.3 and trunk. It adds a lang.length() == 0 check to the already existing lang == null check without a configuration setting.

      was (Author: markus17):
    Here are patches for the current 1.2 stable, branch 1.3 and trunk. 
  
> LanguageIdentifier should not set empty lang field on NutchDocument
> -------------------------------------------------------------------
>
>                 Key: NUTCH-936
>                 URL: https://issues.apache.org/jira/browse/NUTCH-936
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 1.2
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.3, 2.0
>
>         Attachments: NUTCH-936-v12-1.patch, NUTCH-936-v13-1.patch, NUTCH-936-v13-1.patch
>
>
> For some reason the language identifier plugin sometimes sets an empty value for the lang field. It is confirmed to occur in 1.2 when parsing a scanned PDF file which cannot be OCR'd to proper text, resulting in an empty content field. Anyway, whether it's a problem with the parser or not, the plugin itself should not add an empty value because the content field can always be empty. The plugin already checks for a null value and then sets the lang field to `unknown`, which is fine. But when the lang string is empty, it should also be set to `unknown`.
> This might break clients that have conditional logic on the empty value, but not on the `unknown` value because it may never have occurred in their set up and therefore they might not have added `unknown` to their logic. However, it might seem a little bit overkill to put this proposal behind a configuration option and let Nutch by default continue to behave as it currently does. Any thoughts on this one?
> Here's the troublesome URL : http://www.nrc.nl/redactie/binnenland/memo_buza_irak.pdf that returns an empty content field and an empty lang string in 1.2 and presumably in trunk and other versions as well.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-936) LanguageIdentifier should not set empty lang field on NutchDocument

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma updated NUTCH-936:
--------------------------------

    Description: 
For some reason the language identifier plugin sometimes sets an empty value for the lang field. It is confirmed to occur in 1.2 when parsing a scanned PDF file which cannot be OCR'd to proper text, resulting in an empty content field. Anyway, whether it's a problem with the parser or not, the plugin itself should not add an empty value because the content field can always be empty. The plugin already checks for a null value and then sets the lang field to `unknown`, which is fine. But when the lang string is empty, it should also be set to `unknown`.

This might break clients that have conditional logic on the empty value, but not on the `unknown` value because it may never have occurred in their set up and therefore they might not have added `unknown` to their logic. However, it might seem a little bit overkill to put this proposal behind a configuration option and let Nutch by default continue to behave as it currently does. Any thoughts on this one?

Here's the troublesome URL : http://www.nrc.nl/redactie/binnenland/memo_buza_irak.pdf that returns an empty content field and an empty lang string in 1.2 and presumably in trunk and other versions as well.

  was:
For some reason the language identifier plugin sometimes sets an empty value for the lang field. It is confirmed to occur in 1.2 when parsing a scanned PDF file which cannot be OCR'd to proper text. Anyway, whether it's a problem with the parser or not, the plugin itself should not add an empty value. The plugin already checks for a null value and then sets the lang field to `unknown`, which is fine. But when the lang string is empty, it should also be set to `unknown`.

This might break clients that have conditional logic on the empty value, but not on the `unknown` value because it may never have occurred in their set up and therefore they might not have added `unknown` to their logic.

However, it might seem a little bit overkill to put this proposal behind a configuration option and let Nutch by default continue to behave as it currently does. Any thoughts on this one?


> LanguageIdentifier should not set empty lang field on NutchDocument
> -------------------------------------------------------------------
>
>                 Key: NUTCH-936
>                 URL: https://issues.apache.org/jira/browse/NUTCH-936
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 1.2
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.3, 2.0
>
>
> For some reason the language identifier plugin sometimes sets an empty value for the lang field. It is confirmed to occur in 1.2 when parsing a scanned PDF file which cannot be OCR'd to proper text, resulting in an empty content field. Anyway, whether it's a problem with the parser or not, the plugin itself should not add an empty value because the content field can always be empty. The plugin already checks for a null value and then sets the lang field to `unknown`, which is fine. But when the lang string is empty, it should also be set to `unknown`.
> This might break clients that have conditional logic on the empty value, but not on the `unknown` value because it may never have occurred in their set up and therefore they might not have added `unknown` to their logic. However, it might seem a little bit overkill to put this proposal behind a configuration option and let Nutch by default continue to behave as it currently does. Any thoughts on this one?
> Here's the troublesome URL : http://www.nrc.nl/redactie/binnenland/memo_buza_irak.pdf that returns an empty content field and an empty lang string in 1.2 and presumably in trunk and other versions as well.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-936) LanguageIdentifier should not set empty lang field on NutchDocument

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma updated NUTCH-936:
--------------------------------

    Attachment: NUTCH-936-v13-1.patch
                NUTCH-936-v13-1.patch
                NUTCH-936-v12-1.patch

Here are patches for the current 1.2 stable, branch 1.3 and trunk. 

> LanguageIdentifier should not set empty lang field on NutchDocument
> -------------------------------------------------------------------
>
>                 Key: NUTCH-936
>                 URL: https://issues.apache.org/jira/browse/NUTCH-936
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 1.2
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.3, 2.0
>
>         Attachments: NUTCH-936-v12-1.patch, NUTCH-936-v13-1.patch, NUTCH-936-v13-1.patch
>
>
> For some reason the language identifier plugin sometimes sets an empty value for the lang field. It is confirmed to occur in 1.2 when parsing a scanned PDF file which cannot be OCR'd to proper text, resulting in an empty content field. Anyway, whether it's a problem with the parser or not, the plugin itself should not add an empty value because the content field can always be empty. The plugin already checks for a null value and then sets the lang field to `unknown`, which is fine. But when the lang string is empty, it should also be set to `unknown`.
> This might break clients that have conditional logic on the empty value, but not on the `unknown` value because it may never have occurred in their set up and therefore they might not have added `unknown` to their logic. However, it might seem a little bit overkill to put this proposal behind a configuration option and let Nutch by default continue to behave as it currently does. Any thoughts on this one?
> Here's the troublesome URL : http://www.nrc.nl/redactie/binnenland/memo_buza_irak.pdf that returns an empty content field and an empty lang string in 1.2 and presumably in trunk and other versions as well.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Closed: (NUTCH-936) LanguageIdentifier should not set empty lang field on NutchDocument

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Julien Nioche closed NUTCH-936.
-------------------------------

    Resolution: Fixed

Committed in trunk under revision 1051985.
Thanks

> LanguageIdentifier should not set empty lang field on NutchDocument
> -------------------------------------------------------------------
>
>                 Key: NUTCH-936
>                 URL: https://issues.apache.org/jira/browse/NUTCH-936
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 1.2
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.3, 2.0
>
>         Attachments: NUTCH-936-v12-1.patch, NUTCH-936-v13-1.patch, NUTCH-936-v13-1.patch
>
>
> For some reason the language identifier plugin sometimes sets an empty value for the lang field. It is confirmed to occur in 1.2 when parsing a scanned PDF file which cannot be OCR'd to proper text, resulting in an empty content field. Anyway, whether it's a problem with the parser or not, the plugin itself should not add an empty value because the content field can always be empty. The plugin already checks for a null value and then sets the lang field to `unknown`, which is fine. But when the lang string is empty, it should also be set to `unknown`.
> This might break clients that have conditional logic on the empty value, but not on the `unknown` value because it may never have occurred in their set up and therefore they might not have added `unknown` to their logic. However, it might seem a little bit overkill to put this proposal behind a configuration option and let Nutch by default continue to behave as it currently does. Any thoughts on this one?
> Here's the troublesome URL : http://www.nrc.nl/redactie/binnenland/memo_buza_irak.pdf that returns an empty content field and an empty lang string in 1.2 and presumably in trunk and other versions as well.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.