You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "kiran (JIRA)" <ji...@apache.org> on 2012/09/06 13:43:07 UTC

[jira] [Created] (NUTCH-1467) nutch 1.5.1 not able to parse mutliValued metatags

kiran created NUTCH-1467:
----------------------------

             Summary: nutch 1.5.1 not able to parse mutliValued metatags
                 Key: NUTCH-1467
                 URL: https://issues.apache.org/jira/browse/NUTCH-1467
             Project: Nutch
          Issue Type: Bug
    Affects Versions: 1.5.1
            Reporter: kiran
            Priority: Minor


Hi,

I have been able to parse metatags in an html page using http://wiki.apache.org/nutch/IndexMetatags. It does not work quite well when there are two metatags with same name but two different contents. 

Does anyone encounter this kind of issue ?  

Are there any changes that need to be made to the config files to make it work ?

Many Thanks,


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Comment Edited] (NUTCH-1467) nutch 1.5.1 not able to parse mutliValued metatags

Posted by "kiran (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13467072#comment-13467072 ] 

kiran edited comment on NUTCH-1467 at 10/2/12 5:40 AM:
-------------------------------------------------------

These 4 files will work as patch to the parse-metatags plugin. They will save the multiValues in an array and then send it to Solr. HTMLMetaProcessor patch need to be applied in parse-html and parse-tika plugins. HTMLMetaTags patch should be in the 'src/java/parse'. HTMLMetadataIndexer in index-metadata plugin.

This is an improvement to the last patch i wrote where i used a separator to save the mutiplevalues of the same tag.

I am working on writing an junit test.

Does anyone know why this plugin is not included in nutch-2.0 ? 

Should this plugin need to be rewritten to be included in nutch-2.0? I did try few things but the classes are not getting compiled.

Many Thanks for the help.
                
      was (Author: kiranch):
    These 4 files will work as patch to the parse-metatags plugin. They will save the multiValues in an array and then send it to Solr. HTMLMetaProcessor patch need to be applied in parse-html and parse-tika plugins. HTMLMetaTags patch should be in the 'src/java/parse'. HTMLMetadataIndexer in index-metadata plugin
                  
> nutch 1.5.1 not able to parse mutliValued metatags
> --------------------------------------------------
>
>                 Key: NUTCH-1467
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1467
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.5.1
>            Reporter: kiran
>            Priority: Minor
>             Fix For: 1.6
>
>         Attachments: Patch_HTMLMetaProcessor.patch, Patch_HTMLMetaTags.patch, Patch_MetadataIndexer.patch, Patch_MetaTagsParser.patch, patch.txt
>
>
> Hi,
> I have been able to parse metatags in an html page using http://wiki.apache.org/nutch/IndexMetatags. It does not work quite well when there are two metatags with same name but two different contents. 
> Does anyone encounter this kind of issue ?  
> Are there any changes that need to be made to the config files to make it work ?
> When there are two tags with same name and different content, it takes the value of the later tag and saves it rather than creating a multiValue field.
> Edit: I have attached the patch for the file and it is provided by DLA (Digital Library and Archives) http://scholar.lib.vt.edu/ of Virginia Tech. 
> Many Thanks,

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1467) nutch 1.5.1 not able to parse mutliValued metatags

Posted by "kiran (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

kiran updated NUTCH-1467:
-------------------------

    Description: 
Hi,

I have been able to parse metatags in an html page using http://wiki.apache.org/nutch/IndexMetatags. It does not work quite well when there are two metatags with same name but two different contents. 

Does anyone encounter this kind of issue ?  

Are there any changes that need to be made to the config files to make it work ?

When there are two tags with same name and different content, it takes the value of the later tag and saves it rather than creating a multiValue field.

Many Thanks,


  was:
Hi,

I have been able to parse metatags in an html page using http://wiki.apache.org/nutch/IndexMetatags. It does not work quite well when there are two metatags with same name but two different contents. 

Does anyone encounter this kind of issue ?  

Are there any changes that need to be made to the config files to make it work ?

Many Thanks,


    
> nutch 1.5.1 not able to parse mutliValued metatags
> --------------------------------------------------
>
>                 Key: NUTCH-1467
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1467
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.5.1
>            Reporter: kiran
>            Priority: Minor
>
> Hi,
> I have been able to parse metatags in an html page using http://wiki.apache.org/nutch/IndexMetatags. It does not work quite well when there are two metatags with same name but two different contents. 
> Does anyone encounter this kind of issue ?  
> Are there any changes that need to be made to the config files to make it work ?
> When there are two tags with same name and different content, it takes the value of the later tag and saves it rather than creating a multiValue field.
> Many Thanks,

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1467) nutch 1.5.1 not able to parse mutliValued metatags

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13454856#comment-13454856 ] 

Julien Nioche commented on NUTCH-1467:
--------------------------------------

Hi Kiran

Thank you for your comments. Re-index all attributes : this could be done by adding the option to parse-metatags and allowing values to be set using regular expressions in index-metadata.

Don't worry about being slow, no one's in a hurry and we are all learning from each other
                
> nutch 1.5.1 not able to parse mutliValued metatags
> --------------------------------------------------
>
>                 Key: NUTCH-1467
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1467
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.5.1
>            Reporter: kiran
>            Priority: Minor
>             Fix For: 1.6
>
>         Attachments: patch.txt
>
>
> Hi,
> I have been able to parse metatags in an html page using http://wiki.apache.org/nutch/IndexMetatags. It does not work quite well when there are two metatags with same name but two different contents. 
> Does anyone encounter this kind of issue ?  
> Are there any changes that need to be made to the config files to make it work ?
> When there are two tags with same name and different content, it takes the value of the later tag and saves it rather than creating a multiValue field.
> Edit: I have attached the patch for the file and it is provided by DLA (Digital Library and Archives) http://scholar.lib.vt.edu/ of Virginia Tech. 
> Many Thanks,

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1467) nutch 1.5.1 not able to parse mutliValued metatags

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13454790#comment-13454790 ] 

Julien Nioche commented on NUTCH-1467:
--------------------------------------

bq. I will work on it soon but i am thinking of working on tika parser so that it can get all the attributes by default, index them and send it to solr 'attr_*' dynamic field, so that instead of specifying manually any attributes will be accepted. That would be helpful i think than the parse-metatags.

a big fat -1 from me. definitely not a good idea to index all the possible attributes by default. 

Adding a test illustrating the new behaviour for this issue would have been good. +1 to being able to store multiple values instead of relying on a separator by convention

Markus - my understanding is that committers mark an issue as resolved but it's up to the author of the issue to confirm that all is done by closing it.
                
> nutch 1.5.1 not able to parse mutliValued metatags
> --------------------------------------------------
>
>                 Key: NUTCH-1467
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1467
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.5.1
>            Reporter: kiran
>            Priority: Minor
>             Fix For: 1.6
>
>         Attachments: patch.txt
>
>
> Hi,
> I have been able to parse metatags in an html page using http://wiki.apache.org/nutch/IndexMetatags. It does not work quite well when there are two metatags with same name but two different contents. 
> Does anyone encounter this kind of issue ?  
> Are there any changes that need to be made to the config files to make it work ?
> When there are two tags with same name and different content, it takes the value of the later tag and saves it rather than creating a multiValue field.
> Edit: I have attached the patch for the file and it is provided by DLA (Digital Library and Archives) http://scholar.lib.vt.edu/ of Virginia Tech. 
> Many Thanks,

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Comment Edited] (NUTCH-1467) nutch 1.5.1 not able to parse mutliValued metatags

Posted by "kiran (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13467072#comment-13467072 ] 

kiran edited comment on NUTCH-1467 at 10/2/12 5:42 AM:
-------------------------------------------------------

These 4 files will work as patch to the parse-metatags plugin. They will save the multiValues in an array and then send it to Solr. HTMLMetaProcessor patch need to be applied in parse-html and parse-tika plugins. HTMLMetaTags patch should be in the 'src/java/parse'. HTMLMetadataIndexer in index-metadata plugin.

This is an improvement to the last patch i wrote where i used a separator to save the mutiplevalues of the same tag.

I am working on writing an junit test.

These patches send the data to Solr as an array and it works all fine as expected but when i run the command 'bin/nutch indexchecker' it wont print the values as expected. I did not work on that yet.

Does anyone know why this plugin is not included in nutch-2.0 ? 

Should this plugin need to be rewritten to be included in nutch-2.0? I did try few things but the classes are not getting compiled.

Many Thanks for the help.
                
      was (Author: kiranch):
    These 4 files will work as patch to the parse-metatags plugin. They will save the multiValues in an array and then send it to Solr. HTMLMetaProcessor patch need to be applied in parse-html and parse-tika plugins. HTMLMetaTags patch should be in the 'src/java/parse'. HTMLMetadataIndexer in index-metadata plugin.

This is an improvement to the last patch i wrote where i used a separator to save the mutiplevalues of the same tag.

I am working on writing an junit test.

Does anyone know why this plugin is not included in nutch-2.0 ? 

Should this plugin need to be rewritten to be included in nutch-2.0? I did try few things but the classes are not getting compiled.

Many Thanks for the help.
                  
> nutch 1.5.1 not able to parse mutliValued metatags
> --------------------------------------------------
>
>                 Key: NUTCH-1467
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1467
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.5.1
>            Reporter: kiran
>            Priority: Minor
>             Fix For: 1.6
>
>         Attachments: Patch_HTMLMetaProcessor.patch, Patch_HTMLMetaTags.patch, Patch_MetadataIndexer.patch, Patch_MetaTagsParser.patch, patch.txt
>
>
> Hi,
> I have been able to parse metatags in an html page using http://wiki.apache.org/nutch/IndexMetatags. It does not work quite well when there are two metatags with same name but two different contents. 
> Does anyone encounter this kind of issue ?  
> Are there any changes that need to be made to the config files to make it work ?
> When there are two tags with same name and different content, it takes the value of the later tag and saves it rather than creating a multiValue field.
> Edit: I have attached the patch for the file and it is provided by DLA (Digital Library and Archives) http://scholar.lib.vt.edu/ of Virginia Tech. 
> Many Thanks,

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1467) nutch 1.5.1 not able to parse mutliValued metatags

Posted by "kiran (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

kiran updated NUTCH-1467:
-------------------------

    Attachment: Patch_MetaTagsParser.patch
                Patch_MetadataIndexer.patch
                Patch_HTMLMetaTags.patch
                Patch_HTMLMetaProcessor.patch

These 4 files will work as patch to the parse-metatags plugin. They will save the multiValues in an array and then send it to Solr. HTMLMetaProcessor patch need to be applied in parse-html and parse-tika plugins. HTMLMetaTags patch should be in the 'src/java/parse'. HTMLMetadataIndexer in index-metadata plugin
                
> nutch 1.5.1 not able to parse mutliValued metatags
> --------------------------------------------------
>
>                 Key: NUTCH-1467
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1467
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.5.1
>            Reporter: kiran
>            Priority: Minor
>             Fix For: 1.6
>
>         Attachments: Patch_HTMLMetaProcessor.patch, Patch_HTMLMetaTags.patch, Patch_MetadataIndexer.patch, Patch_MetaTagsParser.patch, patch.txt
>
>
> Hi,
> I have been able to parse metatags in an html page using http://wiki.apache.org/nutch/IndexMetatags. It does not work quite well when there are two metatags with same name but two different contents. 
> Does anyone encounter this kind of issue ?  
> Are there any changes that need to be made to the config files to make it work ?
> When there are two tags with same name and different content, it takes the value of the later tag and saves it rather than creating a multiValue field.
> Edit: I have attached the patch for the file and it is provided by DLA (Digital Library and Archives) http://scholar.lib.vt.edu/ of Virginia Tech. 
> Many Thanks,

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Reopened] (NUTCH-1467) nutch 1.5.1 not able to parse mutliValued metatags

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma reopened NUTCH-1467:
----------------------------------


Thanks for your patch. Reopening issue since nothing is committed. Usually only committers resolve or close an issue. I'll mark it to fix for 1.6.
                
> nutch 1.5.1 not able to parse mutliValued metatags
> --------------------------------------------------
>
>                 Key: NUTCH-1467
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1467
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.5.1
>            Reporter: kiran
>            Priority: Minor
>             Fix For: 1.6
>
>         Attachments: patch.txt
>
>
> Hi,
> I have been able to parse metatags in an html page using http://wiki.apache.org/nutch/IndexMetatags. It does not work quite well when there are two metatags with same name but two different contents. 
> Does anyone encounter this kind of issue ?  
> Are there any changes that need to be made to the config files to make it work ?
> When there are two tags with same name and different content, it takes the value of the later tag and saves it rather than creating a multiValue field.
> Many Thanks,

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1467) nutch 1.5.1 not able to parse mutliValued metatags

Posted by "kiran (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13482668#comment-13482668 ] 

kiran commented on NUTCH-1467:
------------------------------

Hi Sebastian,

Thank you for the suggestions. I will look in to them. I have ported the plugin (https://issues.apache.org/jira/browse/NUTCH-1478)  based on the patch i have written here, if the changes here work then the patch i wrote for porting also needs to be changed.

I will test the NUTCH-1467-TEST-1.patch soon and update here if any more cases need to be covered. 

Regards,
Kiran




                
> nutch 1.5.1 not able to parse mutliValued metatags
> --------------------------------------------------
>
>                 Key: NUTCH-1467
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1467
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.5.1
>            Reporter: kiran
>            Priority: Minor
>             Fix For: 1.6
>
>         Attachments: NUTCH-1467-TEST-1.patch, NUTCH-1467-trunk.patch, Patch_HTMLMetaProcessor.patch, Patch_HTMLMetaTags.patch, Patch_MetadataIndexer.patch, Patch_MetaTagsParser.patch, patch.txt
>
>
> Hi,
> I have been able to parse metatags in an html page using http://wiki.apache.org/nutch/IndexMetatags. It does not work quite well when there are two metatags with same name but two different contents. 
> Does anyone encounter this kind of issue ?  
> Are there any changes that need to be made to the config files to make it work ?
> When there are two tags with same name and different content, it takes the value of the later tag and saves it rather than creating a multiValue field.
> Edit: I have attached the patch for the file and it is provided by DLA (Digital Library and Archives) http://scholar.lib.vt.edu/ of Virginia Tech. 
> Many Thanks,

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Comment Edited] (NUTCH-1467) nutch 1.5.1 not able to parse mutliValued metatags

Posted by "kiran (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13454296#comment-13454296 ] 

kiran edited comment on NUTCH-1467 at 9/13/12 6:59 AM:
-------------------------------------------------------

Yes, it would be great to store the multiValues as array instead of concatenation. I just made a quick workaround but  the implementation of HTMLMetaTags.generalTags should be changed from the implementation of properties so that it can accept array of values. That would be the next step. 

I will work on it soon but i am thinking of working on tika parser so that it can get all the attributes by default, index them and send it to solr 'attr_*' dynamic field, so that instead of specifying manually any attributes will be accepted. That would be helpful i think than the parse-metatags.

I am not sure if this feature is already present in Nutch.

Any suggestions are appreciated.

Thank you,
Kiran.
                
      was (Author: kiranch):
    Yes, it would be great to store the multiValues as array instead of concatenation. I just made a quick workaround but  the implementation of HTMLMetaTags.generalTags should be changed from the implementation of properties so that it can accept array of values. That would be the next step. 

I will work on it soon but i am thinking of working on tika parser so that it can get all the attributes by default, index them and send it to solr 'attr_*' dynamic field, so that instead of specifying manually, any attributes will be accepted. That would be helpful i think than the parse-metatags.

I am not sure if this feature is already present in Nutch.

Any suggestions are appreciated.

Thank you,
Kiran.
                  
> nutch 1.5.1 not able to parse mutliValued metatags
> --------------------------------------------------
>
>                 Key: NUTCH-1467
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1467
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.5.1
>            Reporter: kiran
>            Priority: Minor
>             Fix For: 1.6
>
>         Attachments: patch.txt
>
>
> Hi,
> I have been able to parse metatags in an html page using http://wiki.apache.org/nutch/IndexMetatags. It does not work quite well when there are two metatags with same name but two different contents. 
> Does anyone encounter this kind of issue ?  
> Are there any changes that need to be made to the config files to make it work ?
> When there are two tags with same name and different content, it takes the value of the later tag and saves it rather than creating a multiValue field.
> Edit: I have attached the patch for the file and it is provided by DLA (Digital Library and Archives) http://scholar.lib.vt.edu/ of Virginia Tech. 
> Many Thanks,

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1467) nutch 1.5.1 not able to parse mutliValued metatags

Posted by "kiran (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

kiran updated NUTCH-1467:
-------------------------

    Attachment: patch.txt
    
> nutch 1.5.1 not able to parse mutliValued metatags
> --------------------------------------------------
>
>                 Key: NUTCH-1467
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1467
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.5.1
>            Reporter: kiran
>            Priority: Minor
>         Attachments: patch.txt
>
>
> Hi,
> I have been able to parse metatags in an html page using http://wiki.apache.org/nutch/IndexMetatags. It does not work quite well when there are two metatags with same name but two different contents. 
> Does anyone encounter this kind of issue ?  
> Are there any changes that need to be made to the config files to make it work ?
> When there are two tags with same name and different content, it takes the value of the later tag and saves it rather than creating a multiValue field.
> Many Thanks,

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1467) nutch 1.5.1 not able to parse mutliValued metatags

Posted by "kiran (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13453417#comment-13453417 ] 

kiran commented on NUTCH-1467:
------------------------------

Thank you for fixing it in the version 1.6
                
> nutch 1.5.1 not able to parse mutliValued metatags
> --------------------------------------------------
>
>                 Key: NUTCH-1467
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1467
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.5.1
>            Reporter: kiran
>            Priority: Minor
>             Fix For: 1.6
>
>         Attachments: patch.txt
>
>
> Hi,
> I have been able to parse metatags in an html page using http://wiki.apache.org/nutch/IndexMetatags. It does not work quite well when there are two metatags with same name but two different contents. 
> Does anyone encounter this kind of issue ?  
> Are there any changes that need to be made to the config files to make it work ?
> When there are two tags with same name and different content, it takes the value of the later tag and saves it rather than creating a multiValue field.
> Many Thanks,

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1467) nutch 1.5.1 not able to parse mutliValued metatags

Posted by "Lewis John McGibbney (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Lewis John McGibbney updated NUTCH-1467:
----------------------------------------

    Attachment: NUTCH-1467-trunk.patch

Hi Kiran. I've attached a unified patch for your contribution. Thank you for this.
Some notes on submitting patches. It makes it so much easier for us to review and comment if patches are created from the top level directory.
Thank you also for your attention to detail regarding the formatting.
                
> nutch 1.5.1 not able to parse mutliValued metatags
> --------------------------------------------------
>
>                 Key: NUTCH-1467
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1467
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.5.1
>            Reporter: kiran
>            Priority: Minor
>             Fix For: 1.6
>
>         Attachments: NUTCH-1467-trunk.patch, Patch_HTMLMetaProcessor.patch, Patch_HTMLMetaTags.patch, Patch_MetadataIndexer.patch, Patch_MetaTagsParser.patch, patch.txt
>
>
> Hi,
> I have been able to parse metatags in an html page using http://wiki.apache.org/nutch/IndexMetatags. It does not work quite well when there are two metatags with same name but two different contents. 
> Does anyone encounter this kind of issue ?  
> Are there any changes that need to be made to the config files to make it work ?
> When there are two tags with same name and different content, it takes the value of the later tag and saves it rather than creating a multiValue field.
> Edit: I have attached the patch for the file and it is provided by DLA (Digital Library and Archives) http://scholar.lib.vt.edu/ of Virginia Tech. 
> Many Thanks,

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1467) nutch 1.5.1 not able to parse mutliValued metatags

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma updated NUTCH-1467:
---------------------------------

    Fix Version/s: 1.6
    
> nutch 1.5.1 not able to parse mutliValued metatags
> --------------------------------------------------
>
>                 Key: NUTCH-1467
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1467
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.5.1
>            Reporter: kiran
>            Priority: Minor
>             Fix For: 1.6
>
>         Attachments: patch.txt
>
>
> Hi,
> I have been able to parse metatags in an html page using http://wiki.apache.org/nutch/IndexMetatags. It does not work quite well when there are two metatags with same name but two different contents. 
> Does anyone encounter this kind of issue ?  
> Are there any changes that need to be made to the config files to make it work ?
> When there are two tags with same name and different content, it takes the value of the later tag and saves it rather than creating a multiValue field.
> Many Thanks,

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1467) nutch 1.5.1 not able to parse mutliValued metatags

Posted by "kiran (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13454296#comment-13454296 ] 

kiran commented on NUTCH-1467:
------------------------------

Yes, it would be great to store the multiValues as array instead of concatenation. I just made a quick workaround but  the implementation of HTMLMetaTags.generalTags should be changed from the implementation of properties so that it can accept array of values. That would be the next step. 

I will work on it soon but i am thinking of working on tika parser so that it can get all the attributes by default, index them and send it to solr 'attr_*' dynamic field, so that instead of specifying manually, any attributes will be accepted. That would be helpful i think than the parse-metatags.

I am not sure if this feature is already present in Nutch.

Any suggestions are appreciated.

Thank you,
Kiran.
                
> nutch 1.5.1 not able to parse mutliValued metatags
> --------------------------------------------------
>
>                 Key: NUTCH-1467
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1467
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.5.1
>            Reporter: kiran
>            Priority: Minor
>             Fix For: 1.6
>
>         Attachments: patch.txt
>
>
> Hi,
> I have been able to parse metatags in an html page using http://wiki.apache.org/nutch/IndexMetatags. It does not work quite well when there are two metatags with same name but two different contents. 
> Does anyone encounter this kind of issue ?  
> Are there any changes that need to be made to the config files to make it work ?
> When there are two tags with same name and different content, it takes the value of the later tag and saves it rather than creating a multiValue field.
> Edit: I have attached the patch for the file and it is provided by DLA (Digital Library and Archives) http://scholar.lib.vt.edu/ of Virginia Tech. 
> Many Thanks,

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1467) nutch 1.5.1 not able to parse mutliValued metatags

Posted by "kiran (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

kiran updated NUTCH-1467:
-------------------------

    Description: 
Hi,

I have been able to parse metatags in an html page using http://wiki.apache.org/nutch/IndexMetatags. It does not work quite well when there are two metatags with same name but two different contents. 

Does anyone encounter this kind of issue ?  

Are there any changes that need to be made to the config files to make it work ?

When there are two tags with same name and different content, it takes the value of the later tag and saves it rather than creating a multiValue field.

Edit: I have attached the patch for the file and it is provided by DLA (Digital Library and Archives) http://scholar.lib.vt.edu/ of Virginia Tech. 

Many Thanks,


  was:
Hi,

I have been able to parse metatags in an html page using http://wiki.apache.org/nutch/IndexMetatags. It does not work quite well when there are two metatags with same name but two different contents. 

Does anyone encounter this kind of issue ?  

Are there any changes that need to be made to the config files to make it work ?

When there are two tags with same name and different content, it takes the value of the later tag and saves it rather than creating a multiValue field.

Edit: I have attached the patch for the file and it is provided by DLA (Digital Library and Archives) of Virginia Tech. 

Many Thanks,


    
> nutch 1.5.1 not able to parse mutliValued metatags
> --------------------------------------------------
>
>                 Key: NUTCH-1467
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1467
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.5.1
>            Reporter: kiran
>            Priority: Minor
>             Fix For: 1.6
>
>         Attachments: patch.txt
>
>
> Hi,
> I have been able to parse metatags in an html page using http://wiki.apache.org/nutch/IndexMetatags. It does not work quite well when there are two metatags with same name but two different contents. 
> Does anyone encounter this kind of issue ?  
> Are there any changes that need to be made to the config files to make it work ?
> When there are two tags with same name and different content, it takes the value of the later tag and saves it rather than creating a multiValue field.
> Edit: I have attached the patch for the file and it is provided by DLA (Digital Library and Archives) http://scholar.lib.vt.edu/ of Virginia Tech. 
> Many Thanks,

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1467) nutch 1.5.1 not able to parse mutliValued metatags

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13468436#comment-13468436 ] 

Julien Nioche commented on NUTCH-1467:
--------------------------------------

Thanks Kiran. See http://wiki.apache.org/nutch/HowToContribute for info on patches 

                
> nutch 1.5.1 not able to parse mutliValued metatags
> --------------------------------------------------
>
>                 Key: NUTCH-1467
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1467
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.5.1
>            Reporter: kiran
>            Priority: Minor
>             Fix For: 1.6
>
>         Attachments: NUTCH-1467-trunk.patch, Patch_HTMLMetaProcessor.patch, Patch_HTMLMetaTags.patch, Patch_MetadataIndexer.patch, Patch_MetaTagsParser.patch, patch.txt
>
>
> Hi,
> I have been able to parse metatags in an html page using http://wiki.apache.org/nutch/IndexMetatags. It does not work quite well when there are two metatags with same name but two different contents. 
> Does anyone encounter this kind of issue ?  
> Are there any changes that need to be made to the config files to make it work ?
> When there are two tags with same name and different content, it takes the value of the later tag and saves it rather than creating a multiValue field.
> Edit: I have attached the patch for the file and it is provided by DLA (Digital Library and Archives) http://scholar.lib.vt.edu/ of Virginia Tech. 
> Many Thanks,

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1467) nutch 1.5.1 not able to parse mutliValued metatags

Posted by "Sebastian Nagel (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sebastian Nagel updated NUTCH-1467:
-----------------------------------

    Attachment: NUTCH-1467-TEST-1.patch
    
> nutch 1.5.1 not able to parse mutliValued metatags
> --------------------------------------------------
>
>                 Key: NUTCH-1467
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1467
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.5.1
>            Reporter: kiran
>            Priority: Minor
>             Fix For: 1.6
>
>         Attachments: NUTCH-1467-TEST-1.patch, NUTCH-1467-trunk.patch, Patch_HTMLMetaProcessor.patch, Patch_HTMLMetaTags.patch, Patch_MetadataIndexer.patch, Patch_MetaTagsParser.patch, patch.txt
>
>
> Hi,
> I have been able to parse metatags in an html page using http://wiki.apache.org/nutch/IndexMetatags. It does not work quite well when there are two metatags with same name but two different contents. 
> Does anyone encounter this kind of issue ?  
> Are there any changes that need to be made to the config files to make it work ?
> When there are two tags with same name and different content, it takes the value of the later tag and saves it rather than creating a multiValue field.
> Edit: I have attached the patch for the file and it is provided by DLA (Digital Library and Archives) http://scholar.lib.vt.edu/ of Virginia Tech. 
> Many Thanks,

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1467) nutch 1.5.1 not able to parse mutliValued metatags

Posted by "kiran (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13467598#comment-13467598 ] 

kiran commented on NUTCH-1467:
------------------------------

Thank you for the unified patch. I did not know much about the patches before but i will try to follow the rules next time i do this. 

Many Thanks for all your help.
                
> nutch 1.5.1 not able to parse mutliValued metatags
> --------------------------------------------------
>
>                 Key: NUTCH-1467
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1467
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.5.1
>            Reporter: kiran
>            Priority: Minor
>             Fix For: 1.6
>
>         Attachments: NUTCH-1467-trunk.patch, Patch_HTMLMetaProcessor.patch, Patch_HTMLMetaTags.patch, Patch_MetadataIndexer.patch, Patch_MetaTagsParser.patch, patch.txt
>
>
> Hi,
> I have been able to parse metatags in an html page using http://wiki.apache.org/nutch/IndexMetatags. It does not work quite well when there are two metatags with same name but two different contents. 
> Does anyone encounter this kind of issue ?  
> Are there any changes that need to be made to the config files to make it work ?
> When there are two tags with same name and different content, it takes the value of the later tag and saves it rather than creating a multiValue field.
> Edit: I have attached the patch for the file and it is provided by DLA (Digital Library and Archives) http://scholar.lib.vt.edu/ of Virginia Tech. 
> Many Thanks,

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1467) nutch 1.5.1 not able to parse mutliValued metatags

Posted by "kiran (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

kiran updated NUTCH-1467:
-------------------------

    Description: 
Hi,

I have been able to parse metatags in an html page using http://wiki.apache.org/nutch/IndexMetatags. It does not work quite well when there are two metatags with same name but two different contents. 

Does anyone encounter this kind of issue ?  

Are there any changes that need to be made to the config files to make it work ?

When there are two tags with same name and different content, it takes the value of the later tag and saves it rather than creating a multiValue field.

Edit: I have attached the patch for the file and it is provided by DLA (Digital Library and Archives) of Virginia Tech. 

Many Thanks,


  was:
Hi,

I have been able to parse metatags in an html page using http://wiki.apache.org/nutch/IndexMetatags. It does not work quite well when there are two metatags with same name but two different contents. 

Does anyone encounter this kind of issue ?  

Are there any changes that need to be made to the config files to make it work ?

When there are two tags with same name and different content, it takes the value of the later tag and saves it rather than creating a multiValue field.

Many Thanks,


    
> nutch 1.5.1 not able to parse mutliValued metatags
> --------------------------------------------------
>
>                 Key: NUTCH-1467
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1467
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.5.1
>            Reporter: kiran
>            Priority: Minor
>             Fix For: 1.6
>
>         Attachments: patch.txt
>
>
> Hi,
> I have been able to parse metatags in an html page using http://wiki.apache.org/nutch/IndexMetatags. It does not work quite well when there are two metatags with same name but two different contents. 
> Does anyone encounter this kind of issue ?  
> Are there any changes that need to be made to the config files to make it work ?
> When there are two tags with same name and different content, it takes the value of the later tag and saves it rather than creating a multiValue field.
> Edit: I have attached the patch for the file and it is provided by DLA (Digital Library and Archives) of Virginia Tech. 
> Many Thanks,

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1467) nutch 1.5.1 not able to parse mutliValued metatags

Posted by "Sebastian Nagel (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13454282#comment-13454282 ] 

Sebastian Nagel commented on NUTCH-1467:
----------------------------------------

Since nutch.metadata.Metadata, NutchField, and SolrInputField are multi-valued wouldn't it be preferable to keep the multiple values instead of concatenating them in advance? This would require to change HTMLMetaTags.generalTags so that it can store multiple values. 
                
> nutch 1.5.1 not able to parse mutliValued metatags
> --------------------------------------------------
>
>                 Key: NUTCH-1467
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1467
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.5.1
>            Reporter: kiran
>            Priority: Minor
>             Fix For: 1.6
>
>         Attachments: patch.txt
>
>
> Hi,
> I have been able to parse metatags in an html page using http://wiki.apache.org/nutch/IndexMetatags. It does not work quite well when there are two metatags with same name but two different contents. 
> Does anyone encounter this kind of issue ?  
> Are there any changes that need to be made to the config files to make it work ?
> When there are two tags with same name and different content, it takes the value of the later tag and saves it rather than creating a multiValue field.
> Edit: I have attached the patch for the file and it is provided by DLA (Digital Library and Archives) http://scholar.lib.vt.edu/ of Virginia Tech. 
> Many Thanks,

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (NUTCH-1467) nutch 1.5.1 not able to parse mutliValued metatags

Posted by "kiran (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

kiran resolved NUTCH-1467.
--------------------------

    Resolution: Implemented

I have made a patch file (attached below) which will solve the above problem. 
I do not think its the best method to do it but thats a temporary solution for me now and i am posting it here. 

For Example if there are two tags like this with the same name :

<meta name="DC.creator" content="R.L. Ticknor">
<meta name="DC.creator" content="J.E. Long">

The parser (after patch applied) will save the values as (metatag.dc.creator=R.L. Ticknor,J.E. Long), separated by commas . 

Previously only second value used to be saved since java properties class was used to save the names and values.

The patch is for the file HTMLMetaProcessor.java in the path ($NUTCH_HOME/src/plugin/parse-html/src/java/org/apache/nutch/parse/html). 

It would have been great if i could save the values as an array instead of comma but since properties was used to save names and values, i thought its best to keep it separated by commas.

Whoever will use the crawled meta values, please use split(',') function for the multi values.

Please let me know if you have any suggestions. 

                
> nutch 1.5.1 not able to parse mutliValued metatags
> --------------------------------------------------
>
>                 Key: NUTCH-1467
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1467
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.5.1
>            Reporter: kiran
>            Priority: Minor
>         Attachments: patch.txt
>
>
> Hi,
> I have been able to parse metatags in an html page using http://wiki.apache.org/nutch/IndexMetatags. It does not work quite well when there are two metatags with same name but two different contents. 
> Does anyone encounter this kind of issue ?  
> Are there any changes that need to be made to the config files to make it work ?
> When there are two tags with same name and different content, it takes the value of the later tag and saves it rather than creating a multiValue field.
> Many Thanks,

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1467) nutch 1.5.1 not able to parse mutliValued metatags

Posted by "kiran (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13454847#comment-13454847 ] 

kiran commented on NUTCH-1467:
------------------------------

Hi Julien,

Thank you for your suggestions. I will look in to adding a test showing the new behaviour. 

When i said about indexing all the possible attributes by default, i mean for people who enable that option will have all attributes indexed by default. Sorry for the confusion. It will be useful for me when i am crawling 1000's of different pages and i do not know the structure of all of them and still want their metadata.

I will also try to do the multiple values for the metatags. 

This is my first time doing small patches to the opensource software, please bear with me if i am slow :) 

Kiran
                
> nutch 1.5.1 not able to parse mutliValued metatags
> --------------------------------------------------
>
>                 Key: NUTCH-1467
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1467
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.5.1
>            Reporter: kiran
>            Priority: Minor
>             Fix For: 1.6
>
>         Attachments: patch.txt
>
>
> Hi,
> I have been able to parse metatags in an html page using http://wiki.apache.org/nutch/IndexMetatags. It does not work quite well when there are two metatags with same name but two different contents. 
> Does anyone encounter this kind of issue ?  
> Are there any changes that need to be made to the config files to make it work ?
> When there are two tags with same name and different content, it takes the value of the later tag and saves it rather than creating a multiValue field.
> Edit: I have attached the patch for the file and it is provided by DLA (Digital Library and Archives) http://scholar.lib.vt.edu/ of Virginia Tech. 
> Many Thanks,

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1467) nutch 1.5.1 not able to parse mutliValued metatags

Posted by "Sebastian Nagel (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13482644#comment-13482644 ] 

Sebastian Nagel commented on NUTCH-1467:
----------------------------------------

Hi Kiran,
thanks for the patch. After a look at it:
* instead of replacing {{Properties generalTags}} in HTMLMetaTags.java by a {{HashMap<String, String[]>}} it seems preferable to use the class {{metadata.Metadata}}:
** provides the required methods
*** add one more value to an array of values
*** {{toString()}} etc.
** would shorten the code significantly
** sufficiently tested (own JUnit test)
* in addition to {{parse.html.HTMLMetaProcessor.java}} also {{parse.tika.HTMLMetaProcessor.java}} needs to be modified

Also, as Julien mentioned, a test would be useful. Added NUTCH-1467-TEST-1.patch as a first draft. Can you have a look at the test? Are all situations covered? Promising: test passes with the current patch applied :)

                
> nutch 1.5.1 not able to parse mutliValued metatags
> --------------------------------------------------
>
>                 Key: NUTCH-1467
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1467
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.5.1
>            Reporter: kiran
>            Priority: Minor
>             Fix For: 1.6
>
>         Attachments: NUTCH-1467-TEST-1.patch, NUTCH-1467-trunk.patch, Patch_HTMLMetaProcessor.patch, Patch_HTMLMetaTags.patch, Patch_MetadataIndexer.patch, Patch_MetaTagsParser.patch, patch.txt
>
>
> Hi,
> I have been able to parse metatags in an html page using http://wiki.apache.org/nutch/IndexMetatags. It does not work quite well when there are two metatags with same name but two different contents. 
> Does anyone encounter this kind of issue ?  
> Are there any changes that need to be made to the config files to make it work ?
> When there are two tags with same name and different content, it takes the value of the later tag and saves it rather than creating a multiValue field.
> Edit: I have attached the patch for the file and it is provided by DLA (Digital Library and Archives) http://scholar.lib.vt.edu/ of Virginia Tech. 
> Many Thanks,

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira