You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Ammar Shadiq (JIRA)" <ji...@apache.org> on 2011/04/06 17:11:05 UTC

[jira] [Created] (NUTCH-978) [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.

[GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.
---------------------------------------------------------------------------------------

                 Key: NUTCH-978
                 URL: https://issues.apache.org/jira/browse/NUTCH-978
             Project: Nutch
          Issue Type: New Feature
          Components: parser
    Affects Versions: 1.2
         Environment: Ubuntu Linux 10.10; JDK 1.6; Netbeans 6.9
            Reporter: Ammar Shadiq
             Fix For: 2.0


Nutch use parse-html plugin to parse web pages, it process the contents of the web page by removing html tags and component like javascript and css and leaving the extracted text to be stored on the index. Nutch by default doesn't have the capability to select certain atomic element on an html page, like certain tags, certain content, some part of the page, etc.

A html page have a tree-like xml pattern with html tag as its branch and text as its node. This branch and node could be extracted using XPath. XPath allowing us to select a certain branch or node of an XML and therefore could be used to extract certain information and treat it differently based on its content and the user requirements. Furthermore a web domain like news website usually have a same html code structure for storing the information on its web pages. This same html code structure could be parsed using the same XPath query and retrieve the same content information element. All of the XPath query for selecting various content could be stored on a XPath Configuration File.

The purpose of nutch are for various web source, not all of the web page retrieved from those various source have the same html code structure, thus have to be threated differently using the correct XPath Configuration. The selection of the correct XPath configuration could be done automatically using regex by matching the url of the web page with valid url pattern for that xpath configuration.

This automatic mechanism allow the user of nutch to process various web page and get only certain information that user wants therefore making the index more accurate and its content more flexible.

The component for this idea have been tested on nutch 1.2 for selecting certain elements on various news website for the purpose of document clustering. This includes a Configuration Editor Application build using NetBeans 6.9 Application Framework. though its need a few debugging.

http://dl.dropbox.com/u/2642087/For_GSoC/for_GSoc.zip



--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-978) [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.

Posted by "Lewis John McGibbney (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13212535#comment-13212535 ] 

Lewis John McGibbney commented on NUTCH-978:
--------------------------------------------

I think it's best if we talk off list for the time being, please get in touch with me lewismc@apache.org and we can take this forward. GSoC expressions of interest need to be made by the end of the month and this would be great as a project for Nutch.
                
> [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.
> ---------------------------------------------------------------------------------------
>
>                 Key: NUTCH-978
>                 URL: https://issues.apache.org/jira/browse/NUTCH-978
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 1.2
>         Environment: Ubuntu Linux 10.10; JDK 1.6; Netbeans 6.9
>            Reporter: Ammar Shadiq
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>              Labels: gsoc2011, mentor
>             Fix For: nutchgora
>
>         Attachments: [Nutch-GSoC-2011-Proposal]Web_Page_Scrapper_Parser_Plugin.pdf, app_guardian_ivory_coast_news_exmpl.png, app_screenshoot_configuration_result.png, app_screenshoot_configuration_result_anchor.png, app_screenshoot_source_view.png, app_screenshoot_url_regex_filter.png, for_GSoc.zip
>
>   Original Estimate: 1,680h
>  Remaining Estimate: 1,680h
>
> Nutch use parse-html plugin to parse web pages, it process the contents of the web page by removing html tags and component like javascript and css and leaving the extracted text to be stored on the index. Nutch by default doesn't have the capability to select certain atomic element on an html page, like certain tags, certain content, some part of the page, etc.
> A html page have a tree-like xml pattern with html tag as its branch and text as its node. This branch and node could be extracted using XPath. XPath allowing us to select a certain branch or node of an XML and therefore could be used to extract certain information and treat it differently based on its content and the user requirements. Furthermore a web domain like news website usually have a same html code structure for storing the information on its web pages. This same html code structure could be parsed using the same XPath query and retrieve the same content information element. All of the XPath query for selecting various content could be stored on a XPath Configuration File.
> The purpose of nutch are for various web source, not all of the web page retrieved from those various source have the same html code structure, thus have to be threated differently using the correct XPath Configuration. The selection of the correct XPath configuration could be done automatically using regex by matching the url of the web page with valid url pattern for that xpath configuration.
> This automatic mechanism allow the user of nutch to process various web page and get only certain information that user wants therefore making the index more accurate and its content more flexible.
> The component for this idea have been tested on nutch 1.2 for selecting certain elements on various news website for the purpose of document clustering. This includes a Configuration Editor Application build using NetBeans 6.9 Application Framework. though its need a few debugging.
> http://dl.dropbox.com/u/2642087/For_GSoC/for_GSoc.zip

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Issue Comment Edited] (NUTCH-978) [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.

Posted by "Ammar Shadiq (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13016748#comment-13016748 ] 

Ammar Shadiq edited comment on NUTCH-978 at 4/7/11 5:04 PM:
------------------------------------------------------------

Wow, thank you very much Mr. Jelsma:-)

      was (Author: ammarshadiq):
    Wow, thank you very much Mr. Jelsman:-)
  
> [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.
> ---------------------------------------------------------------------------------------
>
>                 Key: NUTCH-978
>                 URL: https://issues.apache.org/jira/browse/NUTCH-978
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 1.2
>         Environment: Ubuntu Linux 10.10; JDK 1.6; Netbeans 6.9
>            Reporter: Ammar Shadiq
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>              Labels: gsoc2011, mentor
>             Fix For: 2.0
>
>         Attachments: [Nutch-GSoC-2011-Proposal]Web_Page_Scrapper_Parser_Plugin.pdf, app_screenshoot_configuration_result.png, app_screenshoot_configuration_result_anchor.png, app_screenshoot_source_view.png, app_screenshoot_url_regex_filter.png
>
>   Original Estimate: 1680h
>  Remaining Estimate: 1680h
>
> Nutch use parse-html plugin to parse web pages, it process the contents of the web page by removing html tags and component like javascript and css and leaving the extracted text to be stored on the index. Nutch by default doesn't have the capability to select certain atomic element on an html page, like certain tags, certain content, some part of the page, etc.
> A html page have a tree-like xml pattern with html tag as its branch and text as its node. This branch and node could be extracted using XPath. XPath allowing us to select a certain branch or node of an XML and therefore could be used to extract certain information and treat it differently based on its content and the user requirements. Furthermore a web domain like news website usually have a same html code structure for storing the information on its web pages. This same html code structure could be parsed using the same XPath query and retrieve the same content information element. All of the XPath query for selecting various content could be stored on a XPath Configuration File.
> The purpose of nutch are for various web source, not all of the web page retrieved from those various source have the same html code structure, thus have to be threated differently using the correct XPath Configuration. The selection of the correct XPath configuration could be done automatically using regex by matching the url of the web page with valid url pattern for that xpath configuration.
> This automatic mechanism allow the user of nutch to process various web page and get only certain information that user wants therefore making the index more accurate and its content more flexible.
> The component for this idea have been tested on nutch 1.2 for selecting certain elements on various news website for the purpose of document clustering. This includes a Configuration Editor Application build using NetBeans 6.9 Application Framework. though its need a few debugging.
> http://dl.dropbox.com/u/2642087/For_GSoC/for_GSoc.zip

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-978) [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.

Posted by "Ammar Shadiq (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ammar Shadiq updated NUTCH-978:
-------------------------------

    Priority: Minor  (was: Major)

> [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.
> ---------------------------------------------------------------------------------------
>
>                 Key: NUTCH-978
>                 URL: https://issues.apache.org/jira/browse/NUTCH-978
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 1.2
>         Environment: Ubuntu Linux 10.10; JDK 1.6; Netbeans 6.9
>            Reporter: Ammar Shadiq
>            Priority: Minor
>              Labels: gsoc
>             Fix For: 2.0
>
>         Attachments: [Nutch-GSoC-2011-Proposal]Web_Page_Scrapper_Parser_Plugin.pdf
>
>   Original Estimate: 1680h
>  Remaining Estimate: 1680h
>
> Nutch use parse-html plugin to parse web pages, it process the contents of the web page by removing html tags and component like javascript and css and leaving the extracted text to be stored on the index. Nutch by default doesn't have the capability to select certain atomic element on an html page, like certain tags, certain content, some part of the page, etc.
> A html page have a tree-like xml pattern with html tag as its branch and text as its node. This branch and node could be extracted using XPath. XPath allowing us to select a certain branch or node of an XML and therefore could be used to extract certain information and treat it differently based on its content and the user requirements. Furthermore a web domain like news website usually have a same html code structure for storing the information on its web pages. This same html code structure could be parsed using the same XPath query and retrieve the same content information element. All of the XPath query for selecting various content could be stored on a XPath Configuration File.
> The purpose of nutch are for various web source, not all of the web page retrieved from those various source have the same html code structure, thus have to be threated differently using the correct XPath Configuration. The selection of the correct XPath configuration could be done automatically using regex by matching the url of the web page with valid url pattern for that xpath configuration.
> This automatic mechanism allow the user of nutch to process various web page and get only certain information that user wants therefore making the index more accurate and its content more flexible.
> The component for this idea have been tested on nutch 1.2 for selecting certain elements on various news website for the purpose of document clustering. This includes a Configuration Editor Application build using NetBeans 6.9 Application Framework. though its need a few debugging.
> http://dl.dropbox.com/u/2642087/For_GSoC/for_GSoc.zip

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-978) [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.

Posted by "Ammar Shadiq (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13212542#comment-13212542 ] 

Ammar Shadiq commented on NUTCH-978:
------------------------------------

I'll send you an email.
                
> [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.
> ---------------------------------------------------------------------------------------
>
>                 Key: NUTCH-978
>                 URL: https://issues.apache.org/jira/browse/NUTCH-978
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 1.2
>         Environment: Ubuntu Linux 10.10; JDK 1.6; Netbeans 6.9
>            Reporter: Ammar Shadiq
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>              Labels: gsoc2011, mentor
>             Fix For: nutchgora
>
>         Attachments: [Nutch-GSoC-2011-Proposal]Web_Page_Scrapper_Parser_Plugin.pdf, app_guardian_ivory_coast_news_exmpl.png, app_screenshoot_configuration_result.png, app_screenshoot_configuration_result_anchor.png, app_screenshoot_source_view.png, app_screenshoot_url_regex_filter.png, for_GSoc.zip
>
>   Original Estimate: 1,680h
>  Remaining Estimate: 1,680h
>
> Nutch use parse-html plugin to parse web pages, it process the contents of the web page by removing html tags and component like javascript and css and leaving the extracted text to be stored on the index. Nutch by default doesn't have the capability to select certain atomic element on an html page, like certain tags, certain content, some part of the page, etc.
> A html page have a tree-like xml pattern with html tag as its branch and text as its node. This branch and node could be extracted using XPath. XPath allowing us to select a certain branch or node of an XML and therefore could be used to extract certain information and treat it differently based on its content and the user requirements. Furthermore a web domain like news website usually have a same html code structure for storing the information on its web pages. This same html code structure could be parsed using the same XPath query and retrieve the same content information element. All of the XPath query for selecting various content could be stored on a XPath Configuration File.
> The purpose of nutch are for various web source, not all of the web page retrieved from those various source have the same html code structure, thus have to be threated differently using the correct XPath Configuration. The selection of the correct XPath configuration could be done automatically using regex by matching the url of the web page with valid url pattern for that xpath configuration.
> This automatic mechanism allow the user of nutch to process various web page and get only certain information that user wants therefore making the index more accurate and its content more flexible.
> The component for this idea have been tested on nutch 1.2 for selecting certain elements on various news website for the purpose of document clustering. This includes a Configuration Editor Application build using NetBeans 6.9 Application Framework. though its need a few debugging.
> http://dl.dropbox.com/u/2642087/For_GSoC/for_GSoc.zip

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-978) [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.

Posted by "Ammar Shadiq (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ammar Shadiq updated NUTCH-978:
-------------------------------

    Attachment: app_screenshoot_url_regex_filter.png
                app_screenshoot_source_view.png
                app_screenshoot_configuration_result_anchor.png
                app_screenshoot_configuration_result.png

> [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.
> ---------------------------------------------------------------------------------------
>
>                 Key: NUTCH-978
>                 URL: https://issues.apache.org/jira/browse/NUTCH-978
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 1.2
>         Environment: Ubuntu Linux 10.10; JDK 1.6; Netbeans 6.9
>            Reporter: Ammar Shadiq
>            Priority: Minor
>              Labels: gsoc2011, mentor
>             Fix For: 2.0
>
>         Attachments: [Nutch-GSoC-2011-Proposal]Web_Page_Scrapper_Parser_Plugin.pdf, app_screenshoot_configuration_result.png, app_screenshoot_configuration_result_anchor.png, app_screenshoot_source_view.png, app_screenshoot_url_regex_filter.png
>
>   Original Estimate: 1680h
>  Remaining Estimate: 1680h
>
> Nutch use parse-html plugin to parse web pages, it process the contents of the web page by removing html tags and component like javascript and css and leaving the extracted text to be stored on the index. Nutch by default doesn't have the capability to select certain atomic element on an html page, like certain tags, certain content, some part of the page, etc.
> A html page have a tree-like xml pattern with html tag as its branch and text as its node. This branch and node could be extracted using XPath. XPath allowing us to select a certain branch or node of an XML and therefore could be used to extract certain information and treat it differently based on its content and the user requirements. Furthermore a web domain like news website usually have a same html code structure for storing the information on its web pages. This same html code structure could be parsed using the same XPath query and retrieve the same content information element. All of the XPath query for selecting various content could be stored on a XPath Configuration File.
> The purpose of nutch are for various web source, not all of the web page retrieved from those various source have the same html code structure, thus have to be threated differently using the correct XPath Configuration. The selection of the correct XPath configuration could be done automatically using regex by matching the url of the web page with valid url pattern for that xpath configuration.
> This automatic mechanism allow the user of nutch to process various web page and get only certain information that user wants therefore making the index more accurate and its content more flexible.
> The component for this idea have been tested on nutch 1.2 for selecting certain elements on various news website for the purpose of document clustering. This includes a Configuration Editor Application build using NetBeans 6.9 Application Framework. though its need a few debugging.
> http://dl.dropbox.com/u/2642087/For_GSoC/for_GSoc.zip

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-978) [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.

Posted by "Thomas Koch (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13017053#comment-13017053 ] 

Thomas Koch commented on NUTCH-978:
-----------------------------------

If it is about main text extraction then there's a plugin in Tika for this (boilerpipe) and there's the readability bookmarklet that has an alternative algorithm to determine the main text.

> [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.
> ---------------------------------------------------------------------------------------
>
>                 Key: NUTCH-978
>                 URL: https://issues.apache.org/jira/browse/NUTCH-978
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 1.2
>         Environment: Ubuntu Linux 10.10; JDK 1.6; Netbeans 6.9
>            Reporter: Ammar Shadiq
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>              Labels: gsoc2011, mentor
>             Fix For: 2.0
>
>         Attachments: [Nutch-GSoC-2011-Proposal]Web_Page_Scrapper_Parser_Plugin.pdf, app_screenshoot_configuration_result.png, app_screenshoot_configuration_result_anchor.png, app_screenshoot_source_view.png, app_screenshoot_url_regex_filter.png
>
>   Original Estimate: 1680h
>  Remaining Estimate: 1680h
>
> Nutch use parse-html plugin to parse web pages, it process the contents of the web page by removing html tags and component like javascript and css and leaving the extracted text to be stored on the index. Nutch by default doesn't have the capability to select certain atomic element on an html page, like certain tags, certain content, some part of the page, etc.
> A html page have a tree-like xml pattern with html tag as its branch and text as its node. This branch and node could be extracted using XPath. XPath allowing us to select a certain branch or node of an XML and therefore could be used to extract certain information and treat it differently based on its content and the user requirements. Furthermore a web domain like news website usually have a same html code structure for storing the information on its web pages. This same html code structure could be parsed using the same XPath query and retrieve the same content information element. All of the XPath query for selecting various content could be stored on a XPath Configuration File.
> The purpose of nutch are for various web source, not all of the web page retrieved from those various source have the same html code structure, thus have to be threated differently using the correct XPath Configuration. The selection of the correct XPath configuration could be done automatically using regex by matching the url of the web page with valid url pattern for that xpath configuration.
> This automatic mechanism allow the user of nutch to process various web page and get only certain information that user wants therefore making the index more accurate and its content more flexible.
> The component for this idea have been tested on nutch 1.2 for selecting certain elements on various news website for the purpose of document clustering. This includes a Configuration Editor Application build using NetBeans 6.9 Application Framework. though its need a few debugging.
> http://dl.dropbox.com/u/2642087/For_GSoC/for_GSoc.zip

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-978) A Plugin for extracting certain element of a web page on html page parsing.

Posted by "Lewis John McGibbney (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Lewis John McGibbney updated NUTCH-978:
---------------------------------------

    Fix Version/s:     (was: nutchgora)
                   2.1

Set and Classify
                
> A Plugin for extracting certain element of a web page on html page parsing.
> ---------------------------------------------------------------------------
>
>                 Key: NUTCH-978
>                 URL: https://issues.apache.org/jira/browse/NUTCH-978
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 1.2
>         Environment: Ubuntu Linux 10.10; JDK 1.6; Netbeans 6.9
>            Reporter: Ammar Shadiq
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>              Labels: gsoc2012, mentor
>             Fix For: 2.1
>
>         Attachments: [Nutch-GSoC-2011-Proposal]Web_Page_Scrapper_Parser_Plugin.pdf, app_guardian_ivory_coast_news_exmpl.png, app_screenshoot_configuration_result.png, app_screenshoot_configuration_result_anchor.png, app_screenshoot_source_view.png, app_screenshoot_url_regex_filter.png, for_GSoc.zip, version_alpha2.zip
>
>   Original Estimate: 1,680h
>  Remaining Estimate: 1,680h
>
> Nutch use parse-html plugin to parse web pages, it process the contents of the web page by removing html tags and component like javascript and css and leaving the extracted text to be stored on the index. Nutch by default doesn't have the capability to select certain atomic element on an html page, like certain tags, certain content, some part of the page, etc.
> A html page have a tree-like xml pattern with html tag as its branch and text as its node. This branch and node could be extracted using XPath. XPath allowing us to select a certain branch or node of an XML and therefore could be used to extract certain information and treat it differently based on its content and the user requirements. Furthermore a web domain like news website usually have a same html code structure for storing the information on its web pages. This same html code structure could be parsed using the same XPath query and retrieve the same content information element. All of the XPath query for selecting various content could be stored on a XPath Configuration File.
> The purpose of nutch are for various web source, not all of the web page retrieved from those various source have the same html code structure, thus have to be threated differently using the correct XPath Configuration. The selection of the correct XPath configuration could be done automatically using regex by matching the url of the web page with valid url pattern for that xpath configuration.
> This automatic mechanism allow the user of nutch to process various web page and get only certain information that user wants therefore making the index more accurate and its content more flexible.
> The component for this idea have been tested on nutch 1.2 for selecting certain elements on various news website for the purpose of document clustering. This includes a Configuration Editor Application build using NetBeans 6.9 Application Framework. though its need a few debugging.
> http://dl.dropbox.com/u/2642087/For_GSoC/for_GSoc.zip

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-978) [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.

Posted by "Lewis John McGibbney (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13212584#comment-13212584 ] 

Lewis John McGibbney commented on NUTCH-978:
--------------------------------------------

Generally speaking the plugin sounds sounds really useful, the only problem I see is that it is very specific and for it to be integrated into the code base usually we need to make it specific enough to address some given task fully and in a well defined and well justified manner, but we also need to make it general enough to be used in many different contexts. This increases usability and user feedback as well engagement.

4. With regards to the biggest technical challenge being the processing of web page's, how far did you get with this? We're you able to process it with enough precision to satisfy your requirements?

5. How were you querying it with XPath? You cannot query with XPath, but instead with XQuery. Do you maybe mean that this enabled you to navigate the document and address various parts of it is XPath?

6. Ok I understand why it has crumbled slightly, but I think if the code is there is would be a huge waster if we didn't try to revive it, possibly getting it integrated into the code base, and maybe getting it added as a contrib component but not shipping it within the core codebase if the former was not a viable option.

I've had a look at NUTCH-185, but I think we can discard this as it was addressed a very long time ago, it's also already integrated into the codebase. I was referring more to Jira issues which were currently open, which we could maybe merge or combine to give this a more general and possibly more justified arguement for inclusion in the codebase... what do you think? Does NUTCH-585 fit this?
                
> [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.
> ---------------------------------------------------------------------------------------
>
>                 Key: NUTCH-978
>                 URL: https://issues.apache.org/jira/browse/NUTCH-978
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 1.2
>         Environment: Ubuntu Linux 10.10; JDK 1.6; Netbeans 6.9
>            Reporter: Ammar Shadiq
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>              Labels: gsoc2011, mentor
>             Fix For: nutchgora
>
>         Attachments: [Nutch-GSoC-2011-Proposal]Web_Page_Scrapper_Parser_Plugin.pdf, app_guardian_ivory_coast_news_exmpl.png, app_screenshoot_configuration_result.png, app_screenshoot_configuration_result_anchor.png, app_screenshoot_source_view.png, app_screenshoot_url_regex_filter.png, for_GSoc.zip
>
>   Original Estimate: 1,680h
>  Remaining Estimate: 1,680h
>
> Nutch use parse-html plugin to parse web pages, it process the contents of the web page by removing html tags and component like javascript and css and leaving the extracted text to be stored on the index. Nutch by default doesn't have the capability to select certain atomic element on an html page, like certain tags, certain content, some part of the page, etc.
> A html page have a tree-like xml pattern with html tag as its branch and text as its node. This branch and node could be extracted using XPath. XPath allowing us to select a certain branch or node of an XML and therefore could be used to extract certain information and treat it differently based on its content and the user requirements. Furthermore a web domain like news website usually have a same html code structure for storing the information on its web pages. This same html code structure could be parsed using the same XPath query and retrieve the same content information element. All of the XPath query for selecting various content could be stored on a XPath Configuration File.
> The purpose of nutch are for various web source, not all of the web page retrieved from those various source have the same html code structure, thus have to be threated differently using the correct XPath Configuration. The selection of the correct XPath configuration could be done automatically using regex by matching the url of the web page with valid url pattern for that xpath configuration.
> This automatic mechanism allow the user of nutch to process various web page and get only certain information that user wants therefore making the index more accurate and its content more flexible.
> The component for this idea have been tested on nutch 1.2 for selecting certain elements on various news website for the purpose of document clustering. This includes a Configuration Editor Application build using NetBeans 6.9 Application Framework. though its need a few debugging.
> http://dl.dropbox.com/u/2642087/For_GSoC/for_GSoc.zip

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-978) [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.

Posted by "Ammar Shadiq (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ammar Shadiq updated NUTCH-978:
-------------------------------

    Attachment:     (was: [Nutch-GSoC-2011-Proposal]Web_Page_Scrapper_Parser_Plugin.pdf)

> [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.
> ---------------------------------------------------------------------------------------
>
>                 Key: NUTCH-978
>                 URL: https://issues.apache.org/jira/browse/NUTCH-978
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 1.2
>         Environment: Ubuntu Linux 10.10; JDK 1.6; Netbeans 6.9
>            Reporter: Ammar Shadiq
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>              Labels: gsoc2011, mentor
>             Fix For: 2.0
>
>         Attachments: [Nutch-GSoC-2011-Proposal]Web_Page_Scrapper_Parser_Plugin.pdf, app_screenshoot_configuration_result.png, app_screenshoot_configuration_result_anchor.png, app_screenshoot_source_view.png, app_screenshoot_url_regex_filter.png
>
>   Original Estimate: 1680h
>  Remaining Estimate: 1680h
>
> Nutch use parse-html plugin to parse web pages, it process the contents of the web page by removing html tags and component like javascript and css and leaving the extracted text to be stored on the index. Nutch by default doesn't have the capability to select certain atomic element on an html page, like certain tags, certain content, some part of the page, etc.
> A html page have a tree-like xml pattern with html tag as its branch and text as its node. This branch and node could be extracted using XPath. XPath allowing us to select a certain branch or node of an XML and therefore could be used to extract certain information and treat it differently based on its content and the user requirements. Furthermore a web domain like news website usually have a same html code structure for storing the information on its web pages. This same html code structure could be parsed using the same XPath query and retrieve the same content information element. All of the XPath query for selecting various content could be stored on a XPath Configuration File.
> The purpose of nutch are for various web source, not all of the web page retrieved from those various source have the same html code structure, thus have to be threated differently using the correct XPath Configuration. The selection of the correct XPath configuration could be done automatically using regex by matching the url of the web page with valid url pattern for that xpath configuration.
> This automatic mechanism allow the user of nutch to process various web page and get only certain information that user wants therefore making the index more accurate and its content more flexible.
> The component for this idea have been tested on nutch 1.2 for selecting certain elements on various news website for the purpose of document clustering. This includes a Configuration Editor Application build using NetBeans 6.9 Application Framework. though its need a few debugging.
> http://dl.dropbox.com/u/2642087/For_GSoC/for_GSoc.zip

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-978) [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.

Posted by "Ammar Shadiq (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13017724#comment-13017724 ] 

Ammar Shadiq commented on NUTCH-978:
------------------------------------

Please correct me if I'm wrong.
In my limited understanding, Nutch using plugin system, one of those are for parsing html pages (HTMLParseFilter class) whose  later selected appropriate plugin based on the configuration and runs it. 

Inside parse-html the main thing it's extract are : Content, Title, and Outlinks.

The problem that I'm trying to solve are, for adding custom field like on : http://sujitpal.blogspot.com/2009/07/nutch-custom-plugin-to-parse-and-add.html
for various type of content and for various sites and add it to the index fields. Instead of creating a new plugin for each site, nutch user could simply create the xpath configuration file, put it on the configuration folder and the parsing of custom fields could be done automatically without writing/compiling any code.

In addition, user could also bypass Content, Title and Outlinks with a different result, for example, 
Set the title of page's from a news site (example: http://www.guardian.co.uk/world/2011/apr/08/ivory-coast-horror-recounted), 

instead the value of <head><title> :
=Ivory Coast horror recounted by victims and perpetrators | World news | The Guardian

get the title only, by using xpath of : /html/body/div[@id='wrapper']/div[@id='box']/div[@id='article-header']/div[@id='main-article-info']/h1/text(), and get:
=Ivory Coast horror recounted by victims and perpetrators

or only follow outlinks of related news, ignore the rest:

8 Apr 2011
Ouattara calls for Ivory Coast sanctions to be lifted 
7 Apr 2011
Ivory Coast crisis: Q&A 
5 Apr 2011
After Gbagbo, what next for Ivory Coast? 
5 Apr 2011
Ivory Coast: The final battle 

like the screenshoot here : https://issues.apache.org/jira/secure/attachment/12475860/app_guardian_ivory_coast_news_exmpl.png

Since the default parser are parse-html. I add the handler there, some kind of if-else bypass, if the parsed page have URL that match one of those Configuration, it's parsed by it, if there's no configuration matched with the URL, it's uses the default parser mechanism. 

I'm sorry for my English and if I'm not presenting my idea well enough.

> [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.
> ---------------------------------------------------------------------------------------
>
>                 Key: NUTCH-978
>                 URL: https://issues.apache.org/jira/browse/NUTCH-978
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 1.2
>         Environment: Ubuntu Linux 10.10; JDK 1.6; Netbeans 6.9
>            Reporter: Ammar Shadiq
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>              Labels: gsoc2011, mentor
>             Fix For: 2.0
>
>         Attachments: [Nutch-GSoC-2011-Proposal]Web_Page_Scrapper_Parser_Plugin.pdf, app_guardian_ivory_coast_news_exmpl.png, app_screenshoot_configuration_result.png, app_screenshoot_configuration_result_anchor.png, app_screenshoot_source_view.png, app_screenshoot_url_regex_filter.png
>
>   Original Estimate: 1680h
>  Remaining Estimate: 1680h
>
> Nutch use parse-html plugin to parse web pages, it process the contents of the web page by removing html tags and component like javascript and css and leaving the extracted text to be stored on the index. Nutch by default doesn't have the capability to select certain atomic element on an html page, like certain tags, certain content, some part of the page, etc.
> A html page have a tree-like xml pattern with html tag as its branch and text as its node. This branch and node could be extracted using XPath. XPath allowing us to select a certain branch or node of an XML and therefore could be used to extract certain information and treat it differently based on its content and the user requirements. Furthermore a web domain like news website usually have a same html code structure for storing the information on its web pages. This same html code structure could be parsed using the same XPath query and retrieve the same content information element. All of the XPath query for selecting various content could be stored on a XPath Configuration File.
> The purpose of nutch are for various web source, not all of the web page retrieved from those various source have the same html code structure, thus have to be threated differently using the correct XPath Configuration. The selection of the correct XPath configuration could be done automatically using regex by matching the url of the web page with valid url pattern for that xpath configuration.
> This automatic mechanism allow the user of nutch to process various web page and get only certain information that user wants therefore making the index more accurate and its content more flexible.
> The component for this idea have been tested on nutch 1.2 for selecting certain elements on various news website for the purpose of document clustering. This includes a Configuration Editor Application build using NetBeans 6.9 Application Framework. though its need a few debugging.
> http://dl.dropbox.com/u/2642087/For_GSoC/for_GSoc.zip

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-978) [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.

Posted by "Chris A. Mattmann (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13212570#comment-13212570 ] 

Chris A. Mattmann commented on NUTCH-978:
-----------------------------------------

Guys, I think it's fine to keep the conversation on list, in fact, I'd favor it unless there is a specific reason to take it there?
                
> [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.
> ---------------------------------------------------------------------------------------
>
>                 Key: NUTCH-978
>                 URL: https://issues.apache.org/jira/browse/NUTCH-978
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 1.2
>         Environment: Ubuntu Linux 10.10; JDK 1.6; Netbeans 6.9
>            Reporter: Ammar Shadiq
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>              Labels: gsoc2011, mentor
>             Fix For: nutchgora
>
>         Attachments: [Nutch-GSoC-2011-Proposal]Web_Page_Scrapper_Parser_Plugin.pdf, app_guardian_ivory_coast_news_exmpl.png, app_screenshoot_configuration_result.png, app_screenshoot_configuration_result_anchor.png, app_screenshoot_source_view.png, app_screenshoot_url_regex_filter.png, for_GSoc.zip
>
>   Original Estimate: 1,680h
>  Remaining Estimate: 1,680h
>
> Nutch use parse-html plugin to parse web pages, it process the contents of the web page by removing html tags and component like javascript and css and leaving the extracted text to be stored on the index. Nutch by default doesn't have the capability to select certain atomic element on an html page, like certain tags, certain content, some part of the page, etc.
> A html page have a tree-like xml pattern with html tag as its branch and text as its node. This branch and node could be extracted using XPath. XPath allowing us to select a certain branch or node of an XML and therefore could be used to extract certain information and treat it differently based on its content and the user requirements. Furthermore a web domain like news website usually have a same html code structure for storing the information on its web pages. This same html code structure could be parsed using the same XPath query and retrieve the same content information element. All of the XPath query for selecting various content could be stored on a XPath Configuration File.
> The purpose of nutch are for various web source, not all of the web page retrieved from those various source have the same html code structure, thus have to be threated differently using the correct XPath Configuration. The selection of the correct XPath configuration could be done automatically using regex by matching the url of the web page with valid url pattern for that xpath configuration.
> This automatic mechanism allow the user of nutch to process various web page and get only certain information that user wants therefore making the index more accurate and its content more flexible.
> The component for this idea have been tested on nutch 1.2 for selecting certain elements on various news website for the purpose of document clustering. This includes a Configuration Editor Application build using NetBeans 6.9 Application Framework. though its need a few debugging.
> http://dl.dropbox.com/u/2642087/For_GSoC/for_GSoc.zip

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Assigned] (NUTCH-978) [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma reassigned NUTCH-978:
-----------------------------------

    Assignee: Chris A. Mattmann

If a mentor has been identified then please assign the issue to that mentor.

http://community.apache.org/guide-to-being-a-mentor.html

> [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.
> ---------------------------------------------------------------------------------------
>
>                 Key: NUTCH-978
>                 URL: https://issues.apache.org/jira/browse/NUTCH-978
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 1.2
>         Environment: Ubuntu Linux 10.10; JDK 1.6; Netbeans 6.9
>            Reporter: Ammar Shadiq
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>              Labels: gsoc2011, mentor
>             Fix For: 2.0
>
>         Attachments: [Nutch-GSoC-2011-Proposal]Web_Page_Scrapper_Parser_Plugin.pdf, app_screenshoot_configuration_result.png, app_screenshoot_configuration_result_anchor.png, app_screenshoot_source_view.png, app_screenshoot_url_regex_filter.png
>
>   Original Estimate: 1680h
>  Remaining Estimate: 1680h
>
> Nutch use parse-html plugin to parse web pages, it process the contents of the web page by removing html tags and component like javascript and css and leaving the extracted text to be stored on the index. Nutch by default doesn't have the capability to select certain atomic element on an html page, like certain tags, certain content, some part of the page, etc.
> A html page have a tree-like xml pattern with html tag as its branch and text as its node. This branch and node could be extracted using XPath. XPath allowing us to select a certain branch or node of an XML and therefore could be used to extract certain information and treat it differently based on its content and the user requirements. Furthermore a web domain like news website usually have a same html code structure for storing the information on its web pages. This same html code structure could be parsed using the same XPath query and retrieve the same content information element. All of the XPath query for selecting various content could be stored on a XPath Configuration File.
> The purpose of nutch are for various web source, not all of the web page retrieved from those various source have the same html code structure, thus have to be threated differently using the correct XPath Configuration. The selection of the correct XPath configuration could be done automatically using regex by matching the url of the web page with valid url pattern for that xpath configuration.
> This automatic mechanism allow the user of nutch to process various web page and get only certain information that user wants therefore making the index more accurate and its content more flexible.
> The component for this idea have been tested on nutch 1.2 for selecting certain elements on various news website for the purpose of document clustering. This includes a Configuration Editor Application build using NetBeans 6.9 Application Framework. though its need a few debugging.
> http://dl.dropbox.com/u/2642087/For_GSoC/for_GSoc.zip

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-978) A Plugin for extracting certain element of a web page on html page parsing.

Posted by "Ammar Shadiq (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13234050#comment-13234050 ] 

Ammar Shadiq commented on NUTCH-978:
------------------------------------

Sweet, thanks Lewis.
                
> A Plugin for extracting certain element of a web page on html page parsing.
> ---------------------------------------------------------------------------
>
>                 Key: NUTCH-978
>                 URL: https://issues.apache.org/jira/browse/NUTCH-978
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 1.2
>         Environment: Ubuntu Linux 10.10; JDK 1.6; Netbeans 6.9
>            Reporter: Ammar Shadiq
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>              Labels: gsoc2012, mentor
>             Fix For: nutchgora
>
>         Attachments: [Nutch-GSoC-2011-Proposal]Web_Page_Scrapper_Parser_Plugin.pdf, app_guardian_ivory_coast_news_exmpl.png, app_screenshoot_configuration_result.png, app_screenshoot_configuration_result_anchor.png, app_screenshoot_source_view.png, app_screenshoot_url_regex_filter.png, for_GSoc.zip, version_alpha2.zip
>
>   Original Estimate: 1,680h
>  Remaining Estimate: 1,680h
>
> Nutch use parse-html plugin to parse web pages, it process the contents of the web page by removing html tags and component like javascript and css and leaving the extracted text to be stored on the index. Nutch by default doesn't have the capability to select certain atomic element on an html page, like certain tags, certain content, some part of the page, etc.
> A html page have a tree-like xml pattern with html tag as its branch and text as its node. This branch and node could be extracted using XPath. XPath allowing us to select a certain branch or node of an XML and therefore could be used to extract certain information and treat it differently based on its content and the user requirements. Furthermore a web domain like news website usually have a same html code structure for storing the information on its web pages. This same html code structure could be parsed using the same XPath query and retrieve the same content information element. All of the XPath query for selecting various content could be stored on a XPath Configuration File.
> The purpose of nutch are for various web source, not all of the web page retrieved from those various source have the same html code structure, thus have to be threated differently using the correct XPath Configuration. The selection of the correct XPath configuration could be done automatically using regex by matching the url of the web page with valid url pattern for that xpath configuration.
> This automatic mechanism allow the user of nutch to process various web page and get only certain information that user wants therefore making the index more accurate and its content more flexible.
> The component for this idea have been tested on nutch 1.2 for selecting certain elements on various news website for the purpose of document clustering. This includes a Configuration Editor Application build using NetBeans 6.9 Application Framework. though its need a few debugging.
> http://dl.dropbox.com/u/2642087/For_GSoC/for_GSoc.zip

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-978) [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.

Posted by "Ammar Shadiq (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ammar Shadiq updated NUTCH-978:
-------------------------------

    Attachment: [Nutch-GSoC-2011-Proposal]Web_Page_Scrapper_Parser_Plugin.pdf

Proposal for Google Summer of Code 2011
http://www.google-melange.com/gsoc/homepage/google/gsoc2011

haven't found any mentor yet :-(

> [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.
> ---------------------------------------------------------------------------------------
>
>                 Key: NUTCH-978
>                 URL: https://issues.apache.org/jira/browse/NUTCH-978
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 1.2
>         Environment: Ubuntu Linux 10.10; JDK 1.6; Netbeans 6.9
>            Reporter: Ammar Shadiq
>              Labels: gsoc
>             Fix For: 2.0
>
>         Attachments: [Nutch-GSoC-2011-Proposal]Web_Page_Scrapper_Parser_Plugin.pdf
>
>   Original Estimate: 1680h
>  Remaining Estimate: 1680h
>
> Nutch use parse-html plugin to parse web pages, it process the contents of the web page by removing html tags and component like javascript and css and leaving the extracted text to be stored on the index. Nutch by default doesn't have the capability to select certain atomic element on an html page, like certain tags, certain content, some part of the page, etc.
> A html page have a tree-like xml pattern with html tag as its branch and text as its node. This branch and node could be extracted using XPath. XPath allowing us to select a certain branch or node of an XML and therefore could be used to extract certain information and treat it differently based on its content and the user requirements. Furthermore a web domain like news website usually have a same html code structure for storing the information on its web pages. This same html code structure could be parsed using the same XPath query and retrieve the same content information element. All of the XPath query for selecting various content could be stored on a XPath Configuration File.
> The purpose of nutch are for various web source, not all of the web page retrieved from those various source have the same html code structure, thus have to be threated differently using the correct XPath Configuration. The selection of the correct XPath configuration could be done automatically using regex by matching the url of the web page with valid url pattern for that xpath configuration.
> This automatic mechanism allow the user of nutch to process various web page and get only certain information that user wants therefore making the index more accurate and its content more flexible.
> The component for this idea have been tested on nutch 1.2 for selecting certain elements on various news website for the purpose of document clustering. This includes a Configuration Editor Application build using NetBeans 6.9 Application Framework. though its need a few debugging.
> http://dl.dropbox.com/u/2642087/For_GSoC/for_GSoc.zip

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-978) [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.

Posted by "Lewis John McGibbney (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Lewis John McGibbney updated NUTCH-978:
---------------------------------------

    Attachment: for_GSoc.zip

In it's present form this is quite literally all over the place and is merely for safe keeping.
                
> [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.
> ---------------------------------------------------------------------------------------
>
>                 Key: NUTCH-978
>                 URL: https://issues.apache.org/jira/browse/NUTCH-978
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 1.2
>         Environment: Ubuntu Linux 10.10; JDK 1.6; Netbeans 6.9
>            Reporter: Ammar Shadiq
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>              Labels: gsoc2011, mentor
>             Fix For: nutchgora
>
>         Attachments: [Nutch-GSoC-2011-Proposal]Web_Page_Scrapper_Parser_Plugin.pdf, app_guardian_ivory_coast_news_exmpl.png, app_screenshoot_configuration_result.png, app_screenshoot_configuration_result_anchor.png, app_screenshoot_source_view.png, app_screenshoot_url_regex_filter.png, for_GSoc.zip
>
>   Original Estimate: 1,680h
>  Remaining Estimate: 1,680h
>
> Nutch use parse-html plugin to parse web pages, it process the contents of the web page by removing html tags and component like javascript and css and leaving the extracted text to be stored on the index. Nutch by default doesn't have the capability to select certain atomic element on an html page, like certain tags, certain content, some part of the page, etc.
> A html page have a tree-like xml pattern with html tag as its branch and text as its node. This branch and node could be extracted using XPath. XPath allowing us to select a certain branch or node of an XML and therefore could be used to extract certain information and treat it differently based on its content and the user requirements. Furthermore a web domain like news website usually have a same html code structure for storing the information on its web pages. This same html code structure could be parsed using the same XPath query and retrieve the same content information element. All of the XPath query for selecting various content could be stored on a XPath Configuration File.
> The purpose of nutch are for various web source, not all of the web page retrieved from those various source have the same html code structure, thus have to be threated differently using the correct XPath Configuration. The selection of the correct XPath configuration could be done automatically using regex by matching the url of the web page with valid url pattern for that xpath configuration.
> This automatic mechanism allow the user of nutch to process various web page and get only certain information that user wants therefore making the index more accurate and its content more flexible.
> The component for this idea have been tested on nutch 1.2 for selecting certain elements on various news website for the purpose of document clustering. This includes a Configuration Editor Application build using NetBeans 6.9 Application Framework. though its need a few debugging.
> http://dl.dropbox.com/u/2642087/For_GSoC/for_GSoc.zip

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-978) [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.

Posted by "Ammar Shadiq (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ammar Shadiq updated NUTCH-978:
-------------------------------

    Attachment:     (was: Screenshot.png)

> [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.
> ---------------------------------------------------------------------------------------
>
>                 Key: NUTCH-978
>                 URL: https://issues.apache.org/jira/browse/NUTCH-978
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 1.2
>         Environment: Ubuntu Linux 10.10; JDK 1.6; Netbeans 6.9
>            Reporter: Ammar Shadiq
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>              Labels: gsoc2011, mentor
>             Fix For: 2.0
>
>         Attachments: [Nutch-GSoC-2011-Proposal]Web_Page_Scrapper_Parser_Plugin.pdf, app_guardian_ivory_coast_news_exmpl.png, app_screenshoot_configuration_result.png, app_screenshoot_configuration_result_anchor.png, app_screenshoot_source_view.png, app_screenshoot_url_regex_filter.png
>
>   Original Estimate: 1680h
>  Remaining Estimate: 1680h
>
> Nutch use parse-html plugin to parse web pages, it process the contents of the web page by removing html tags and component like javascript and css and leaving the extracted text to be stored on the index. Nutch by default doesn't have the capability to select certain atomic element on an html page, like certain tags, certain content, some part of the page, etc.
> A html page have a tree-like xml pattern with html tag as its branch and text as its node. This branch and node could be extracted using XPath. XPath allowing us to select a certain branch or node of an XML and therefore could be used to extract certain information and treat it differently based on its content and the user requirements. Furthermore a web domain like news website usually have a same html code structure for storing the information on its web pages. This same html code structure could be parsed using the same XPath query and retrieve the same content information element. All of the XPath query for selecting various content could be stored on a XPath Configuration File.
> The purpose of nutch are for various web source, not all of the web page retrieved from those various source have the same html code structure, thus have to be threated differently using the correct XPath Configuration. The selection of the correct XPath configuration could be done automatically using regex by matching the url of the web page with valid url pattern for that xpath configuration.
> This automatic mechanism allow the user of nutch to process various web page and get only certain information that user wants therefore making the index more accurate and its content more flexible.
> The component for this idea have been tested on nutch 1.2 for selecting certain elements on various news website for the purpose of document clustering. This includes a Configuration Editor Application build using NetBeans 6.9 Application Framework. though its need a few debugging.
> http://dl.dropbox.com/u/2642087/For_GSoC/for_GSoc.zip

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-978) [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.

Posted by "Lewis John McGibbney (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13212576#comment-13212576 ] 

Lewis John McGibbney commented on NUTCH-978:
--------------------------------------------

No bother Chris. So far questions that have been asked
1. provide a quick run down on the issue, summarizing all of the above
2. what were the motivations, purpose and technical challenges encountered whilst working on it?
3. Why did the issue drop away and what do you think is required to get it back on track and possibly in the codebase?
                
> [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.
> ---------------------------------------------------------------------------------------
>
>                 Key: NUTCH-978
>                 URL: https://issues.apache.org/jira/browse/NUTCH-978
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 1.2
>         Environment: Ubuntu Linux 10.10; JDK 1.6; Netbeans 6.9
>            Reporter: Ammar Shadiq
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>              Labels: gsoc2011, mentor
>             Fix For: nutchgora
>
>         Attachments: [Nutch-GSoC-2011-Proposal]Web_Page_Scrapper_Parser_Plugin.pdf, app_guardian_ivory_coast_news_exmpl.png, app_screenshoot_configuration_result.png, app_screenshoot_configuration_result_anchor.png, app_screenshoot_source_view.png, app_screenshoot_url_regex_filter.png, for_GSoc.zip
>
>   Original Estimate: 1,680h
>  Remaining Estimate: 1,680h
>
> Nutch use parse-html plugin to parse web pages, it process the contents of the web page by removing html tags and component like javascript and css and leaving the extracted text to be stored on the index. Nutch by default doesn't have the capability to select certain atomic element on an html page, like certain tags, certain content, some part of the page, etc.
> A html page have a tree-like xml pattern with html tag as its branch and text as its node. This branch and node could be extracted using XPath. XPath allowing us to select a certain branch or node of an XML and therefore could be used to extract certain information and treat it differently based on its content and the user requirements. Furthermore a web domain like news website usually have a same html code structure for storing the information on its web pages. This same html code structure could be parsed using the same XPath query and retrieve the same content information element. All of the XPath query for selecting various content could be stored on a XPath Configuration File.
> The purpose of nutch are for various web source, not all of the web page retrieved from those various source have the same html code structure, thus have to be threated differently using the correct XPath Configuration. The selection of the correct XPath configuration could be done automatically using regex by matching the url of the web page with valid url pattern for that xpath configuration.
> This automatic mechanism allow the user of nutch to process various web page and get only certain information that user wants therefore making the index more accurate and its content more flexible.
> The component for this idea have been tested on nutch 1.2 for selecting certain elements on various news website for the purpose of document clustering. This includes a Configuration Editor Application build using NetBeans 6.9 Application Framework. though its need a few debugging.
> http://dl.dropbox.com/u/2642087/For_GSoC/for_GSoc.zip

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-978) [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.

Posted by "Ammar Shadiq (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13212605#comment-13212605 ] 

Ammar Shadiq commented on NUTCH-978:
------------------------------------

>>4.With regards to the biggest technical challenge being the processing of web page's, how far did you get with this? We're you able to process it with enough precision to satisfy your requirements?

I get it to work for my text clustering algorithm, the application screenshoot provided here: http://www.facebook.com/media/set/?set=a.2075564646205.124550.1157621543&type=3&l=7313965254\. Yes, it's quite satisfactory.

>> 5. How were you querying it with XPath? You cannot query with XPath, but instead with XQuery. Do you maybe mean that this enabled you to navigate the document and address various parts of it is XPath?

In my understanding there are 3 ways to query an XML document, that is using XPath, XQuery and XLST, I'm sorry if i get it wrong. For navigating various parts of the page i uses java HTML parse lister extending  HTMLEditorKit.ParserCallback and then displaying it on the editor application (some kind of chromium Inspect element), this makes the web page structure visible and thus making the XPath expression easier to make.

>> 6. Ok I understand why it has crumbled slightly, but I think if the code is there is would be a huge waster if we didn't try to revive it, possibly getting it integrated into the code base, and maybe getting it added as a contrib component but not shipping it within the core codebase if the former was not a viable option.

I totally agree

As for Nutch 585, i think it's different in the idea that is Nutch 585 trying to block certain parts. This idea instead, only retrieve certain parts and in addition store it in certain lucene field (i havent looked into the Solr implementation yet) thus automatically discarding the rest.
                
> [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.
> ---------------------------------------------------------------------------------------
>
>                 Key: NUTCH-978
>                 URL: https://issues.apache.org/jira/browse/NUTCH-978
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 1.2
>         Environment: Ubuntu Linux 10.10; JDK 1.6; Netbeans 6.9
>            Reporter: Ammar Shadiq
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>              Labels: gsoc2011, mentor
>             Fix For: nutchgora
>
>         Attachments: [Nutch-GSoC-2011-Proposal]Web_Page_Scrapper_Parser_Plugin.pdf, app_guardian_ivory_coast_news_exmpl.png, app_screenshoot_configuration_result.png, app_screenshoot_configuration_result_anchor.png, app_screenshoot_source_view.png, app_screenshoot_url_regex_filter.png, for_GSoc.zip
>
>   Original Estimate: 1,680h
>  Remaining Estimate: 1,680h
>
> Nutch use parse-html plugin to parse web pages, it process the contents of the web page by removing html tags and component like javascript and css and leaving the extracted text to be stored on the index. Nutch by default doesn't have the capability to select certain atomic element on an html page, like certain tags, certain content, some part of the page, etc.
> A html page have a tree-like xml pattern with html tag as its branch and text as its node. This branch and node could be extracted using XPath. XPath allowing us to select a certain branch or node of an XML and therefore could be used to extract certain information and treat it differently based on its content and the user requirements. Furthermore a web domain like news website usually have a same html code structure for storing the information on its web pages. This same html code structure could be parsed using the same XPath query and retrieve the same content information element. All of the XPath query for selecting various content could be stored on a XPath Configuration File.
> The purpose of nutch are for various web source, not all of the web page retrieved from those various source have the same html code structure, thus have to be threated differently using the correct XPath Configuration. The selection of the correct XPath configuration could be done automatically using regex by matching the url of the web page with valid url pattern for that xpath configuration.
> This automatic mechanism allow the user of nutch to process various web page and get only certain information that user wants therefore making the index more accurate and its content more flexible.
> The component for this idea have been tested on nutch 1.2 for selecting certain elements on various news website for the purpose of document clustering. This includes a Configuration Editor Application build using NetBeans 6.9 Application Framework. though its need a few debugging.
> http://dl.dropbox.com/u/2642087/For_GSoC/for_GSoc.zip

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-978) A Plugin for extracting certain element of a web page on html page parsing.

Posted by "Lewis John McGibbney (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Lewis John McGibbney updated NUTCH-978:
---------------------------------------

    Fix Version/s:     (was: 2.1)
                   2.2
    
> A Plugin for extracting certain element of a web page on html page parsing.
> ---------------------------------------------------------------------------
>
>                 Key: NUTCH-978
>                 URL: https://issues.apache.org/jira/browse/NUTCH-978
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 1.2
>         Environment: Ubuntu Linux 10.10; JDK 1.6; Netbeans 6.9
>            Reporter: Ammar Shadiq
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>              Labels: gsoc2012, mentor
>             Fix For: 2.2
>
>         Attachments: app_guardian_ivory_coast_news_exmpl.png, app_screenshoot_configuration_result_anchor.png, app_screenshoot_configuration_result.png, app_screenshoot_source_view.png, app_screenshoot_url_regex_filter.png, for_GSoc.zip, [Nutch-GSoC-2011-Proposal]Web_Page_Scrapper_Parser_Plugin.pdf, version_alpha2.zip
>
>   Original Estimate: 1,680h
>  Remaining Estimate: 1,680h
>
> Nutch use parse-html plugin to parse web pages, it process the contents of the web page by removing html tags and component like javascript and css and leaving the extracted text to be stored on the index. Nutch by default doesn't have the capability to select certain atomic element on an html page, like certain tags, certain content, some part of the page, etc.
> A html page have a tree-like xml pattern with html tag as its branch and text as its node. This branch and node could be extracted using XPath. XPath allowing us to select a certain branch or node of an XML and therefore could be used to extract certain information and treat it differently based on its content and the user requirements. Furthermore a web domain like news website usually have a same html code structure for storing the information on its web pages. This same html code structure could be parsed using the same XPath query and retrieve the same content information element. All of the XPath query for selecting various content could be stored on a XPath Configuration File.
> The purpose of nutch are for various web source, not all of the web page retrieved from those various source have the same html code structure, thus have to be threated differently using the correct XPath Configuration. The selection of the correct XPath configuration could be done automatically using regex by matching the url of the web page with valid url pattern for that xpath configuration.
> This automatic mechanism allow the user of nutch to process various web page and get only certain information that user wants therefore making the index more accurate and its content more flexible.
> The component for this idea have been tested on nutch 1.2 for selecting certain elements on various news website for the purpose of document clustering. This includes a Configuration Editor Application build using NetBeans 6.9 Application Framework. though its need a few debugging.
> http://dl.dropbox.com/u/2642087/For_GSoC/for_GSoc.zip

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-978) [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.

Posted by "Ammar Shadiq (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13016748#comment-13016748 ] 

Ammar Shadiq commented on NUTCH-978:
------------------------------------

Wow, thank you very much :-)

> [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.
> ---------------------------------------------------------------------------------------
>
>                 Key: NUTCH-978
>                 URL: https://issues.apache.org/jira/browse/NUTCH-978
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 1.2
>         Environment: Ubuntu Linux 10.10; JDK 1.6; Netbeans 6.9
>            Reporter: Ammar Shadiq
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>              Labels: gsoc2011, mentor
>             Fix For: 2.0
>
>         Attachments: [Nutch-GSoC-2011-Proposal]Web_Page_Scrapper_Parser_Plugin.pdf, app_screenshoot_configuration_result.png, app_screenshoot_configuration_result_anchor.png, app_screenshoot_source_view.png, app_screenshoot_url_regex_filter.png
>
>   Original Estimate: 1680h
>  Remaining Estimate: 1680h
>
> Nutch use parse-html plugin to parse web pages, it process the contents of the web page by removing html tags and component like javascript and css and leaving the extracted text to be stored on the index. Nutch by default doesn't have the capability to select certain atomic element on an html page, like certain tags, certain content, some part of the page, etc.
> A html page have a tree-like xml pattern with html tag as its branch and text as its node. This branch and node could be extracted using XPath. XPath allowing us to select a certain branch or node of an XML and therefore could be used to extract certain information and treat it differently based on its content and the user requirements. Furthermore a web domain like news website usually have a same html code structure for storing the information on its web pages. This same html code structure could be parsed using the same XPath query and retrieve the same content information element. All of the XPath query for selecting various content could be stored on a XPath Configuration File.
> The purpose of nutch are for various web source, not all of the web page retrieved from those various source have the same html code structure, thus have to be threated differently using the correct XPath Configuration. The selection of the correct XPath configuration could be done automatically using regex by matching the url of the web page with valid url pattern for that xpath configuration.
> This automatic mechanism allow the user of nutch to process various web page and get only certain information that user wants therefore making the index more accurate and its content more flexible.
> The component for this idea have been tested on nutch 1.2 for selecting certain elements on various news website for the purpose of document clustering. This includes a Configuration Editor Application build using NetBeans 6.9 Application Framework. though its need a few debugging.
> http://dl.dropbox.com/u/2642087/For_GSoC/for_GSoc.zip

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-978) [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.

Posted by "Ammar Shadiq (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13232713#comment-13232713 ] 

Ammar Shadiq commented on NUTCH-978:
------------------------------------

I don't think i could participate in this year GSoC Lewis, my status is not a student anymore. I put it here so it could be freely used/developed further by anyone.

cheers
Ammar
                
> [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.
> ---------------------------------------------------------------------------------------
>
>                 Key: NUTCH-978
>                 URL: https://issues.apache.org/jira/browse/NUTCH-978
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 1.2
>         Environment: Ubuntu Linux 10.10; JDK 1.6; Netbeans 6.9
>            Reporter: Ammar Shadiq
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>              Labels: gsoc2011, mentor
>             Fix For: nutchgora
>
>         Attachments: [Nutch-GSoC-2011-Proposal]Web_Page_Scrapper_Parser_Plugin.pdf, app_guardian_ivory_coast_news_exmpl.png, app_screenshoot_configuration_result.png, app_screenshoot_configuration_result_anchor.png, app_screenshoot_source_view.png, app_screenshoot_url_regex_filter.png, for_GSoc.zip, version_alpha2.zip
>
>   Original Estimate: 1,680h
>  Remaining Estimate: 1,680h
>
> Nutch use parse-html plugin to parse web pages, it process the contents of the web page by removing html tags and component like javascript and css and leaving the extracted text to be stored on the index. Nutch by default doesn't have the capability to select certain atomic element on an html page, like certain tags, certain content, some part of the page, etc.
> A html page have a tree-like xml pattern with html tag as its branch and text as its node. This branch and node could be extracted using XPath. XPath allowing us to select a certain branch or node of an XML and therefore could be used to extract certain information and treat it differently based on its content and the user requirements. Furthermore a web domain like news website usually have a same html code structure for storing the information on its web pages. This same html code structure could be parsed using the same XPath query and retrieve the same content information element. All of the XPath query for selecting various content could be stored on a XPath Configuration File.
> The purpose of nutch are for various web source, not all of the web page retrieved from those various source have the same html code structure, thus have to be threated differently using the correct XPath Configuration. The selection of the correct XPath configuration could be done automatically using regex by matching the url of the web page with valid url pattern for that xpath configuration.
> This automatic mechanism allow the user of nutch to process various web page and get only certain information that user wants therefore making the index more accurate and its content more flexible.
> The component for this idea have been tested on nutch 1.2 for selecting certain elements on various news website for the purpose of document clustering. This includes a Configuration Editor Application build using NetBeans 6.9 Application Framework. though its need a few debugging.
> http://dl.dropbox.com/u/2642087/For_GSoC/for_GSoc.zip

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-978) [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.

Posted by "Chris A. Mattmann (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13211490#comment-13211490 ] 

Chris A. Mattmann commented on NUTCH-978:
-----------------------------------------

Hey Lewis,

I didn't end up mentoring this project b/c the proposal came too late and the GSoC Apache folks mentioned the program was already over by that time.

+1 to continuing work on it though!

Cheers,
Chris

                
> [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.
> ---------------------------------------------------------------------------------------
>
>                 Key: NUTCH-978
>                 URL: https://issues.apache.org/jira/browse/NUTCH-978
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 1.2
>         Environment: Ubuntu Linux 10.10; JDK 1.6; Netbeans 6.9
>            Reporter: Ammar Shadiq
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>              Labels: gsoc2011, mentor
>             Fix For: nutchgora
>
>         Attachments: [Nutch-GSoC-2011-Proposal]Web_Page_Scrapper_Parser_Plugin.pdf, app_guardian_ivory_coast_news_exmpl.png, app_screenshoot_configuration_result.png, app_screenshoot_configuration_result_anchor.png, app_screenshoot_source_view.png, app_screenshoot_url_regex_filter.png, for_GSoc.zip
>
>   Original Estimate: 1,680h
>  Remaining Estimate: 1,680h
>
> Nutch use parse-html plugin to parse web pages, it process the contents of the web page by removing html tags and component like javascript and css and leaving the extracted text to be stored on the index. Nutch by default doesn't have the capability to select certain atomic element on an html page, like certain tags, certain content, some part of the page, etc.
> A html page have a tree-like xml pattern with html tag as its branch and text as its node. This branch and node could be extracted using XPath. XPath allowing us to select a certain branch or node of an XML and therefore could be used to extract certain information and treat it differently based on its content and the user requirements. Furthermore a web domain like news website usually have a same html code structure for storing the information on its web pages. This same html code structure could be parsed using the same XPath query and retrieve the same content information element. All of the XPath query for selecting various content could be stored on a XPath Configuration File.
> The purpose of nutch are for various web source, not all of the web page retrieved from those various source have the same html code structure, thus have to be threated differently using the correct XPath Configuration. The selection of the correct XPath configuration could be done automatically using regex by matching the url of the web page with valid url pattern for that xpath configuration.
> This automatic mechanism allow the user of nutch to process various web page and get only certain information that user wants therefore making the index more accurate and its content more flexible.
> The component for this idea have been tested on nutch 1.2 for selecting certain elements on various news website for the purpose of document clustering. This includes a Configuration Editor Application build using NetBeans 6.9 Application Framework. though its need a few debugging.
> http://dl.dropbox.com/u/2642087/For_GSoC/for_GSoc.zip

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Issue Comment Edited] (NUTCH-978) [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.

Posted by "Ammar Shadiq (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13017724#comment-13017724 ] 

Ammar Shadiq edited comment on NUTCH-978 at 4/8/11 10:45 PM:
-------------------------------------------------------------

Please correct me if I'm wrong.
In my limited understanding, Nutch using plugin system, one of those are for parsing html pages (HTMLParseFilter class) whose  later selected appropriate plugin based on the configuration and runs it. 

Inside parse-html the main thing it's extract are : Content, Title, and Outlinks.

The problem that I'm trying to solve are, for adding custom field like on : http://sujitpal.blogspot.com/2009/07/nutch-custom-plugin-to-parse-and-add.html
for various type of content and for various sites and add it to the index fields. Instead of creating a new plugin for each site, nutch user could simply create the xpath configuration file, put it on the configuration folder and the parsing of custom fields could be done automatically without writing/compiling any code.

In addition, user could also bypass Content, Title and Outlinks with a different result, for example, 
Set the title of page's from a news site (example: http://www.guardian.co.uk/world/2011/apr/08/ivory-coast-horror-recounted), 

instead the value of <head><title> :
=Ivory Coast horror recounted by victims and perpetrators | World news | The Guardian

get the title only, by using xpath of : /html/body/div[@id='wrapper']/div[@id='box']/div[@id='article-header']/div[@id='main-article-info']/h1/text(), and get:
=Ivory Coast horror recounted by victims and perpetrators

or only follow outlinks of related news, ignore the rest:

-Ouattara calls for Ivory Coast sanctions to be lifted 
-Ivory Coast crisis: Q&A 
-After Gbagbo, what next for Ivory Coast? 
-Ivory Coast: The final battle 

like the screenshoot here : https://issues.apache.org/jira/secure/attachment/12475860/app_guardian_ivory_coast_news_exmpl.png

Since the default parser are parse-html. I add the handler there, some kind of if-else bypass, if the parsed page have URL that match one of those Configuration, it's parsed by it, if there's no configuration matched with the URL, it's uses the default parser mechanism. 

I'm sorry for my English and if I'm not presenting my idea well enough.

      was (Author: ammarshadiq):
    Please correct me if I'm wrong.
In my limited understanding, Nutch using plugin system, one of those are for parsing html pages (HTMLParseFilter class) whose  later selected appropriate plugin based on the configuration and runs it. 

Inside parse-html the main thing it's extract are : Content, Title, and Outlinks.

The problem that I'm trying to solve are, for adding custom field like on : http://sujitpal.blogspot.com/2009/07/nutch-custom-plugin-to-parse-and-add.html
for various type of content and for various sites and add it to the index fields. Instead of creating a new plugin for each site, nutch user could simply create the xpath configuration file, put it on the configuration folder and the parsing of custom fields could be done automatically without writing/compiling any code.

In addition, user could also bypass Content, Title and Outlinks with a different result, for example, 
Set the title of page's from a news site (example: http://www.guardian.co.uk/world/2011/apr/08/ivory-coast-horror-recounted), 

instead the value of <head><title> :
=Ivory Coast horror recounted by victims and perpetrators | World news | The Guardian

get the title only, by using xpath of : /html/body/div[@id='wrapper']/div[@id='box']/div[@id='article-header']/div[@id='main-article-info']/h1/text(), and get:
=Ivory Coast horror recounted by victims and perpetrators

or only follow outlinks of related news, ignore the rest:

8 Apr 2011
Ouattara calls for Ivory Coast sanctions to be lifted 
7 Apr 2011
Ivory Coast crisis: Q&A 
5 Apr 2011
After Gbagbo, what next for Ivory Coast? 
5 Apr 2011
Ivory Coast: The final battle 

like the screenshoot here : https://issues.apache.org/jira/secure/attachment/12475860/app_guardian_ivory_coast_news_exmpl.png

Since the default parser are parse-html. I add the handler there, some kind of if-else bypass, if the parsed page have URL that match one of those Configuration, it's parsed by it, if there's no configuration matched with the URL, it's uses the default parser mechanism. 

I'm sorry for my English and if I'm not presenting my idea well enough.
  
> [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.
> ---------------------------------------------------------------------------------------
>
>                 Key: NUTCH-978
>                 URL: https://issues.apache.org/jira/browse/NUTCH-978
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 1.2
>         Environment: Ubuntu Linux 10.10; JDK 1.6; Netbeans 6.9
>            Reporter: Ammar Shadiq
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>              Labels: gsoc2011, mentor
>             Fix For: 2.0
>
>         Attachments: [Nutch-GSoC-2011-Proposal]Web_Page_Scrapper_Parser_Plugin.pdf, app_guardian_ivory_coast_news_exmpl.png, app_screenshoot_configuration_result.png, app_screenshoot_configuration_result_anchor.png, app_screenshoot_source_view.png, app_screenshoot_url_regex_filter.png
>
>   Original Estimate: 1680h
>  Remaining Estimate: 1680h
>
> Nutch use parse-html plugin to parse web pages, it process the contents of the web page by removing html tags and component like javascript and css and leaving the extracted text to be stored on the index. Nutch by default doesn't have the capability to select certain atomic element on an html page, like certain tags, certain content, some part of the page, etc.
> A html page have a tree-like xml pattern with html tag as its branch and text as its node. This branch and node could be extracted using XPath. XPath allowing us to select a certain branch or node of an XML and therefore could be used to extract certain information and treat it differently based on its content and the user requirements. Furthermore a web domain like news website usually have a same html code structure for storing the information on its web pages. This same html code structure could be parsed using the same XPath query and retrieve the same content information element. All of the XPath query for selecting various content could be stored on a XPath Configuration File.
> The purpose of nutch are for various web source, not all of the web page retrieved from those various source have the same html code structure, thus have to be threated differently using the correct XPath Configuration. The selection of the correct XPath configuration could be done automatically using regex by matching the url of the web page with valid url pattern for that xpath configuration.
> This automatic mechanism allow the user of nutch to process various web page and get only certain information that user wants therefore making the index more accurate and its content more flexible.
> The component for this idea have been tested on nutch 1.2 for selecting certain elements on various news website for the purpose of document clustering. This includes a Configuration Editor Application build using NetBeans 6.9 Application Framework. though its need a few debugging.
> http://dl.dropbox.com/u/2642087/For_GSoC/for_GSoc.zip

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-978) [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.

Posted by "Ammar Shadiq (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ammar Shadiq updated NUTCH-978:
-------------------------------

    Attachment: [Nutch-GSoC-2011-Proposal]Web_Page_Scrapper_Parser_Plugin.pdf

Proposal Updated

> [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.
> ---------------------------------------------------------------------------------------
>
>                 Key: NUTCH-978
>                 URL: https://issues.apache.org/jira/browse/NUTCH-978
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 1.2
>         Environment: Ubuntu Linux 10.10; JDK 1.6; Netbeans 6.9
>            Reporter: Ammar Shadiq
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>              Labels: gsoc2011, mentor
>             Fix For: 2.0
>
>         Attachments: [Nutch-GSoC-2011-Proposal]Web_Page_Scrapper_Parser_Plugin.pdf, [Nutch-GSoC-2011-Proposal]Web_Page_Scrapper_Parser_Plugin.pdf, app_screenshoot_configuration_result.png, app_screenshoot_configuration_result_anchor.png, app_screenshoot_source_view.png, app_screenshoot_url_regex_filter.png
>
>   Original Estimate: 1680h
>  Remaining Estimate: 1680h
>
> Nutch use parse-html plugin to parse web pages, it process the contents of the web page by removing html tags and component like javascript and css and leaving the extracted text to be stored on the index. Nutch by default doesn't have the capability to select certain atomic element on an html page, like certain tags, certain content, some part of the page, etc.
> A html page have a tree-like xml pattern with html tag as its branch and text as its node. This branch and node could be extracted using XPath. XPath allowing us to select a certain branch or node of an XML and therefore could be used to extract certain information and treat it differently based on its content and the user requirements. Furthermore a web domain like news website usually have a same html code structure for storing the information on its web pages. This same html code structure could be parsed using the same XPath query and retrieve the same content information element. All of the XPath query for selecting various content could be stored on a XPath Configuration File.
> The purpose of nutch are for various web source, not all of the web page retrieved from those various source have the same html code structure, thus have to be threated differently using the correct XPath Configuration. The selection of the correct XPath configuration could be done automatically using regex by matching the url of the web page with valid url pattern for that xpath configuration.
> This automatic mechanism allow the user of nutch to process various web page and get only certain information that user wants therefore making the index more accurate and its content more flexible.
> The component for this idea have been tested on nutch 1.2 for selecting certain elements on various news website for the purpose of document clustering. This includes a Configuration Editor Application build using NetBeans 6.9 Application Framework. though its need a few debugging.
> http://dl.dropbox.com/u/2642087/For_GSoC/for_GSoc.zip

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-978) [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.

Posted by "Ammar Shadiq (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13017077#comment-13017077 ] 

Ammar Shadiq commented on NUTCH-978:
------------------------------------

Hi Thomas, thank you for the question and the information for boilerplate and readability bookmarklet. 

I think it's different.

It's not just about main text extraction, but also specific information like the value of some <meta> tags, picture illustration link of a news page article (a news page article usually have only one picture), or certain links (in form of anchors) for the next crawling iteration that you want to get and process. I think this type of specific configuration would be useful.

It's not using any training, from what i read on the boilerplate paper, boilerplate use training data for the algorithm and focusing mainly on the number of words to determine the main text. I myself wonder what if the input page are japanese or chinese. I think they developed a custom tokenizer for that, I haven't exploring the component more thoroughly, so I'm not sure. I myself use this component I'm working on to parse pages in Bahasa Indonesia.

> [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.
> ---------------------------------------------------------------------------------------
>
>                 Key: NUTCH-978
>                 URL: https://issues.apache.org/jira/browse/NUTCH-978
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 1.2
>         Environment: Ubuntu Linux 10.10; JDK 1.6; Netbeans 6.9
>            Reporter: Ammar Shadiq
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>              Labels: gsoc2011, mentor
>             Fix For: 2.0
>
>         Attachments: [Nutch-GSoC-2011-Proposal]Web_Page_Scrapper_Parser_Plugin.pdf, app_screenshoot_configuration_result.png, app_screenshoot_configuration_result_anchor.png, app_screenshoot_source_view.png, app_screenshoot_url_regex_filter.png
>
>   Original Estimate: 1680h
>  Remaining Estimate: 1680h
>
> Nutch use parse-html plugin to parse web pages, it process the contents of the web page by removing html tags and component like javascript and css and leaving the extracted text to be stored on the index. Nutch by default doesn't have the capability to select certain atomic element on an html page, like certain tags, certain content, some part of the page, etc.
> A html page have a tree-like xml pattern with html tag as its branch and text as its node. This branch and node could be extracted using XPath. XPath allowing us to select a certain branch or node of an XML and therefore could be used to extract certain information and treat it differently based on its content and the user requirements. Furthermore a web domain like news website usually have a same html code structure for storing the information on its web pages. This same html code structure could be parsed using the same XPath query and retrieve the same content information element. All of the XPath query for selecting various content could be stored on a XPath Configuration File.
> The purpose of nutch are for various web source, not all of the web page retrieved from those various source have the same html code structure, thus have to be threated differently using the correct XPath Configuration. The selection of the correct XPath configuration could be done automatically using regex by matching the url of the web page with valid url pattern for that xpath configuration.
> This automatic mechanism allow the user of nutch to process various web page and get only certain information that user wants therefore making the index more accurate and its content more flexible.
> The component for this idea have been tested on nutch 1.2 for selecting certain elements on various news website for the purpose of document clustering. This includes a Configuration Editor Application build using NetBeans 6.9 Application Framework. though its need a few debugging.
> http://dl.dropbox.com/u/2642087/For_GSoC/for_GSoc.zip

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-978) [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.

Posted by "Lewis John McGibbney (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13081703#comment-13081703 ] 

Lewis John McGibbney commented on NUTCH-978:
--------------------------------------------

If there has been a plugin written for this, would it be possible to get the code added to the wiki? As we have both parse-tika and parse-html for text and outlink extraction for html format, I don't think that this plugin serves much purpose for the average user of Nutch. It really only adds value for users looking for a solution to the specific problem addressed... this is rare.

It would be disappointing if we were not able to harness and share the code from this small project with other members of the Nutch community via the wiki.

> [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.
> ---------------------------------------------------------------------------------------
>
>                 Key: NUTCH-978
>                 URL: https://issues.apache.org/jira/browse/NUTCH-978
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 1.2
>         Environment: Ubuntu Linux 10.10; JDK 1.6; Netbeans 6.9
>            Reporter: Ammar Shadiq
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>              Labels: gsoc2011, mentor
>             Fix For: 2.0
>
>         Attachments: [Nutch-GSoC-2011-Proposal]Web_Page_Scrapper_Parser_Plugin.pdf, app_guardian_ivory_coast_news_exmpl.png, app_screenshoot_configuration_result.png, app_screenshoot_configuration_result_anchor.png, app_screenshoot_source_view.png, app_screenshoot_url_regex_filter.png
>
>   Original Estimate: 1,680h
>  Remaining Estimate: 1,680h
>
> Nutch use parse-html plugin to parse web pages, it process the contents of the web page by removing html tags and component like javascript and css and leaving the extracted text to be stored on the index. Nutch by default doesn't have the capability to select certain atomic element on an html page, like certain tags, certain content, some part of the page, etc.
> A html page have a tree-like xml pattern with html tag as its branch and text as its node. This branch and node could be extracted using XPath. XPath allowing us to select a certain branch or node of an XML and therefore could be used to extract certain information and treat it differently based on its content and the user requirements. Furthermore a web domain like news website usually have a same html code structure for storing the information on its web pages. This same html code structure could be parsed using the same XPath query and retrieve the same content information element. All of the XPath query for selecting various content could be stored on a XPath Configuration File.
> The purpose of nutch are for various web source, not all of the web page retrieved from those various source have the same html code structure, thus have to be threated differently using the correct XPath Configuration. The selection of the correct XPath configuration could be done automatically using regex by matching the url of the web page with valid url pattern for that xpath configuration.
> This automatic mechanism allow the user of nutch to process various web page and get only certain information that user wants therefore making the index more accurate and its content more flexible.
> The component for this idea have been tested on nutch 1.2 for selecting certain elements on various news website for the purpose of document clustering. This includes a Configuration Editor Application build using NetBeans 6.9 Application Framework. though its need a few debugging.
> http://dl.dropbox.com/u/2642087/For_GSoC/for_GSoc.zip

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Issue Comment Edited] (NUTCH-978) [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.

Posted by "Ammar Shadiq (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13016748#comment-13016748 ] 

Ammar Shadiq edited comment on NUTCH-978 at 4/7/11 8:02 AM:
------------------------------------------------------------

Wow, thank you very much Mr. Jelsman:-)

      was (Author: ammarshadiq):
    Wow, thank you very much :-)
  
> [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.
> ---------------------------------------------------------------------------------------
>
>                 Key: NUTCH-978
>                 URL: https://issues.apache.org/jira/browse/NUTCH-978
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 1.2
>         Environment: Ubuntu Linux 10.10; JDK 1.6; Netbeans 6.9
>            Reporter: Ammar Shadiq
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>              Labels: gsoc2011, mentor
>             Fix For: 2.0
>
>         Attachments: [Nutch-GSoC-2011-Proposal]Web_Page_Scrapper_Parser_Plugin.pdf, app_screenshoot_configuration_result.png, app_screenshoot_configuration_result_anchor.png, app_screenshoot_source_view.png, app_screenshoot_url_regex_filter.png
>
>   Original Estimate: 1680h
>  Remaining Estimate: 1680h
>
> Nutch use parse-html plugin to parse web pages, it process the contents of the web page by removing html tags and component like javascript and css and leaving the extracted text to be stored on the index. Nutch by default doesn't have the capability to select certain atomic element on an html page, like certain tags, certain content, some part of the page, etc.
> A html page have a tree-like xml pattern with html tag as its branch and text as its node. This branch and node could be extracted using XPath. XPath allowing us to select a certain branch or node of an XML and therefore could be used to extract certain information and treat it differently based on its content and the user requirements. Furthermore a web domain like news website usually have a same html code structure for storing the information on its web pages. This same html code structure could be parsed using the same XPath query and retrieve the same content information element. All of the XPath query for selecting various content could be stored on a XPath Configuration File.
> The purpose of nutch are for various web source, not all of the web page retrieved from those various source have the same html code structure, thus have to be threated differently using the correct XPath Configuration. The selection of the correct XPath configuration could be done automatically using regex by matching the url of the web page with valid url pattern for that xpath configuration.
> This automatic mechanism allow the user of nutch to process various web page and get only certain information that user wants therefore making the index more accurate and its content more flexible.
> The component for this idea have been tested on nutch 1.2 for selecting certain elements on various news website for the purpose of document clustering. This includes a Configuration Editor Application build using NetBeans 6.9 Application Framework. though its need a few debugging.
> http://dl.dropbox.com/u/2642087/For_GSoC/for_GSoc.zip

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-978) [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.

Posted by "Lewis John McGibbney (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13212582#comment-13212582 ] 

Lewis John McGibbney commented on NUTCH-978:
--------------------------------------------

Replies:

1 & 2. The main motivation of this issue is for processing news document
required for my undergrad thesis of Bahasa Indonesia news text
clustering, it's needed a prepossessing to extract the title, news
content, date, related news link separately.

2. The most biggest technical challenge for me is processing the web page
so it could be parsered as an XML document   and could be queried by
XPath.

3. The issue is drop away, because with a small tweak a could get it
working for "only" my thesis requirements, i haven't tested it with
web page other than the web pages i used for my thesis so i think it's
not anyway nearly finished yet. And since the proposal is not accepted
as a GSOC project, i lost motivation to continue to work on this issue
and decided to work on my thesis instead.

related issue : https://issues.apache.org/jira/browse/NUTCH-185
                
> [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.
> ---------------------------------------------------------------------------------------
>
>                 Key: NUTCH-978
>                 URL: https://issues.apache.org/jira/browse/NUTCH-978
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 1.2
>         Environment: Ubuntu Linux 10.10; JDK 1.6; Netbeans 6.9
>            Reporter: Ammar Shadiq
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>              Labels: gsoc2011, mentor
>             Fix For: nutchgora
>
>         Attachments: [Nutch-GSoC-2011-Proposal]Web_Page_Scrapper_Parser_Plugin.pdf, app_guardian_ivory_coast_news_exmpl.png, app_screenshoot_configuration_result.png, app_screenshoot_configuration_result_anchor.png, app_screenshoot_source_view.png, app_screenshoot_url_regex_filter.png, for_GSoc.zip
>
>   Original Estimate: 1,680h
>  Remaining Estimate: 1,680h
>
> Nutch use parse-html plugin to parse web pages, it process the contents of the web page by removing html tags and component like javascript and css and leaving the extracted text to be stored on the index. Nutch by default doesn't have the capability to select certain atomic element on an html page, like certain tags, certain content, some part of the page, etc.
> A html page have a tree-like xml pattern with html tag as its branch and text as its node. This branch and node could be extracted using XPath. XPath allowing us to select a certain branch or node of an XML and therefore could be used to extract certain information and treat it differently based on its content and the user requirements. Furthermore a web domain like news website usually have a same html code structure for storing the information on its web pages. This same html code structure could be parsed using the same XPath query and retrieve the same content information element. All of the XPath query for selecting various content could be stored on a XPath Configuration File.
> The purpose of nutch are for various web source, not all of the web page retrieved from those various source have the same html code structure, thus have to be threated differently using the correct XPath Configuration. The selection of the correct XPath configuration could be done automatically using regex by matching the url of the web page with valid url pattern for that xpath configuration.
> This automatic mechanism allow the user of nutch to process various web page and get only certain information that user wants therefore making the index more accurate and its content more flexible.
> The component for this idea have been tested on nutch 1.2 for selecting certain elements on various news website for the purpose of document clustering. This includes a Configuration Editor Application build using NetBeans 6.9 Application Framework. though its need a few debugging.
> http://dl.dropbox.com/u/2642087/For_GSoC/for_GSoc.zip

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-978) [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.

Posted by "Ammar Shadiq (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ammar Shadiq updated NUTCH-978:
-------------------------------

    Attachment: Screenshot.png

> [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.
> ---------------------------------------------------------------------------------------
>
>                 Key: NUTCH-978
>                 URL: https://issues.apache.org/jira/browse/NUTCH-978
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 1.2
>         Environment: Ubuntu Linux 10.10; JDK 1.6; Netbeans 6.9
>            Reporter: Ammar Shadiq
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>              Labels: gsoc2011, mentor
>             Fix For: 2.0
>
>         Attachments: Screenshot.png, [Nutch-GSoC-2011-Proposal]Web_Page_Scrapper_Parser_Plugin.pdf, app_screenshoot_configuration_result.png, app_screenshoot_configuration_result_anchor.png, app_screenshoot_source_view.png, app_screenshoot_url_regex_filter.png
>
>   Original Estimate: 1680h
>  Remaining Estimate: 1680h
>
> Nutch use parse-html plugin to parse web pages, it process the contents of the web page by removing html tags and component like javascript and css and leaving the extracted text to be stored on the index. Nutch by default doesn't have the capability to select certain atomic element on an html page, like certain tags, certain content, some part of the page, etc.
> A html page have a tree-like xml pattern with html tag as its branch and text as its node. This branch and node could be extracted using XPath. XPath allowing us to select a certain branch or node of an XML and therefore could be used to extract certain information and treat it differently based on its content and the user requirements. Furthermore a web domain like news website usually have a same html code structure for storing the information on its web pages. This same html code structure could be parsed using the same XPath query and retrieve the same content information element. All of the XPath query for selecting various content could be stored on a XPath Configuration File.
> The purpose of nutch are for various web source, not all of the web page retrieved from those various source have the same html code structure, thus have to be threated differently using the correct XPath Configuration. The selection of the correct XPath configuration could be done automatically using regex by matching the url of the web page with valid url pattern for that xpath configuration.
> This automatic mechanism allow the user of nutch to process various web page and get only certain information that user wants therefore making the index more accurate and its content more flexible.
> The component for this idea have been tested on nutch 1.2 for selecting certain elements on various news website for the purpose of document clustering. This includes a Configuration Editor Application build using NetBeans 6.9 Application Framework. though its need a few debugging.
> http://dl.dropbox.com/u/2642087/For_GSoC/for_GSoc.zip

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-978) A Plugin for extracting certain element of a web page on html page parsing.

Posted by "Lewis John McGibbney (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Lewis John McGibbney updated NUTCH-978:
---------------------------------------

     Labels: gsoc2012 mentor  (was: gsoc2011 mentor)
    Summary: A Plugin for extracting certain element of a web page on html page parsing.  (was: [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.)

This is as I thought. Look I've marked it for this years GSoC, students can apply up until April 6th iirc so if there is any interest then we can progress with it. Thanks Ammar
                
> A Plugin for extracting certain element of a web page on html page parsing.
> ---------------------------------------------------------------------------
>
>                 Key: NUTCH-978
>                 URL: https://issues.apache.org/jira/browse/NUTCH-978
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 1.2
>         Environment: Ubuntu Linux 10.10; JDK 1.6; Netbeans 6.9
>            Reporter: Ammar Shadiq
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>              Labels: gsoc2012, mentor
>             Fix For: nutchgora
>
>         Attachments: [Nutch-GSoC-2011-Proposal]Web_Page_Scrapper_Parser_Plugin.pdf, app_guardian_ivory_coast_news_exmpl.png, app_screenshoot_configuration_result.png, app_screenshoot_configuration_result_anchor.png, app_screenshoot_source_view.png, app_screenshoot_url_regex_filter.png, for_GSoc.zip, version_alpha2.zip
>
>   Original Estimate: 1,680h
>  Remaining Estimate: 1,680h
>
> Nutch use parse-html plugin to parse web pages, it process the contents of the web page by removing html tags and component like javascript and css and leaving the extracted text to be stored on the index. Nutch by default doesn't have the capability to select certain atomic element on an html page, like certain tags, certain content, some part of the page, etc.
> A html page have a tree-like xml pattern with html tag as its branch and text as its node. This branch and node could be extracted using XPath. XPath allowing us to select a certain branch or node of an XML and therefore could be used to extract certain information and treat it differently based on its content and the user requirements. Furthermore a web domain like news website usually have a same html code structure for storing the information on its web pages. This same html code structure could be parsed using the same XPath query and retrieve the same content information element. All of the XPath query for selecting various content could be stored on a XPath Configuration File.
> The purpose of nutch are for various web source, not all of the web page retrieved from those various source have the same html code structure, thus have to be threated differently using the correct XPath Configuration. The selection of the correct XPath configuration could be done automatically using regex by matching the url of the web page with valid url pattern for that xpath configuration.
> This automatic mechanism allow the user of nutch to process various web page and get only certain information that user wants therefore making the index more accurate and its content more flexible.
> The component for this idea have been tested on nutch 1.2 for selecting certain elements on various news website for the purpose of document clustering. This includes a Configuration Editor Application build using NetBeans 6.9 Application Framework. though its need a few debugging.
> http://dl.dropbox.com/u/2642087/For_GSoC/for_GSoc.zip

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-978) [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13017378#comment-13017378 ] 

Julien Nioche commented on NUTCH-978:
-------------------------------------

Can you please explain how your proposal differs from the HTMLParseFilter mechanism that Nutch already has?

> [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.
> ---------------------------------------------------------------------------------------
>
>                 Key: NUTCH-978
>                 URL: https://issues.apache.org/jira/browse/NUTCH-978
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 1.2
>         Environment: Ubuntu Linux 10.10; JDK 1.6; Netbeans 6.9
>            Reporter: Ammar Shadiq
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>              Labels: gsoc2011, mentor
>             Fix For: 2.0
>
>         Attachments: [Nutch-GSoC-2011-Proposal]Web_Page_Scrapper_Parser_Plugin.pdf, app_screenshoot_configuration_result.png, app_screenshoot_configuration_result_anchor.png, app_screenshoot_source_view.png, app_screenshoot_url_regex_filter.png
>
>   Original Estimate: 1680h
>  Remaining Estimate: 1680h
>
> Nutch use parse-html plugin to parse web pages, it process the contents of the web page by removing html tags and component like javascript and css and leaving the extracted text to be stored on the index. Nutch by default doesn't have the capability to select certain atomic element on an html page, like certain tags, certain content, some part of the page, etc.
> A html page have a tree-like xml pattern with html tag as its branch and text as its node. This branch and node could be extracted using XPath. XPath allowing us to select a certain branch or node of an XML and therefore could be used to extract certain information and treat it differently based on its content and the user requirements. Furthermore a web domain like news website usually have a same html code structure for storing the information on its web pages. This same html code structure could be parsed using the same XPath query and retrieve the same content information element. All of the XPath query for selecting various content could be stored on a XPath Configuration File.
> The purpose of nutch are for various web source, not all of the web page retrieved from those various source have the same html code structure, thus have to be threated differently using the correct XPath Configuration. The selection of the correct XPath configuration could be done automatically using regex by matching the url of the web page with valid url pattern for that xpath configuration.
> This automatic mechanism allow the user of nutch to process various web page and get only certain information that user wants therefore making the index more accurate and its content more flexible.
> The component for this idea have been tested on nutch 1.2 for selecting certain elements on various news website for the purpose of document clustering. This includes a Configuration Editor Application build using NetBeans 6.9 Application Framework. though its need a few debugging.
> http://dl.dropbox.com/u/2642087/For_GSoC/for_GSoc.zip

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-978) [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.

Posted by "Ammar Shadiq (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13212523#comment-13212523 ] 

Ammar Shadiq commented on NUTCH-978:
------------------------------------

Hi Lewis,

Since the proposal is not accepted, I'm using my summer time to work on my undergrad thesis. I'm graduated from collage recently, and the time has freed up, so I'd love to help, and it's awesome if we could collaborate.

thanks,
Ammar
                
> [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.
> ---------------------------------------------------------------------------------------
>
>                 Key: NUTCH-978
>                 URL: https://issues.apache.org/jira/browse/NUTCH-978
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 1.2
>         Environment: Ubuntu Linux 10.10; JDK 1.6; Netbeans 6.9
>            Reporter: Ammar Shadiq
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>              Labels: gsoc2011, mentor
>             Fix For: nutchgora
>
>         Attachments: [Nutch-GSoC-2011-Proposal]Web_Page_Scrapper_Parser_Plugin.pdf, app_guardian_ivory_coast_news_exmpl.png, app_screenshoot_configuration_result.png, app_screenshoot_configuration_result_anchor.png, app_screenshoot_source_view.png, app_screenshoot_url_regex_filter.png, for_GSoc.zip
>
>   Original Estimate: 1,680h
>  Remaining Estimate: 1,680h
>
> Nutch use parse-html plugin to parse web pages, it process the contents of the web page by removing html tags and component like javascript and css and leaving the extracted text to be stored on the index. Nutch by default doesn't have the capability to select certain atomic element on an html page, like certain tags, certain content, some part of the page, etc.
> A html page have a tree-like xml pattern with html tag as its branch and text as its node. This branch and node could be extracted using XPath. XPath allowing us to select a certain branch or node of an XML and therefore could be used to extract certain information and treat it differently based on its content and the user requirements. Furthermore a web domain like news website usually have a same html code structure for storing the information on its web pages. This same html code structure could be parsed using the same XPath query and retrieve the same content information element. All of the XPath query for selecting various content could be stored on a XPath Configuration File.
> The purpose of nutch are for various web source, not all of the web page retrieved from those various source have the same html code structure, thus have to be threated differently using the correct XPath Configuration. The selection of the correct XPath configuration could be done automatically using regex by matching the url of the web page with valid url pattern for that xpath configuration.
> This automatic mechanism allow the user of nutch to process various web page and get only certain information that user wants therefore making the index more accurate and its content more flexible.
> The component for this idea have been tested on nutch 1.2 for selecting certain elements on various news website for the purpose of document clustering. This includes a Configuration Editor Application build using NetBeans 6.9 Application Framework. though its need a few debugging.
> http://dl.dropbox.com/u/2642087/For_GSoC/for_GSoc.zip

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-978) [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.

Posted by "Lewis John McGibbney (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13211470#comment-13211470 ] 

Lewis John McGibbney commented on NUTCH-978:
--------------------------------------------

Hi Chris did you mentor this project through GSoC? I've downloaded the .zip available in the description (which I've also attached in case the link goes AWOL) and I'm going to play about with it. I'll attach it as a patch if I get anywhere.
                
> [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.
> ---------------------------------------------------------------------------------------
>
>                 Key: NUTCH-978
>                 URL: https://issues.apache.org/jira/browse/NUTCH-978
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 1.2
>         Environment: Ubuntu Linux 10.10; JDK 1.6; Netbeans 6.9
>            Reporter: Ammar Shadiq
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>              Labels: gsoc2011, mentor
>             Fix For: nutchgora
>
>         Attachments: [Nutch-GSoC-2011-Proposal]Web_Page_Scrapper_Parser_Plugin.pdf, app_guardian_ivory_coast_news_exmpl.png, app_screenshoot_configuration_result.png, app_screenshoot_configuration_result_anchor.png, app_screenshoot_source_view.png, app_screenshoot_url_regex_filter.png
>
>   Original Estimate: 1,680h
>  Remaining Estimate: 1,680h
>
> Nutch use parse-html plugin to parse web pages, it process the contents of the web page by removing html tags and component like javascript and css and leaving the extracted text to be stored on the index. Nutch by default doesn't have the capability to select certain atomic element on an html page, like certain tags, certain content, some part of the page, etc.
> A html page have a tree-like xml pattern with html tag as its branch and text as its node. This branch and node could be extracted using XPath. XPath allowing us to select a certain branch or node of an XML and therefore could be used to extract certain information and treat it differently based on its content and the user requirements. Furthermore a web domain like news website usually have a same html code structure for storing the information on its web pages. This same html code structure could be parsed using the same XPath query and retrieve the same content information element. All of the XPath query for selecting various content could be stored on a XPath Configuration File.
> The purpose of nutch are for various web source, not all of the web page retrieved from those various source have the same html code structure, thus have to be threated differently using the correct XPath Configuration. The selection of the correct XPath configuration could be done automatically using regex by matching the url of the web page with valid url pattern for that xpath configuration.
> This automatic mechanism allow the user of nutch to process various web page and get only certain information that user wants therefore making the index more accurate and its content more flexible.
> The component for this idea have been tested on nutch 1.2 for selecting certain elements on various news website for the purpose of document clustering. This includes a Configuration Editor Application build using NetBeans 6.9 Application Framework. though its need a few debugging.
> http://dl.dropbox.com/u/2642087/For_GSoC/for_GSoc.zip

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Issue Comment Edited] (NUTCH-978) [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.

Posted by "Ammar Shadiq (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13016407#comment-13016407 ] 

Ammar Shadiq edited comment on NUTCH-978 at 4/7/11 8:02 AM:
------------------------------------------------------------

Proposal for Google Summer of Code 2011
http://www.google-melange.com/gsoc/homepage/google/gsoc2011

      was (Author: ammarshadiq):
    Proposal for Google Summer of Code 2011
http://www.google-melange.com/gsoc/homepage/google/gsoc2011

haven't found any mentor yet :-(
  
> [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.
> ---------------------------------------------------------------------------------------
>
>                 Key: NUTCH-978
>                 URL: https://issues.apache.org/jira/browse/NUTCH-978
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 1.2
>         Environment: Ubuntu Linux 10.10; JDK 1.6; Netbeans 6.9
>            Reporter: Ammar Shadiq
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>              Labels: gsoc2011, mentor
>             Fix For: 2.0
>
>         Attachments: [Nutch-GSoC-2011-Proposal]Web_Page_Scrapper_Parser_Plugin.pdf, app_screenshoot_configuration_result.png, app_screenshoot_configuration_result_anchor.png, app_screenshoot_source_view.png, app_screenshoot_url_regex_filter.png
>
>   Original Estimate: 1680h
>  Remaining Estimate: 1680h
>
> Nutch use parse-html plugin to parse web pages, it process the contents of the web page by removing html tags and component like javascript and css and leaving the extracted text to be stored on the index. Nutch by default doesn't have the capability to select certain atomic element on an html page, like certain tags, certain content, some part of the page, etc.
> A html page have a tree-like xml pattern with html tag as its branch and text as its node. This branch and node could be extracted using XPath. XPath allowing us to select a certain branch or node of an XML and therefore could be used to extract certain information and treat it differently based on its content and the user requirements. Furthermore a web domain like news website usually have a same html code structure for storing the information on its web pages. This same html code structure could be parsed using the same XPath query and retrieve the same content information element. All of the XPath query for selecting various content could be stored on a XPath Configuration File.
> The purpose of nutch are for various web source, not all of the web page retrieved from those various source have the same html code structure, thus have to be threated differently using the correct XPath Configuration. The selection of the correct XPath configuration could be done automatically using regex by matching the url of the web page with valid url pattern for that xpath configuration.
> This automatic mechanism allow the user of nutch to process various web page and get only certain information that user wants therefore making the index more accurate and its content more flexible.
> The component for this idea have been tested on nutch 1.2 for selecting certain elements on various news website for the purpose of document clustering. This includes a Configuration Editor Application build using NetBeans 6.9 Application Framework. though its need a few debugging.
> http://dl.dropbox.com/u/2642087/For_GSoC/for_GSoc.zip

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-978) [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.

Posted by "Lewis John McGibbney (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13232588#comment-13232588 ] 

Lewis John McGibbney commented on NUTCH-978:
--------------------------------------------

Great Ammar. Are you wanting to add this as a GSoC2012 project? I am already mentoring one project, and time/work restrictions mean that I can't step up to take on another mentoring role. If you don't wish to make this a project this year, at least the code is on here for guys to pick it up in the future. 
                
> [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.
> ---------------------------------------------------------------------------------------
>
>                 Key: NUTCH-978
>                 URL: https://issues.apache.org/jira/browse/NUTCH-978
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 1.2
>         Environment: Ubuntu Linux 10.10; JDK 1.6; Netbeans 6.9
>            Reporter: Ammar Shadiq
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>              Labels: gsoc2011, mentor
>             Fix For: nutchgora
>
>         Attachments: [Nutch-GSoC-2011-Proposal]Web_Page_Scrapper_Parser_Plugin.pdf, app_guardian_ivory_coast_news_exmpl.png, app_screenshoot_configuration_result.png, app_screenshoot_configuration_result_anchor.png, app_screenshoot_source_view.png, app_screenshoot_url_regex_filter.png, for_GSoc.zip, version_alpha2.zip
>
>   Original Estimate: 1,680h
>  Remaining Estimate: 1,680h
>
> Nutch use parse-html plugin to parse web pages, it process the contents of the web page by removing html tags and component like javascript and css and leaving the extracted text to be stored on the index. Nutch by default doesn't have the capability to select certain atomic element on an html page, like certain tags, certain content, some part of the page, etc.
> A html page have a tree-like xml pattern with html tag as its branch and text as its node. This branch and node could be extracted using XPath. XPath allowing us to select a certain branch or node of an XML and therefore could be used to extract certain information and treat it differently based on its content and the user requirements. Furthermore a web domain like news website usually have a same html code structure for storing the information on its web pages. This same html code structure could be parsed using the same XPath query and retrieve the same content information element. All of the XPath query for selecting various content could be stored on a XPath Configuration File.
> The purpose of nutch are for various web source, not all of the web page retrieved from those various source have the same html code structure, thus have to be threated differently using the correct XPath Configuration. The selection of the correct XPath configuration could be done automatically using regex by matching the url of the web page with valid url pattern for that xpath configuration.
> This automatic mechanism allow the user of nutch to process various web page and get only certain information that user wants therefore making the index more accurate and its content more flexible.
> The component for this idea have been tested on nutch 1.2 for selecting certain elements on various news website for the purpose of document clustering. This includes a Configuration Editor Application build using NetBeans 6.9 Application Framework. though its need a few debugging.
> http://dl.dropbox.com/u/2642087/For_GSoC/for_GSoc.zip

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-978) [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.

Posted by "Ammar Shadiq (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ammar Shadiq updated NUTCH-978:
-------------------------------

    Attachment: version_alpha2.zip

upload latest version, worked on 1.2
                
> [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.
> ---------------------------------------------------------------------------------------
>
>                 Key: NUTCH-978
>                 URL: https://issues.apache.org/jira/browse/NUTCH-978
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 1.2
>         Environment: Ubuntu Linux 10.10; JDK 1.6; Netbeans 6.9
>            Reporter: Ammar Shadiq
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>              Labels: gsoc2011, mentor
>             Fix For: nutchgora
>
>         Attachments: [Nutch-GSoC-2011-Proposal]Web_Page_Scrapper_Parser_Plugin.pdf, app_guardian_ivory_coast_news_exmpl.png, app_screenshoot_configuration_result.png, app_screenshoot_configuration_result_anchor.png, app_screenshoot_source_view.png, app_screenshoot_url_regex_filter.png, for_GSoc.zip, version_alpha2.zip
>
>   Original Estimate: 1,680h
>  Remaining Estimate: 1,680h
>
> Nutch use parse-html plugin to parse web pages, it process the contents of the web page by removing html tags and component like javascript and css and leaving the extracted text to be stored on the index. Nutch by default doesn't have the capability to select certain atomic element on an html page, like certain tags, certain content, some part of the page, etc.
> A html page have a tree-like xml pattern with html tag as its branch and text as its node. This branch and node could be extracted using XPath. XPath allowing us to select a certain branch or node of an XML and therefore could be used to extract certain information and treat it differently based on its content and the user requirements. Furthermore a web domain like news website usually have a same html code structure for storing the information on its web pages. This same html code structure could be parsed using the same XPath query and retrieve the same content information element. All of the XPath query for selecting various content could be stored on a XPath Configuration File.
> The purpose of nutch are for various web source, not all of the web page retrieved from those various source have the same html code structure, thus have to be threated differently using the correct XPath Configuration. The selection of the correct XPath configuration could be done automatically using regex by matching the url of the web page with valid url pattern for that xpath configuration.
> This automatic mechanism allow the user of nutch to process various web page and get only certain information that user wants therefore making the index more accurate and its content more flexible.
> The component for this idea have been tested on nutch 1.2 for selecting certain elements on various news website for the purpose of document clustering. This includes a Configuration Editor Application build using NetBeans 6.9 Application Framework. though its need a few debugging.
> http://dl.dropbox.com/u/2642087/For_GSoC/for_GSoc.zip

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-978) [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.

Posted by "Ammar Shadiq (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ammar Shadiq updated NUTCH-978:
-------------------------------

    Attachment: app_guardian_ivory_coast_news_exmpl.png

> [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.
> ---------------------------------------------------------------------------------------
>
>                 Key: NUTCH-978
>                 URL: https://issues.apache.org/jira/browse/NUTCH-978
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 1.2
>         Environment: Ubuntu Linux 10.10; JDK 1.6; Netbeans 6.9
>            Reporter: Ammar Shadiq
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>              Labels: gsoc2011, mentor
>             Fix For: 2.0
>
>         Attachments: [Nutch-GSoC-2011-Proposal]Web_Page_Scrapper_Parser_Plugin.pdf, app_guardian_ivory_coast_news_exmpl.png, app_screenshoot_configuration_result.png, app_screenshoot_configuration_result_anchor.png, app_screenshoot_source_view.png, app_screenshoot_url_regex_filter.png
>
>   Original Estimate: 1680h
>  Remaining Estimate: 1680h
>
> Nutch use parse-html plugin to parse web pages, it process the contents of the web page by removing html tags and component like javascript and css and leaving the extracted text to be stored on the index. Nutch by default doesn't have the capability to select certain atomic element on an html page, like certain tags, certain content, some part of the page, etc.
> A html page have a tree-like xml pattern with html tag as its branch and text as its node. This branch and node could be extracted using XPath. XPath allowing us to select a certain branch or node of an XML and therefore could be used to extract certain information and treat it differently based on its content and the user requirements. Furthermore a web domain like news website usually have a same html code structure for storing the information on its web pages. This same html code structure could be parsed using the same XPath query and retrieve the same content information element. All of the XPath query for selecting various content could be stored on a XPath Configuration File.
> The purpose of nutch are for various web source, not all of the web page retrieved from those various source have the same html code structure, thus have to be threated differently using the correct XPath Configuration. The selection of the correct XPath configuration could be done automatically using regex by matching the url of the web page with valid url pattern for that xpath configuration.
> This automatic mechanism allow the user of nutch to process various web page and get only certain information that user wants therefore making the index more accurate and its content more flexible.
> The component for this idea have been tested on nutch 1.2 for selecting certain elements on various news website for the purpose of document clustering. This includes a Configuration Editor Application build using NetBeans 6.9 Application Framework. though its need a few debugging.
> http://dl.dropbox.com/u/2642087/For_GSoC/for_GSoc.zip

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira