You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Kaidul Islam (JIRA)" <ji...@apache.org> on 2017/05/26 15:39:04 UTC
[jira] [Commented] (NUTCH-2389) Precise data parsing using Jsoup
CSS selectors
[ https://issues.apache.org/jira/browse/NUTCH-2389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16026412#comment-16026412 ]
Kaidul Islam commented on NUTCH-2389:
-------------------------------------
Hi [~lewismc] I ended up writing one {{ParseFilter}} plugin - {{parse-jsoup}} which basically extracts contents from specific URLs which match a specific URL pattern and uses jsoup APIs (selecting by CSS selector, attributes) using XML configuration file. A sample XML configuration file is -
{code:xml}
<page>
<!-- Jsoup selection can be applied on those webpages which will match this url pattern -->
<url-regex>^https?://(?:www\.)?youtu(?:\.be/|be\.com/watch\?v=)(?:[a-zA-Z0-9_-]{11}).*$</url-regex>
<!-- Fields to parse -->
<fields>
<field name="title">
<css-selector>#eow-title</css-selector>
<default-value>foobar</default-value>
</field>
<field name="description">
<css-selector>#watch-description-text p#eow-description</css-selector>
</field>
<field name="uploadTime">
<css-selector>.watch-time-text</css-selector>
</field>
<field name="likeCount">
<css-selector>.like-button-renderer-like-button.like-button-renderer-like-button-unclicked span.yt-uix-button-content</css-selector>
</field>
<field name="dislikeCount">
<css-selector>.like-button-renderer-dislike-button.like-button-renderer-dislike-button-unclicked span.yt-uix-button-content</css-selector>
</field>
<field name="viewCount">
<css-selector>.watch-view-count</css-selector>
</field>
<field name="subscriberCount">
<css-selector>.yt-subscriber-count</css-selector>
</field>
<field name="publisherName">
<css-selector>.yt-user-info a</css-selector>
</field>
<field name="publisherChannel">
<css-selector>.yt-user-info a</css-selector>
<attribute>abs:href</attribute>
</field>
<field name="publisherStatus">
<css-selector>.yt-user-info span</css-selector>
<attribute>aria-label</attribute>
</field>
<field name="category">
<css-selector>.watch-extras-section :nth-child(1) a</css-selector>
</field>
</fields>
</page> <!-- End of page -->
{code}
And like {{parse-metatags}}, I am putting these contents into {{Map<CharSequence, ByteBuffer> metadata}} adding {{jsoup_}} as prefix. And to index these data, I am using similar {{IndexingFilter}} plugin like {{index-metadata}} plugin which index the entries containing {{jsoup_}} as prefix.
This suited my requirements in my job as I was building a training dataset and knowledge-base of 10M youtube.com videos for a NLP based project. But I am not sure about the general case.
Also as I see, similar kind of plugin had been proposed previously in NUTCH-978 which seems pretty controversial from comment sections and eventually the issue had been closed. Please let me know your opinion about this plugin. I, myself, have doubt about it - should it be parse-filter or parser plugin?
Thanks!
> Precise data parsing using Jsoup CSS selectors
> ----------------------------------------------
>
> Key: NUTCH-2389
> URL: https://issues.apache.org/jira/browse/NUTCH-2389
> Project: Nutch
> Issue Type: New Feature
> Components: parser
> Affects Versions: 2.3
> Reporter: Kaidul Islam
> Assignee: Kaidul Islam
> Fix For: 2.4
>
> Original Estimate: 0.05h
> Remaining Estimate: 0.05h
>
> As far as I know, currently Nutch 1.x and 2.x has no features to extract/parse exact contents for specific websites. I've developed a plugin {{parse-jsoup}} using Jsoup for my current project to extract precise content for site specific crawling using detailed XML configuration(field name, CSS-selector, attribute, extraction rules, data-type, default-value etc).
> Please let me know if this feature seems relevant and currently not present in Nutch. I have also plan to export it into Nutch 1.x.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)