You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Kaidul Islam (JIRA)" <ji...@apache.org> on 2017/05/26 15:39:04 UTC
[jira] [Commented] (NUTCH-2389) Precise data parsing using Jsoup CSS selectors

    [ https://issues.apache.org/jira/browse/NUTCH-2389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16026412#comment-16026412 ] 

Kaidul Islam commented on NUTCH-2389:
-------------------------------------

Hi [~lewismc] I ended up writing one {{ParseFilter}} plugin - {{parse-jsoup}} which basically extracts contents from specific URLs which match a specific URL pattern and uses jsoup APIs (selecting by CSS selector, attributes) using XML configuration file. A sample XML configuration file is - 

{code:xml}
<page>
	<!-- Jsoup selection can be applied on those webpages which will match this url pattern -->
	<url-regex>^https?://(?:www\.)?youtu(?:\.be/|be\.com/watch\?v=)(?:[a-zA-Z0-9_-]{11}).*$</url-regex>

	<!-- Fields to parse -->
	<fields>
		<field name="title">
			<css-selector>#eow-title</css-selector>
			<default-value>foobar</default-value>
		</field>
		<field name="description">
			<css-selector>#watch-description-text p#eow-description</css-selector>
		</field>
		<field name="uploadTime">
			<css-selector>.watch-time-text</css-selector>
		</field>
		<field name="likeCount">
			<css-selector>.like-button-renderer-like-button.like-button-renderer-like-button-unclicked span.yt-uix-button-content</css-selector>
		</field>
		<field name="dislikeCount">
			<css-selector>.like-button-renderer-dislike-button.like-button-renderer-dislike-button-unclicked span.yt-uix-button-content</css-selector>
		</field>
		<field name="viewCount">
			<css-selector>.watch-view-count</css-selector>
		</field>
		<field name="subscriberCount">
			<css-selector>.yt-subscriber-count</css-selector>
		</field>
		<field name="publisherName">
			<css-selector>.yt-user-info a</css-selector>
		</field>
		<field name="publisherChannel">
			<css-selector>.yt-user-info a</css-selector>
			<attribute>abs:href</attribute>
		</field>
		<field name="publisherStatus">
			<css-selector>.yt-user-info span</css-selector>
			<attribute>aria-label</attribute>
		</field>
		<field name="category">
			<css-selector>.watch-extras-section :nth-child(1) a</css-selector>
		</field>
	</fields>
</page> <!-- End of page -->
{code}

And like {{parse-metatags}}, I am putting these contents into {{Map<CharSequence, ByteBuffer> metadata}} adding {{jsoup_}} as prefix. And to index these data, I am using similar {{IndexingFilter}} plugin like {{index-metadata}} plugin which index the entries containing {{jsoup_}} as prefix.

This suited my requirements in my job as I was building a training dataset and knowledge-base of 10M youtube.com videos for a NLP based project. But I am not sure about the general case.

Also as I see, similar kind of plugin had been proposed previously in NUTCH-978 which seems pretty controversial from comment sections and eventually the issue had been closed. Please let me know your opinion about this plugin. I, myself, have doubt about it - should it be parse-filter or parser plugin?

Thanks!

> Precise data parsing using Jsoup CSS selectors
> ----------------------------------------------
>
>                 Key: NUTCH-2389
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2389
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 2.3
>            Reporter: Kaidul Islam
>            Assignee: Kaidul Islam
>             Fix For: 2.4
>
>   Original Estimate: 0.05h
>  Remaining Estimate: 0.05h
>
> As far as I know, currently Nutch 1.x and 2.x has no features to extract/parse exact contents for specific websites. I've developed a plugin {{parse-jsoup}} using Jsoup for my current project to extract precise content for site specific crawling using detailed XML configuration(field name, CSS-selector, attribute, extraction rules, data-type, default-value etc).
> Please let me know if this feature seems relevant and currently not present in Nutch. I have also plan to export it into Nutch 1.x.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)