You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Luke (JIRA)" <ji...@apache.org> on 2014/01/26 13:01:38 UTC

[jira] [Commented] (NUTCH-1414) Date extraction parse filter

    [ https://issues.apache.org/jira/browse/NUTCH-1414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13882282#comment-13882282 ] 

Luke commented on NUTCH-1414:
-----------------------------

Hi Markus/Others,

Firstly, let me say I like this functionality and wish it was built into a shipped plugin - I'm surprised there isn't more interest in this. Am I missing something? Is there a newer/better way of extracting dates from parsed text?

A couple of questions:
* I was wondering if you'd attempted to pass the extracted date to Solr (/other) in a date format, rather than as a string? If so, how have you done it?

* Many websites now put the date in the URL (esp. wordpress). eg: /2014/01/26/, -20140126-, /2014/jan/26/, etc. Did you consider also searching the URL?

* In getFragment() there is this code:
{code}
     // Check if we need to obtain the tail
     if (text.length() > maxFragmentLength + headFragmentLength) {
       tail = text.substring(text.length() - maxFragmentLength);
     }
{code}
I'm not sure that this does what it's meant to.
looking at the code above, this essentially means for there to be a tail, the total length has to be {{2 x maxFragmentLength}}.
If {{text.length() > 2 x maxFragmentLength}}, then the fragment is essentially of length {{2 x maxFragmentLength}}.
However, if {{maxFragmentLength < text.length() < 2 x maxFragmentLength}} then the fragment is just the head. In this case, it would make sense to have the whole text as the fragment. Thus, if there's a date in the tail it may be missed for short (but not too short) pages.

* I understand Julien's POV - that this is somewhat micro functionality, although handling dates does seem to require quite specific code. I've seen discussions elsewhere that suggest implementing a system such as described at http://www.atlantbh.com/precise-data-extraction-with-apache-nutch/ and whilst this seems to be a good option it could not offer the same accuracy as this plugin. Is there any chance that this would be promoted to a shipped pluggin? What needs to happen to make that happen?


> Date extraction parse filter
> ----------------------------
>
>                 Key: NUTCH-1414
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1414
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.9
>
>         Attachments: NUTCH-1414-1.6-1-testdata.patch, NUTCH-1414-1.6-1.patch
>
>
> Date extraction parse filter for Nutch to provide means to extract an arbitrary page date (article date) from the parse text.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)