You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Michael Nebel <mi...@nebel.de> on 2005/09/01 01:21:03 UTC

Re: [jira] Commented: (NUTCH-65) index-more plugin can't parse large set of modification-date

Hi Jérôme,

it works great (see the new function bellow). But we'll have to add 
commons-lang (http://jakarta.apache.org/commons/lang/) to the libraries. 
Are there any objections? How is the procedure to add it?

I'm trying my changes right now (I think, it will take the rest of the 
night to complete the job :-). So I'll get some sleep. Tomorrow, I'll 
look, if the conversions were successfull and submit a patch to NUTCH-65.

Good night

	Michael


PS.: here the functions by now:

-----------------------------------
   private long getTime(String date, String url) {
     long time = -1;
     try {
       time = HttpDateFormat.toLong(date);
     } catch (ParseException e) {
	// try to parse it as date in alternative format
	try {
	    Date parsedDate = DateUtils.parseDate(date,
		  new String [] {
		      "EEE MMM dd HH:mm:ss yyyy",
		      "EEE, dd MMM yyyy HH:mm:ss zzz",
		      "EEE,dd MMM yyyy HH:mm:ss zzz",
		      "EEE, dd MMM yyyy HH:mm:sszzz",
		      "EEE, dd MMM yyyy HH:mm:ss",
		      "EEE, dd-MMM-yy HH:mm:ss zzz",
		      "yyyy/MM/dd HH:mm:ss.SSS zzz",
		      "yyyy/MM/dd HH:mm:ss.SSS",
		      "yyyy/MM/dd HH:mm:ss zzz",
		      "yyyy/MM/dd",
		      "yyyy.MM.dd HH:mm:ss",
		      "yyyy-MM-dd HH:mm",
		      "dd.MM.yyyy HH:mm:ss zzz",
		      "dd MM yyyy HH:mm:ss zzz",
		      "dd.MM.yyyy; HH:mm:ss",
		      "dd.MM.yyyy HH:mm:ss",
		      "dd.MM.yyyy zzz"
		  });
	    time = parsedDate.getTime();
	//    LOG.warning(url + ": parsed date: " + date +" to:"+time);
	} catch (Exception e2) {
	    LOG.warning(url + ": can't parse erroneous date: " + date);
	}
     }
     return time;
   }
	


-------------------------

< Jérôme Charron wrote:

> Michael,
> 
> the solution is perhaps to use Jakarta Commons DateUtils.parseDate method:
> http://jakarta.apache.org/commons/lang/api/org/apache/commons/lang/time/DateUtils.html#parseDate(java.lang.String,%20java.lang.String[])
> 
> It will gives something like:
> 
> Date parsedDate = DateUtils.parseDate(dates[i],
>         new String [] {"yyyy/MM/dd",
>                        "yyyy.MM.dd HH:mm:ss",
>                        "yyyy-MM-dd HH:mm",
>                        ...
>                        and so on
>                        ...
>                        });
> 
> No time to code it and test it right now, but I assume it could be solution.
> I will add a comment to this issue reporting your previous message and mine.
> (as usually, to keep trace of the informations)
> 
> Regards
> 
> Jérôme
> 
> On 8/31/05, Michael Nebel <mi...@nebel.de> wrote:
> 
>>Some more errors (short selection from my logfile). Do we really have to
>>handle the all seperatly or are there any functions/tools for this kind
>>of problem?
>>
>>...can't parse erroneous date: 12.06.2005 22:02:54 GMT
>>...can't parse erroneous date: 14.07.2005 GMT
>>...can't parse erroneous date: 15.10.2003 04:58:08
>>...can't parse erroneous date: 16 6 2005 00:00:00 GMT
>>...can't parse erroneous date: 16.06.2005 10:10:57 GMT
>>...can't parse erroneous date: 2005/06/21 20:51:40.618 GMT+2
>>...can't parse erroneous date: 29.06.2005 GMT
>>...can't parse erroneous date: 31.5.2005; 10:14:49
>>...can't parse erroneous date: 968776128
>>...can't parse erroneous date: Aug 17 2005 04:34:59 GMT
>>...can't parse erroneous date: Di, 21 Jun 2005 09:06:32 GMT
>>...can't parse erroneous date: Die, 21 Jun 2005 12:12:22 GMT
>>...can't parse erroneous date: Do, 16 Jun 2005 09:00:10 GMT
>>...can't parse erroneous date: FRI, Jan 25 2099 23:59:59 GMT
>>...can't parse erroneous date: Fri, 01 Okt 2004 10:43:20 GMT
>>...can't parse erroneous date: Fri,17 Jun 2005 00:01:14 GMT
>>...can't parse erroneous date: Friday, 01-Mar-96 16:04:23 GMT
>>...can't parse erroneous date: Friday, 10-Aug-2001 23:36:10 GMT
>>...can't parse erroneous date: Friday, 13-Feb-04 13:09:46 GMT
>>...can't parse erroneous date: Friday, 18-Feb-05 09:33:06 GMT
>>...can't parse erroneous date: Jun 16 2005 08:22:22 GMT
>>...can't parse erroneous date: Mi, 15 Jun 2005 12:06:18 GMT
>>...can't parse erroneous date: Mon, 04 Okt 2004 07:38:27 GMT
>>...can't parse erroneous date: Monday, 03-Nov-2003 18:35:32 GMT
>>...can't parse erroneous date: Monday, 04-Aug-103 13:15:54 GMT
>>...can't parse erroneous date: Monday, 04-Mar-96 07:42:44 GMT
>>...can't parse erroneous date: November 18 2004 11:26:19. CET
>>...can't parse erroneous date: Sat, 11 Okt 2003 16:57:11 GMT
>>...can't parse erroneous date: Sat, 23 Okt 2004 13:50:17 GMT
>>...can't parse erroneous date: Saturday, 25-Nov-50 22:15:00 GMT
>>...can't parse erroneous date: Sun Jul 31 18:10:19 CEST 2005
>>...can't parse erroneous date: Thursday, 23-May-96 08:48:36 GMT
>>...can't parse erroneous date: Tue 21 Jun 2005 13:16:47GMT
>>...can't parse erroneous date: Tue, 05 Okt 2004 05:35:34 GMT
>>...can't parse erroneous date: Tuesday, 04-Mar-03 12:10:17 GMT
>>...can't parse erroneous date: Wed 15 Jun 2005 18:55:15GMT
>>...can't parse erroneous date: Wed 15 Jun 2005 18:58:56GMT
>>...can't parse erroneous date: Wed 15 Jun 2005 19:12:14GMT
>>...can't parse erroneous date: Wed, 06 Okt 2004 13:01:36 GMT
>>...can't parse erroneous date: Wed,15 Jun 2005 18:56:56 GMT
>>...can't parse erroneous date: Wednesday, 15-Jun-05 15:26:26 GMT
>>...can't parse erroneous date: Wednesday, 25-May-05 15:45:54 GMT
>>...can't parse erroneous date: Wednesday, 26-May-2004 10:48:00 GMT
>>
>>
>>
>>
>>Michael Nebel (JIRA) wrote:
>>
>>
>>>[ 
>>
>>http://issues.apache.org/jira/browse/NUTCH-65?page=comments#action_12320291]
>>
>>>Michael Nebel commented on NUTCH-65:
>>>------------------------------------
>>>
>>>I checked out the trunc at 24/Aug/05 and still get some errors:
>>>
>>>... can't parse erroneous date: Fri, 29 Okt 2004 11:08:24 GMT
>>>... can't parse erroneous date: Thu, 28 Okt 2004 08:59:16 GMT
>>>... can't parse erroneous date: Thu, 14 Okt 2004 07:17:15 GMT
>>>... can't parse erroneous date: Fri, 08 Okt 2004 12:42:00 GMT
>>>... can't parse erroneous date: Tue, 26 Okt 2004 09:31:48 GMT
>>>... can't parse erroneous date: Tue, 19 Okt 2004 06:03:00 GMT
>>>
>>>Attention: it's the german "Okt" for "Oktober" not "Oct" for the english 
>>
>>"October". I think, the Local.US is confusing the object.
>>
>>>
>>>
>>>>index-more plugin can't parse large set of modification-date
>>>>-------------------------------------------------------------
>>>>
>>>>Key: NUTCH-65
>>>>URL: http://issues.apache.org/jira/browse/NUTCH-65
>>>>Project: Nutch
>>>>Type: Bug
>>>>Components: indexer
>>>>Environment: nutch 0.7, java 1.5, linux
>>>>Reporter: Lutischán Ferenc
>>>
>>>
>>>>I found a problem in MoreIndexingFilter.java.
>>>>When I indexing segments, I get large list of error messages:
>>>>can't parse errorenous date: Wed, 10 Sep 2003 11:59:14 or
>>>>can't parse errorenous date: Wed, 10 Sep 2003 11:59:14GMT
>>>>I modifiing source code (I don't make a 'patch'):
>>>>Original (lines 137-138):
>>>>DateFormat df = new SimpleDateFormat("EEE MMM dd HH:mm:ss yyyy zzz");
>>>>Date d = df.parse(date);
>>>>New:
>>>>DateFormat df = new SimpleDateFormat("EEE, MMM dd HH:mm:ss yyyy", 
>>
>>Locale.US);
>>
>>>>Date d = df.parse(date.substring(0,25));
>>>>The modified code works fine.
>>>

--
Michael Nebel
http://www.nebel.de/
http://www.netluchs.de/



Re: [jira] Commented: (NUTCH-65) index-more plugin can't parse large set of modification-date

Posted by Jérôme Charron <je...@gmail.com>.
> 
> There's already commons-logging, in nutch libs, so I think there's no 
> problem to add commons-lang. 
> Moreover it is under Apache License, so there's no prolem.
> I will add it while committing your patch.
> 
No objections for adding commons-lang to the nutch lib.
As it is a generic lib, I plan to add it in the nutch/lib (in order to be 
available for all nutch code) instead of index-mmore plugin lib.
No objection?

Regards

Jérôme

-- 
http://motrech.free.fr/
http://www.frutch.org/

Re: [jira] Commented: (NUTCH-65) index-more plugin can't parse large set of modification-date

Posted by Jérôme Charron <je...@gmail.com>.
> 
> it works great (see the new function bellow). But we'll have to add
> commons-lang (http://jakarta.apache.org/commons/lang/) to the libraries.
> Are there any objections? How is the procedure to add it?

There's already commons-logging, in nutch libs, so I think there's no 
problem to add commons-lang.
Moreover it is under Apache License, so there's no prolem.
I will add it while committing your patch.

I'm trying my changes right now (I think, it will take the rest of the
> night to complete the job :-). So I'll get some sleep. Tomorrow, I'll
> look, if the conversions were successfull and submit a patch to NUTCH-65.

Ok, thanks.
Regards

Jérôme

-- 
http://motrech.free.fr/
http://www.frutch.org/