You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Sebastian Schick <sc...@informatik.uni-rostock.de> on 2007/09/25 16:47:55 UTC

Re: Last-modified / creation date or time

Hello,

we have the same problem. Accidentally I created a new thread 
http://www.nabble.com/problem-with-MoreIndexingFilter-tf4515835.html#a12880357
here .
Are there already any solutions? 

Regards,

Sebastian


chris sleeman wrote:
> 
> Hi,
> 
> Can anyone tell me how to get the last-modified or the creation time of a
> page, crawled and indexed by nutch?
> I tried using the Metadata.LAST_MODIFIED field but it returned me null. I
> need them while displaying my search results.
> 
> Would appreciate any pointers on this.
> 
> Regards,
> Chris
> 
> 

-- 
View this message in context: http://www.nabble.com/Last-modified---creation-date-or-time-tf3704140.html#a12881175
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: Last-modified / creation date or time

Posted by Sebastian Schick <sc...@informatik.uni-rostock.de>.
Thanks for the fast reply.
I think the "Date" field is not what I mean.
I tried to get it in MoreIndexingFilter.java and get a date but this is the
fetching date and not the http-equiv="Last-Modified" html meta tag in the
HTML files. 
So my question is, why is this wrong?

    String lastModified = data.getMeta(Metadata.LAST_MODIFIED);
	if (lastModified != null) {                   // try parse last-modified
      time = getTime(lastModified,url);           // use as time
                                                  // store as string
      doc.add(new Field("lastModified", new Long(time).toString(),
Field.Store.YES, Field.Index.NO));
    }

I do not understand, why lastModified is null. Because as mentioned before,
in my other thread the last-modified tag is pared correctly!


Regards,

Sebastian


Susam Pal wrote:
> 
> I tried this code:-
> 
>   System.out.println(metaData);
>   String[] names = metaData.names();
>   for (int i = 0; i < names.length; i++) {
>     System.out.println(names[i] + ": " + metaData.get(names[i]));
>   }
> 
> I got this:-
> 
> nutch.content.digest=96f6d3d267d955728fc98b820fc72c32 Date=Tue, 25 Sep
> 2007 15:53:40 GMT Content-Length=73 nutch.crawl.score=1.0
> nutch.segment.name=20070925212336 Content-Type=text/html;
> charset=UTF-8 Server=Apache/2.2.3 (Debian) PHP/5.2.0-8+etch7
> X-Powered-By=PHP/5.2.0-8+etch7 _ftk_=1190735620303
> nutch.content.digest: 96f6d3d267d955728fc98b820fc72c32
> Date: Tue, 25 Sep 2007 15:53:40 GMT
> Content-Length: 73
> nutch.crawl.score: 1.0
> nutch.segment.name: 20070925212336
> Content-Type: text/html; charset=UTF-8
> Server: Apache/2.2.3 (Debian) PHP/5.2.0-8+etch7
> X-Powered-By: PHP/5.2.0-8+etch7
> _ftk_: 1190735620303
> 
> So, metaData.get("Date") is one good solution.
> 
> I wonder why the date is stored against "Date" whereas DublinCore
> interface (which Metadata implements) defines DATE as:-
> 
> public static final String DATE = "date";
> 
> Regards,
> Susam Pal
> http://susam.in/
> 
> On 9/25/07, Sebastian Schick <sc...@informatik.uni-rostock.de> wrote:
>>
>> Hello,
>>
>> we have the same problem. Accidentally I created a new thread
>> http://www.nabble.com/problem-with-MoreIndexingFilter-tf4515835.html#a12880357
>> here .
>> Are there already any solutions?
>>
>> Regards,
>>
>> Sebastian
>>
>>
>> chris sleeman wrote:
>> >
>> > Hi,
>> >
>> > Can anyone tell me how to get the last-modified or the creation time of
>> a
>> > page, crawled and indexed by nutch?
>> > I tried using the Metadata.LAST_MODIFIED field but it returned me null.
>> I
>> > need them while displaying my search results.
>> >
>> > Would appreciate any pointers on this.
>> >
>> > Regards,
>> > Chris
>> >
>> >
>>
>> --
>> View this message in context:
>> http://www.nabble.com/Last-modified---creation-date-or-time-tf3704140.html#a12881175
>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>
>>
> 
> 

-- 
View this message in context: http://www.nabble.com/Last-modified---creation-date-or-time-tf3704140.html#a12885644
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: Last-modified / creation date or time

Posted by Sebastian Schick <sc...@informatik.uni-rostock.de>.
Thanks for the fast reply.
I think the "Date" field is not what I mean.
I tried to get it in MoreIndexingFilter.java and get a date but this is the
fetching date and not the http-equiv="Last-Modified" html meta tag in the
HTML files. 
So my question is, why is this wrong?

    String lastModified = data.getMeta(Metadata.LAST_MODIFIED);
	if (lastModified != null) {                   // try parse last-modified
      time = getTime(lastModified,url);           // use as time
                                                  // store as string
      doc.add(new Field("lastModified", new Long(time).toString(),
Field.Store.YES, Field.Index.NO));
    }

I do not understand, why lastModified is null. Because as mentioned before,
in my other thread the last-modified tag is pared correctly!


Regards,

Sebastian

Susam Pal wrote:
> 
> I tried this code:-
> 
>   System.out.println(metaData);
>   String[] names = metaData.names();
>   for (int i = 0; i < names.length; i++) {
>     System.out.println(names[i] + ": " + metaData.get(names[i]));
>   }
> 
> I got this:-
> 
> nutch.content.digest=96f6d3d267d955728fc98b820fc72c32 Date=Tue, 25 Sep
> 2007 15:53:40 GMT Content-Length=73 nutch.crawl.score=1.0
> nutch.segment.name=20070925212336 Content-Type=text/html;
> charset=UTF-8 Server=Apache/2.2.3 (Debian) PHP/5.2.0-8+etch7
> X-Powered-By=PHP/5.2.0-8+etch7 _ftk_=1190735620303
> nutch.content.digest: 96f6d3d267d955728fc98b820fc72c32
> Date: Tue, 25 Sep 2007 15:53:40 GMT
> Content-Length: 73
> nutch.crawl.score: 1.0
> nutch.segment.name: 20070925212336
> Content-Type: text/html; charset=UTF-8
> Server: Apache/2.2.3 (Debian) PHP/5.2.0-8+etch7
> X-Powered-By: PHP/5.2.0-8+etch7
> _ftk_: 1190735620303
> 
> So, metaData.get("Date") is one good solution.
> 
> I wonder why the date is stored against "Date" whereas DublinCore
> interface (which Metadata implements) defines DATE as:-
> 
> public static final String DATE = "date";
> 
> Regards,
> Susam Pal
> http://susam.in/
> 
> On 9/25/07, Sebastian Schick <sc...@informatik.uni-rostock.de> wrote:
>>
>> Hello,
>>
>> we have the same problem. Accidentally I created a new thread
>> http://www.nabble.com/problem-with-MoreIndexingFilter-tf4515835.html#a12880357
>> here .
>> Are there already any solutions?
>>
>> Regards,
>>
>> Sebastian
>>
>>
>> chris sleeman wrote:
>> >
>> > Hi,
>> >
>> > Can anyone tell me how to get the last-modified or the creation time of
>> a
>> > page, crawled and indexed by nutch?
>> > I tried using the Metadata.LAST_MODIFIED field but it returned me null.
>> I
>> > need them while displaying my search results.
>> >
>> > Would appreciate any pointers on this.
>> >
>> > Regards,
>> > Chris
>> >
>> >
>>
>> --
>> View this message in context:
>> http://www.nabble.com/Last-modified---creation-date-or-time-tf3704140.html#a12881175
>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>
>>
> 
> 

-- 
View this message in context: http://www.nabble.com/Last-modified---creation-date-or-time-tf3704140.html#a12885648
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: Last-modified / creation date or time

Posted by Susam Pal <su...@gmail.com>.
I tried this code:-

  System.out.println(metaData);
  String[] names = metaData.names();
  for (int i = 0; i < names.length; i++) {
    System.out.println(names[i] + ": " + metaData.get(names[i]));
  }

I got this:-

nutch.content.digest=96f6d3d267d955728fc98b820fc72c32 Date=Tue, 25 Sep
2007 15:53:40 GMT Content-Length=73 nutch.crawl.score=1.0
nutch.segment.name=20070925212336 Content-Type=text/html;
charset=UTF-8 Server=Apache/2.2.3 (Debian) PHP/5.2.0-8+etch7
X-Powered-By=PHP/5.2.0-8+etch7 _ftk_=1190735620303
nutch.content.digest: 96f6d3d267d955728fc98b820fc72c32
Date: Tue, 25 Sep 2007 15:53:40 GMT
Content-Length: 73
nutch.crawl.score: 1.0
nutch.segment.name: 20070925212336
Content-Type: text/html; charset=UTF-8
Server: Apache/2.2.3 (Debian) PHP/5.2.0-8+etch7
X-Powered-By: PHP/5.2.0-8+etch7
_ftk_: 1190735620303

So, metaData.get("Date") is one good solution.

I wonder why the date is stored against "Date" whereas DublinCore
interface (which Metadata implements) defines DATE as:-

public static final String DATE = "date";

Regards,
Susam Pal
http://susam.in/

On 9/25/07, Sebastian Schick <sc...@informatik.uni-rostock.de> wrote:
>
> Hello,
>
> we have the same problem. Accidentally I created a new thread
> http://www.nabble.com/problem-with-MoreIndexingFilter-tf4515835.html#a12880357
> here .
> Are there already any solutions?
>
> Regards,
>
> Sebastian
>
>
> chris sleeman wrote:
> >
> > Hi,
> >
> > Can anyone tell me how to get the last-modified or the creation time of a
> > page, crawled and indexed by nutch?
> > I tried using the Metadata.LAST_MODIFIED field but it returned me null. I
> > need them while displaying my search results.
> >
> > Would appreciate any pointers on this.
> >
> > Regards,
> > Chris
> >
> >
>
> --
> View this message in context: http://www.nabble.com/Last-modified---creation-date-or-time-tf3704140.html#a12881175
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>