You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Markus Jelsma <ma...@openindex.io> on 2015/03/11 22:12:14 UTC

RE: Handling servers with wrong Last Modified HTTP header

Hello Jorge,

This is an interesting but very complicated issue. First of all, do not rely on HTTP headers, they are incorrect on any scale larger than very small. This is true for Last-Modified due to dynamic CMS' but for many other headers. You can even expect website descriptions in headers such as Content-Type, madness!

The only reliable source of a document's date and optionally time is within the document itself. This introduces two news problems, 1) what format and language, and 2) where exactly can you find it. Let's discuss these two issues.

The first is the most straightforward to deal with, it is a two-stage process. First you need to extract anything that resembles a date format that is used on Earth, this includes non-numeric dates such as month names. Then you have to pass all those date candidates through a series of carefully aligned date formats (SimpleDateFormat) and set the appropriate Locale. This stage requires that you have identified the language of the document, or the part of the document you are processing in case of multi-language documents.

Luckily, i have uploaded preliminary work as a Nutch parse-plugin a few years ago that does exactly this, check out NUTCH-1414 [1]. You present the extractor with a language and a piece of text, in this case the document's extracted text. It is very basic and has many flaws but it should work nicely if you present it with concise fragments of text.

The second part of the solution is more cumbersome to deal with. NUTCH-1414 uses the document's extracted text as source for date extraction, and it has really no clue as to where the date is located in the document's structure. If you use Nutch' basic text extraction (extract all TEXT nodes) you will get bad results for most documents. It can be partially solved by relying on Boilerpipe's text extraction. But using Boilerpipe may in turn prevent you from extracting dates that actually got extracted using no text extraction algorithm at all!

Please, check out NUTCH-1414 and see if it works for you. Hopefully, in your case, it will do what you want it to do. I decided a few years ago to get place the improved date extraction tool to a separate project and get rid of Boilerpipe altogether and build a new tool from scratch that can interface with a date extraction tool, and has support for looking up the exact spot of the document's date. It works on 95% of the many hundreds of real web page tests so if you need something that works at scale, you can contact me off list, the stuff has not been open sourced.

Have fun!
Markus

[1]: https://issues.apache.org/jira/browse/NUTCH-1414
 
-----Original message-----
> From:Jorge Luis Betancourt González <jl...@uci.cu>
> Sent: Tuesday 10th March 2015 4:23
> To: dev@nutch.apache.org
> Subject: Handling servers with wrong Last Modified HTTP header
> 
> Recently in the search app we are working on we've encountered a lot of websites that have a wrong and invalid date in the Last Modified HTTP header, meaning for instance that an article posted on a news site back in 2010 has a Las Modified header of just a few days back, this could be for any number of reasons:
> 
> - A new comment was added to the site
> - Some cache invalidation occurring in the source code of the website that affects the article's page
> - Perhaps a new ad showing in the sidebar
> - Or just plain wrong header handling in the platform code
> 
> For what I've seen this is handled by several CMS even allowing to "tweak" the published date, My question is basically if any one on the list has a suggestion on how to tackle this or has some suggestion on how to address this situation. For the particular case that we've been working most of the URLs have the published date in the URL in the form of yyyy/mm/dd (or some similar fashion), so this could be one way of "guessing" the publication date of the article. I realize that this is no silver bullet but I'd love to get some feedback on this type of situations. From my experience when people usually filter by date in our frontend app, they usually are trying to get news/articles by the publication date instead of the Last Modified date and they are confused when the returned results have very old publication dates, they usually don't check if is a new comment for instance.
> 
> I'm living the "how to implement this" a side for now, just interested in discussing how to deal with this type of situations, as stated in our particular case we can rely on the URL patterns for a very good portion, but was hopping to agree on some general approach that could be integrated in Nutch.
> 
> Regards,
> 
> PS: Should I post this also to the user list? 
> 

Re: Handling servers with wrong Last Modified HTTP header

Posted by Eyeris RodrIguez Rueda <er...@uci.cu>.
Hi all.
It is a really interesting topic as Markus said, however i think that a simple improvement to parse metatags plugin could help.
if you check the code of this url http://www.radiosanctispiritus.cu/es/2013/06/petrocaribe-aprueba-plan-para-crear-zona-economica-especial/ 
Nutch dont recognize metatags for opengraph protocol like og:title" content="PetroCaribe aprueba plan para crear zona económica especial" /> 
and neither this other meta <meta property="article:published_time" content="2013-06-30T15:11:10+00:00" />
Parse metatags could help to solve this problem with a simple modification.

Looking in source code of Parse metatags it only request metatags.name property of configuration file  

String[] values = conf.getStrings("metatags.names", "*");
for that reason only get keywords and description. Maybe adding one property metatags.property and put all wanted metatags like og:title;article:published_time 
is possible get last modified and others metas if is present in a page.

This jira NUTCH-1561 is resolved but maybe could be open again  for include this change or create a new jira with this modification and use these metas if is present in a web page.
¿?






----- Mensaje original -----
De: "Markus Jelsma" <ma...@openindex.io>
Para: dev@nutch.apache.org, user@nutch.apache.org
Enviados: Miércoles, 11 de Marzo 2015 17:12:14
Asunto: [MASSMAIL]RE: Handling servers with wrong Last Modified HTTP header

Hello Jorge,

This is an interesting but very complicated issue. First of all, do not rely on HTTP headers, they are incorrect on any scale larger than very small. This is true for Last-Modified due to dynamic CMS' but for many other headers. You can even expect website descriptions in headers such as Content-Type, madness!

The only reliable source of a document's date and optionally time is within the document itself. This introduces two news problems, 1) what format and language, and 2) where exactly can you find it. Let's discuss these two issues.

The first is the most straightforward to deal with, it is a two-stage process. First you need to extract anything that resembles a date format that is used on Earth, this includes non-numeric dates such as month names. Then you have to pass all those date candidates through a series of carefully aligned date formats (SimpleDateFormat) and set the appropriate Locale. This stage requires that you have identified the language of the document, or the part of the document you are processing in case of multi-language documents.

Luckily, i have uploaded preliminary work as a Nutch parse-plugin a few years ago that does exactly this, check out NUTCH-1414 [1]. You present the extractor with a language and a piece of text, in this case the document's extracted text. It is very basic and has many flaws but it should work nicely if you present it with concise fragments of text.

The second part of the solution is more cumbersome to deal with. NUTCH-1414 uses the document's extracted text as source for date extraction, and it has really no clue as to where the date is located in the document's structure. If you use Nutch' basic text extraction (extract all TEXT nodes) you will get bad results for most documents. It can be partially solved by relying on Boilerpipe's text extraction. But using Boilerpipe may in turn prevent you from extracting dates that actually got extracted using no text extraction algorithm at all!

Please, check out NUTCH-1414 and see if it works for you. Hopefully, in your case, it will do what you want it to do. I decided a few years ago to get place the improved date extraction tool to a separate project and get rid of Boilerpipe altogether and build a new tool from scratch that can interface with a date extraction tool, and has support for looking up the exact spot of the document's date. It works on 95% of the many hundreds of real web page tests so if you need something that works at scale, you can contact me off list, the stuff has not been open sourced.

Have fun!
Markus

[1]: https://issues.apache.org/jira/browse/NUTCH-1414
 
-----Original message-----
> From:Jorge Luis Betancourt González <jl...@uci.cu>
> Sent: Tuesday 10th March 2015 4:23
> To: dev@nutch.apache.org
> Subject: Handling servers with wrong Last Modified HTTP header
> 
> Recently in the search app we are working on we've encountered a lot of websites that have a wrong and invalid date in the Last Modified HTTP header, meaning for instance that an article posted on a news site back in 2010 has a Las Modified header of just a few days back, this could be for any number of reasons:
> 
> - A new comment was added to the site
> - Some cache invalidation occurring in the source code of the website that affects the article's page
> - Perhaps a new ad showing in the sidebar
> - Or just plain wrong header handling in the platform code
> 
> For what I've seen this is handled by several CMS even allowing to "tweak" the published date, My question is basically if any one on the list has a suggestion on how to tackle this or has some suggestion on how to address this situation. For the particular case that we've been working most of the URLs have the published date in the URL in the form of yyyy/mm/dd (or some similar fashion), so this could be one way of "guessing" the publication date of the article. I realize that this is no silver bullet but I'd love to get some feedback on this type of situations. From my experience when people usually filter by date in our frontend app, they usually are trying to get news/articles by the publication date instead of the Last Modified date and they are confused when the returned results have very old publication dates, they usually don't check if is a new comment for instance.
> 
> I'm living the "how to implement this" a side for now, just interested in discussing how to deal with this type of situations, as stated in our particular case we can rely on the URL patterns for a very good portion, but was hopping to agree on some general approach that could be integrated in Nutch.
> 
> Regards,
> 
> PS: Should I post this also to the user list? 
> 

Re: [MASSMAIL]RE: Handling servers with wrong Last Modified HTTP header

Posted by Jorge Luis Betancourt González <jl...@uci.cu>.
Hi Markus:

Replied inline,

----- Original Message -----
From: "Markus Jelsma" <ma...@openindex.io>
To: dev@nutch.apache.org, user@nutch.apache.org
Sent: Wednesday, March 11, 2015 5:12:14 PM
Subject: [MASSMAIL]RE: Handling servers with wrong Last Modified HTTP header

Hello Jorge,

This is an interesting but very complicated issue. First of all, do not rely on HTTP headers, they are incorrect on any scale larger than very small. This is true for Last-Modified due to dynamic CMS' but for many other headers. You can even expect website descriptions in headers such as Content-Type, madness!

Madness indeed! 

The only reliable source of a document's date and optionally time is within the document itself. This introduces two news problems, 1) what format and language, and 2) where exactly can you find it. Let's discuss these two issues.

The first is the most straightforward to deal with, it is a two-stage process. First you need to extract anything that resembles a date format that is used on Earth, this includes non-numeric dates such as month names. Then you have to pass all those date candidates through a series of carefully aligned date formats (SimpleDateFormat) and set the appropriate Locale. This stage requires that you have identified the language of the document, or the part of the document you are processing in case of multi-language documents.

One issue inside this main one is those sites that show several dates in the same page, for instance with comments, showing the date/time of each comment posted, which can also cause cause problems. 

Luckily, i have uploaded preliminary work as a Nutch parse-plugin a few years ago that does exactly this, check out NUTCH-1414 [1]. You present the extractor with a language and a piece of text, in this case the document's extracted text. It is very basic and has many flaws but it should work nicely if you present it with concise fragments of text.

The second part of the solution is more cumbersome to deal with. NUTCH-1414 uses the document's extracted text as source for date extraction, and it has really no clue as to where the date is located in the document's structure. If you use Nutch' basic text extraction (extract all TEXT nodes) you will get bad results for most documents. It can be partially solved by relying on Boilerpipe's text extraction. But using Boilerpipe may in turn prevent you from extracting dates that actually got extracted using no text extraction algorithm at all!

Yes, Boilerpipe could be one solution but I personally have encountered problems with pages with large comments section in both ways: comments being added as the article main text and content being left out in favor of some comments. 

Please, check out NUTCH-1414 and see if it works for you. Hopefully, in your case, it will do what you want it to do. I decided a few years ago to get place the improved date extraction tool to a separate project and get rid of Boilerpipe altogether and build a new tool from scratch that can interface with a date extraction tool, and has support for looking up the exact spot of the document's date. It works on 95% of the many hundreds of real web page tests so if you need something that works at scale, you can contact me off list, the stuff has not been open sourced.

Will test NUTCH-1414 to see how it works, for what I can see NUTCH-1414 works on the extracted content, in your experience could worth detecting any date in the URL? This was my initial thought for my particular case, also I'll contact you as well :) this tool of yours sound "magical".

Regards,

Have fun!
Markus

[1]: https://issues.apache.org/jira/browse/NUTCH-1414
 
-----Original message-----
> From:Jorge Luis Betancourt González <jl...@uci.cu>
> Sent: Tuesday 10th March 2015 4:23
> To: dev@nutch.apache.org
> Subject: Handling servers with wrong Last Modified HTTP header
> 
> Recently in the search app we are working on we've encountered a lot of websites that have a wrong and invalid date in the Last Modified HTTP header, meaning for instance that an article posted on a news site back in 2010 has a Las Modified header of just a few days back, this could be for any number of reasons:
> 
> - A new comment was added to the site
> - Some cache invalidation occurring in the source code of the website that affects the article's page
> - Perhaps a new ad showing in the sidebar
> - Or just plain wrong header handling in the platform code
> 
> For what I've seen this is handled by several CMS even allowing to "tweak" the published date, My question is basically if any one on the list has a suggestion on how to tackle this or has some suggestion on how to address this situation. For the particular case that we've been working most of the URLs have the published date in the URL in the form of yyyy/mm/dd (or some similar fashion), so this could be one way of "guessing" the publication date of the article. I realize that this is no silver bullet but I'd love to get some feedback on this type of situations. From my experience when people usually filter by date in our frontend app, they usually are trying to get news/articles by the publication date instead of the Last Modified date and they are confused when the returned results have very old publication dates, they usually don't check if is a new comment for instance.
> 
> I'm living the "how to implement this" a side for now, just interested in discussing how to deal with this type of situations, as stated in our particular case we can rely on the URL patterns for a very good portion, but was hopping to agree on some general approach that could be integrated in Nutch.
> 
> Regards,
> 
> PS: Should I post this also to the user list? 
>