You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@wink.apache.org by "Baram, Eliezer" <eb...@hp.com> on 2010/09/13 10:18:22 UTC

FW: Tolerance to malformed media types in Wink client

And here is the mail he tried to post

---------- Forwarded message ----------
From: Steve Miller <st...@gmail.com>>
Date: Mon, Sep 12, 2010 at 2:15 AM
Subject: Tolerance to malformed media types in Wink client
To: wink-user@incubator.apache.org<ma...@incubator.apache.org>

Hi
I created a crawler using the Apache wink client, but I found out that wink client is not tolerant to malformed media types, even if the malformed part is only a media type parameter. Unfortunately there are a lot of those in the internet.
When wink receives such media type it throw exception with the message: 'java.lang.IllegalArgumentException ... Verify that the format is like "type/subtype".'
I think it would be good if wink can be more tolerant for such media types, especially since they are common. It will surly easy my time :-)

Here are examples of the media types that cause the problem and their source. This is a sample, the sites list is longer, but the media type patterns return on themselves.

URL:   http://www.aol.com/   (and all aol sites around the globe)
Media Type: text/html;;charset=utf-8

URL: http://www.plugrush.com/
Media Type: text/html; charset: UTF-8

URL: http://www.torrentleech.org/
Media Type: text/html; charset=

URL: http://www.comingsoon.net/
Media Type: text/html; $str_charset; charset=ISO-8859-1

URL: http://www.globalsources.com/
Media Type: text/html; UTF-8;charset=ISO-8859-1

URL: http://dic.academic.ru/
Media Type: text/html; utf-8

URL: http://www.warnerbros.com/
Media Type: text/html; UTF-8;charset=UTF-8

Thanks,
Steve










Re: FW: Tolerance to malformed media types in Wink client

Posted by Bryant Luk <br...@gmail.com>.
Okay so I fixed the MediaTypeHeaderDelegate in WINK-315 to ignore bad
parameter types.  The unit test should cover the cases mentioned in
the e-mail unless someone thinks of something else.  Thanks.

On Mon, Sep 13, 2010 at 9:54 AM, Nicholas Gallardo
<ni...@yahoo.com> wrote:
> Yep, I can't think of another way to make that work without it being overly
> complicated.
>
>
> For most of the examples below, it looks like just adding logic that ignores the
> param when there's a key without a value would do the trick.
>
>
>
> ----- Original Message ----
> From: Bryant Luk <br...@gmail.com>
> To: wink-dev@incubator.apache.org
> Sent: Mon, September 13, 2010 9:43:17 AM
> Subject: Re: FW: Tolerance to malformed media types in Wink client
>
> I think making this change is fine.  I think we'd have to ignore the
> "malformed" parameters unless someone has a better idea?
>
> On Mon, Sep 13, 2010 at 3:18 AM, Baram, Eliezer <eb...@hp.com> wrote:
>> And here is the mail he tried to post
>>
>> ---------- Forwarded message ----------
>> From: Steve Miller <st...@gmail.com>>
>> Date: Mon, Sep 12, 2010 at 2:15 AM
>> Subject: Tolerance to malformed media types in Wink client
>> To: wink-user@incubator.apache.org<ma...@incubator.apache.org>
>>
>> Hi
>> I created a crawler using the Apache wink client, but I found out that wink
>>client is not tolerant to malformed media types, even if the malformed part is
>>only a media type parameter. Unfortunately there are a lot of those in the
>>internet.
>> When wink receives such media type it throw exception with the message:
>>'java.lang.IllegalArgumentException ... Verify that the format is like
>>"type/subtype".'
>> I think it would be good if wink can be more tolerant for such media types,
>>especially since they are common. It will surly easy my time :-)
>>
>> Here are examples of the media types that cause the problem and their source.
>>This is a sample, the sites list is longer, but the media type patterns return
>>on themselves.
>>
>> URL:   http://www.aol.com/   (and all aol sites around the globe)
>> Media Type: text/html;;charset=utf-8
>>
>> URL: http://www.plugrush.com/
>> Media Type: text/html; charset: UTF-8
>>
>> URL: http://www.torrentleech.org/
>> Media Type: text/html; charset=
>>
>> URL: http://www.comingsoon.net/
>> Media Type: text/html; $str_charset; charset=ISO-8859-1
>>
>> URL: http://www.globalsources.com/
>> Media Type: text/html; UTF-8;charset=ISO-8859-1
>>
>> URL: http://dic.academic.ru/
>> Media Type: text/html; utf-8
>>
>> URL: http://www.warnerbros.com/
>> Media Type: text/html; UTF-8;charset=UTF-8
>>
>> Thanks,
>> Steve
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>
>
>
>
>

Re: FW: Tolerance to malformed media types in Wink client

Posted by Nicholas Gallardo <ni...@yahoo.com>.
Yep, I can't think of another way to make that work without it being overly 
complicated.  


For most of the examples below, it looks like just adding logic that ignores the 
param when there's a key without a value would do the trick.



----- Original Message ----
From: Bryant Luk <br...@gmail.com>
To: wink-dev@incubator.apache.org
Sent: Mon, September 13, 2010 9:43:17 AM
Subject: Re: FW: Tolerance to malformed media types in Wink client

I think making this change is fine.  I think we'd have to ignore the
"malformed" parameters unless someone has a better idea?

On Mon, Sep 13, 2010 at 3:18 AM, Baram, Eliezer <eb...@hp.com> wrote:
> And here is the mail he tried to post
>
> ---------- Forwarded message ----------
> From: Steve Miller <st...@gmail.com>>
> Date: Mon, Sep 12, 2010 at 2:15 AM
> Subject: Tolerance to malformed media types in Wink client
> To: wink-user@incubator.apache.org<ma...@incubator.apache.org>
>
> Hi
> I created a crawler using the Apache wink client, but I found out that wink 
>client is not tolerant to malformed media types, even if the malformed part is 
>only a media type parameter. Unfortunately there are a lot of those in the 
>internet.
> When wink receives such media type it throw exception with the message: 
>'java.lang.IllegalArgumentException ... Verify that the format is like 
>"type/subtype".'
> I think it would be good if wink can be more tolerant for such media types, 
>especially since they are common. It will surly easy my time :-)
>
> Here are examples of the media types that cause the problem and their source. 
>This is a sample, the sites list is longer, but the media type patterns return 
>on themselves.
>
> URL:   http://www.aol.com/   (and all aol sites around the globe)
> Media Type: text/html;;charset=utf-8
>
> URL: http://www.plugrush.com/
> Media Type: text/html; charset: UTF-8
>
> URL: http://www.torrentleech.org/
> Media Type: text/html; charset=
>
> URL: http://www.comingsoon.net/
> Media Type: text/html; $str_charset; charset=ISO-8859-1
>
> URL: http://www.globalsources.com/
> Media Type: text/html; UTF-8;charset=ISO-8859-1
>
> URL: http://dic.academic.ru/
> Media Type: text/html; utf-8
>
> URL: http://www.warnerbros.com/
> Media Type: text/html; UTF-8;charset=UTF-8
>
> Thanks,
> Steve
>
>
>
>
>
>
>
>
>
>



      

Re: FW: Tolerance to malformed media types in Wink client

Posted by Mike Rheinheimer <ro...@apache.org>.
I agree with ignoring the malformed parts, unless you want to try to
accomodate the '=' in place of the ':'.

mike


On Mon, Sep 13, 2010 at 9:43 AM, Bryant Luk <br...@gmail.com> wrote:
> I think making this change is fine.  I think we'd have to ignore the
> "malformed" parameters unless someone has a better idea?
>
> On Mon, Sep 13, 2010 at 3:18 AM, Baram, Eliezer <eb...@hp.com> wrote:
>> And here is the mail he tried to post
>>
>> ---------- Forwarded message ----------
>> From: Steve Miller <st...@gmail.com>>
>> Date: Mon, Sep 12, 2010 at 2:15 AM
>> Subject: Tolerance to malformed media types in Wink client
>> To: wink-user@incubator.apache.org<ma...@incubator.apache.org>
>>
>> Hi
>> I created a crawler using the Apache wink client, but I found out that wink client is not tolerant to malformed media types, even if the malformed part is only a media type parameter. Unfortunately there are a lot of those in the internet.
>> When wink receives such media type it throw exception with the message: 'java.lang.IllegalArgumentException ... Verify that the format is like "type/subtype".'
>> I think it would be good if wink can be more tolerant for such media types, especially since they are common. It will surly easy my time :-)
>>
>> Here are examples of the media types that cause the problem and their source. This is a sample, the sites list is longer, but the media type patterns return on themselves.
>>
>> URL:   http://www.aol.com/   (and all aol sites around the globe)
>> Media Type: text/html;;charset=utf-8
>>
>> URL: http://www.plugrush.com/
>> Media Type: text/html; charset: UTF-8
>>
>> URL: http://www.torrentleech.org/
>> Media Type: text/html; charset=
>>
>> URL: http://www.comingsoon.net/
>> Media Type: text/html; $str_charset; charset=ISO-8859-1
>>
>> URL: http://www.globalsources.com/
>> Media Type: text/html; UTF-8;charset=ISO-8859-1
>>
>> URL: http://dic.academic.ru/
>> Media Type: text/html; utf-8
>>
>> URL: http://www.warnerbros.com/
>> Media Type: text/html; UTF-8;charset=UTF-8
>>
>> Thanks,
>> Steve
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>

Re: FW: Tolerance to malformed media types in Wink client

Posted by Bryant Luk <br...@gmail.com>.
I think making this change is fine.  I think we'd have to ignore the
"malformed" parameters unless someone has a better idea?

On Mon, Sep 13, 2010 at 3:18 AM, Baram, Eliezer <eb...@hp.com> wrote:
> And here is the mail he tried to post
>
> ---------- Forwarded message ----------
> From: Steve Miller <st...@gmail.com>>
> Date: Mon, Sep 12, 2010 at 2:15 AM
> Subject: Tolerance to malformed media types in Wink client
> To: wink-user@incubator.apache.org<ma...@incubator.apache.org>
>
> Hi
> I created a crawler using the Apache wink client, but I found out that wink client is not tolerant to malformed media types, even if the malformed part is only a media type parameter. Unfortunately there are a lot of those in the internet.
> When wink receives such media type it throw exception with the message: 'java.lang.IllegalArgumentException ... Verify that the format is like "type/subtype".'
> I think it would be good if wink can be more tolerant for such media types, especially since they are common. It will surly easy my time :-)
>
> Here are examples of the media types that cause the problem and their source. This is a sample, the sites list is longer, but the media type patterns return on themselves.
>
> URL:   http://www.aol.com/   (and all aol sites around the globe)
> Media Type: text/html;;charset=utf-8
>
> URL: http://www.plugrush.com/
> Media Type: text/html; charset: UTF-8
>
> URL: http://www.torrentleech.org/
> Media Type: text/html; charset=
>
> URL: http://www.comingsoon.net/
> Media Type: text/html; $str_charset; charset=ISO-8859-1
>
> URL: http://www.globalsources.com/
> Media Type: text/html; UTF-8;charset=ISO-8859-1
>
> URL: http://dic.academic.ru/
> Media Type: text/html; utf-8
>
> URL: http://www.warnerbros.com/
> Media Type: text/html; UTF-8;charset=UTF-8
>
> Thanks,
> Steve
>
>
>
>
>
>
>
>
>
>