You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@camel.apache.org by Bryce Ewing <br...@gmail.com> on 2010/07/17 11:42:39 UTC

Error with RSS component accessing gzip content

Hi,

I have found at least one site that responds to a request for their RSS feed
with gzipped content, without being asked to.  If you request the URL using
curl (with no options) you get binary (gzipped) data back if you use curl
--compressed then curl expects this and displays the content in ascii.

What this means for the Camel RSS component is that in this block of code
(RssUtils):
        InputStream in = new URL(feedUri).openStream();
        SyndFeedInput input = new SyndFeedInput();
        return input.build(new XmlReader(in));
The content that comes out of the input stream is gzipped content and when
it is parsed an exception is thrown:
        Invalid XML: Error on line 1: Content is not allowed in prolog.

A simple test in the debugger at this point in the code showed that using a
GZIPInputStream worked, but then that wouldn't work for a non gzipped
stream.

Firstly I am wondering if anyone has come across this before, whether there
is a work around, or maybe I am plain doing something wrong?

Secondly if this indeed in something to be fixed I am willing to look into a
solution (thinking that a more robust http client might catch this through
reading the response headers, which do contain: "Content-Encoding: gzip").

Thoughts?

Cheers
Bryce
-- 
View this message in context: http://camel.465427.n5.nabble.com/Error-with-RSS-component-accessing-gzip-content-tp1335918p1335918.html
Sent from the Camel - Users mailing list archive at Nabble.com.

Re: Error with RSS component accessing gzip content

Posted by Willem Jiang <wi...@gmail.com>.
Hi Bryce,

I just went through the code of camel-rss, if you want to get the Feed 
object from the different RSS services,  you can consider to use 
RssDataFormat, it will help you to turn an input stream into SyndFeed.
The route could like this
from("http://xxxx").unmarshal().rss().to("someOtheEndpoint");

Willem
----------------------------------
  Apache Camel, Apache CXF committer
  Open Source Integration http://www.fusesource.com
  Blog http://willemjiang.blogspot.com
  Tiwtter http://twitter.com/willemjiang

Bryce Ewing wrote:
> Hi Willem,
> 
> Yes that was pretty much what I was thinking.  Having had more of a look
> through the camel-rss component it seems very much geared towards polling
> (understandably since that is what it does).  I will try creating a
> completely separate component to begin with (will probably copy some code
> from camel-rss) so that I can get more of an understanding of how things
> work too.  I have only used components up until now, so will be interesting
> to write one.
> 
> Once I have written a specific processor component we will see how best this
> could be integrated into a singular component.
> 
> Cheers
> Bryce
> 
> On Tue, Jul 20, 2010 at 1:25 PM, Willem.Jiang [via Camel] <
> ml-node+1543487-1672642491-53780@n5.nabble.com<ml...@n5.nabble.com>
>> wrote:
> 
>> Hi Bryce,
>>
>> Bryce Ewing wrote:
>>> Hi,
>>>
>>> I have been having a look at the http4 component source and thinking
>> about
>>> how much of this would end up being duplicated into the RSS component to
>>> properly handle all cases, etc. HttpProducer.extractResponseBody and
>>> utilising GZIPHelper.uncompressGzip seems to cover my particular case.
>> Yes, and camel-http component also has this GZIPHelper.
>>> This got me thinking about other ways of doing this.  At present the RSS
>>> component can read from at least file and http based RSS documents.  This
>>> fix would firstly be required by just the http based feeds.  I can see
>> many
>>> other ways that RSS could be consumed, there could be RSS documents in a
>>> database, in ftp, via xmpp, etc.  The vast majority would most likely be
>>> http but it doesn't need to be limited to this.
>> As camel already has the camel-ftp and camel-xmpp component, we could
>> leverage it for the Camel-RSS.
>>
>>> Firstly should the RSS component be reusing for example the http4 code?
>>  And
>>> secondly should the RSS component actually just be the second step in the
>>> process, e.g. use the http4 component to do the polling, then the RSS
>>> component processes the output from this?
>>>
>>> The second option would allow for much more flexibility in terms of where
>>> the feed is being read from, and much more code reuse.
>>>
>>> What are the thoughts on this?
>> Maybe we could add some option in the RSS component to let it take the
>> feed inputstream from the inMessage of the Exchange, then we can use
>> camel-http to pull the request. and it could be easy to change the
>> transport to ftp or xmpp.
>>
>>> Cheers
>>> Bryce
>> Willem
>> ----------------------------------
>> Apache Camel, Apache CXF committer
>> Open Source Integration http://www.fusesource.com<http://www.fusesource.com?by-user=t>
>> Blog http://willemjiang.blogspot.com<http://willemjiang.blogspot.com?by-user=t>
>> Tiwtter http://twitter.com/willemjiang
>>
>>
>> ------------------------------
>>  View message @
>> http://camel.465427.n5.nabble.com/Error-with-RSS-component-accessing-gzip-content-tp1335918p1543487.html
>> To unsubscribe from Re: Error with RSS component accessing gzip content, click
>> here< (link removed) >.
>>
>>
>>
> 


Re: Error with RSS component accessing gzip content

Posted by Bryce Ewing <br...@gmail.com>.
Hi Willem,

Yes that was pretty much what I was thinking.  Having had more of a look
through the camel-rss component it seems very much geared towards polling
(understandably since that is what it does).  I will try creating a
completely separate component to begin with (will probably copy some code
from camel-rss) so that I can get more of an understanding of how things
work too.  I have only used components up until now, so will be interesting
to write one.

Once I have written a specific processor component we will see how best this
could be integrated into a singular component.

Cheers
Bryce

On Tue, Jul 20, 2010 at 1:25 PM, Willem.Jiang [via Camel] <
ml-node+1543487-1672642491-53780@n5.nabble.com<ml...@n5.nabble.com>
> wrote:

> Hi Bryce,
>
> Bryce Ewing wrote:
> > Hi,
> >
> > I have been having a look at the http4 component source and thinking
> about
> > how much of this would end up being duplicated into the RSS component to
> > properly handle all cases, etc. HttpProducer.extractResponseBody and
> > utilising GZIPHelper.uncompressGzip seems to cover my particular case.
> Yes, and camel-http component also has this GZIPHelper.
> >
> > This got me thinking about other ways of doing this.  At present the RSS
> > component can read from at least file and http based RSS documents.  This
>
> > fix would firstly be required by just the http based feeds.  I can see
> many
> > other ways that RSS could be consumed, there could be RSS documents in a
> > database, in ftp, via xmpp, etc.  The vast majority would most likely be
> > http but it doesn't need to be limited to this.
> As camel already has the camel-ftp and camel-xmpp component, we could
> leverage it for the Camel-RSS.
>
> >
> > Firstly should the RSS component be reusing for example the http4 code?
>  And
> > secondly should the RSS component actually just be the second step in the
>
> > process, e.g. use the http4 component to do the polling, then the RSS
> > component processes the output from this?
> >
> > The second option would allow for much more flexibility in terms of where
>
> > the feed is being read from, and much more code reuse.
> >
> > What are the thoughts on this?
>
> Maybe we could add some option in the RSS component to let it take the
> feed inputstream from the inMessage of the Exchange, then we can use
> camel-http to pull the request. and it could be easy to change the
> transport to ftp or xmpp.
>
> >
> > Cheers
> > Bryce
>
> Willem
> ----------------------------------
> Apache Camel, Apache CXF committer
> Open Source Integration http://www.fusesource.com<http://www.fusesource.com?by-user=t>
> Blog http://willemjiang.blogspot.com<http://willemjiang.blogspot.com?by-user=t>
> Tiwtter http://twitter.com/willemjiang
>
>
> ------------------------------
>  View message @
> http://camel.465427.n5.nabble.com/Error-with-RSS-component-accessing-gzip-content-tp1335918p1543487.html
> To unsubscribe from Re: Error with RSS component accessing gzip content, click
> here< (link removed) >.
>
>
>

-- 
View this message in context: http://camel.465427.n5.nabble.com/Error-with-RSS-component-accessing-gzip-content-tp1335918p1564590.html
Sent from the Camel - Users mailing list archive at Nabble.com.

Re: Error with RSS component accessing gzip content

Posted by Willem Jiang <wi...@gmail.com>.
Hi Bryce,

Bryce Ewing wrote:
> Hi,
> 
> I have been having a look at the http4 component source and thinking about
> how much of this would end up being duplicated into the RSS component to
> properly handle all cases, etc. HttpProducer.extractResponseBody and
> utilising GZIPHelper.uncompressGzip seems to cover my particular case.
Yes, and camel-http component also has this GZIPHelper.
> 
> This got me thinking about other ways of doing this.  At present the RSS
> component can read from at least file and http based RSS documents.  This
> fix would firstly be required by just the http based feeds.  I can see many
> other ways that RSS could be consumed, there could be RSS documents in a
> database, in ftp, via xmpp, etc.  The vast majority would most likely be
> http but it doesn't need to be limited to this.
As camel already has the camel-ftp and camel-xmpp component, we could 
leverage it for the Camel-RSS.
> 
> Firstly should the RSS component be reusing for example the http4 code?  And
> secondly should the RSS component actually just be the second step in the
> process, e.g. use the http4 component to do the polling, then the RSS
> component processes the output from this?
> 
> The second option would allow for much more flexibility in terms of where
> the feed is being read from, and much more code reuse.
> 
> What are the thoughts on this?

Maybe we could add some option in the RSS component to let it take the 
feed inputstream from the inMessage of the Exchange, then we can use 
camel-http to pull the request. and it could be easy to change the 
transport to ftp or xmpp.

> 
> Cheers
> Bryce

Willem
----------------------------------
Apache Camel, Apache CXF committer
Open Source Integration http://www.fusesource.com
Blog http://willemjiang.blogspot.com
Tiwtter http://twitter.com/willemjiang

Re: Error with RSS component accessing gzip content

Posted by Bryce Ewing <br...@gmail.com>.
Hi,

I have been having a look at the http4 component source and thinking about
how much of this would end up being duplicated into the RSS component to
properly handle all cases, etc. HttpProducer.extractResponseBody and
utilising GZIPHelper.uncompressGzip seems to cover my particular case.

This got me thinking about other ways of doing this.  At present the RSS
component can read from at least file and http based RSS documents.  This
fix would firstly be required by just the http based feeds.  I can see many
other ways that RSS could be consumed, there could be RSS documents in a
database, in ftp, via xmpp, etc.  The vast majority would most likely be
http but it doesn't need to be limited to this.

Firstly should the RSS component be reusing for example the http4 code?  And
secondly should the RSS component actually just be the second step in the
process, e.g. use the http4 component to do the polling, then the RSS
component processes the output from this?

The second option would allow for much more flexibility in terms of where
the feed is being read from, and much more code reuse.

What are the thoughts on this?

Cheers
Bryce
-- 
View this message in context: http://camel.465427.n5.nabble.com/Error-with-RSS-component-accessing-gzip-content-tp1335918p1511721.html
Sent from the Camel - Users mailing list archive at Nabble.com.

Re: Error with RSS component accessing gzip content

Posted by Willem Jiang <wi...@gmail.com>.
I think we can leverage the sophisticate http client to handle the 
"Content-Encoding; gzip" instead of using the URL.openStream directly.

Please feel free to fill a JIRA[1] for it, and patch with test case is 
welcome :)

[1]https://issues.apache.org/activemq/browse/CAMEL

Willem

Bryce Ewing wrote:
> Hi,
> 
> I have found at least one site that responds to a request for their RSS feed
> with gzipped content, without being asked to.  If you request the URL using
> curl (with no options) you get binary (gzipped) data back if you use curl
> --compressed then curl expects this and displays the content in ascii.
> 
> What this means for the Camel RSS component is that in this block of code
> (RssUtils):
>         InputStream in = new URL(feedUri).openStream();
>         SyndFeedInput input = new SyndFeedInput();
>         return input.build(new XmlReader(in));
> The content that comes out of the input stream is gzipped content and when
> it is parsed an exception is thrown:
>         Invalid XML: Error on line 1: Content is not allowed in prolog.
> 
> A simple test in the debugger at this point in the code showed that using a
> GZIPInputStream worked, but then that wouldn't work for a non gzipped
> stream.
> 
> Firstly I am wondering if anyone has come across this before, whether there
> is a work around, or maybe I am plain doing something wrong?
> 
> Secondly if this indeed in something to be fixed I am willing to look into a
> solution (thinking that a more robust http client might catch this through
> reading the response headers, which do contain: "Content-Encoding: gzip").
> 
> Thoughts?
> 
> Cheers
> Bryce