You are viewing a plain text version of this content. The canonical link for it is here.
Posted to httpclient-users@hc.apache.org by Mugoma Joseph Okomba <mu...@yengas.com> on 2012/05/11 19:41:59 UTC

HC 4: Excluding images ang other types of content

Hello,

I am using HC 4 to download web page. Since I am only interested in the
text of the web page I would like to exclude images and other content such
as javascript, css, etc

Is there a way to do this in HttClient?

Thanks.

Mugoma Joseph.


---------------------------------------------------------------------
To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
For additional commands, e-mail: httpclient-users-help@hc.apache.org


Re: HC 4: Excluding images ang other types of content

Posted by William Speirs <ws...@apache.org>.
By default if you point HC4 at a web page it will only download the
HTML. You'd have to parse that HTML and extract all the links to get
the images, JavaScript, etc.

Give it a try...

Bill-

On Fri, May 11, 2012 at 1:41 PM, Mugoma Joseph Okomba <mu...@yengas.com> wrote:
> Hello,
>
> I am using HC 4 to download web page. Since I am only interested in the
> text of the web page I would like to exclude images and other content such
> as javascript, css, etc
>
> Is there a way to do this in HttClient?
>
> Thanks.
>
> Mugoma Joseph.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
> For additional commands, e-mail: httpclient-users-help@hc.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
For additional commands, e-mail: httpclient-users-help@hc.apache.org