You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Mohammad Al-Mohsin <me...@mem9.net> on 2015/02/21 15:03:28 UTC

Nutch-Selenium Plugin Truncates Binary Data

I am using nutch-selenium <https://github.com/momer/nutch-selenium> plugin
and I also have Tesseract <https://wiki.apache.org/tika/TikaOCR> installed
for parsing text off images.

While crawling with Nutch & selenium, I noticed that binary data (e.g.
images, pdf) are always truncated and thus skip/fail parsing. Here is a
sample of the log:

*Content of size 800750 was truncated to 368. Content is truncated, parse
may fail!*
When I turn selenium off, parsing works fine and the content is not
truncated.

I found that nutch-selenium gets the html body of whatever Firefox
displays. So even though you're fetching an image, selenium will just give
you the image html tag instead of the image itself.
e.g. <img src='xyz.png' height="400" width="600">

To get around this, I modified selenium plugin to handle the fetch only if
the Content-Type header starts with 'text', i.e. to catch 'text/html'.
Otherwise, if the content is not textual, it just returns the content as
protocol-httpclient does.

Now, I am getting binary data properly parsed and also getting selenium
handle page rendering with javascript.

Is this is the proper way to tackle this? what do you think?


Best regards,
Mohammad Al-Mohsin

Re: Nutch-Selenium Plugin Truncates Binary Data

Posted by Mohammad Al-Mohsin <me...@mem9.net>.

Hi Jiaxin,

In *HttpResponse.java*, you can check the 'Content-Type' header and then
decide whether to:

- Set the response content to be the binary http response. (Check out
protocol-httpclient's source code for hints)
or
- Continue executing *readPlainContent(url)*, which in turn will set the
'content' from the html body by Selenium Firefox driver.

By the way, since nutch-selenium will be looking for the html body, I think
we should check for 'text/html' and 'application/xhtml+xml' content types,
not just anything that starts with 'text/.....'


Best regards,
Mohammad Al-Mohsin

On Sat, Feb 21, 2015 at 12:05 PM, Jiaxin Ye <ji...@usc.edu> wrote:

> Hi Mohammad,
>
> Hey, I think that's a very good idea! Any hints about how to change the
> selenium plugin? I am thinking about the same thing but struggling on how
> to do it.
>
> Best,
> Jiaxin
>
> On Sat, Feb 21, 2015 at 6:03 AM, Mohammad Al-Mohsin <me...@mem9.net> wrote:
>
>> I am using nutch-selenium <https://github.com/momer/nutch-selenium>
>> plugin and I also have Tesseract <https://wiki.apache.org/tika/TikaOCR>
>> installed for parsing text off images.
>>
>> While crawling with Nutch & selenium, I noticed that binary data (e.g.
>> images, pdf) are always truncated and thus skip/fail parsing. Here is a
>> sample of the log:
>>
>> *Content of size 800750 was truncated to 368. Content is truncated, parse
>> may fail!*
>> When I turn selenium off, parsing works fine and the content is not
>> truncated.
>>
>> I found that nutch-selenium gets the html body of whatever Firefox
>> displays. So even though you're fetching an image, selenium will just give
>> you the image html tag instead of the image itself.
>> e.g. <img src='xyz.png' height="400" width="600">
>>
>> To get around this, I modified selenium plugin to handle the fetch only
>> if the Content-Type header starts with 'text', i.e. to catch 'text/html'.
>> Otherwise, if the content is not textual, it just returns the content as
>> protocol-httpclient does.
>>
>> Now, I am getting binary data properly parsed and also getting selenium
>> handle page rendering with javascript.
>>
>> Is this is the proper way to tackle this? what do you think?
>>
>>
>> Best regards,
>> Mohammad Al-Mohsin
>>
>
>

Re: Nutch-Selenium Plugin Truncates Binary Data

Posted by Jiaxin Ye <ji...@usc.edu>.

Hi Mohammad,

Hey, I think that's a very good idea! Any hints about how to change the
selenium plugin? I am thinking about the same thing but struggling on how
to do it.

Best,
Jiaxin

On Sat, Feb 21, 2015 at 6:03 AM, Mohammad Al-Mohsin <me...@mem9.net> wrote:

> I am using nutch-selenium <https://github.com/momer/nutch-selenium>
> plugin and I also have Tesseract <https://wiki.apache.org/tika/TikaOCR>
> installed for parsing text off images.
>
> While crawling with Nutch & selenium, I noticed that binary data (e.g.
> images, pdf) are always truncated and thus skip/fail parsing. Here is a
> sample of the log:
>
> *Content of size 800750 was truncated to 368. Content is truncated, parse
> may fail!*
> When I turn selenium off, parsing works fine and the content is not
> truncated.
>
> I found that nutch-selenium gets the html body of whatever Firefox
> displays. So even though you're fetching an image, selenium will just give
> you the image html tag instead of the image itself.
> e.g. <img src='xyz.png' height="400" width="600">
>
> To get around this, I modified selenium plugin to handle the fetch only if
> the Content-Type header starts with 'text', i.e. to catch 'text/html'.
> Otherwise, if the content is not textual, it just returns the content as
> protocol-httpclient does.
>
> Now, I am getting binary data properly parsed and also getting selenium
> handle page rendering with javascript.
>
> Is this is the proper way to tackle this? what do you think?
>
>
> Best regards,
> Mohammad Al-Mohsin
>

Re: Nutch-Selenium Plugin Truncates Binary Data

Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.

Thank you Mohammad!

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++






-----Original Message-----
From: Mohammad Al-Mohsin <al...@usc.edu>
Reply-To: "dev@nutch.apache.org" <de...@nutch.apache.org>
Date: Monday, February 23, 2015 at 3:13 AM
To: "dev@nutch.apache.org" <de...@nutch.apache.org>
Cc: Mohammad Al-Mohsin <al...@usc.edu>
Subject: Re: Nutch-Selenium Plugin Truncates Binary Data

>Sure, I've just uploaded the updated patch.
>
>On Sun, Feb 22, 2015 at 4:50 PM, Mattmann, Chris A (3980)
><ch...@jpl.nasa.gov> wrote:
>
>I think this is fantastic Mohammad!
>
>Can you update the patch on NUTCH-1933 with this improvement,
>so we can get it into the sources?
>
>Cheers,
>Chris
>
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Chris Mattmann, Ph.D.
>Chief Architect
>Instrument Software and Science Data Systems Section (398)
>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>Office: 168-519, Mailstop: 168-527
>Email: chris.a.mattmann@nasa.gov
>WWW:  http://sunset.usc.edu/~mattmann/
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Adjunct Associate Professor, Computer Science Department
>University of Southern California, Los Angeles, CA 90089 USA
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
>
>-----Original Message-----
>From: Mohammad Al-Mohsin <me...@mem9.net>
>Reply-To: "dev@nutch.apache.org" <de...@nutch.apache.org>
>Date: Saturday, February 21, 2015 at 6:03 AM
>To: "dev@nutch.apache.org" <de...@nutch.apache.org>
>Cc: Mohammad Al-Mohsin <al...@usc.edu>
>Subject: Nutch-Selenium Plugin Truncates Binary Data
>
>>I am using
>>nutch-selenium <https://github.com/momer/nutch-selenium> plugin and I
>>also have
>>Tesseract <https://wiki.apache.org/tika/TikaOCR> installed for parsing
>>text off images.
>>
>>
>>While crawling with Nutch & selenium, I noticed that binary data (e.g.
>>images, pdf) are always truncated and thus skip/fail parsing. Here is a
>>sample of the log:
>>Content of size 800750 was truncated to 368. Content is truncated, parse
>>may fail!
>>
>>When I turn selenium off, parsing works fine and the content is not
>>truncated.
>>
>>
>>I found that nutch-selenium gets the html body of whatever Firefox
>>displays. So even though you're fetching an image, selenium will just
>>give you the image html tag instead of the image itself.
>>e.g. <img src='xyz.png' height="400" width="600">
>>
>>
>>To get around this, I modified selenium plugin to handle the fetch only
>>if the Content-Type header starts with 'text', i.e. to catch 'text/html'.
>>Otherwise, if the content is not textual, it just returns the content as
>>protocol-httpclient does.
>>
>>
>>Now, I am getting binary data properly parsed and also getting selenium
>>handle page rendering with javascript.
>>
>>
>>Is this is the proper way to tackle this? what do you think?
>>
>>
>>
>>
>>Best regards,
>>Mohammad Al-Mohsin
>>
>>
>
>
>
>
>
>
>
>

Re: Nutch-Selenium Plugin Truncates Binary Data

Posted by Mohammad Al-Mohsin <al...@usc.edu>.

Sure, I've just uploaded the updated patch.

On Sun, Feb 22, 2015 at 4:50 PM, Mattmann, Chris A (3980) <
chris.a.mattmann@jpl.nasa.gov> wrote:

> I think this is fantastic Mohammad!
>
> Can you update the patch on NUTCH-1933 with this improvement,
> so we can get it into the sources?
>
> Cheers,
> Chris
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattmann@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
>
> -----Original Message-----
> From: Mohammad Al-Mohsin <me...@mem9.net>
> Reply-To: "dev@nutch.apache.org" <de...@nutch.apache.org>
> Date: Saturday, February 21, 2015 at 6:03 AM
> To: "dev@nutch.apache.org" <de...@nutch.apache.org>
> Cc: Mohammad Al-Mohsin <al...@usc.edu>
> Subject: Nutch-Selenium Plugin Truncates Binary Data
>
> >I am using
> >nutch-selenium <https://github.com/momer/nutch-selenium> plugin and I
> >also have
> >Tesseract <https://wiki.apache.org/tika/TikaOCR> installed for parsing
> >text off images.
> >
> >
> >While crawling with Nutch & selenium, I noticed that binary data (e.g.
> >images, pdf) are always truncated and thus skip/fail parsing. Here is a
> >sample of the log:
> >Content of size 800750 was truncated to 368. Content is truncated, parse
> >may fail!
> >
> >When I turn selenium off, parsing works fine and the content is not
> >truncated.
> >
> >
> >I found that nutch-selenium gets the html body of whatever Firefox
> >displays. So even though you're fetching an image, selenium will just
> >give you the image html tag instead of the image itself.
> >e.g. <img src='xyz.png' height="400" width="600">
> >
> >
> >To get around this, I modified selenium plugin to handle the fetch only
> >if the Content-Type header starts with 'text', i.e. to catch 'text/html'.
> >Otherwise, if the content is not textual, it just returns the content as
> >protocol-httpclient does.
> >
> >
> >Now, I am getting binary data properly parsed and also getting selenium
> >handle page rendering with javascript.
> >
> >
> >Is this is the proper way to tackle this? what do you think?
> >
> >
> >
> >
> >Best regards,
> >Mohammad Al-Mohsin
> >
> >
>
>

Re: Nutch-Selenium Plugin Truncates Binary Data

Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.

I think this is fantastic Mohammad!

Can you update the patch on NUTCH-1933 with this improvement,
so we can get it into the sources?

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++






-----Original Message-----
From: Mohammad Al-Mohsin <me...@mem9.net>
Reply-To: "dev@nutch.apache.org" <de...@nutch.apache.org>
Date: Saturday, February 21, 2015 at 6:03 AM
To: "dev@nutch.apache.org" <de...@nutch.apache.org>
Cc: Mohammad Al-Mohsin <al...@usc.edu>
Subject: Nutch-Selenium Plugin Truncates Binary Data

>I am using 
>nutch-selenium <https://github.com/momer/nutch-selenium> plugin and I
>also have 
>Tesseract <https://wiki.apache.org/tika/TikaOCR> installed for parsing
>text off images.
>
>
>While crawling with Nutch & selenium, I noticed that binary data (e.g.
>images, pdf) are always truncated and thus skip/fail parsing. Here is a
>sample of the log:
>Content of size 800750 was truncated to 368. Content is truncated, parse
>may fail!
>
>When I turn selenium off, parsing works fine and the content is not
>truncated.
>
>
>I found that nutch-selenium gets the html body of whatever Firefox
>displays. So even though you're fetching an image, selenium will just
>give you the image html tag instead of the image itself.
>e.g. <img src='xyz.png' height="400" width="600">
>
>
>To get around this, I modified selenium plugin to handle the fetch only
>if the Content-Type header starts with 'text', i.e. to catch 'text/html'.
>Otherwise, if the content is not textual, it just returns the content as
>protocol-httpclient does.
>
>
>Now, I am getting binary data properly parsed and also getting selenium
>handle page rendering with javascript.
>
>
>Is this is the proper way to tackle this? what do you think?
>
>
>
>
>Best regards,
>Mohammad Al-Mohsin
>
>