You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by remi tassing <ta...@gmail.com> on 2012/02/07 08:16:20 UTC

how are CSV/TXT files handled

Hey guys,

I checked the mailing-list archive but couldn't get an answer on this. I
think CSV and TXT don't need any kind of parsing, but how.are handled by
default?

Remi

Re: how are CSV/TXT files handled

Posted by remi tassing <ta...@gmail.com>.

Hi,

Tika is parsing properly, I think it was some kind of proxy issue and also
the http.content.limit.

Thanks!

Remi

On Fri, Feb 10, 2012 at 11:16 PM, Lewis John Mcgibbney <
lewis.mcgibbney@gmail.com> wrote:

> Hi Remi,
>
> Please ensure that your http.content limit is sufficient, what are you url
> filters? Any other configuration that could be knocking you off?
>
> lewismc@lewismc-HP-Mini-110-3100:~/ASF/trunk/runtime/local$ bin/nutch
> parsechecker http://avis.free.fr/livret_278_recettes.pdf
> fetching: http://avis.free.fr/livret_278_recettes.pdf
> parsing: http://avis.free.fr/livret_278_recettes.pdf
> contentType: application/pdf
> signature: aa6e668dca553598a943d8abeb0e9f83
> ---------
> Url
> ---------------
> http://avis.free.fr/livret_278_recettes.pdf
> ---------
> ParseData
> ---------
> Version: 5
> Status: success(1,0)
> Title: Microsoft Word - RECETTES V.doc
> Outlinks: 3
>  outlink: toUrl: http://avis.free.fr anchor:
>  outlink: toUrl: http://avis.free.fr anchor:
>  outlink: toUrl: http://avea.net/cvg/ anchor:
> Content Metadata: ETag="2a11be-535d2-43e29257" Date=Fri, 10 Feb 2012
> 21:11:23 GMT Content-Length=341458 Last-Modified=Thu, 02 Feb 2006 23:14:31
> GMT Content-Type=application/pdf Accept-Ranges=bytes Connection=close
> Server=Apache/ProXad [Aug  9 2008 02:45:09]
> Parse Metadata: xmpTPg:NPages=32 Creation-Date=2006-01-02T00:36:06Z
> created=Mon Jan 02 00:36:06 GMT 2006 Author=CARREFOUR producer=Acrobat
> Distiller 6.0 (Windows) Last-Modified=2006-01-02T00:36:06Z
> Content-Type=application/pdf creator=PScript5.dll Version 5.2
>
> On Wed, Feb 8, 2012 at 2:04 PM, remi tassing <ta...@gmail.com>
> wrote:
>
> >
> > $ bin/nutch parsechecker http://avis.free.fr/livret_278_recettes.pdf
> > fetching: http://avis.free.fr/livret_278_recettes.pdf
> > Can't fetch URL successfully
> >
>
> lewismc@lewismc-HP-Mini-110-3100:~/ASF/trunk/runtime/local$ bin/nutch
> parsechecker http://spreadsheetpage.com/downloads/xl/keno.xls
> fetching: http://spreadsheetpage.com/downloads/xl/keno.xls
> parsing: http://spreadsheetpage.com/downloads/xl/keno.xls
> contentType: application/vnd.ms-excel
> signature: d3f1d947dfe727e33669dad44957be19
> ---------
> Url
> ---------------
> http://spreadsheetpage.com/downloads/xl/keno.xls
> ---------
> ParseData
> ---------
> Version: 5
> Status: success(1,0)
> Title:
> Outlinks: 0
> Content Metadata: ETag="a22003-17c00-4531a9cb1dd80" Date=Fri, 10 Feb 2012
> 21:14:40 GMT Content-Length=97280 Last-Modified=Mon, 28 Jul 2008 19:34:30
> GMT Content-Type=application/vnd.ms-excel Connection=close
> Accept-Ranges=bytes Server=Apache/2.2.3 (Red Hat)
> Parse Metadata: Creation-Date=1998-06-23T16:20:19Z Last-Author=John
> Walkenbach Application-Name=Microsoft Excel Author=John Walkenbach
> Company=JWalk And Associates Content-Type=application/vnd.ms-excel
>
>
>
> >
> > $ bin/nutch parsechecker
> http://spreadsheetpage.com/downloads/xl/keno.xls
> > fetching: http://spreadsheetpage.com/downloads/xl/keno.xls
> > Can't fetch URL successfully
> >
>

Re: how are CSV/TXT files handled

Posted by Lewis John Mcgibbney <le...@gmail.com>.

Hi Remi,

Please ensure that your http.content limit is sufficient, what are you url
filters? Any other configuration that could be knocking you off?

lewismc@lewismc-HP-Mini-110-3100:~/ASF/trunk/runtime/local$ bin/nutch
parsechecker http://avis.free.fr/livret_278_recettes.pdf
fetching: http://avis.free.fr/livret_278_recettes.pdf
parsing: http://avis.free.fr/livret_278_recettes.pdf
contentType: application/pdf
signature: aa6e668dca553598a943d8abeb0e9f83
---------
Url
---------------
http://avis.free.fr/livret_278_recettes.pdf
---------
ParseData
---------
Version: 5
Status: success(1,0)
Title: Microsoft Word - RECETTES V.doc
Outlinks: 3
  outlink: toUrl: http://avis.free.fr anchor:
  outlink: toUrl: http://avis.free.fr anchor:
  outlink: toUrl: http://avea.net/cvg/ anchor:
Content Metadata: ETag="2a11be-535d2-43e29257" Date=Fri, 10 Feb 2012
21:11:23 GMT Content-Length=341458 Last-Modified=Thu, 02 Feb 2006 23:14:31
GMT Content-Type=application/pdf Accept-Ranges=bytes Connection=close
Server=Apache/ProXad [Aug  9 2008 02:45:09]
Parse Metadata: xmpTPg:NPages=32 Creation-Date=2006-01-02T00:36:06Z
created=Mon Jan 02 00:36:06 GMT 2006 Author=CARREFOUR producer=Acrobat
Distiller 6.0 (Windows) Last-Modified=2006-01-02T00:36:06Z
Content-Type=application/pdf creator=PScript5.dll Version 5.2

On Wed, Feb 8, 2012 at 2:04 PM, remi tassing <ta...@gmail.com> wrote:

>
> $ bin/nutch parsechecker http://avis.free.fr/livret_278_recettes.pdf
> fetching: http://avis.free.fr/livret_278_recettes.pdf
> Can't fetch URL successfully
>

lewismc@lewismc-HP-Mini-110-3100:~/ASF/trunk/runtime/local$ bin/nutch
parsechecker http://spreadsheetpage.com/downloads/xl/keno.xls
fetching: http://spreadsheetpage.com/downloads/xl/keno.xls
parsing: http://spreadsheetpage.com/downloads/xl/keno.xls
contentType: application/vnd.ms-excel
signature: d3f1d947dfe727e33669dad44957be19
---------
Url
---------------
http://spreadsheetpage.com/downloads/xl/keno.xls
---------
ParseData
---------
Version: 5
Status: success(1,0)
Title:
Outlinks: 0
Content Metadata: ETag="a22003-17c00-4531a9cb1dd80" Date=Fri, 10 Feb 2012
21:14:40 GMT Content-Length=97280 Last-Modified=Mon, 28 Jul 2008 19:34:30
GMT Content-Type=application/vnd.ms-excel Connection=close
Accept-Ranges=bytes Server=Apache/2.2.3 (Red Hat)
Parse Metadata: Creation-Date=1998-06-23T16:20:19Z Last-Author=John
Walkenbach Application-Name=Microsoft Excel Author=John Walkenbach
Company=JWalk And Associates Content-Type=application/vnd.ms-excel



>
> $ bin/nutch parsechecker http://spreadsheetpage.com/downloads/xl/keno.xls
> fetching: http://spreadsheetpage.com/downloads/xl/keno.xls
> Can't fetch URL successfully
>

Re: how are CSV/TXT files handled

Posted by remi tassing <ta...@gmail.com>.

You're right about Parsechecker and Nutch-1.2.

Well I'm trying Nutch-1.4 right now but still having same problem. Here is
my parsechecker output:

$ bin/nutch parsechecker http://avis.free.fr/livret_278_recettes.pdf
fetching: http://avis.free.fr/livret_278_recettes.pdf
Can't fetch URL successfully

$ bin/nutch parsechecker http://spreadsheetpage.com/downloads/xl/keno.xls
fetching: http://spreadsheetpage.com/downloads/xl/keno.xls
Can't fetch URL successfully

Remi

On Wed, Feb 8, 2012 at 12:50 PM, Lewis John Mcgibbney <
lewis.mcgibbney@gmail.com> wrote:

> Hi Remi,
>
> To give some history. I think you would have needed to add ms-excel to the
> plugin.includes when using 1.2. From what you posted above this doesn't
> seem to be the case.
>
> Also as you know, parsechecker is not available in 1.2.
>
> On Wed, Feb 8, 2012 at 9:22 AM, remi tassing <ta...@gmail.com>
> wrote:
>
> > Ok I just did (It's great but I've been reluctant because recompiling
> > always gives me errors).
> >
> This is nothing 'serious' however there is a ticket logged in Jira for
> fixing the error you see when compiling. Hopefully we can get round to
> fixing this at some stage.
>
>
> > ---------
> > Version: 5
> > Status: failed(2,0): Can't retrieve Tika parser for mime-type
> > application/ms-excel
> >
> Can anyone advise if it it necessary to specifically add the contentType to
> the plugin.xml? I've not been grabbing .xls for quite a while now and
> really can't remember.
>
> hth
>

Re: how are CSV/TXT files handled

Posted by Lewis John Mcgibbney <le...@gmail.com>.

Hi Remi,

To give some history. I think you would have needed to add ms-excel to the
plugin.includes when using 1.2. From what you posted above this doesn't
seem to be the case.

Also as you know, parsechecker is not available in 1.2.

On Wed, Feb 8, 2012 at 9:22 AM, remi tassing <ta...@gmail.com> wrote:

> Ok I just did (It's great but I've been reluctant because recompiling
> always gives me errors).
>
This is nothing 'serious' however there is a ticket logged in Jira for
fixing the error you see when compiling. Hopefully we can get round to
fixing this at some stage.

> ---------
> Version: 5
> Status: failed(2,0): Can't retrieve Tika parser for mime-type
> application/ms-excel
>
Can anyone advise if it it necessary to specifically add the contentType to
the plugin.xml? I've not been grabbing .xls for quite a while now and
really can't remember.

hth

Re: how are CSV/TXT files handled

Posted by remi tassing <ta...@gmail.com>.

Ok I just did (It's great but I've been reluctant because recompiling
always gives me errors).

However, I'm still having a similar error:
$ bin/nutch parsechecker http://URL
fetching: http://URL
parsing: http://URL
contentType: application/ms-excel
---------
Url
---------------
http://URL---------
ParseData
---------
Version: 5
Status: failed(2,0): Can't retrieve Tika parser for mime-type
application/ms-excel
Title:
Outlinks: 0
Content Metadata:
Parse Metadata:

my nutch-default.xml and nutch-site.xml all have:
<property>
  <name>plugin.includes</name>

<value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
  <description>Regular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
  In any case you need at least include the nutch-extensionpoints plugin. By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins. In order to use HTTPS please
enable
  protocol-httpclient, but be aware of possible intermittent problems with
the
  underlying commons-httpclient library.
  </description>
</property>

Remi

On Tue, Feb 7, 2012 at 11:17 AM, Markus Jelsma <ma...@apache.org> wrote:

> Upgrade to 1.4.
>
> > With the "nutch parsechecker" command I get the following error message:
> >
> > "Error: Could not find or load main class parsechecker", this doesn't
> sound
> > good!
> >
> > On Tue, Feb 7, 2012 at 9:58 AM, remi tassing <ta...@gmail.com>
> wrote:
> > > The point that made me start thinking is because I got this error
> > > message:
> > >
> > > "failed(2,0): Can't retrieve Tika parser for mime-type
> > > application/ms-excel"
> > >
> > > I'm using Nutch-1.2 and my nutch-site.xml has:
> > >
> > > "<property>
> > >
> > >   <name>plugin.includes</name>
> > >
> > >
> <value>protocol-httpclient|urlfilter-regex|parse-(text|html|js|tika)|inde
> > > x-(basic|anchor)|q..."
> > >
> > > Remi
> > >
> > > On Tue, Feb 7, 2012 at 9:16 AM, remi tassing <tassingremi@gmail.com
> >wrote:
> > >> Hey guys,
> > >>
> > >> I checked the mailing-list archive but couldn't get an answer on
> this. I
> > >> think CSV and TXT don't need any kind of parsing, but how.are handled
> by
> > >> default?
> > >>
> > >> Remi
>

Re: how are CSV/TXT files handled

Posted by Markus Jelsma <ma...@apache.org>.

Upgrade to 1.4.

> With the "nutch parsechecker" command I get the following error message:
> 
> "Error: Could not find or load main class parsechecker", this doesn't sound
> good!
> 
> On Tue, Feb 7, 2012 at 9:58 AM, remi tassing <ta...@gmail.com> wrote:
> > The point that made me start thinking is because I got this error
> > message:
> > 
> > "failed(2,0): Can't retrieve Tika parser for mime-type
> > application/ms-excel"
> > 
> > I'm using Nutch-1.2 and my nutch-site.xml has:
> > 
> > "<property>
> > 
> >   <name>plugin.includes</name>
> > 
> > <value>protocol-httpclient|urlfilter-regex|parse-(text|html|js|tika)|inde
> > x-(basic|anchor)|q..."
> > 
> > Remi
> > 
> > On Tue, Feb 7, 2012 at 9:16 AM, remi tassing <ta...@gmail.com>wrote:
> >> Hey guys,
> >> 
> >> I checked the mailing-list archive but couldn't get an answer on this. I
> >> think CSV and TXT don't need any kind of parsing, but how.are handled by
> >> default?
> >> 
> >> Remi

Re: how are CSV/TXT files handled

Posted by remi tassing <ta...@gmail.com>.

With the "nutch parsechecker" command I get the following error message:

"Error: Could not find or load main class parsechecker", this doesn't sound
good!

On Tue, Feb 7, 2012 at 9:58 AM, remi tassing <ta...@gmail.com> wrote:

> The point that made me start thinking is because I got this error message:
>
> "failed(2,0): Can't retrieve Tika parser for mime-type
> application/ms-excel"
>
> I'm using Nutch-1.2 and my nutch-site.xml has:
>
> "<property>
>   <name>plugin.includes</name>
>
> <value>protocol-httpclient|urlfilter-regex|parse-(text|html|js|tika)|index-(basic|anchor)|q..."
>
> Remi
>
> On Tue, Feb 7, 2012 at 9:16 AM, remi tassing <ta...@gmail.com>wrote:
>
>> Hey guys,
>>
>> I checked the mailing-list archive but couldn't get an answer on this. I
>> think CSV and TXT don't need any kind of parsing, but how.are handled by
>> default?
>>
>> Remi
>
>
>

Re: how are CSV/TXT files handled

Posted by remi tassing <ta...@gmail.com>.

The point that made me start thinking is because I got this error message:

"failed(2,0): Can't retrieve Tika parser for mime-type application/ms-excel"

I'm using Nutch-1.2 and my nutch-site.xml has:

"<property>
  <name>plugin.includes</name>

<value>protocol-httpclient|urlfilter-regex|parse-(text|html|js|tika)|index-(basic|anchor)|q..."

Remi

On Tue, Feb 7, 2012 at 9:16 AM, remi tassing <ta...@gmail.com> wrote:

> Hey guys,
>
> I checked the mailing-list archive but couldn't get an answer on this. I
> think CSV and TXT don't need any kind of parsing, but how.are handled by
> default?
>
> Remi