You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Ayyanar Inbamohan <te...@yahoo.com> on 2005/09/05 07:11:44 UTC

Content-type mismatch for Excel

Hi all,

In my crawl in nutch 6.0, i am trying to crawl ppt,xls
 and zip.

I got the plugins from JIRA.


Sample lines taken while  crawling, where excel is
taken as application/pdf

050905 103243 fetching
http://localhost:8080/search_sample/testexcel.xls
050905 103243 fetching
http://localhost:8080/search_sample/javaCertStudyNotes.pd
f
HR Response Code: 200
HR content.length: 38117
HR contentType: application/pdf
HR url:
http://localhost:8080/search_sample/testexcel.xls
HR Response Code: 200
HR content.length: 13824
HR contentType: null


thanks,
Ayyanar...

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

Re: Content-type mismatch for Excel

Posted by Jérôme Charron <je...@gmail.com>.
> I took at random some xls-files from the internet, crawled them and saw
> some errors. I haven't been able to check the errors further. So I can't
> give you a more specific description of the problem :-( If you're
> interested, I can mail you the url with my test-documents "off-list".

Yes, I'm interested to get these files and integrating them into the 
parse-msexcel unit tests.

Thanks

Jérôme


-- 
http://motrech.free.fr/
http://www.frutch.org/

Re: Content-type mismatch for Excel

Posted by Michael Nebel <mi...@nebel.de>.
Hi Jérôme,

Jérôme Charron wrote:
>>The changes are not difficult, but I still
>>observe some other problems with this plugin.
> Ok, what kind of problems?

I took at random some xls-files from the internet, crawled them and saw 
some errors. I haven't been able to check the errors further. So I can't 
  give you a more specific description of the problem :-( If you're 
interested, I can mail you the url with my test-documents "off-list".

Regards

	Michael


050829 192634 fetching http://www.xxxxxx.xx/xxx/xls/RAL_RGB_Farbkarte.xls
java.lang.reflect.InvocationTargetException
          at 
sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
          at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
          at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
          at java.lang.reflect.Constructor.newInstance(Constructor.java:494)
          at 
org.apache.poi.hssf.record.RecordFactory.createRecord(RecordFactory.java:224)
          at 
org.apache.poi.hssf.record.RecordFactory.createRecords(RecordFactory.java:160)
          at 
org.apache.poi.hssf.usermodel.HSSFWorkbook.<init>(HSSFWorkbook.java:163)
          at 
org.apache.poi.hssf.usermodel.HSSFWorkbook.<init>(HSSFWorkbook.java:210)
          at 
org.apache.poi.hssf.usermodel.HSSFWorkbook.<init>(HSSFWorkbook.java:191)
          at 
org.apache.nutch.parse.msexcel.ExcelTextExtractor.extractText(ExcelTextExtractor.java:34)
          at 
org.apache.nutch.parse.msexcel.MSExcelParser.getParse(MSExcelParser.java:73)
          at 
org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:254)
          at 
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)
  Caused by: java.lang.ArrayIndexOutOfBoundsException
          at java.lang.System.arraycopy(Native Method)
          at 
org.apache.poi.hssf.record.UnknownRecord.<init>(UnknownRecord.java:62)
          at 
org.apache.poi.hssf.record.SubRecord.createSubRecord(SubRecord.java:57)
          at 
org.apache.poi.hssf.record.ObjRecord.fillFields(ObjRecord.java:99)
          at org.apache.poi.hssf.record.Record.fillFields(Record.java:90)
          at org.apache.poi.hssf.record.Record.<init>(Record.java:55)
          at org.apache.poi.hssf.record.ObjRecord.<init>(ObjRecord.java:61)
          ... 13 more


  050829 192650 fetching http://www.xxxxxxx.xx/xxxx/xls/TAG_Liste.xls
  java.lang.reflect.InvocationTargetException
          at 
sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
          at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
          at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
          at java.lang.reflect.Constructor.newInstance(Constructor.java:494)
          at 
org.apache.poi.hssf.record.RecordFactory.createRecord(RecordFactory.java:224)
          at 
org.apache.poi.hssf.record.RecordFactory.createRecords(RecordFactory.java:160)
          at 
org.apache.poi.hssf.usermodel.HSSFWorkbook.<init>(HSSFWorkbook.java:163)
          at 
org.apache.poi.hssf.usermodel.HSSFWorkbook.<init>(HSSFWorkbook.java:210)
          at 
org.apache.poi.hssf.usermodel.HSSFWorkbook.<init>(HSSFWorkbook.java:191)
          at 
org.apache.nutch.parse.msexcel.ExcelTextExtractor.extractText(ExcelTextExtractor.java:34)
          at 
org.apache.nutch.parse.msexcel.MSExcelParser.getParse(MSExcelParser.java:73)
          at 
org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:254)
          at 
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)
  Caused by: java.lang.ArrayIndexOutOfBoundsException
          at java.lang.System.arraycopy(Native Method)
          at 
org.apache.poi.hssf.record.UnknownRecord.<init>(UnknownRecord.java:62)
          at 
org.apache.poi.hssf.record.SubRecord.createSubRecord(SubRecord.java:57)
          at 
org.apache.poi.hssf.record.ObjRecord.fillFields(ObjRecord.java:99)
          at org.apache.poi.hssf.record.Record.fillFields(Record.java:90)
          at org.apache.poi.hssf.record.Record.<init>(Record.java:55)
          at org.apache.poi.hssf.record.ObjRecord.<init>(ObjRecord.java:61)
          ... 13 more


-- 
Michael Nebel
http://www.nebel.de/
http://www.netluchs.de/


Re: Content-type mismatch for Excel

Posted by Jérôme Charron <je...@gmail.com>.
> 
> there are some modifications nescessary, because the xls-plugin uses
> still an old interface.

Yes, it uses some old interefaces. I have made the changes in my local copy 
for committing in the trunk.
But I have not tested it already (I will commit in a few days if no 
objections for other developpers).

> The changes are not difficult, but I still
> observe some other problems with this plugin.

Ok, what kind of problems?



-- 
http://motrech.free.fr/
http://www.frutch.org/

Re: Content-type mismatch for Excel

Posted by Michael Nebel <mi...@nebel.de>.
Hi,

there are some modifications nescessary, because the xls-plugin uses 
still an old interface. The changes are not difficult, but I still 
observe some other problems with this plugin.

Regards

	Michael

Ayyanar Inbamohan wrote:

> Hi jerome,
> 
> Now i am trying nutch 7.0. I am using the plugin from
> JIRA,but still while building the plugin using ant,i
> am getting two exceptions from the excel plugin
> 
> 
> compile:
>      [echo] Compiling plugin: parse-msexcel
>     [javac] Compiling 3 source files to
> /home/oss/nutch-0.7/build/parse-msexcel/classes
>     [javac]
> /home/oss/nutch-0.7/src/plugin/parse-msexcel/src/java/org/apache/nutch/parse/msexcel/MSExcelParser.java:35:
> getParse(org.apache.nutch.protocol.Content) in
> org.apache.nutch.parse.msexcel.MSExcelParser cannot
> implement getParse(org.apache.nutch.protocol.Content)
> in org.apache.nutch.parse.Parser; overridden method
> does not throw org.apache.nutch.parse.ParseException
>     [javac]     public Parse getParse(final Content
> content)throws ParseException {
>     [javac]                  ^
>     [javac]
> /home/oss/nutch-0.7/src/plugin/parse-msexcel/src/java/org/apache/nutch/parse/msexcel/MSExcelParser.java:103:
> cannot resolve symbol
>     [javac] symbol  : constructor ParseData
> (java.lang.String,org.apache.nutch.parse.Outlink[],java.util.Properties)
>     [javac] location: class
> org.apache.nutch.parse.ParseData
>     [javac]    final ParseData parseData = new
> ParseData(resultTitle, outlinks, metadata);
>     [javac]                                ^
>     [javac] 2 errors
> 
> how to avoid the above errors,
> 
> 
> 
> thanks,
> Ayyanar...
> 
> --- Jérôme Charron <je...@gmail.com> wrote:
> 
> 
>>>Sample lines taken while crawling, where excel is
>>>taken as application/pdf
>>
>>
>>I don't think that your xsl file is taken as a pdf,
>>but as an unknown file 
>>type (Content-Type: null).
>>In Nutch 0.6, if the httpd server is badly
>>configured and doesn't return a 
>>godd content-type, Nutch can't find it itself (and
>>then process is aborted).
>>In Nutch 0.7, the mime-type detector tries to find
>>the document's type if 
>>not sended by the server (it is a first step in
>>detection, the next is to 
>>check that the type returned by the server is the
>>good one). If you can, try 
>>nutch-7, that should solve your problem (
>>http://lucene.apache.org/nutch/release/)
>>
>>Regards
>>
>>Jérôme
>>
-- 
Michael Nebel
http://www.nebel.de/
http://www.netluchs.de/


Re: Content-type mismatch for Excel

Posted by Ayyanar Inbamohan <te...@yahoo.com>.
Hi jerome,

Now i am trying nutch 7.0. I am using the plugin from
JIRA,but still while building the plugin using ant,i
am getting two exceptions from the excel plugin


compile:
     [echo] Compiling plugin: parse-msexcel
    [javac] Compiling 3 source files to
/home/oss/nutch-0.7/build/parse-msexcel/classes
    [javac]
/home/oss/nutch-0.7/src/plugin/parse-msexcel/src/java/org/apache/nutch/parse/msexcel/MSExcelParser.java:35:
getParse(org.apache.nutch.protocol.Content) in
org.apache.nutch.parse.msexcel.MSExcelParser cannot
implement getParse(org.apache.nutch.protocol.Content)
in org.apache.nutch.parse.Parser; overridden method
does not throw org.apache.nutch.parse.ParseException
    [javac]     public Parse getParse(final Content
content)throws ParseException {
    [javac]                  ^
    [javac]
/home/oss/nutch-0.7/src/plugin/parse-msexcel/src/java/org/apache/nutch/parse/msexcel/MSExcelParser.java:103:
cannot resolve symbol
    [javac] symbol  : constructor ParseData
(java.lang.String,org.apache.nutch.parse.Outlink[],java.util.Properties)
    [javac] location: class
org.apache.nutch.parse.ParseData
    [javac]    final ParseData parseData = new
ParseData(resultTitle, outlinks, metadata);
    [javac]                                ^
    [javac] 2 errors

how to avoid the above errors,



thanks,
Ayyanar...

--- Jérôme Charron <je...@gmail.com> wrote:

> > Sample lines taken while crawling, where excel is
> > taken as application/pdf
> 
> 
> I don't think that your xsl file is taken as a pdf,
> but as an unknown file 
> type (Content-Type: null).
> In Nutch 0.6, if the httpd server is badly
> configured and doesn't return a 
> godd content-type, Nutch can't find it itself (and
> then process is aborted).
> In Nutch 0.7, the mime-type detector tries to find
> the document's type if 
> not sended by the server (it is a first step in
> detection, the next is to 
> check that the type returned by the server is the
> good one). If you can, try 
> nutch-7, that should solve your problem (
> http://lucene.apache.org/nutch/release/)
> 
> Regards
> 
> Jérôme
> 
> -- 
> http://motrech.free.fr/
> http://www.frutch.org/
> 


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

Re: Content-type mismatch for Excel

Posted by Jérôme Charron <je...@gmail.com>.
> Sample lines taken while crawling, where excel is
> taken as application/pdf


I don't think that your xsl file is taken as a pdf, but as an unknown file 
type (Content-Type: null).
In Nutch 0.6, if the httpd server is badly configured and doesn't return a 
godd content-type, Nutch can't find it itself (and then process is aborted).
In Nutch 0.7, the mime-type detector tries to find the document's type if 
not sended by the server (it is a first step in detection, the next is to 
check that the type returned by the server is the good one). If you can, try 
nutch-7, that should solve your problem (
http://lucene.apache.org/nutch/release/)

Regards

Jérôme

-- 
http://motrech.free.fr/
http://www.frutch.org/