You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Ayyanar Inbamohan <te...@yahoo.com> on 2005/09/05 07:11:44 UTC
Content-type mismatch for Excel
Hi all,
In my crawl in nutch 6.0, i am trying to crawl ppt,xls
and zip.
I got the plugins from JIRA.
Sample lines taken while crawling, where excel is
taken as application/pdf
050905 103243 fetching
http://localhost:8080/search_sample/testexcel.xls
050905 103243 fetching
http://localhost:8080/search_sample/javaCertStudyNotes.pd
f
HR Response Code: 200
HR content.length: 38117
HR contentType: application/pdf
HR url:
http://localhost:8080/search_sample/testexcel.xls
HR Response Code: 200
HR content.length: 13824
HR contentType: null
thanks,
Ayyanar...
__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com
Re: Content-type mismatch for Excel
Posted by Jérôme Charron <je...@gmail.com>.
> I took at random some xls-files from the internet, crawled them and saw
> some errors. I haven't been able to check the errors further. So I can't
> give you a more specific description of the problem :-( If you're
> interested, I can mail you the url with my test-documents "off-list".
Yes, I'm interested to get these files and integrating them into the
parse-msexcel unit tests.
Thanks
Jérôme
--
http://motrech.free.fr/
http://www.frutch.org/
Re: Content-type mismatch for Excel
Posted by Michael Nebel <mi...@nebel.de>.
Hi Jérôme,
Jérôme Charron wrote:
>>The changes are not difficult, but I still
>>observe some other problems with this plugin.
> Ok, what kind of problems?
I took at random some xls-files from the internet, crawled them and saw
some errors. I haven't been able to check the errors further. So I can't
give you a more specific description of the problem :-( If you're
interested, I can mail you the url with my test-documents "off-list".
Regards
Michael
050829 192634 fetching http://www.xxxxxx.xx/xxx/xls/RAL_RGB_Farbkarte.xls
java.lang.reflect.InvocationTargetException
at
sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
at java.lang.reflect.Constructor.newInstance(Constructor.java:494)
at
org.apache.poi.hssf.record.RecordFactory.createRecord(RecordFactory.java:224)
at
org.apache.poi.hssf.record.RecordFactory.createRecords(RecordFactory.java:160)
at
org.apache.poi.hssf.usermodel.HSSFWorkbook.<init>(HSSFWorkbook.java:163)
at
org.apache.poi.hssf.usermodel.HSSFWorkbook.<init>(HSSFWorkbook.java:210)
at
org.apache.poi.hssf.usermodel.HSSFWorkbook.<init>(HSSFWorkbook.java:191)
at
org.apache.nutch.parse.msexcel.ExcelTextExtractor.extractText(ExcelTextExtractor.java:34)
at
org.apache.nutch.parse.msexcel.MSExcelParser.getParse(MSExcelParser.java:73)
at
org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:254)
at
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)
Caused by: java.lang.ArrayIndexOutOfBoundsException
at java.lang.System.arraycopy(Native Method)
at
org.apache.poi.hssf.record.UnknownRecord.<init>(UnknownRecord.java:62)
at
org.apache.poi.hssf.record.SubRecord.createSubRecord(SubRecord.java:57)
at
org.apache.poi.hssf.record.ObjRecord.fillFields(ObjRecord.java:99)
at org.apache.poi.hssf.record.Record.fillFields(Record.java:90)
at org.apache.poi.hssf.record.Record.<init>(Record.java:55)
at org.apache.poi.hssf.record.ObjRecord.<init>(ObjRecord.java:61)
... 13 more
050829 192650 fetching http://www.xxxxxxx.xx/xxxx/xls/TAG_Liste.xls
java.lang.reflect.InvocationTargetException
at
sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
at java.lang.reflect.Constructor.newInstance(Constructor.java:494)
at
org.apache.poi.hssf.record.RecordFactory.createRecord(RecordFactory.java:224)
at
org.apache.poi.hssf.record.RecordFactory.createRecords(RecordFactory.java:160)
at
org.apache.poi.hssf.usermodel.HSSFWorkbook.<init>(HSSFWorkbook.java:163)
at
org.apache.poi.hssf.usermodel.HSSFWorkbook.<init>(HSSFWorkbook.java:210)
at
org.apache.poi.hssf.usermodel.HSSFWorkbook.<init>(HSSFWorkbook.java:191)
at
org.apache.nutch.parse.msexcel.ExcelTextExtractor.extractText(ExcelTextExtractor.java:34)
at
org.apache.nutch.parse.msexcel.MSExcelParser.getParse(MSExcelParser.java:73)
at
org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:254)
at
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)
Caused by: java.lang.ArrayIndexOutOfBoundsException
at java.lang.System.arraycopy(Native Method)
at
org.apache.poi.hssf.record.UnknownRecord.<init>(UnknownRecord.java:62)
at
org.apache.poi.hssf.record.SubRecord.createSubRecord(SubRecord.java:57)
at
org.apache.poi.hssf.record.ObjRecord.fillFields(ObjRecord.java:99)
at org.apache.poi.hssf.record.Record.fillFields(Record.java:90)
at org.apache.poi.hssf.record.Record.<init>(Record.java:55)
at org.apache.poi.hssf.record.ObjRecord.<init>(ObjRecord.java:61)
... 13 more
--
Michael Nebel
http://www.nebel.de/
http://www.netluchs.de/
Re: Content-type mismatch for Excel
Posted by Jérôme Charron <je...@gmail.com>.
>
> there are some modifications nescessary, because the xls-plugin uses
> still an old interface.
Yes, it uses some old interefaces. I have made the changes in my local copy
for committing in the trunk.
But I have not tested it already (I will commit in a few days if no
objections for other developpers).
> The changes are not difficult, but I still
> observe some other problems with this plugin.
Ok, what kind of problems?
--
http://motrech.free.fr/
http://www.frutch.org/
Re: Content-type mismatch for Excel
Posted by Michael Nebel <mi...@nebel.de>.
Hi,
there are some modifications nescessary, because the xls-plugin uses
still an old interface. The changes are not difficult, but I still
observe some other problems with this plugin.
Regards
Michael
Ayyanar Inbamohan wrote:
> Hi jerome,
>
> Now i am trying nutch 7.0. I am using the plugin from
> JIRA,but still while building the plugin using ant,i
> am getting two exceptions from the excel plugin
>
>
> compile:
> [echo] Compiling plugin: parse-msexcel
> [javac] Compiling 3 source files to
> /home/oss/nutch-0.7/build/parse-msexcel/classes
> [javac]
> /home/oss/nutch-0.7/src/plugin/parse-msexcel/src/java/org/apache/nutch/parse/msexcel/MSExcelParser.java:35:
> getParse(org.apache.nutch.protocol.Content) in
> org.apache.nutch.parse.msexcel.MSExcelParser cannot
> implement getParse(org.apache.nutch.protocol.Content)
> in org.apache.nutch.parse.Parser; overridden method
> does not throw org.apache.nutch.parse.ParseException
> [javac] public Parse getParse(final Content
> content)throws ParseException {
> [javac] ^
> [javac]
> /home/oss/nutch-0.7/src/plugin/parse-msexcel/src/java/org/apache/nutch/parse/msexcel/MSExcelParser.java:103:
> cannot resolve symbol
> [javac] symbol : constructor ParseData
> (java.lang.String,org.apache.nutch.parse.Outlink[],java.util.Properties)
> [javac] location: class
> org.apache.nutch.parse.ParseData
> [javac] final ParseData parseData = new
> ParseData(resultTitle, outlinks, metadata);
> [javac] ^
> [javac] 2 errors
>
> how to avoid the above errors,
>
>
>
> thanks,
> Ayyanar...
>
> --- Jérôme Charron <je...@gmail.com> wrote:
>
>
>>>Sample lines taken while crawling, where excel is
>>>taken as application/pdf
>>
>>
>>I don't think that your xsl file is taken as a pdf,
>>but as an unknown file
>>type (Content-Type: null).
>>In Nutch 0.6, if the httpd server is badly
>>configured and doesn't return a
>>godd content-type, Nutch can't find it itself (and
>>then process is aborted).
>>In Nutch 0.7, the mime-type detector tries to find
>>the document's type if
>>not sended by the server (it is a first step in
>>detection, the next is to
>>check that the type returned by the server is the
>>good one). If you can, try
>>nutch-7, that should solve your problem (
>>http://lucene.apache.org/nutch/release/)
>>
>>Regards
>>
>>Jérôme
>>
--
Michael Nebel
http://www.nebel.de/
http://www.netluchs.de/
Re: Content-type mismatch for Excel
Posted by Ayyanar Inbamohan <te...@yahoo.com>.
Hi jerome,
Now i am trying nutch 7.0. I am using the plugin from
JIRA,but still while building the plugin using ant,i
am getting two exceptions from the excel plugin
compile:
[echo] Compiling plugin: parse-msexcel
[javac] Compiling 3 source files to
/home/oss/nutch-0.7/build/parse-msexcel/classes
[javac]
/home/oss/nutch-0.7/src/plugin/parse-msexcel/src/java/org/apache/nutch/parse/msexcel/MSExcelParser.java:35:
getParse(org.apache.nutch.protocol.Content) in
org.apache.nutch.parse.msexcel.MSExcelParser cannot
implement getParse(org.apache.nutch.protocol.Content)
in org.apache.nutch.parse.Parser; overridden method
does not throw org.apache.nutch.parse.ParseException
[javac] public Parse getParse(final Content
content)throws ParseException {
[javac] ^
[javac]
/home/oss/nutch-0.7/src/plugin/parse-msexcel/src/java/org/apache/nutch/parse/msexcel/MSExcelParser.java:103:
cannot resolve symbol
[javac] symbol : constructor ParseData
(java.lang.String,org.apache.nutch.parse.Outlink[],java.util.Properties)
[javac] location: class
org.apache.nutch.parse.ParseData
[javac] final ParseData parseData = new
ParseData(resultTitle, outlinks, metadata);
[javac] ^
[javac] 2 errors
how to avoid the above errors,
thanks,
Ayyanar...
--- Jérôme Charron <je...@gmail.com> wrote:
> > Sample lines taken while crawling, where excel is
> > taken as application/pdf
>
>
> I don't think that your xsl file is taken as a pdf,
> but as an unknown file
> type (Content-Type: null).
> In Nutch 0.6, if the httpd server is badly
> configured and doesn't return a
> godd content-type, Nutch can't find it itself (and
> then process is aborted).
> In Nutch 0.7, the mime-type detector tries to find
> the document's type if
> not sended by the server (it is a first step in
> detection, the next is to
> check that the type returned by the server is the
> good one). If you can, try
> nutch-7, that should solve your problem (
> http://lucene.apache.org/nutch/release/)
>
> Regards
>
> Jérôme
>
> --
> http://motrech.free.fr/
> http://www.frutch.org/
>
__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com
Re: Content-type mismatch for Excel
Posted by Jérôme Charron <je...@gmail.com>.
> Sample lines taken while crawling, where excel is
> taken as application/pdf
I don't think that your xsl file is taken as a pdf, but as an unknown file
type (Content-Type: null).
In Nutch 0.6, if the httpd server is badly configured and doesn't return a
godd content-type, Nutch can't find it itself (and then process is aborted).
In Nutch 0.7, the mime-type detector tries to find the document's type if
not sended by the server (it is a first step in detection, the next is to
check that the type returned by the server is the good one). If you can, try
nutch-7, that should solve your problem (
http://lucene.apache.org/nutch/release/)
Regards
Jérôme
--
http://motrech.free.fr/
http://www.frutch.org/