You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by scorpking <le...@gmail.com> on 2011/09/09 12:58:30 UTC

indexing data from rich documents - Tika with solr3.1

Hi everyone, 
Now i have had a problem with tika and solr. I successed in index data from
various file formats (pdf, doc...) with a file absolute path. but now I have
a link from internet (ex: http://myweb/filename.pdf). I want to index from
this link, But it's not ok. I don't why? This is my file dataconfig.xml:

*<dataConfig>
    <dataSource type="BinFileDataSource" name="bin"/>
    <document>
						
        <entity name="tika-test" processor="TikaEntityProcessor" url="
http://myweb/filename.pdf" format="text" dataSource="bin" >
				
                <field column="Author" name="author" meta="true"/>
                <field column="title" name="title" meta="true"/>
                <field column="text" name="text"/>

		</entity>
    </document>
</dataConfig>*

when i change url=" http://myweb/filename.pdf" by a file absolute path, it
work very good. 
Any one know this? 
Thanks for your help.

--
View this message in context: http://lucene.472066.n3.nabble.com/indexing-data-from-rich-documents-Tika-with-solr3-1-tp3322555p3322555.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: indexing data from rich documents - Tika with solr3.1

Posted by scorpking <le...@gmail.com>.

oh, it is good for me. Thank Erik Hatcher-4 very much. I have done to index
from https. 

--
View this message in context: http://lucene.472066.n3.nabble.com/indexing-data-from-rich-documents-Tika-with-solr3-1-tp3322555p3326971.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: indexing data from rich documents - Tika with solr3.1

Posted by scorpking <le...@gmail.com>.

Hi all, thanks everyone who help me very much, i indexed form http using DIH. 

--
View this message in context: http://lucene.472066.n3.nabble.com/indexing-data-from-rich-documents-Tika-with-solr3-1-tp3322555p3351278.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: indexing data from rich documents - Tika with solr3.1

Posted by scorpking <le...@gmail.com>.

yeah, i want to use DIH and i tried config my file dataconfig. but it is
wrong. This is my config:

*<dataConfig>
    <dataSource type="JdbcDataSource"
driver="com.microsoft.sqlserver.jdbc.SQLServerDriver"
url="jdbc:sqlserver://ipAddress;databaseName=VTC_Edu" user="myuser"
password="mypass"  name="VTCEduDocument"/>
	
	<dataSource type="BinURLDataSource" name="dsurl"/>
    
	<document>
		
		<entity name="VTCEduDocument" pk="pk_document_id" query="select TOP 10
pk_document_id, s_path_origin from [VTC_Edu].[dbo].[tbl_Document]"			

	
transformer="vn.vtc.solr.transformer.ImageFilter,vn.vtc.solr.transformer.RemoveHTML,RegexTransformer,TemplateTransformer,vn.vtc.solr.transformer.vntransformer,vn.vtc.solr.correctUnicodeString.correctUnicodeString,vn.vtc.solr.unescapeHtmlString.UnescapeHtmlString,vn.vtc.solr.correctISOString.correctISOString"
>
                <field column="pk_document_id" name="pk_document_id" />				
				<field column="s_path_origin" name="s_path_origin" />						
		</entity>
		
		<entity processor="TikaEntityProcessor" dataSource="dsurl" format="text"
url=
"http://media.gox.vn/edu/document/original/${VTCEduDocument.s_path_origin}">
				<field column="Author" name="author" meta="true"/>
                <field column="title" name="title" meta="true"/>
                <field column="text" name="text"/> 
      </entity>
  
    </document>
</dataConfig>*

And here error: 
*EVERE: Full Import
failed:org.apache.solr.handler.dataimport.DataImportHandlerException:
Exception in invoking url null Processing Document # 1
	at
org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:72)
	at
org.apache.solr.handler.dataimport.BinURLDataSource.getData(BinURLDataSource.java:89)
	at
org.apache.solr.handler.dataimport.BinURLDataSource.getData(BinURLDataSource.java:38)
	at
org.apache.solr.handler.dataimport.SqlEntityProcessor.initQuery(SqlEntityProcessor.java:59)
	at
org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:73)
	at
org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:238)
	at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:591)
	at
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:267)
	at
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:186)
	at
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:353)
	at
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:411)
	at
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:392)
Caused by: java.net.MalformedURLException: no protocol: nullselect TOP 10
pk_document_id, s_path_origin from [VTC_Edu].[dbo].[tbl_Document]
	at java.net.URL.<init>(URL.java:567)
	at java.net.URL.<init>(URL.java:464)
	at java.net.URL.<init>(URL.java:413)
	at
org.apache.solr.handler.dataimport.BinURLDataSource.getData(BinURLDataSource.java:81)
	... 10 more*

???
Thanks

--
View this message in context: http://lucene.472066.n3.nabble.com/indexing-data-from-rich-documents-Tika-with-solr3-1-tp3322555p3348149.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: indexing data from rich documents - Tika with solr3.1

Posted by Erik Hatcher <er...@gmail.com>.

On Sep 18, 2011, at 21:52 , scorpking wrote:

> Hi Erik Hatcher-4
> I tried index from your url. But i have a problem. In your case, you knew a
> files absolute path (Dir.new("/Users/erikhatcher/apache-solr-3.3.0/docs").
> So you can indexed it. In my case, i don't know a files absolute path. I
> only know http's address where have files (ex: you can see this link as
> reference: http://www.lc.unsw.edu.au/onlib/pdf/). Another ways? Thanks 

Write a little script that takes the HTTP directory listing like that, and then uses stream.url (rather than stream.file as my example used).

	Erik

Re: indexing data from rich documents - Tika with solr3.1

Posted by scorpking <le...@gmail.com>.

Hi Erik Hatcher-4
I tried index from your url. But i have a problem. In your case, you knew a
files absolute path (Dir.new("/Users/erikhatcher/apache-solr-3.3.0/docs").
So you can indexed it. In my case, i don't know a files absolute path. I
only know http's address where have files (ex: you can see this link as
reference: http://www.lc.unsw.edu.au/onlib/pdf/). Another ways? Thanks 



--
View this message in context: http://lucene.472066.n3.nabble.com/indexing-data-from-rich-documents-Tika-with-solr3-1-tp3322555p3347706.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: indexing data from rich documents - Tika with solr3.1

Posted by Erik Hatcher <er...@gmail.com>.

Maybe this quick script will get you running?

    <http://www.lucidimagination.com/blog/2011/08/31/indexing-rich-files-into-solr-quickly-and-easily/>


On Sep 15, 2011, at 00:44 , scorpking wrote:

> Hi Erick Erickson, 
> Now, we have many files format(doc, ppt, pdf, ...), File's purpose serve to
> search details content of education in that files. Because i am new solr, so
> maybe i understand not enough depth about Apache Tika. At the moment i can't
> index pdf files from http, with one file is ok. Thank for your attention. 
> 
> 
> 
> --
> View this message in context: http://lucene.472066.n3.nabble.com/indexing-data-from-rich-documents-Tika-with-solr3-1-tp3322555p3337963.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: indexing data from rich documents - Tika with solr3.1

Posted by scorpking <le...@gmail.com>.

Hi Erick Erickson, 
Now, we have many files format(doc, ppt, pdf, ...), File's purpose serve to
search details content of education in that files. Because i am new solr, so
maybe i understand not enough depth about Apache Tika. At the moment i can't
index pdf files from http, with one file is ok. Thank for your attention. 



--
View this message in context: http://lucene.472066.n3.nabble.com/indexing-data-from-rich-documents-Tika-with-solr3-1-tp3322555p3337963.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: indexing data from rich documents - Tika with solr3.1

Posted by Erick Erickson <er...@gmail.com>.

FileListEntityProcessor pre-supposes it's looking at files on disk. it
doesn't know anything about the web. So, as the stack trace
indicates, it tries to open a directory called http://..... and fails.

What is it you're really trying to do here? Perhaps if you explain
your higher-level problem we can provide some help.

Best
Erick

On Mon, Sep 12, 2011 at 11:53 PM, scorpking <le...@gmail.com> wrote:
> Hi,
> Can you explain me this problem?
> I have indexed data from multi file which use tika libs. And i have indexed
> data from http. But only one file (ex: http://myweb/filename.pdf). Now i
> have many file formats in a http path (ex:http://myweb/files/). I tried
> index data from a http path but it's not work. It is my data-config.
>
> *<dataConfig>
>    <dataSource type="BinURLDataSource" name="bin" encoding="utf-8"/>
>    <document>
>                <entity name="sd" processor="FileListEntityProcessor"
> fileName=".*\.(DOC)|(PDF)|(pdf)|(doc)"
> baseDir="http://www.lc.unsw.edu.au/onlib/pdf/"
>                                recursive="true" rootEntity="false" transformer="DateFormatTransformer"
>>
>
>        <entity name="tika-test" processor="TikaEntityProcessor"
> url="${sd.fileAbsolutePath}" format="text" dataSource="bin" >
>
>                <field column="Author" name="author" meta="true"/>
>                <field column="title" name="title" meta="true"/>
>                <field column="text" name="text"/>
>
>        </entity>
>                                 <field column="file" name="filename"/>
>
>                </entity>
>    </document>
> </dataConfig>*
>
> Error:
> Full Import
> failed:org.apache.solr.handler.dataimport.DataImportHandlerException:
> 'baseDir' value: http://www.lc.unsw.edu.au/onlib/pdf/ is not a directory
> Processing Document # 1
>        at
> org.apache.solr.handler.dataimport.FileListEntityProcessor.init(FileListEntityProcessor.java:124)
>        at
> org.apache.solr.handler.dataimport.EntityProcessorWrapper.init(EntityProcessorWrapper.java:69)
>        at
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:552)
>        at
> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:267)
>        at
> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:186)
>        at
> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:353)
>        at
> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:411)
>        at
> org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:392)
>
> Thanks for your help.
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/indexing-data-from-rich-documents-Tika-with-solr3-1-tp3322555p3331651.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: indexing data from rich documents - Tika with solr3.1

Posted by scorpking <le...@gmail.com>.

Hi, 
Can you explain me this problem?
I have indexed data from multi file which use tika libs. And i have indexed
data from http. But only one file (ex: http://myweb/filename.pdf). Now i
have many file formats in a http path (ex:http://myweb/files/). I tried
index data from a http path but it's not work. It is my data-config. 

*<dataConfig>
    <dataSource type="BinURLDataSource" name="bin" encoding="utf-8"/>
    <document>
		<entity name="sd" processor="FileListEntityProcessor"
fileName=".*\.(DOC)|(PDF)|(pdf)|(doc)"
baseDir="http://www.lc.unsw.edu.au/onlib/pdf/"
				recursive="true" rootEntity="false" transformer="DateFormatTransformer"
> 
				
        <entity name="tika-test" processor="TikaEntityProcessor"
url="${sd.fileAbsolutePath}" format="text" dataSource="bin" >
				
                <field column="Author" name="author" meta="true"/>
                <field column="title" name="title" meta="true"/>
                <field column="text" name="text"/>
								
        </entity>
				 <field column="file" name="filename"/> 
				 
		</entity>
    </document>
</dataConfig>*

Error: 
Full Import
failed:org.apache.solr.handler.dataimport.DataImportHandlerException:
'baseDir' value: http://www.lc.unsw.edu.au/onlib/pdf/ is not a directory
Processing Document # 1
	at
org.apache.solr.handler.dataimport.FileListEntityProcessor.init(FileListEntityProcessor.java:124)
	at
org.apache.solr.handler.dataimport.EntityProcessorWrapper.init(EntityProcessorWrapper.java:69)
	at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:552)
	at
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:267)
	at
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:186)
	at
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:353)
	at
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:411)
	at
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:392)

Thanks for your help.


--
View this message in context: http://lucene.472066.n3.nabble.com/indexing-data-from-rich-documents-Tika-with-solr3-1-tp3322555p3331651.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: indexing data from rich documents - Tika with solr3.1

Posted by Erik Hatcher <er...@gmail.com>.

If the only thing you're doing is indexing file content, then you can bypass using the Data Import Handler altogether and use the ExtractingRequestHandler (aka Solr Cell).  And you can feed in a file from a URL using the stream.url capability, like the stream.file example here: <http://wiki.apache.org/solr/ExtractingRequestHandler#Configuration>

Something like -  http://localhost:8983/solr/update/extract?stream.url=http://myweb/filename.pdf&literal.id=filename.pdf

But to fix what you're doing below, looks like you should be using BinURLDataSource rather than BinFileDataSource - other than that, it looks fine.

	Erik

On Sep 9, 2011, at 06:58 , scorpking wrote:

> Hi everyone, 
> Now i have had a problem with tika and solr. I successed in index data from
> various file formats (pdf, doc...) with a file absolute path. but now I have
> a link from internet (ex: http://myweb/filename.pdf). I want to index from
> this link, But it's not ok. I don't why? This is my file dataconfig.xml:
> 
> *<dataConfig>
>    <dataSource type="BinFileDataSource" name="bin"/>
>    <document>
> 						
>        <entity name="tika-test" processor="TikaEntityProcessor" url="
> http://myweb/filename.pdf" format="text" dataSource="bin" >
> 				
>                <field column="Author" name="author" meta="true"/>
>                <field column="title" name="title" meta="true"/>
>                <field column="text" name="text"/>
> 
> 		</entity>
>    </document>
> </dataConfig>*
> 
> when i change url=" http://myweb/filename.pdf" by a file absolute path, it
> work very good. 
> Any one know this? 
> Thanks for your help.
> 
> --
> View this message in context: http://lucene.472066.n3.nabble.com/indexing-data-from-rich-documents-Tika-with-solr3-1-tp3322555p3322555.html
> Sent from the Solr - User mailing list archive at Nabble.com.