You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by scorpking <le...@gmail.com> on 2011/09/30 11:58:14 UTC

How to skip current document to index data from DIP

hi, 
can anyone help me this problem? I'm using tika to index data from rich
documents and index by http request. I queried from database to get fields
and then combined with Tika. everything is ok, but i face to face with this
error "FileNotFoundException". I known this error, but I want skip documents
to continue index data. 

Thanks for your help.



--
View this message in context: http://lucene.472066.n3.nabble.com/How-to-skip-current-document-to-index-data-from-DIP-tp3381894p3381894.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: How to skip current document to index data from DIP

Posted by scorpking <le...@gmail.com>.

Hi Erick Erickson
Thank you for reply for me, In my config, i indexed successful data from
HTTP using Tika. I combined a field and url in Tika to get file by that
http. But during indexing, i have seen some URL which is not exist or
notice:

*Caused by: java.io.FileNotFoundException:
http://media.gox.vn/edu/document/original/1/2704201010071760_Bai25.ppt
*

it mean that, this file is not exist in server. i want to skip file
(documents) to index next files. I tried to use *onError="skip"* to continue
index from file rich documents but it doesn't work and stop at. Is there a
way to overcome this problem?

Best Regard
Thanks

--
View this message in context: http://lucene.472066.n3.nabble.com/How-to-skip-current-document-to-index-data-from-DIH-tp3381894p3392055.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: How to skip current document to index data from DIP

Posted by Erick Erickson <er...@gmail.com>.

You might want to review:
http://wiki.apache.org/solr/UsingMailingLists

You've given us a config, but no idea what you've
tried. What you observe. What error you're seeing.

What do your logs show? Are you seeing any error
stack traces? Have you tried backing this up and
trying only one addition at a time? For instance,
take out the whole TikaEntityProcessor and see if
you can just connect to your DB and index something
from there.

It appears you have a custom Transformer. Take that
out. Or put logging messages in there to see if you
even get that far.

In other words, try stuff and tell us what the results
are. But just saying "it doesn't work" gives us very
little to go on.

Best
Erick

On Sun, Oct 2, 2011 at 11:00 PM, scorpking <le...@gmail.com> wrote:
> Hi, thanks for your reply.
> But, when i set attribute onError="skip", There is no data which import.
> This is my config.
> *<dataConfig>
>    <dataSource type="JdbcDataSource"
> driver="com.microsoft.sqlserver.jdbc.SQLServerDriver"
> url="jdbc:sqlserver://myip;databaseName=VTC_Edu" user="ac" password="ps"
> name="dsdb"/>
>
>        <dataSource type="BinURLDataSource" name="dsurl"/>
>
>        <document>
>
>                <entity name="VTCEduDocument" dataSource="dsdb" pk="pk_document_id"
> query="select pk_document_id, s_path_origin from
> [VTC_Edu].[dbo].[tbl_Document]" onError="skip"
>
>                >
>                <field column="pk_document_id" name="pk_document_id" />
>                                <field column="s_path_origin" name="s_path_origin" />
>
>
>                <entity processor="TikaEntityProcessor" dataSource="dsurl" format="text"
> url=
> "http://media.gox.vn/edu/document/original/${VTCEduDocument.s_path_origin}"
>                transformer="com.vtc.search.Converter" onError="skip"
>                >
>                                <field column="Author" name="author" meta="true"/>
>                <field column="title" name="title" meta="true"/>
>                <field column="text" name="text" encodingvn="true"/>
>      </entity>
>          </entity>
>
>    </document>
> </dataConfig>*
>
> Thanks
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/How-to-skip-current-document-to-index-data-from-DIH-tp3381894p3388700.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: How to skip current document to index data from DIP

Posted by scorpking <le...@gmail.com>.

Hi, thanks for your reply.
But, when i set attribute onError="skip", There is no data which import.
This is my config. 
*<dataConfig>
    <dataSource type="JdbcDataSource"
driver="com.microsoft.sqlserver.jdbc.SQLServerDriver"
url="jdbc:sqlserver://myip;databaseName=VTC_Edu" user="ac" password="ps" 
name="dsdb"/>
	
	<dataSource type="BinURLDataSource" name="dsurl"/>
    
	<document>
		
		<entity name="VTCEduDocument" dataSource="dsdb" pk="pk_document_id"
query="select pk_document_id, s_path_origin from
[VTC_Edu].[dbo].[tbl_Document]" onError="skip"
		
		>
                <field column="pk_document_id" name="pk_document_id" />				
				<field column="s_path_origin" name="s_path_origin" />						
		
		
		<entity processor="TikaEntityProcessor" dataSource="dsurl" format="text"
url=
"http://media.gox.vn/edu/document/original/${VTCEduDocument.s_path_origin}"
		transformer="com.vtc.search.Converter" onError="skip"
		>
				<field column="Author" name="author" meta="true"/>
                <field column="title" name="title" meta="true"/>
                <field column="text" name="text" encodingvn="true"/> 
      </entity>
	  </entity>
  
    </document>
</dataConfig>*

Thanks


--
View this message in context: http://lucene.472066.n3.nabble.com/How-to-skip-current-document-to-index-data-from-DIH-tp3381894p3388700.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: How to skip current document to index data from DIP

Posted by Ahmet Arslan <io...@yahoo.com>.

> can anyone help me this problem? I'm using tika to index
> data from rich
> documents and index by http request. I queried from
> database to get fields
> and then combined with Tika. everything is ok, but i face
> to face with this
> error "FileNotFoundException". I known this error, but I
> want skip documents
> to continue index data. 

"onError : (abort|skip|continue) . The default value is 'abort' . 'skip' skips the current document. 'continue' continues as if the error did not happen."

http://wiki.apache.org/solr/DataImportHandler#Schema_for_the_data_config