You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by kostali hassan <me...@gmail.com> on 2015/12/02 12:02:31 UTC

indexing rich data from directory using solarium

HOW I can indexing from solarium rich data(msword and pdf files) from a
dirctory who contient many files, MY config is

$config = array(
         "endpoint" => array("localhost" => array("host"=>"127.0.0.1",
         "port"=>"8983", "path"=>"/solr", "core"=>"demo",)
        ) );

I try this code:

$dir = new Folder($dossier);
$files = $dir->find('.*\.*');
foreach ($files as $file) {
$file = new File($dir->pwd() . DS . $file);

$update = $client->createUpdate();

$query = $client->createExtract();
$query->setFile($file->pwd());
$query->setCommit(true);
$query->setOmitHeader(false);
$doc = $query->createDocument();
$doc->id =$file->pwd();
$doc->name = $file->name;
$doc->title = $file->name();
$query->setDocument($doc);

$result = $client->extract($query);
}

When i execute it i get this ERROR:

org.apache.solr.common.SolrException: URLDecoder: Invalid character
encoding detected after position 79 of query string / form data (while
parsing as UTF-8)

Re: indexing rich data from directory using solarium

Posted by Gora Mohanty <go...@mimirtech.com>.
On 2 December 2015 at 21:59, Erik Hatcher <er...@gmail.com> wrote:
> Gora -
>
> SimplePostTool actually already adds the literal.id parameter* when in “auto” mode (and it’s not an XML, JSON, or CSV file).

Ah, OK. It has been a while since I actually used the tool. Thanks for the info.

Regards,
Gora

Re: indexing rich data from directory using solarium

Posted by Erik Hatcher <er...@gmail.com>.
Gora - 

SimplePostTool actually already adds the literal.id parameter* when in “auto” mode (and it’s not an XML, JSON, or CSV file).

	Erik


* See https://github.com/apache/lucene-solr/blob/d4762c1a2677a44c8a580b97cccc239e1e91a25d/solr/core/src/java/org/apache/solr/util/SimplePostTool.java#L786 <https://github.com/apache/lucene-solr/blob/d4762c1a2677a44c8a580b97cccc239e1e91a25d/solr/core/src/java/org/apache/solr/util/SimplePostTool.java#L786>


—
Erik Hatcher, Senior Solutions Architect
http://www.lucidworks.com <http://www.lucidworks.com/>



> On Dec 2, 2015, at 11:18 AM, Gora Mohanty <go...@mimirtech.com> wrote:
> 
> On 2 December 2015 at 17:16, kostali hassan <me...@gmail.com> wrote:
>> yes its logic Thank you , but i want understand why the same data is
>> indexing fine in shell using windows SimplePostTool :
>>> 
>>> D:\solr\solr-5.3.1>java -classpath example\exampledocs\post.jar -Dauto=yes
>>> -Dc=solr_docs_core -Ddata=files -Drecursive=yes
>>> org.apache.solr.util.SimplePostTool D:\Lucene\document ;
> 
> That seems strange. Are you sure that you are posting the same PDF.
> With SimplePostTool, you should be POSTing to the URL
> /solr/update/extract?literal.id=myid , i.e., you need an option of
> something like:
> -Durl=http://localhost:8983/solr/update/extract?literal.id=myid in the
> command line for SimplePostTool.
> 
> Likewise, I am not that familiar with Solarium. Are you sure that the
> file is being POSTed to /solr/update/extract . Are you seeing any
> errors in your Solr logs?
> 
> Regards,
> Gora


Re: indexing rich data from directory using solarium

Posted by kostali hassan <me...@gmail.com>.
the prob with posting using line commande is :

I start working in solr 5.3.1 by extract solr in D://solr and run solr
server with :

D:\solr\solr-5.3.1\bin>solr start ;

Then I create a core in standalone mode :

D:\solr\solr-5.3.1\bin>solr create -c mycore

I need indexing from system files (word and pdf) and the schema API don’t
have a field “name” of document, then I Add this field using curl :

curl -X POST -H 'Content-type:application/json' --data-binary '{

  "add-field":{

     "name":"name",

     "type":"text_general",

     "stored":true,

     “indexed”:true }

}' http://localhost:8983/solr/mycore/schema



And re-index all document.with windows SimplepostTools:

D:\solr\solr-5.3.1>java -classpath example\exampledocs\post.jar -Dauto=yes
-Dc=mycore -Ddata=files -Drecursive=yes org.apache.solr.util.SimplePostTool
D:\Lucene\document ;



But even if the field “name” is succeffly added he is empty ; the field
title get the name for only pdf document not for msword(.doc and .docx).



Then I choose indexing with techproducts example because he don’t use
schema.xml API then I can modified my schema:



D:\solr\solr-5.3.1>solr –e techproducts



Techproducts return the name of all files.xml indexed;



Then I create a new core based in solr_home example/techproducts/solr and I
use schema.xml (contient field “name”) and solrConfig.xml from techproducts
in this new core called demo.

When I indexed all document the field name exist but still empty for all
document indexed.



My question is how I can get just the name of each document(msword and pdf)
not the path like the field “id” or field “ressource_name” ; I have to
create new Typefield or exist another way.

2015-12-02 16:25 GMT+00:00 kostali hassan <me...@gmail.com>:

> yes they are a Error in my solr logs:
> SolrException URLDecoder: Invalid character encoding detected after
> position 79 of query string / form data (while parsing as UTF-8)
> <http://stackoverflow.com/questions/34017889/solrexception-urldecoder-invalid-character-encoding-detected-after-position-79>
> this is my post in stack overflow :
>
> http://stackoverflow.com/questions/34017889/solrexception-urldecoder-invalid-character-encoding-detected-after-position-79
>
> 2015-12-02 16:18 GMT+00:00 Gora Mohanty <go...@mimirtech.com>:
>
>> On 2 December 2015 at 17:16, kostali hassan <me...@gmail.com>
>> wrote:
>> > yes its logic Thank you , but i want understand why the same data is
>> > indexing fine in shell using windows SimplePostTool :
>> >>
>> >> D:\solr\solr-5.3.1>java -classpath example\exampledocs\post.jar
>> -Dauto=yes
>> >> -Dc=solr_docs_core -Ddata=files -Drecursive=yes
>> >> org.apache.solr.util.SimplePostTool D:\Lucene\document ;
>>
>> That seems strange. Are you sure that you are posting the same PDF.
>> With SimplePostTool, you should be POSTing to the URL
>> /solr/update/extract?literal.id=myid , i.e., you need an option of
>> something like:
>> -Durl=http://localhost:8983/solr/update/extract?literal.id=myid in the
>> command line for SimplePostTool.
>>
>> Likewise, I am not that familiar with Solarium. Are you sure that the
>> file is being POSTed to /solr/update/extract . Are you seeing any
>> errors in your Solr logs?
>>
>> Regards,
>> Gora
>>
>
>

Re: indexing rich data from directory using solarium

Posted by Gora Mohanty <go...@mimirtech.com>.
On 2 December 2015 at 22:35, kostali hassan <me...@gmail.com> wrote:
> i fixed but he still a smal prb from time out 30sc of wamp server then i
> can just put 130files to a directory to index untill i index all my files :
> this is my function idex document:

Again, not familiar with Solarium, and at this point you are probably
better off asking on a Solarium-specific list, but my guess is that
you need keepalive on the connection. It seems that Solarium's
ZendHttpServer does this:
http://wiki.solarium-project.org/index.php/V1:Client_adapters .

Regards,
Gora

Re: indexing rich data from directory using solarium

Posted by kostali hassan <me...@gmail.com>.
i fixed but he still a smal prb from time out 30sc of wamp server then i
can just put 130files to a directory to index untill i index all my files :
this is my function idex document:

*App::import('Vendor','autoload',array('file'=>'solarium/vendor/autoload.php'));*

*public function indexDocument(){*
*$config = array(*
*         "endpoint" => array("localhost" => array("host"=>"127.0.0.1",*
*         "port"=>"8983", "path"=>"/solr", "core"=>"demo",)*
*        ) );*
*       $start = microtime(true);*

*if($_POST){*
*            // create a client instance*
*$client = new Solarium\Client($config);*
*$dossier=$this->request->data['User']['dossier'];*
*$dir = new Folder($dossier);*
*$files = $dir->find('.*\.*');*

* $headers = array('Content-Type:multipart/form-data');*

*foreach ($files as $file) {*
*    $file = new File($dir->pwd() . DS . $file);*

*$query = $client->createExtract();*
*$query->setFile($file->pwd());*
*$query->setCommit(true);*
*$query->setOmitHeader(false);*

*$doc = $query->createDocument();*
*$doc->id =$file->pwd();*
*$doc->name = $file->name;*
*$doc->title = $file->name();*

*$query->setDocument($doc);*

*$request = $client->createRequest($query);*
*$request->addHeaders($headers);*

*$result = $client->executeRequest($request);*
*}*

*}*

*$this->set(compact('start'));*
*}*


2015-12-02 16:42 GMT+00:00 kostali hassan <me...@gmail.com>:

> yes I am sure because i successeflly Post the same document(455 .doc .docx
> and pdf in 18 second) with SimplePostTool
> But now i want to commincate directly with my server solr using solarium
> in my application cakephp ; I think only way to have the right encoding is
> in header :
> *$headers = array('Content-Type:multipart/form-data');*
> * I guess it will *working if the time of indexing is not depassing 30
> second from time out of wamp server.
>
> 2015-12-02 16:32 GMT+00:00 Gora Mohanty <go...@mimirtech.com>:
>
>> On 2 December 2015 at 21:55, kostali hassan <me...@gmail.com>
>> wrote:
>> > yes they are a Error in my solr logs:
>> > SolrException URLDecoder: Invalid character encoding detected after
>> > position 79 of query string / form data (while parsing as UTF-8)
>> > <
>> http://stackoverflow.com/questions/34017889/solrexception-urldecoder-invalid-character-encoding-detected-after-position-79
>> >
>> > this is my post in stack overflow :
>> >
>> http://stackoverflow.com/questions/34017889/solrexception-urldecoder-invalid-character-encoding-detected-after-position-79
>>
>> Looks like an encoding error all right. Are you very sure that you can
>> sucessfully POST the same document with SimplePostTool. If so, I would
>> guess that you are not using Solarium correctly, i.e., the PDF file is
>> getting POSTed such that Solr is getting the raw content rather than
>> the extracted content.
>>
>> Regards,
>> Gora
>>
>
>

Re: indexing rich data from directory using solarium

Posted by kostali hassan <me...@gmail.com>.
yes I am sure because i successeflly Post the same document(455 .doc .docx
and pdf in 18 second) with SimplePostTool
But now i want to commincate directly with my server solr using solarium in
my application cakephp ; I think only way to have the right encoding is in
header :
*$headers = array('Content-Type:multipart/form-data');*
* I guess it will *working if the time of indexing is not depassing 30
second from time out of wamp server.

2015-12-02 16:32 GMT+00:00 Gora Mohanty <go...@mimirtech.com>:

> On 2 December 2015 at 21:55, kostali hassan <me...@gmail.com>
> wrote:
> > yes they are a Error in my solr logs:
> > SolrException URLDecoder: Invalid character encoding detected after
> > position 79 of query string / form data (while parsing as UTF-8)
> > <
> http://stackoverflow.com/questions/34017889/solrexception-urldecoder-invalid-character-encoding-detected-after-position-79
> >
> > this is my post in stack overflow :
> >
> http://stackoverflow.com/questions/34017889/solrexception-urldecoder-invalid-character-encoding-detected-after-position-79
>
> Looks like an encoding error all right. Are you very sure that you can
> sucessfully POST the same document with SimplePostTool. If so, I would
> guess that you are not using Solarium correctly, i.e., the PDF file is
> getting POSTed such that Solr is getting the raw content rather than
> the extracted content.
>
> Regards,
> Gora
>

Re: indexing rich data from directory using solarium

Posted by Gora Mohanty <go...@mimirtech.com>.
On 2 December 2015 at 21:55, kostali hassan <me...@gmail.com> wrote:
> yes they are a Error in my solr logs:
> SolrException URLDecoder: Invalid character encoding detected after
> position 79 of query string / form data (while parsing as UTF-8)
> <http://stackoverflow.com/questions/34017889/solrexception-urldecoder-invalid-character-encoding-detected-after-position-79>
> this is my post in stack overflow :
> http://stackoverflow.com/questions/34017889/solrexception-urldecoder-invalid-character-encoding-detected-after-position-79

Looks like an encoding error all right. Are you very sure that you can
sucessfully POST the same document with SimplePostTool. If so, I would
guess that you are not using Solarium correctly, i.e., the PDF file is
getting POSTed such that Solr is getting the raw content rather than
the extracted content.

Regards,
Gora

Re: indexing rich data from directory using solarium

Posted by kostali hassan <me...@gmail.com>.
yes they are a Error in my solr logs:
SolrException URLDecoder: Invalid character encoding detected after
position 79 of query string / form data (while parsing as UTF-8)
<http://stackoverflow.com/questions/34017889/solrexception-urldecoder-invalid-character-encoding-detected-after-position-79>
this is my post in stack overflow :
http://stackoverflow.com/questions/34017889/solrexception-urldecoder-invalid-character-encoding-detected-after-position-79

2015-12-02 16:18 GMT+00:00 Gora Mohanty <go...@mimirtech.com>:

> On 2 December 2015 at 17:16, kostali hassan <me...@gmail.com>
> wrote:
> > yes its logic Thank you , but i want understand why the same data is
> > indexing fine in shell using windows SimplePostTool :
> >>
> >> D:\solr\solr-5.3.1>java -classpath example\exampledocs\post.jar
> -Dauto=yes
> >> -Dc=solr_docs_core -Ddata=files -Drecursive=yes
> >> org.apache.solr.util.SimplePostTool D:\Lucene\document ;
>
> That seems strange. Are you sure that you are posting the same PDF.
> With SimplePostTool, you should be POSTing to the URL
> /solr/update/extract?literal.id=myid , i.e., you need an option of
> something like:
> -Durl=http://localhost:8983/solr/update/extract?literal.id=myid in the
> command line for SimplePostTool.
>
> Likewise, I am not that familiar with Solarium. Are you sure that the
> file is being POSTed to /solr/update/extract . Are you seeing any
> errors in your Solr logs?
>
> Regards,
> Gora
>

Re: indexing rich data from directory using solarium

Posted by Gora Mohanty <go...@mimirtech.com>.
On 2 December 2015 at 17:16, kostali hassan <me...@gmail.com> wrote:
> yes its logic Thank you , but i want understand why the same data is
> indexing fine in shell using windows SimplePostTool :
>>
>> D:\solr\solr-5.3.1>java -classpath example\exampledocs\post.jar -Dauto=yes
>> -Dc=solr_docs_core -Ddata=files -Drecursive=yes
>> org.apache.solr.util.SimplePostTool D:\Lucene\document ;

That seems strange. Are you sure that you are posting the same PDF.
With SimplePostTool, you should be POSTing to the URL
/solr/update/extract?literal.id=myid , i.e., you need an option of
something like:
-Durl=http://localhost:8983/solr/update/extract?literal.id=myid in the
command line for SimplePostTool.

Likewise, I am not that familiar with Solarium. Are you sure that the
file is being POSTed to /solr/update/extract . Are you seeing any
errors in your Solr logs?

Regards,
Gora

Re: indexing rich data from directory using solarium

Posted by kostali hassan <me...@gmail.com>.
yes its logic Thank you , but i want understand why the same data is
indexing fine in shell using windows SimplePostTool :
>
> D:\solr\solr-5.3.1>java -classpath example\exampledocs\post.jar -Dauto=yes
> -Dc=solr_docs_core -Ddata=files -Drecursive=yes
> org.apache.solr.util.SimplePostTool D:\Lucene\document ;



2015-12-02 11:09 GMT+00:00 Gora Mohanty <go...@mimirtech.com>:

> On 2 December 2015 at 16:32, kostali hassan <me...@gmail.com>
> wrote:
> [...]
> >
> > When i execute it i get this ERROR:
> >
> > org.apache.solr.common.SolrException: URLDecoder: Invalid character
> > encoding detected after position 79 of query string / form data (while
> > parsing as UTF-8)
>
> Solr expects UTF-8 data. Your documents are probably in some different
> encoding. You will need to figure out what the encoding is, and how to
> convert it to UTF-8.
>
> Regards,
> Gora
>

Re: indexing rich data from directory using solarium

Posted by Gora Mohanty <go...@mimirtech.com>.
On 2 December 2015 at 16:32, kostali hassan <me...@gmail.com> wrote:
[...]
>
> When i execute it i get this ERROR:
>
> org.apache.solr.common.SolrException: URLDecoder: Invalid character
> encoding detected after position 79 of query string / form data (while
> parsing as UTF-8)

Solr expects UTF-8 data. Your documents are probably in some different
encoding. You will need to figure out what the encoding is, and how to
convert it to UTF-8.

Regards,
Gora