You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@solr.apache.org by Dan Armbrust <da...@gmail.com> on 2022/03/03 23:05:05 UTC

question on filesize limits for json

Hi,

I'm experimenting with Solr and indexing schemaless JSON content.

I'm using the latest docker image of Solr, and just testing various things.

The indexing and querying works as I would expect for documents of reasonable size.

However, if I ask it to index a document that is ~100MB, I'm unable to query any results 
from this document.

Yet, I can't find any indication that there was an error in indexing the document.

Indexing:

curl -vv 
'http://localhost:8983/solr/gettingstarted/update/json/docs?f=/docs/**&commit=true' -H 
'Content-type: application/json' -d @837-10000-2022010415135.json
*   Trying 127.0.0.1:8983...
* TCP_NODELAY set
* Connected to localhost (127.0.0.1) port 8983 (#0)
 > POST /solr/gettingstarted/update/json/docs?f=/docs/**&commit=true HTTP/1.1
 > Host: localhost:8983
 > User-Agent: curl/7.68.0
 > Accept: */*
 > Content-type:application/json
 > Content-Length: 97581522
 > Expect: 100-continue
 >
* Mark bundle as not supporting multiuse
< HTTP/1.1 100 Continue
* We are completely uploaded and fine
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< Content-Security-Policy: default-src 'none'; base-uri 'none'; connect-src 'self'; 
form-action 'self'; font-src 'self'; frame-ancestors 'none'; img-src 'self'; media-src 
'self'; style-src '
self' 'unsafe-inline'; script-src 'self'; worker-src 'self';
< X-Content-Type-Options: nosniff
< X-Frame-Options: SAMEORIGIN
< X-XSS-Protection: 1; mode=block
< Content-Type: text/plain;charset=utf-8
< Vary: Accept-Encoding, User-Agent
< Content-Length: 57
<
{
  "responseHeader":{
    "status":0,
    "QTime":376}}
* Connection #0 to host localhost left intact


No errors are logged in the log file:

2022-03-03 22:57:00.412 INFO (searcherExecutor-26-thread-1-processing-x:gettingstarted) [ 
x:gettingstarted] o.a.s.c.QuerySenderListener QuerySenderListener done.
2022-03-03 22:57:00.414 INFO (searcherExecutor-26-thread-1-processing-x:gettingstarted) [ 
x:gettingstarted] o.a.s.c.SolrCore [gettingstarted] Registered new searcher autowarm time: 
0 ms
2022-03-03 22:57:00.414 INFO  (qtp1515403487-49) [ x:gettingstarted] 
o.a.s.u.p.LogUpdateProcessorFactory [gettingstarted]  webapp=/solr path=/update/json/docs 
params={f=/docs/**&commit=true}{add=[fbb18697-d823-46e8-8571-6dde6750634b 
(1726321231525838848)],commit=} 0 369

The "Num Docs" reported in the solr GUI increases each time I do this.

A query for everything (*:*) gives me the correct doc count.

But no matter what I query for, I cannot get a result from inside the large document.  Am 
I hitting some limit that is silently messing up the indexing and/or the query return?

Thanks,

Dan





Re: question on filesize limits for json

Posted by Eric Pugh <ep...@opensourceconnections.com>.
I’d love to see some documentation improvements, please tag me for review.

There is this document, but I don’t love it: https://cwiki.apache.org/confluence/display/SOLR/HowToContribute

So…. Let me try to describe how you would contribute a documentation fix!

1) Go ahead and fork the GitHub.com/apache/solr project into your own branch.  For me, I have GitHub.com/epugh/solr

2) Clone your repo to your local environment: git clone https://github.com/epugh/solr.git

3) Open up a JIRA ticket for your patch on the Solr JIRA

4) Make a branch for your fix via git checkout -B YOUR_JIRA_NUMBER

5) Make sure you can build Solr Ref Guide via ./gradlew buildLocalSite

6) This will output the docs in solr/solr-ref-guide/build/site/index.html

7) Edit the .adoc files that you need to in solr/solr-ref-guide/modules

8) Rebuild the site after your changes, check it, edit it as needed till you are done!

9) Commmit your changes….  Github Desktop works great.

10) Go to GitHub.com/apache/solr and you will be prompted to submit your branch as a Pull Request to the main project.

11) Tag Eric to review ;-)



> On Mar 4, 2022, at 4:02 PM, Dan Armbrust <da...@gmail.com> wrote:
> 
> On 3/4/22 1:33 PM, 6harat wrote:
>> Hi,
>> I am not able to hit any limit in terms of uploading a 100MB file and was able to search the relevant fields inside the doc too.
>> 
>> 
>> I hope:
>> 1. The json file that you are trying to upload has a root level key named "docs"
>> 2. You are not trying to fetch the entire document when using solr admin UI.
>> 
> 
> Thanks for this - #1 - I had a copy/paste error in my load script, and didn't intend to have that /docs in the wildcard string.
> 
> That said, there may be an opportunity to improve this documentation a bit:
> https://solr.apache.org/guide/6_6/transforming-and-indexing-custom-json.html
> 
> Until now, I didn't realize that that is what f=*/docs*/* was doing in this context - relating to the structure of the json itself.  In retrospect, its pretty confusing to use /docs in the example there, when the rest endpoint also uses /docs and solr is capturing documents itself.  Plus, the curl example given has json content that doesn't contain a root level key of docs.
> 
> How do you typically handle documentation updates?  I think this section would be much more clear if f=*/docs*/* was changed to something else, and if the json example given actually had something that demonstrated this.
> 
> I can deal with the scaling issues on the hit in the monster document, this is just proof of concept / demo ware.
> 
> Thanks,
> 
> Dan
> 
> 

_______________________
Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | http://www.opensourceconnections.com <http://www.opensourceconnections.com/> | My Free/Busy <http://tinyurl.com/eric-cal>  
Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>	
This e-mail and all contents, including attachments, is considered to be Company Confidential unless explicitly stated otherwise, regardless of whether attachments are marked as such.


Re: question on filesize limits for json

Posted by Dan Armbrust <da...@gmail.com>.
On 3/4/22 1:33 PM, 6harat wrote:
> Hi,
> I am not able to hit any limit in terms of uploading a 100MB file and was able to search 
> the relevant fields inside the doc too.
>
>
> I hope:
> 1. The json file that you are trying to upload has a root level key named "docs"
> 2. You are not trying to fetch the entire document when using solr admin UI.
>

Thanks for this - #1 - I had a copy/paste error in my load script, and didn't intend to 
have that /docs in the wildcard string.

That said, there may be an opportunity to improve this documentation a bit:
https://solr.apache.org/guide/6_6/transforming-and-indexing-custom-json.html

Until now, I didn't realize that that is what f=*/docs*/* was doing in this context - 
relating to the structure of the json itself.  In retrospect, its pretty confusing to use 
/docs in the example there, when the rest endpoint also uses /docs and solr is capturing 
documents itself.  Plus, the curl example given has json content that doesn't contain a 
root level key of docs.

How do you typically handle documentation updates?  I think this section would be much 
more clear if f=*/docs*/* was changed to something else, and if the json example given 
actually had something that demonstrated this.

I can deal with the scaling issues on the hit in the monster document, this is just proof 
of concept / demo ware.

Thanks,

Dan



Re: question on filesize limits for json

Posted by 6harat <bh...@gmail.com>.
Hi,
I am not able to hit any limit in terms of uploading a 100MB file and was
able to search the relevant fields inside the doc too.
[image: image.png]

I hope:
1. The json file that you are trying to upload has a root level key named
"docs"
2. You are not trying to fetch the entire document when using solr admin UI.

Reason for stating point 2 is that solr will convert all fields into
"stored" in schemaless mode AFAIK and hence you are asking the admin UI to
fetch 100MB of payload before which it will time out.

That being said, using solr in this form may not be the ideal way to
accomplish the task. (I am not a solr expert, so feel free to disagree if
you know better.)
You can try to use extractors and index key part of the information which
your users are more likely to search and use solr as a way to get the
matching document IDs and serve the actual document for rendering purposes
outside of solr.


On Fri, Mar 4, 2022 at 4:35 AM Dan Armbrust <da...@gmail.com>
wrote:

> Hi,
>
> I'm experimenting with Solr and indexing schemaless JSON content.
>
> I'm using the latest docker image of Solr, and just testing various things.
>
> The indexing and querying works as I would expect for documents of
> reasonable size.
>
> However, if I ask it to index a document that is ~100MB, I'm unable to
> query any results
> from this document.
>
> Yet, I can't find any indication that there was an error in indexing the
> document.
>
> Indexing:
>
> curl -vv
> '
> http://localhost:8983/solr/gettingstarted/update/json/docs?f=/docs/**&commit=true'
> -H
> 'Content-type: application/json' -d @837-10000-2022010415135.json
> *   Trying 127.0.0.1:8983...
> * TCP_NODELAY set
> * Connected to localhost (127.0.0.1) port 8983 (#0)
>  > POST /solr/gettingstarted/update/json/docs?f=/docs/**&commit=true
> HTTP/1.1
>  > Host: localhost:8983
>  > User-Agent: curl/7.68.0
>  > Accept: */*
>  > Content-type:application/json
>  > Content-Length: 97581522
>  > Expect: 100-continue
>  >
> * Mark bundle as not supporting multiuse
> < HTTP/1.1 100 Continue
> * We are completely uploaded and fine
> * Mark bundle as not supporting multiuse
> < HTTP/1.1 200 OK
> < Content-Security-Policy: default-src 'none'; base-uri 'none';
> connect-src 'self';
> form-action 'self'; font-src 'self'; frame-ancestors 'none'; img-src
> 'self'; media-src
> 'self'; style-src '
> self' 'unsafe-inline'; script-src 'self'; worker-src 'self';
> < X-Content-Type-Options: nosniff
> < X-Frame-Options: SAMEORIGIN
> < X-XSS-Protection: 1; mode=block
> < Content-Type: text/plain;charset=utf-8
> < Vary: Accept-Encoding, User-Agent
> < Content-Length: 57
> <
> {
>   "responseHeader":{
>     "status":0,
>     "QTime":376}}
> * Connection #0 to host localhost left intact
>
>
> No errors are logged in the log file:
>
> 2022-03-03 22:57:00.412 INFO
> (searcherExecutor-26-thread-1-processing-x:gettingstarted) [
> x:gettingstarted] o.a.s.c.QuerySenderListener QuerySenderListener done.
> 2022-03-03 22:57:00.414 INFO
> (searcherExecutor-26-thread-1-processing-x:gettingstarted) [
> x:gettingstarted] o.a.s.c.SolrCore [gettingstarted] Registered new
> searcher autowarm time:
> 0 ms
> 2022-03-03 22:57:00.414 INFO  (qtp1515403487-49) [ x:gettingstarted]
> o.a.s.u.p.LogUpdateProcessorFactory [gettingstarted]  webapp=/solr
> path=/update/json/docs
> params={f=/docs/**&commit=true}{add=[fbb18697-d823-46e8-8571-6dde6750634b
> (1726321231525838848)],commit=} 0 369
>
> The "Num Docs" reported in the solr GUI increases each time I do this.
>
> A query for everything (*:*) gives me the correct doc count.
>
> But no matter what I query for, I cannot get a result from inside the
> large document.  Am
> I hitting some limit that is silently messing up the indexing and/or the
> query return?
>
> Thanks,
>
> Dan
>
>
>
>
>

-- 
6harat
[solr enthusiast, not affiliated to core dev team]