You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Clemens Wyss DEV <cl...@mysign.ch> on 2015/05/08 15:54:05 UTC

determine "big" documents in the index?

Context: Solr/Lucene 5.1

Is there a way to determine documents that occupy alot "space" in the index. As I don't store any fields that have text, it must be the terms extracted from the documents occupying the space. 

So my question is: which documents occupy a most space in the inverted index?

Context:
I index approx 7000pdfs (extracted with tika) into my index. I suspect that for some pdf's the extarcted text is not really text but "binary blobs". In order to verify this (and possibly omit these pdfs) I hope to get some hints of Solr/Lucene ;)

Re: determine "big" documents in the index?

Posted by Erick Erickson <er...@gmail.com>.

Oops, this may be a better link: http://lucidworks.com/blog/indexing-with-solrj/

On Fri, May 8, 2015 at 9:55 AM, Erick Erickson <er...@gmail.com> wrote:
> bq: has 30'860'099 terms. Is this "too much"
>
> Depends on how you indexed it. If you used shingles, then maybe, maybe
> not. If you just do normal text analysis, it's suspicious to say the
> least. There are about 300K words in the English language and you have
> 100X that. So either
> 1> you have a lot of legitimately unique terms, say part numbers,
> SKUs, etc. digits analyzed as text, whatever.
> 2> you have a lot of garbage in your input. OCR is notorious for this,
> as are binary blobs.
>
> The TermsComponent is your friend, it'll allow you to get an idea of
> what the actual terms are, it does take a bit of poking around though.
>
> There's no good way I know of to know which docs are taking up space
> in the index. What I'd probably do is use Tika in a SolrJ client and
> look at the data as I sent it, here's a place to start:
> https://lucidworks.com/blog/dev/2012/02/14/indexing-with-solrj/
>
> Best,
> Erick
>
> On Fri, May 8, 2015 at 7:30 AM, Clemens Wyss DEV <cl...@mysign.ch> wrote:
>> On one of my fields (the "phrase suggestion" field) has 30'860'099 terms. Is this "too much"?
>> Another field (the "single word suggestion") has 2'156'218 terms.
>>
>>
>>
>> -----Ursprüngliche Nachricht-----
>> Von: Clemens Wyss DEV [mailto:clemensdev@mysign.ch]
>> Gesendet: Freitag, 8. Mai 2015 15:54
>> An: solr-user@lucene.apache.org
>> Betreff: determine "big" documents in the index?
>>
>> Context: Solr/Lucene 5.1
>>
>> Is there a way to determine documents that occupy alot "space" in the index. As I don't store any fields that have text, it must be the terms extracted from the documents occupying the space.
>>
>> So my question is: which documents occupy a most space in the inverted index?
>>
>> Context:
>> I index approx 7000pdfs (extracted with tika) into my index. I suspect that for some pdf's the extarcted text is not really text but "binary blobs". In order to verify this (and possibly omit these pdfs) I hope to get some hints of Solr/Lucene ;)

Re: determine "big" documents in the index?

Posted by Erick Erickson <er...@gmail.com>.

1> Right, shingles (and you've set max size to 3) a bazillion
possibilities, so the sky's the limit. It's usually smaller than that
since some patterns of words aren't very likely, but it's still a big
number.  I'd really take a look at the terms that are actually indexed
with TermsComponent or similar. Or perhaps run a test where you
_don't_ shingle and see what the cardinality of the field is. If a
large portion of your terms are garbage, it should be pretty obvious.

2> No way that I know of to tell Tika "don't return suspicious stuff",
and I'm not up enough on the internals of Tika to say much. Perhaps
ask the Tika folks directly?

I was thinking of you using Tika in a client-side SolrJ program. Once
the parsing is done, you can get all the text Tika thinks is valid and
do some examination of it to see whether it's "real". You might get
some good results from simply seeing if each word returned was longer
than some arbitrary number, or whether the codepoint was out of the
range you expect etc. Anything you do will be imperfect unfortunately.

Best,
Erick

On Sat, May 9, 2015 at 12:11 AM, Clemens Wyss DEV <cl...@mysign.ch> wrote:
>> If you used shingles
> I do:
>     <fieldType class="solr.TextField" name="suggest_phrase" positionIncrementGap="100">
>       <analyzer>
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>         <filter class="solr.ShingleFilterFactory" maxShingleSize="3" outputUnigrams="true"/>
>       </analyzer>
>     </fieldType>
>
>>http://lucidworks.com/blog/indexing-with-solrj/
> This is more or less what I do
>
>>2> you have a lot of garbage in your input.
>>OCR is notorious for this,as are binary blobs.
> What does the AutodetectParser return in case of an OCR-Pdf? Can I "detect"/omit an OCR pdf?
>
> -----Ursprüngliche Nachricht-----
> Von: Erick Erickson [mailto:erickerickson@gmail.com]
> Gesendet: Freitag, 8. Mai 2015 18:55
> An: solr-user@lucene.apache.org
> Betreff: Re: determine "big" documents in the index?
>
> bq: has 30'860'099 terms. Is this "too much"
>
> Depends on how you indexed it. If you used shingles, then maybe, maybe not. If you just do normal text analysis, it's suspicious to say the least. There are about 300K words in the English language and you have 100X that. So either
> 1> you have a lot of legitimately unique terms, say part numbers,
> SKUs, etc. digits analyzed as text, whatever.
> 2> you have a lot of garbage in your input. OCR is notorious for this,
> as are binary blobs.
>
> The TermsComponent is your friend, it'll allow you to get an idea of what the actual terms are, it does take a bit of poking around though.
>
> There's no good way I know of to know which docs are taking up space in the index. What I'd probably do is use Tika in a SolrJ client and look at the data as I sent it, here's a place to start:
> https://lucidworks.com/blog/dev/2012/02/14/indexing-with-solrj/
>
> Best,
> Erick
>
> On Fri, May 8, 2015 at 7:30 AM, Clemens Wyss DEV <cl...@mysign.ch> wrote:
>> On one of my fields (the "phrase suggestion" field) has 30'860'099 terms. Is this "too much"?
>> Another field (the "single word suggestion") has 2'156'218 terms.
>>
>>
>>
>> -----Ursprüngliche Nachricht-----
>> Von: Clemens Wyss DEV [mailto:clemensdev@mysign.ch]
>> Gesendet: Freitag, 8. Mai 2015 15:54
>> An: solr-user@lucene.apache.org
>> Betreff: determine "big" documents in the index?
>>
>> Context: Solr/Lucene 5.1
>>
>> Is there a way to determine documents that occupy alot "space" in the index. As I don't store any fields that have text, it must be the terms extracted from the documents occupying the space.
>>
>> So my question is: which documents occupy a most space in the inverted index?
>>
>> Context:
>> I index approx 7000pdfs (extracted with tika) into my index. I suspect
>> that for some pdf's the extarcted text is not really text but "binary
>> blobs". In order to verify this (and possibly omit these pdfs) I hope
>> to get some hints of Solr/Lucene ;)

AW: determine "big" documents in the index?

Posted by Clemens Wyss DEV <cl...@mysign.ch>.

> If you used shingles
I do:
    <fieldType class="solr.TextField" name="suggest_phrase" positionIncrementGap="100">
      <analyzer>
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.ShingleFilterFactory" maxShingleSize="3" outputUnigrams="true"/>    
      </analyzer>
    </fieldType>

>http://lucidworks.com/blog/indexing-with-solrj/
This is more or less what I do

>2> you have a lot of garbage in your input. 
>OCR is notorious for this,as are binary blobs.
What does the AutodetectParser return in case of an OCR-Pdf? Can I "detect"/omit an OCR pdf?

-----Ursprüngliche Nachricht-----
Von: Erick Erickson [mailto:erickerickson@gmail.com] 
Gesendet: Freitag, 8. Mai 2015 18:55
An: solr-user@lucene.apache.org
Betreff: Re: determine "big" documents in the index?

bq: has 30'860'099 terms. Is this "too much"

Depends on how you indexed it. If you used shingles, then maybe, maybe not. If you just do normal text analysis, it's suspicious to say the least. There are about 300K words in the English language and you have 100X that. So either
1> you have a lot of legitimately unique terms, say part numbers,
SKUs, etc. digits analyzed as text, whatever.
2> you have a lot of garbage in your input. OCR is notorious for this,
as are binary blobs.

The TermsComponent is your friend, it'll allow you to get an idea of what the actual terms are, it does take a bit of poking around though.

There's no good way I know of to know which docs are taking up space in the index. What I'd probably do is use Tika in a SolrJ client and look at the data as I sent it, here's a place to start:
https://lucidworks.com/blog/dev/2012/02/14/indexing-with-solrj/

Best,
Erick

On Fri, May 8, 2015 at 7:30 AM, Clemens Wyss DEV <cl...@mysign.ch> wrote:
> On one of my fields (the "phrase suggestion" field) has 30'860'099 terms. Is this "too much"?
> Another field (the "single word suggestion") has 2'156'218 terms.
>
>
>
> -----Ursprüngliche Nachricht-----
> Von: Clemens Wyss DEV [mailto:clemensdev@mysign.ch]
> Gesendet: Freitag, 8. Mai 2015 15:54
> An: solr-user@lucene.apache.org
> Betreff: determine "big" documents in the index?
>
> Context: Solr/Lucene 5.1
>
> Is there a way to determine documents that occupy alot "space" in the index. As I don't store any fields that have text, it must be the terms extracted from the documents occupying the space.
>
> So my question is: which documents occupy a most space in the inverted index?
>
> Context:
> I index approx 7000pdfs (extracted with tika) into my index. I suspect 
> that for some pdf's the extarcted text is not really text but "binary 
> blobs". In order to verify this (and possibly omit these pdfs) I hope 
> to get some hints of Solr/Lucene ;)

Re: determine "big" documents in the index?

Posted by Erick Erickson <er...@gmail.com>.

bq: has 30'860'099 terms. Is this "too much"

Depends on how you indexed it. If you used shingles, then maybe, maybe
not. If you just do normal text analysis, it's suspicious to say the
least. There are about 300K words in the English language and you have
100X that. So either
1> you have a lot of legitimately unique terms, say part numbers,
SKUs, etc. digits analyzed as text, whatever.
2> you have a lot of garbage in your input. OCR is notorious for this,
as are binary blobs.

The TermsComponent is your friend, it'll allow you to get an idea of
what the actual terms are, it does take a bit of poking around though.

There's no good way I know of to know which docs are taking up space
in the index. What I'd probably do is use Tika in a SolrJ client and
look at the data as I sent it, here's a place to start:
https://lucidworks.com/blog/dev/2012/02/14/indexing-with-solrj/

Best,
Erick

On Fri, May 8, 2015 at 7:30 AM, Clemens Wyss DEV <cl...@mysign.ch> wrote:
> On one of my fields (the "phrase suggestion" field) has 30'860'099 terms. Is this "too much"?
> Another field (the "single word suggestion") has 2'156'218 terms.
>
>
>
> -----Ursprüngliche Nachricht-----
> Von: Clemens Wyss DEV [mailto:clemensdev@mysign.ch]
> Gesendet: Freitag, 8. Mai 2015 15:54
> An: solr-user@lucene.apache.org
> Betreff: determine "big" documents in the index?
>
> Context: Solr/Lucene 5.1
>
> Is there a way to determine documents that occupy alot "space" in the index. As I don't store any fields that have text, it must be the terms extracted from the documents occupying the space.
>
> So my question is: which documents occupy a most space in the inverted index?
>
> Context:
> I index approx 7000pdfs (extracted with tika) into my index. I suspect that for some pdf's the extarcted text is not really text but "binary blobs". In order to verify this (and possibly omit these pdfs) I hope to get some hints of Solr/Lucene ;)

AW: determine "big" documents in the index?

Posted by Clemens Wyss DEV <cl...@mysign.ch>.

On one of my fields (the "phrase suggestion" field) has 30'860'099 terms. Is this "too much"?
Another field (the "single word suggestion") has 2'156'218 terms.



-----Ursprüngliche Nachricht-----
Von: Clemens Wyss DEV [mailto:clemensdev@mysign.ch] 
Gesendet: Freitag, 8. Mai 2015 15:54
An: solr-user@lucene.apache.org
Betreff: determine "big" documents in the index?

Context: Solr/Lucene 5.1

Is there a way to determine documents that occupy alot "space" in the index. As I don't store any fields that have text, it must be the terms extracted from the documents occupying the space. 

So my question is: which documents occupy a most space in the inverted index?

Context:
I index approx 7000pdfs (extracted with tika) into my index. I suspect that for some pdf's the extarcted text is not really text but "binary blobs". In order to verify this (and possibly omit these pdfs) I hope to get some hints of Solr/Lucene ;)