You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucy.apache.org by Peter Karman <pe...@peknet.com> on 2010/01/25 21:11:40 UTC

Re: [Lucy] Re: Invalid UTF-8

Marvin Humphrey wrote on 01/25/2010 11:48 AM:

> 
> It would be interesting to see a hexdump of "lextemp" starting at byte 12464.
> That's where the PostingPool run starts.  The combining sequence that triggers
> the exception starts two bytes later, at 12466.

$ hexdump -C -s 12464 -n 16 sources.index.ks/seg_1/lextemp
000030b0  00 00 1f 00 00 00 c1 5c  3c 20 62 20 3e 20 57 69  |.......\< b
> Wi|

the sequence c1 5c 3c 20 looks odd to me. It's definitely not UTF-8.

[... /me debugs ... hours pass ...]

the problem is in libswish3, not KinoSearch or the Search::Tools or the
original docs.

Thanks for the tips on how UTF-8 works in KS, though. It was helpful.

-- 
Peter Karman  .  http://peknet.com/  .  peter@peknet.com

Re: Invalid UTF-8

Posted by Peter Karman <pe...@peknet.com>.

Peter Karman wrote on 1/26/10 8:24 PM:

> 
> However, when I try to build on the two Linux boxen I have (32 and 64) 
> with most recent KS trunk I get this:
> 
> Initializing Charmonizer/Core/OperatingSystem...
> Trying to find a bit-bucket a la /dev/null...
> Creating compiler object...
> Trying to compile a small test file...
> _charm_run.c: In function ?main?:
> _charm_run.c:26: error: expected expression before ?/? token
> _charm_run.c:26: error: too few arguments to function ?freopen?
> _charm_run.c:27: error: expected expression before ?/? token
> _charm_run.c:27: error: too few arguments to function ?freopen?
> failed to compile _charm_run helper utility
> Failed to write charmony.h at buildlib/KinoSearch/Build.pm line 183.
> make: *** [all] Error 25
> 
> 
> could one of the changes you committed in the last 48 hours have caused 
> that?
> 

same with Mac 10.6.

-- 
Peter Karman  .  http://peknet.com/  .  peter@peknet.com

Re: [Lucy] Index-time RAM consumption settings (was Invalid UTF-8)

Posted by Peter Karman <pe...@peknet.com>.

Marvin Humphrey wrote on 2/1/10 1:52 PM:

> FWIW, once we fix SortWriter's RAM consumption problem, we'll go back to being
> relatively parsimonious with process RAM.

if that proves true, I think it's a non-issue.

I only raise the flag because in Xapian there's such a dial, an env var setting
a flush threshold, that is often mentioned as a way to control indexing speed
vs. memory use. From what I've seen of KS the indexing speed is much faster and
mem use much lower anyway, so I'm not worrying about it.

Thanks for the detailed reply re: the issues involved.


-- 
Peter Karman  .  http://peknet.com/  .  peter@peknet.com

Index-time RAM consumption settings (was Invalid UTF-8)

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Wed, Jan 27, 2010 at 10:43:22PM -0600, Peter Karman wrote:

> Is there, or any plan to, make the DEFAULT_MEM_THRESH alterable at runtime? 

I've made it settable privately so that we could go back to simulating large
indexes within the test suite. But as a public API?  

Well, here's the problem.  It's an implementation detail, specific to
PostingListWriter.  I'm just about to add another, separate SortExternal pool
in SortWriter, which will have its own threshold at which it flushes runs to
disk.  More generally, arbitrary index components added using custom
Architectures might have their own pools and their own thresholds.  How would
setting a default memory threshold for one affect the others?

I don't think it makes sense to expose any of those thresholds specifically.
Lucene has historically exposed all kinds of extra optimization settings via
IndexWriter, which go stale as the underlying implementation changes, bloating
IndexWriter's API and causing confusion:

  setMergeFactor()
  setMaxMergeDocs() 
  setMaxBufferedDocs() 
  setMergePolicy() 
  setMergeScheduler() 
  setRAMBufferSizeMB()

And so on.  I think that's sub-optimal design for a number of reasons, and I
think it's important that Lucy *not* go down the same road.

> I'm assuming that in situations where available ram is low, it would be
> helpful to trade-off speed for memory by setting the threshold lower and
> flushing to disk more often. Is that a realistic assumption?

If we were to do something like that, it would be one dial, and instead of
Indexer it would go into IndexManager, where we hide all expert per-session
settings.  Rather than an absolute number, it would be a float multiplier
defaulting to 1.0 which all index components would have the option of
consulting.  PostingListWriter would use it to scale its memory threshold.

However, it would not cap memory usage.  It wouldn't be like specifying a JVM
heap size.  And performance will still depend to a large extent on the size of
the index and the RAM installed in the machine, since speed will dive if our
temp files get ejected from the IO cache.

FWIW, once we fix SortWriter's RAM consumption problem, we'll go back to being
relatively parsimonious with process RAM.

Marvin Humphrey

Re: Invalid UTF-8

Posted by Peter Karman <pe...@peknet.com>.

Peter Karman wrote on 1/27/10 10:43 PM:

> The OSX behaviour was weird. First time it segfaulted. Ran it again 
> under gdb and it completed ok. Ran it again without gdb and I got this:
> 

ignore these complaints. seems my os and/or fs was/is seriously fscked.


-- 
Peter Karman  .  http://peknet.com/  .  peter@peknet.com

Re: Invalid UTF-8

Posted by Peter Karman <pe...@peknet.com>.

Marvin Humphrey wrote on 1/27/10 6:41 PM:
> On Tue, Jan 26, 2010 at 07:15:16PM -0800, Marvin Humphrey wrote:
> 
>> Yup, I've now duplicated the problem on my system using 60,000 docs.  
> 
> Fixed by r5764.

cool. thanks for digging in.

I have tested it under RHEL (works great with ~90k docs, 2g of data) and OSX 
10.6 (where it fails, see below), both 64-bit arch.

The OSX behaviour was weird. First time it segfaulted. Ran it again under gdb 
and it completed ok. Ran it again without gdb and I got this:

[karpet@pekmac:~/tmp]$ perl ks-test.pl swishdocs2/
Crawled 1000000 documents
Read past EOF of 
'/Volumes/users/karpet/tmp/test-ks-utf8/seg_2/ptemp-4284913-to-4383411' (offset: 
4284913 len: 98498), S_refill at ../core/KinoSearch/Store/InStream.c line 145
  at ks-test.pl line 65


Using same test script as I posted before, with 1m docs instead of 33k.

> 
>> I bet I can get that way down by fiddling with the flush threshold.
> 
> Ultimately, I was isolate the trigger to a single document with two fields, by
> bringing the threshold at which PostingListWriter flushes all of its
> PostingPools way, way down:
> 
> -#define DEFAULT_MEM_THRESH 0x1000000
> +/* #define DEFAULT_MEM_THRESH 0x1000000 */
> +#define DEFAULT_MEM_THRESH 0x10
> 
> When that variable lived in Perl, the KinoSearch::Test module used to set it
> to a much smaller number at load time.  This had the effect of simulating
> large indexes as far as PostingListWriter was concerned, by forcing runs to be
> flushed many many times.  However, it turns out that we have been doing
> without that important simulation for a long time -- the entire KS test suite
> was not triggering a PostingPool flush even once.  I'm a little surprised that
> after all the refactoring I did on this code recently, there was only a single
> glitch that needed to be fixed.  
> 
> Now even if I set the threshold to 0x100, the whole test suite passes.
> 

this is good and interesting to know. Is there, or any plan to, make the 
DEFAULT_MEM_THRESH alterable at runtime? I'm assuming that in situations where 
available ram is low, it would be helpful to trade-off speed for memory by 
setting the threshold lower and flushing to disk more often. Is that a realistic 
assumption?


-- 
Peter Karman  .  http://peknet.com/  .  peter@peknet.com

Re: [KinoSearch] Invalid UTF-8

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Tue, Jan 26, 2010 at 07:15:16PM -0800, Marvin Humphrey wrote:

> Yup, I've now duplicated the problem on my system using 60,000 docs.  

Fixed by r5764.

> I bet I can get that way down by fiddling with the flush threshold.

Ultimately, I was isolate the trigger to a single document with two fields, by
bringing the threshold at which PostingListWriter flushes all of its
PostingPools way, way down:

-#define DEFAULT_MEM_THRESH 0x1000000
+/* #define DEFAULT_MEM_THRESH 0x1000000 */
+#define DEFAULT_MEM_THRESH 0x10

When that variable lived in Perl, the KinoSearch::Test module used to set it
to a much smaller number at load time.  This had the effect of simulating
large indexes as far as PostingListWriter was concerned, by forcing runs to be
flushed many many times.  However, it turns out that we have been doing
without that important simulation for a long time -- the entire KS test suite
was not triggering a PostingPool flush even once.  I'm a little surprised that
after all the refactoring I did on this code recently, there was only a single
glitch that needed to be fixed.  

Now even if I set the threshold to 0x100, the whole test suite passes.

Marvin Humphrey

Re: Invalid UTF-8

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Tue, Jan 26, 2010 at 07:07:10PM -0800, Marvin Humphrey wrote:

> We might have to add more docs

Yup, I've now duplicated the problem on my system using 60,000 docs.  

I bet I can get that way down by fiddling with the flush threshold.

Marvin Humphrey

Re: Invalid UTF-8

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Tue, Jan 26, 2010 at 08:24:06PM -0600, Peter Karman wrote:

> >Before we go further, what kind of system are you having trouble on?  Is it
> >a 64-bit box?
> 
> yes, 64-bit. Tested on both RHEL 4 and Mac 10.6.

OK, I don't know that this is a 64-bit problem, but I believe that the flushes
would happen on a different schedule under 64-bit.  It's good that you can
duplicate this on multiple systems.

We might have to add more docs.  Investigating...

Marvin Humphrey