You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucy.apache.org by Peter Karman <pe...@peknet.com> on 2010/01/25 21:11:40 UTC

Re: [Lucy] Re: Invalid UTF-8

Marvin Humphrey wrote on 01/25/2010 11:48 AM:

> 
> It would be interesting to see a hexdump of "lextemp" starting at byte 12464.
> That's where the PostingPool run starts.  The combining sequence that triggers
> the exception starts two bytes later, at 12466.

$ hexdump -C -s 12464 -n 16 sources.index.ks/seg_1/lextemp
000030b0  00 00 1f 00 00 00 c1 5c  3c 20 62 20 3e 20 57 69  |.......\< b
> Wi|

the sequence c1 5c 3c 20 looks odd to me. It's definitely not UTF-8.

[... /me debugs ... hours pass ...]

the problem is in libswish3, not KinoSearch or the Search::Tools or the
original docs.

Thanks for the tips on how UTF-8 works in KS, though. It was helpful.

-- 
Peter Karman  .  http://peknet.com/  .  peter@peknet.com

Re: Invalid UTF-8

Posted by Peter Karman <pe...@peknet.com>.
Peter Karman wrote on 1/26/10 8:24 PM:

> 
> However, when I try to build on the two Linux boxen I have (32 and 64) 
> with most recent KS trunk I get this:
> 
> Initializing Charmonizer/Core/OperatingSystem...
> Trying to find a bit-bucket a la /dev/null...
> Creating compiler object...
> Trying to compile a small test file...
> _charm_run.c: In function ?main?:
> _charm_run.c:26: error: expected expression before ?/? token
> _charm_run.c:26: error: too few arguments to function ?freopen?
> _charm_run.c:27: error: expected expression before ?/? token
> _charm_run.c:27: error: too few arguments to function ?freopen?
> failed to compile _charm_run helper utility
> Failed to write charmony.h at buildlib/KinoSearch/Build.pm line 183.
> make: *** [all] Error 25
> 
> 
> could one of the changes you committed in the last 48 hours have caused 
> that?
> 

same with Mac 10.6.

-- 
Peter Karman  .  http://peknet.com/  .  peter@peknet.com

Re: [Lucy] Index-time RAM consumption settings (was Invalid UTF-8)

Posted by Peter Karman <pe...@peknet.com>.
Marvin Humphrey wrote on 2/1/10 1:52 PM:

> FWIW, once we fix SortWriter's RAM consumption problem, we'll go back to being
> relatively parsimonious with process RAM.

if that proves true, I think it's a non-issue.

I only raise the flag because in Xapian there's such a dial, an env var setting
a flush threshold, that is often mentioned as a way to control indexing speed
vs. memory use. From what I've seen of KS the indexing speed is much faster and
mem use much lower anyway, so I'm not worrying about it.

Thanks for the detailed reply re: the issues involved.


-- 
Peter Karman  .  http://peknet.com/  .  peter@peknet.com

Index-time RAM consumption settings (was Invalid UTF-8)

Posted by Marvin Humphrey <ma...@rectangular.com>.
On Wed, Jan 27, 2010 at 10:43:22PM -0600, Peter Karman wrote:

> Is there, or any plan to, make the DEFAULT_MEM_THRESH alterable at runtime? 

I've made it settable privately so that we could go back to simulating large
indexes within the test suite. But as a public API?  

Well, here's the problem.  It's an implementation detail, specific to
PostingListWriter.  I'm just about to add another, separate SortExternal pool
in SortWriter, which will have its own threshold at which it flushes runs to
disk.  More generally, arbitrary index components added using custom
Architectures might have their own pools and their own thresholds.  How would
setting a default memory threshold for one affect the others?

I don't think it makes sense to expose any of those thresholds specifically.
Lucene has historically exposed all kinds of extra optimization settings via
IndexWriter, which go stale as the underlying implementation changes, bloating
IndexWriter's API and causing confusion:

  setMergeFactor()
  setMaxMergeDocs() 
  setMaxBufferedDocs() 
  setMergePolicy() 
  setMergeScheduler() 
  setRAMBufferSizeMB()
  
And so on.  I think that's sub-optimal design for a number of reasons, and I
think it's important that Lucy *not* go down the same road.

> I'm assuming that in situations where available ram is low, it would be
> helpful to trade-off speed for memory by setting the threshold lower and
> flushing to disk more often. Is that a realistic assumption?

If we were to do something like that, it would be one dial, and instead of
Indexer it would go into IndexManager, where we hide all expert per-session
settings.  Rather than an absolute number, it would be a float multiplier
defaulting to 1.0 which all index components would have the option of
consulting.  PostingListWriter would use it to scale its memory threshold.

However, it would not cap memory usage.  It wouldn't be like specifying a JVM
heap size.  And performance will still depend to a large extent on the size of
the index and the RAM installed in the machine, since speed will dive if our
temp files get ejected from the IO cache.

FWIW, once we fix SortWriter's RAM consumption problem, we'll go back to being
relatively parsimonious with process RAM.

Marvin Humphrey


Re: Invalid UTF-8

Posted by Peter Karman <pe...@peknet.com>.
Peter Karman wrote on 1/27/10 10:43 PM:

> The OSX behaviour was weird. First time it segfaulted. Ran it again 
> under gdb and it completed ok. Ran it again without gdb and I got this:
> 

ignore these complaints. seems my os and/or fs was/is seriously fscked.


-- 
Peter Karman  .  http://peknet.com/  .  peter@peknet.com

Re: Invalid UTF-8

Posted by Peter Karman <pe...@peknet.com>.
Marvin Humphrey wrote on 1/27/10 6:41 PM:
> On Tue, Jan 26, 2010 at 07:15:16PM -0800, Marvin Humphrey wrote:
> 
>> Yup, I've now duplicated the problem on my system using 60,000 docs.  
> 
> Fixed by r5764.

cool. thanks for digging in.

I have tested it under RHEL (works great with ~90k docs, 2g of data) and OSX 
10.6 (where it fails, see below), both 64-bit arch.

The OSX behaviour was weird. First time it segfaulted. Ran it again under gdb 
and it completed ok. Ran it again without gdb and I got this:

[karpet@pekmac:~/tmp]$ perl ks-test.pl swishdocs2/
Crawled 1000000 documents
Read past EOF of 
'/Volumes/users/karpet/tmp/test-ks-utf8/seg_2/ptemp-4284913-to-4383411' (offset: 
4284913 len: 98498), S_refill at ../core/KinoSearch/Store/InStream.c line 145
  at ks-test.pl line 65


Using same test script as I posted before, with 1m docs instead of 33k.

> 
>> I bet I can get that way down by fiddling with the flush threshold.
> 
> Ultimately, I was isolate the trigger to a single document with two fields, by
> bringing the threshold at which PostingListWriter flushes all of its
> PostingPools way, way down:
> 
> -#define DEFAULT_MEM_THRESH 0x1000000
> +/* #define DEFAULT_MEM_THRESH 0x1000000 */
> +#define DEFAULT_MEM_THRESH 0x10
> 
> When that variable lived in Perl, the KinoSearch::Test module used to set it
> to a much smaller number at load time.  This had the effect of simulating
> large indexes as far as PostingListWriter was concerned, by forcing runs to be
> flushed many many times.  However, it turns out that we have been doing
> without that important simulation for a long time -- the entire KS test suite
> was not triggering a PostingPool flush even once.  I'm a little surprised that
> after all the refactoring I did on this code recently, there was only a single
> glitch that needed to be fixed.  
> 
> Now even if I set the threshold to 0x100, the whole test suite passes.
> 

this is good and interesting to know. Is there, or any plan to, make the 
DEFAULT_MEM_THRESH alterable at runtime? I'm assuming that in situations where 
available ram is low, it would be helpful to trade-off speed for memory by 
setting the threshold lower and flushing to disk more often. Is that a realistic 
assumption?


-- 
Peter Karman  .  http://peknet.com/  .  peter@peknet.com

Re: [KinoSearch] Invalid UTF-8

Posted by Marvin Humphrey <ma...@rectangular.com>.
On Tue, Jan 26, 2010 at 07:15:16PM -0800, Marvin Humphrey wrote:

> Yup, I've now duplicated the problem on my system using 60,000 docs.  

Fixed by r5764.

> I bet I can get that way down by fiddling with the flush threshold.

Ultimately, I was isolate the trigger to a single document with two fields, by
bringing the threshold at which PostingListWriter flushes all of its
PostingPools way, way down:

-#define DEFAULT_MEM_THRESH 0x1000000
+/* #define DEFAULT_MEM_THRESH 0x1000000 */
+#define DEFAULT_MEM_THRESH 0x10

When that variable lived in Perl, the KinoSearch::Test module used to set it
to a much smaller number at load time.  This had the effect of simulating
large indexes as far as PostingListWriter was concerned, by forcing runs to be
flushed many many times.  However, it turns out that we have been doing
without that important simulation for a long time -- the entire KS test suite
was not triggering a PostingPool flush even once.  I'm a little surprised that
after all the refactoring I did on this code recently, there was only a single
glitch that needed to be fixed.  

Now even if I set the threshold to 0x100, the whole test suite passes.

Marvin Humphrey


Re: Invalid UTF-8

Posted by Marvin Humphrey <ma...@rectangular.com>.
On Tue, Jan 26, 2010 at 07:07:10PM -0800, Marvin Humphrey wrote:

> We might have to add more docs

Yup, I've now duplicated the problem on my system using 60,000 docs.  

I bet I can get that way down by fiddling with the flush threshold.

Marvin Humphrey

Re: Invalid UTF-8

Posted by Marvin Humphrey <ma...@rectangular.com>.
On Tue, Jan 26, 2010 at 08:24:06PM -0600, Peter Karman wrote:

> >Before we go further, what kind of system are you having trouble on?  Is it
> >a 64-bit box?
> 
> yes, 64-bit. Tested on both RHEL 4 and Mac 10.6.

OK, I don't know that this is a 64-bit problem, but I believe that the flushes
would happen on a different schedule under 64-bit.  It's good that you can
duplicate this on multiple systems.

We might have to add more docs.  Investigating...

Marvin Humphrey


Re: [Lucy] Re: Invalid UTF-8

Posted by Peter Karman <pe...@peknet.com>.
Marvin Humphrey wrote on 1/26/10 8:54 PM:

> Fixed by r5760.  
> 

ack. compiles fine now.


-- 
Peter Karman  .  http://peknet.com/  .  peter@peknet.com

Re: Invalid UTF-8

Posted by Marvin Humphrey <ma...@rectangular.com>.
On Tue, Jan 26, 2010 at 08:24:06PM -0600, Peter Karman wrote:

> failed to compile _charm_run helper utility
> Failed to write charmony.h at buildlib/KinoSearch/Build.pm line 183.
> make: *** [all] Error 25
> 
> could one of the changes you committed in the last 48 hours have caused 
> that?

Fixed by r5760.  

I finished the METAQUOTE -> QUOTE transition in Charmonizer. This was just a
glitch.

Marvin Humphrey


Re: Invalid UTF-8

Posted by Peter Karman <pe...@peknet.com>.
Marvin Humphrey wrote on 1/26/10 8:03 PM:

>>  perl docmaker.pl \
>>     --utf_factor=0 \
>>     --write_files \
>>     --tmp_dir path/to/my/testdocs/ \
>>     --max_files 33000 \
>>     --max_words 3 \
>>     --tmp_dir_segments 2
> 
> I wonder whether this produces the same corpus on my OS X 10.5.8 MBPro as on
> your system.

no, definitely different. docmaker.pl creates random strings based on your 
system dictionary.


> No matter what, I see the following output:
> 
> marvin@smokey:~/projects/ks/perl $ rm -rf test-ks-utf8/ ; perl -Mblib karpet_utf8_test.pl testdocs/
> Crawled 33000 documents
> marvin@smokey:~/projects/ks/perl $ 
> 

damn.

> 
> Before we go further, what kind of system are you having trouble on?  Is it a
> 64-bit box?

yes, 64-bit. Tested on both RHEL 4 and Mac 10.6.

However, when I try to build on the two Linux boxen I have (32 and 64) with most 
recent KS trunk I get this:

Initializing Charmonizer/Core/OperatingSystem...
Trying to find a bit-bucket a la /dev/null...
Creating compiler object...
Trying to compile a small test file...
_charm_run.c: In function ?main?:
_charm_run.c:26: error: expected expression before ?/? token
_charm_run.c:26: error: too few arguments to function ?freopen?
_charm_run.c:27: error: expected expression before ?/? token
_charm_run.c:27: error: too few arguments to function ?freopen?
failed to compile _charm_run helper utility
Failed to write charmony.h at buildlib/KinoSearch/Build.pm line 183.
make: *** [all] Error 25


could one of the changes you committed in the last 48 hours have caused that?

-- 
Peter Karman  .  http://peknet.com/  .  peter@peknet.com

Re: Invalid UTF-8

Posted by Marvin Humphrey <ma...@rectangular.com>.
On Tue, Jan 26, 2010 at 12:09:20AM -0600, Peter Karman wrote:
> Here's the test case.

Thanks for the hard work building this case.

>  perl docmaker.pl \
>     --utf_factor=0 \
>     --write_files \
>     --tmp_dir path/to/my/testdocs/ \
>     --max_files 33000 \
>     --max_words 3 \
>     --tmp_dir_segments 2

I wonder whether this produces the same corpus on my OS X 10.5.8 MBPro as on
your system.

> there appears to be something magical in the *total number* of terms parsed.

Might have something to do with when runs are flushed.

> Here are some things I notice.
> 
> 1) if I comment out the swishwordnum and swishdescription in parse_file() 
> it works.
> 
> 2) if I comment out the swishdescription alone, it fails.
> 
> 3) if I comment out the swishwordnum alone, it fails.

I tried out all four possible permutations of swishwordnum and
swishdescription:

         swishdescription  => "",  # yes, empty
         swishwordnum      => 0,   # yes, zero

         #swishdescription  => "",  # yes, empty
         swishwordnum      => 0,   # yes, zero
    
         swishdescription  => "",  # yes, empty
         #swishwordnum      => 0,   # yes, zero

         #swishdescription  => "",  # yes, empty
         #swishwordnum      => 0,   # yes, zero

No matter what, I see the following output:

marvin@smokey:~/projects/ks/perl $ rm -rf test-ks-utf8/ ; perl -Mblib karpet_utf8_test.pl testdocs/
Crawled 33000 documents
marvin@smokey:~/projects/ks/perl $ 


Before we go further, what kind of system are you having trouble on?  Is it a
64-bit box?

Marvin Humphrey


Re: Invalid UTF-8

Posted by Peter Karman <pe...@peknet.com>.
Peter Karman wrote on 1/25/10 9:12 PM:

> I'll try and create a test case. I suspect it's going to be because I'm 
> using a lot of fields of various FieldType combinations.
> 

Here's the test case.

First, you need to create a corpus to test with. I use this script:

http://svn.swish-e.org/libswish3/trunk/perl/docmaker.pl

like this:

  perl docmaker.pl \
     --utf_factor=0 \
     --write_files \
     --tmp_dir path/to/my/testdocs/ \
     --max_files 33000 \
     --max_words 3 \
     --tmp_dir_segments 2

could also make fewer files with more words in them. Or use a different corpus 
altogether. But there appears to be something magical in the *total number* of 
terms parsed.

Second, here's the test script:

--------------------8<------------------------
#!/usr/bin/env perl
use strict;
use warnings;

use File::Find;
use File::Slurp;
use Data::Dump qw( dump );
use KinoSearch::Indexer;
use KinoSearch::Schema;
use KinoSearch::Analysis::PolyAnalyzer;
use KinoSearch::FieldType::FullTextType;
use KinoSearch::FieldType::StringType;

my $usage = "$0 path/to/files\n";
die $usage unless @ARGV;

my $path_to_index = 'test-ks-utf8';
my $lang          = 'en';
my $schema        = KinoSearch::Schema->new();
my $analyzer  = KinoSearch::Analysis::PolyAnalyzer->new( language => $lang, );
my $fieldtype = KinoSearch::FieldType::FullTextType->new(
     analyzer      => $analyzer,
     highlightable => 1,
     sortable      => 1,
);
my $stringtype = KinoSearch::FieldType::StringType->new( sortable => 1, );
$schema->spec_field(
     name => 'swishtitle',
     type => $fieldtype,
);
$schema->spec_field(
     name => 'swishdefault',
     type => $fieldtype,
);

for my $property_name (
     qw(
     swishdescription
     swishdocpath
     swishdocsize
     swishencoding
     swishlastmodified
     swishmime
     swishparser
     swishwordnum
     )
     )
{
     $schema->spec_field(
         name => $property_name,
         type => $stringtype,
     );
}

my $indexer = KinoSearch::Indexer->new(
     schema => $schema,
     index  => $path_to_index,
     create => 1,
);

my $count = 0;

find( { wanted => \&wanted, no_chdir => 1 }, @ARGV );
print "Crawled $count documents\n";
$indexer->commit();

sub wanted {
     my $filename = $File::Find::name;
     return unless $filename =~ m/\.xml/;
     my $doc = parse_file($filename);

     #warn dump $doc;

     $indexer->add_doc($doc);
     $count++;
}

sub parse_file {
     my $file = shift;
     my $buf  = read_file($file);
     $buf =~ s/<.+?>//sg;
     return {
         swishtitle        => "",  # yes, empty
         swishdescription  => "",  # yes, empty
         swishdefault      => $buf,
         swishlastmodified => ( stat($file) )[9],
         swishdocsize      => ( stat($file) )[7],
         swishparser       => 'XML',
         swishmime         => 'application/xml',
         swishencoding     => 'utf-8',
         swishdocpath      => $file,
         swishwordnum      => 0,   # yes, zero
     };
}
--------------------8<------------------------

Here are some things I notice.

1) if I comment out the swishwordnum and swishdescription in parse_file() it works.

2) if I comment out the swishdescription alone, it fails.

3) if I comment out the swishwordnum alone, it fails.

I'll all-in for tonight, but hopefully this can help expose what's going on, 
either with my code or in KS.

cheers,
pek
-- 
Peter Karman  .  http://peknet.com/  .  peter@peknet.com

Re: [Lucy] Re: Invalid UTF-8

Posted by Peter Karman <pe...@peknet.com>.
Peter Karman wrote on 1/25/10 2:11 PM:

> the problem is in libswish3, not KinoSearch or the Search::Tools or the
> original docs.

well, there *was* a problem in libswish3, but it was not the problem causing 
this issue. I still get the error, even with the double-check using the 
utf8_valid() code you suggested. I can even reproduce the problem on a directory 
full of ascii-only files, using only ~4k docs.

I'll try and create a test case. I suspect it's going to be because I'm using a 
lot of fields of various FieldType combinations.

-- 
Peter Karman  .  http://peknet.com/  .  peter@peknet.com