You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by xu cheng <xc...@gmail.com> on 2010/11/16 07:25:56 UTC

encoding messy code

hi all:
I configure an app with solr to index documents
and there are some Chinese content in the documents
and I've configure the apache tomcat URIEncoding to be utf-8
and I use the program curl to sent the documents in xml format
however , when I query the documents, all the Chinese content becomes messy
code. It've cost me a lot of time.
does anyone has some idea about it?
thanks
by the way , the tomcat edition is 6.0.20,while the solr is 1.4

Re: DIH full-import failure, no real error message

Posted by Erik Fäßler <er...@uni-jena.de>.
  Retrieval by ID would only be one possible case; I'm still at the 
beginning of the project, I imagine adding more fields for more 
complicated queries in the future. I imagine a "where - like" query over 
all the XML documents stored in a DBMS wouldn't be too performant ;)

And at a later stage I will process all these documents and add lots of 
metadata - then by latest, I will need a Lucene Index rather than a 
database. So I'd by interested in solution ideas to my issue all the same.

Regards,

     Erik

Am 16.11.2010 11:35, schrieb Dennis Gearon:
> Wow, if all you want is to retrieve by ID, a database would be fine, even a NO
> SQL database.
>
>
>   Dennis Gearon
>
>
> Signature Warning
> ----------------
> It is always a good idea to learn from your own mistakes. It is usually a better
> idea to learn from others’ mistakes, so you do not have to make them yourself.
> from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036'
>
>
> EARTH has a Right To Life,
> otherwise we all die.
>
>
>
> ----- Original Message ----
> From: Erik Fäßler<er...@uni-jena.de>
> To: solr-user@lucene.apache.org
> Sent: Tue, November 16, 2010 12:33:28 AM
> Subject: DIH full-import failure, no real error message
>
> Hey all,
>
> I'm trying to create a Solr index for the 2010 Medline-baseline (www.pubmed.gov,
> over 18 million XML documents). My goal is to be able to retrieve single XML
> documents by their ID. Each document comes with a unique ID, the PubMedID. So my
> schema (important portions) looks like this:
>
> <field name="pmid" type="string" indexed="true" stored="true" required="true" />
> <field name="date" type="tdate" indexed="true" stored="true"/>
> <field name="xml" type="text" indexed="true" stored="true"/>
>
> <uniqueKey>pmid</uniqueKey>
> <defaultSearchField>pmid</defaultSearchField>
>
> pmid holds the ID, data hold the creation date; xml holds the whole XML document
> (mostly below 5kb). I used the DataImporter to do this. I had to write some
> classes (DataSource, EntityProcessor, DateFormatter) myself, so theoretically,
> the error could lie there.
>
> What happens is that indexing just looks fine at the beginning. Memory usage is
> quite below the maximum (max of 20g, usage of below 5g, most of the time around
> 3g). It goes several hours in this manner until it suddenly stopps. I tried this
> a few times with minor tweaks, non of which made any difference. The last time
> such a crash occurred, over 16.5 million documents already had been indexed
> (argh, so close...). It never stops at the same document and trying to index the
> documents, where the error occurred, just runs fine. Index size on disc was
> between 40g and 50g the last time I had a look.
>
> This is the log from beginning to end:
>
> (I decided to just attach the log for the sake of readability ;) ).
>
> As you can see, Solr's error message is not quite complete. There are no closing
> brackets. The document is cut in half on this message and not even the error
> message itself is complete: The 'D' of
> (D)ataImporter.runCmd(DataImporter.java:389) right after the document text is
> missing.
>
> I have one thought concerning this: I get the input documents as an InputStream
> which I read buffer-wise (at most 1000bytes per read() call). I need to deliver
> the documents in one large byte-Array to the XML parser I use (VTD XML).
> But I don't only get the individual small XML documents but always one larger
> XML blob with exactly 30,000 of these documents. I use a self-written
> EntityProcessor to extract the single documents from the larger blob. These
> blobs have a size of about 50 to 150mb. So what I do is to read these large
> blobs in 1000bytes steps and store each byte array in an ArrayList<byte[]>.
> Afterwards, I create the ultimate byte[] and do System.arraycopy from the
> ArrayList into the byte[].
> I tested this and it looks fine to me. And how I said, indexing the documents
> where the error occurred just works fine (that is, indexing the whole blob
> containing the single document). I just mention this because it kind of looks
> like there is this cut in the document and the missing 'D' reminds me of
> char-encoding errors. But I don't know for real, opening the error log in vi
> doesn't show any broken characters (the last time I had such problems, vi could
> identify the characters in question, other editors just wouldn't show them).
>
> Further ideas from my side: Is the index too big? I think I read something about
> a large index would be something around 10million documents, I aim to
> approximately double this number. But would this cause such an error? In the
> end: What exactly IS the error?
>
> Sorry for the lot of text, just trying to describe the problem as detailed as
> possible. Thanks a lot for reading and I appreciate any ideas! :)
>
> Best regards,
>
>      Erik
>


Re: DIH full-import failure, no real error message

Posted by Dennis Gearon <ge...@sbcglobal.net>.
Wow, if all you want is to retrieve by ID, a database would be fine, even a NO 
SQL database.


 Dennis Gearon


Signature Warning
----------------
It is always a good idea to learn from your own mistakes. It is usually a better 
idea to learn from others’ mistakes, so you do not have to make them yourself. 
from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036'


EARTH has a Right To Life,
otherwise we all die.



----- Original Message ----
From: Erik Fäßler <er...@uni-jena.de>
To: solr-user@lucene.apache.org
Sent: Tue, November 16, 2010 12:33:28 AM
Subject: DIH full-import failure, no real error message

Hey all,

I'm trying to create a Solr index for the 2010 Medline-baseline (www.pubmed.gov, 
over 18 million XML documents). My goal is to be able to retrieve single XML 
documents by their ID. Each document comes with a unique ID, the PubMedID. So my 
schema (important portions) looks like this:

<field name="pmid" type="string" indexed="true" stored="true" required="true" />
<field name="date" type="tdate" indexed="true" stored="true"/>
<field name="xml" type="text" indexed="true" stored="true"/>

<uniqueKey>pmid</uniqueKey>
<defaultSearchField>pmid</defaultSearchField>

pmid holds the ID, data hold the creation date; xml holds the whole XML document 
(mostly below 5kb). I used the DataImporter to do this. I had to write some 
classes (DataSource, EntityProcessor, DateFormatter) myself, so theoretically, 
the error could lie there.

What happens is that indexing just looks fine at the beginning. Memory usage is 
quite below the maximum (max of 20g, usage of below 5g, most of the time around 
3g). It goes several hours in this manner until it suddenly stopps. I tried this 
a few times with minor tweaks, non of which made any difference. The last time 
such a crash occurred, over 16.5 million documents already had been indexed 
(argh, so close...). It never stops at the same document and trying to index the 
documents, where the error occurred, just runs fine. Index size on disc was 
between 40g and 50g the last time I had a look.

This is the log from beginning to end:

(I decided to just attach the log for the sake of readability ;) ).

As you can see, Solr's error message is not quite complete. There are no closing 
brackets. The document is cut in half on this message and not even the error 
message itself is complete: The 'D' of 
(D)ataImporter.runCmd(DataImporter.java:389) right after the document text is 
missing.

I have one thought concerning this: I get the input documents as an InputStream 
which I read buffer-wise (at most 1000bytes per read() call). I need to deliver 
the documents in one large byte-Array to the XML parser I use (VTD XML).
But I don't only get the individual small XML documents but always one larger 
XML blob with exactly 30,000 of these documents. I use a self-written 
EntityProcessor to extract the single documents from the larger blob. These 
blobs have a size of about 50 to 150mb. So what I do is to read these large 
blobs in 1000bytes steps and store each byte array in an ArrayList<byte[]>. 
Afterwards, I create the ultimate byte[] and do System.arraycopy from the 
ArrayList into the byte[].
I tested this and it looks fine to me. And how I said, indexing the documents 
where the error occurred just works fine (that is, indexing the whole blob 
containing the single document). I just mention this because it kind of looks 
like there is this cut in the document and the missing 'D' reminds me of 
char-encoding errors. But I don't know for real, opening the error log in vi 
doesn't show any broken characters (the last time I had such problems, vi could 
identify the characters in question, other editors just wouldn't show them).

Further ideas from my side: Is the index too big? I think I read something about 
a large index would be something around 10million documents, I aim to 
approximately double this number. But would this cause such an error? In the 
end: What exactly IS the error?

Sorry for the lot of text, just trying to describe the problem as detailed as 
possible. Thanks a lot for reading and I appreciate any ideas! :)

Best regards,

    Erik


Re: DIH full-import failure, no real error message

Posted by Erik Fäßler <er...@uni-jena.de>.
Yes, I noticed just after sending the message.
My apologies!

Best,

Erik

Am 20.11.2010 um 00:32 schrieb Chris Hostetter <ho...@fucit.org>:

> 
> : Subject: DIH full-import failure, no real error message
> : References: <AA...@mail.gmail.com>
> : In-Reply-To: <AA...@mail.gmail.com>
> 
> http://people.apache.org/~hossman/#threadhijack
> Thread Hijacking on Mailing Lists
> 
> When starting a new discussion on a mailing list, please do not reply to 
> an existing message, instead start a fresh email.  Even if you change the 
> subject line of your email, other mail headers still track which thread 
> you replied to and your question is "hidden" in that thread and gets less 
> attention.   It makes following discussions in the mailing list archives 
> particularly difficult.
> 
> 
> -Hoss

Re: DIH full-import failure, no real error message

Posted by Chris Hostetter <ho...@fucit.org>.
: Subject: DIH full-import failure, no real error message
: References: <AA...@mail.gmail.com>
: In-Reply-To: <AA...@mail.gmail.com>

http://people.apache.org/~hossman/#threadhijack
Thread Hijacking on Mailing Lists

When starting a new discussion on a mailing list, please do not reply to 
an existing message, instead start a fresh email.  Even if you change the 
subject line of your email, other mail headers still track which thread 
you replied to and your question is "hidden" in that thread and gets less 
attention.   It makes following discussions in the mailing list archives 
particularly difficult.


-Hoss

Re: DIH full-import failure, no real error message

Posted by Erik Fäßler <er...@uni-jena.de>.
  Hi Tommaso,

I'm not sure I saw exactly that but there was a Solr-UIMA-contribution a 
few months ago and I had a look. I didn't go into details, because our 
search engine isn't upgraded to Solr yet (but is to come). But I will 
keep your link, perhaps this will proof useful to me, thank you!

Best regards,

     Erik

Am 17.11.2010 16:25, schrieb Tommaso Teofili:
> Hi Erik
>
> 2010/11/17 Erik Fäßler<er...@uni-jena.de>
>
>> . But until this point it is necessary to retrieve the full documents,
>> otherwise I'd have to re-evaluate and partly rewrite our UIMA-Pipelines.
>
> Did you see https://issues.apache.org/jira/browse/SOLR-2129 for enhancing
> docs with UIMA pipelines just before they get indexed in Solr?
> Cheers,
> Tommaso
>


Re: DIH full-import failure, no real error message

Posted by Tommaso Teofili <to...@gmail.com>.
Hi Erik

2010/11/17 Erik Fäßler <er...@uni-jena.de>

> . But until this point it is necessary to retrieve the full documents,
> otherwise I'd have to re-evaluate and partly rewrite our UIMA-Pipelines.


Did you see https://issues.apache.org/jira/browse/SOLR-2129 for enhancing
docs with UIMA pipelines just before they get indexed in Solr?
Cheers,
Tommaso

Re: DIH full-import failure, no real error message

Posted by Erik Fäßler <er...@uni-jena.de>.
  Yes, I knew index and storing would pose a heavy load but I wanted to 
give it a try. The storing has to be for the goal I'd like to archive. 
We use a UIMA NLP-Pipeline to process the Medline documents and we 
already have a Medline-XML reader. Everything's fine with all this 
except until now we just stored every single XML document on disc and 
saved all the paths of the exact documents we wanted to process on a 
particular run in a database. Then, our UIMA CollectionReader would 
retrieve a batch of file paths from the database, read the files and 
process them.
This worked fine and it still will - but importing into the database can 
take quite a long time because we have to traverse the file system tree 
for the correct files. We arranged the files so we can find them more 
easily. But still, extracting all the individual files from the larger 
XML blobs takes to much time and Inodes ;)
This is why I'm doing a Solr index (nice benefit here: I could implement 
search) and - as an alternative -  store them in a database for 
retrieval; I will experiment with both solutions and check out which 
better fulfills my needs. But until this point it is necessary to 
retrieve the full documents, otherwise I'd have to re-evaluate and 
partly rewrite our UIMA-Pipelines. Perhaps this will be the way to go, 
but this would be really time consuming and I'd only do this if there 
are great benefits.

It seems, David's solution would be ideal for us; perhaps I will have a 
read on the cloud-branch, and HBase in particular.

But - as long Solr can take the effort of storing the whole XML 
documents - of course I can switch the indexing of the XML off. I may 
need the whole XML for retrieval, but I can identify particular parts of 
the XML we'd like to search. These can be extracted easily so this is a 
good idea, of course.

Thanks for all your great advices and help, I really appreciate!

Best,

Erik


Am 17.11.2010 01:55, schrieb Erick Erickson:
> They're not mutually exclusive. Part of your index size is because you
> *store*
> the full xml, which means that a verbatim copy of the raw data is placed in
> the
> index, along with the searchable terms. Including the tags. This only makes
> sense if you're going to return the original data to the user AND use the
> index
> to hold it.
>
> Storing has nothing to do with searching (again, pardon me if this is
> obvious),
> which can be confusing. I claim you could reduce the size of your index
> dramatically without losing any search capability by simply NOT storing
> the XML blob, just indexing it. But that may not be what you need to do,
> only you know your problem space.....
>
> Which brings up the question whether it makes sense to index the
> XML tags, but again that will be defined by your problem space. If
> you have a well-defined set of input tags, you could consider indexing
> each of the tags in a unique field, but the query then gets complicated.
>
> I've seen more than a few situations where trying to use a RDBMSs
> search capabilities doesn't work as the database gets larger, and
> your's qualifies as "larger". In particular, RDBMSs don't have very
> sophisticated search capabilities, and the speed gets pretty bad.
> That's OK, because Solr doesn't have very good join capabilities,
> different tools for different problems.
>
> Best
> Erick
>
> On Tue, Nov 16, 2010 at 12:16 PM, Erik Fäßler<er...@uni-jena.de>wrote:
>
>>   Thank you very much, I will have a read on your links.
>>
>> The full-text-red-flag is exactly the thing why I'm testing this with Solr.
>> As was said before by Dennis, I could also use a database as long as I don't
>> need sophisticated query capabilities. To be honest, I don't know the
>> performance gap between a Lucene index and a database in such a case. I
>> guess I will have to test it.
>> This is thought as a substitution for holding every single file on disc.
>> But I need the whole file information because it's not clear which
>> information will be required in the future. And we don't want to re-index
>> every time we add a new field (not yet, that is ;)).
>>
>> Best regards,
>>
>>     Erik
>>
>> Am 16.11.2010 16:27, schrieb Erick Erickson:
>>
>>> The key is that Solr handles merges by copying, and only after
>>> the copy is complete does it delete the old index. So you'll need
>>> at least 2x your final index size before you start, especially if you
>>> optimize...
>>>
>>> Here's a handy matrix of what you need in your index depending
>>> upon what you want to do:
>>>
>>> http://search.lucidimagination.com/search/out?u=http://wiki.apache.org/solr/FieldOptionsByUseCase
>>>
>>> Leaving out what you don't use will help by shrinking your index.
>>>
>>> <
>>> http://search.lucidimagination.com/search/out?u=http://wiki.apache.org/solr/FieldOptionsByUseCase
>>>> the
>>> thing that jumps out is that you're storing your entire XML document
>>> as well as indexing it. Are you expecting to return the document
>>> to the user? Storing the entire document is is a red-flag, you
>>> probably don't want to do this. If you need to return the entire
>>> document some time, one strategy is to index whatever you need
>>> to search, and index what you need to fetch the document from
>>> an external store. You can index the values of selected tags as fields in
>>> your documents. That would also give you far more flexibility
>>> when searching.
>>>
>>> Best
>>> Erick
>>>
>>>
>>>
>>>
>>> On Tue, Nov 16, 2010 at 9:48 AM, Erik Fäßler<erik.faessler@uni-jena.de
>>>> wrote:
>>>    Hello Erick,
>>>> I guess I'm the one asking for pardon - but sure not you! It seems,
>>>> you're
>>>> first guess could already be the correct one. Disc space IS kind of short
>>>> and I believe it could have run out; since Solr is performing a rollback
>>>> after the failure, I didn't notice (beside the fact that this is one of
>>>> our
>>>> server machine, but apparently the wrong mount point...).
>>>>
>>>> I not yet absolutely sure of this, but it would explain a lot and it
>>>> really
>>>> looks like it. So thank you for this maybe not so obvious hint :)
>>>>
>>>> But you also mentioned the merging strategy. I left everything on the
>>>> standards that come with the Solr download concerning these things.
>>>> Could it be that such a great index needs another treatment? Could you
>>>> point me to a Wiki page or something where I get a few tips?
>>>>
>>>> Thanks a lot, I will try building the index on a partition with enough
>>>> space, perhaps that will already do it.
>>>>
>>>> Best regards,
>>>>
>>>>     Erik
>>>>
>>>> Am 16.11.2010 14:19, schrieb Erick Erickson:
>>>>
>>>>   Several questions. Pardon me if they're obvious, but I've spent faaaar
>>>>
>>>>> too much of my life overlooking the obvious...
>>>>>
>>>>> 1>    Is it possible you're running out of disk? 40-50G could suck up
>>>>> a lot of disk, especially when merging. You may need that much again
>>>>> free when a merge occurs.
>>>>> 2>    speaking of merging, what are your merge settings? How are you
>>>>> triggering merges. See<mergeFactor>    and associated in solrconfig.xml?
>>>>> 3>    You might get some insight by removing the Solr indexing part, can
>>>>> you spin through your parsing from beginning to end? That would
>>>>> eliminate your questions about whether you're XML parsing is the
>>>>> problem.
>>>>>
>>>>>
>>>>> 40-50G is a large index, but it's certainly within Solr's capability,
>>>>> so you're not hitting any built-in limits.
>>>>>
>>>>> My first guess would be that you're running out of disk, at least
>>>>> that's the first thing I'd check next...
>>>>>
>>>>> Best
>>>>> Erick
>>>>>
>>>>> On Tue, Nov 16, 2010 at 3:33 AM, Erik Fäßler<erik.faessler@uni-jena.de
>>>>>
>>>>>> wrote:
>>>>>>
>>>>>    Hey all,
>>>>>
>>>>>> I'm trying to create a Solr index for the 2010 Medline-baseline (
>>>>>> www.pubmed.gov, over 18 million XML documents). My goal is to be able
>>>>>> to
>>>>>> retrieve single XML documents by their ID. Each document comes with a
>>>>>> unique
>>>>>> ID, the PubMedID. So my schema (important portions) looks like this:
>>>>>>
>>>>>> <field name="pmid" type="string" indexed="true" stored="true"
>>>>>> required="true" />
>>>>>> <field name="date" type="tdate" indexed="true" stored="true"/>
>>>>>> <field name="xml" type="text" indexed="true" stored="true"/>
>>>>>>
>>>>>> <uniqueKey>pmid</uniqueKey>
>>>>>> <defaultSearchField>pmid</defaultSearchField>
>>>>>>
>>>>>> pmid holds the ID, data hold the creation date; xml holds the whole XML
>>>>>> document (mostly below 5kb). I used the DataImporter to do this. I had
>>>>>> to
>>>>>> write some classes (DataSource, EntityProcessor, DateFormatter) myself,
>>>>>> so
>>>>>> theoretically, the error could lie there.
>>>>>>
>>>>>> What happens is that indexing just looks fine at the beginning. Memory
>>>>>> usage is quite below the maximum (max of 20g, usage of below 5g, most
>>>>>> of
>>>>>> the
>>>>>> time around 3g). It goes several hours in this manner until it suddenly
>>>>>> stopps. I tried this a few times with minor tweaks, non of which made
>>>>>> any
>>>>>> difference. The last time such a crash occurred, over 16.5 million
>>>>>> documents
>>>>>> already had been indexed (argh, so close...). It never stops at the
>>>>>> same
>>>>>> document and trying to index the documents, where the error occurred,
>>>>>> just
>>>>>> runs fine. Index size on disc was between 40g and 50g the last time I
>>>>>> had
>>>>>> a
>>>>>> look.
>>>>>>
>>>>>> This is the log from beginning to end:
>>>>>>
>>>>>> (I decided to just attach the log for the sake of readability ;) ).
>>>>>>
>>>>>> As you can see, Solr's error message is not quite complete. There are
>>>>>> no
>>>>>> closing brackets. The document is cut in half on this message and not
>>>>>> even
>>>>>> the error message itself is complete: The 'D' of
>>>>>> (D)ataImporter.runCmd(DataImporter.java:389) right after the document
>>>>>> text
>>>>>> is missing.
>>>>>>
>>>>>> I have one thought concerning this: I get the input documents as an
>>>>>> InputStream which I read buffer-wise (at most 1000bytes per read()
>>>>>> call).
>>>>>> I
>>>>>> need to deliver the documents in one large byte-Array to the XML parser
>>>>>> I
>>>>>> use (VTD XML).
>>>>>> But I don't only get the individual small XML documents but always one
>>>>>> larger XML blob with exactly 30,000 of these documents. I use a
>>>>>> self-written
>>>>>> EntityProcessor to extract the single documents from the larger blob.
>>>>>> These
>>>>>> blobs have a size of about 50 to 150mb. So what I do is to read these
>>>>>> large
>>>>>> blobs in 1000bytes steps and store each byte array in an
>>>>>> ArrayList<byte[]>.
>>>>>> Afterwards, I create the ultimate byte[] and do System.arraycopy from
>>>>>> the
>>>>>> ArrayList into the byte[].
>>>>>> I tested this and it looks fine to me. And how I said, indexing the
>>>>>> documents where the error occurred just works fine (that is, indexing
>>>>>> the
>>>>>> whole blob containing the single document). I just mention this because
>>>>>> it
>>>>>> kind of looks like there is this cut in the document and the missing
>>>>>> 'D'
>>>>>> reminds me of char-encoding errors. But I don't know for real, opening
>>>>>> the
>>>>>> error log in vi doesn't show any broken characters (the last time I had
>>>>>> such
>>>>>> problems, vi could identify the characters in question, other editors
>>>>>> just
>>>>>> wouldn't show them).
>>>>>>
>>>>>> Further ideas from my side: Is the index too big? I think I read
>>>>>> something
>>>>>> about a large index would be something around 10million documents, I
>>>>>> aim
>>>>>> to
>>>>>> approximately double this number. But would this cause such an error?
>>>>>> In
>>>>>> the
>>>>>> end: What exactly IS the error?
>>>>>>
>>>>>> Sorry for the lot of text, just trying to describe the problem as
>>>>>> detailed
>>>>>> as possible. Thanks a lot for reading and I appreciate any ideas! :)
>>>>>>
>>>>>> Best regards,
>>>>>>
>>>>>>     Erik
>>>>>>
>>>>>>
>>>>>>


Re: DIH full-import failure, no real error message

Posted by Erick Erickson <er...@gmail.com>.
They're not mutually exclusive. Part of your index size is because you
*store*
the full xml, which means that a verbatim copy of the raw data is placed in
the
index, along with the searchable terms. Including the tags. This only makes
sense if you're going to return the original data to the user AND use the
index
to hold it.

Storing has nothing to do with searching (again, pardon me if this is
obvious),
which can be confusing. I claim you could reduce the size of your index
dramatically without losing any search capability by simply NOT storing
the XML blob, just indexing it. But that may not be what you need to do,
only you know your problem space.....

Which brings up the question whether it makes sense to index the
XML tags, but again that will be defined by your problem space. If
you have a well-defined set of input tags, you could consider indexing
each of the tags in a unique field, but the query then gets complicated.

I've seen more than a few situations where trying to use a RDBMSs
search capabilities doesn't work as the database gets larger, and
your's qualifies as "larger". In particular, RDBMSs don't have very
sophisticated search capabilities, and the speed gets pretty bad.
That's OK, because Solr doesn't have very good join capabilities,
different tools for different problems.

Best
Erick

On Tue, Nov 16, 2010 at 12:16 PM, Erik Fäßler <er...@uni-jena.de>wrote:

>  Thank you very much, I will have a read on your links.
>
> The full-text-red-flag is exactly the thing why I'm testing this with Solr.
> As was said before by Dennis, I could also use a database as long as I don't
> need sophisticated query capabilities. To be honest, I don't know the
> performance gap between a Lucene index and a database in such a case. I
> guess I will have to test it.
> This is thought as a substitution for holding every single file on disc.
> But I need the whole file information because it's not clear which
> information will be required in the future. And we don't want to re-index
> every time we add a new field (not yet, that is ;)).
>
> Best regards,
>
>    Erik
>
> Am 16.11.2010 16:27, schrieb Erick Erickson:
>
>> The key is that Solr handles merges by copying, and only after
>> the copy is complete does it delete the old index. So you'll need
>> at least 2x your final index size before you start, especially if you
>> optimize...
>>
>> Here's a handy matrix of what you need in your index depending
>> upon what you want to do:
>>
>> http://search.lucidimagination.com/search/out?u=http://wiki.apache.org/solr/FieldOptionsByUseCase
>>
>> Leaving out what you don't use will help by shrinking your index.
>>
>> <
>> http://search.lucidimagination.com/search/out?u=http://wiki.apache.org/solr/FieldOptionsByUseCase
>> >the
>>
>> thing that jumps out is that you're storing your entire XML document
>> as well as indexing it. Are you expecting to return the document
>> to the user? Storing the entire document is is a red-flag, you
>> probably don't want to do this. If you need to return the entire
>> document some time, one strategy is to index whatever you need
>> to search, and index what you need to fetch the document from
>> an external store. You can index the values of selected tags as fields in
>> your documents. That would also give you far more flexibility
>> when searching.
>>
>> Best
>> Erick
>>
>>
>>
>>
>> On Tue, Nov 16, 2010 at 9:48 AM, Erik Fäßler<erik.faessler@uni-jena.de
>> >wrote:
>>
>>   Hello Erick,
>>>
>>> I guess I'm the one asking for pardon - but sure not you! It seems,
>>> you're
>>> first guess could already be the correct one. Disc space IS kind of short
>>> and I believe it could have run out; since Solr is performing a rollback
>>> after the failure, I didn't notice (beside the fact that this is one of
>>> our
>>> server machine, but apparently the wrong mount point...).
>>>
>>> I not yet absolutely sure of this, but it would explain a lot and it
>>> really
>>> looks like it. So thank you for this maybe not so obvious hint :)
>>>
>>> But you also mentioned the merging strategy. I left everything on the
>>> standards that come with the Solr download concerning these things.
>>> Could it be that such a great index needs another treatment? Could you
>>> point me to a Wiki page or something where I get a few tips?
>>>
>>> Thanks a lot, I will try building the index on a partition with enough
>>> space, perhaps that will already do it.
>>>
>>> Best regards,
>>>
>>>    Erik
>>>
>>> Am 16.11.2010 14:19, schrieb Erick Erickson:
>>>
>>>  Several questions. Pardon me if they're obvious, but I've spent faaaar
>>>
>>>> too much of my life overlooking the obvious...
>>>>
>>>> 1>   Is it possible you're running out of disk? 40-50G could suck up
>>>> a lot of disk, especially when merging. You may need that much again
>>>> free when a merge occurs.
>>>> 2>   speaking of merging, what are your merge settings? How are you
>>>> triggering merges. See<mergeFactor>   and associated in solrconfig.xml?
>>>> 3>   You might get some insight by removing the Solr indexing part, can
>>>> you spin through your parsing from beginning to end? That would
>>>> eliminate your questions about whether you're XML parsing is the
>>>> problem.
>>>>
>>>>
>>>> 40-50G is a large index, but it's certainly within Solr's capability,
>>>> so you're not hitting any built-in limits.
>>>>
>>>> My first guess would be that you're running out of disk, at least
>>>> that's the first thing I'd check next...
>>>>
>>>> Best
>>>> Erick
>>>>
>>>> On Tue, Nov 16, 2010 at 3:33 AM, Erik Fäßler<erik.faessler@uni-jena.de
>>>>
>>>>> wrote:
>>>>>
>>>>   Hey all,
>>>>
>>>>> I'm trying to create a Solr index for the 2010 Medline-baseline (
>>>>> www.pubmed.gov, over 18 million XML documents). My goal is to be able
>>>>> to
>>>>> retrieve single XML documents by their ID. Each document comes with a
>>>>> unique
>>>>> ID, the PubMedID. So my schema (important portions) looks like this:
>>>>>
>>>>> <field name="pmid" type="string" indexed="true" stored="true"
>>>>> required="true" />
>>>>> <field name="date" type="tdate" indexed="true" stored="true"/>
>>>>> <field name="xml" type="text" indexed="true" stored="true"/>
>>>>>
>>>>> <uniqueKey>pmid</uniqueKey>
>>>>> <defaultSearchField>pmid</defaultSearchField>
>>>>>
>>>>> pmid holds the ID, data hold the creation date; xml holds the whole XML
>>>>> document (mostly below 5kb). I used the DataImporter to do this. I had
>>>>> to
>>>>> write some classes (DataSource, EntityProcessor, DateFormatter) myself,
>>>>> so
>>>>> theoretically, the error could lie there.
>>>>>
>>>>> What happens is that indexing just looks fine at the beginning. Memory
>>>>> usage is quite below the maximum (max of 20g, usage of below 5g, most
>>>>> of
>>>>> the
>>>>> time around 3g). It goes several hours in this manner until it suddenly
>>>>> stopps. I tried this a few times with minor tweaks, non of which made
>>>>> any
>>>>> difference. The last time such a crash occurred, over 16.5 million
>>>>> documents
>>>>> already had been indexed (argh, so close...). It never stops at the
>>>>> same
>>>>> document and trying to index the documents, where the error occurred,
>>>>> just
>>>>> runs fine. Index size on disc was between 40g and 50g the last time I
>>>>> had
>>>>> a
>>>>> look.
>>>>>
>>>>> This is the log from beginning to end:
>>>>>
>>>>> (I decided to just attach the log for the sake of readability ;) ).
>>>>>
>>>>> As you can see, Solr's error message is not quite complete. There are
>>>>> no
>>>>> closing brackets. The document is cut in half on this message and not
>>>>> even
>>>>> the error message itself is complete: The 'D' of
>>>>> (D)ataImporter.runCmd(DataImporter.java:389) right after the document
>>>>> text
>>>>> is missing.
>>>>>
>>>>> I have one thought concerning this: I get the input documents as an
>>>>> InputStream which I read buffer-wise (at most 1000bytes per read()
>>>>> call).
>>>>> I
>>>>> need to deliver the documents in one large byte-Array to the XML parser
>>>>> I
>>>>> use (VTD XML).
>>>>> But I don't only get the individual small XML documents but always one
>>>>> larger XML blob with exactly 30,000 of these documents. I use a
>>>>> self-written
>>>>> EntityProcessor to extract the single documents from the larger blob.
>>>>> These
>>>>> blobs have a size of about 50 to 150mb. So what I do is to read these
>>>>> large
>>>>> blobs in 1000bytes steps and store each byte array in an
>>>>> ArrayList<byte[]>.
>>>>> Afterwards, I create the ultimate byte[] and do System.arraycopy from
>>>>> the
>>>>> ArrayList into the byte[].
>>>>> I tested this and it looks fine to me. And how I said, indexing the
>>>>> documents where the error occurred just works fine (that is, indexing
>>>>> the
>>>>> whole blob containing the single document). I just mention this because
>>>>> it
>>>>> kind of looks like there is this cut in the document and the missing
>>>>> 'D'
>>>>> reminds me of char-encoding errors. But I don't know for real, opening
>>>>> the
>>>>> error log in vi doesn't show any broken characters (the last time I had
>>>>> such
>>>>> problems, vi could identify the characters in question, other editors
>>>>> just
>>>>> wouldn't show them).
>>>>>
>>>>> Further ideas from my side: Is the index too big? I think I read
>>>>> something
>>>>> about a large index would be something around 10million documents, I
>>>>> aim
>>>>> to
>>>>> approximately double this number. But would this cause such an error?
>>>>> In
>>>>> the
>>>>> end: What exactly IS the error?
>>>>>
>>>>> Sorry for the lot of text, just trying to describe the problem as
>>>>> detailed
>>>>> as possible. Thanks a lot for reading and I appreciate any ideas! :)
>>>>>
>>>>> Best regards,
>>>>>
>>>>>    Erik
>>>>>
>>>>>
>>>>>
>

RE: DIH full-import failure, no real error message

Posted by "Buttler, David" <bu...@llnl.gov>.
I am using the solr cloud branch on 6 machines.  I first load PubMed into HBase, and then push the fields I care about to solr.  Indexing from HBase to solr takes about 18 minutes.  Loading to hbase takes a little longer (2 hours?), but it only happens once so I haven't spent much time trying to optimize.

This gives me the flexibility of a solr search as well as full document retrieval (and additional processing) from hbase.

Dave

-----Original Message-----
From: Erik Fäßler [mailto:erik.faessler@uni-jena.de] 
Sent: Tuesday, November 16, 2010 9:16 AM
To: solr-user@lucene.apache.org
Subject: Re: DIH full-import failure, no real error message

  Thank you very much, I will have a read on your links.

The full-text-red-flag is exactly the thing why I'm testing this with 
Solr. As was said before by Dennis, I could also use a database as long 
as I don't need sophisticated query capabilities. To be honest, I don't 
know the performance gap between a Lucene index and a database in such a 
case. I guess I will have to test it.
This is thought as a substitution for holding every single file on disc. 
But I need the whole file information because it's not clear which 
information will be required in the future. And we don't want to 
re-index every time we add a new field (not yet, that is ;)).

Best regards,

     Erik

Am 16.11.2010 16:27, schrieb Erick Erickson:
> The key is that Solr handles merges by copying, and only after
> the copy is complete does it delete the old index. So you'll need
> at least 2x your final index size before you start, especially if you
> optimize...
>
> Here's a handy matrix of what you need in your index depending
> upon what you want to do:
> http://BLOCKEDsearch.lucidimagination.com/search/out?u=http://BLOCKEDwiki.apache.org/solr/FieldOptionsByUseCase
>
> Leaving out what you don't use will help by shrinking your index.
>
> <http://BLOCKEDsearch.lucidimagination.com/search/out?u=http://BLOCKEDwiki.apache.org/solr/FieldOptionsByUseCase>the
> thing that jumps out is that you're storing your entire XML document
> as well as indexing it. Are you expecting to return the document
> to the user? Storing the entire document is is a red-flag, you
> probably don't want to do this. If you need to return the entire
> document some time, one strategy is to index whatever you need
> to search, and index what you need to fetch the document from
> an external store. You can index the values of selected tags as fields in
> your documents. That would also give you far more flexibility
> when searching.
>
> Best
> Erick
>
>
>
>
> On Tue, Nov 16, 2010 at 9:48 AM, Erik Fäßler<er...@uni-jena.de>wrote:
>
>>   Hello Erick,
>>
>> I guess I'm the one asking for pardon - but sure not you! It seems, you're
>> first guess could already be the correct one. Disc space IS kind of short
>> and I believe it could have run out; since Solr is performing a rollback
>> after the failure, I didn't notice (beside the fact that this is one of our
>> server machine, but apparently the wrong mount point...).
>>
>> I not yet absolutely sure of this, but it would explain a lot and it really
>> looks like it. So thank you for this maybe not so obvious hint :)
>>
>> But you also mentioned the merging strategy. I left everything on the
>> standards that come with the Solr download concerning these things.
>> Could it be that such a great index needs another treatment? Could you
>> point me to a Wiki page or something where I get a few tips?
>>
>> Thanks a lot, I will try building the index on a partition with enough
>> space, perhaps that will already do it.
>>
>> Best regards,
>>
>>     Erik
>>
>> Am 16.11.2010 14:19, schrieb Erick Erickson:
>>
>>   Several questions. Pardon me if they're obvious, but I've spent faaaar
>>> too much of my life overlooking the obvious...
>>>
>>> 1>   Is it possible you're running out of disk? 40-50G could suck up
>>> a lot of disk, especially when merging. You may need that much again
>>> free when a merge occurs.
>>> 2>   speaking of merging, what are your merge settings? How are you
>>> triggering merges. See<mergeFactor>   and associated in solrconfig.xml?
>>> 3>   You might get some insight by removing the Solr indexing part, can
>>> you spin through your parsing from beginning to end? That would
>>> eliminate your questions about whether you're XML parsing is the
>>> problem.
>>>
>>>
>>> 40-50G is a large index, but it's certainly within Solr's capability,
>>> so you're not hitting any built-in limits.
>>>
>>> My first guess would be that you're running out of disk, at least
>>> that's the first thing I'd check next...
>>>
>>> Best
>>> Erick
>>>
>>> On Tue, Nov 16, 2010 at 3:33 AM, Erik Fäßler<erik.faessler@uni-jena.de
>>>> wrote:
>>>    Hey all,
>>>> I'm trying to create a Solr index for the 2010 Medline-baseline (
>>>> www.BLOCKEDpubmed.gov, over 18 million XML documents). My goal is to be able to
>>>> retrieve single XML documents by their ID. Each document comes with a
>>>> unique
>>>> ID, the PubMedID. So my schema (important portions) looks like this:
>>>>
>>>> <field name="pmid" type="string" indexed="true" stored="true"
>>>> required="true" />
>>>> <field name="date" type="tdate" indexed="true" stored="true"/>
>>>> <field name="xml" type="text" indexed="true" stored="true"/>
>>>>
>>>> <uniqueKey>pmid</uniqueKey>
>>>> <defaultSearchField>pmid</defaultSearchField>
>>>>
>>>> pmid holds the ID, data hold the creation date; xml holds the whole XML
>>>> document (mostly below 5kb). I used the DataImporter to do this. I had to
>>>> write some classes (DataSource, EntityProcessor, DateFormatter) myself,
>>>> so
>>>> theoretically, the error could lie there.
>>>>
>>>> What happens is that indexing just looks fine at the beginning. Memory
>>>> usage is quite below the maximum (max of 20g, usage of below 5g, most of
>>>> the
>>>> time around 3g). It goes several hours in this manner until it suddenly
>>>> stopps. I tried this a few times with minor tweaks, non of which made any
>>>> difference. The last time such a crash occurred, over 16.5 million
>>>> documents
>>>> already had been indexed (argh, so close...). It never stops at the same
>>>> document and trying to index the documents, where the error occurred,
>>>> just
>>>> runs fine. Index size on disc was between 40g and 50g the last time I had
>>>> a
>>>> look.
>>>>
>>>> This is the log from beginning to end:
>>>>
>>>> (I decided to just attach the log for the sake of readability ;) ).
>>>>
>>>> As you can see, Solr's error message is not quite complete. There are no
>>>> closing brackets. The document is cut in half on this message and not
>>>> even
>>>> the error message itself is complete: The 'D' of
>>>> (D)ataImporter.runCmd(DataImporter.java:389) right after the document
>>>> text
>>>> is missing.
>>>>
>>>> I have one thought concerning this: I get the input documents as an
>>>> InputStream which I read buffer-wise (at most 1000bytes per read() call).
>>>> I
>>>> need to deliver the documents in one large byte-Array to the XML parser I
>>>> use (VTD XML).
>>>> But I don't only get the individual small XML documents but always one
>>>> larger XML blob with exactly 30,000 of these documents. I use a
>>>> self-written
>>>> EntityProcessor to extract the single documents from the larger blob.
>>>> These
>>>> blobs have a size of about 50 to 150mb. So what I do is to read these
>>>> large
>>>> blobs in 1000bytes steps and store each byte array in an
>>>> ArrayList<byte[]>.
>>>> Afterwards, I create the ultimate byte[] and do System.arraycopy from the
>>>> ArrayList into the byte[].
>>>> I tested this and it looks fine to me. And how I said, indexing the
>>>> documents where the error occurred just works fine (that is, indexing the
>>>> whole blob containing the single document). I just mention this because
>>>> it
>>>> kind of looks like there is this cut in the document and the missing 'D'
>>>> reminds me of char-encoding errors. But I don't know for real, opening
>>>> the
>>>> error log in vi doesn't show any broken characters (the last time I had
>>>> such
>>>> problems, vi could identify the characters in question, other editors
>>>> just
>>>> wouldn't show them).
>>>>
>>>> Further ideas from my side: Is the index too big? I think I read
>>>> something
>>>> about a large index would be something around 10million documents, I aim
>>>> to
>>>> approximately double this number. But would this cause such an error? In
>>>> the
>>>> end: What exactly IS the error?
>>>>
>>>> Sorry for the lot of text, just trying to describe the problem as
>>>> detailed
>>>> as possible. Thanks a lot for reading and I appreciate any ideas! :)
>>>>
>>>> Best regards,
>>>>
>>>>     Erik
>>>>
>>>>



Re: DIH full-import failure, no real error message

Posted by Erik Fäßler <er...@uni-jena.de>.
  Thank you very much, I will have a read on your links.

The full-text-red-flag is exactly the thing why I'm testing this with 
Solr. As was said before by Dennis, I could also use a database as long 
as I don't need sophisticated query capabilities. To be honest, I don't 
know the performance gap between a Lucene index and a database in such a 
case. I guess I will have to test it.
This is thought as a substitution for holding every single file on disc. 
But I need the whole file information because it's not clear which 
information will be required in the future. And we don't want to 
re-index every time we add a new field (not yet, that is ;)).

Best regards,

     Erik

Am 16.11.2010 16:27, schrieb Erick Erickson:
> The key is that Solr handles merges by copying, and only after
> the copy is complete does it delete the old index. So you'll need
> at least 2x your final index size before you start, especially if you
> optimize...
>
> Here's a handy matrix of what you need in your index depending
> upon what you want to do:
> http://search.lucidimagination.com/search/out?u=http://wiki.apache.org/solr/FieldOptionsByUseCase
>
> Leaving out what you don't use will help by shrinking your index.
>
> <http://search.lucidimagination.com/search/out?u=http://wiki.apache.org/solr/FieldOptionsByUseCase>the
> thing that jumps out is that you're storing your entire XML document
> as well as indexing it. Are you expecting to return the document
> to the user? Storing the entire document is is a red-flag, you
> probably don't want to do this. If you need to return the entire
> document some time, one strategy is to index whatever you need
> to search, and index what you need to fetch the document from
> an external store. You can index the values of selected tags as fields in
> your documents. That would also give you far more flexibility
> when searching.
>
> Best
> Erick
>
>
>
>
> On Tue, Nov 16, 2010 at 9:48 AM, Erik Fäßler<er...@uni-jena.de>wrote:
>
>>   Hello Erick,
>>
>> I guess I'm the one asking for pardon - but sure not you! It seems, you're
>> first guess could already be the correct one. Disc space IS kind of short
>> and I believe it could have run out; since Solr is performing a rollback
>> after the failure, I didn't notice (beside the fact that this is one of our
>> server machine, but apparently the wrong mount point...).
>>
>> I not yet absolutely sure of this, but it would explain a lot and it really
>> looks like it. So thank you for this maybe not so obvious hint :)
>>
>> But you also mentioned the merging strategy. I left everything on the
>> standards that come with the Solr download concerning these things.
>> Could it be that such a great index needs another treatment? Could you
>> point me to a Wiki page or something where I get a few tips?
>>
>> Thanks a lot, I will try building the index on a partition with enough
>> space, perhaps that will already do it.
>>
>> Best regards,
>>
>>     Erik
>>
>> Am 16.11.2010 14:19, schrieb Erick Erickson:
>>
>>   Several questions. Pardon me if they're obvious, but I've spent faaaar
>>> too much of my life overlooking the obvious...
>>>
>>> 1>   Is it possible you're running out of disk? 40-50G could suck up
>>> a lot of disk, especially when merging. You may need that much again
>>> free when a merge occurs.
>>> 2>   speaking of merging, what are your merge settings? How are you
>>> triggering merges. See<mergeFactor>   and associated in solrconfig.xml?
>>> 3>   You might get some insight by removing the Solr indexing part, can
>>> you spin through your parsing from beginning to end? That would
>>> eliminate your questions about whether you're XML parsing is the
>>> problem.
>>>
>>>
>>> 40-50G is a large index, but it's certainly within Solr's capability,
>>> so you're not hitting any built-in limits.
>>>
>>> My first guess would be that you're running out of disk, at least
>>> that's the first thing I'd check next...
>>>
>>> Best
>>> Erick
>>>
>>> On Tue, Nov 16, 2010 at 3:33 AM, Erik Fäßler<erik.faessler@uni-jena.de
>>>> wrote:
>>>    Hey all,
>>>> I'm trying to create a Solr index for the 2010 Medline-baseline (
>>>> www.pubmed.gov, over 18 million XML documents). My goal is to be able to
>>>> retrieve single XML documents by their ID. Each document comes with a
>>>> unique
>>>> ID, the PubMedID. So my schema (important portions) looks like this:
>>>>
>>>> <field name="pmid" type="string" indexed="true" stored="true"
>>>> required="true" />
>>>> <field name="date" type="tdate" indexed="true" stored="true"/>
>>>> <field name="xml" type="text" indexed="true" stored="true"/>
>>>>
>>>> <uniqueKey>pmid</uniqueKey>
>>>> <defaultSearchField>pmid</defaultSearchField>
>>>>
>>>> pmid holds the ID, data hold the creation date; xml holds the whole XML
>>>> document (mostly below 5kb). I used the DataImporter to do this. I had to
>>>> write some classes (DataSource, EntityProcessor, DateFormatter) myself,
>>>> so
>>>> theoretically, the error could lie there.
>>>>
>>>> What happens is that indexing just looks fine at the beginning. Memory
>>>> usage is quite below the maximum (max of 20g, usage of below 5g, most of
>>>> the
>>>> time around 3g). It goes several hours in this manner until it suddenly
>>>> stopps. I tried this a few times with minor tweaks, non of which made any
>>>> difference. The last time such a crash occurred, over 16.5 million
>>>> documents
>>>> already had been indexed (argh, so close...). It never stops at the same
>>>> document and trying to index the documents, where the error occurred,
>>>> just
>>>> runs fine. Index size on disc was between 40g and 50g the last time I had
>>>> a
>>>> look.
>>>>
>>>> This is the log from beginning to end:
>>>>
>>>> (I decided to just attach the log for the sake of readability ;) ).
>>>>
>>>> As you can see, Solr's error message is not quite complete. There are no
>>>> closing brackets. The document is cut in half on this message and not
>>>> even
>>>> the error message itself is complete: The 'D' of
>>>> (D)ataImporter.runCmd(DataImporter.java:389) right after the document
>>>> text
>>>> is missing.
>>>>
>>>> I have one thought concerning this: I get the input documents as an
>>>> InputStream which I read buffer-wise (at most 1000bytes per read() call).
>>>> I
>>>> need to deliver the documents in one large byte-Array to the XML parser I
>>>> use (VTD XML).
>>>> But I don't only get the individual small XML documents but always one
>>>> larger XML blob with exactly 30,000 of these documents. I use a
>>>> self-written
>>>> EntityProcessor to extract the single documents from the larger blob.
>>>> These
>>>> blobs have a size of about 50 to 150mb. So what I do is to read these
>>>> large
>>>> blobs in 1000bytes steps and store each byte array in an
>>>> ArrayList<byte[]>.
>>>> Afterwards, I create the ultimate byte[] and do System.arraycopy from the
>>>> ArrayList into the byte[].
>>>> I tested this and it looks fine to me. And how I said, indexing the
>>>> documents where the error occurred just works fine (that is, indexing the
>>>> whole blob containing the single document). I just mention this because
>>>> it
>>>> kind of looks like there is this cut in the document and the missing 'D'
>>>> reminds me of char-encoding errors. But I don't know for real, opening
>>>> the
>>>> error log in vi doesn't show any broken characters (the last time I had
>>>> such
>>>> problems, vi could identify the characters in question, other editors
>>>> just
>>>> wouldn't show them).
>>>>
>>>> Further ideas from my side: Is the index too big? I think I read
>>>> something
>>>> about a large index would be something around 10million documents, I aim
>>>> to
>>>> approximately double this number. But would this cause such an error? In
>>>> the
>>>> end: What exactly IS the error?
>>>>
>>>> Sorry for the lot of text, just trying to describe the problem as
>>>> detailed
>>>> as possible. Thanks a lot for reading and I appreciate any ideas! :)
>>>>
>>>> Best regards,
>>>>
>>>>     Erik
>>>>
>>>>


Re: DIH full-import failure, no real error message

Posted by Erick Erickson <er...@gmail.com>.
The key is that Solr handles merges by copying, and only after
the copy is complete does it delete the old index. So you'll need
at least 2x your final index size before you start, especially if you
optimize...

Here's a handy matrix of what you need in your index depending
upon what you want to do:
http://search.lucidimagination.com/search/out?u=http://wiki.apache.org/solr/FieldOptionsByUseCase

Leaving out what you don't use will help by shrinking your index.

<http://search.lucidimagination.com/search/out?u=http://wiki.apache.org/solr/FieldOptionsByUseCase>the
thing that jumps out is that you're storing your entire XML document
as well as indexing it. Are you expecting to return the document
to the user? Storing the entire document is is a red-flag, you
probably don't want to do this. If you need to return the entire
document some time, one strategy is to index whatever you need
to search, and index what you need to fetch the document from
an external store. You can index the values of selected tags as fields in
your documents. That would also give you far more flexibility
when searching.

Best
Erick




On Tue, Nov 16, 2010 at 9:48 AM, Erik Fäßler <er...@uni-jena.de>wrote:

>  Hello Erick,
>
> I guess I'm the one asking for pardon - but sure not you! It seems, you're
> first guess could already be the correct one. Disc space IS kind of short
> and I believe it could have run out; since Solr is performing a rollback
> after the failure, I didn't notice (beside the fact that this is one of our
> server machine, but apparently the wrong mount point...).
>
> I not yet absolutely sure of this, but it would explain a lot and it really
> looks like it. So thank you for this maybe not so obvious hint :)
>
> But you also mentioned the merging strategy. I left everything on the
> standards that come with the Solr download concerning these things.
> Could it be that such a great index needs another treatment? Could you
> point me to a Wiki page or something where I get a few tips?
>
> Thanks a lot, I will try building the index on a partition with enough
> space, perhaps that will already do it.
>
> Best regards,
>
>    Erik
>
> Am 16.11.2010 14:19, schrieb Erick Erickson:
>
>  Several questions. Pardon me if they're obvious, but I've spent faaaar
>> too much of my life overlooking the obvious...
>>
>> 1>  Is it possible you're running out of disk? 40-50G could suck up
>> a lot of disk, especially when merging. You may need that much again
>> free when a merge occurs.
>> 2>  speaking of merging, what are your merge settings? How are you
>> triggering merges. See<mergeFactor>  and associated in solrconfig.xml?
>> 3>  You might get some insight by removing the Solr indexing part, can
>> you spin through your parsing from beginning to end? That would
>> eliminate your questions about whether you're XML parsing is the
>> problem.
>>
>>
>> 40-50G is a large index, but it's certainly within Solr's capability,
>> so you're not hitting any built-in limits.
>>
>> My first guess would be that you're running out of disk, at least
>> that's the first thing I'd check next...
>>
>> Best
>> Erick
>>
>> On Tue, Nov 16, 2010 at 3:33 AM, Erik Fäßler<erik.faessler@uni-jena.de
>> >wrote:
>>
>>   Hey all,
>>>
>>> I'm trying to create a Solr index for the 2010 Medline-baseline (
>>> www.pubmed.gov, over 18 million XML documents). My goal is to be able to
>>> retrieve single XML documents by their ID. Each document comes with a
>>> unique
>>> ID, the PubMedID. So my schema (important portions) looks like this:
>>>
>>> <field name="pmid" type="string" indexed="true" stored="true"
>>> required="true" />
>>> <field name="date" type="tdate" indexed="true" stored="true"/>
>>> <field name="xml" type="text" indexed="true" stored="true"/>
>>>
>>> <uniqueKey>pmid</uniqueKey>
>>> <defaultSearchField>pmid</defaultSearchField>
>>>
>>> pmid holds the ID, data hold the creation date; xml holds the whole XML
>>> document (mostly below 5kb). I used the DataImporter to do this. I had to
>>> write some classes (DataSource, EntityProcessor, DateFormatter) myself,
>>> so
>>> theoretically, the error could lie there.
>>>
>>> What happens is that indexing just looks fine at the beginning. Memory
>>> usage is quite below the maximum (max of 20g, usage of below 5g, most of
>>> the
>>> time around 3g). It goes several hours in this manner until it suddenly
>>> stopps. I tried this a few times with minor tweaks, non of which made any
>>> difference. The last time such a crash occurred, over 16.5 million
>>> documents
>>> already had been indexed (argh, so close...). It never stops at the same
>>> document and trying to index the documents, where the error occurred,
>>> just
>>> runs fine. Index size on disc was between 40g and 50g the last time I had
>>> a
>>> look.
>>>
>>> This is the log from beginning to end:
>>>
>>> (I decided to just attach the log for the sake of readability ;) ).
>>>
>>> As you can see, Solr's error message is not quite complete. There are no
>>> closing brackets. The document is cut in half on this message and not
>>> even
>>> the error message itself is complete: The 'D' of
>>> (D)ataImporter.runCmd(DataImporter.java:389) right after the document
>>> text
>>> is missing.
>>>
>>> I have one thought concerning this: I get the input documents as an
>>> InputStream which I read buffer-wise (at most 1000bytes per read() call).
>>> I
>>> need to deliver the documents in one large byte-Array to the XML parser I
>>> use (VTD XML).
>>> But I don't only get the individual small XML documents but always one
>>> larger XML blob with exactly 30,000 of these documents. I use a
>>> self-written
>>> EntityProcessor to extract the single documents from the larger blob.
>>> These
>>> blobs have a size of about 50 to 150mb. So what I do is to read these
>>> large
>>> blobs in 1000bytes steps and store each byte array in an
>>> ArrayList<byte[]>.
>>> Afterwards, I create the ultimate byte[] and do System.arraycopy from the
>>> ArrayList into the byte[].
>>> I tested this and it looks fine to me. And how I said, indexing the
>>> documents where the error occurred just works fine (that is, indexing the
>>> whole blob containing the single document). I just mention this because
>>> it
>>> kind of looks like there is this cut in the document and the missing 'D'
>>> reminds me of char-encoding errors. But I don't know for real, opening
>>> the
>>> error log in vi doesn't show any broken characters (the last time I had
>>> such
>>> problems, vi could identify the characters in question, other editors
>>> just
>>> wouldn't show them).
>>>
>>> Further ideas from my side: Is the index too big? I think I read
>>> something
>>> about a large index would be something around 10million documents, I aim
>>> to
>>> approximately double this number. But would this cause such an error? In
>>> the
>>> end: What exactly IS the error?
>>>
>>> Sorry for the lot of text, just trying to describe the problem as
>>> detailed
>>> as possible. Thanks a lot for reading and I appreciate any ideas! :)
>>>
>>> Best regards,
>>>
>>>    Erik
>>>
>>>
>

Re: DIH full-import failure, no real error message

Posted by Erik Fäßler <er...@uni-jena.de>.
  Hello Erick,

I guess I'm the one asking for pardon - but sure not you! It seems, 
you're first guess could already be the correct one. Disc space IS kind 
of short and I believe it could have run out; since Solr is performing a 
rollback after the failure, I didn't notice (beside the fact that this 
is one of our server machine, but apparently the wrong mount point...).

I not yet absolutely sure of this, but it would explain a lot and it 
really looks like it. So thank you for this maybe not so obvious hint :)

But you also mentioned the merging strategy. I left everything on the 
standards that come with the Solr download concerning these things.
Could it be that such a great index needs another treatment? Could you 
point me to a Wiki page or something where I get a few tips?

Thanks a lot, I will try building the index on a partition with enough 
space, perhaps that will already do it.

Best regards,

     Erik

Am 16.11.2010 14:19, schrieb Erick Erickson:
> Several questions. Pardon me if they're obvious, but I've spent faaaar
> too much of my life overlooking the obvious...
>
> 1>  Is it possible you're running out of disk? 40-50G could suck up
> a lot of disk, especially when merging. You may need that much again
> free when a merge occurs.
> 2>  speaking of merging, what are your merge settings? How are you
> triggering merges. See<mergeFactor>  and associated in solrconfig.xml?
> 3>  You might get some insight by removing the Solr indexing part, can
> you spin through your parsing from beginning to end? That would
> eliminate your questions about whether you're XML parsing is the
> problem.
>
>
> 40-50G is a large index, but it's certainly within Solr's capability,
> so you're not hitting any built-in limits.
>
> My first guess would be that you're running out of disk, at least
> that's the first thing I'd check next...
>
> Best
> Erick
>
> On Tue, Nov 16, 2010 at 3:33 AM, Erik Fäßler<er...@uni-jena.de>wrote:
>
>>   Hey all,
>>
>> I'm trying to create a Solr index for the 2010 Medline-baseline (
>> www.pubmed.gov, over 18 million XML documents). My goal is to be able to
>> retrieve single XML documents by their ID. Each document comes with a unique
>> ID, the PubMedID. So my schema (important portions) looks like this:
>>
>> <field name="pmid" type="string" indexed="true" stored="true"
>> required="true" />
>> <field name="date" type="tdate" indexed="true" stored="true"/>
>> <field name="xml" type="text" indexed="true" stored="true"/>
>>
>> <uniqueKey>pmid</uniqueKey>
>> <defaultSearchField>pmid</defaultSearchField>
>>
>> pmid holds the ID, data hold the creation date; xml holds the whole XML
>> document (mostly below 5kb). I used the DataImporter to do this. I had to
>> write some classes (DataSource, EntityProcessor, DateFormatter) myself, so
>> theoretically, the error could lie there.
>>
>> What happens is that indexing just looks fine at the beginning. Memory
>> usage is quite below the maximum (max of 20g, usage of below 5g, most of the
>> time around 3g). It goes several hours in this manner until it suddenly
>> stopps. I tried this a few times with minor tweaks, non of which made any
>> difference. The last time such a crash occurred, over 16.5 million documents
>> already had been indexed (argh, so close...). It never stops at the same
>> document and trying to index the documents, where the error occurred, just
>> runs fine. Index size on disc was between 40g and 50g the last time I had a
>> look.
>>
>> This is the log from beginning to end:
>>
>> (I decided to just attach the log for the sake of readability ;) ).
>>
>> As you can see, Solr's error message is not quite complete. There are no
>> closing brackets. The document is cut in half on this message and not even
>> the error message itself is complete: The 'D' of
>> (D)ataImporter.runCmd(DataImporter.java:389) right after the document text
>> is missing.
>>
>> I have one thought concerning this: I get the input documents as an
>> InputStream which I read buffer-wise (at most 1000bytes per read() call). I
>> need to deliver the documents in one large byte-Array to the XML parser I
>> use (VTD XML).
>> But I don't only get the individual small XML documents but always one
>> larger XML blob with exactly 30,000 of these documents. I use a self-written
>> EntityProcessor to extract the single documents from the larger blob. These
>> blobs have a size of about 50 to 150mb. So what I do is to read these large
>> blobs in 1000bytes steps and store each byte array in an ArrayList<byte[]>.
>> Afterwards, I create the ultimate byte[] and do System.arraycopy from the
>> ArrayList into the byte[].
>> I tested this and it looks fine to me. And how I said, indexing the
>> documents where the error occurred just works fine (that is, indexing the
>> whole blob containing the single document). I just mention this because it
>> kind of looks like there is this cut in the document and the missing 'D'
>> reminds me of char-encoding errors. But I don't know for real, opening the
>> error log in vi doesn't show any broken characters (the last time I had such
>> problems, vi could identify the characters in question, other editors just
>> wouldn't show them).
>>
>> Further ideas from my side: Is the index too big? I think I read something
>> about a large index would be something around 10million documents, I aim to
>> approximately double this number. But would this cause such an error? In the
>> end: What exactly IS the error?
>>
>> Sorry for the lot of text, just trying to describe the problem as detailed
>> as possible. Thanks a lot for reading and I appreciate any ideas! :)
>>
>> Best regards,
>>
>>     Erik
>>


Re: DIH full-import failure, no real error message

Posted by Erick Erickson <er...@gmail.com>.
Several questions. Pardon me if they're obvious, but I've spent faaaar
too much of my life overlooking the obvious...

1> Is it possible you're running out of disk? 40-50G could suck up
a lot of disk, especially when merging. You may need that much again
free when a merge occurs.
2> speaking of merging, what are your merge settings? How are you
triggering merges. See <mergeFactor> and associated in solrconfig.xml?
3> You might get some insight by removing the Solr indexing part, can
you spin through your parsing from beginning to end? That would
eliminate your questions about whether you're XML parsing is the
problem.


40-50G is a large index, but it's certainly within Solr's capability,
so you're not hitting any built-in limits.

My first guess would be that you're running out of disk, at least
that's the first thing I'd check next...

Best
Erick

On Tue, Nov 16, 2010 at 3:33 AM, Erik Fäßler <er...@uni-jena.de>wrote:

>  Hey all,
>
> I'm trying to create a Solr index for the 2010 Medline-baseline (
> www.pubmed.gov, over 18 million XML documents). My goal is to be able to
> retrieve single XML documents by their ID. Each document comes with a unique
> ID, the PubMedID. So my schema (important portions) looks like this:
>
> <field name="pmid" type="string" indexed="true" stored="true"
> required="true" />
> <field name="date" type="tdate" indexed="true" stored="true"/>
> <field name="xml" type="text" indexed="true" stored="true"/>
>
> <uniqueKey>pmid</uniqueKey>
> <defaultSearchField>pmid</defaultSearchField>
>
> pmid holds the ID, data hold the creation date; xml holds the whole XML
> document (mostly below 5kb). I used the DataImporter to do this. I had to
> write some classes (DataSource, EntityProcessor, DateFormatter) myself, so
> theoretically, the error could lie there.
>
> What happens is that indexing just looks fine at the beginning. Memory
> usage is quite below the maximum (max of 20g, usage of below 5g, most of the
> time around 3g). It goes several hours in this manner until it suddenly
> stopps. I tried this a few times with minor tweaks, non of which made any
> difference. The last time such a crash occurred, over 16.5 million documents
> already had been indexed (argh, so close...). It never stops at the same
> document and trying to index the documents, where the error occurred, just
> runs fine. Index size on disc was between 40g and 50g the last time I had a
> look.
>
> This is the log from beginning to end:
>
> (I decided to just attach the log for the sake of readability ;) ).
>
> As you can see, Solr's error message is not quite complete. There are no
> closing brackets. The document is cut in half on this message and not even
> the error message itself is complete: The 'D' of
> (D)ataImporter.runCmd(DataImporter.java:389) right after the document text
> is missing.
>
> I have one thought concerning this: I get the input documents as an
> InputStream which I read buffer-wise (at most 1000bytes per read() call). I
> need to deliver the documents in one large byte-Array to the XML parser I
> use (VTD XML).
> But I don't only get the individual small XML documents but always one
> larger XML blob with exactly 30,000 of these documents. I use a self-written
> EntityProcessor to extract the single documents from the larger blob. These
> blobs have a size of about 50 to 150mb. So what I do is to read these large
> blobs in 1000bytes steps and store each byte array in an ArrayList<byte[]>.
> Afterwards, I create the ultimate byte[] and do System.arraycopy from the
> ArrayList into the byte[].
> I tested this and it looks fine to me. And how I said, indexing the
> documents where the error occurred just works fine (that is, indexing the
> whole blob containing the single document). I just mention this because it
> kind of looks like there is this cut in the document and the missing 'D'
> reminds me of char-encoding errors. But I don't know for real, opening the
> error log in vi doesn't show any broken characters (the last time I had such
> problems, vi could identify the characters in question, other editors just
> wouldn't show them).
>
> Further ideas from my side: Is the index too big? I think I read something
> about a large index would be something around 10million documents, I aim to
> approximately double this number. But would this cause such an error? In the
> end: What exactly IS the error?
>
> Sorry for the lot of text, just trying to describe the problem as detailed
> as possible. Thanks a lot for reading and I appreciate any ideas! :)
>
> Best regards,
>
>    Erik
>

DIH full-import failure, no real error message

Posted by Erik Fäßler <er...@uni-jena.de>.
  Hey all,

I'm trying to create a Solr index for the 2010 Medline-baseline 
(www.pubmed.gov, over 18 million XML documents). My goal is to be able 
to retrieve single XML documents by their ID. Each document comes with a 
unique ID, the PubMedID. So my schema (important portions) looks like this:

<field name="pmid" type="string" indexed="true" stored="true" 
required="true" />
<field name="date" type="tdate" indexed="true" stored="true"/>
<field name="xml" type="text" indexed="true" stored="true"/>

<uniqueKey>pmid</uniqueKey>
<defaultSearchField>pmid</defaultSearchField>

pmid holds the ID, data hold the creation date; xml holds the whole XML 
document (mostly below 5kb). I used the DataImporter to do this. I had 
to write some classes (DataSource, EntityProcessor, DateFormatter) 
myself, so theoretically, the error could lie there.

What happens is that indexing just looks fine at the beginning. Memory 
usage is quite below the maximum (max of 20g, usage of below 5g, most of 
the time around 3g). It goes several hours in this manner until it 
suddenly stopps. I tried this a few times with minor tweaks, non of 
which made any difference. The last time such a crash occurred, over 
16.5 million documents already had been indexed (argh, so close...). It 
never stops at the same document and trying to index the documents, 
where the error occurred, just runs fine. Index size on disc was between 
40g and 50g the last time I had a look.

This is the log from beginning to end:

(I decided to just attach the log for the sake of readability ;) ).

As you can see, Solr's error message is not quite complete. There are no 
closing brackets. The document is cut in half on this message and not 
even the error message itself is complete: The 'D' of 
(D)ataImporter.runCmd(DataImporter.java:389) right after the document 
text is missing.

I have one thought concerning this: I get the input documents as an 
InputStream which I read buffer-wise (at most 1000bytes per read() 
call). I need to deliver the documents in one large byte-Array to the 
XML parser I use (VTD XML).
But I don't only get the individual small XML documents but always one 
larger XML blob with exactly 30,000 of these documents. I use a 
self-written EntityProcessor to extract the single documents from the 
larger blob. These blobs have a size of about 50 to 150mb. So what I do 
is to read these large blobs in 1000bytes steps and store each byte 
array in an ArrayList<byte[]>. Afterwards, I create the ultimate byte[] 
and do System.arraycopy from the ArrayList into the byte[].
I tested this and it looks fine to me. And how I said, indexing the 
documents where the error occurred just works fine (that is, indexing 
the whole blob containing the single document). I just mention this 
because it kind of looks like there is this cut in the document and the 
missing 'D' reminds me of char-encoding errors. But I don't know for 
real, opening the error log in vi doesn't show any broken characters 
(the last time I had such problems, vi could identify the characters in 
question, other editors just wouldn't show them).

Further ideas from my side: Is the index too big? I think I read 
something about a large index would be something around 10million 
documents, I aim to approximately double this number. But would this 
cause such an error? In the end: What exactly IS the error?

Sorry for the lot of text, just trying to describe the problem as 
detailed as possible. Thanks a lot for reading and I appreciate any 
ideas! :)

Best regards,

     Erik

Re: encoding messy code

Posted by xu cheng <xc...@gmail.com>.
hi:
the problem lies in the web server that interact with the solr server. and
after some transformation, it works now
thanks

2010/11/16 Peter Karich <pe...@yahoo.de>

>  Am 16.11.2010 07:25, schrieb xu cheng:
>
>  hi all:
>> I configure an app with solr to index documents
>> and there are some Chinese content in the documents
>> and I've configure the apache tomcat URIEncoding to be utf-8
>> and I use the program curl to sent the documents in xml format
>> however , when I query the documents, all the Chinese content becomes
>> messy
>> code. It've cost me a lot of time.
>>
>
> solr handles only utf8. is the xml properly encoded in utf8?
> if you are under linux you can easily convert this via iconv.
> or detect the encoding (based on some heuristics) using enca or similar.
>
> Regards,
> Peter.
>
> --
> http://jetwick.com twitter search prototype
>
>

Re: encoding messy code

Posted by Peter Karich <pe...@yahoo.de>.
  Am 16.11.2010 07:25, schrieb xu cheng:
> hi all:
> I configure an app with solr to index documents
> and there are some Chinese content in the documents
> and I've configure the apache tomcat URIEncoding to be utf-8
> and I use the program curl to sent the documents in xml format
> however , when I query the documents, all the Chinese content becomes messy
> code. It've cost me a lot of time.

solr handles only utf8. is the xml properly encoded in utf8?
if you are under linux you can easily convert this via iconv.
or detect the encoding (based on some heuristics) using enca or similar.

Regards,
Peter.

-- 
http://jetwick.com twitter search prototype