You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by neosky <ne...@yahoo.com> on 2012/03/10 19:52:34 UTC

How to index a single big file?

Hello, I have a great challenge here. I have a big file(1.2G) with more than
200 million records need to index.  It might more than 9 G file with more
than 1000 million record later.
One record contains 3 fields. I am quite newer for solr and lucene, so I
have some questions:
1. It seems that solr only works with the xml file, so I must transform the
text file into xml?
2. Even I transform the file into the xml format, can the solr deal with
this big file?
So, I have some ideas here.Maybe I should split the big file first. 
1. One option is I split one record into one file, but it seems that it will
produce million files and it still hard to store and index.
2. Another option is that I split the file into some smaller file about 10M.
But it seems that it is also difficult to split based on file size that
doesn't mess up the format.
Do you guys have any experience on indexing this kind of big file?  Any idea
or suggestion are helpful.
Thanks in advance!

attached one record sample
original raw data:
>A0B531 A0B531_METTP^|^^|^Putative uncharacterized
protein^|^^|^^|^Methanosaeta thermophila PT^|^349307^|^Arch/Euryar^|^28890
MLFALALSLLILTSGSRSIELNNATVIDLAEGKAVIEQPVSGKIFNITAIARIENISVIH
NSHTARCSVEESFWRGVYRYRITADSPVSGILRYEAPLRGQQFISPIVLNGTVVVAIPEG
YTTGARALGIPRPEPYEIFHENRTVVVWRLERESIVEVGFYRNDAPQILGYFFVLLLAAG
IFLAAGYYSSIKKLEAMRRGLK

I plan to format
<ID>
>A0B531
</ID>
<description>
A0B531_METTP^|^^|^Putative uncharacterized protein^|^^|^^|^Methanosaeta
thermophila PT^|^349307^|^Arch/Euryar^|^28890
</description>
<text>
MLFALALSLLILTSGSRSIELNNATVIDLAEGKAVIEQPVSGKIFNITAIARIENISVIH
NSHTARCSVEESFWRGVYRYRITADSPVSGILRYEAPLRGQQFISPIVLNGTVVVAIPEG
YTTGARALGIPRPEPYEIFHENRTVVVWRLERESIVEVGFYRNDAPQILGYFFVLLLAAG
IFLAAGYYSSIKKLEAMRRGLK
</text>





--
View this message in context: http://lucene.472066.n3.nabble.com/How-to-index-a-single-big-file-tp3815540p3815540.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Does the lucene support the substring search?

Posted by Ahmet Arslan <io...@yahoo.com>.
> Return to the post, I would like to know about whether the
> lucene support
> the substring search or not.
> As you can see, one field of my document is long string
> filed without any
> spaces. It means the token doesn't work here. Suppose I want
> to search a
> string "TARCSV" in my documents. I want to return the sample
> record from my
> document set. I try the Wildcard search and Fuzzy search
> both. But neither
> seems work. I am very sure whether I do all things right in
> the index and
> parse stage. Do you any one has the experience in the
> substring search?

Yes it is possible. Two different approaches are described in a recent thread. http://search-lucene.com/m/Wicj8UB0gl2

One of them uses both trailing and leading wildcard, e.g. q=*TARCSV*

Other approach makes use of NGramFilterFactry at index time only.

It seems that you will be dealing with extremely long tokens. It is a good idea to increase maxTokenLength (default value is 255)
SOLR-2188 Tokens longer than this are silently ignored. 


Does the lucene support the substring search?

Posted by neosky <ne...@yahoo.com>.
Thank you! Now I use the awk to preprocess it. It seems quite efficiency.I
think the other scripting languages will also be helpful.

Return to the post, I would like to know about whether the lucene support
the substring search or not.
As you can see, one field of my document is long string filed without any
spaces. It means the token doesn't work here. Suppose I want to search a
string "TARCSV" in my documents. I want to return the sample record from my
document set. I try the Wildcard search and Fuzzy search both. But neither
seems work. I am very sure whether I do all things right in the index and
parse stage. Do you any one has the experience in the substring search?

>A0B531 A0B531_METTP^|^^|^Putative uncharacterized
protein^|^^|^^|^Methanosaeta thermophila PT^|^349307^|^Arch/Euryar^|^28890 
MLFALALSLLILTSGSRSIELNNATVIDLAEGKAVIEQPVSGKIFNITAIARIENISVIH 
NSH*TARCSV*EESFWRGVYRYRITADSPVSGILRYEAPLRGQQFISPIVLNGTVVVAIPEG 
YTTGARALGIPRPEPYEIFHENRTVVVWRLERESIVEVGFYRNDAPQILGYFFVLLLAAG 
IFLAAGYYSSIKKLEAMRRGLK 

--
View this message in context: http://lucene.472066.n3.nabble.com/How-to-index-a-single-big-file-tp3815540p3818320.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: How to index a single big file?

Posted by Grant Ingersoll <gs...@apache.org>.
On Mar 10, 2012, at 1:52 PM, neosky wrote:

> Hello, I have a great challenge here. I have a big file(1.2G) with more than
> 200 million records need to index.  It might more than 9 G file with more
> than 1000 million record later.
> One record contains 3 fields. I am quite newer for solr and lucene, so I
> have some questions:
> 1. It seems that solr only works with the xml file, so I must transform the
> text file into xml?

There are other formats supported, including just using the SolrJ client and some of your own code that loops through the files.   I wouldn't bother converting to XML, just have your SolrJ program take in a record and convert it to a SolrInputDocument and then send in.


> 2. Even I transform the file into the xml format, can the solr deal with
> this big file?
> So, I have some ideas here.Maybe I should split the big file first. 
> 1. One option is I split one record into one file, but it seems that it will
> produce million files and it still hard to store and index.
> 2. Another option is that I split the file into some smaller file about 10M.
> But it seems that it is also difficult to split based on file size that
> doesn't mess up the format.
> Do you guys have any experience on indexing this kind of big file?  Any idea
> or suggestion are helpful.

I would likely split into some subset of smaller files (I would guess in the range of 10-30M recs per file) and then process those files in parallel (multithreaded) using SolrJ and sending in batches of documents at once or using the StreamingUpdateSolrServer. 

There are lots of good tutorials on using SolrJ available.


--------------------------------------------
Grant Ingersoll
http://www.lucidimagination.com