You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Lokeya <lo...@gmail.com> on 2007/03/17 05:57:48 UTC

Issue while parsing XML files due to control characters, help appreciated.

Hi, 

I am trying to index the content from XML files which are basically the
metadata collected from a website which have a huge collection of documents.
This metadata xml has control characters which causes errors while trying to
parse using the DOM parser. I tried to use encoding = UTF-8 but looks like
it doesn't cover all the unicode characters and I get error. Also when I
tried to use UTF-16, I am getting Prolog content not allowed here. So my
guess is there is no enoding which is going to cover almost all unicode
characters. So I tried to split my metadata files into small files and
processing records which doesnt throw parsing error.

But by breaking metadata file into smaller files I get, 10,000 xml files per
metadata file. I have 70 metadata files, so altogether it becomes 7,00,000
files. Processing them individually takes really long time using Lucene, my
guess is I/O is time consuing, like opening every small xml file loading in
DOM extracting required data and processing.

Qn  1: Any suggestion to get this indexing time reduced? It would be really
great.

Qn 2 : Am I overlooking something in Lucene with respect to indexing?

Right now 12 metadata files take 10 hrs nearly which is really a long time. 

Help Appreciated.

Much Thanks.
-- 
View this message in context: http://www.nabble.com/Issue-while-parsing-XML-files-due-to-control-characters%2C-help-appreciated.-tf3418085.html#a9526527
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Issue while parsing XML files due to control characters, help appreciated.

Posted by Cheolgoo Kang <ap...@gmail.com>.
On 3/17/07, Lokeya <lo...@gmail.com> wrote:
>
> Hi,
>
> I am trying to index the content from XML files which are basically the
> metadata collected from a website which have a huge collection of documents.
> This metadata xml has control characters which causes errors while trying to
> parse using the DOM parser. I tried to use encoding = UTF-8 but looks like
> it doesn't cover all the unicode characters and I get error. Also when I
> tried to use UTF-16, I am getting Prolog content not allowed here. So my
> guess is there is no enoding which is going to cover almost all unicode
> characters. So I tried to split my metadata files into small files and
> processing records which doesnt throw parsing error.

Are you using CDATA section in your XML documents?

>
> But by breaking metadata file into smaller files I get, 10,000 xml files per
> metadata file. I have 70 metadata files, so altogether it becomes 7,00,000
> files. Processing them individually takes really long time using Lucene, my
> guess is I/O is time consuing, like opening every small xml file loading in
> DOM extracting required data and processing.

I think DOM is not appropriate for a batch job like your case. Why
don't you try SAX or XPP? And also, try to adjust IndexWriter's
parameters like mergeFactor, minMergeDocs.

>
> Qn  1: Any suggestion to get this indexing time reduced? It would be really
> great.
>
> Qn 2 : Am I overlooking something in Lucene with respect to indexing?
>
> Right now 12 metadata files take 10 hrs nearly which is really a long time.
>
> Help Appreciated.
>
> Much Thanks.
> --
> View this message in context: http://www.nabble.com/Issue-while-parsing-XML-files-due-to-control-characters%2C-help-appreciated.-tf3418085.html#a9526527
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


-- 
Cheolgoo

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Obtaining the (indexed) terms in a field in a particular document

Posted by Erick Erickson <er...@gmail.com>.
Well, depending upon your storage requirements, it's actually
much easier than that. Assuming you're adding
this field (or a duplicate) as UN_TOKENIZED (in this case, no
need to store), you can just spin
through all the terms for that field with TermDocs/TermEnum.
The trick is to have your term start with a value of "". I.e.
new Term(field, "") to enumerate them all. See TermDocs.seek.

This is without TermVectors at all, which'll save you some space.

Erick

On 3/20/07, Donna L Gresh <gr...@us.ibm.com> wrote:
>
> Thanks, I see what you are saying.
>
> Seems that if I create the field at index time with term vectors stored,
> then I can iterate through the documents and get both the unique
> identifier and the terms, right? My original question was imprecise in
> that I'm going to want to get all the terms for *all* the documents (one
> document at a time) so I can just iterate through all the documents using
>
>                 for (int i=0; i<indexReaderR.numDocs(); i++) {
>                         TermFreqVector tfv =
> indexReaderR.getTermFreqVector(i,"my text field name");
>
> Donna L. Gresh
> Services Research, Mathematical Sciences Department
> IBM T.J. Watson Research Center
> (914) 945-2472
> http://www.research.ibm.com/people/g/donnagresh
> gresh@us.ibm.com
>
>
>
>
> "Erick Erickson" <er...@gmail.com>
> 03/20/2007 03:08 PM
> Please respond to
> java-user@lucene.apache.org
>
>
> To
> java-user@lucene.apache.org
> cc
>
> Subject
> Re: Obtaining the (indexed) terms in a field in a particular document
>
>
>
>
>
>
> Sorry, but you have to have the Lucene document ID, which you
> can get either as part of a Hits or HitCollector or...
> or by using TermDocs/TermEnum on your unique id (my_id in
> your example).
>
> Erick
>
> On 3/20/07, Erick Erickson <er...@gmail.com> wrote:
> >
> > You can do a document.get(field), *assuming* you have stored the data
> > (Field.Store.YES) at index time, although you may not get
> > stop words.
> >
> > On 3/20/07, Donna L Gresh <gr...@us.ibm.com> wrote:
> > >
> > > My apologies if this is a simple question--
> > >
> > > How can I get all the (stemmed and stop words removed, etc.) terms in
> a
> > > particular field of a particular document?
> > >
> > > Suppose my documents each consist of two fields, one with the name
> > > "my_id"
> > > and a unique identifier, and the other being some text string
> consisting
> > > of a number of words.
> > > I'd like to get all the terms in the text string given the unique
> > > identifier.
> > >
> > > (My basic reason is to do a sort of document similarity between the
> text
> > >
> > > string and some other text string, doing a boolean query with
> > > a number of SHOULD clauses, if this makes sense; I'm welcome to
> > > suggestions of better ways to do this)
> > >
> > > Donna L. Gresh
> > >
> >
> >
>
>

Re: Obtaining the (indexed) terms in a field in a particular document

Posted by Donna L Gresh <gr...@us.ibm.com>.
Thanks, I see what you are saying.

Seems that if I create the field at index time with term vectors stored, 
then I can iterate through the documents and get both the unique 
identifier and the terms, right? My original question was imprecise in 
that I'm going to want to get all the terms for *all* the documents (one 
document at a time) so I can just iterate through all the documents using

                for (int i=0; i<indexReaderR.numDocs(); i++) {
                        TermFreqVector tfv = 
indexReaderR.getTermFreqVector(i,"my text field name");

Donna L. Gresh
Services Research, Mathematical Sciences Department
IBM T.J. Watson Research Center
(914) 945-2472
http://www.research.ibm.com/people/g/donnagresh
gresh@us.ibm.com




"Erick Erickson" <er...@gmail.com> 
03/20/2007 03:08 PM
Please respond to
java-user@lucene.apache.org


To
java-user@lucene.apache.org
cc

Subject
Re: Obtaining the (indexed) terms in a field in a particular document






Sorry, but you have to have the Lucene document ID, which you
can get either as part of a Hits or HitCollector or...
or by using TermDocs/TermEnum on your unique id (my_id in
your example).

Erick

On 3/20/07, Erick Erickson <er...@gmail.com> wrote:
>
> You can do a document.get(field), *assuming* you have stored the data
> (Field.Store.YES) at index time, although you may not get
> stop words.
>
> On 3/20/07, Donna L Gresh <gr...@us.ibm.com> wrote:
> >
> > My apologies if this is a simple question--
> >
> > How can I get all the (stemmed and stop words removed, etc.) terms in 
a
> > particular field of a particular document?
> >
> > Suppose my documents each consist of two fields, one with the name
> > "my_id"
> > and a unique identifier, and the other being some text string 
consisting
> > of a number of words.
> > I'd like to get all the terms in the text string given the unique
> > identifier.
> >
> > (My basic reason is to do a sort of document similarity between the 
text
> >
> > string and some other text string, doing a boolean query with
> > a number of SHOULD clauses, if this makes sense; I'm welcome to
> > suggestions of better ways to do this)
> >
> > Donna L. Gresh
> >
>
>


Re: Obtaining the (indexed) terms in a field in a particular document

Posted by Erick Erickson <er...@gmail.com>.
Sorry, but you have to have the Lucene document ID, which you
can get either as part of a Hits or HitCollector or...
or by using TermDocs/TermEnum on your unique id (my_id in
your example).

Erick

On 3/20/07, Erick Erickson <er...@gmail.com> wrote:
>
> You can do a document.get(field), *assuming* you have stored the data
> (Field.Store.YES) at index time, although you may not get
> stop words.
>
> On 3/20/07, Donna L Gresh <gr...@us.ibm.com> wrote:
> >
> > My apologies if this is a simple question--
> >
> > How can I get all the (stemmed and stop words removed, etc.) terms in a
> > particular field of a particular document?
> >
> > Suppose my documents each consist of two fields, one with the name
> > "my_id"
> > and a unique identifier, and the other being some text string consisting
> > of a number of words.
> > I'd like to get all the terms in the text string given the unique
> > identifier.
> >
> > (My basic reason is to do a sort of document similarity between the text
> >
> > string and some other text string, doing a boolean query with
> > a number of SHOULD clauses, if this makes sense; I'm welcome to
> > suggestions of better ways to do this)
> >
> > Donna L. Gresh
> >
>
>

Re: Obtaining the (indexed) terms in a field in a particular document

Posted by Erick Erickson <er...@gmail.com>.
You can do a document.get(field), *assuming* you have stored the data
(Field.Store.YES) at index time, although you may not get
stop words.

On 3/20/07, Donna L Gresh <gr...@us.ibm.com> wrote:
>
> My apologies if this is a simple question--
>
> How can I get all the (stemmed and stop words removed, etc.) terms in a
> particular field of a particular document?
>
> Suppose my documents each consist of two fields, one with the name "my_id"
> and a unique identifier, and the other being some text string consisting
> of a number of words.
> I'd like to get all the terms in the text string given the unique
> identifier.
>
> (My basic reason is to do a sort of document similarity between the text
> string and some other text string, doing a boolean query with
> a number of SHOULD clauses, if this makes sense; I'm welcome to
> suggestions of better ways to do this)
>
> Donna L. Gresh
>

Re: Issue while parsing XML files due to control characters, help appreciated.

Posted by Doron Cohen <DO...@il.ibm.com>.
Lokeya <lo...@gmail.com> wrote on 21/03/2007 22:09:06:

>
> Initially I was writing into the Index 7,00,000 times. I chaged the code
to
> now write only 70 times which means I am putting lot of data in an array
> list and add to doc and index at one shot. This is where the improvement
> came from. To be precise IndexWriter is now adding document 70 times Vs.
> 7,00,000 times.

I was just concluding from the provided code;

So now your index should have only 70 documents, right?

(Search "granularity" would be 1/70 of collection size, is that really
useful?)

If now you open the index writer once, add all those 70 documents, call
optimize once, close index writer once, that's the right way.

> If this is not clear please let me know. I have't pasted the latest code
> where I have fixed the lock issue as well. If required I can do that.
>
> Thanks everyone for quick turnaround and it really helped me a lot.
>
>
> Doron Cohen wrote:
> >
> > Lokeya <lo...@gmail.com> wrote on 18/03/2007 13:19:45:
> >
> >>
> >> Yep I did that, and now my code looks as follows.
> >> The time taken for indexing one file is now
> >> => Elapsed Time in Minutes :: 0.3531
> >> which is really great
> >
> > I am jumping in late so appologies if I am missing something.
> > However I don't understand where this improvement came
> > from. (more below)
> >
> >>
> >> //Create an Index under LUCENE
> >> IndexWriter writer = new IndexWriter("./LUCENE", new
> >>                                      StopStemmingAnalyzer(),
> >>                                      false);
> >> Document doc = new Document();
> >>
> >> //Get Array List Elements  and add them as fileds to doc
> >> for(int k=0; k < alist_Title.size(); k++)             {
> >>   doc.add(new Field("Title",
> >>                     alist_Title.get(k).toString(),
> >>                     Field.Store.YES,
> >>                     Field.Index.UN_TOKENIZED));
> >> }
> >>
> >> for(int k=0; k < alist_Descr.size(); k++)             {
> >>   doc.add(new Field("Description",
> >>                     alist_Descr.get(k).toString(),
> >>                     Field.Store.YES,
> >>                     Field.Index.UN_TOKENIZED));
> >> }
> >>
> >> //Add the document created out of those fields to
> >> // the IndexWriter which will create and index
> >> writer.addDocument(doc);
> >> writer.optimize();
> >> writer.close();
> >>
> >> long elapsedTimeMillis = System.currentTimeMillis()-startTime;
> >> System.out.println("Elapsed Time for  "+
> >>                    dirName +" :: " + elapsedTimeMillis);
> >> float elapsedTimeMin = elapsedTimeMillis/(60*1000F);
> >> System.out.println("Elapsed Time in Minutes ::  "+
> >>                    elapsedTimeMin);
> >
> > the code listed above is, still, for each indexed document,
> > doing this:
> >
> >   1. open index writer
> >   2. create a document (and add fields to it)
> >   3. add the doc to the index
> >   4. optimize the index
> >   5. close the index writer
> >
> > As already mentioned here this is very non optimal.
> > So I don't understand where the improvement came from.
> >
> > It should be done like this:
> >
> >   1. open/create the index
> >   2. for each valid doc input {
> >        a. create a document (add fields)
> >        b. add the doc to the index
> >      }
> >   3. optimize the index (optionally)
> >   4. close the index
> >
> > It would probably also help both coding and discussion to capture
> > some logic in methods, e.g.
> >
> > IndexWriter createIndex();
> > IndexWriter openIndex();
> > void closeIndex();
> > Document createDocument(String oai_citeseer_fileName)
> >
> >> but after processing 4 dumpfiles(which means
> >> 40,000 small xml's), I get :
> >>
> >> caught a class java.io.IOException
> >>  40114  with message: Lock obtain timed out:
> >> Lock@/tmp/lucene-e36d478e46e88f594d57a03c10ee0b3b-write.lock
> >>
> >>
> >> This is the new issue now.What could be the reason ?. I am surprised
> > because
> >> I am only writing to Index under ./LUCENE/ evrytime and not doing
> > anything
> >> with the index(ofcrs to avoid such synchronization issues !)
> >
> > If the code listed above is getting this exception, and this runs on
> > Windows, then perhaps this is related to issue 665 - intensive IO
> > activity, write lock acquired and released for each and every added
> > document. If so, this too would be avoided by opening the index just
once.
> >
> > Last comment - I did not understand if a single index is used for all
> > 70 metadata files, or each one has its own index, totalling 70 indexes.
> > I would think it should be only one index for all, and then you better
> > avoid calling optimize 70 times - that would waste time too. Better
> > organize the code like this:
> >
> > 1. create/open index
> > 2. for each meta data file {
> >    a. for each "valid element" in meta data file {
> >         a.1 create document
> >         a.2 add doc to index
> >       }
> >    }
> > 3. optimize
> > 4. close index
> >
> > Regards,
> > Doron


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Issue while parsing XML files due to control characters, help appreciated.

Posted by Lokeya <lo...@gmail.com>.
Initially I was writing into the Index 7,00,000 times. I chaged the code to
now write only 70 times which means I am putting lot of data in an array
list and add to doc and index at one shot. This is where the improvement
came from. To be precise IndexWriter is now adding document 70 times Vs.
7,00,000 times. 

If this is not clear please let me know. I have't pasted the latest code
where I have fixed the lock issue as well. If required I can do that.

Thanks everyone for quick turnaround and it really helped me a lot.


Doron Cohen wrote:
> 
> Lokeya <lo...@gmail.com> wrote on 18/03/2007 13:19:45:
> 
>>
>> Yep I did that, and now my code looks as follows.
>> The time taken for indexing one file is now
>> => Elapsed Time in Minutes :: 0.3531
>> which is really great
> 
> I am jumping in late so appologies if I am missing something.
> However I don't understand where this improvement came
> from. (more below)
> 
>>
>> //Create an Index under LUCENE
>> IndexWriter writer = new IndexWriter("./LUCENE", new
>>                                      StopStemmingAnalyzer(),
>>                                      false);
>> Document doc = new Document();
>>
>> //Get Array List Elements  and add them as fileds to doc
>> for(int k=0; k < alist_Title.size(); k++)             {
>>   doc.add(new Field("Title",
>>                     alist_Title.get(k).toString(),
>>                     Field.Store.YES,
>>                     Field.Index.UN_TOKENIZED));
>> }
>>
>> for(int k=0; k < alist_Descr.size(); k++)             {
>>   doc.add(new Field("Description",
>>                     alist_Descr.get(k).toString(),
>>                     Field.Store.YES,
>>                     Field.Index.UN_TOKENIZED));
>> }
>>
>> //Add the document created out of those fields to
>> // the IndexWriter which will create and index
>> writer.addDocument(doc);
>> writer.optimize();
>> writer.close();
>>
>> long elapsedTimeMillis = System.currentTimeMillis()-startTime;
>> System.out.println("Elapsed Time for  "+
>>                    dirName +" :: " + elapsedTimeMillis);
>> float elapsedTimeMin = elapsedTimeMillis/(60*1000F);
>> System.out.println("Elapsed Time in Minutes ::  "+
>>                    elapsedTimeMin);
> 
> the code listed above is, still, for each indexed document,
> doing this:
> 
>   1. open index writer
>   2. create a document (and add fields to it)
>   3. add the doc to the index
>   4. optimize the index
>   5. close the index writer
> 
> As already mentioned here this is very non optimal.
> So I don't understand where the improvement came from.
> 
> It should be done like this:
> 
>   1. open/create the index
>   2. for each valid doc input {
>        a. create a document (add fields)
>        b. add the doc to the index
>      }
>   3. optimize the index (optionally)
>   4. close the index
> 
> It would probably also help both coding and discussion to capture
> some logic in methods, e.g.
> 
> IndexWriter createIndex();
> IndexWriter openIndex();
> void closeIndex();
> Document createDocument(String oai_citeseer_fileName)
> 
>> but after processing 4 dumpfiles(which means
>> 40,000 small xml's), I get :
>>
>> caught a class java.io.IOException
>>  40114  with message: Lock obtain timed out:
>> Lock@/tmp/lucene-e36d478e46e88f594d57a03c10ee0b3b-write.lock
>>
>>
>> This is the new issue now.What could be the reason ?. I am surprised
> because
>> I am only writing to Index under ./LUCENE/ evrytime and not doing
> anything
>> with the index(ofcrs to avoid such synchronization issues !)
> 
> If the code listed above is getting this exception, and this runs on
> Windows, then perhaps this is related to issue 665 - intensive IO
> activity, write lock acquired and released for each and every added
> document. If so, this too would be avoided by opening the index just once.
> 
> Last comment - I did not understand if a single index is used for all
> 70 metadata files, or each one has its own index, totalling 70 indexes.
> I would think it should be only one index for all, and then you better
> avoid calling optimize 70 times - that would waste time too. Better
> organize the code like this:
> 
> 1. create/open index
> 2. for each meta data file {
>    a. for each "valid element" in meta data file {
>         a.1 create document
>         a.2 add doc to index
>       }
>    }
> 3. optimize
> 4. close index
> 
> Regards,
> Doron
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Issue-while-parsing-XML-files-due-to-control-characters%2C-help-appreciated.-tf3418085.html#a9608570
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Obtaining the (indexed) terms in a field in a particular document

Posted by Donna L Gresh <gr...@us.ibm.com>.
My apologies if this is a simple question--

How can I get all the (stemmed and stop words removed, etc.) terms in a 
particular field of a particular document?

Suppose my documents each consist of two fields, one with the name "my_id" 
and a unique identifier, and the other being some text string consisting
of a number of words.
I'd like to get all the terms in the text string given the unique 
identifier.

(My basic reason is to do a sort of document similarity between the text 
string and some other text string, doing a boolean query with
a number of SHOULD clauses, if this makes sense; I'm welcome to 
suggestions of better ways to do this)

Donna L. Gresh

Re: Issue while parsing XML files due to control characters, help appreciated.

Posted by Doron Cohen <DO...@il.ibm.com>.
Lokeya <lo...@gmail.com> wrote on 18/03/2007 13:19:45:

>
> Yep I did that, and now my code looks as follows.
> The time taken for indexing one file is now
> => Elapsed Time in Minutes :: 0.3531
> which is really great

I am jumping in late so appologies if I am missing something.
However I don't understand where this improvement came
from. (more below)

>
> //Create an Index under LUCENE
> IndexWriter writer = new IndexWriter("./LUCENE", new
>                                      StopStemmingAnalyzer(),
>                                      false);
> Document doc = new Document();
>
> //Get Array List Elements  and add them as fileds to doc
> for(int k=0; k < alist_Title.size(); k++)             {
>   doc.add(new Field("Title",
>                     alist_Title.get(k).toString(),
>                     Field.Store.YES,
>                     Field.Index.UN_TOKENIZED));
> }
>
> for(int k=0; k < alist_Descr.size(); k++)             {
>   doc.add(new Field("Description",
>                     alist_Descr.get(k).toString(),
>                     Field.Store.YES,
>                     Field.Index.UN_TOKENIZED));
> }
>
> //Add the document created out of those fields to
> // the IndexWriter which will create and index
> writer.addDocument(doc);
> writer.optimize();
> writer.close();
>
> long elapsedTimeMillis = System.currentTimeMillis()-startTime;
> System.out.println("Elapsed Time for  "+
>                    dirName +" :: " + elapsedTimeMillis);
> float elapsedTimeMin = elapsedTimeMillis/(60*1000F);
> System.out.println("Elapsed Time in Minutes ::  "+
>                    elapsedTimeMin);

the code listed above is, still, for each indexed document,
doing this:

  1. open index writer
  2. create a document (and add fields to it)
  3. add the doc to the index
  4. optimize the index
  5. close the index writer

As already mentioned here this is very non optimal.
So I don't understand where the improvement came from.

It should be done like this:

  1. open/create the index
  2. for each valid doc input {
       a. create a document (add fields)
       b. add the doc to the index
     }
  3. optimize the index (optionally)
  4. close the index

It would probably also help both coding and discussion to capture
some logic in methods, e.g.

IndexWriter createIndex();
IndexWriter openIndex();
void closeIndex();
Document createDocument(String oai_citeseer_fileName)

> but after processing 4 dumpfiles(which means
> 40,000 small xml's), I get :
>
> caught a class java.io.IOException
>  40114  with message: Lock obtain timed out:
> Lock@/tmp/lucene-e36d478e46e88f594d57a03c10ee0b3b-write.lock
>
>
> This is the new issue now.What could be the reason ?. I am surprised
because
> I am only writing to Index under ./LUCENE/ evrytime and not doing
anything
> with the index(ofcrs to avoid such synchronization issues !)

If the code listed above is getting this exception, and this runs on
Windows, then perhaps this is related to issue 665 - intensive IO
activity, write lock acquired and released for each and every added
document. If so, this too would be avoided by opening the index just once.

Last comment - I did not understand if a single index is used for all
70 metadata files, or each one has its own index, totalling 70 indexes.
I would think it should be only one index for all, and then you better
avoid calling optimize 70 times - that would waste time too. Better
organize the code like this:

1. create/open index
2. for each meta data file {
   a. for each "valid element" in meta data file {
        a.1 create document
        a.2 add doc to index
      }
   }
3. optimize
4. close index

Regards,
Doron


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Issue while parsing XML files due to control characters, help appreciated.

Posted by Lokeya <lo...@gmail.com>.
Thanks all for the valuable suggestions.

The lock issue also got resolved and all 7 laks files are indexed in arnd 85
minutes which is like wow !

To get away with the lock issue i followed the suggestion given in this :
http://mail-archives.apache.org/mod_mbox/lucene-java-user/200601.mbox/%3CPKEIKOCKCHNIIHNECPCCGELACCAA.koji.sekiguchi@m4.dion.ne.jp%3E

There is another approach also but certain issues are there:

try
{
 writer.close();
}
finally
{
FSDirectory fs = FSDirectory.getDirectory("./LUCENE",false);
if(IndexReader.isLocked(fs));
{
 IndexReader.unlock(fs);
}
}

Thanks a lot again.


Lokeya wrote:
> 
> I will try to explain it like an algorithm what I am trying to do:
> 
> 1. There are 70 Dump files which have 10,000 record tags which I have
> pasted in my earlier mails. I split every dumpfile and create 10,000 xml
> files each with a single <record> and its child tags. This is because
> there are some parsing issues due to unicode characters and we have
> decided to process only valid records.
> 
> 2. With the above idea, I created 70 directories each with the name of
> dumpfile like : oai_citeseer_1/ etc and under this I will have 10,000
> XML's being extracted out using some Java program.
> 
> 3. So now coming to my program, I do the following:
>     a. Take Each Dumpfile directory.
>     b. Get all child xml files(would be 10,000 in number)
>     c. Get the <title> and <description> tag values and put them in an
> array list.
>     d. Create a new doc. 
>     e. In a for loop add all title and description values as fields which
> we haev stored in ArrayList(this loop will add 10,000 X 2 = 20,000 fields
> for both tags)
>    f. Add this document to the index writer.
>    g. Optimize and close it.
> 
> Hope this clearly tells how im trying to index the values of those tags.
> 
> Additional Info: I am using lucene-core-2.0.0.jar.
> 
> StackTrace :
> Indexer.java:75)
> java.io.IOException: Lock obtain timed out:
> Lock@/tmp/lucene-e36d478e46e88f594d57a03c10ee0b3b-write.lock
>         at org.apache.lucene.store.Lock.obtain(Lock.java:56)
>         at
> org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:254)
>         at
> org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:204)
>         at edu.vt.searchengine.Indexer.main(Indexer.java:110)
> 
> 
> Erick Erickson wrote:
>> 
>> I'm not sure what the lock issues is. What version of Lucene are you
>> using? And what is your filesystem like? There are some known locking
>> issues with some versions of Lucene and some filesystems,
>> particularly NFS mounts as I remember... It would help if you told
>> us the entire stack trace rather than just the exception.
>> 
>> But I really don't understand your index structure. You are still
>> opening and closing Lucene every time you add a document.
>> And each document contains the titles and descriptions
>> of all the files in the directory. Is this actually your intent or
>> do you really want to index one Lucene document per
>> title/description pair? If the latter, you want a loop something like...
>> 
>> open writer
>> for each file
>>     parse file
>>     Doc doc = new Document();
>>     doc.add(field, val);
>>     doc.add(field2, val2);
>>     writer.write(doc);
>> end for
>> 
>> writer.close();
>> 
>> There's no reason to close the writer until the last xml file has been
>> indexed......
>> 
>> If that re-structuring causes your lock error to go away, I'll be
>> baffled because it shouldn't (unless your version of Lucene
>> and filesystem is one of the "interesting" ones).
>> 
>> But it'll make your code simpler...
>> 
>> Best
>> Erick
>>     add fields to
>> 
>> On 3/18/07, Lokeya <lo...@gmail.com> wrote:
>>>
>>>
>>> Yep I did that, and now my code looks as follows.
>>> The time taken for indexing one file is now => Elapsed Time in Minutes
>>> ::
>>> 0.3531, which is really great, but after processing 4 dumpfiles(which
>>> means
>>> 40,000 small xml's), I get :
>>>
>>> caught a class java.io.IOException
>>> 40114  with message: Lock obtain timed out:
>>> Lock@/tmp/lucene-e36d478e46e88f594d57a03c10ee0b3b-write.lock
>>>
>>>
>>> This is the new issue now.What could be the reason ?. I am surprised
>>> because
>>> I am only writing to Index under ./LUCENE/ evrytime and not doing
>>> anything
>>> with the index(ofcrs to avoid such synchronization issues !)
>>>
>>>                 for (int i=0; i<children1.length; i++)
>>>                         {
>>>                                 // Get filename of file or directory
>>>                                 String filename = children1[i];
>>>
>>>                                 if(!filename.startsWith("oai_citeseer"))
>>>                                 {
>>>                                         continue;
>>>                                 }
>>>                                 String dirName = filename;
>>>                                 System.out.println("File => "+filename);
>>>
>>>                                 NodeList nl = null;
>>>                                 //Testing the creation of Index
>>>                                 try
>>>                                 {
>>>                                         File dir = new File(dirName);
>>>                                         String[] children = dir.list();
>>>                                         if (children == null) {
>>>                                                 // Either dir does not
>>> exist or is not a directory
>>>                                         }
>>>                                         else
>>>                                         {
>>>                                                 ArrayList alist_Title =
>>> new ArrayList(children.length);
>>>                                                 ArrayList alist_Descr =
>>> new ArrayList(children.length);
>>>                                                 System.out.println("Size
>>> of Array List : "+alist_Title.size());
>>>                                                 String title = null;
>>>                                                 String descr = null;
>>>
>>>                                                 long startTime =
>>> System.currentTimeMillis();
>>>                                                 System.out.println(dir
>>> +"
>>> start time  ==> "+
>>> System.currentTimeMillis());
>>>
>>>                                                 for (int ii=0; ii<
>>> children.length; ii++)
>>>                                                 {
>>>                                                         // Get filename
>>> of
>>> file or directory
>>>                                                         String file =
>>> children[ii];
>>>                                                        
>>> System.out.println("
>>> Name of Dir : Record ~~~~~~~ " +dirName +" : "+
>>> file);
>>>                                                        
>>> //System.out.println("The
>>> name of file parsed now ==> "+file);
>>>
>>>                                                         //Get the value
>>> of
>>> recXYZ's metadata tag
>>>                                                         nl =
>>> ValueGetter.getNodeList(filename+"/"+file, "metadata");
>>>
>>>                                                         if(nl == null)
>>>                                                         {
>>>
>>> System.out.println("Error shoudlnt be thrown ...");
>>>
>>>                                                                
>>> alist_Title.add(ii,"title");
>>>
>>>                                                                
>>> alist_Descr.add(ii,"descr");
>>>                                                                
>>> continue;
>>>                                                         }
>>>
>>>                                                         //Get the
>>> metadata
>>> element(title and description tags from dump file
>>>                                                         ValueGetter vg =
>>> new ValueGetter();
>>>
>>>                                                         //Get the
>>> Extracted Tags Title, Identifier and Description
>>>                                                         try
>>>                                                         {
>>>                                                                 title =
>>> vg.getNodeVal(nl, "dc:title");
>>>
>>>                                                                
>>> alist_Title.add(ii,title);
>>>                                                                 descr =
>>> vg.getNodeVal(nl, "dc:description");
>>>
>>>                                                                
>>> alist_Descr.add(ii,descr);
>>>                                                         }
>>>                                                         catch(Exception
>>> ex)
>>>                                                         {
>>>
>>> ex.printStackTrace();
>>>
>>> System.out.println("Excep ==> "+ ex);
>>>                                                         }
>>>
>>>                                                 }//End of For
>>>
>>>                                                 //Create an Index under
>>> LUCENE
>>>
>>>                                                 IndexWriter writer = new
>>> IndexWriter("./LUCENE", new
>>> StopStemmingAnalyzer(),false);
>>>                                                 Document doc = new
>>> Document();
>>>
>>>                                                 //Get Array List
>>> Elements  and add them as fileds to doc
>>>
>>>                                                 for(int k=0; k <
>>> alist_Title.size(); k++)
>>>                                                 {
>>>                                                         doc.add(new
>>> Field("Title",alist_Title.get(k).toString(),
>>> Field.Store.YES, Field.Index.UN_TOKENIZED));
>>>                                                 }
>>>
>>>                                                 for(int k=0; k <
>>> alist_Descr.size(); k++)
>>>                                                 {
>>>                                                         doc.add(new
>>> Field("Description",alist_Descr.get(k).toString(),
>>> Field.Store.YES, Field.Index.UN_TOKENIZED));
>>>                                                 }
>>>
>>>                                                 //Add the document
>>> created
>>> out of those fields to the IndexWriter
>>> which will create and index
>>>                                                 writer.addDocument(doc);
>>>                                                 writer.optimize();
>>>                                                 writer.close();
>>>
>>>                                                 long elapsedTimeMillis =
>>> System.currentTimeMillis()-startTime;
>>>                                                
>>> System.out.println("Elapsed
>>> Time for  "+dirName +" :: " +
>>> elapsedTimeMillis);
>>>                                                 float elapsedTimeMin =
>>> elapsedTimeMillis/(60*1000F);
>>>                                                
>>> System.out.println("Elapsed
>>> Time in Minutes ::  "+ elapsedTimeMin);
>>>
>>>                                         }//End of Else
>>>
>>>                                 } //End of try
>>>                                 catch (Exception ee)
>>>                                 {
>>>                                         ee.printStackTrace();
>>>                                         System.out.println("caught a " +
>>> ee.getClass() + "\n with message: "+
>>> ee.getMessage());
>>>                                 }
>>>                                 System.out.println("Total Record ==>
>>> "+total_records);
>>>
>>>                         }// End of For
>>>
>>> Grant Ingersoll-5 wrote:
>>> >
>>> > Move index writer creation, optimization and closure outside of your
>>> > loop.  I would also use a SAX parser.  Take a look at the demo code
>>> > to see an example of indexing.
>>> >
>>> > Cheers,
>>> > Grant
>>> >
>>> > On Mar 18, 2007, at 12:31 PM, Lokeya wrote:
>>> >
>>> >>
>>> >>
>>> >> Erick Erickson wrote:
>>> >>>
>>> >>> Grant:
>>> >>>
>>> >>>  I think that "Parsing 70 files totally takes 80 minutes" really
>>> >>> means parsing 70 metadata files containing 10,000 XML
>>> >>> files each.....
>>> >>>
>>> >>> One Metadata File is split into 10,000 XML files which looks as
>>> >>> below:
>>> >>>
>>> >>> <root>
>>> >>>     <record>
>>> >>>     <header>
>>> >>>     <identifier>oai:CiteSeerPSU:1</identifier>
>>> >>>     <datestamp>1993-08-11</datestamp>
>>> >>>     <setSpec>CiteSeerPSUset</setSpec>
>>> >>>     </header>
>>> >>>             <metadata>
>>> >>>             <oai_citeseer:oai_citeseer
>>> >>> xmlns:oai_citeseer="http://copper.ist.psu.edu/oai/oai_citeseer/"
>>> >>> xmlns:dc
>>> >>> ="http://purl.org/dc/elements/1.1/"
>>> >>> xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
>>> >>> xsi:schemaLocation="http://copper.ist.psu.edu/oai/oai_citeseer/
>>> >>> http://copper.ist.psu.edu/oai/oai_citeseer.xsd ">
>>> >>>             <dc:title>36 Problems for Semantic
>>> Interpretation</dc:title>
>>> >>>             <oai_citeseer:author name="Gabriele Scheler">
>>> >>>                     <address>80290 Munchen , Germany</address>
>>> >>>                     <affiliation>Institut fur Informatik; Technische
>>> Universitat
>>> >>> Munchen</affiliation>
>>> >>>             </oai_citeseer:author>
>>> >>>             <dc:subject>Gabriele Scheler 36 Problems for Semantic
>>> >>> Interpretation</dc:subject>
>>> >>>             <dc:description>This paper presents a collection of
>>> problems for
>>> >>> natural
>>> >>> language analysisderived mainly from theoretical linguistics. Most
>>> of
>>> >>> these problemspresent major obstacles for computational systems of
>>> >>> language interpretation.The set of given sentences can easily be
>>> >>> scaled up
>>> >>> by introducing moreexamples per problem. The construction of
>>> >>> computational
>>> >>> systems couldbenefit from such a collection, either using it
>>> >>> directly for
>>> >>> training andtesting or as a set of benchmarks to qualify the
>>> >>> performance
>>> >>> of a NLPsystem.1 IntroductionThe main part of this paper consists
>>> >>> of a
>>> >>> collection of problems for semanticanalysis of natural language. The
>>> >>> problems are arranged in the following way:example sentencesconcise
>>> >>> description of the problemkeyword for the type of problemThe sources
>>> >>> (first appearance in print) of the sentences have been left
>>> >>> out,because
>>> >>> they are sometimes hard to track and will usually not be of much
>>> >>> use,as
>>> >>> they indicate a starting-point of discussion only. The keywords
>>> >>> howeve...
>>> >>>             </dc:description>
>>> >>>             <dc:contributor>The Pennsylvania State University
>>> CiteSeer
>>> >>> Archives</dc:contributor>
>>> >>>             <dc:publisher>unknown</dc:publisher>
>>> >>>             <dc:date>1993-08-11</dc:date>
>>> >>>             <dc:format>ps</dc:format>
>>> >>>             <dc:identifier>http://citeseer.ist.psu.edu/1.html
>>> </dc:identifier>
>>> >>> <dc:source>ftp://flop.informatik.tu-muenchen.de/pub/fki/
>>> >>> fki-179-93.ps.gz</dc:source>
>>> >>>             <dc:language>en</dc:language>
>>> >>>             <dc:rights>unrestricted</dc:rights>
>>> >>>             </oai_citeseer:oai_citeseer>
>>> >>>             </metadata>
>>> >>>     </record>
>>> >>> </root>
>>> >>>
>>> >>>
>>> >>> From the above I will extract the Title and the Description tags
>>> >>> to index.
>>> >>>
>>> >>> Code to do this:
>>> >>>
>>> >>> 1. I have 70 directories with the name like oai_citeseerXYZ/
>>> >>> 2. Under each of above directory, I have 10,000 xml files each
>>> having
>>> >>> above xml data.
>>> >>> 3. Program does the following
>>> >>>
>>> >>>                                     File dir = new File(dirName);
>>> >>>                                     String[] children = dir.list();
>>> >>>                                     if (children == null) {
>>> >>>                                             // Either dir does not
>>> exist or is not a directory
>>> >>>                                     }
>>> >>>                                     else
>>> >>>                                     {
>>> >>>                                             for (int ii=0; ii<
>>> children.length; ii++)
>>> >>>                                             {
>>> >>>                                                     // Get filename
>>> of
>>> file or directory
>>> >>>                                                     String file =
>>> children[ii];
>>> >>>
>>> //System.out.println("The name of file parsed now ==> "+file);
>>> >>>                                                     nl =
>>> ReadDump.getNodeList(filename+"/"+file, "metadata");
>>> >>>                                                     if(nl == null)
>>> >>>                                                     {
>>> >>>
>>> //System.out.println("Error shoudlnt be thrown ...");
>>> >>>                                                     }
>>> >>>                                                     //Get the
>>> metadata
>>> element tags from xml file
>>> >>>                                                     ReadDump rd =
>>> new
>>> ReadDump();
>>> >>>
>>> >>>                                                     //Get the
>>> Extracted Tags Title, Identifier and Description
>>> >>>                                                     ArrayList
>>> alist_Title = rd.getElements(nl, "dc:title");
>>> >>>                                                     ArrayList
>>> alist_Descr = rd.getElements(nl, "dc:description");
>>> >>>
>>> >>>                                                     //Create an
>>> Index
>>> under DIR
>>> >>>                                                     IndexWriter
>>> writer
>>> = new IndexWriter("./FINAL/", new
>>> >>> StopStemmingAnalyzer(),false);
>>> >>>                                                     Document doc =
>>> new
>>> Document();
>>> >>>
>>> >>>                                                     //Get Array List
>>> Elements  and add them as fileds to doc
>>> >>>                                                     for(int k=0; k <
>>> alist_Title.size(); k++)
>>> >>>                                                     {
>>> >>>                                                            
>>> doc.add(new
>>> Field("Title",alist_Title.get(k).toString(),
>>> >>> Field.Store.YES, Field.Index.UN_TOKENIZED));
>>> >>>                                                     }
>>> >>>
>>> >>>                                                     for(int k=0; k <
>>> alist_Descr.size(); k++)
>>> >>>                                                     {
>>> >>>                                                            
>>> doc.add(new
>>> Field("Description",alist_Descr.get(k).toString
>>> >>> (),
>>> >>> Field.Store.YES, Field.Index.UN_TOKENIZED));
>>> >>>                                                     }
>>> >>>
>>> >>>                     //Add the document created out of those fields
>>> to
>>> the
>>> >>> IndexWriter which
>>> >>> will create and index
>>> >>>                                                    
>>> writer.addDocument
>>> (doc);
>>> >>>                                                    
>>> writer.optimize();
>>> >>>                                                     writer.close();
>>> >>>                }
>>> >>>
>>> >>>
>>> >>> This is the main file which does indexing.
>>> >>>
>>> >>> Hope this will give you an idea.
>>> >>>
>>> >>>
>>> >>> Lokeya:
>>> >>> Can you confirm my supposition? And I'd still post the code
>>> >>> Grant requested if you can.....
>>> >>>
>>> >>> So, you're talking about indexing 10,000 xml files in 2-3 hours,
>>> >>> 8 minutes or so which is spent reading/parsing, right? It'll be
>>> >>> important to know how much data you're indexing and now, so
>>> >>> the code snippet is doubly important....
>>> >>>
>>> >>> Erick
>>> >>>
>>> >>> On 3/18/07, Grant Ingersoll <gs...@apache.org> wrote:
>>> >>>>
>>> >>>> Can you post the relevant indexing code?  Are you doing things like
>>> >>>> optimizing after every file?  Both the parsing and the indexing
>>> >>>> sound
>>> >>>> really long.  How big are these files?
>>> >>>>
>>> >>>> Also, I assume you machine is at least somewhat current, right?
>>> >>>>
>>> >>>> On Mar 18, 2007, at 1:00 AM, Lokeya wrote:
>>> >>>>
>>> >>>>>
>>> >>>>> Thanks for your reply. I tried to check if the I/O and Parsing is
>>> >>>>> taking time
>>> >>>>> separately and Indexing time also. I observed that I/O and Parsing
>>> >>>>> 70 files
>>> >>>>> totally takes 80 minutes where as when I combine this with
>>> Indexing
>>> >>>>> for a
>>> >>>>> single Metadata file it nearly 2 to 3 hours. So looks like
>>> >>>>> IndexWriter takes
>>> >>>>> time that too when we are appending to the Index file this
>>> happens.
>>> >>>>>
>>> >>>>> So what is the best approach to handle this?
>>> >>>>>
>>> >>>>> Thanks in Advance.
>>> >>>>>
>>> >>>>>
>>> >>>>> Erick Erickson wrote:
>>> >>>>>>
>>> >>>>>> See below...
>>> >>>>>>
>>> >>>>>> On 3/17/07, Lokeya <lo...@gmail.com> wrote:
>>> >>>>>>>
>>> >>>>>>>
>>> >>>>>>> Hi,
>>> >>>>>>>
>>> >>>>>>> I am trying to index the content from XML files which are
>>> >>>>>>> basically the
>>> >>>>>>> metadata collected from a website which have a huge collection
>>> of
>>> >>>>>>> documents.
>>> >>>>>>> This metadata xml has control characters which causes errors
>>> >>>>>>> while trying
>>> >>>>>>> to
>>> >>>>>>> parse using the DOM parser. I tried to use encoding = UTF-8 but
>>> >>>>>>> looks
>>> >>>>>>> like
>>> >>>>>>> it doesn't cover all the unicode characters and I get error.
>>> Also
>>> >>>>>>> when I
>>> >>>>>>> tried to use UTF-16, I am getting Prolog content not allowed
>>> >>>>>>> here. So my
>>> >>>>>>> guess is there is no enoding which is going to cover almost all
>>> >>>>>>> unicode
>>> >>>>>>> characters. So I tried to split my metadata files into small
>>> >>>>>>> files and
>>> >>>>>>> processing records which doesnt throw parsing error.
>>> >>>>>>>
>>> >>>>>>> But by breaking metadata file into smaller files I get, 10,000
>>> >>>>>>> xml files
>>> >>>>>>> per
>>> >>>>>>> metadata file. I have 70 metadata files, so altogether it
>>> becomes
>>> >>>>>>> 7,00,000
>>> >>>>>>> files. Processing them individually takes really long time using
>>> >>>>>>> Lucene,
>>> >>>>>>> my
>>> >>>>>>> guess is I/O is time consuing, like opening every small xml file
>>> >>>>>>> loading
>>> >>>>>>> in
>>> >>>>>>> DOM extracting required data and processing.
>>> >>>>>>
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> So why don't you measure and find out before trying to make the
>>> >>>>>> indexing
>>> >>>>>> step more efficient? You simply cannot optimize without knowing
>>> >>>>>> where
>>> >>>>>> you're spending your time. I can't tell you how often I've been
>>> >>>>>> wrong
>>> >>>>>> about
>>> >>>>>> "why my program was slow" <G>.
>>> >>>>>>
>>> >>>>>> In this case, it should be really simple. Just comment out the
>>> >>>>>> part where
>>> >>>>>> you index the data and run, say, one of your metadata files.. I
>>> >>>>>> suspect
>>> >>>>>> that
>>> >>>>>> Cheolgoo Kang's response is cogent, and you indeed are spending
>>> >>>>>> your
>>> >>>>>> time parsing the XML. I further suspect that the problem is not
>>> >>>>>> disk IO,
>>> >>>>>> but the time spent parsing. But until you measure, you have no
>>> >>>>>> clue
>>> >>>>>> whether you should mess around with the Lucene parameters, or
>>> find
>>> >>>>>> another parser, or just live with it.. Assuming that you
>>> >>>>>> comment out
>>> >>>>>> Lucene and things are still slow, the next step would be to just
>>> >>>>>> read in
>>> >>>>>> each file and NOT parse it to figure out whether it's the IO or
>>> >>>>>> the
>>> >>>>>> parsing.
>>> >>>>>>
>>> >>>>>> Then you can worry about how to fix it..
>>> >>>>>>
>>> >>>>>> Best
>>> >>>>>> Erick
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> Qn  1: Any suggestion to get this indexing time reduced? It
>>> >>>>>> would be
>>> >>>>>> really
>>> >>>>>>> great.
>>> >>>>>>>
>>> >>>>>>> Qn 2 : Am I overlooking something in Lucene with respect to
>>> >>>>>>> indexing?
>>> >>>>>>>
>>> >>>>>>> Right now 12 metadata files take 10 hrs nearly which is really a
>>> >>>>>>> long
>>> >>>>>>> time.
>>> >>>>>>>
>>> >>>>>>> Help Appreciated.
>>> >>>>>>>
>>> >>>>>>> Much Thanks.
>>> >>>>>>> --
>>> >>>>>>> View this message in context:
>>> >>>>>>> http://www.nabble.com/Issue-while-parsing-XML-files-due-to-
>>> >>>>>>> control-characters%2C-help-appreciated.-tf3418085.html#a9526527
>>> >>>>>>> Sent from the Lucene - Java Users mailing list archive at
>>> >>>>>>> Nabble.com.
>>> >>>>>>>
>>> >>>>>>>
>>> >>>>>>>
>>> -----------------------------------------------------------------
>>> >>>>>>> ---
>>> >>>>>>> -
>>> >>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> >>>>>>> For additional commands, e-mail:
>>> java-user-help@lucene.apache.org
>>> >>>>>>>
>>> >>>>>>>
>>> >>>>>>
>>> >>>>>>
>>> >>>>>
>>> >>>>> --
>>> >>>>> View this message in context: http://www.nabble.com/Issue-while-
>>> >>>>> parsing-XML-files-due-to-control-characters%2C-help-appreciated.-
>>> >>>>> tf3418085.html#a9536099
>>> >>>>> Sent from the Lucene - Java Users mailing list archive at
>>> >>>>> Nabble.com.
>>> >>>>>
>>> >>>>>
>>> >>>>>
>>> -------------------------------------------------------------------
>>> >>>>> --
>>> >>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> >>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>> >>>>>
>>> >>>>
>>> >>>> --------------------------
>>> >>>> Grant Ingersoll
>>> >>>> Center for Natural Language Processing
>>> >>>> http://www.cnlp.org
>>> >>>>
>>> >>>> Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/
>>> >>>> LuceneFAQ
>>> >>>>
>>> >>>>
>>> >>>>
>>> >>>>
>>> --------------------------------------------------------------------
>>> >>>> -
>>> >>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> >>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>> >>>>
>>> >>>>
>>> >>>
>>> >>>
>>> >>
>>> >> --
>>> >> View this message in context: http://www.nabble.com/Issue-while-
>>> >> parsing-XML-files-due-to-control-characters%2C-help-appreciated.-
>>> >> tf3418085.html#a9540232
>>> >> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>> >>
>>> >>
>>> >> ---------------------------------------------------------------------
>>> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> >> For additional commands, e-mail: java-user-help@lucene.apache.org
>>> >>
>>> >
>>> > ------------------------------------------------------
>>> > Grant Ingersoll
>>> > http://www.grantingersoll.com/
>>> > http://lucene.grantingersoll.com
>>> > http://www.paperoftheweek.com/
>>> >
>>> >
>>> >
>>> > ---------------------------------------------------------------------
>>> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> > For additional commands, e-mail: java-user-help@lucene.apache.org
>>> >
>>> >
>>> >
>>>
>>> --
>>> View this message in context:
>>> http://www.nabble.com/Issue-while-parsing-XML-files-due-to-control-characters%2C-help-appreciated.-tf3418085.html#a9542471
>>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>> 
>> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Issue-while-parsing-XML-files-due-to-control-characters%2C-help-appreciated.-tf3418085.html#a9565759
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Issue while parsing XML files due to control characters, help appreciated.

Posted by Lokeya <lo...@gmail.com>.
I will try to explain it like an algorithm what I am trying to do:

1. There are 70 Dump files which have 10,000 record tags which I have pasted
in my earlier mails. I split every dumpfile and create 10,000 xml files each
with a single <record> and its child tags. This is because there are some
parsing issues due to unicode characters and we have decided to process only
valid records.

2. With the above idea, I created 70 directories each with the name of
dumpfile like : oai_citeseer_1/ etc and under this I will have 10,000 XML's
being extracted out using some Java program.

3. So now coming to my program, I do the following:
    a. Take Each Dumpfile directory.
    b. Get all child xml files(would be 10,000 in number)
    c. Get the <title> and <description> tag values and put them in an array
list.
    d. Create a new doc. 
    e. In a for loop add all title and description values as fields which we
haev stored in ArrayList(this loop will add 10,000 X 2 = 20,000 fields for
both tags)
   f. Add this document to the index writer.
   g. Optimize and close it.

Hope this clearly tells how im trying to index the values of those tags.

Additional Info: I am using lucene-core-2.0.0.jar.

StackTrace :
Indexer.java:75)
java.io.IOException: Lock obtain timed out:
Lock@/tmp/lucene-e36d478e46e88f594d57a03c10ee0b3b-write.lock
        at org.apache.lucene.store.Lock.obtain(Lock.java:56)
        at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:254)
        at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:204)
        at edu.vt.searchengine.Indexer.main(Indexer.java:110)


Erick Erickson wrote:
> 
> I'm not sure what the lock issues is. What version of Lucene are you
> using? And what is your filesystem like? There are some known locking
> issues with some versions of Lucene and some filesystems,
> particularly NFS mounts as I remember... It would help if you told
> us the entire stack trace rather than just the exception.
> 
> But I really don't understand your index structure. You are still
> opening and closing Lucene every time you add a document.
> And each document contains the titles and descriptions
> of all the files in the directory. Is this actually your intent or
> do you really want to index one Lucene document per
> title/description pair? If the latter, you want a loop something like...
> 
> open writer
> for each file
>     parse file
>     Doc doc = new Document();
>     doc.add(field, val);
>     doc.add(field2, val2);
>     writer.write(doc);
> end for
> 
> writer.close();
> 
> There's no reason to close the writer until the last xml file has been
> indexed......
> 
> If that re-structuring causes your lock error to go away, I'll be
> baffled because it shouldn't (unless your version of Lucene
> and filesystem is one of the "interesting" ones).
> 
> But it'll make your code simpler...
> 
> Best
> Erick
>     add fields to
> 
> On 3/18/07, Lokeya <lo...@gmail.com> wrote:
>>
>>
>> Yep I did that, and now my code looks as follows.
>> The time taken for indexing one file is now => Elapsed Time in Minutes ::
>> 0.3531, which is really great, but after processing 4 dumpfiles(which
>> means
>> 40,000 small xml's), I get :
>>
>> caught a class java.io.IOException
>> 40114  with message: Lock obtain timed out:
>> Lock@/tmp/lucene-e36d478e46e88f594d57a03c10ee0b3b-write.lock
>>
>>
>> This is the new issue now.What could be the reason ?. I am surprised
>> because
>> I am only writing to Index under ./LUCENE/ evrytime and not doing
>> anything
>> with the index(ofcrs to avoid such synchronization issues !)
>>
>>                 for (int i=0; i<children1.length; i++)
>>                         {
>>                                 // Get filename of file or directory
>>                                 String filename = children1[i];
>>
>>                                 if(!filename.startsWith("oai_citeseer"))
>>                                 {
>>                                         continue;
>>                                 }
>>                                 String dirName = filename;
>>                                 System.out.println("File => "+filename);
>>
>>                                 NodeList nl = null;
>>                                 //Testing the creation of Index
>>                                 try
>>                                 {
>>                                         File dir = new File(dirName);
>>                                         String[] children = dir.list();
>>                                         if (children == null) {
>>                                                 // Either dir does not
>> exist or is not a directory
>>                                         }
>>                                         else
>>                                         {
>>                                                 ArrayList alist_Title =
>> new ArrayList(children.length);
>>                                                 ArrayList alist_Descr =
>> new ArrayList(children.length);
>>                                                 System.out.println("Size
>> of Array List : "+alist_Title.size());
>>                                                 String title = null;
>>                                                 String descr = null;
>>
>>                                                 long startTime =
>> System.currentTimeMillis();
>>                                                 System.out.println(dir +"
>> start time  ==> "+
>> System.currentTimeMillis());
>>
>>                                                 for (int ii=0; ii<
>> children.length; ii++)
>>                                                 {
>>                                                         // Get filename
>> of
>> file or directory
>>                                                         String file =
>> children[ii];
>>                                                        
>> System.out.println("
>> Name of Dir : Record ~~~~~~~ " +dirName +" : "+
>> file);
>>                                                        
>> //System.out.println("The
>> name of file parsed now ==> "+file);
>>
>>                                                         //Get the value
>> of
>> recXYZ's metadata tag
>>                                                         nl =
>> ValueGetter.getNodeList(filename+"/"+file, "metadata");
>>
>>                                                         if(nl == null)
>>                                                         {
>>
>> System.out.println("Error shoudlnt be thrown ...");
>>
>>                                                                
>> alist_Title.add(ii,"title");
>>
>>                                                                
>> alist_Descr.add(ii,"descr");
>>                                                                 continue;
>>                                                         }
>>
>>                                                         //Get the
>> metadata
>> element(title and description tags from dump file
>>                                                         ValueGetter vg =
>> new ValueGetter();
>>
>>                                                         //Get the
>> Extracted Tags Title, Identifier and Description
>>                                                         try
>>                                                         {
>>                                                                 title =
>> vg.getNodeVal(nl, "dc:title");
>>
>>                                                                
>> alist_Title.add(ii,title);
>>                                                                 descr =
>> vg.getNodeVal(nl, "dc:description");
>>
>>                                                                
>> alist_Descr.add(ii,descr);
>>                                                         }
>>                                                         catch(Exception
>> ex)
>>                                                         {
>>
>> ex.printStackTrace();
>>
>> System.out.println("Excep ==> "+ ex);
>>                                                         }
>>
>>                                                 }//End of For
>>
>>                                                 //Create an Index under
>> LUCENE
>>
>>                                                 IndexWriter writer = new
>> IndexWriter("./LUCENE", new
>> StopStemmingAnalyzer(),false);
>>                                                 Document doc = new
>> Document();
>>
>>                                                 //Get Array List
>> Elements  and add them as fileds to doc
>>
>>                                                 for(int k=0; k <
>> alist_Title.size(); k++)
>>                                                 {
>>                                                         doc.add(new
>> Field("Title",alist_Title.get(k).toString(),
>> Field.Store.YES, Field.Index.UN_TOKENIZED));
>>                                                 }
>>
>>                                                 for(int k=0; k <
>> alist_Descr.size(); k++)
>>                                                 {
>>                                                         doc.add(new
>> Field("Description",alist_Descr.get(k).toString(),
>> Field.Store.YES, Field.Index.UN_TOKENIZED));
>>                                                 }
>>
>>                                                 //Add the document
>> created
>> out of those fields to the IndexWriter
>> which will create and index
>>                                                 writer.addDocument(doc);
>>                                                 writer.optimize();
>>                                                 writer.close();
>>
>>                                                 long elapsedTimeMillis =
>> System.currentTimeMillis()-startTime;
>>                                                
>> System.out.println("Elapsed
>> Time for  "+dirName +" :: " +
>> elapsedTimeMillis);
>>                                                 float elapsedTimeMin =
>> elapsedTimeMillis/(60*1000F);
>>                                                
>> System.out.println("Elapsed
>> Time in Minutes ::  "+ elapsedTimeMin);
>>
>>                                         }//End of Else
>>
>>                                 } //End of try
>>                                 catch (Exception ee)
>>                                 {
>>                                         ee.printStackTrace();
>>                                         System.out.println("caught a " +
>> ee.getClass() + "\n with message: "+
>> ee.getMessage());
>>                                 }
>>                                 System.out.println("Total Record ==>
>> "+total_records);
>>
>>                         }// End of For
>>
>> Grant Ingersoll-5 wrote:
>> >
>> > Move index writer creation, optimization and closure outside of your
>> > loop.  I would also use a SAX parser.  Take a look at the demo code
>> > to see an example of indexing.
>> >
>> > Cheers,
>> > Grant
>> >
>> > On Mar 18, 2007, at 12:31 PM, Lokeya wrote:
>> >
>> >>
>> >>
>> >> Erick Erickson wrote:
>> >>>
>> >>> Grant:
>> >>>
>> >>>  I think that "Parsing 70 files totally takes 80 minutes" really
>> >>> means parsing 70 metadata files containing 10,000 XML
>> >>> files each.....
>> >>>
>> >>> One Metadata File is split into 10,000 XML files which looks as
>> >>> below:
>> >>>
>> >>> <root>
>> >>>     <record>
>> >>>     <header>
>> >>>     <identifier>oai:CiteSeerPSU:1</identifier>
>> >>>     <datestamp>1993-08-11</datestamp>
>> >>>     <setSpec>CiteSeerPSUset</setSpec>
>> >>>     </header>
>> >>>             <metadata>
>> >>>             <oai_citeseer:oai_citeseer
>> >>> xmlns:oai_citeseer="http://copper.ist.psu.edu/oai/oai_citeseer/"
>> >>> xmlns:dc
>> >>> ="http://purl.org/dc/elements/1.1/"
>> >>> xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
>> >>> xsi:schemaLocation="http://copper.ist.psu.edu/oai/oai_citeseer/
>> >>> http://copper.ist.psu.edu/oai/oai_citeseer.xsd ">
>> >>>             <dc:title>36 Problems for Semantic
>> Interpretation</dc:title>
>> >>>             <oai_citeseer:author name="Gabriele Scheler">
>> >>>                     <address>80290 Munchen , Germany</address>
>> >>>                     <affiliation>Institut fur Informatik; Technische
>> Universitat
>> >>> Munchen</affiliation>
>> >>>             </oai_citeseer:author>
>> >>>             <dc:subject>Gabriele Scheler 36 Problems for Semantic
>> >>> Interpretation</dc:subject>
>> >>>             <dc:description>This paper presents a collection of
>> problems for
>> >>> natural
>> >>> language analysisderived mainly from theoretical linguistics. Most of
>> >>> these problemspresent major obstacles for computational systems of
>> >>> language interpretation.The set of given sentences can easily be
>> >>> scaled up
>> >>> by introducing moreexamples per problem. The construction of
>> >>> computational
>> >>> systems couldbenefit from such a collection, either using it
>> >>> directly for
>> >>> training andtesting or as a set of benchmarks to qualify the
>> >>> performance
>> >>> of a NLPsystem.1 IntroductionThe main part of this paper consists
>> >>> of a
>> >>> collection of problems for semanticanalysis of natural language. The
>> >>> problems are arranged in the following way:example sentencesconcise
>> >>> description of the problemkeyword for the type of problemThe sources
>> >>> (first appearance in print) of the sentences have been left
>> >>> out,because
>> >>> they are sometimes hard to track and will usually not be of much
>> >>> use,as
>> >>> they indicate a starting-point of discussion only. The keywords
>> >>> howeve...
>> >>>             </dc:description>
>> >>>             <dc:contributor>The Pennsylvania State University
>> CiteSeer
>> >>> Archives</dc:contributor>
>> >>>             <dc:publisher>unknown</dc:publisher>
>> >>>             <dc:date>1993-08-11</dc:date>
>> >>>             <dc:format>ps</dc:format>
>> >>>             <dc:identifier>http://citeseer.ist.psu.edu/1.html
>> </dc:identifier>
>> >>> <dc:source>ftp://flop.informatik.tu-muenchen.de/pub/fki/
>> >>> fki-179-93.ps.gz</dc:source>
>> >>>             <dc:language>en</dc:language>
>> >>>             <dc:rights>unrestricted</dc:rights>
>> >>>             </oai_citeseer:oai_citeseer>
>> >>>             </metadata>
>> >>>     </record>
>> >>> </root>
>> >>>
>> >>>
>> >>> From the above I will extract the Title and the Description tags
>> >>> to index.
>> >>>
>> >>> Code to do this:
>> >>>
>> >>> 1. I have 70 directories with the name like oai_citeseerXYZ/
>> >>> 2. Under each of above directory, I have 10,000 xml files each having
>> >>> above xml data.
>> >>> 3. Program does the following
>> >>>
>> >>>                                     File dir = new File(dirName);
>> >>>                                     String[] children = dir.list();
>> >>>                                     if (children == null) {
>> >>>                                             // Either dir does not
>> exist or is not a directory
>> >>>                                     }
>> >>>                                     else
>> >>>                                     {
>> >>>                                             for (int ii=0; ii<
>> children.length; ii++)
>> >>>                                             {
>> >>>                                                     // Get filename
>> of
>> file or directory
>> >>>                                                     String file =
>> children[ii];
>> >>>
>> //System.out.println("The name of file parsed now ==> "+file);
>> >>>                                                     nl =
>> ReadDump.getNodeList(filename+"/"+file, "metadata");
>> >>>                                                     if(nl == null)
>> >>>                                                     {
>> >>>
>> //System.out.println("Error shoudlnt be thrown ...");
>> >>>                                                     }
>> >>>                                                     //Get the
>> metadata
>> element tags from xml file
>> >>>                                                     ReadDump rd = new
>> ReadDump();
>> >>>
>> >>>                                                     //Get the
>> Extracted Tags Title, Identifier and Description
>> >>>                                                     ArrayList
>> alist_Title = rd.getElements(nl, "dc:title");
>> >>>                                                     ArrayList
>> alist_Descr = rd.getElements(nl, "dc:description");
>> >>>
>> >>>                                                     //Create an Index
>> under DIR
>> >>>                                                     IndexWriter
>> writer
>> = new IndexWriter("./FINAL/", new
>> >>> StopStemmingAnalyzer(),false);
>> >>>                                                     Document doc =
>> new
>> Document();
>> >>>
>> >>>                                                     //Get Array List
>> Elements  and add them as fileds to doc
>> >>>                                                     for(int k=0; k <
>> alist_Title.size(); k++)
>> >>>                                                     {
>> >>>                                                            
>> doc.add(new
>> Field("Title",alist_Title.get(k).toString(),
>> >>> Field.Store.YES, Field.Index.UN_TOKENIZED));
>> >>>                                                     }
>> >>>
>> >>>                                                     for(int k=0; k <
>> alist_Descr.size(); k++)
>> >>>                                                     {
>> >>>                                                            
>> doc.add(new
>> Field("Description",alist_Descr.get(k).toString
>> >>> (),
>> >>> Field.Store.YES, Field.Index.UN_TOKENIZED));
>> >>>                                                     }
>> >>>
>> >>>                     //Add the document created out of those fields to
>> the
>> >>> IndexWriter which
>> >>> will create and index
>> >>>                                                    
>> writer.addDocument
>> (doc);
>> >>>                                                    
>> writer.optimize();
>> >>>                                                     writer.close();
>> >>>                }
>> >>>
>> >>>
>> >>> This is the main file which does indexing.
>> >>>
>> >>> Hope this will give you an idea.
>> >>>
>> >>>
>> >>> Lokeya:
>> >>> Can you confirm my supposition? And I'd still post the code
>> >>> Grant requested if you can.....
>> >>>
>> >>> So, you're talking about indexing 10,000 xml files in 2-3 hours,
>> >>> 8 minutes or so which is spent reading/parsing, right? It'll be
>> >>> important to know how much data you're indexing and now, so
>> >>> the code snippet is doubly important....
>> >>>
>> >>> Erick
>> >>>
>> >>> On 3/18/07, Grant Ingersoll <gs...@apache.org> wrote:
>> >>>>
>> >>>> Can you post the relevant indexing code?  Are you doing things like
>> >>>> optimizing after every file?  Both the parsing and the indexing
>> >>>> sound
>> >>>> really long.  How big are these files?
>> >>>>
>> >>>> Also, I assume you machine is at least somewhat current, right?
>> >>>>
>> >>>> On Mar 18, 2007, at 1:00 AM, Lokeya wrote:
>> >>>>
>> >>>>>
>> >>>>> Thanks for your reply. I tried to check if the I/O and Parsing is
>> >>>>> taking time
>> >>>>> separately and Indexing time also. I observed that I/O and Parsing
>> >>>>> 70 files
>> >>>>> totally takes 80 minutes where as when I combine this with Indexing
>> >>>>> for a
>> >>>>> single Metadata file it nearly 2 to 3 hours. So looks like
>> >>>>> IndexWriter takes
>> >>>>> time that too when we are appending to the Index file this happens.
>> >>>>>
>> >>>>> So what is the best approach to handle this?
>> >>>>>
>> >>>>> Thanks in Advance.
>> >>>>>
>> >>>>>
>> >>>>> Erick Erickson wrote:
>> >>>>>>
>> >>>>>> See below...
>> >>>>>>
>> >>>>>> On 3/17/07, Lokeya <lo...@gmail.com> wrote:
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> Hi,
>> >>>>>>>
>> >>>>>>> I am trying to index the content from XML files which are
>> >>>>>>> basically the
>> >>>>>>> metadata collected from a website which have a huge collection of
>> >>>>>>> documents.
>> >>>>>>> This metadata xml has control characters which causes errors
>> >>>>>>> while trying
>> >>>>>>> to
>> >>>>>>> parse using the DOM parser. I tried to use encoding = UTF-8 but
>> >>>>>>> looks
>> >>>>>>> like
>> >>>>>>> it doesn't cover all the unicode characters and I get error. Also
>> >>>>>>> when I
>> >>>>>>> tried to use UTF-16, I am getting Prolog content not allowed
>> >>>>>>> here. So my
>> >>>>>>> guess is there is no enoding which is going to cover almost all
>> >>>>>>> unicode
>> >>>>>>> characters. So I tried to split my metadata files into small
>> >>>>>>> files and
>> >>>>>>> processing records which doesnt throw parsing error.
>> >>>>>>>
>> >>>>>>> But by breaking metadata file into smaller files I get, 10,000
>> >>>>>>> xml files
>> >>>>>>> per
>> >>>>>>> metadata file. I have 70 metadata files, so altogether it becomes
>> >>>>>>> 7,00,000
>> >>>>>>> files. Processing them individually takes really long time using
>> >>>>>>> Lucene,
>> >>>>>>> my
>> >>>>>>> guess is I/O is time consuing, like opening every small xml file
>> >>>>>>> loading
>> >>>>>>> in
>> >>>>>>> DOM extracting required data and processing.
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> So why don't you measure and find out before trying to make the
>> >>>>>> indexing
>> >>>>>> step more efficient? You simply cannot optimize without knowing
>> >>>>>> where
>> >>>>>> you're spending your time. I can't tell you how often I've been
>> >>>>>> wrong
>> >>>>>> about
>> >>>>>> "why my program was slow" <G>.
>> >>>>>>
>> >>>>>> In this case, it should be really simple. Just comment out the
>> >>>>>> part where
>> >>>>>> you index the data and run, say, one of your metadata files.. I
>> >>>>>> suspect
>> >>>>>> that
>> >>>>>> Cheolgoo Kang's response is cogent, and you indeed are spending
>> >>>>>> your
>> >>>>>> time parsing the XML. I further suspect that the problem is not
>> >>>>>> disk IO,
>> >>>>>> but the time spent parsing. But until you measure, you have no
>> >>>>>> clue
>> >>>>>> whether you should mess around with the Lucene parameters, or find
>> >>>>>> another parser, or just live with it.. Assuming that you
>> >>>>>> comment out
>> >>>>>> Lucene and things are still slow, the next step would be to just
>> >>>>>> read in
>> >>>>>> each file and NOT parse it to figure out whether it's the IO or
>> >>>>>> the
>> >>>>>> parsing.
>> >>>>>>
>> >>>>>> Then you can worry about how to fix it..
>> >>>>>>
>> >>>>>> Best
>> >>>>>> Erick
>> >>>>>>
>> >>>>>>
>> >>>>>> Qn  1: Any suggestion to get this indexing time reduced? It
>> >>>>>> would be
>> >>>>>> really
>> >>>>>>> great.
>> >>>>>>>
>> >>>>>>> Qn 2 : Am I overlooking something in Lucene with respect to
>> >>>>>>> indexing?
>> >>>>>>>
>> >>>>>>> Right now 12 metadata files take 10 hrs nearly which is really a
>> >>>>>>> long
>> >>>>>>> time.
>> >>>>>>>
>> >>>>>>> Help Appreciated.
>> >>>>>>>
>> >>>>>>> Much Thanks.
>> >>>>>>> --
>> >>>>>>> View this message in context:
>> >>>>>>> http://www.nabble.com/Issue-while-parsing-XML-files-due-to-
>> >>>>>>> control-characters%2C-help-appreciated.-tf3418085.html#a9526527
>> >>>>>>> Sent from the Lucene - Java Users mailing list archive at
>> >>>>>>> Nabble.com.
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> -----------------------------------------------------------------
>> >>>>>>> ---
>> >>>>>>> -
>> >>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> >>>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>> >>>>>>>
>> >>>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>
>> >>>>> --
>> >>>>> View this message in context: http://www.nabble.com/Issue-while-
>> >>>>> parsing-XML-files-due-to-control-characters%2C-help-appreciated.-
>> >>>>> tf3418085.html#a9536099
>> >>>>> Sent from the Lucene - Java Users mailing list archive at
>> >>>>> Nabble.com.
>> >>>>>
>> >>>>>
>> >>>>> -------------------------------------------------------------------
>> >>>>> --
>> >>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> >>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>> >>>>>
>> >>>>
>> >>>> --------------------------
>> >>>> Grant Ingersoll
>> >>>> Center for Natural Language Processing
>> >>>> http://www.cnlp.org
>> >>>>
>> >>>> Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/
>> >>>> LuceneFAQ
>> >>>>
>> >>>>
>> >>>>
>> >>>> --------------------------------------------------------------------
>> >>>> -
>> >>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> >>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>> >>>>
>> >>>>
>> >>>
>> >>>
>> >>
>> >> --
>> >> View this message in context: http://www.nabble.com/Issue-while-
>> >> parsing-XML-files-due-to-control-characters%2C-help-appreciated.-
>> >> tf3418085.html#a9540232
>> >> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>> >>
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: java-user-help@lucene.apache.org
>> >>
>> >
>> > ------------------------------------------------------
>> > Grant Ingersoll
>> > http://www.grantingersoll.com/
>> > http://lucene.grantingersoll.com
>> > http://www.paperoftheweek.com/
>> >
>> >
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> > For additional commands, e-mail: java-user-help@lucene.apache.org
>> >
>> >
>> >
>>
>> --
>> View this message in context:
>> http://www.nabble.com/Issue-while-parsing-XML-files-due-to-control-characters%2C-help-appreciated.-tf3418085.html#a9542471
>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
> 
> 

-- 
View this message in context: http://www.nabble.com/Issue-while-parsing-XML-files-due-to-control-characters%2C-help-appreciated.-tf3418085.html#a9543855
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Issue while parsing XML files due to control characters, help appreciated.

Posted by Erick Erickson <er...@gmail.com>.
I'm not sure what the lock issues is. What version of Lucene are you
using? And what is your filesystem like? There are some known locking
issues with some versions of Lucene and some filesystems,
particularly NFS mounts as I remember... It would help if you told
us the entire stack trace rather than just the exception.

But I really don't understand your index structure. You are still
opening and closing Lucene every time you add a document.
And each document contains the titles and descriptions
of all the files in the directory. Is this actually your intent or
do you really want to index one Lucene document per
title/description pair? If the latter, you want a loop something like...

open writer
for each file
    parse file
    Doc doc = new Document();
    doc.add(field, val);
    doc.add(field2, val2);
    writer.write(doc);
end for

writer.close();

There's no reason to close the writer until the last xml file has been
indexed......

If that re-structuring causes your lock error to go away, I'll be
baffled because it shouldn't (unless your version of Lucene
and filesystem is one of the "interesting" ones).

But it'll make your code simpler...

Best
Erick
    add fields to

On 3/18/07, Lokeya <lo...@gmail.com> wrote:
>
>
> Yep I did that, and now my code looks as follows.
> The time taken for indexing one file is now => Elapsed Time in Minutes ::
> 0.3531, which is really great, but after processing 4 dumpfiles(which
> means
> 40,000 small xml's), I get :
>
> caught a class java.io.IOException
> 40114  with message: Lock obtain timed out:
> Lock@/tmp/lucene-e36d478e46e88f594d57a03c10ee0b3b-write.lock
>
>
> This is the new issue now.What could be the reason ?. I am surprised
> because
> I am only writing to Index under ./LUCENE/ evrytime and not doing anything
> with the index(ofcrs to avoid such synchronization issues !)
>
>                 for (int i=0; i<children1.length; i++)
>                         {
>                                 // Get filename of file or directory
>                                 String filename = children1[i];
>
>                                 if(!filename.startsWith("oai_citeseer"))
>                                 {
>                                         continue;
>                                 }
>                                 String dirName = filename;
>                                 System.out.println("File => "+filename);
>
>                                 NodeList nl = null;
>                                 //Testing the creation of Index
>                                 try
>                                 {
>                                         File dir = new File(dirName);
>                                         String[] children = dir.list();
>                                         if (children == null) {
>                                                 // Either dir does not
> exist or is not a directory
>                                         }
>                                         else
>                                         {
>                                                 ArrayList alist_Title =
> new ArrayList(children.length);
>                                                 ArrayList alist_Descr =
> new ArrayList(children.length);
>                                                 System.out.println("Size
> of Array List : "+alist_Title.size());
>                                                 String title = null;
>                                                 String descr = null;
>
>                                                 long startTime =
> System.currentTimeMillis();
>                                                 System.out.println(dir +"
> start time  ==> "+
> System.currentTimeMillis());
>
>                                                 for (int ii=0; ii<
> children.length; ii++)
>                                                 {
>                                                         // Get filename of
> file or directory
>                                                         String file =
> children[ii];
>                                                         System.out.println("
> Name of Dir : Record ~~~~~~~ " +dirName +" : "+
> file);
>                                                         //System.out.println("The
> name of file parsed now ==> "+file);
>
>                                                         //Get the value of
> recXYZ's metadata tag
>                                                         nl =
> ValueGetter.getNodeList(filename+"/"+file, "metadata");
>
>                                                         if(nl == null)
>                                                         {
>
> System.out.println("Error shoudlnt be thrown ...");
>
>                                                                 alist_Title.add(ii,"title");
>
>                                                                 alist_Descr.add(ii,"descr");
>                                                                 continue;
>                                                         }
>
>                                                         //Get the metadata
> element(title and description tags from dump file
>                                                         ValueGetter vg =
> new ValueGetter();
>
>                                                         //Get the
> Extracted Tags Title, Identifier and Description
>                                                         try
>                                                         {
>                                                                 title =
> vg.getNodeVal(nl, "dc:title");
>
>                                                                 alist_Title.add(ii,title);
>                                                                 descr =
> vg.getNodeVal(nl, "dc:description");
>
>                                                                 alist_Descr.add(ii,descr);
>                                                         }
>                                                         catch(Exception
> ex)
>                                                         {
>
> ex.printStackTrace();
>
> System.out.println("Excep ==> "+ ex);
>                                                         }
>
>                                                 }//End of For
>
>                                                 //Create an Index under
> LUCENE
>
>                                                 IndexWriter writer = new
> IndexWriter("./LUCENE", new
> StopStemmingAnalyzer(),false);
>                                                 Document doc = new
> Document();
>
>                                                 //Get Array List
> Elements  and add them as fileds to doc
>
>                                                 for(int k=0; k <
> alist_Title.size(); k++)
>                                                 {
>                                                         doc.add(new
> Field("Title",alist_Title.get(k).toString(),
> Field.Store.YES, Field.Index.UN_TOKENIZED));
>                                                 }
>
>                                                 for(int k=0; k <
> alist_Descr.size(); k++)
>                                                 {
>                                                         doc.add(new
> Field("Description",alist_Descr.get(k).toString(),
> Field.Store.YES, Field.Index.UN_TOKENIZED));
>                                                 }
>
>                                                 //Add the document created
> out of those fields to the IndexWriter
> which will create and index
>                                                 writer.addDocument(doc);
>                                                 writer.optimize();
>                                                 writer.close();
>
>                                                 long elapsedTimeMillis =
> System.currentTimeMillis()-startTime;
>                                                 System.out.println("Elapsed
> Time for  "+dirName +" :: " +
> elapsedTimeMillis);
>                                                 float elapsedTimeMin =
> elapsedTimeMillis/(60*1000F);
>                                                 System.out.println("Elapsed
> Time in Minutes ::  "+ elapsedTimeMin);
>
>                                         }//End of Else
>
>                                 } //End of try
>                                 catch (Exception ee)
>                                 {
>                                         ee.printStackTrace();
>                                         System.out.println("caught a " +
> ee.getClass() + "\n with message: "+
> ee.getMessage());
>                                 }
>                                 System.out.println("Total Record ==>
> "+total_records);
>
>                         }// End of For
>
> Grant Ingersoll-5 wrote:
> >
> > Move index writer creation, optimization and closure outside of your
> > loop.  I would also use a SAX parser.  Take a look at the demo code
> > to see an example of indexing.
> >
> > Cheers,
> > Grant
> >
> > On Mar 18, 2007, at 12:31 PM, Lokeya wrote:
> >
> >>
> >>
> >> Erick Erickson wrote:
> >>>
> >>> Grant:
> >>>
> >>>  I think that "Parsing 70 files totally takes 80 minutes" really
> >>> means parsing 70 metadata files containing 10,000 XML
> >>> files each.....
> >>>
> >>> One Metadata File is split into 10,000 XML files which looks as
> >>> below:
> >>>
> >>> <root>
> >>>     <record>
> >>>     <header>
> >>>     <identifier>oai:CiteSeerPSU:1</identifier>
> >>>     <datestamp>1993-08-11</datestamp>
> >>>     <setSpec>CiteSeerPSUset</setSpec>
> >>>     </header>
> >>>             <metadata>
> >>>             <oai_citeseer:oai_citeseer
> >>> xmlns:oai_citeseer="http://copper.ist.psu.edu/oai/oai_citeseer/"
> >>> xmlns:dc
> >>> ="http://purl.org/dc/elements/1.1/"
> >>> xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
> >>> xsi:schemaLocation="http://copper.ist.psu.edu/oai/oai_citeseer/
> >>> http://copper.ist.psu.edu/oai/oai_citeseer.xsd ">
> >>>             <dc:title>36 Problems for Semantic
> Interpretation</dc:title>
> >>>             <oai_citeseer:author name="Gabriele Scheler">
> >>>                     <address>80290 Munchen , Germany</address>
> >>>                     <affiliation>Institut fur Informatik; Technische
> Universitat
> >>> Munchen</affiliation>
> >>>             </oai_citeseer:author>
> >>>             <dc:subject>Gabriele Scheler 36 Problems for Semantic
> >>> Interpretation</dc:subject>
> >>>             <dc:description>This paper presents a collection of
> problems for
> >>> natural
> >>> language analysisderived mainly from theoretical linguistics. Most of
> >>> these problemspresent major obstacles for computational systems of
> >>> language interpretation.The set of given sentences can easily be
> >>> scaled up
> >>> by introducing moreexamples per problem. The construction of
> >>> computational
> >>> systems couldbenefit from such a collection, either using it
> >>> directly for
> >>> training andtesting or as a set of benchmarks to qualify the
> >>> performance
> >>> of a NLPsystem.1 IntroductionThe main part of this paper consists
> >>> of a
> >>> collection of problems for semanticanalysis of natural language. The
> >>> problems are arranged in the following way:example sentencesconcise
> >>> description of the problemkeyword for the type of problemThe sources
> >>> (first appearance in print) of the sentences have been left
> >>> out,because
> >>> they are sometimes hard to track and will usually not be of much
> >>> use,as
> >>> they indicate a starting-point of discussion only. The keywords
> >>> howeve...
> >>>             </dc:description>
> >>>             <dc:contributor>The Pennsylvania State University CiteSeer
> >>> Archives</dc:contributor>
> >>>             <dc:publisher>unknown</dc:publisher>
> >>>             <dc:date>1993-08-11</dc:date>
> >>>             <dc:format>ps</dc:format>
> >>>             <dc:identifier>http://citeseer.ist.psu.edu/1.html
> </dc:identifier>
> >>> <dc:source>ftp://flop.informatik.tu-muenchen.de/pub/fki/
> >>> fki-179-93.ps.gz</dc:source>
> >>>             <dc:language>en</dc:language>
> >>>             <dc:rights>unrestricted</dc:rights>
> >>>             </oai_citeseer:oai_citeseer>
> >>>             </metadata>
> >>>     </record>
> >>> </root>
> >>>
> >>>
> >>> From the above I will extract the Title and the Description tags
> >>> to index.
> >>>
> >>> Code to do this:
> >>>
> >>> 1. I have 70 directories with the name like oai_citeseerXYZ/
> >>> 2. Under each of above directory, I have 10,000 xml files each having
> >>> above xml data.
> >>> 3. Program does the following
> >>>
> >>>                                     File dir = new File(dirName);
> >>>                                     String[] children = dir.list();
> >>>                                     if (children == null) {
> >>>                                             // Either dir does not
> exist or is not a directory
> >>>                                     }
> >>>                                     else
> >>>                                     {
> >>>                                             for (int ii=0; ii<
> children.length; ii++)
> >>>                                             {
> >>>                                                     // Get filename of
> file or directory
> >>>                                                     String file =
> children[ii];
> >>>
> //System.out.println("The name of file parsed now ==> "+file);
> >>>                                                     nl =
> ReadDump.getNodeList(filename+"/"+file, "metadata");
> >>>                                                     if(nl == null)
> >>>                                                     {
> >>>
> //System.out.println("Error shoudlnt be thrown ...");
> >>>                                                     }
> >>>                                                     //Get the metadata
> element tags from xml file
> >>>                                                     ReadDump rd = new
> ReadDump();
> >>>
> >>>                                                     //Get the
> Extracted Tags Title, Identifier and Description
> >>>                                                     ArrayList
> alist_Title = rd.getElements(nl, "dc:title");
> >>>                                                     ArrayList
> alist_Descr = rd.getElements(nl, "dc:description");
> >>>
> >>>                                                     //Create an Index
> under DIR
> >>>                                                     IndexWriter writer
> = new IndexWriter("./FINAL/", new
> >>> StopStemmingAnalyzer(),false);
> >>>                                                     Document doc = new
> Document();
> >>>
> >>>                                                     //Get Array List
> Elements  and add them as fileds to doc
> >>>                                                     for(int k=0; k <
> alist_Title.size(); k++)
> >>>                                                     {
> >>>                                                             doc.add(new
> Field("Title",alist_Title.get(k).toString(),
> >>> Field.Store.YES, Field.Index.UN_TOKENIZED));
> >>>                                                     }
> >>>
> >>>                                                     for(int k=0; k <
> alist_Descr.size(); k++)
> >>>                                                     {
> >>>                                                             doc.add(new
> Field("Description",alist_Descr.get(k).toString
> >>> (),
> >>> Field.Store.YES, Field.Index.UN_TOKENIZED));
> >>>                                                     }
> >>>
> >>>                     //Add the document created out of those fields to
> the
> >>> IndexWriter which
> >>> will create and index
> >>>                                                     writer.addDocument
> (doc);
> >>>                                                     writer.optimize();
> >>>                                                     writer.close();
> >>>                }
> >>>
> >>>
> >>> This is the main file which does indexing.
> >>>
> >>> Hope this will give you an idea.
> >>>
> >>>
> >>> Lokeya:
> >>> Can you confirm my supposition? And I'd still post the code
> >>> Grant requested if you can.....
> >>>
> >>> So, you're talking about indexing 10,000 xml files in 2-3 hours,
> >>> 8 minutes or so which is spent reading/parsing, right? It'll be
> >>> important to know how much data you're indexing and now, so
> >>> the code snippet is doubly important....
> >>>
> >>> Erick
> >>>
> >>> On 3/18/07, Grant Ingersoll <gs...@apache.org> wrote:
> >>>>
> >>>> Can you post the relevant indexing code?  Are you doing things like
> >>>> optimizing after every file?  Both the parsing and the indexing
> >>>> sound
> >>>> really long.  How big are these files?
> >>>>
> >>>> Also, I assume you machine is at least somewhat current, right?
> >>>>
> >>>> On Mar 18, 2007, at 1:00 AM, Lokeya wrote:
> >>>>
> >>>>>
> >>>>> Thanks for your reply. I tried to check if the I/O and Parsing is
> >>>>> taking time
> >>>>> separately and Indexing time also. I observed that I/O and Parsing
> >>>>> 70 files
> >>>>> totally takes 80 minutes where as when I combine this with Indexing
> >>>>> for a
> >>>>> single Metadata file it nearly 2 to 3 hours. So looks like
> >>>>> IndexWriter takes
> >>>>> time that too when we are appending to the Index file this happens.
> >>>>>
> >>>>> So what is the best approach to handle this?
> >>>>>
> >>>>> Thanks in Advance.
> >>>>>
> >>>>>
> >>>>> Erick Erickson wrote:
> >>>>>>
> >>>>>> See below...
> >>>>>>
> >>>>>> On 3/17/07, Lokeya <lo...@gmail.com> wrote:
> >>>>>>>
> >>>>>>>
> >>>>>>> Hi,
> >>>>>>>
> >>>>>>> I am trying to index the content from XML files which are
> >>>>>>> basically the
> >>>>>>> metadata collected from a website which have a huge collection of
> >>>>>>> documents.
> >>>>>>> This metadata xml has control characters which causes errors
> >>>>>>> while trying
> >>>>>>> to
> >>>>>>> parse using the DOM parser. I tried to use encoding = UTF-8 but
> >>>>>>> looks
> >>>>>>> like
> >>>>>>> it doesn't cover all the unicode characters and I get error. Also
> >>>>>>> when I
> >>>>>>> tried to use UTF-16, I am getting Prolog content not allowed
> >>>>>>> here. So my
> >>>>>>> guess is there is no enoding which is going to cover almost all
> >>>>>>> unicode
> >>>>>>> characters. So I tried to split my metadata files into small
> >>>>>>> files and
> >>>>>>> processing records which doesnt throw parsing error.
> >>>>>>>
> >>>>>>> But by breaking metadata file into smaller files I get, 10,000
> >>>>>>> xml files
> >>>>>>> per
> >>>>>>> metadata file. I have 70 metadata files, so altogether it becomes
> >>>>>>> 7,00,000
> >>>>>>> files. Processing them individually takes really long time using
> >>>>>>> Lucene,
> >>>>>>> my
> >>>>>>> guess is I/O is time consuing, like opening every small xml file
> >>>>>>> loading
> >>>>>>> in
> >>>>>>> DOM extracting required data and processing.
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> So why don't you measure and find out before trying to make the
> >>>>>> indexing
> >>>>>> step more efficient? You simply cannot optimize without knowing
> >>>>>> where
> >>>>>> you're spending your time. I can't tell you how often I've been
> >>>>>> wrong
> >>>>>> about
> >>>>>> "why my program was slow" <G>.
> >>>>>>
> >>>>>> In this case, it should be really simple. Just comment out the
> >>>>>> part where
> >>>>>> you index the data and run, say, one of your metadata files.. I
> >>>>>> suspect
> >>>>>> that
> >>>>>> Cheolgoo Kang's response is cogent, and you indeed are spending
> >>>>>> your
> >>>>>> time parsing the XML. I further suspect that the problem is not
> >>>>>> disk IO,
> >>>>>> but the time spent parsing. But until you measure, you have no
> >>>>>> clue
> >>>>>> whether you should mess around with the Lucene parameters, or find
> >>>>>> another parser, or just live with it.. Assuming that you
> >>>>>> comment out
> >>>>>> Lucene and things are still slow, the next step would be to just
> >>>>>> read in
> >>>>>> each file and NOT parse it to figure out whether it's the IO or
> >>>>>> the
> >>>>>> parsing.
> >>>>>>
> >>>>>> Then you can worry about how to fix it..
> >>>>>>
> >>>>>> Best
> >>>>>> Erick
> >>>>>>
> >>>>>>
> >>>>>> Qn  1: Any suggestion to get this indexing time reduced? It
> >>>>>> would be
> >>>>>> really
> >>>>>>> great.
> >>>>>>>
> >>>>>>> Qn 2 : Am I overlooking something in Lucene with respect to
> >>>>>>> indexing?
> >>>>>>>
> >>>>>>> Right now 12 metadata files take 10 hrs nearly which is really a
> >>>>>>> long
> >>>>>>> time.
> >>>>>>>
> >>>>>>> Help Appreciated.
> >>>>>>>
> >>>>>>> Much Thanks.
> >>>>>>> --
> >>>>>>> View this message in context:
> >>>>>>> http://www.nabble.com/Issue-while-parsing-XML-files-due-to-
> >>>>>>> control-characters%2C-help-appreciated.-tf3418085.html#a9526527
> >>>>>>> Sent from the Lucene - Java Users mailing list archive at
> >>>>>>> Nabble.com.
> >>>>>>>
> >>>>>>>
> >>>>>>> -----------------------------------------------------------------
> >>>>>>> ---
> >>>>>>> -
> >>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>
> >>>>> --
> >>>>> View this message in context: http://www.nabble.com/Issue-while-
> >>>>> parsing-XML-files-due-to-control-characters%2C-help-appreciated.-
> >>>>> tf3418085.html#a9536099
> >>>>> Sent from the Lucene - Java Users mailing list archive at
> >>>>> Nabble.com.
> >>>>>
> >>>>>
> >>>>> -------------------------------------------------------------------
> >>>>> --
> >>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>>>>
> >>>>
> >>>> --------------------------
> >>>> Grant Ingersoll
> >>>> Center for Natural Language Processing
> >>>> http://www.cnlp.org
> >>>>
> >>>> Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/
> >>>> LuceneFAQ
> >>>>
> >>>>
> >>>>
> >>>> --------------------------------------------------------------------
> >>>> -
> >>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>>> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>>>
> >>>>
> >>>
> >>>
> >>
> >> --
> >> View this message in context: http://www.nabble.com/Issue-while-
> >> parsing-XML-files-due-to-control-characters%2C-help-appreciated.-
> >> tf3418085.html#a9540232
> >> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >
> > ------------------------------------------------------
> > Grant Ingersoll
> > http://www.grantingersoll.com/
> > http://lucene.grantingersoll.com
> > http://www.paperoftheweek.com/
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/Issue-while-parsing-XML-files-due-to-control-characters%2C-help-appreciated.-tf3418085.html#a9542471
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Issue while parsing XML files due to control characters, help appreciated.

Posted by Lokeya <lo...@gmail.com>.
Yep I did that, and now my code looks as follows. 
The time taken for indexing one file is now => Elapsed Time in Minutes :: 
0.3531, which is really great, but after processing 4 dumpfiles(which means
40,000 small xml's), I get :

caught a class java.io.IOException
 40114  with message: Lock obtain timed out:
Lock@/tmp/lucene-e36d478e46e88f594d57a03c10ee0b3b-write.lock


This is the new issue now.What could be the reason ?. I am surprised because
I am only writing to Index under ./LUCENE/ evrytime and not doing anything
with the index(ofcrs to avoid such synchronization issues !)

		for (int i=0; i<children1.length; i++)
			{
				// Get filename of file or directory
				String filename = children1[i];
				
				if(!filename.startsWith("oai_citeseer"))
				{
					continue;
				}
				String dirName = filename;
				System.out.println("File => "+filename);

				NodeList nl = null;
				//Testing the creation of Index
				try
				{	
					File dir = new File(dirName);
					String[] children = dir.list();
					if (children == null) {
						// Either dir does not exist or is not a directory						
					} 
					else 
					{
						ArrayList alist_Title = new ArrayList(children.length); 
						ArrayList alist_Descr = new ArrayList(children.length);
						System.out.println("Size of Array List : "+alist_Title.size());
						String title = null;
						String descr = null;
					
						long startTime = System.currentTimeMillis();
						System.out.println(dir +" start time  ==> "+
System.currentTimeMillis());

						for (int ii=0; ii<children.length; ii++) 
						{							
							// Get filename of file or directory
							String file = children[ii];
							System.out.println(" Name of Dir : Record ~~~~~~~ " +dirName +" : "+
file);
							//System.out.println("The name of file parsed now ==> "+file);

							//Get the value of recXYZ's metadata tag
							nl = ValueGetter.getNodeList(filename+"/"+file, "metadata");

							if(nl == null)
							{
								System.out.println("Error shoudlnt be thrown ...");
								alist_Title.add(ii,"title");
								alist_Descr.add(ii,"descr");
								continue;
							}	

							//Get the metadata element(title and description tags from dump file
							ValueGetter vg = new ValueGetter();

							//Get the Extracted Tags Title, Identifier and Description
							try
							{
								title = vg.getNodeVal(nl, "dc:title");
								alist_Title.add(ii,title);
								descr = vg.getNodeVal(nl, "dc:description");
								alist_Descr.add(ii,descr);
							}
							catch(Exception ex)
							{								
								ex.printStackTrace();
								System.out.println("Excep ==> "+ ex);
							}						

						}//End of For
						
						//Create an Index under LUCENE

						IndexWriter writer = new IndexWriter("./LUCENE", new
StopStemmingAnalyzer(),false);
						Document doc = new Document();

						//Get Array List Elements  and add them as fileds to doc 
					
						for(int k=0; k < alist_Title.size(); k++)
						{
							doc.add(new Field("Title",alist_Title.get(k).toString(),
Field.Store.YES, Field.Index.UN_TOKENIZED));
						}

						for(int k=0; k < alist_Descr.size(); k++)
						{
							doc.add(new Field("Description",alist_Descr.get(k).toString(),
Field.Store.YES, Field.Index.UN_TOKENIZED));
						}

						//Add the document created out of those fields to the IndexWriter
which will create and index
						writer.addDocument(doc);
						writer.optimize();
						writer.close();

						long elapsedTimeMillis = System.currentTimeMillis()-startTime;
						System.out.println("Elapsed Time for  "+dirName +" :: " +
elapsedTimeMillis);
						float elapsedTimeMin = elapsedTimeMillis/(60*1000F);
						System.out.println("Elapsed Time in Minutes ::  "+ elapsedTimeMin);
						
					}//End of Else

				} //End of try
				catch (Exception ee)
				{
					ee.printStackTrace();
					System.out.println("caught a " + ee.getClass() + "\n with message: "+
ee.getMessage());
				}	
				System.out.println("Total Record ==> "+total_records);

			}// End of For

Grant Ingersoll-5 wrote:
> 
> Move index writer creation, optimization and closure outside of your  
> loop.  I would also use a SAX parser.  Take a look at the demo code  
> to see an example of indexing.
> 
> Cheers,
> Grant
> 
> On Mar 18, 2007, at 12:31 PM, Lokeya wrote:
> 
>>
>>
>> Erick Erickson wrote:
>>>
>>> Grant:
>>>
>>>  I think that "Parsing 70 files totally takes 80 minutes" really
>>> means parsing 70 metadata files containing 10,000 XML
>>> files each.....
>>>
>>> One Metadata File is split into 10,000 XML files which looks as  
>>> below:
>>>
>>> <root>
>>> 	<record>
>>> 	<header>
>>> 	<identifier>oai:CiteSeerPSU:1</identifier>
>>> 	<datestamp>1993-08-11</datestamp>
>>> 	<setSpec>CiteSeerPSUset</setSpec>
>>> 	</header>
>>> 		<metadata>
>>> 		<oai_citeseer:oai_citeseer
>>> xmlns:oai_citeseer="http://copper.ist.psu.edu/oai/oai_citeseer/"  
>>> xmlns:dc
>>> ="http://purl.org/dc/elements/1.1/"
>>> xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
>>> xsi:schemaLocation="http://copper.ist.psu.edu/oai/oai_citeseer/
>>> http://copper.ist.psu.edu/oai/oai_citeseer.xsd ">
>>> 		<dc:title>36 Problems for Semantic Interpretation</dc:title>
>>> 		<oai_citeseer:author name="Gabriele Scheler">
>>> 			<address>80290 Munchen , Germany</address>
>>> 			<affiliation>Institut fur Informatik; Technische Universitat
>>> Munchen</affiliation>
>>> 		</oai_citeseer:author>
>>> 		<dc:subject>Gabriele Scheler 36 Problems for Semantic
>>> Interpretation</dc:subject>
>>> 		<dc:description>This paper presents a collection of problems for  
>>> natural
>>> language analysisderived mainly from theoretical linguistics. Most of
>>> these problemspresent major obstacles for computational systems of
>>> language interpretation.The set of given sentences can easily be  
>>> scaled up
>>> by introducing moreexamples per problem. The construction of  
>>> computational
>>> systems couldbenefit from such a collection, either using it  
>>> directly for
>>> training andtesting or as a set of benchmarks to qualify the  
>>> performance
>>> of a NLPsystem.1 IntroductionThe main part of this paper consists  
>>> of a
>>> collection of problems for semanticanalysis of natural language. The
>>> problems are arranged in the following way:example sentencesconcise
>>> description of the problemkeyword for the type of problemThe sources
>>> (first appearance in print) of the sentences have been left  
>>> out,because
>>> they are sometimes hard to track and will usually not be of much  
>>> use,as
>>> they indicate a starting-point of discussion only. The keywords  
>>> howeve...
>>> 		</dc:description>
>>> 		<dc:contributor>The Pennsylvania State University CiteSeer
>>> Archives</dc:contributor>
>>> 		<dc:publisher>unknown</dc:publisher>
>>> 		<dc:date>1993-08-11</dc:date>
>>> 		<dc:format>ps</dc:format>
>>> 		<dc:identifier>http://citeseer.ist.psu.edu/1.html</dc:identifier>
>>> <dc:source>ftp://flop.informatik.tu-muenchen.de/pub/fki/ 
>>> fki-179-93.ps.gz</dc:source>
>>> 		<dc:language>en</dc:language>
>>> 		<dc:rights>unrestricted</dc:rights>
>>> 		</oai_citeseer:oai_citeseer>
>>> 		</metadata>
>>> 	</record>
>>> </root>
>>>
>>>
>>> From the above I will extract the Title and the Description tags  
>>> to index.
>>>
>>> Code to do this:
>>>
>>> 1. I have 70 directories with the name like oai_citeseerXYZ/
>>> 2. Under each of above directory, I have 10,000 xml files each having
>>> above xml data.
>>> 3. Program does the following
>>>
>>> 					File dir = new File(dirName);
>>> 					String[] children = dir.list();
>>> 					if (children == null) {
>>> 						// Either dir does not exist or is not a directory
>>> 					}
>>> 					else
>>> 					{
>>> 						for (int ii=0; ii<children.length; ii++)
>>> 						{							
>>> 							// Get filename of file or directory
>>> 							String file = children[ii];
>>> 							//System.out.println("The name of file parsed now ==> "+file);
>>> 							nl = ReadDump.getNodeList(filename+"/"+file, "metadata");
>>> 							if(nl == null)
>>> 							{
>>> 								//System.out.println("Error shoudlnt be thrown ...");
>>> 							}	
>>> 							//Get the metadata element tags from xml file
>>> 							ReadDump rd = new ReadDump();
>>>
>>> 							//Get the Extracted Tags Title, Identifier and Description
>>> 							ArrayList alist_Title = rd.getElements(nl, "dc:title");
>>> 							ArrayList alist_Descr = rd.getElements(nl, "dc:description");
>>>
>>> 							//Create an Index under DIR
>>> 							IndexWriter writer = new IndexWriter("./FINAL/", new
>>> StopStemmingAnalyzer(),false);
>>> 							Document doc = new Document();
>>>
>>> 							//Get Array List Elements  and add them as fileds to doc
>>> 							for(int k=0; k < alist_Title.size(); k++)
>>> 							{
>>> 								doc.add(new Field("Title",alist_Title.get(k).toString(),
>>> Field.Store.YES, Field.Index.UN_TOKENIZED));
>>> 							}
>>> 							
>>> 							for(int k=0; k < alist_Descr.size(); k++)
>>> 							{
>>> 								doc.add(new Field("Description",alist_Descr.get(k).toString 
>>> (),
>>> Field.Store.YES, Field.Index.UN_TOKENIZED));
>>> 							}							
>>>
>>> 			//Add the document created out of those fields to the  
>>> IndexWriter which
>>> will create and index
>>> 							writer.addDocument(doc);
>>> 							writer.optimize();
>>> 							writer.close();
>>>                }
>>> 							
>>>
>>> This is the main file which does indexing.
>>>
>>> Hope this will give you an idea.
>>>
>>>
>>> Lokeya:
>>> Can you confirm my supposition? And I'd still post the code
>>> Grant requested if you can.....
>>>
>>> So, you're talking about indexing 10,000 xml files in 2-3 hours,
>>> 8 minutes or so which is spent reading/parsing, right? It'll be
>>> important to know how much data you're indexing and now, so
>>> the code snippet is doubly important....
>>>
>>> Erick
>>>
>>> On 3/18/07, Grant Ingersoll <gs...@apache.org> wrote:
>>>>
>>>> Can you post the relevant indexing code?  Are you doing things like
>>>> optimizing after every file?  Both the parsing and the indexing  
>>>> sound
>>>> really long.  How big are these files?
>>>>
>>>> Also, I assume you machine is at least somewhat current, right?
>>>>
>>>> On Mar 18, 2007, at 1:00 AM, Lokeya wrote:
>>>>
>>>>>
>>>>> Thanks for your reply. I tried to check if the I/O and Parsing is
>>>>> taking time
>>>>> separately and Indexing time also. I observed that I/O and Parsing
>>>>> 70 files
>>>>> totally takes 80 minutes where as when I combine this with Indexing
>>>>> for a
>>>>> single Metadata file it nearly 2 to 3 hours. So looks like
>>>>> IndexWriter takes
>>>>> time that too when we are appending to the Index file this happens.
>>>>>
>>>>> So what is the best approach to handle this?
>>>>>
>>>>> Thanks in Advance.
>>>>>
>>>>>
>>>>> Erick Erickson wrote:
>>>>>>
>>>>>> See below...
>>>>>>
>>>>>> On 3/17/07, Lokeya <lo...@gmail.com> wrote:
>>>>>>>
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I am trying to index the content from XML files which are
>>>>>>> basically the
>>>>>>> metadata collected from a website which have a huge collection of
>>>>>>> documents.
>>>>>>> This metadata xml has control characters which causes errors
>>>>>>> while trying
>>>>>>> to
>>>>>>> parse using the DOM parser. I tried to use encoding = UTF-8 but
>>>>>>> looks
>>>>>>> like
>>>>>>> it doesn't cover all the unicode characters and I get error. Also
>>>>>>> when I
>>>>>>> tried to use UTF-16, I am getting Prolog content not allowed
>>>>>>> here. So my
>>>>>>> guess is there is no enoding which is going to cover almost all
>>>>>>> unicode
>>>>>>> characters. So I tried to split my metadata files into small
>>>>>>> files and
>>>>>>> processing records which doesnt throw parsing error.
>>>>>>>
>>>>>>> But by breaking metadata file into smaller files I get, 10,000
>>>>>>> xml files
>>>>>>> per
>>>>>>> metadata file. I have 70 metadata files, so altogether it becomes
>>>>>>> 7,00,000
>>>>>>> files. Processing them individually takes really long time using
>>>>>>> Lucene,
>>>>>>> my
>>>>>>> guess is I/O is time consuing, like opening every small xml file
>>>>>>> loading
>>>>>>> in
>>>>>>> DOM extracting required data and processing.
>>>>>>
>>>>>>
>>>>>>
>>>>>> So why don't you measure and find out before trying to make the
>>>>>> indexing
>>>>>> step more efficient? You simply cannot optimize without knowing  
>>>>>> where
>>>>>> you're spending your time. I can't tell you how often I've been  
>>>>>> wrong
>>>>>> about
>>>>>> "why my program was slow" <G>.
>>>>>>
>>>>>> In this case, it should be really simple. Just comment out the
>>>>>> part where
>>>>>> you index the data and run, say, one of your metadata files.. I
>>>>>> suspect
>>>>>> that
>>>>>> Cheolgoo Kang's response is cogent, and you indeed are spending  
>>>>>> your
>>>>>> time parsing the XML. I further suspect that the problem is not
>>>>>> disk IO,
>>>>>> but the time spent parsing. But until you measure, you have no  
>>>>>> clue
>>>>>> whether you should mess around with the Lucene parameters, or find
>>>>>> another parser, or just live with it.. Assuming that you  
>>>>>> comment out
>>>>>> Lucene and things are still slow, the next step would be to just
>>>>>> read in
>>>>>> each file and NOT parse it to figure out whether it's the IO or  
>>>>>> the
>>>>>> parsing.
>>>>>>
>>>>>> Then you can worry about how to fix it..
>>>>>>
>>>>>> Best
>>>>>> Erick
>>>>>>
>>>>>>
>>>>>> Qn  1: Any suggestion to get this indexing time reduced? It  
>>>>>> would be
>>>>>> really
>>>>>>> great.
>>>>>>>
>>>>>>> Qn 2 : Am I overlooking something in Lucene with respect to
>>>>>>> indexing?
>>>>>>>
>>>>>>> Right now 12 metadata files take 10 hrs nearly which is really a
>>>>>>> long
>>>>>>> time.
>>>>>>>
>>>>>>> Help Appreciated.
>>>>>>>
>>>>>>> Much Thanks.
>>>>>>> --
>>>>>>> View this message in context:
>>>>>>> http://www.nabble.com/Issue-while-parsing-XML-files-due-to-
>>>>>>> control-characters%2C-help-appreciated.-tf3418085.html#a9526527
>>>>>>> Sent from the Lucene - Java Users mailing list archive at
>>>>>>> Nabble.com.
>>>>>>>
>>>>>>>
>>>>>>> ----------------------------------------------------------------- 
>>>>>>> ---
>>>>>>> -
>>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> View this message in context: http://www.nabble.com/Issue-while-
>>>>> parsing-XML-files-due-to-control-characters%2C-help-appreciated.-
>>>>> tf3418085.html#a9536099
>>>>> Sent from the Lucene - Java Users mailing list archive at  
>>>>> Nabble.com.
>>>>>
>>>>>
>>>>> ------------------------------------------------------------------- 
>>>>> --
>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>
>>>>
>>>> --------------------------
>>>> Grant Ingersoll
>>>> Center for Natural Language Processing
>>>> http://www.cnlp.org
>>>>
>>>> Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/
>>>> LuceneFAQ
>>>>
>>>>
>>>>
>>>> -------------------------------------------------------------------- 
>>>> -
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>>
>>>
>>>
>>
>> -- 
>> View this message in context: http://www.nabble.com/Issue-while- 
>> parsing-XML-files-due-to-control-characters%2C-help-appreciated.- 
>> tf3418085.html#a9540232
>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
> 
> ------------------------------------------------------
> Grant Ingersoll
> http://www.grantingersoll.com/
> http://lucene.grantingersoll.com
> http://www.paperoftheweek.com/
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Issue-while-parsing-XML-files-due-to-control-characters%2C-help-appreciated.-tf3418085.html#a9542471
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Issue while parsing XML files due to control characters, help appreciated.

Posted by Grant Ingersoll <gr...@gmail.com>.
Move index writer creation, optimization and closure outside of your  
loop.  I would also use a SAX parser.  Take a look at the demo code  
to see an example of indexing.

Cheers,
Grant

On Mar 18, 2007, at 12:31 PM, Lokeya wrote:

>
>
> Erick Erickson wrote:
>>
>> Grant:
>>
>>  I think that "Parsing 70 files totally takes 80 minutes" really
>> means parsing 70 metadata files containing 10,000 XML
>> files each.....
>>
>> One Metadata File is split into 10,000 XML files which looks as  
>> below:
>>
>> <root>
>> 	<record>
>> 	<header>
>> 	<identifier>oai:CiteSeerPSU:1</identifier>
>> 	<datestamp>1993-08-11</datestamp>
>> 	<setSpec>CiteSeerPSUset</setSpec>
>> 	</header>
>> 		<metadata>
>> 		<oai_citeseer:oai_citeseer
>> xmlns:oai_citeseer="http://copper.ist.psu.edu/oai/oai_citeseer/"  
>> xmlns:dc
>> ="http://purl.org/dc/elements/1.1/"
>> xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
>> xsi:schemaLocation="http://copper.ist.psu.edu/oai/oai_citeseer/
>> http://copper.ist.psu.edu/oai/oai_citeseer.xsd ">
>> 		<dc:title>36 Problems for Semantic Interpretation</dc:title>
>> 		<oai_citeseer:author name="Gabriele Scheler">
>> 			<address>80290 Munchen , Germany</address>
>> 			<affiliation>Institut fur Informatik; Technische Universitat
>> Munchen</affiliation>
>> 		</oai_citeseer:author>
>> 		<dc:subject>Gabriele Scheler 36 Problems for Semantic
>> Interpretation</dc:subject>
>> 		<dc:description>This paper presents a collection of problems for  
>> natural
>> language analysisderived mainly from theoretical linguistics. Most of
>> these problemspresent major obstacles for computational systems of
>> language interpretation.The set of given sentences can easily be  
>> scaled up
>> by introducing moreexamples per problem. The construction of  
>> computational
>> systems couldbenefit from such a collection, either using it  
>> directly for
>> training andtesting or as a set of benchmarks to qualify the  
>> performance
>> of a NLPsystem.1 IntroductionThe main part of this paper consists  
>> of a
>> collection of problems for semanticanalysis of natural language. The
>> problems are arranged in the following way:example sentencesconcise
>> description of the problemkeyword for the type of problemThe sources
>> (first appearance in print) of the sentences have been left  
>> out,because
>> they are sometimes hard to track and will usually not be of much  
>> use,as
>> they indicate a starting-point of discussion only. The keywords  
>> howeve...
>> 		</dc:description>
>> 		<dc:contributor>The Pennsylvania State University CiteSeer
>> Archives</dc:contributor>
>> 		<dc:publisher>unknown</dc:publisher>
>> 		<dc:date>1993-08-11</dc:date>
>> 		<dc:format>ps</dc:format>
>> 		<dc:identifier>http://citeseer.ist.psu.edu/1.html</dc:identifier>
>> <dc:source>ftp://flop.informatik.tu-muenchen.de/pub/fki/ 
>> fki-179-93.ps.gz</dc:source>
>> 		<dc:language>en</dc:language>
>> 		<dc:rights>unrestricted</dc:rights>
>> 		</oai_citeseer:oai_citeseer>
>> 		</metadata>
>> 	</record>
>> </root>
>>
>>
>> From the above I will extract the Title and the Description tags  
>> to index.
>>
>> Code to do this:
>>
>> 1. I have 70 directories with the name like oai_citeseerXYZ/
>> 2. Under each of above directory, I have 10,000 xml files each having
>> above xml data.
>> 3. Program does the following
>>
>> 					File dir = new File(dirName);
>> 					String[] children = dir.list();
>> 					if (children == null) {
>> 						// Either dir does not exist or is not a directory
>> 					}
>> 					else
>> 					{
>> 						for (int ii=0; ii<children.length; ii++)
>> 						{							
>> 							// Get filename of file or directory
>> 							String file = children[ii];
>> 							//System.out.println("The name of file parsed now ==> "+file);
>> 							nl = ReadDump.getNodeList(filename+"/"+file, "metadata");
>> 							if(nl == null)
>> 							{
>> 								//System.out.println("Error shoudlnt be thrown ...");
>> 							}	
>> 							//Get the metadata element tags from xml file
>> 							ReadDump rd = new ReadDump();
>>
>> 							//Get the Extracted Tags Title, Identifier and Description
>> 							ArrayList alist_Title = rd.getElements(nl, "dc:title");
>> 							ArrayList alist_Descr = rd.getElements(nl, "dc:description");
>>
>> 							//Create an Index under DIR
>> 							IndexWriter writer = new IndexWriter("./FINAL/", new
>> StopStemmingAnalyzer(),false);
>> 							Document doc = new Document();
>>
>> 							//Get Array List Elements  and add them as fileds to doc
>> 							for(int k=0; k < alist_Title.size(); k++)
>> 							{
>> 								doc.add(new Field("Title",alist_Title.get(k).toString(),
>> Field.Store.YES, Field.Index.UN_TOKENIZED));
>> 							}
>> 							
>> 							for(int k=0; k < alist_Descr.size(); k++)
>> 							{
>> 								doc.add(new Field("Description",alist_Descr.get(k).toString 
>> (),
>> Field.Store.YES, Field.Index.UN_TOKENIZED));
>> 							}							
>>
>> 			//Add the document created out of those fields to the  
>> IndexWriter which
>> will create and index
>> 							writer.addDocument(doc);
>> 							writer.optimize();
>> 							writer.close();
>>                }
>> 							
>>
>> This is the main file which does indexing.
>>
>> Hope this will give you an idea.
>>
>>
>> Lokeya:
>> Can you confirm my supposition? And I'd still post the code
>> Grant requested if you can.....
>>
>> So, you're talking about indexing 10,000 xml files in 2-3 hours,
>> 8 minutes or so which is spent reading/parsing, right? It'll be
>> important to know how much data you're indexing and now, so
>> the code snippet is doubly important....
>>
>> Erick
>>
>> On 3/18/07, Grant Ingersoll <gs...@apache.org> wrote:
>>>
>>> Can you post the relevant indexing code?  Are you doing things like
>>> optimizing after every file?  Both the parsing and the indexing  
>>> sound
>>> really long.  How big are these files?
>>>
>>> Also, I assume you machine is at least somewhat current, right?
>>>
>>> On Mar 18, 2007, at 1:00 AM, Lokeya wrote:
>>>
>>>>
>>>> Thanks for your reply. I tried to check if the I/O and Parsing is
>>>> taking time
>>>> separately and Indexing time also. I observed that I/O and Parsing
>>>> 70 files
>>>> totally takes 80 minutes where as when I combine this with Indexing
>>>> for a
>>>> single Metadata file it nearly 2 to 3 hours. So looks like
>>>> IndexWriter takes
>>>> time that too when we are appending to the Index file this happens.
>>>>
>>>> So what is the best approach to handle this?
>>>>
>>>> Thanks in Advance.
>>>>
>>>>
>>>> Erick Erickson wrote:
>>>>>
>>>>> See below...
>>>>>
>>>>> On 3/17/07, Lokeya <lo...@gmail.com> wrote:
>>>>>>
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I am trying to index the content from XML files which are
>>>>>> basically the
>>>>>> metadata collected from a website which have a huge collection of
>>>>>> documents.
>>>>>> This metadata xml has control characters which causes errors
>>>>>> while trying
>>>>>> to
>>>>>> parse using the DOM parser. I tried to use encoding = UTF-8 but
>>>>>> looks
>>>>>> like
>>>>>> it doesn't cover all the unicode characters and I get error. Also
>>>>>> when I
>>>>>> tried to use UTF-16, I am getting Prolog content not allowed
>>>>>> here. So my
>>>>>> guess is there is no enoding which is going to cover almost all
>>>>>> unicode
>>>>>> characters. So I tried to split my metadata files into small
>>>>>> files and
>>>>>> processing records which doesnt throw parsing error.
>>>>>>
>>>>>> But by breaking metadata file into smaller files I get, 10,000
>>>>>> xml files
>>>>>> per
>>>>>> metadata file. I have 70 metadata files, so altogether it becomes
>>>>>> 7,00,000
>>>>>> files. Processing them individually takes really long time using
>>>>>> Lucene,
>>>>>> my
>>>>>> guess is I/O is time consuing, like opening every small xml file
>>>>>> loading
>>>>>> in
>>>>>> DOM extracting required data and processing.
>>>>>
>>>>>
>>>>>
>>>>> So why don't you measure and find out before trying to make the
>>>>> indexing
>>>>> step more efficient? You simply cannot optimize without knowing  
>>>>> where
>>>>> you're spending your time. I can't tell you how often I've been  
>>>>> wrong
>>>>> about
>>>>> "why my program was slow" <G>.
>>>>>
>>>>> In this case, it should be really simple. Just comment out the
>>>>> part where
>>>>> you index the data and run, say, one of your metadata files.. I
>>>>> suspect
>>>>> that
>>>>> Cheolgoo Kang's response is cogent, and you indeed are spending  
>>>>> your
>>>>> time parsing the XML. I further suspect that the problem is not
>>>>> disk IO,
>>>>> but the time spent parsing. But until you measure, you have no  
>>>>> clue
>>>>> whether you should mess around with the Lucene parameters, or find
>>>>> another parser, or just live with it.. Assuming that you  
>>>>> comment out
>>>>> Lucene and things are still slow, the next step would be to just
>>>>> read in
>>>>> each file and NOT parse it to figure out whether it's the IO or  
>>>>> the
>>>>> parsing.
>>>>>
>>>>> Then you can worry about how to fix it..
>>>>>
>>>>> Best
>>>>> Erick
>>>>>
>>>>>
>>>>> Qn  1: Any suggestion to get this indexing time reduced? It  
>>>>> would be
>>>>> really
>>>>>> great.
>>>>>>
>>>>>> Qn 2 : Am I overlooking something in Lucene with respect to
>>>>>> indexing?
>>>>>>
>>>>>> Right now 12 metadata files take 10 hrs nearly which is really a
>>>>>> long
>>>>>> time.
>>>>>>
>>>>>> Help Appreciated.
>>>>>>
>>>>>> Much Thanks.
>>>>>> --
>>>>>> View this message in context:
>>>>>> http://www.nabble.com/Issue-while-parsing-XML-files-due-to-
>>>>>> control-characters%2C-help-appreciated.-tf3418085.html#a9526527
>>>>>> Sent from the Lucene - Java Users mailing list archive at
>>>>>> Nabble.com.
>>>>>>
>>>>>>
>>>>>> ----------------------------------------------------------------- 
>>>>>> ---
>>>>>> -
>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>> --
>>>> View this message in context: http://www.nabble.com/Issue-while-
>>>> parsing-XML-files-due-to-control-characters%2C-help-appreciated.-
>>>> tf3418085.html#a9536099
>>>> Sent from the Lucene - Java Users mailing list archive at  
>>>> Nabble.com.
>>>>
>>>>
>>>> ------------------------------------------------------------------- 
>>>> --
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>
>>> --------------------------
>>> Grant Ingersoll
>>> Center for Natural Language Processing
>>> http://www.cnlp.org
>>>
>>> Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/
>>> LuceneFAQ
>>>
>>>
>>>
>>> -------------------------------------------------------------------- 
>>> -
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>
>>
>
> -- 
> View this message in context: http://www.nabble.com/Issue-while- 
> parsing-XML-files-due-to-control-characters%2C-help-appreciated.- 
> tf3418085.html#a9540232
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

------------------------------------------------------
Grant Ingersoll
http://www.grantingersoll.com/
http://lucene.grantingersoll.com
http://www.paperoftheweek.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Issue while parsing XML files due to control characters, help appreciated.

Posted by Lokeya <lo...@gmail.com>.

Erick Erickson wrote:
> 
> Grant:
> 
>  I think that "Parsing 70 files totally takes 80 minutes" really
> means parsing 70 metadata files containing 10,000 XML
> files each.....
> 
> One Metadata File is split into 10,000 XML files which looks as below:
> 
> <root>
> 	<record>
> 	<header>
> 	<identifier>oai:CiteSeerPSU:1</identifier>
> 	<datestamp>1993-08-11</datestamp>
> 	<setSpec>CiteSeerPSUset</setSpec>
> 	</header>
> 		<metadata>
> 		<oai_citeseer:oai_citeseer
> xmlns:oai_citeseer="http://copper.ist.psu.edu/oai/oai_citeseer/" xmlns:dc
> ="http://purl.org/dc/elements/1.1/"
> xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
> xsi:schemaLocation="http://copper.ist.psu.edu/oai/oai_citeseer/
> http://copper.ist.psu.edu/oai/oai_citeseer.xsd ">   
> 		<dc:title>36 Problems for Semantic Interpretation</dc:title>   
> 		<oai_citeseer:author name="Gabriele Scheler">      
> 			<address>80290 Munchen , Germany</address>      
> 			<affiliation>Institut fur Informatik; Technische Universitat
> Munchen</affiliation>   
> 		</oai_citeseer:author>   
> 		<dc:subject>Gabriele Scheler 36 Problems for Semantic
> Interpretation</dc:subject>   
> 		<dc:description>This paper presents a collection of problems for natural
> language analysisderived mainly from theoretical linguistics. Most of
> these problemspresent major obstacles for computational systems of
> language interpretation.The set of given sentences can easily be scaled up
> by introducing moreexamples per problem. The construction of computational
> systems couldbenefit from such a collection, either using it directly for
> training andtesting or as a set of benchmarks to qualify the performance
> of a NLPsystem.1 IntroductionThe main part of this paper consists of a
> collection of problems for semanticanalysis of natural language. The
> problems are arranged in the following way:example sentencesconcise
> description of the problemkeyword for the type of problemThe sources
> (first appearance in print) of the sentences have been left out,because
> they are sometimes hard to track and will usually not be of much use,as
> they indicate a starting-point of discussion only. The keywords howeve...
> 		</dc:description>   
> 		<dc:contributor>The Pennsylvania State University CiteSeer
> Archives</dc:contributor>   
> 		<dc:publisher>unknown</dc:publisher>   
> 		<dc:date>1993-08-11</dc:date>   
> 		<dc:format>ps</dc:format>   
> 		<dc:identifier>http://citeseer.ist.psu.edu/1.html</dc:identifier>  
> <dc:source>ftp://flop.informatik.tu-muenchen.de/pub/fki/fki-179-93.ps.gz</dc:source>   
> 		<dc:language>en</dc:language>   
> 		<dc:rights>unrestricted</dc:rights>
> 		</oai_citeseer:oai_citeseer>
> 		</metadata>
> 	</record>
> </root>
> 
> 
> From the above I will extract the Title and the Description tags to index.
> 
> Code to do this:
> 
> 1. I have 70 directories with the name like oai_citeseerXYZ/
> 2. Under each of above directory, I have 10,000 xml files each having
> above xml data.
> 3. Program does the following
> 
> 					File dir = new File(dirName);
> 					String[] children = dir.list();
> 					if (children == null) {
> 						// Either dir does not exist or is not a directory
> 					} 
> 					else 
> 					{
> 						for (int ii=0; ii<children.length; ii++) 
> 						{							
> 							// Get filename of file or directory
> 							String file = children[ii];
> 							//System.out.println("The name of file parsed now ==> "+file);
> 							nl = ReadDump.getNodeList(filename+"/"+file, "metadata");
> 							if(nl == null)
> 							{
> 								//System.out.println("Error shoudlnt be thrown ...");
> 							}	
> 							//Get the metadata element tags from xml file
> 							ReadDump rd = new ReadDump();
> 
> 							//Get the Extracted Tags Title, Identifier and Description
> 							ArrayList alist_Title = rd.getElements(nl, "dc:title");
> 							ArrayList alist_Descr = rd.getElements(nl, "dc:description");
> 
> 							//Create an Index under DIR
> 							IndexWriter writer = new IndexWriter("./FINAL/", new
> StopStemmingAnalyzer(),false);
> 							Document doc = new Document();
> 
> 							//Get Array List Elements  and add them as fileds to doc 
> 							for(int k=0; k < alist_Title.size(); k++)
> 							{
> 								doc.add(new Field("Title",alist_Title.get(k).toString(),
> Field.Store.YES, Field.Index.UN_TOKENIZED));
> 							}
> 							
> 							for(int k=0; k < alist_Descr.size(); k++)
> 							{
> 								doc.add(new Field("Description",alist_Descr.get(k).toString(),
> Field.Store.YES, Field.Index.UN_TOKENIZED));
> 							}							
> 
> 			//Add the document created out of those fields to the IndexWriter which
> will create and index
> 							writer.addDocument(doc);
> 							writer.optimize();
> 							writer.close();
>                }
> 							
> 
> This is the main file which does indexing.
> 
> Hope this will give you an idea.
> 
>  
> Lokeya:
> Can you confirm my supposition? And I'd still post the code
> Grant requested if you can.....
> 
> So, you're talking about indexing 10,000 xml files in 2-3 hours,
> 8 minutes or so which is spent reading/parsing, right? It'll be
> important to know how much data you're indexing and now, so
> the code snippet is doubly important....
> 
> Erick
> 
> On 3/18/07, Grant Ingersoll <gs...@apache.org> wrote:
>>
>> Can you post the relevant indexing code?  Are you doing things like
>> optimizing after every file?  Both the parsing and the indexing sound
>> really long.  How big are these files?
>>
>> Also, I assume you machine is at least somewhat current, right?
>>
>> On Mar 18, 2007, at 1:00 AM, Lokeya wrote:
>>
>> >
>> > Thanks for your reply. I tried to check if the I/O and Parsing is
>> > taking time
>> > separately and Indexing time also. I observed that I/O and Parsing
>> > 70 files
>> > totally takes 80 minutes where as when I combine this with Indexing
>> > for a
>> > single Metadata file it nearly 2 to 3 hours. So looks like
>> > IndexWriter takes
>> > time that too when we are appending to the Index file this happens.
>> >
>> > So what is the best approach to handle this?
>> >
>> > Thanks in Advance.
>> >
>> >
>> > Erick Erickson wrote:
>> >>
>> >> See below...
>> >>
>> >> On 3/17/07, Lokeya <lo...@gmail.com> wrote:
>> >>>
>> >>>
>> >>> Hi,
>> >>>
>> >>> I am trying to index the content from XML files which are
>> >>> basically the
>> >>> metadata collected from a website which have a huge collection of
>> >>> documents.
>> >>> This metadata xml has control characters which causes errors
>> >>> while trying
>> >>> to
>> >>> parse using the DOM parser. I tried to use encoding = UTF-8 but
>> >>> looks
>> >>> like
>> >>> it doesn't cover all the unicode characters and I get error. Also
>> >>> when I
>> >>> tried to use UTF-16, I am getting Prolog content not allowed
>> >>> here. So my
>> >>> guess is there is no enoding which is going to cover almost all
>> >>> unicode
>> >>> characters. So I tried to split my metadata files into small
>> >>> files and
>> >>> processing records which doesnt throw parsing error.
>> >>>
>> >>> But by breaking metadata file into smaller files I get, 10,000
>> >>> xml files
>> >>> per
>> >>> metadata file. I have 70 metadata files, so altogether it becomes
>> >>> 7,00,000
>> >>> files. Processing them individually takes really long time using
>> >>> Lucene,
>> >>> my
>> >>> guess is I/O is time consuing, like opening every small xml file
>> >>> loading
>> >>> in
>> >>> DOM extracting required data and processing.
>> >>
>> >>
>> >>
>> >> So why don't you measure and find out before trying to make the
>> >> indexing
>> >> step more efficient? You simply cannot optimize without knowing where
>> >> you're spending your time. I can't tell you how often I've been wrong
>> >> about
>> >> "why my program was slow" <G>.
>> >>
>> >> In this case, it should be really simple. Just comment out the
>> >> part where
>> >> you index the data and run, say, one of your metadata files.. I
>> >> suspect
>> >> that
>> >> Cheolgoo Kang's response is cogent, and you indeed are spending your
>> >> time parsing the XML. I further suspect that the problem is not
>> >> disk IO,
>> >> but the time spent parsing. But until you measure, you have no clue
>> >> whether you should mess around with the Lucene parameters, or find
>> >> another parser, or just live with it.. Assuming that you comment out
>> >> Lucene and things are still slow, the next step would be to just
>> >> read in
>> >> each file and NOT parse it to figure out whether it's the IO or the
>> >> parsing.
>> >>
>> >> Then you can worry about how to fix it..
>> >>
>> >> Best
>> >> Erick
>> >>
>> >>
>> >> Qn  1: Any suggestion to get this indexing time reduced? It would be
>> >> really
>> >>> great.
>> >>>
>> >>> Qn 2 : Am I overlooking something in Lucene with respect to
>> >>> indexing?
>> >>>
>> >>> Right now 12 metadata files take 10 hrs nearly which is really a
>> >>> long
>> >>> time.
>> >>>
>> >>> Help Appreciated.
>> >>>
>> >>> Much Thanks.
>> >>> --
>> >>> View this message in context:
>> >>> http://www.nabble.com/Issue-while-parsing-XML-files-due-to-
>> >>> control-characters%2C-help-appreciated.-tf3418085.html#a9526527
>> >>> Sent from the Lucene - Java Users mailing list archive at
>> >>> Nabble.com.
>> >>>
>> >>>
>> >>> --------------------------------------------------------------------
>> >>> -
>> >>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> >>> For additional commands, e-mail: java-user-help@lucene.apache.org
>> >>>
>> >>>
>> >>
>> >>
>> >
>> > --
>> > View this message in context: http://www.nabble.com/Issue-while-
>> > parsing-XML-files-due-to-control-characters%2C-help-appreciated.-
>> > tf3418085.html#a9536099
>> > Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>> >
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> > For additional commands, e-mail: java-user-help@lucene.apache.org
>> >
>>
>> --------------------------
>> Grant Ingersoll
>> Center for Natural Language Processing
>> http://www.cnlp.org
>>
>> Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/
>> LuceneFAQ
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
> 
> 

-- 
View this message in context: http://www.nabble.com/Issue-while-parsing-XML-files-due-to-control-characters%2C-help-appreciated.-tf3418085.html#a9540232
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Issue while parsing XML files due to control characters, help appreciated.

Posted by Erick Erickson <er...@gmail.com>.
Grant:

I think that "Parsing 70 files totally takes 80 minutes" really
means parsing 70 metadata files containing 10,000 XML
files each.....

Lokeya:
Can you confirm my supposition? And I'd still post the code
Grant requested if you can.....

So, you're talking about indexing 10,000 xml files in 2-3 hours,
8 minutes or so which is spent reading/parsing, right? It'll be
important to know how much data you're indexing and now, so
the code snippet is doubly important....

Erick

On 3/18/07, Grant Ingersoll <gs...@apache.org> wrote:
>
> Can you post the relevant indexing code?  Are you doing things like
> optimizing after every file?  Both the parsing and the indexing sound
> really long.  How big are these files?
>
> Also, I assume you machine is at least somewhat current, right?
>
> On Mar 18, 2007, at 1:00 AM, Lokeya wrote:
>
> >
> > Thanks for your reply. I tried to check if the I/O and Parsing is
> > taking time
> > separately and Indexing time also. I observed that I/O and Parsing
> > 70 files
> > totally takes 80 minutes where as when I combine this with Indexing
> > for a
> > single Metadata file it nearly 2 to 3 hours. So looks like
> > IndexWriter takes
> > time that too when we are appending to the Index file this happens.
> >
> > So what is the best approach to handle this?
> >
> > Thanks in Advance.
> >
> >
> > Erick Erickson wrote:
> >>
> >> See below...
> >>
> >> On 3/17/07, Lokeya <lo...@gmail.com> wrote:
> >>>
> >>>
> >>> Hi,
> >>>
> >>> I am trying to index the content from XML files which are
> >>> basically the
> >>> metadata collected from a website which have a huge collection of
> >>> documents.
> >>> This metadata xml has control characters which causes errors
> >>> while trying
> >>> to
> >>> parse using the DOM parser. I tried to use encoding = UTF-8 but
> >>> looks
> >>> like
> >>> it doesn't cover all the unicode characters and I get error. Also
> >>> when I
> >>> tried to use UTF-16, I am getting Prolog content not allowed
> >>> here. So my
> >>> guess is there is no enoding which is going to cover almost all
> >>> unicode
> >>> characters. So I tried to split my metadata files into small
> >>> files and
> >>> processing records which doesnt throw parsing error.
> >>>
> >>> But by breaking metadata file into smaller files I get, 10,000
> >>> xml files
> >>> per
> >>> metadata file. I have 70 metadata files, so altogether it becomes
> >>> 7,00,000
> >>> files. Processing them individually takes really long time using
> >>> Lucene,
> >>> my
> >>> guess is I/O is time consuing, like opening every small xml file
> >>> loading
> >>> in
> >>> DOM extracting required data and processing.
> >>
> >>
> >>
> >> So why don't you measure and find out before trying to make the
> >> indexing
> >> step more efficient? You simply cannot optimize without knowing where
> >> you're spending your time. I can't tell you how often I've been wrong
> >> about
> >> "why my program was slow" <G>.
> >>
> >> In this case, it should be really simple. Just comment out the
> >> part where
> >> you index the data and run, say, one of your metadata files.. I
> >> suspect
> >> that
> >> Cheolgoo Kang's response is cogent, and you indeed are spending your
> >> time parsing the XML. I further suspect that the problem is not
> >> disk IO,
> >> but the time spent parsing. But until you measure, you have no clue
> >> whether you should mess around with the Lucene parameters, or find
> >> another parser, or just live with it.. Assuming that you comment out
> >> Lucene and things are still slow, the next step would be to just
> >> read in
> >> each file and NOT parse it to figure out whether it's the IO or the
> >> parsing.
> >>
> >> Then you can worry about how to fix it..
> >>
> >> Best
> >> Erick
> >>
> >>
> >> Qn  1: Any suggestion to get this indexing time reduced? It would be
> >> really
> >>> great.
> >>>
> >>> Qn 2 : Am I overlooking something in Lucene with respect to
> >>> indexing?
> >>>
> >>> Right now 12 metadata files take 10 hrs nearly which is really a
> >>> long
> >>> time.
> >>>
> >>> Help Appreciated.
> >>>
> >>> Much Thanks.
> >>> --
> >>> View this message in context:
> >>> http://www.nabble.com/Issue-while-parsing-XML-files-due-to-
> >>> control-characters%2C-help-appreciated.-tf3418085.html#a9526527
> >>> Sent from the Lucene - Java Users mailing list archive at
> >>> Nabble.com.
> >>>
> >>>
> >>> --------------------------------------------------------------------
> >>> -
> >>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>>
> >>>
> >>
> >>
> >
> > --
> > View this message in context: http://www.nabble.com/Issue-while-
> > parsing-XML-files-due-to-control-characters%2C-help-appreciated.-
> > tf3418085.html#a9536099
> > Sent from the Lucene - Java Users mailing list archive at Nabble.com.
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
>
> --------------------------
> Grant Ingersoll
> Center for Natural Language Processing
> http://www.cnlp.org
>
> Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/
> LuceneFAQ
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Issue while parsing XML files due to control characters, help appreciated.

Posted by Grant Ingersoll <gs...@apache.org>.
Can you post the relevant indexing code?  Are you doing things like  
optimizing after every file?  Both the parsing and the indexing sound  
really long.  How big are these files?

Also, I assume you machine is at least somewhat current, right?

On Mar 18, 2007, at 1:00 AM, Lokeya wrote:

>
> Thanks for your reply. I tried to check if the I/O and Parsing is  
> taking time
> separately and Indexing time also. I observed that I/O and Parsing  
> 70 files
> totally takes 80 minutes where as when I combine this with Indexing  
> for a
> single Metadata file it nearly 2 to 3 hours. So looks like  
> IndexWriter takes
> time that too when we are appending to the Index file this happens.
>
> So what is the best approach to handle this?
>
> Thanks in Advance.
>
>
> Erick Erickson wrote:
>>
>> See below...
>>
>> On 3/17/07, Lokeya <lo...@gmail.com> wrote:
>>>
>>>
>>> Hi,
>>>
>>> I am trying to index the content from XML files which are  
>>> basically the
>>> metadata collected from a website which have a huge collection of
>>> documents.
>>> This metadata xml has control characters which causes errors  
>>> while trying
>>> to
>>> parse using the DOM parser. I tried to use encoding = UTF-8 but  
>>> looks
>>> like
>>> it doesn't cover all the unicode characters and I get error. Also  
>>> when I
>>> tried to use UTF-16, I am getting Prolog content not allowed  
>>> here. So my
>>> guess is there is no enoding which is going to cover almost all  
>>> unicode
>>> characters. So I tried to split my metadata files into small  
>>> files and
>>> processing records which doesnt throw parsing error.
>>>
>>> But by breaking metadata file into smaller files I get, 10,000  
>>> xml files
>>> per
>>> metadata file. I have 70 metadata files, so altogether it becomes
>>> 7,00,000
>>> files. Processing them individually takes really long time using  
>>> Lucene,
>>> my
>>> guess is I/O is time consuing, like opening every small xml file  
>>> loading
>>> in
>>> DOM extracting required data and processing.
>>
>>
>>
>> So why don't you measure and find out before trying to make the  
>> indexing
>> step more efficient? You simply cannot optimize without knowing where
>> you're spending your time. I can't tell you how often I've been wrong
>> about
>> "why my program was slow" <G>.
>>
>> In this case, it should be really simple. Just comment out the  
>> part where
>> you index the data and run, say, one of your metadata files.. I  
>> suspect
>> that
>> Cheolgoo Kang's response is cogent, and you indeed are spending your
>> time parsing the XML. I further suspect that the problem is not  
>> disk IO,
>> but the time spent parsing. But until you measure, you have no clue
>> whether you should mess around with the Lucene parameters, or find
>> another parser, or just live with it.. Assuming that you comment out
>> Lucene and things are still slow, the next step would be to just  
>> read in
>> each file and NOT parse it to figure out whether it's the IO or the
>> parsing.
>>
>> Then you can worry about how to fix it..
>>
>> Best
>> Erick
>>
>>
>> Qn  1: Any suggestion to get this indexing time reduced? It would be
>> really
>>> great.
>>>
>>> Qn 2 : Am I overlooking something in Lucene with respect to  
>>> indexing?
>>>
>>> Right now 12 metadata files take 10 hrs nearly which is really a  
>>> long
>>> time.
>>>
>>> Help Appreciated.
>>>
>>> Much Thanks.
>>> --
>>> View this message in context:
>>> http://www.nabble.com/Issue-while-parsing-XML-files-due-to- 
>>> control-characters%2C-help-appreciated.-tf3418085.html#a9526527
>>> Sent from the Lucene - Java Users mailing list archive at  
>>> Nabble.com.
>>>
>>>
>>> -------------------------------------------------------------------- 
>>> -
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>
>>
>
> -- 
> View this message in context: http://www.nabble.com/Issue-while- 
> parsing-XML-files-due-to-control-characters%2C-help-appreciated.- 
> tf3418085.html#a9536099
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

--------------------------
Grant Ingersoll
Center for Natural Language Processing
http://www.cnlp.org

Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/ 
LuceneFAQ



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Issue while parsing XML files due to control characters, help appreciated.

Posted by Lokeya <lo...@gmail.com>.
Thanks for your reply. I tried to check if the I/O and Parsing is taking time
separately and Indexing time also. I observed that I/O and Parsing 70 files
totally takes 80 minutes where as when I combine this with Indexing for a
single Metadata file it nearly 2 to 3 hours. So looks like IndexWriter takes
time that too when we are appending to the Index file this happens. 

So what is the best approach to handle this?

Thanks in Advance.


Erick Erickson wrote:
> 
> See below...
> 
> On 3/17/07, Lokeya <lo...@gmail.com> wrote:
>>
>>
>> Hi,
>>
>> I am trying to index the content from XML files which are basically the
>> metadata collected from a website which have a huge collection of
>> documents.
>> This metadata xml has control characters which causes errors while trying
>> to
>> parse using the DOM parser. I tried to use encoding = UTF-8 but looks
>> like
>> it doesn't cover all the unicode characters and I get error. Also when I
>> tried to use UTF-16, I am getting Prolog content not allowed here. So my
>> guess is there is no enoding which is going to cover almost all unicode
>> characters. So I tried to split my metadata files into small files and
>> processing records which doesnt throw parsing error.
>>
>> But by breaking metadata file into smaller files I get, 10,000 xml files
>> per
>> metadata file. I have 70 metadata files, so altogether it becomes
>> 7,00,000
>> files. Processing them individually takes really long time using Lucene,
>> my
>> guess is I/O is time consuing, like opening every small xml file loading
>> in
>> DOM extracting required data and processing.
> 
> 
> 
> So why don't you measure and find out before trying to make the indexing
> step more efficient? You simply cannot optimize without knowing where
> you're spending your time. I can't tell you how often I've been wrong
> about
> "why my program was slow" <G>.
> 
> In this case, it should be really simple. Just comment out the part where
> you index the data and run, say, one of your metadata files.. I suspect
> that
> Cheolgoo Kang's response is cogent, and you indeed are spending your
> time parsing the XML. I further suspect that the problem is not disk IO,
> but the time spent parsing. But until you measure, you have no clue
> whether you should mess around with the Lucene parameters, or find
> another parser, or just live with it.. Assuming that you comment out
> Lucene and things are still slow, the next step would be to just read in
> each file and NOT parse it to figure out whether it's the IO or the
> parsing.
> 
> Then you can worry about how to fix it..
> 
> Best
> Erick
> 
> 
> Qn  1: Any suggestion to get this indexing time reduced? It would be
> really
>> great.
>>
>> Qn 2 : Am I overlooking something in Lucene with respect to indexing?
>>
>> Right now 12 metadata files take 10 hrs nearly which is really a long
>> time.
>>
>> Help Appreciated.
>>
>> Much Thanks.
>> --
>> View this message in context:
>> http://www.nabble.com/Issue-while-parsing-XML-files-due-to-control-characters%2C-help-appreciated.-tf3418085.html#a9526527
>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
> 
> 

-- 
View this message in context: http://www.nabble.com/Issue-while-parsing-XML-files-due-to-control-characters%2C-help-appreciated.-tf3418085.html#a9536099
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Issue while parsing XML files due to control characters, help appreciated.

Posted by Erick Erickson <er...@gmail.com>.
See below...

On 3/17/07, Lokeya <lo...@gmail.com> wrote:
>
>
> Hi,
>
> I am trying to index the content from XML files which are basically the
> metadata collected from a website which have a huge collection of
> documents.
> This metadata xml has control characters which causes errors while trying
> to
> parse using the DOM parser. I tried to use encoding = UTF-8 but looks like
> it doesn't cover all the unicode characters and I get error. Also when I
> tried to use UTF-16, I am getting Prolog content not allowed here. So my
> guess is there is no enoding which is going to cover almost all unicode
> characters. So I tried to split my metadata files into small files and
> processing records which doesnt throw parsing error.
>
> But by breaking metadata file into smaller files I get, 10,000 xml files
> per
> metadata file. I have 70 metadata files, so altogether it becomes 7,00,000
> files. Processing them individually takes really long time using Lucene,
> my
> guess is I/O is time consuing, like opening every small xml file loading
> in
> DOM extracting required data and processing.



So why don't you measure and find out before trying to make the indexing
step more efficient? You simply cannot optimize without knowing where
you're spending your time. I can't tell you how often I've been wrong about
"why my program was slow" <G>.

In this case, it should be really simple. Just comment out the part where
you index the data and run, say, one of your metadata files.. I suspect that
Cheolgoo Kang's response is cogent, and you indeed are spending your
time parsing the XML. I further suspect that the problem is not disk IO,
but the time spent parsing. But until you measure, you have no clue
whether you should mess around with the Lucene parameters, or find
another parser, or just live with it.. Assuming that you comment out
Lucene and things are still slow, the next step would be to just read in
each file and NOT parse it to figure out whether it's the IO or the parsing.

Then you can worry about how to fix it..

Best
Erick


Qn  1: Any suggestion to get this indexing time reduced? It would be really
> great.
>
> Qn 2 : Am I overlooking something in Lucene with respect to indexing?
>
> Right now 12 metadata files take 10 hrs nearly which is really a long
> time.
>
> Help Appreciated.
>
> Much Thanks.
> --
> View this message in context:
> http://www.nabble.com/Issue-while-parsing-XML-files-due-to-control-characters%2C-help-appreciated.-tf3418085.html#a9526527
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>