You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Allan Roberto Avendano Sudario <aa...@fiec.espol.edu.ec> on 2009/07/02 01:20:41 UTC

Creating Vectors from Text

Regards Community,
Someone know how to execute the code that create vector from text? This
message show me java, when I try to run this:

 java -cp $CLASSPATH org.apache.mahout.utils.vectors.Driver --dir
~/Desktop/crawlSite/index  --field body --dictOut ~/Desktop/dict/dict.txt
--output ~/Desktop/dict/out.txt --max 50

....

 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your
platform... using builtin-java classes where applicable
09/07/01 18:02:32 INFO compress.CodecPool: Got brand-new compressor
Exception in thread "main" java.lang.NullPointerException
        at
org.apache.mahout.utils.vectors.lucene.LuceneIteratable$TDIterator.next(LuceneIteratable.java:111)
        at
org.apache.mahout.utils.vectors.lucene.LuceneIteratable$TDIterator.next(LuceneIteratable.java:82)
        at
org.apache.mahout.utils.vectors.io.SequenceFileVectorWriter.write(SequenceFileVectorWriter.java:25)
        at org.apache.mahout.utils.vectors.Driver.main(Driver.java:204)


Thanks!

-- 
Allan Avendaño S.
Home: 04 2 800 692
Cell: 09 700 42 48

Re: Creating Vectors from Text

Posted by sushil_kb <ba...@gmail.com>.

I created a ticket and attached the patch in Jira
https://issues.apache.org/jira/browse/MAHOUT-191

- sushil


Isabel Drost-4 wrote:
> 
> On Tue sushil_kb <ba...@gmail.com> wrote:
> 
>> It seems that the problem is because that not all the documents in my
>> index has the field that I am using to get term vectors from. I made
>> the following changes to make this work
> 
> Would you please be so kind as to open a jira ticket and attach your
> changes as patch?
> 
> Thanks,
> Isabel
> 
> 

-- 
View this message in context: http://www.nabble.com/Creating-Vectors-from-Text-tp24298643p26090851.html
Sent from the Mahout User List mailing list archive at Nabble.com.

Re: Creating Vectors from Text

Posted by Isabel Drost <is...@apache.org>.

On Tue sushil_kb <ba...@gmail.com> wrote:

> It seems that the problem is because that not all the documents in my
> index has the field that I am using to get term vectors from. I made
> the following changes to make this work

Would you please be so kind as to open a jira ticket and attach your
changes as patch?

Thanks,
Isabel

Re: Creating Vectors from Text

Posted by sushil_kb <ba...@gmail.com>.

It seems that the problem is because that not all the documents in my index
has the field that I am using to get term vectors from. I made the following
changes to make this work, but I am not sure if thats the right way. I
wanted to get this work to run the LDA topic modeling using the output from
the Driver. 

Index:
utils/src/main/java/org/apache/mahout/utils/vectors/io/SequenceFileVectorWriter.java
===================================================================
---
utils/src/main/java/org/apache/mahout/utils/vectors/io/SequenceFileVectorWriter.java
(revision 830343)
+++
utils/src/main/java/org/apache/mahout/utils/vectors/io/SequenceFileVectorWriter.java
(working copy)
@@ -42,7 +42,7 @@
         break;
       }
       //point.write(dataOut);
-      writer.append(new LongWritable(recNum++), point);
+      if(point!=null) writer.append(new LongWritable(recNum++), point);
 
     }
     return recNum;
Index:
utils/src/main/java/org/apache/mahout/utils/vectors/lucene/LuceneIterable.java
===================================================================
---
utils/src/main/java/org/apache/mahout/utils/vectors/lucene/LuceneIterable.java
(revision 830343)
+++
utils/src/main/java/org/apache/mahout/utils/vectors/lucene/LuceneIterable.java
(working copy)
@@ -104,6 +104,10 @@
       try {
         indexReader.getTermFreqVector(doc, field, mapper);
         result = mapper.getVector();
+        
+        if (result == null)
+        	return null;
+        
         if (idField != null) {
           String id = indexReader.document(doc,
idFieldSelector).get(idField);
           result.setName(id);





sushil_kb wrote:
> 
> I am having the same problem as Allan. I checked out mahout from trunk and
> tried to create term frequency vector from a lucene index and ran into
> this..
> 
> 09/10/27 17:36:10 INFO lucene.Driver: Output File:
> /Users/shoeseal/DATA/luc2tvec.out
> 09/10/27 17:36:11 WARN util.NativeCodeLoader: Unable to load native-hadoop
> library for your platform... using builtin-java classes where applicable
> 09/10/27 17:36:11 INFO compress.CodecPool: Got brand-new compressor
> Exception in thread "main" java.lang.NullPointerException
> 	at
> org.apache.mahout.utils.vectors.lucene.LuceneIterable$TDIterator.next(LuceneIterable.java:109)
> 	at
> org.apache.mahout.utils.vectors.lucene.LuceneIterable$TDIterator.next(LuceneIterable.java:1)
> 	at
> org.apache.mahout.utils.vectors.io.SequenceFileVectorWriter.write(SequenceFileVectorWriter.java:40)
> 	at org.apache.mahout.utils.vectors.lucene.Driver.main(Driver.java:200)
> 
> I am running this from Eclipse (snow leopard with JDK 6), on an index that
> has field with stored term vectors..
> 
> my input parameters for Driver are: 
> --dir <path>/smallidx/ --output <path>/luc2tvec.out --idField id_field
>  --field field_with_TV --dictOut <path>/luc2tvec.dict --max 50  --weight
> tf
> 
> Luke shows the following info on the fields I am using:
>  id_field is indexed, stored, omit norms
>  field_with_TV is indexed, tokenized, stored, term vector
> 
> I can run the test LuceneIterableTest fine but when I run the Driver on my
> index I get into trouble. Any possible reasons for this behavior besides
> not having an index field with stored term vector?
> 
> Thanks.
> - sushil
> 
> 
> 
> 
> Grant Ingersoll-6 wrote:
>> 
>> 
>> On Jul 2, 2009, at 12:09 PM, Allan Roberto Avendano Sudario wrote:
>> 
>>> Regards,
>>> This is the entire exception message:
>>>
>>>
>>> java -cp $JAVACLASSPATH org.apache.mahout.utils.vectors.Driver --dir
>>> /home/hadoop/Desktop/<urls>/index  --field content  --dictOut
>>> /home/hadoop/Desktop/dictionary/dict.txt --output
>>> /home/hadoop/Desktop/dictionary/out.txt --max 50 --norm 2
>>>
>>>
>>> 09/07/02 09:35:47 INFO vectors.Driver: Output File:
>>> /home/hadoop/Desktop/dictionary/out.txt
>>> 09/07/02 09:35:47 INFO util.NativeCodeLoader: Loaded the native-hadoop
>>> library
>>> 09/07/02 09:35:47 INFO zlib.ZlibFactory: Successfully loaded &  
>>> initialized
>>> native-zlib library
>>> 09/07/02 09:35:47 INFO compress.CodecPool: Got brand-new compressor
>>> Exception in thread "main" java.lang.NullPointerException
>>>        at
>>> org.apache.mahout.utils.vectors.lucene.LuceneIteratable 
>>> $TDIterator.next(LuceneIteratable.java:111)
>>>        at
>>> org.apache.mahout.utils.vectors.lucene.LuceneIteratable 
>>> $TDIterator.next(LuceneIteratable.java:82)
>>>        at
>>> org 
>>> .apache 
>>> .mahout 
>>> .utils 
>>> .vectors 
>>> .io.SequenceFileVectorWriter.write(SequenceFileVectorWriter.java:25)
>>>        at org.apache.mahout.utils.vectors.Driver.main(Driver.java:204)
>>>
>>>
>>> Well, I used a nutch crawl index, is that correct? mmm... I have  
>>> change to
>>> contenc field, but nothing happened.
>>> Possibly the nutch crawl doesn´t have Term Vector indexed.
>> 
>> This would be my guess.  A small edit to Nutch code would probably  
>> allow it.  Just find where it creates a new Field and add in the TV  
>> stuff.
>> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Creating-Vectors-from-Text-tp24298643p26087765.html
Sent from the Mahout User List mailing list archive at Nabble.com.

Re: Creating Vectors from Text

Posted by sushil_kb <ba...@gmail.com>.

I am having the same problem as Allan. I checked out mahout from trunk and
tried to create term frequency vector from a lucene index and ran into
this..

09/10/27 17:36:10 INFO lucene.Driver: Output File:
/Users/shoeseal/DATA/luc2tvec.out
09/10/27 17:36:11 WARN util.NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
09/10/27 17:36:11 INFO compress.CodecPool: Got brand-new compressor
Exception in thread "main" java.lang.NullPointerException
	at
org.apache.mahout.utils.vectors.lucene.LuceneIterable$TDIterator.next(LuceneIterable.java:109)
	at
org.apache.mahout.utils.vectors.lucene.LuceneIterable$TDIterator.next(LuceneIterable.java:1)
	at
org.apache.mahout.utils.vectors.io.SequenceFileVectorWriter.write(SequenceFileVectorWriter.java:40)
	at org.apache.mahout.utils.vectors.lucene.Driver.main(Driver.java:200)

I am running this from Eclipse (snow leopard with JDK 6), on an index that
has field with stored term vectors..

my input parameters for Driver are: 
--dir <path>/smallidx/ --output <path>/luc2tvec.out --idField id_field
 --field field_with_TV --dictOut <path>/luc2tvec.dict --max 50  --weight tf

Luke shows the following info on the fields I am using:
 id_field is indexed, stored, omit norms
 field_with_TV is indexed, tokenized, stored, term vector

I can run the test LuceneIterableTest fine but when I run the Driver on my
index I get into trouble. Any possible reasons for this behavior besides not
having an index field with stored term vector?

Thanks.
- sushil




Grant Ingersoll-6 wrote:
> 
> 
> On Jul 2, 2009, at 12:09 PM, Allan Roberto Avendano Sudario wrote:
> 
>> Regards,
>> This is the entire exception message:
>>
>>
>> java -cp $JAVACLASSPATH org.apache.mahout.utils.vectors.Driver --dir
>> /home/hadoop/Desktop/<urls>/index  --field content  --dictOut
>> /home/hadoop/Desktop/dictionary/dict.txt --output
>> /home/hadoop/Desktop/dictionary/out.txt --max 50 --norm 2
>>
>>
>> 09/07/02 09:35:47 INFO vectors.Driver: Output File:
>> /home/hadoop/Desktop/dictionary/out.txt
>> 09/07/02 09:35:47 INFO util.NativeCodeLoader: Loaded the native-hadoop
>> library
>> 09/07/02 09:35:47 INFO zlib.ZlibFactory: Successfully loaded &  
>> initialized
>> native-zlib library
>> 09/07/02 09:35:47 INFO compress.CodecPool: Got brand-new compressor
>> Exception in thread "main" java.lang.NullPointerException
>>        at
>> org.apache.mahout.utils.vectors.lucene.LuceneIteratable 
>> $TDIterator.next(LuceneIteratable.java:111)
>>        at
>> org.apache.mahout.utils.vectors.lucene.LuceneIteratable 
>> $TDIterator.next(LuceneIteratable.java:82)
>>        at
>> org 
>> .apache 
>> .mahout 
>> .utils 
>> .vectors 
>> .io.SequenceFileVectorWriter.write(SequenceFileVectorWriter.java:25)
>>        at org.apache.mahout.utils.vectors.Driver.main(Driver.java:204)
>>
>>
>> Well, I used a nutch crawl index, is that correct? mmm... I have  
>> change to
>> contenc field, but nothing happened.
>> Possibly the nutch crawl doesn´t have Term Vector indexed.
> 
> This would be my guess.  A small edit to Nutch code would probably  
> allow it.  Just find where it creates a new Field and add in the TV  
> stuff.
> 

-- 
View this message in context: http://www.nabble.com/Creating-Vectors-from-Text-tp24298643p26087537.html
Sent from the Mahout User List mailing list archive at Nabble.com.

Re: Creating Vectors from Text

Posted by Grant Ingersoll <gs...@apache.org>.

On Jul 2, 2009, at 12:09 PM, Allan Roberto Avendano Sudario wrote:

> Regards,
> This is the entire exception message:
>
>
> java -cp $JAVACLASSPATH org.apache.mahout.utils.vectors.Driver --dir
> /home/hadoop/Desktop/<urls>/index  --field content  --dictOut
> /home/hadoop/Desktop/dictionary/dict.txt --output
> /home/hadoop/Desktop/dictionary/out.txt --max 50 --norm 2
>
>
> 09/07/02 09:35:47 INFO vectors.Driver: Output File:
> /home/hadoop/Desktop/dictionary/out.txt
> 09/07/02 09:35:47 INFO util.NativeCodeLoader: Loaded the native-hadoop
> library
> 09/07/02 09:35:47 INFO zlib.ZlibFactory: Successfully loaded &  
> initialized
> native-zlib library
> 09/07/02 09:35:47 INFO compress.CodecPool: Got brand-new compressor
> Exception in thread "main" java.lang.NullPointerException
>        at
> org.apache.mahout.utils.vectors.lucene.LuceneIteratable 
> $TDIterator.next(LuceneIteratable.java:111)
>        at
> org.apache.mahout.utils.vectors.lucene.LuceneIteratable 
> $TDIterator.next(LuceneIteratable.java:82)
>        at
> org 
> .apache 
> .mahout 
> .utils 
> .vectors 
> .io.SequenceFileVectorWriter.write(SequenceFileVectorWriter.java:25)
>        at org.apache.mahout.utils.vectors.Driver.main(Driver.java:204)
>
>
> Well, I used a nutch crawl index, is that correct? mmm... I have  
> change to
> contenc field, but nothing happened.
> Possibly the nutch crawl doesn´t have Term Vector indexed.

This would be my guess.  A small edit to Nutch code would probably  
allow it.  Just find where it creates a new Field and add in the TV  
stuff.

Re: Creating Vectors from Text

Posted by Allan Roberto Avendano Sudario <aa...@fiec.espol.edu.ec>.

Regards,
This is the entire exception message:


java -cp $JAVACLASSPATH org.apache.mahout.utils.vectors.Driver --dir
/home/hadoop/Desktop/<urls>/index  --field content  --dictOut
/home/hadoop/Desktop/dictionary/dict.txt --output
/home/hadoop/Desktop/dictionary/out.txt --max 50 --norm 2


09/07/02 09:35:47 INFO vectors.Driver: Output File:
/home/hadoop/Desktop/dictionary/out.txt
09/07/02 09:35:47 INFO util.NativeCodeLoader: Loaded the native-hadoop
library
09/07/02 09:35:47 INFO zlib.ZlibFactory: Successfully loaded & initialized
native-zlib library
09/07/02 09:35:47 INFO compress.CodecPool: Got brand-new compressor
Exception in thread "main" java.lang.NullPointerException
        at
org.apache.mahout.utils.vectors.lucene.LuceneIteratable$TDIterator.next(LuceneIteratable.java:111)
        at
org.apache.mahout.utils.vectors.lucene.LuceneIteratable$TDIterator.next(LuceneIteratable.java:82)
        at
org.apache.mahout.utils.vectors.io.SequenceFileVectorWriter.write(SequenceFileVectorWriter.java:25)
        at org.apache.mahout.utils.vectors.Driver.main(Driver.java:204)


Well, I used a nutch crawl index, is that correct? mmm... I have change to
contenc field, but nothing happened.
Possibly the nutch crawl doesn´t have Term Vector indexed.

Thanks,


2009/7/1 Grant Ingersoll <gs...@apache.org>

> Is there any more information around the exception?
>
> How did you create your Lucene index?  Does the body field exist and does
> it have Term Vectors stored?
>
>
>
> On Jul 1, 2009, at 7:20 PM, Allan Roberto Avendano Sudario wrote:
>
>  Regards Community,
>> Someone know how to execute the code that create vector from text? This
>> message show me java, when I try to run this:
>>
>> java -cp $CLASSPATH org.apache.mahout.utils.vectors.Driver --dir
>> ~/Desktop/crawlSite/index  --field body --dictOut ~/Desktop/dict/dict.txt
>> --output ~/Desktop/dict/out.txt --max 50
>>
>> ....
>>
>> WARN util.NativeCodeLoader: Unable to load native-hadoop library for your
>> platform... using builtin-java classes where applicable
>> 09/07/01 18:02:32 INFO compress.CodecPool: Got brand-new compressor
>> Exception in thread "main" java.lang.NullPointerException
>>       at
>>
>> org.apache.mahout.utils.vectors.lucene.LuceneIteratable$TDIterator.next(LuceneIteratable.java:111)
>>       at
>>
>> org.apache.mahout.utils.vectors.lucene.LuceneIteratable$TDIterator.next(LuceneIteratable.java:82)
>>       at
>>
>> org.apache.mahout.utils.vectors.io.SequenceFileVectorWriter.write(SequenceFileVectorWriter.java:25)
>>       at org.apache.mahout.utils.vectors.Driver.main(Driver.java:204)
>>
>>
>> Thanks!
>>
>> --
>> Allan Avendaño S.
>> Home: 04 2 800 692
>> Cell: 09 700 42 48
>>
>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
> Solr/Lucene:
> http://www.lucidimagination.com/search
>
>


-- 
Allan Avendaño S.
Home: 04 2 800 692
Cell: 09 700 42 48

Re: Creating Vectors from Text

Posted by Grant Ingersoll <gs...@apache.org>.

Is there any more information around the exception?

How did you create your Lucene index?  Does the body field exist and  
does it have Term Vectors stored?


On Jul 1, 2009, at 7:20 PM, Allan Roberto Avendano Sudario wrote:

> Regards Community,
> Someone know how to execute the code that create vector from text?  
> This
> message show me java, when I try to run this:
>
> java -cp $CLASSPATH org.apache.mahout.utils.vectors.Driver --dir
> ~/Desktop/crawlSite/index  --field body --dictOut ~/Desktop/dict/ 
> dict.txt
> --output ~/Desktop/dict/out.txt --max 50
>
> ....
>
> WARN util.NativeCodeLoader: Unable to load native-hadoop library for  
> your
> platform... using builtin-java classes where applicable
> 09/07/01 18:02:32 INFO compress.CodecPool: Got brand-new compressor
> Exception in thread "main" java.lang.NullPointerException
>        at
> org.apache.mahout.utils.vectors.lucene.LuceneIteratable 
> $TDIterator.next(LuceneIteratable.java:111)
>        at
> org.apache.mahout.utils.vectors.lucene.LuceneIteratable 
> $TDIterator.next(LuceneIteratable.java:82)
>        at
> org 
> .apache 
> .mahout 
> .utils 
> .vectors 
> .io.SequenceFileVectorWriter.write(SequenceFileVectorWriter.java:25)
>        at org.apache.mahout.utils.vectors.Driver.main(Driver.java:204)
>
>
> Thanks!
>
> -- 
> Allan Avendaño S.
> Home: 04 2 800 692
> Cell: 09 700 42 48

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search