You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Siddharth Tiwari <si...@live.com> on 2012/08/24 07:52:13 UTC

Reading multiple lines from a microsoft doc in hadoop

hi,
I have doc files in msword doc and docx format. These have entries which are seperated by an empty line. Is it possible for me to read these lines separated from empty lines at a time. Also which inpurformat shall I use to read doc docx. Please help

*------------------------*

Cheers !!!

Siddharth Tiwari

Have a refreshing day !!!
"Every duty is holy, and devotion to duty is the highest form of worship of God.” 

"Maybe other people will try to limit me but I don't limit myself"
 		 	   		  

Re: Reading multiple lines from a microsoft doc in hadoop

Posted by Bejoy KS <be...@gmail.com>.
Hi Siddharth

I believe doc and docx have custom formatting other than text. In that case you may have to build your own input format. Also your own record reader if you want to have the record delimiter as an empty line. 

Regards
Bejoy KS

Sent from handheld, please excuse typos.

-----Original Message-----
From: Siddharth Tiwari <si...@live.com>
Date: Fri, 24 Aug 2012 05:52:13 
To: USers Hadoop<us...@hadoop.apache.org>
Reply-To: user@hadoop.apache.org
Subject: Reading multiple lines from a microsoft doc in hadoop


hi,
I have doc files in msword doc and docx format. These have entries which are seperated by an empty line. Is it possible for me to read these lines separated from empty lines at a time. Also which inpurformat shall I use to read doc docx. Please help

*------------------------*

Cheers !!!

Siddharth Tiwari

Have a refreshing day !!!
"Every duty is holy, and devotion to duty is the highest form of worship of God.” 

"Maybe other people will try to limit me but I don't limit myself"
 		 	   		  

Re: Reading multiple lines from a microsoft doc in hadoop

Posted by Biju Balakrishnan <bi...@gmail.com>.
Siddharth,


> I have doc files in msword doc and docx format. These have entries which
> are seperated by an empty line. Is it possible for me to read
> these lines separated from empty lines at a time. Also which inpurformat
> shall I use to read doc docx. Please help
>
>
As far as i know, none of the input format supports the doc & docx(to be
noted: as far as i know).
you might need to write a custom input format to support doc[x] files.

its better to convert to text files before processing using hadoop.


-- 
*Biju
*

Re: Reading multiple lines from a microsoft doc in hadoop

Posted by Biju Balakrishnan <bi...@gmail.com>.
Siddharth,


> I have doc files in msword doc and docx format. These have entries which
> are seperated by an empty line. Is it possible for me to read
> these lines separated from empty lines at a time. Also which inpurformat
> shall I use to read doc docx. Please help
>
>
As far as i know, none of the input format supports the doc & docx(to be
noted: as far as i know).
you might need to write a custom input format to support doc[x] files.

its better to convert to text files before processing using hadoop.


-- 
*Biju
*

Re: Reading multiple lines from a microsoft doc in hadoop

Posted by Biju Balakrishnan <bi...@gmail.com>.
Siddharth,


> I have doc files in msword doc and docx format. These have entries which
> are seperated by an empty line. Is it possible for me to read
> these lines separated from empty lines at a time. Also which inpurformat
> shall I use to read doc docx. Please help
>
>
As far as i know, none of the input format supports the doc & docx(to be
noted: as far as i know).
you might need to write a custom input format to support doc[x] files.

its better to convert to text files before processing using hadoop.


-- 
*Biju
*

Re: Reading multiple lines from a microsoft doc in hadoop

Posted by Bertrand Dechoux <de...@gmail.com>.
And that would help you with performance too.
Were you originally planning to have one file per word document?
What is the average size of you word documents?
It shouldn't be much. I am afraid your map startup time won't be negligible
in that case.

Regards

Bertrand

On Fri, Aug 24, 2012 at 8:07 AM, Håvard Wahl Kongsgård <
haavard.kongsgaard@gmail.com> wrote:

> It's much easier if you convert the documents to text first
>
> use
> http://tika.apache.org/
>
> or some other doc parser
>
>
> -Håvard
>
> On Fri, Aug 24, 2012 at 7:52 AM, Siddharth Tiwari
> <si...@live.com> wrote:
> > hi,
> > I have doc files in msword doc and docx format. These have entries which
> are
> > seperated by an empty line. Is it possible for me to read
> > these lines separated from empty lines at a time. Also which inpurformat
> > shall I use to read doc docx. Please help
> >
> > *------------------------*
> > Cheers !!!
> > Siddharth Tiwari
> > Have a refreshing day !!!
> > "Every duty is holy, and devotion to duty is the highest form of worship
> of
> > God.”
> > "Maybe other people will try to limit me but I don't limit myself"
>
>
>
> --
> Håvard Wahl Kongsgård
> Faculty of Medicine &
> Department of Mathematical Sciences
> NTNU
>
> http://havard.security-review.net/
>



-- 
Bertrand Dechoux

Re: Reading multiple lines from a microsoft doc in hadoop

Posted by Mohammad Tariq <do...@gmail.com>.
Sorry I forgot the link :
http://hadoopchicago.com/tips-tricks/custom-xmlreader-boris-lublinsky-michael-segel/

Regards,
    Mohammad Tariq



On Fri, Aug 24, 2012 at 1:10 PM, Mohammad Tariq <do...@gmail.com> wrote:

> Hello Siddharth,
>
>        You can tweak the "NLineInputFormat" as per your requirement and
> use it. It allows us to read a specified no of lines
> unlike "TextInputFormat". Here is a good post by Boris and Michael on
> custom record reader. Also I would suggest you to
> combine similar files together into one bigger file if feasible, as you
> files are very small.
>
> Regards,
>     Mohammad Tariq
>
>
>
> On Fri, Aug 24, 2012 at 1:00 PM, Siddharth Tiwari <
> siddharth.tiwari@live.com> wrote:
>
>> Hi,
>> Thank you for the suggestion. Actually I was using poi to extract text,
>> but since now  I  have so many  documents I thought I will use hadoop
>> directly to parse as well. Average size of each document is around 120 kb.
>> Also I want to read multiple lines from the text until I find a blank line.
>> I do not have any idea ankit how to design custom input format and record
>> reader. Pleaser help with some tutorial tutorial, code or resource around
>> it. I am struggling with the issue. I will be highly grateful. Thank you so
>> much once again
>>
>> > Date: Fri, 24 Aug 2012 08:07:39 +0200
>> > Subject: Re: Reading multiple lines from a microsoft doc in hadoop
>> > From: haavard.kongsgaard@gmail.com
>> > To: user@hadoop.apache.org
>> >
>> > It's much easier if you convert the documents to text first
>> >
>> > use
>> > http://tika.apache.org/
>> >
>> > or some other doc parser
>> >
>> >
>> > -Håvard
>> >
>> > On Fri, Aug 24, 2012 at 7:52 AM, Siddharth Tiwari
>> > <si...@live.com> wrote:
>> > > hi,
>> > > I have doc files in msword doc and docx format. These have entries
>> which are
>> > > seperated by an empty line. Is it possible for me to read
>> > > these lines separated from empty lines at a time. Also which
>> inpurformat
>> > > shall I use to read doc docx. Please help
>> > >
>> > > *------------------------*
>> > > Cheers !!!
>> > > Siddharth Tiwari
>> > > Have a refreshing day !!!
>> > > "Every duty is holy, and devotion to duty is the highest form of
>> worship of
>> > > God.”
>> > > "Maybe other people will try to limit me but I don't limit myself"
>> >
>> >
>> >
>> > --
>> > Håvard Wahl Kongsgård
>> > Faculty of Medicine &
>> > Department of Mathematical Sciences
>> > NTNU
>> >
>> > http://havard.security-review.net/
>>
>
>

Re: Reading multiple lines from a microsoft doc in hadoop

Posted by Mohammad Tariq <do...@gmail.com>.
Sorry I forgot the link :
http://hadoopchicago.com/tips-tricks/custom-xmlreader-boris-lublinsky-michael-segel/

Regards,
    Mohammad Tariq



On Fri, Aug 24, 2012 at 1:10 PM, Mohammad Tariq <do...@gmail.com> wrote:

> Hello Siddharth,
>
>        You can tweak the "NLineInputFormat" as per your requirement and
> use it. It allows us to read a specified no of lines
> unlike "TextInputFormat". Here is a good post by Boris and Michael on
> custom record reader. Also I would suggest you to
> combine similar files together into one bigger file if feasible, as you
> files are very small.
>
> Regards,
>     Mohammad Tariq
>
>
>
> On Fri, Aug 24, 2012 at 1:00 PM, Siddharth Tiwari <
> siddharth.tiwari@live.com> wrote:
>
>> Hi,
>> Thank you for the suggestion. Actually I was using poi to extract text,
>> but since now  I  have so many  documents I thought I will use hadoop
>> directly to parse as well. Average size of each document is around 120 kb.
>> Also I want to read multiple lines from the text until I find a blank line.
>> I do not have any idea ankit how to design custom input format and record
>> reader. Pleaser help with some tutorial tutorial, code or resource around
>> it. I am struggling with the issue. I will be highly grateful. Thank you so
>> much once again
>>
>> > Date: Fri, 24 Aug 2012 08:07:39 +0200
>> > Subject: Re: Reading multiple lines from a microsoft doc in hadoop
>> > From: haavard.kongsgaard@gmail.com
>> > To: user@hadoop.apache.org
>> >
>> > It's much easier if you convert the documents to text first
>> >
>> > use
>> > http://tika.apache.org/
>> >
>> > or some other doc parser
>> >
>> >
>> > -Håvard
>> >
>> > On Fri, Aug 24, 2012 at 7:52 AM, Siddharth Tiwari
>> > <si...@live.com> wrote:
>> > > hi,
>> > > I have doc files in msword doc and docx format. These have entries
>> which are
>> > > seperated by an empty line. Is it possible for me to read
>> > > these lines separated from empty lines at a time. Also which
>> inpurformat
>> > > shall I use to read doc docx. Please help
>> > >
>> > > *------------------------*
>> > > Cheers !!!
>> > > Siddharth Tiwari
>> > > Have a refreshing day !!!
>> > > "Every duty is holy, and devotion to duty is the highest form of
>> worship of
>> > > God.”
>> > > "Maybe other people will try to limit me but I don't limit myself"
>> >
>> >
>> >
>> > --
>> > Håvard Wahl Kongsgård
>> > Faculty of Medicine &
>> > Department of Mathematical Sciences
>> > NTNU
>> >
>> > http://havard.security-review.net/
>>
>
>

Re: Reading multiple lines from a microsoft doc in hadoop

Posted by Mohammad Tariq <do...@gmail.com>.
Sorry I forgot the link :
http://hadoopchicago.com/tips-tricks/custom-xmlreader-boris-lublinsky-michael-segel/

Regards,
    Mohammad Tariq



On Fri, Aug 24, 2012 at 1:10 PM, Mohammad Tariq <do...@gmail.com> wrote:

> Hello Siddharth,
>
>        You can tweak the "NLineInputFormat" as per your requirement and
> use it. It allows us to read a specified no of lines
> unlike "TextInputFormat". Here is a good post by Boris and Michael on
> custom record reader. Also I would suggest you to
> combine similar files together into one bigger file if feasible, as you
> files are very small.
>
> Regards,
>     Mohammad Tariq
>
>
>
> On Fri, Aug 24, 2012 at 1:00 PM, Siddharth Tiwari <
> siddharth.tiwari@live.com> wrote:
>
>> Hi,
>> Thank you for the suggestion. Actually I was using poi to extract text,
>> but since now  I  have so many  documents I thought I will use hadoop
>> directly to parse as well. Average size of each document is around 120 kb.
>> Also I want to read multiple lines from the text until I find a blank line.
>> I do not have any idea ankit how to design custom input format and record
>> reader. Pleaser help with some tutorial tutorial, code or resource around
>> it. I am struggling with the issue. I will be highly grateful. Thank you so
>> much once again
>>
>> > Date: Fri, 24 Aug 2012 08:07:39 +0200
>> > Subject: Re: Reading multiple lines from a microsoft doc in hadoop
>> > From: haavard.kongsgaard@gmail.com
>> > To: user@hadoop.apache.org
>> >
>> > It's much easier if you convert the documents to text first
>> >
>> > use
>> > http://tika.apache.org/
>> >
>> > or some other doc parser
>> >
>> >
>> > -Håvard
>> >
>> > On Fri, Aug 24, 2012 at 7:52 AM, Siddharth Tiwari
>> > <si...@live.com> wrote:
>> > > hi,
>> > > I have doc files in msword doc and docx format. These have entries
>> which are
>> > > seperated by an empty line. Is it possible for me to read
>> > > these lines separated from empty lines at a time. Also which
>> inpurformat
>> > > shall I use to read doc docx. Please help
>> > >
>> > > *------------------------*
>> > > Cheers !!!
>> > > Siddharth Tiwari
>> > > Have a refreshing day !!!
>> > > "Every duty is holy, and devotion to duty is the highest form of
>> worship of
>> > > God.”
>> > > "Maybe other people will try to limit me but I don't limit myself"
>> >
>> >
>> >
>> > --
>> > Håvard Wahl Kongsgård
>> > Faculty of Medicine &
>> > Department of Mathematical Sciences
>> > NTNU
>> >
>> > http://havard.security-review.net/
>>
>
>

Re: Reading multiple lines from a microsoft doc in hadoop

Posted by Mohammad Tariq <do...@gmail.com>.
Sorry I forgot the link :
http://hadoopchicago.com/tips-tricks/custom-xmlreader-boris-lublinsky-michael-segel/

Regards,
    Mohammad Tariq



On Fri, Aug 24, 2012 at 1:10 PM, Mohammad Tariq <do...@gmail.com> wrote:

> Hello Siddharth,
>
>        You can tweak the "NLineInputFormat" as per your requirement and
> use it. It allows us to read a specified no of lines
> unlike "TextInputFormat". Here is a good post by Boris and Michael on
> custom record reader. Also I would suggest you to
> combine similar files together into one bigger file if feasible, as you
> files are very small.
>
> Regards,
>     Mohammad Tariq
>
>
>
> On Fri, Aug 24, 2012 at 1:00 PM, Siddharth Tiwari <
> siddharth.tiwari@live.com> wrote:
>
>> Hi,
>> Thank you for the suggestion. Actually I was using poi to extract text,
>> but since now  I  have so many  documents I thought I will use hadoop
>> directly to parse as well. Average size of each document is around 120 kb.
>> Also I want to read multiple lines from the text until I find a blank line.
>> I do not have any idea ankit how to design custom input format and record
>> reader. Pleaser help with some tutorial tutorial, code or resource around
>> it. I am struggling with the issue. I will be highly grateful. Thank you so
>> much once again
>>
>> > Date: Fri, 24 Aug 2012 08:07:39 +0200
>> > Subject: Re: Reading multiple lines from a microsoft doc in hadoop
>> > From: haavard.kongsgaard@gmail.com
>> > To: user@hadoop.apache.org
>> >
>> > It's much easier if you convert the documents to text first
>> >
>> > use
>> > http://tika.apache.org/
>> >
>> > or some other doc parser
>> >
>> >
>> > -Håvard
>> >
>> > On Fri, Aug 24, 2012 at 7:52 AM, Siddharth Tiwari
>> > <si...@live.com> wrote:
>> > > hi,
>> > > I have doc files in msword doc and docx format. These have entries
>> which are
>> > > seperated by an empty line. Is it possible for me to read
>> > > these lines separated from empty lines at a time. Also which
>> inpurformat
>> > > shall I use to read doc docx. Please help
>> > >
>> > > *------------------------*
>> > > Cheers !!!
>> > > Siddharth Tiwari
>> > > Have a refreshing day !!!
>> > > "Every duty is holy, and devotion to duty is the highest form of
>> worship of
>> > > God.”
>> > > "Maybe other people will try to limit me but I don't limit myself"
>> >
>> >
>> >
>> > --
>> > Håvard Wahl Kongsgård
>> > Faculty of Medicine &
>> > Department of Mathematical Sciences
>> > NTNU
>> >
>> > http://havard.security-review.net/
>>
>
>

Re: Reading multiple lines from a microsoft doc in hadoop

Posted by Mohammad Tariq <do...@gmail.com>.
Hello Siddharth,

       You can tweak the "NLineInputFormat" as per your requirement and use
it. It allows us to read a specified no of lines
unlike "TextInputFormat". Here is a good post by Boris and Michael on
custom record reader. Also I would suggest you to
combine similar files together into one bigger file if feasible, as you
files are very small.

Regards,
    Mohammad Tariq



On Fri, Aug 24, 2012 at 1:00 PM, Siddharth Tiwari <siddharth.tiwari@live.com
> wrote:

> Hi,
> Thank you for the suggestion. Actually I was using poi to extract text,
> but since now  I  have so many  documents I thought I will use hadoop
> directly to parse as well. Average size of each document is around 120 kb.
> Also I want to read multiple lines from the text until I find a blank line.
> I do not have any idea ankit how to design custom input format and record
> reader. Pleaser help with some tutorial tutorial, code or resource around
> it. I am struggling with the issue. I will be highly grateful. Thank you so
> much once again
>
> > Date: Fri, 24 Aug 2012 08:07:39 +0200
> > Subject: Re: Reading multiple lines from a microsoft doc in hadoop
> > From: haavard.kongsgaard@gmail.com
> > To: user@hadoop.apache.org
> >
> > It's much easier if you convert the documents to text first
> >
> > use
> > http://tika.apache.org/
> >
> > or some other doc parser
> >
> >
> > -Håvard
> >
> > On Fri, Aug 24, 2012 at 7:52 AM, Siddharth Tiwari
> > <si...@live.com> wrote:
> > > hi,
> > > I have doc files in msword doc and docx format. These have entries
> which are
> > > seperated by an empty line. Is it possible for me to read
> > > these lines separated from empty lines at a time. Also which
> inpurformat
> > > shall I use to read doc docx. Please help
> > >
> > > *------------------------*
> > > Cheers !!!
> > > Siddharth Tiwari
> > > Have a refreshing day !!!
> > > "Every duty is holy, and devotion to duty is the highest form of
> worship of
> > > God.”
> > > "Maybe other people will try to limit me but I don't limit myself"
> >
> >
> >
> > --
> > Håvard Wahl Kongsgård
> > Faculty of Medicine &
> > Department of Mathematical Sciences
> > NTNU
> >
> > http://havard.security-review.net/
>

Re: Reading multiple lines from a microsoft doc in hadoop

Posted by Harsh J <ha...@cloudera.com>.
Hi Siddharth,

First of all, please understand the medium - Mailing lists aren't
immediate or interactive help mediums, please be patient for the ones
who help you out of their own time. Secondly, take a read of
http://www.catb.org/~esr/faqs/smart-questions.html for understanding
why certain etiquette is beneficial to both ends.

Your requirement here seems to be that you want to read all text in a
file, in records separated by two newlines. Depending on the version
of Hadoop you use, I think you can probably set
"textinputformat.record.delimiter" to "\n\n" or "\r\n\r\n" to have
this working with the TextInputFormat itself.

On Sat, Aug 25, 2012 at 5:37 PM, Siddharth Tiwari
<si...@live.com> wrote:
>
> CAn anybody enlighten me on what could be wrongg ?
>
>
> *------------------------*
> Cheers !!!
> Siddharth Tiwari
> Have a refreshing day !!!
> "Every duty is holy, and devotion to duty is the highest form of worship of
> God.”
> "Maybe other people will try to limit me but I don't limit myself"
>
>
> ________________________________
> From: siddharth.tiwari@live.com
> To: user@hadoop.apache.org; bejoy.hadoop@gmail.com; bejoy_ks@yahoo.com
> Subject: RE: Reading multiple lines from a microsoft doc in hadoop
> Date: Sat, 25 Aug 2012 05:35:48 +0000
>
>
>
> Any help on below would be really appreciated. i am stuck with it
>
> *------------------------*
> Cheers !!!
> Siddharth Tiwari
> Have a refreshing day !!!
> "Every duty is holy, and devotion to duty is the highest form of worship of
> God.”
> "Maybe other people will try to limit me but I don't limit myself"
>
>
> ________________________________
> From: siddharth.tiwari@live.com
> To: user@hadoop.apache.org; bejoy.hadoop@gmail.com; bejoy_ks@yahoo.com
> Subject: RE: Reading multiple lines from a microsoft doc in hadoop
> Date: Fri, 24 Aug 2012 20:23:45 +0000
>
> Hi ,
>
> Can anyone please help ?
>
> Thank you in advance
>
>
> *------------------------*
> Cheers !!!
> Siddharth Tiwari
> Have a refreshing day !!!
> "Every duty is holy, and devotion to duty is the highest form of worship of
> God.”
> "Maybe other people will try to limit me but I don't limit myself"
>
>
> ________________________________
> From: siddharth.tiwari@live.com
> To: user@hadoop.apache.org; bejoy.hadoop@gmail.com; bejoy_ks@yahoo.com
> Subject: RE: Reading multiple lines from a microsoft doc in hadoop
> Date: Fri, 24 Aug 2012 16:22:57 +0000
>
> Hi Team,
>
> Thanks a lot for so many good suggestions. I wrote a custom input format for
> reading one paragraph at a time. But when I use it I get lines read. Can you
> please suggest what changes I must make to read one para at a time seperated
> by null lines ?
> below is the code I wrote:-
>
>
> import java.io.IOException;
> import java.util.ArrayList;
> import java.util.regex.Matcher;
> import java.util.regex.Pattern;
> import java.io.IOException;
> import java.util.ArrayList;
> import java.util.List;
>
> import org.apache.hadoop.conf.Configuration;
> import org.apache.hadoop.fs.FSDataInputStream;
> import org.apache.hadoop.fs.FileStatus;
> import org.apache.hadoop.fs.FileSystem;
> import org.apache.hadoop.fs.Path;
> import org.apache.hadoop.io.LongWritable;
> import org.apache.hadoop.io.Text;
> import org.apache.hadoop.mapred.JobConf;
> import org.apache.hadoop.mapreduce.InputSplit;
> import org.apache.hadoop.mapreduce.Job;
> import org.apache.hadoop.mapreduce.JobContext;
> import org.apache.hadoop.mapreduce.RecordReader;
> import org.apache.hadoop.mapreduce.TaskAttemptContext;
> import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
> import org.apache.hadoop.mapreduce.lib.input.FileSplit;
> import org.apache.hadoop.mapreduce.lib.input.LineRecordReader;
> import org.apache.hadoop.util.LineReader;
>
>
>
>
> /**
>  *
>  */
>
> /**
>  * @author 460615
>  *
>  */
> //FileInputFormat is the base class for all file-based InputFormats
> public class ParaInputFormat extends FileInputFormat<LongWritable,Text> {
> private String nullRegex = "^\\s*$" ;
> public String StrLine = null;
> /*public RecordReader<LongWritable, Text> getRecordReader (InputSplit
> genericSplit, JobConf job, Reporter reporter) throws IOException {
> reporter.setStatus(genericSplit.toString());
> return new ParaInputFormat(job, (FileSplit)genericSplit);
> }*/
> public RecordReader<LongWritable, Text> createRecordReader(InputSplit
> genericSplit, TaskAttemptContext context)throws IOException {
>    context.setStatus(genericSplit.toString());
>    return new LineRecordReader();
>  }
>
>
> public InputSplit[] getSplits(JobContext job, Configuration conf) throws
> IOException {
> ArrayList<FileSplit> splits = new ArrayList<FileSplit>();
> for (FileStatus status : listStatus(job)) {
> Path fileName = status.getPath();
> if (status.isDir()) {
> throw new IOException("Not a file: " + fileName);
> }
> FileSystem  fs = fileName.getFileSystem(conf);
> LineReader lr = null;
> try {
> FSDataInputStream in  = fs.open(fileName);
> lr = new LineReader(in, conf);
> // String regexMatch =in.readLine();
> Text line = new Text();
> long begin = 0;
> long length = 0;
> int num = -1;
> String boolTest = null;
> boolean match = false;
> Pattern p = Pattern.compile(nullRegex);
> // Matcher matcher = new p.matcher();
> while ((boolTest = in.readLine()) != null && (num = lr.readLine(line)) > 0
> && ! ( in.readLine().isEmpty())){
> // numLines++;
> length += num;
>
>
> splits.add(new FileSplit(fileName, begin, length, new String[]{}));}
> begin=length;
> }finally {
> if (lr != null) {
> lr.close();
> }
>
>
>
> }
>
> }
> return splits.toArray(new FileSplit[splits.size()]);
> }
>
>
>
> }
>
>
>
>
>
> *------------------------*
> Cheers !!!
> Siddharth Tiwari
> Have a refreshing day !!!
> "Every duty is holy, and devotion to duty is the highest form of worship of
> God.”
> "Maybe other people will try to limit me but I don't limit myself"
>
>
>> Date: Fri, 24 Aug 2012 09:54:10 +0200
>> Subject: Re: Reading multiple lines from a microsoft doc in hadoop
>> From: haavard.kongsgaard@gmail.com
>> To: user@hadoop.apache.org
>>
>> Hi, maybe you should check out the old nutch project
>> http://nutch.apache.org/ (hadoop was developed for nutch).
>> It's a web crawler and indexer, but the malinglists hold much info
>> doc/pdf parsing which also relates to hadoop.
>>
>> Have never parsed many docx or doc files, but it should be
>> strait-forward. But generally for text analysis preprocessing is the
>> KEY! For example replace dual lines \r\n\r\n or (\n\n) with #### is a
>> simple trick)
>>
>>
>> -Håvard
>>
>> On Fri, Aug 24, 2012 at 9:30 AM, Siddharth Tiwari
>> <si...@live.com> wrote:
>> > Hi,
>> > Thank you for the suggestion. Actually I was using poi to extract text,
>> > but
>> > since now I have so many documents I thought I will use hadoop directly
>> > to parse as well. Average size of each document is around 120 kb. Also I
>> > want to read multiple lines from the text until I find a blank line. I
>> > do
>> > not have any idea ankit how to design custom input format and record
>> > reader.
>> > Pleaser help with some tutorial tutorial, code or resource around it. I
>> > am
>> > struggling with the issue. I will be highly grateful. Thank you so much
>> > once
>> > again
>> >
>> >> Date: Fri, 24 Aug 2012 08:07:39 +0200
>> >> Subject: Re: Reading multiple lines from a microsoft doc in hadoop
>> >> From: haavard.kongsgaard@gmail.com
>> >> To: user@hadoop.apache.org
>> >
>> >>
>> >> It's much easier if you convert the documents to text first
>> >>
>> >> use
>> >> http://tika.apache.org/
>> >>
>> >> or some other doc parser
>> >>
>> >>
>> >> -Håvard
>> >>
>> >> On Fri, Aug 24, 2012 at 7:52 AM, Siddharth Tiwari
>> >> <si...@live.com> wrote:
>> >> > hi,
>> >> > I have doc files in msword doc and docx format. These have entries
>> >> > which
>> >> > are
>> >> > seperated by an empty line. Is it possible for me to read
>> >> > these lines separated from empty lines at a time. Also which
>> >> > inpurformat
>> >> > shall I use to read doc docx. Please help
>> >> >
>> >> > *------------------------*
>> >> > Cheers !!!
>> >> > Siddharth Tiwari
>> >> > Have a refreshing day !!!
>> >> > "Every duty is holy, and devotion to duty is the highest form of
>> >> > worship
>> >> > of
>> >> > God.”
>> >> > "Maybe other people will try to limit me but I don't limit myself"
>> >>
>> >>
>> >>
>> >> --
>> >> Håvard Wahl Kongsgård
>> >> Faculty of Medicine &
>> >> Department of Mathematical Sciences
>> >> NTNU
>> >>
>> >> http://havard.security-review.net/
>>
>>
>>
>> --
>> Håvard Wahl Kongsgård
>> Faculty of Medicine &
>> Department of Mathematical Sciences
>> NTNU
>>
>> http://havard.security-review.net/



-- 
Harsh J

Re: Reading multiple lines from a microsoft doc in hadoop

Posted by Harsh J <ha...@cloudera.com>.
Hi Siddharth,

First of all, please understand the medium - Mailing lists aren't
immediate or interactive help mediums, please be patient for the ones
who help you out of their own time. Secondly, take a read of
http://www.catb.org/~esr/faqs/smart-questions.html for understanding
why certain etiquette is beneficial to both ends.

Your requirement here seems to be that you want to read all text in a
file, in records separated by two newlines. Depending on the version
of Hadoop you use, I think you can probably set
"textinputformat.record.delimiter" to "\n\n" or "\r\n\r\n" to have
this working with the TextInputFormat itself.

On Sat, Aug 25, 2012 at 5:37 PM, Siddharth Tiwari
<si...@live.com> wrote:
>
> CAn anybody enlighten me on what could be wrongg ?
>
>
> *------------------------*
> Cheers !!!
> Siddharth Tiwari
> Have a refreshing day !!!
> "Every duty is holy, and devotion to duty is the highest form of worship of
> God.”
> "Maybe other people will try to limit me but I don't limit myself"
>
>
> ________________________________
> From: siddharth.tiwari@live.com
> To: user@hadoop.apache.org; bejoy.hadoop@gmail.com; bejoy_ks@yahoo.com
> Subject: RE: Reading multiple lines from a microsoft doc in hadoop
> Date: Sat, 25 Aug 2012 05:35:48 +0000
>
>
>
> Any help on below would be really appreciated. i am stuck with it
>
> *------------------------*
> Cheers !!!
> Siddharth Tiwari
> Have a refreshing day !!!
> "Every duty is holy, and devotion to duty is the highest form of worship of
> God.”
> "Maybe other people will try to limit me but I don't limit myself"
>
>
> ________________________________
> From: siddharth.tiwari@live.com
> To: user@hadoop.apache.org; bejoy.hadoop@gmail.com; bejoy_ks@yahoo.com
> Subject: RE: Reading multiple lines from a microsoft doc in hadoop
> Date: Fri, 24 Aug 2012 20:23:45 +0000
>
> Hi ,
>
> Can anyone please help ?
>
> Thank you in advance
>
>
> *------------------------*
> Cheers !!!
> Siddharth Tiwari
> Have a refreshing day !!!
> "Every duty is holy, and devotion to duty is the highest form of worship of
> God.”
> "Maybe other people will try to limit me but I don't limit myself"
>
>
> ________________________________
> From: siddharth.tiwari@live.com
> To: user@hadoop.apache.org; bejoy.hadoop@gmail.com; bejoy_ks@yahoo.com
> Subject: RE: Reading multiple lines from a microsoft doc in hadoop
> Date: Fri, 24 Aug 2012 16:22:57 +0000
>
> Hi Team,
>
> Thanks a lot for so many good suggestions. I wrote a custom input format for
> reading one paragraph at a time. But when I use it I get lines read. Can you
> please suggest what changes I must make to read one para at a time seperated
> by null lines ?
> below is the code I wrote:-
>
>
> import java.io.IOException;
> import java.util.ArrayList;
> import java.util.regex.Matcher;
> import java.util.regex.Pattern;
> import java.io.IOException;
> import java.util.ArrayList;
> import java.util.List;
>
> import org.apache.hadoop.conf.Configuration;
> import org.apache.hadoop.fs.FSDataInputStream;
> import org.apache.hadoop.fs.FileStatus;
> import org.apache.hadoop.fs.FileSystem;
> import org.apache.hadoop.fs.Path;
> import org.apache.hadoop.io.LongWritable;
> import org.apache.hadoop.io.Text;
> import org.apache.hadoop.mapred.JobConf;
> import org.apache.hadoop.mapreduce.InputSplit;
> import org.apache.hadoop.mapreduce.Job;
> import org.apache.hadoop.mapreduce.JobContext;
> import org.apache.hadoop.mapreduce.RecordReader;
> import org.apache.hadoop.mapreduce.TaskAttemptContext;
> import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
> import org.apache.hadoop.mapreduce.lib.input.FileSplit;
> import org.apache.hadoop.mapreduce.lib.input.LineRecordReader;
> import org.apache.hadoop.util.LineReader;
>
>
>
>
> /**
>  *
>  */
>
> /**
>  * @author 460615
>  *
>  */
> //FileInputFormat is the base class for all file-based InputFormats
> public class ParaInputFormat extends FileInputFormat<LongWritable,Text> {
> private String nullRegex = "^\\s*$" ;
> public String StrLine = null;
> /*public RecordReader<LongWritable, Text> getRecordReader (InputSplit
> genericSplit, JobConf job, Reporter reporter) throws IOException {
> reporter.setStatus(genericSplit.toString());
> return new ParaInputFormat(job, (FileSplit)genericSplit);
> }*/
> public RecordReader<LongWritable, Text> createRecordReader(InputSplit
> genericSplit, TaskAttemptContext context)throws IOException {
>    context.setStatus(genericSplit.toString());
>    return new LineRecordReader();
>  }
>
>
> public InputSplit[] getSplits(JobContext job, Configuration conf) throws
> IOException {
> ArrayList<FileSplit> splits = new ArrayList<FileSplit>();
> for (FileStatus status : listStatus(job)) {
> Path fileName = status.getPath();
> if (status.isDir()) {
> throw new IOException("Not a file: " + fileName);
> }
> FileSystem  fs = fileName.getFileSystem(conf);
> LineReader lr = null;
> try {
> FSDataInputStream in  = fs.open(fileName);
> lr = new LineReader(in, conf);
> // String regexMatch =in.readLine();
> Text line = new Text();
> long begin = 0;
> long length = 0;
> int num = -1;
> String boolTest = null;
> boolean match = false;
> Pattern p = Pattern.compile(nullRegex);
> // Matcher matcher = new p.matcher();
> while ((boolTest = in.readLine()) != null && (num = lr.readLine(line)) > 0
> && ! ( in.readLine().isEmpty())){
> // numLines++;
> length += num;
>
>
> splits.add(new FileSplit(fileName, begin, length, new String[]{}));}
> begin=length;
> }finally {
> if (lr != null) {
> lr.close();
> }
>
>
>
> }
>
> }
> return splits.toArray(new FileSplit[splits.size()]);
> }
>
>
>
> }
>
>
>
>
>
> *------------------------*
> Cheers !!!
> Siddharth Tiwari
> Have a refreshing day !!!
> "Every duty is holy, and devotion to duty is the highest form of worship of
> God.”
> "Maybe other people will try to limit me but I don't limit myself"
>
>
>> Date: Fri, 24 Aug 2012 09:54:10 +0200
>> Subject: Re: Reading multiple lines from a microsoft doc in hadoop
>> From: haavard.kongsgaard@gmail.com
>> To: user@hadoop.apache.org
>>
>> Hi, maybe you should check out the old nutch project
>> http://nutch.apache.org/ (hadoop was developed for nutch).
>> It's a web crawler and indexer, but the malinglists hold much info
>> doc/pdf parsing which also relates to hadoop.
>>
>> Have never parsed many docx or doc files, but it should be
>> strait-forward. But generally for text analysis preprocessing is the
>> KEY! For example replace dual lines \r\n\r\n or (\n\n) with #### is a
>> simple trick)
>>
>>
>> -Håvard
>>
>> On Fri, Aug 24, 2012 at 9:30 AM, Siddharth Tiwari
>> <si...@live.com> wrote:
>> > Hi,
>> > Thank you for the suggestion. Actually I was using poi to extract text,
>> > but
>> > since now I have so many documents I thought I will use hadoop directly
>> > to parse as well. Average size of each document is around 120 kb. Also I
>> > want to read multiple lines from the text until I find a blank line. I
>> > do
>> > not have any idea ankit how to design custom input format and record
>> > reader.
>> > Pleaser help with some tutorial tutorial, code or resource around it. I
>> > am
>> > struggling with the issue. I will be highly grateful. Thank you so much
>> > once
>> > again
>> >
>> >> Date: Fri, 24 Aug 2012 08:07:39 +0200
>> >> Subject: Re: Reading multiple lines from a microsoft doc in hadoop
>> >> From: haavard.kongsgaard@gmail.com
>> >> To: user@hadoop.apache.org
>> >
>> >>
>> >> It's much easier if you convert the documents to text first
>> >>
>> >> use
>> >> http://tika.apache.org/
>> >>
>> >> or some other doc parser
>> >>
>> >>
>> >> -Håvard
>> >>
>> >> On Fri, Aug 24, 2012 at 7:52 AM, Siddharth Tiwari
>> >> <si...@live.com> wrote:
>> >> > hi,
>> >> > I have doc files in msword doc and docx format. These have entries
>> >> > which
>> >> > are
>> >> > seperated by an empty line. Is it possible for me to read
>> >> > these lines separated from empty lines at a time. Also which
>> >> > inpurformat
>> >> > shall I use to read doc docx. Please help
>> >> >
>> >> > *------------------------*
>> >> > Cheers !!!
>> >> > Siddharth Tiwari
>> >> > Have a refreshing day !!!
>> >> > "Every duty is holy, and devotion to duty is the highest form of
>> >> > worship
>> >> > of
>> >> > God.”
>> >> > "Maybe other people will try to limit me but I don't limit myself"
>> >>
>> >>
>> >>
>> >> --
>> >> Håvard Wahl Kongsgård
>> >> Faculty of Medicine &
>> >> Department of Mathematical Sciences
>> >> NTNU
>> >>
>> >> http://havard.security-review.net/
>>
>>
>>
>> --
>> Håvard Wahl Kongsgård
>> Faculty of Medicine &
>> Department of Mathematical Sciences
>> NTNU
>>
>> http://havard.security-review.net/



-- 
Harsh J

Re: Reading multiple lines from a microsoft doc in hadoop

Posted by Harsh J <ha...@cloudera.com>.
Hi Siddharth,

First of all, please understand the medium - Mailing lists aren't
immediate or interactive help mediums, please be patient for the ones
who help you out of their own time. Secondly, take a read of
http://www.catb.org/~esr/faqs/smart-questions.html for understanding
why certain etiquette is beneficial to both ends.

Your requirement here seems to be that you want to read all text in a
file, in records separated by two newlines. Depending on the version
of Hadoop you use, I think you can probably set
"textinputformat.record.delimiter" to "\n\n" or "\r\n\r\n" to have
this working with the TextInputFormat itself.

On Sat, Aug 25, 2012 at 5:37 PM, Siddharth Tiwari
<si...@live.com> wrote:
>
> CAn anybody enlighten me on what could be wrongg ?
>
>
> *------------------------*
> Cheers !!!
> Siddharth Tiwari
> Have a refreshing day !!!
> "Every duty is holy, and devotion to duty is the highest form of worship of
> God.”
> "Maybe other people will try to limit me but I don't limit myself"
>
>
> ________________________________
> From: siddharth.tiwari@live.com
> To: user@hadoop.apache.org; bejoy.hadoop@gmail.com; bejoy_ks@yahoo.com
> Subject: RE: Reading multiple lines from a microsoft doc in hadoop
> Date: Sat, 25 Aug 2012 05:35:48 +0000
>
>
>
> Any help on below would be really appreciated. i am stuck with it
>
> *------------------------*
> Cheers !!!
> Siddharth Tiwari
> Have a refreshing day !!!
> "Every duty is holy, and devotion to duty is the highest form of worship of
> God.”
> "Maybe other people will try to limit me but I don't limit myself"
>
>
> ________________________________
> From: siddharth.tiwari@live.com
> To: user@hadoop.apache.org; bejoy.hadoop@gmail.com; bejoy_ks@yahoo.com
> Subject: RE: Reading multiple lines from a microsoft doc in hadoop
> Date: Fri, 24 Aug 2012 20:23:45 +0000
>
> Hi ,
>
> Can anyone please help ?
>
> Thank you in advance
>
>
> *------------------------*
> Cheers !!!
> Siddharth Tiwari
> Have a refreshing day !!!
> "Every duty is holy, and devotion to duty is the highest form of worship of
> God.”
> "Maybe other people will try to limit me but I don't limit myself"
>
>
> ________________________________
> From: siddharth.tiwari@live.com
> To: user@hadoop.apache.org; bejoy.hadoop@gmail.com; bejoy_ks@yahoo.com
> Subject: RE: Reading multiple lines from a microsoft doc in hadoop
> Date: Fri, 24 Aug 2012 16:22:57 +0000
>
> Hi Team,
>
> Thanks a lot for so many good suggestions. I wrote a custom input format for
> reading one paragraph at a time. But when I use it I get lines read. Can you
> please suggest what changes I must make to read one para at a time seperated
> by null lines ?
> below is the code I wrote:-
>
>
> import java.io.IOException;
> import java.util.ArrayList;
> import java.util.regex.Matcher;
> import java.util.regex.Pattern;
> import java.io.IOException;
> import java.util.ArrayList;
> import java.util.List;
>
> import org.apache.hadoop.conf.Configuration;
> import org.apache.hadoop.fs.FSDataInputStream;
> import org.apache.hadoop.fs.FileStatus;
> import org.apache.hadoop.fs.FileSystem;
> import org.apache.hadoop.fs.Path;
> import org.apache.hadoop.io.LongWritable;
> import org.apache.hadoop.io.Text;
> import org.apache.hadoop.mapred.JobConf;
> import org.apache.hadoop.mapreduce.InputSplit;
> import org.apache.hadoop.mapreduce.Job;
> import org.apache.hadoop.mapreduce.JobContext;
> import org.apache.hadoop.mapreduce.RecordReader;
> import org.apache.hadoop.mapreduce.TaskAttemptContext;
> import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
> import org.apache.hadoop.mapreduce.lib.input.FileSplit;
> import org.apache.hadoop.mapreduce.lib.input.LineRecordReader;
> import org.apache.hadoop.util.LineReader;
>
>
>
>
> /**
>  *
>  */
>
> /**
>  * @author 460615
>  *
>  */
> //FileInputFormat is the base class for all file-based InputFormats
> public class ParaInputFormat extends FileInputFormat<LongWritable,Text> {
> private String nullRegex = "^\\s*$" ;
> public String StrLine = null;
> /*public RecordReader<LongWritable, Text> getRecordReader (InputSplit
> genericSplit, JobConf job, Reporter reporter) throws IOException {
> reporter.setStatus(genericSplit.toString());
> return new ParaInputFormat(job, (FileSplit)genericSplit);
> }*/
> public RecordReader<LongWritable, Text> createRecordReader(InputSplit
> genericSplit, TaskAttemptContext context)throws IOException {
>    context.setStatus(genericSplit.toString());
>    return new LineRecordReader();
>  }
>
>
> public InputSplit[] getSplits(JobContext job, Configuration conf) throws
> IOException {
> ArrayList<FileSplit> splits = new ArrayList<FileSplit>();
> for (FileStatus status : listStatus(job)) {
> Path fileName = status.getPath();
> if (status.isDir()) {
> throw new IOException("Not a file: " + fileName);
> }
> FileSystem  fs = fileName.getFileSystem(conf);
> LineReader lr = null;
> try {
> FSDataInputStream in  = fs.open(fileName);
> lr = new LineReader(in, conf);
> // String regexMatch =in.readLine();
> Text line = new Text();
> long begin = 0;
> long length = 0;
> int num = -1;
> String boolTest = null;
> boolean match = false;
> Pattern p = Pattern.compile(nullRegex);
> // Matcher matcher = new p.matcher();
> while ((boolTest = in.readLine()) != null && (num = lr.readLine(line)) > 0
> && ! ( in.readLine().isEmpty())){
> // numLines++;
> length += num;
>
>
> splits.add(new FileSplit(fileName, begin, length, new String[]{}));}
> begin=length;
> }finally {
> if (lr != null) {
> lr.close();
> }
>
>
>
> }
>
> }
> return splits.toArray(new FileSplit[splits.size()]);
> }
>
>
>
> }
>
>
>
>
>
> *------------------------*
> Cheers !!!
> Siddharth Tiwari
> Have a refreshing day !!!
> "Every duty is holy, and devotion to duty is the highest form of worship of
> God.”
> "Maybe other people will try to limit me but I don't limit myself"
>
>
>> Date: Fri, 24 Aug 2012 09:54:10 +0200
>> Subject: Re: Reading multiple lines from a microsoft doc in hadoop
>> From: haavard.kongsgaard@gmail.com
>> To: user@hadoop.apache.org
>>
>> Hi, maybe you should check out the old nutch project
>> http://nutch.apache.org/ (hadoop was developed for nutch).
>> It's a web crawler and indexer, but the malinglists hold much info
>> doc/pdf parsing which also relates to hadoop.
>>
>> Have never parsed many docx or doc files, but it should be
>> strait-forward. But generally for text analysis preprocessing is the
>> KEY! For example replace dual lines \r\n\r\n or (\n\n) with #### is a
>> simple trick)
>>
>>
>> -Håvard
>>
>> On Fri, Aug 24, 2012 at 9:30 AM, Siddharth Tiwari
>> <si...@live.com> wrote:
>> > Hi,
>> > Thank you for the suggestion. Actually I was using poi to extract text,
>> > but
>> > since now I have so many documents I thought I will use hadoop directly
>> > to parse as well. Average size of each document is around 120 kb. Also I
>> > want to read multiple lines from the text until I find a blank line. I
>> > do
>> > not have any idea ankit how to design custom input format and record
>> > reader.
>> > Pleaser help with some tutorial tutorial, code or resource around it. I
>> > am
>> > struggling with the issue. I will be highly grateful. Thank you so much
>> > once
>> > again
>> >
>> >> Date: Fri, 24 Aug 2012 08:07:39 +0200
>> >> Subject: Re: Reading multiple lines from a microsoft doc in hadoop
>> >> From: haavard.kongsgaard@gmail.com
>> >> To: user@hadoop.apache.org
>> >
>> >>
>> >> It's much easier if you convert the documents to text first
>> >>
>> >> use
>> >> http://tika.apache.org/
>> >>
>> >> or some other doc parser
>> >>
>> >>
>> >> -Håvard
>> >>
>> >> On Fri, Aug 24, 2012 at 7:52 AM, Siddharth Tiwari
>> >> <si...@live.com> wrote:
>> >> > hi,
>> >> > I have doc files in msword doc and docx format. These have entries
>> >> > which
>> >> > are
>> >> > seperated by an empty line. Is it possible for me to read
>> >> > these lines separated from empty lines at a time. Also which
>> >> > inpurformat
>> >> > shall I use to read doc docx. Please help
>> >> >
>> >> > *------------------------*
>> >> > Cheers !!!
>> >> > Siddharth Tiwari
>> >> > Have a refreshing day !!!
>> >> > "Every duty is holy, and devotion to duty is the highest form of
>> >> > worship
>> >> > of
>> >> > God.”
>> >> > "Maybe other people will try to limit me but I don't limit myself"
>> >>
>> >>
>> >>
>> >> --
>> >> Håvard Wahl Kongsgård
>> >> Faculty of Medicine &
>> >> Department of Mathematical Sciences
>> >> NTNU
>> >>
>> >> http://havard.security-review.net/
>>
>>
>>
>> --
>> Håvard Wahl Kongsgård
>> Faculty of Medicine &
>> Department of Mathematical Sciences
>> NTNU
>>
>> http://havard.security-review.net/



-- 
Harsh J

Re: Reading multiple lines from a microsoft doc in hadoop

Posted by Harsh J <ha...@cloudera.com>.
Hi Siddharth,

First of all, please understand the medium - Mailing lists aren't
immediate or interactive help mediums, please be patient for the ones
who help you out of their own time. Secondly, take a read of
http://www.catb.org/~esr/faqs/smart-questions.html for understanding
why certain etiquette is beneficial to both ends.

Your requirement here seems to be that you want to read all text in a
file, in records separated by two newlines. Depending on the version
of Hadoop you use, I think you can probably set
"textinputformat.record.delimiter" to "\n\n" or "\r\n\r\n" to have
this working with the TextInputFormat itself.

On Sat, Aug 25, 2012 at 5:37 PM, Siddharth Tiwari
<si...@live.com> wrote:
>
> CAn anybody enlighten me on what could be wrongg ?
>
>
> *------------------------*
> Cheers !!!
> Siddharth Tiwari
> Have a refreshing day !!!
> "Every duty is holy, and devotion to duty is the highest form of worship of
> God.”
> "Maybe other people will try to limit me but I don't limit myself"
>
>
> ________________________________
> From: siddharth.tiwari@live.com
> To: user@hadoop.apache.org; bejoy.hadoop@gmail.com; bejoy_ks@yahoo.com
> Subject: RE: Reading multiple lines from a microsoft doc in hadoop
> Date: Sat, 25 Aug 2012 05:35:48 +0000
>
>
>
> Any help on below would be really appreciated. i am stuck with it
>
> *------------------------*
> Cheers !!!
> Siddharth Tiwari
> Have a refreshing day !!!
> "Every duty is holy, and devotion to duty is the highest form of worship of
> God.”
> "Maybe other people will try to limit me but I don't limit myself"
>
>
> ________________________________
> From: siddharth.tiwari@live.com
> To: user@hadoop.apache.org; bejoy.hadoop@gmail.com; bejoy_ks@yahoo.com
> Subject: RE: Reading multiple lines from a microsoft doc in hadoop
> Date: Fri, 24 Aug 2012 20:23:45 +0000
>
> Hi ,
>
> Can anyone please help ?
>
> Thank you in advance
>
>
> *------------------------*
> Cheers !!!
> Siddharth Tiwari
> Have a refreshing day !!!
> "Every duty is holy, and devotion to duty is the highest form of worship of
> God.”
> "Maybe other people will try to limit me but I don't limit myself"
>
>
> ________________________________
> From: siddharth.tiwari@live.com
> To: user@hadoop.apache.org; bejoy.hadoop@gmail.com; bejoy_ks@yahoo.com
> Subject: RE: Reading multiple lines from a microsoft doc in hadoop
> Date: Fri, 24 Aug 2012 16:22:57 +0000
>
> Hi Team,
>
> Thanks a lot for so many good suggestions. I wrote a custom input format for
> reading one paragraph at a time. But when I use it I get lines read. Can you
> please suggest what changes I must make to read one para at a time seperated
> by null lines ?
> below is the code I wrote:-
>
>
> import java.io.IOException;
> import java.util.ArrayList;
> import java.util.regex.Matcher;
> import java.util.regex.Pattern;
> import java.io.IOException;
> import java.util.ArrayList;
> import java.util.List;
>
> import org.apache.hadoop.conf.Configuration;
> import org.apache.hadoop.fs.FSDataInputStream;
> import org.apache.hadoop.fs.FileStatus;
> import org.apache.hadoop.fs.FileSystem;
> import org.apache.hadoop.fs.Path;
> import org.apache.hadoop.io.LongWritable;
> import org.apache.hadoop.io.Text;
> import org.apache.hadoop.mapred.JobConf;
> import org.apache.hadoop.mapreduce.InputSplit;
> import org.apache.hadoop.mapreduce.Job;
> import org.apache.hadoop.mapreduce.JobContext;
> import org.apache.hadoop.mapreduce.RecordReader;
> import org.apache.hadoop.mapreduce.TaskAttemptContext;
> import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
> import org.apache.hadoop.mapreduce.lib.input.FileSplit;
> import org.apache.hadoop.mapreduce.lib.input.LineRecordReader;
> import org.apache.hadoop.util.LineReader;
>
>
>
>
> /**
>  *
>  */
>
> /**
>  * @author 460615
>  *
>  */
> //FileInputFormat is the base class for all file-based InputFormats
> public class ParaInputFormat extends FileInputFormat<LongWritable,Text> {
> private String nullRegex = "^\\s*$" ;
> public String StrLine = null;
> /*public RecordReader<LongWritable, Text> getRecordReader (InputSplit
> genericSplit, JobConf job, Reporter reporter) throws IOException {
> reporter.setStatus(genericSplit.toString());
> return new ParaInputFormat(job, (FileSplit)genericSplit);
> }*/
> public RecordReader<LongWritable, Text> createRecordReader(InputSplit
> genericSplit, TaskAttemptContext context)throws IOException {
>    context.setStatus(genericSplit.toString());
>    return new LineRecordReader();
>  }
>
>
> public InputSplit[] getSplits(JobContext job, Configuration conf) throws
> IOException {
> ArrayList<FileSplit> splits = new ArrayList<FileSplit>();
> for (FileStatus status : listStatus(job)) {
> Path fileName = status.getPath();
> if (status.isDir()) {
> throw new IOException("Not a file: " + fileName);
> }
> FileSystem  fs = fileName.getFileSystem(conf);
> LineReader lr = null;
> try {
> FSDataInputStream in  = fs.open(fileName);
> lr = new LineReader(in, conf);
> // String regexMatch =in.readLine();
> Text line = new Text();
> long begin = 0;
> long length = 0;
> int num = -1;
> String boolTest = null;
> boolean match = false;
> Pattern p = Pattern.compile(nullRegex);
> // Matcher matcher = new p.matcher();
> while ((boolTest = in.readLine()) != null && (num = lr.readLine(line)) > 0
> && ! ( in.readLine().isEmpty())){
> // numLines++;
> length += num;
>
>
> splits.add(new FileSplit(fileName, begin, length, new String[]{}));}
> begin=length;
> }finally {
> if (lr != null) {
> lr.close();
> }
>
>
>
> }
>
> }
> return splits.toArray(new FileSplit[splits.size()]);
> }
>
>
>
> }
>
>
>
>
>
> *------------------------*
> Cheers !!!
> Siddharth Tiwari
> Have a refreshing day !!!
> "Every duty is holy, and devotion to duty is the highest form of worship of
> God.”
> "Maybe other people will try to limit me but I don't limit myself"
>
>
>> Date: Fri, 24 Aug 2012 09:54:10 +0200
>> Subject: Re: Reading multiple lines from a microsoft doc in hadoop
>> From: haavard.kongsgaard@gmail.com
>> To: user@hadoop.apache.org
>>
>> Hi, maybe you should check out the old nutch project
>> http://nutch.apache.org/ (hadoop was developed for nutch).
>> It's a web crawler and indexer, but the malinglists hold much info
>> doc/pdf parsing which also relates to hadoop.
>>
>> Have never parsed many docx or doc files, but it should be
>> strait-forward. But generally for text analysis preprocessing is the
>> KEY! For example replace dual lines \r\n\r\n or (\n\n) with #### is a
>> simple trick)
>>
>>
>> -Håvard
>>
>> On Fri, Aug 24, 2012 at 9:30 AM, Siddharth Tiwari
>> <si...@live.com> wrote:
>> > Hi,
>> > Thank you for the suggestion. Actually I was using poi to extract text,
>> > but
>> > since now I have so many documents I thought I will use hadoop directly
>> > to parse as well. Average size of each document is around 120 kb. Also I
>> > want to read multiple lines from the text until I find a blank line. I
>> > do
>> > not have any idea ankit how to design custom input format and record
>> > reader.
>> > Pleaser help with some tutorial tutorial, code or resource around it. I
>> > am
>> > struggling with the issue. I will be highly grateful. Thank you so much
>> > once
>> > again
>> >
>> >> Date: Fri, 24 Aug 2012 08:07:39 +0200
>> >> Subject: Re: Reading multiple lines from a microsoft doc in hadoop
>> >> From: haavard.kongsgaard@gmail.com
>> >> To: user@hadoop.apache.org
>> >
>> >>
>> >> It's much easier if you convert the documents to text first
>> >>
>> >> use
>> >> http://tika.apache.org/
>> >>
>> >> or some other doc parser
>> >>
>> >>
>> >> -Håvard
>> >>
>> >> On Fri, Aug 24, 2012 at 7:52 AM, Siddharth Tiwari
>> >> <si...@live.com> wrote:
>> >> > hi,
>> >> > I have doc files in msword doc and docx format. These have entries
>> >> > which
>> >> > are
>> >> > seperated by an empty line. Is it possible for me to read
>> >> > these lines separated from empty lines at a time. Also which
>> >> > inpurformat
>> >> > shall I use to read doc docx. Please help
>> >> >
>> >> > *------------------------*
>> >> > Cheers !!!
>> >> > Siddharth Tiwari
>> >> > Have a refreshing day !!!
>> >> > "Every duty is holy, and devotion to duty is the highest form of
>> >> > worship
>> >> > of
>> >> > God.”
>> >> > "Maybe other people will try to limit me but I don't limit myself"
>> >>
>> >>
>> >>
>> >> --
>> >> Håvard Wahl Kongsgård
>> >> Faculty of Medicine &
>> >> Department of Mathematical Sciences
>> >> NTNU
>> >>
>> >> http://havard.security-review.net/
>>
>>
>>
>> --
>> Håvard Wahl Kongsgård
>> Faculty of Medicine &
>> Department of Mathematical Sciences
>> NTNU
>>
>> http://havard.security-review.net/



-- 
Harsh J

RE: Reading multiple lines from a microsoft doc in hadoop

Posted by Siddharth Tiwari <si...@live.com>.
CAn anybody enlighten me on what could be wrongg ?

*------------------------*

Cheers !!!

Siddharth Tiwari

Have a refreshing day !!!
"Every duty is holy, and devotion to duty is the highest form of worship of God.” 

"Maybe other people will try to limit me but I don't limit myself"


From: siddharth.tiwari@live.com
To: user@hadoop.apache.org; bejoy.hadoop@gmail.com; bejoy_ks@yahoo.com
Subject: RE: Reading multiple lines from a microsoft doc in hadoop
Date: Sat, 25 Aug 2012 05:35:48 +0000





Any help on below would be really appreciated. i am stuck with it

*------------------------*

Cheers !!!

Siddharth Tiwari

Have a refreshing day !!!
"Every duty is holy, and devotion to duty is the highest form of worship of God.” 

"Maybe other people will try to limit me but I don't limit myself"


From: siddharth.tiwari@live.com
To: user@hadoop.apache.org; bejoy.hadoop@gmail.com; bejoy_ks@yahoo.com
Subject: RE: Reading multiple lines from a microsoft doc in hadoop
Date: Fri, 24 Aug 2012 20:23:45 +0000





Hi ,

Can anyone please help ?

Thank you in advance

*------------------------*

Cheers !!!

Siddharth Tiwari

Have a refreshing day !!!
"Every duty is holy, and devotion to duty is the highest form of worship of God.” 

"Maybe other people will try to limit me but I don't limit myself"


From: siddharth.tiwari@live.com
To: user@hadoop.apache.org; bejoy.hadoop@gmail.com; bejoy_ks@yahoo.com
Subject: RE: Reading multiple lines from a microsoft doc in hadoop
Date: Fri, 24 Aug 2012 16:22:57 +0000





Hi Team,

Thanks a lot for so many good suggestions. I wrote a custom input format for reading one paragraph at a time. But when I use it I get lines read. Can you please suggest what changes I must make to read one para at a time seperated by null lines ?
below is the code I wrote:-


import java.io.IOException;
import java.util.ArrayList;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.JobContext;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.mapreduce.lib.input.LineRecordReader;
import org.apache.hadoop.util.LineReader;




/**
 * 
 */

/**
 * @author 460615
 *
 */
//FileInputFormat is the base class for all file-based InputFormats
public class ParaInputFormat extends FileInputFormat<LongWritable,Text> {
private String nullRegex = "^\\s*$" ;
public String StrLine = null;
/*public RecordReader<LongWritable, Text> getRecordReader (InputSplit genericSplit, JobConf job, Reporter reporter) throws IOException {
reporter.setStatus(genericSplit.toString());
return new ParaInputFormat(job, (FileSplit)genericSplit);
}*/
public RecordReader<LongWritable, Text> createRecordReader(InputSplit genericSplit, TaskAttemptContext context)throws IOException {
   context.setStatus(genericSplit.toString());
   return new LineRecordReader();
 }


public InputSplit[] getSplits(JobContext job, Configuration conf) throws IOException {
ArrayList<FileSplit> splits = new ArrayList<FileSplit>();
for (FileStatus status : listStatus(job)) {
Path fileName = status.getPath();
if (status.isDir()) {
throw new IOException("Not a file: " + fileName);
}
FileSystem  fs = fileName.getFileSystem(conf);
LineReader lr = null;
try {
FSDataInputStream in  = fs.open(fileName);
lr = new LineReader(in, conf);
// String regexMatch =in.readLine();
Text line = new Text();
long begin = 0;
long length = 0;
int num = -1;
String boolTest = null;
boolean match = false;
Pattern p = Pattern.compile(nullRegex);
// Matcher matcher = new p.matcher();
while ((boolTest = in.readLine()) != null && (num = lr.readLine(line)) > 0 && ! ( in.readLine().isEmpty())){
// numLines++;
length += num;
 
 
splits.add(new FileSplit(fileName, begin, length, new String[]{}));}
begin=length;
}finally {
if (lr != null) {
lr.close();
}
 
 
 
}
 
}
return splits.toArray(new FileSplit[splits.size()]);
}
 


}




*------------------------*

Cheers !!!

Siddharth Tiwari

Have a refreshing day !!!
"Every duty is holy, and devotion to duty is the highest form of worship of God.” 

"Maybe other people will try to limit me but I don't limit myself"


> Date: Fri, 24 Aug 2012 09:54:10 +0200
> Subject: Re: Reading multiple lines from a microsoft doc in hadoop
> From: haavard.kongsgaard@gmail.com
> To: user@hadoop.apache.org
> 
> Hi, maybe you should check out the old nutch project
> http://nutch.apache.org/ (hadoop was developed for nutch).
> It's a web crawler and indexer, but the malinglists hold much info
> doc/pdf parsing which also relates to hadoop.
> 
> Have never parsed many docx or doc files, but it should be
> strait-forward. But generally for text analysis preprocessing is the
> KEY! For example replace dual lines \r\n\r\n or (\n\n) with #### is a
> simple trick)
> 
> 
> -Håvard
> 
> On Fri, Aug 24, 2012 at 9:30 AM, Siddharth Tiwari
> <si...@live.com> wrote:
> > Hi,
> > Thank you for the suggestion. Actually I was using poi to extract text, but
> > since now  I  have so many  documents I thought I will use hadoop directly
> > to parse as well. Average size of each document is around 120 kb. Also I
> > want to read multiple lines from the text until I find a blank line. I do
> > not have any idea ankit how to design custom input format and record reader.
> > Pleaser help with some tutorial tutorial, code or resource around it. I am
> > struggling with the issue. I will be highly grateful. Thank you so much once
> > again
> >
> >> Date: Fri, 24 Aug 2012 08:07:39 +0200
> >> Subject: Re: Reading multiple lines from a microsoft doc in hadoop
> >> From: haavard.kongsgaard@gmail.com
> >> To: user@hadoop.apache.org
> >
> >>
> >> It's much easier if you convert the documents to text first
> >>
> >> use
> >> http://tika.apache.org/
> >>
> >> or some other doc parser
> >>
> >>
> >> -Håvard
> >>
> >> On Fri, Aug 24, 2012 at 7:52 AM, Siddharth Tiwari
> >> <si...@live.com> wrote:
> >> > hi,
> >> > I have doc files in msword doc and docx format. These have entries which
> >> > are
> >> > seperated by an empty line. Is it possible for me to read
> >> > these lines separated from empty lines at a time. Also which inpurformat
> >> > shall I use to read doc docx. Please help
> >> >
> >> > *------------------------*
> >> > Cheers !!!
> >> > Siddharth Tiwari
> >> > Have a refreshing day !!!
> >> > "Every duty is holy, and devotion to duty is the highest form of worship
> >> > of
> >> > God.”
> >> > "Maybe other people will try to limit me but I don't limit myself"
> >>
> >>
> >>
> >> --
> >> Håvard Wahl Kongsgård
> >> Faculty of Medicine &
> >> Department of Mathematical Sciences
> >> NTNU
> >>
> >> http://havard.security-review.net/
> 
> 
> 
> -- 
> Håvard Wahl Kongsgård
> Faculty of Medicine &
> Department of Mathematical Sciences
> NTNU
> 
> http://havard.security-review.net/
 		 	   		   		 	   		   		 	   		   		 	   		  

RE: Reading multiple lines from a microsoft doc in hadoop

Posted by Siddharth Tiwari <si...@live.com>.
CAn anybody enlighten me on what could be wrongg ?

*------------------------*

Cheers !!!

Siddharth Tiwari

Have a refreshing day !!!
"Every duty is holy, and devotion to duty is the highest form of worship of God.” 

"Maybe other people will try to limit me but I don't limit myself"


From: siddharth.tiwari@live.com
To: user@hadoop.apache.org; bejoy.hadoop@gmail.com; bejoy_ks@yahoo.com
Subject: RE: Reading multiple lines from a microsoft doc in hadoop
Date: Sat, 25 Aug 2012 05:35:48 +0000





Any help on below would be really appreciated. i am stuck with it

*------------------------*

Cheers !!!

Siddharth Tiwari

Have a refreshing day !!!
"Every duty is holy, and devotion to duty is the highest form of worship of God.” 

"Maybe other people will try to limit me but I don't limit myself"


From: siddharth.tiwari@live.com
To: user@hadoop.apache.org; bejoy.hadoop@gmail.com; bejoy_ks@yahoo.com
Subject: RE: Reading multiple lines from a microsoft doc in hadoop
Date: Fri, 24 Aug 2012 20:23:45 +0000





Hi ,

Can anyone please help ?

Thank you in advance

*------------------------*

Cheers !!!

Siddharth Tiwari

Have a refreshing day !!!
"Every duty is holy, and devotion to duty is the highest form of worship of God.” 

"Maybe other people will try to limit me but I don't limit myself"


From: siddharth.tiwari@live.com
To: user@hadoop.apache.org; bejoy.hadoop@gmail.com; bejoy_ks@yahoo.com
Subject: RE: Reading multiple lines from a microsoft doc in hadoop
Date: Fri, 24 Aug 2012 16:22:57 +0000





Hi Team,

Thanks a lot for so many good suggestions. I wrote a custom input format for reading one paragraph at a time. But when I use it I get lines read. Can you please suggest what changes I must make to read one para at a time seperated by null lines ?
below is the code I wrote:-


import java.io.IOException;
import java.util.ArrayList;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.JobContext;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.mapreduce.lib.input.LineRecordReader;
import org.apache.hadoop.util.LineReader;




/**
 * 
 */

/**
 * @author 460615
 *
 */
//FileInputFormat is the base class for all file-based InputFormats
public class ParaInputFormat extends FileInputFormat<LongWritable,Text> {
private String nullRegex = "^\\s*$" ;
public String StrLine = null;
/*public RecordReader<LongWritable, Text> getRecordReader (InputSplit genericSplit, JobConf job, Reporter reporter) throws IOException {
reporter.setStatus(genericSplit.toString());
return new ParaInputFormat(job, (FileSplit)genericSplit);
}*/
public RecordReader<LongWritable, Text> createRecordReader(InputSplit genericSplit, TaskAttemptContext context)throws IOException {
   context.setStatus(genericSplit.toString());
   return new LineRecordReader();
 }


public InputSplit[] getSplits(JobContext job, Configuration conf) throws IOException {
ArrayList<FileSplit> splits = new ArrayList<FileSplit>();
for (FileStatus status : listStatus(job)) {
Path fileName = status.getPath();
if (status.isDir()) {
throw new IOException("Not a file: " + fileName);
}
FileSystem  fs = fileName.getFileSystem(conf);
LineReader lr = null;
try {
FSDataInputStream in  = fs.open(fileName);
lr = new LineReader(in, conf);
// String regexMatch =in.readLine();
Text line = new Text();
long begin = 0;
long length = 0;
int num = -1;
String boolTest = null;
boolean match = false;
Pattern p = Pattern.compile(nullRegex);
// Matcher matcher = new p.matcher();
while ((boolTest = in.readLine()) != null && (num = lr.readLine(line)) > 0 && ! ( in.readLine().isEmpty())){
// numLines++;
length += num;
 
 
splits.add(new FileSplit(fileName, begin, length, new String[]{}));}
begin=length;
}finally {
if (lr != null) {
lr.close();
}
 
 
 
}
 
}
return splits.toArray(new FileSplit[splits.size()]);
}
 


}




*------------------------*

Cheers !!!

Siddharth Tiwari

Have a refreshing day !!!
"Every duty is holy, and devotion to duty is the highest form of worship of God.” 

"Maybe other people will try to limit me but I don't limit myself"


> Date: Fri, 24 Aug 2012 09:54:10 +0200
> Subject: Re: Reading multiple lines from a microsoft doc in hadoop
> From: haavard.kongsgaard@gmail.com
> To: user@hadoop.apache.org
> 
> Hi, maybe you should check out the old nutch project
> http://nutch.apache.org/ (hadoop was developed for nutch).
> It's a web crawler and indexer, but the malinglists hold much info
> doc/pdf parsing which also relates to hadoop.
> 
> Have never parsed many docx or doc files, but it should be
> strait-forward. But generally for text analysis preprocessing is the
> KEY! For example replace dual lines \r\n\r\n or (\n\n) with #### is a
> simple trick)
> 
> 
> -Håvard
> 
> On Fri, Aug 24, 2012 at 9:30 AM, Siddharth Tiwari
> <si...@live.com> wrote:
> > Hi,
> > Thank you for the suggestion. Actually I was using poi to extract text, but
> > since now  I  have so many  documents I thought I will use hadoop directly
> > to parse as well. Average size of each document is around 120 kb. Also I
> > want to read multiple lines from the text until I find a blank line. I do
> > not have any idea ankit how to design custom input format and record reader.
> > Pleaser help with some tutorial tutorial, code or resource around it. I am
> > struggling with the issue. I will be highly grateful. Thank you so much once
> > again
> >
> >> Date: Fri, 24 Aug 2012 08:07:39 +0200
> >> Subject: Re: Reading multiple lines from a microsoft doc in hadoop
> >> From: haavard.kongsgaard@gmail.com
> >> To: user@hadoop.apache.org
> >
> >>
> >> It's much easier if you convert the documents to text first
> >>
> >> use
> >> http://tika.apache.org/
> >>
> >> or some other doc parser
> >>
> >>
> >> -Håvard
> >>
> >> On Fri, Aug 24, 2012 at 7:52 AM, Siddharth Tiwari
> >> <si...@live.com> wrote:
> >> > hi,
> >> > I have doc files in msword doc and docx format. These have entries which
> >> > are
> >> > seperated by an empty line. Is it possible for me to read
> >> > these lines separated from empty lines at a time. Also which inpurformat
> >> > shall I use to read doc docx. Please help
> >> >
> >> > *------------------------*
> >> > Cheers !!!
> >> > Siddharth Tiwari
> >> > Have a refreshing day !!!
> >> > "Every duty is holy, and devotion to duty is the highest form of worship
> >> > of
> >> > God.”
> >> > "Maybe other people will try to limit me but I don't limit myself"
> >>
> >>
> >>
> >> --
> >> Håvard Wahl Kongsgård
> >> Faculty of Medicine &
> >> Department of Mathematical Sciences
> >> NTNU
> >>
> >> http://havard.security-review.net/
> 
> 
> 
> -- 
> Håvard Wahl Kongsgård
> Faculty of Medicine &
> Department of Mathematical Sciences
> NTNU
> 
> http://havard.security-review.net/
 		 	   		   		 	   		   		 	   		   		 	   		  

RE: Reading multiple lines from a microsoft doc in hadoop

Posted by Siddharth Tiwari <si...@live.com>.
CAn anybody enlighten me on what could be wrongg ?

*------------------------*

Cheers !!!

Siddharth Tiwari

Have a refreshing day !!!
"Every duty is holy, and devotion to duty is the highest form of worship of God.” 

"Maybe other people will try to limit me but I don't limit myself"


From: siddharth.tiwari@live.com
To: user@hadoop.apache.org; bejoy.hadoop@gmail.com; bejoy_ks@yahoo.com
Subject: RE: Reading multiple lines from a microsoft doc in hadoop
Date: Sat, 25 Aug 2012 05:35:48 +0000





Any help on below would be really appreciated. i am stuck with it

*------------------------*

Cheers !!!

Siddharth Tiwari

Have a refreshing day !!!
"Every duty is holy, and devotion to duty is the highest form of worship of God.” 

"Maybe other people will try to limit me but I don't limit myself"


From: siddharth.tiwari@live.com
To: user@hadoop.apache.org; bejoy.hadoop@gmail.com; bejoy_ks@yahoo.com
Subject: RE: Reading multiple lines from a microsoft doc in hadoop
Date: Fri, 24 Aug 2012 20:23:45 +0000





Hi ,

Can anyone please help ?

Thank you in advance

*------------------------*

Cheers !!!

Siddharth Tiwari

Have a refreshing day !!!
"Every duty is holy, and devotion to duty is the highest form of worship of God.” 

"Maybe other people will try to limit me but I don't limit myself"


From: siddharth.tiwari@live.com
To: user@hadoop.apache.org; bejoy.hadoop@gmail.com; bejoy_ks@yahoo.com
Subject: RE: Reading multiple lines from a microsoft doc in hadoop
Date: Fri, 24 Aug 2012 16:22:57 +0000





Hi Team,

Thanks a lot for so many good suggestions. I wrote a custom input format for reading one paragraph at a time. But when I use it I get lines read. Can you please suggest what changes I must make to read one para at a time seperated by null lines ?
below is the code I wrote:-


import java.io.IOException;
import java.util.ArrayList;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.JobContext;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.mapreduce.lib.input.LineRecordReader;
import org.apache.hadoop.util.LineReader;




/**
 * 
 */

/**
 * @author 460615
 *
 */
//FileInputFormat is the base class for all file-based InputFormats
public class ParaInputFormat extends FileInputFormat<LongWritable,Text> {
private String nullRegex = "^\\s*$" ;
public String StrLine = null;
/*public RecordReader<LongWritable, Text> getRecordReader (InputSplit genericSplit, JobConf job, Reporter reporter) throws IOException {
reporter.setStatus(genericSplit.toString());
return new ParaInputFormat(job, (FileSplit)genericSplit);
}*/
public RecordReader<LongWritable, Text> createRecordReader(InputSplit genericSplit, TaskAttemptContext context)throws IOException {
   context.setStatus(genericSplit.toString());
   return new LineRecordReader();
 }


public InputSplit[] getSplits(JobContext job, Configuration conf) throws IOException {
ArrayList<FileSplit> splits = new ArrayList<FileSplit>();
for (FileStatus status : listStatus(job)) {
Path fileName = status.getPath();
if (status.isDir()) {
throw new IOException("Not a file: " + fileName);
}
FileSystem  fs = fileName.getFileSystem(conf);
LineReader lr = null;
try {
FSDataInputStream in  = fs.open(fileName);
lr = new LineReader(in, conf);
// String regexMatch =in.readLine();
Text line = new Text();
long begin = 0;
long length = 0;
int num = -1;
String boolTest = null;
boolean match = false;
Pattern p = Pattern.compile(nullRegex);
// Matcher matcher = new p.matcher();
while ((boolTest = in.readLine()) != null && (num = lr.readLine(line)) > 0 && ! ( in.readLine().isEmpty())){
// numLines++;
length += num;
 
 
splits.add(new FileSplit(fileName, begin, length, new String[]{}));}
begin=length;
}finally {
if (lr != null) {
lr.close();
}
 
 
 
}
 
}
return splits.toArray(new FileSplit[splits.size()]);
}
 


}




*------------------------*

Cheers !!!

Siddharth Tiwari

Have a refreshing day !!!
"Every duty is holy, and devotion to duty is the highest form of worship of God.” 

"Maybe other people will try to limit me but I don't limit myself"


> Date: Fri, 24 Aug 2012 09:54:10 +0200
> Subject: Re: Reading multiple lines from a microsoft doc in hadoop
> From: haavard.kongsgaard@gmail.com
> To: user@hadoop.apache.org
> 
> Hi, maybe you should check out the old nutch project
> http://nutch.apache.org/ (hadoop was developed for nutch).
> It's a web crawler and indexer, but the malinglists hold much info
> doc/pdf parsing which also relates to hadoop.
> 
> Have never parsed many docx or doc files, but it should be
> strait-forward. But generally for text analysis preprocessing is the
> KEY! For example replace dual lines \r\n\r\n or (\n\n) with #### is a
> simple trick)
> 
> 
> -Håvard
> 
> On Fri, Aug 24, 2012 at 9:30 AM, Siddharth Tiwari
> <si...@live.com> wrote:
> > Hi,
> > Thank you for the suggestion. Actually I was using poi to extract text, but
> > since now  I  have so many  documents I thought I will use hadoop directly
> > to parse as well. Average size of each document is around 120 kb. Also I
> > want to read multiple lines from the text until I find a blank line. I do
> > not have any idea ankit how to design custom input format and record reader.
> > Pleaser help with some tutorial tutorial, code or resource around it. I am
> > struggling with the issue. I will be highly grateful. Thank you so much once
> > again
> >
> >> Date: Fri, 24 Aug 2012 08:07:39 +0200
> >> Subject: Re: Reading multiple lines from a microsoft doc in hadoop
> >> From: haavard.kongsgaard@gmail.com
> >> To: user@hadoop.apache.org
> >
> >>
> >> It's much easier if you convert the documents to text first
> >>
> >> use
> >> http://tika.apache.org/
> >>
> >> or some other doc parser
> >>
> >>
> >> -Håvard
> >>
> >> On Fri, Aug 24, 2012 at 7:52 AM, Siddharth Tiwari
> >> <si...@live.com> wrote:
> >> > hi,
> >> > I have doc files in msword doc and docx format. These have entries which
> >> > are
> >> > seperated by an empty line. Is it possible for me to read
> >> > these lines separated from empty lines at a time. Also which inpurformat
> >> > shall I use to read doc docx. Please help
> >> >
> >> > *------------------------*
> >> > Cheers !!!
> >> > Siddharth Tiwari
> >> > Have a refreshing day !!!
> >> > "Every duty is holy, and devotion to duty is the highest form of worship
> >> > of
> >> > God.”
> >> > "Maybe other people will try to limit me but I don't limit myself"
> >>
> >>
> >>
> >> --
> >> Håvard Wahl Kongsgård
> >> Faculty of Medicine &
> >> Department of Mathematical Sciences
> >> NTNU
> >>
> >> http://havard.security-review.net/
> 
> 
> 
> -- 
> Håvard Wahl Kongsgård
> Faculty of Medicine &
> Department of Mathematical Sciences
> NTNU
> 
> http://havard.security-review.net/
 		 	   		   		 	   		   		 	   		   		 	   		  

RE: Reading multiple lines from a microsoft doc in hadoop

Posted by Siddharth Tiwari <si...@live.com>.
CAn anybody enlighten me on what could be wrongg ?

*------------------------*

Cheers !!!

Siddharth Tiwari

Have a refreshing day !!!
"Every duty is holy, and devotion to duty is the highest form of worship of God.” 

"Maybe other people will try to limit me but I don't limit myself"


From: siddharth.tiwari@live.com
To: user@hadoop.apache.org; bejoy.hadoop@gmail.com; bejoy_ks@yahoo.com
Subject: RE: Reading multiple lines from a microsoft doc in hadoop
Date: Sat, 25 Aug 2012 05:35:48 +0000





Any help on below would be really appreciated. i am stuck with it

*------------------------*

Cheers !!!

Siddharth Tiwari

Have a refreshing day !!!
"Every duty is holy, and devotion to duty is the highest form of worship of God.” 

"Maybe other people will try to limit me but I don't limit myself"


From: siddharth.tiwari@live.com
To: user@hadoop.apache.org; bejoy.hadoop@gmail.com; bejoy_ks@yahoo.com
Subject: RE: Reading multiple lines from a microsoft doc in hadoop
Date: Fri, 24 Aug 2012 20:23:45 +0000





Hi ,

Can anyone please help ?

Thank you in advance

*------------------------*

Cheers !!!

Siddharth Tiwari

Have a refreshing day !!!
"Every duty is holy, and devotion to duty is the highest form of worship of God.” 

"Maybe other people will try to limit me but I don't limit myself"


From: siddharth.tiwari@live.com
To: user@hadoop.apache.org; bejoy.hadoop@gmail.com; bejoy_ks@yahoo.com
Subject: RE: Reading multiple lines from a microsoft doc in hadoop
Date: Fri, 24 Aug 2012 16:22:57 +0000





Hi Team,

Thanks a lot for so many good suggestions. I wrote a custom input format for reading one paragraph at a time. But when I use it I get lines read. Can you please suggest what changes I must make to read one para at a time seperated by null lines ?
below is the code I wrote:-


import java.io.IOException;
import java.util.ArrayList;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.JobContext;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.mapreduce.lib.input.LineRecordReader;
import org.apache.hadoop.util.LineReader;




/**
 * 
 */

/**
 * @author 460615
 *
 */
//FileInputFormat is the base class for all file-based InputFormats
public class ParaInputFormat extends FileInputFormat<LongWritable,Text> {
private String nullRegex = "^\\s*$" ;
public String StrLine = null;
/*public RecordReader<LongWritable, Text> getRecordReader (InputSplit genericSplit, JobConf job, Reporter reporter) throws IOException {
reporter.setStatus(genericSplit.toString());
return new ParaInputFormat(job, (FileSplit)genericSplit);
}*/
public RecordReader<LongWritable, Text> createRecordReader(InputSplit genericSplit, TaskAttemptContext context)throws IOException {
   context.setStatus(genericSplit.toString());
   return new LineRecordReader();
 }


public InputSplit[] getSplits(JobContext job, Configuration conf) throws IOException {
ArrayList<FileSplit> splits = new ArrayList<FileSplit>();
for (FileStatus status : listStatus(job)) {
Path fileName = status.getPath();
if (status.isDir()) {
throw new IOException("Not a file: " + fileName);
}
FileSystem  fs = fileName.getFileSystem(conf);
LineReader lr = null;
try {
FSDataInputStream in  = fs.open(fileName);
lr = new LineReader(in, conf);
// String regexMatch =in.readLine();
Text line = new Text();
long begin = 0;
long length = 0;
int num = -1;
String boolTest = null;
boolean match = false;
Pattern p = Pattern.compile(nullRegex);
// Matcher matcher = new p.matcher();
while ((boolTest = in.readLine()) != null && (num = lr.readLine(line)) > 0 && ! ( in.readLine().isEmpty())){
// numLines++;
length += num;
 
 
splits.add(new FileSplit(fileName, begin, length, new String[]{}));}
begin=length;
}finally {
if (lr != null) {
lr.close();
}
 
 
 
}
 
}
return splits.toArray(new FileSplit[splits.size()]);
}
 


}




*------------------------*

Cheers !!!

Siddharth Tiwari

Have a refreshing day !!!
"Every duty is holy, and devotion to duty is the highest form of worship of God.” 

"Maybe other people will try to limit me but I don't limit myself"


> Date: Fri, 24 Aug 2012 09:54:10 +0200
> Subject: Re: Reading multiple lines from a microsoft doc in hadoop
> From: haavard.kongsgaard@gmail.com
> To: user@hadoop.apache.org
> 
> Hi, maybe you should check out the old nutch project
> http://nutch.apache.org/ (hadoop was developed for nutch).
> It's a web crawler and indexer, but the malinglists hold much info
> doc/pdf parsing which also relates to hadoop.
> 
> Have never parsed many docx or doc files, but it should be
> strait-forward. But generally for text analysis preprocessing is the
> KEY! For example replace dual lines \r\n\r\n or (\n\n) with #### is a
> simple trick)
> 
> 
> -Håvard
> 
> On Fri, Aug 24, 2012 at 9:30 AM, Siddharth Tiwari
> <si...@live.com> wrote:
> > Hi,
> > Thank you for the suggestion. Actually I was using poi to extract text, but
> > since now  I  have so many  documents I thought I will use hadoop directly
> > to parse as well. Average size of each document is around 120 kb. Also I
> > want to read multiple lines from the text until I find a blank line. I do
> > not have any idea ankit how to design custom input format and record reader.
> > Pleaser help with some tutorial tutorial, code or resource around it. I am
> > struggling with the issue. I will be highly grateful. Thank you so much once
> > again
> >
> >> Date: Fri, 24 Aug 2012 08:07:39 +0200
> >> Subject: Re: Reading multiple lines from a microsoft doc in hadoop
> >> From: haavard.kongsgaard@gmail.com
> >> To: user@hadoop.apache.org
> >
> >>
> >> It's much easier if you convert the documents to text first
> >>
> >> use
> >> http://tika.apache.org/
> >>
> >> or some other doc parser
> >>
> >>
> >> -Håvard
> >>
> >> On Fri, Aug 24, 2012 at 7:52 AM, Siddharth Tiwari
> >> <si...@live.com> wrote:
> >> > hi,
> >> > I have doc files in msword doc and docx format. These have entries which
> >> > are
> >> > seperated by an empty line. Is it possible for me to read
> >> > these lines separated from empty lines at a time. Also which inpurformat
> >> > shall I use to read doc docx. Please help
> >> >
> >> > *------------------------*
> >> > Cheers !!!
> >> > Siddharth Tiwari
> >> > Have a refreshing day !!!
> >> > "Every duty is holy, and devotion to duty is the highest form of worship
> >> > of
> >> > God.”
> >> > "Maybe other people will try to limit me but I don't limit myself"
> >>
> >>
> >>
> >> --
> >> Håvard Wahl Kongsgård
> >> Faculty of Medicine &
> >> Department of Mathematical Sciences
> >> NTNU
> >>
> >> http://havard.security-review.net/
> 
> 
> 
> -- 
> Håvard Wahl Kongsgård
> Faculty of Medicine &
> Department of Mathematical Sciences
> NTNU
> 
> http://havard.security-review.net/
 		 	   		   		 	   		   		 	   		   		 	   		  

RE: Reading multiple lines from a microsoft doc in hadoop

Posted by Siddharth Tiwari <si...@live.com>.
Any help on below would be really appreciated. i am stuck with it

*------------------------*

Cheers !!!

Siddharth Tiwari

Have a refreshing day !!!
"Every duty is holy, and devotion to duty is the highest form of worship of God.” 

"Maybe other people will try to limit me but I don't limit myself"


From: siddharth.tiwari@live.com
To: user@hadoop.apache.org; bejoy.hadoop@gmail.com; bejoy_ks@yahoo.com
Subject: RE: Reading multiple lines from a microsoft doc in hadoop
Date: Fri, 24 Aug 2012 20:23:45 +0000





Hi ,

Can anyone please help ?

Thank you in advance

*------------------------*

Cheers !!!

Siddharth Tiwari

Have a refreshing day !!!
"Every duty is holy, and devotion to duty is the highest form of worship of God.” 

"Maybe other people will try to limit me but I don't limit myself"


From: siddharth.tiwari@live.com
To: user@hadoop.apache.org; bejoy.hadoop@gmail.com; bejoy_ks@yahoo.com
Subject: RE: Reading multiple lines from a microsoft doc in hadoop
Date: Fri, 24 Aug 2012 16:22:57 +0000





Hi Team,

Thanks a lot for so many good suggestions. I wrote a custom input format for reading one paragraph at a time. But when I use it I get lines read. Can you please suggest what changes I must make to read one para at a time seperated by null lines ?
below is the code I wrote:-


import java.io.IOException;
import java.util.ArrayList;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.JobContext;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.mapreduce.lib.input.LineRecordReader;
import org.apache.hadoop.util.LineReader;




/**
 * 
 */

/**
 * @author 460615
 *
 */
//FileInputFormat is the base class for all file-based InputFormats
public class ParaInputFormat extends FileInputFormat<LongWritable,Text> {
private String nullRegex = "^\\s*$" ;
public String StrLine = null;
/*public RecordReader<LongWritable, Text> getRecordReader (InputSplit genericSplit, JobConf job, Reporter reporter) throws IOException {
reporter.setStatus(genericSplit.toString());
return new ParaInputFormat(job, (FileSplit)genericSplit);
}*/
public RecordReader<LongWritable, Text> createRecordReader(InputSplit genericSplit, TaskAttemptContext context)throws IOException {
   context.setStatus(genericSplit.toString());
   return new LineRecordReader();
 }


public InputSplit[] getSplits(JobContext job, Configuration conf) throws IOException {
ArrayList<FileSplit> splits = new ArrayList<FileSplit>();
for (FileStatus status : listStatus(job)) {
Path fileName = status.getPath();
if (status.isDir()) {
throw new IOException("Not a file: " + fileName);
}
FileSystem  fs = fileName.getFileSystem(conf);
LineReader lr = null;
try {
FSDataInputStream in  = fs.open(fileName);
lr = new LineReader(in, conf);
// String regexMatch =in.readLine();
Text line = new Text();
long begin = 0;
long length = 0;
int num = -1;
String boolTest = null;
boolean match = false;
Pattern p = Pattern.compile(nullRegex);
// Matcher matcher = new p.matcher();
while ((boolTest = in.readLine()) != null && (num = lr.readLine(line)) > 0 && ! ( in.readLine().isEmpty())){
// numLines++;
length += num;
 
 
splits.add(new FileSplit(fileName, begin, length, new String[]{}));}
begin=length;
}finally {
if (lr != null) {
lr.close();
}
 
 
 
}
 
}
return splits.toArray(new FileSplit[splits.size()]);
}
 


}




*------------------------*

Cheers !!!

Siddharth Tiwari

Have a refreshing day !!!
"Every duty is holy, and devotion to duty is the highest form of worship of God.” 

"Maybe other people will try to limit me but I don't limit myself"


> Date: Fri, 24 Aug 2012 09:54:10 +0200
> Subject: Re: Reading multiple lines from a microsoft doc in hadoop
> From: haavard.kongsgaard@gmail.com
> To: user@hadoop.apache.org
> 
> Hi, maybe you should check out the old nutch project
> http://nutch.apache.org/ (hadoop was developed for nutch).
> It's a web crawler and indexer, but the malinglists hold much info
> doc/pdf parsing which also relates to hadoop.
> 
> Have never parsed many docx or doc files, but it should be
> strait-forward. But generally for text analysis preprocessing is the
> KEY! For example replace dual lines \r\n\r\n or (\n\n) with #### is a
> simple trick)
> 
> 
> -Håvard
> 
> On Fri, Aug 24, 2012 at 9:30 AM, Siddharth Tiwari
> <si...@live.com> wrote:
> > Hi,
> > Thank you for the suggestion. Actually I was using poi to extract text, but
> > since now  I  have so many  documents I thought I will use hadoop directly
> > to parse as well. Average size of each document is around 120 kb. Also I
> > want to read multiple lines from the text until I find a blank line. I do
> > not have any idea ankit how to design custom input format and record reader.
> > Pleaser help with some tutorial tutorial, code or resource around it. I am
> > struggling with the issue. I will be highly grateful. Thank you so much once
> > again
> >
> >> Date: Fri, 24 Aug 2012 08:07:39 +0200
> >> Subject: Re: Reading multiple lines from a microsoft doc in hadoop
> >> From: haavard.kongsgaard@gmail.com
> >> To: user@hadoop.apache.org
> >
> >>
> >> It's much easier if you convert the documents to text first
> >>
> >> use
> >> http://tika.apache.org/
> >>
> >> or some other doc parser
> >>
> >>
> >> -Håvard
> >>
> >> On Fri, Aug 24, 2012 at 7:52 AM, Siddharth Tiwari
> >> <si...@live.com> wrote:
> >> > hi,
> >> > I have doc files in msword doc and docx format. These have entries which
> >> > are
> >> > seperated by an empty line. Is it possible for me to read
> >> > these lines separated from empty lines at a time. Also which inpurformat
> >> > shall I use to read doc docx. Please help
> >> >
> >> > *------------------------*
> >> > Cheers !!!
> >> > Siddharth Tiwari
> >> > Have a refreshing day !!!
> >> > "Every duty is holy, and devotion to duty is the highest form of worship
> >> > of
> >> > God.”
> >> > "Maybe other people will try to limit me but I don't limit myself"
> >>
> >>
> >>
> >> --
> >> Håvard Wahl Kongsgård
> >> Faculty of Medicine &
> >> Department of Mathematical Sciences
> >> NTNU
> >>
> >> http://havard.security-review.net/
> 
> 
> 
> -- 
> Håvard Wahl Kongsgård
> Faculty of Medicine &
> Department of Mathematical Sciences
> NTNU
> 
> http://havard.security-review.net/
 		 	   		   		 	   		   		 	   		  

RE: Reading multiple lines from a microsoft doc in hadoop

Posted by Siddharth Tiwari <si...@live.com>.
Any help on below would be really appreciated. i am stuck with it

*------------------------*

Cheers !!!

Siddharth Tiwari

Have a refreshing day !!!
"Every duty is holy, and devotion to duty is the highest form of worship of God.” 

"Maybe other people will try to limit me but I don't limit myself"


From: siddharth.tiwari@live.com
To: user@hadoop.apache.org; bejoy.hadoop@gmail.com; bejoy_ks@yahoo.com
Subject: RE: Reading multiple lines from a microsoft doc in hadoop
Date: Fri, 24 Aug 2012 20:23:45 +0000





Hi ,

Can anyone please help ?

Thank you in advance

*------------------------*

Cheers !!!

Siddharth Tiwari

Have a refreshing day !!!
"Every duty is holy, and devotion to duty is the highest form of worship of God.” 

"Maybe other people will try to limit me but I don't limit myself"


From: siddharth.tiwari@live.com
To: user@hadoop.apache.org; bejoy.hadoop@gmail.com; bejoy_ks@yahoo.com
Subject: RE: Reading multiple lines from a microsoft doc in hadoop
Date: Fri, 24 Aug 2012 16:22:57 +0000





Hi Team,

Thanks a lot for so many good suggestions. I wrote a custom input format for reading one paragraph at a time. But when I use it I get lines read. Can you please suggest what changes I must make to read one para at a time seperated by null lines ?
below is the code I wrote:-


import java.io.IOException;
import java.util.ArrayList;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.JobContext;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.mapreduce.lib.input.LineRecordReader;
import org.apache.hadoop.util.LineReader;




/**
 * 
 */

/**
 * @author 460615
 *
 */
//FileInputFormat is the base class for all file-based InputFormats
public class ParaInputFormat extends FileInputFormat<LongWritable,Text> {
private String nullRegex = "^\\s*$" ;
public String StrLine = null;
/*public RecordReader<LongWritable, Text> getRecordReader (InputSplit genericSplit, JobConf job, Reporter reporter) throws IOException {
reporter.setStatus(genericSplit.toString());
return new ParaInputFormat(job, (FileSplit)genericSplit);
}*/
public RecordReader<LongWritable, Text> createRecordReader(InputSplit genericSplit, TaskAttemptContext context)throws IOException {
   context.setStatus(genericSplit.toString());
   return new LineRecordReader();
 }


public InputSplit[] getSplits(JobContext job, Configuration conf) throws IOException {
ArrayList<FileSplit> splits = new ArrayList<FileSplit>();
for (FileStatus status : listStatus(job)) {
Path fileName = status.getPath();
if (status.isDir()) {
throw new IOException("Not a file: " + fileName);
}
FileSystem  fs = fileName.getFileSystem(conf);
LineReader lr = null;
try {
FSDataInputStream in  = fs.open(fileName);
lr = new LineReader(in, conf);
// String regexMatch =in.readLine();
Text line = new Text();
long begin = 0;
long length = 0;
int num = -1;
String boolTest = null;
boolean match = false;
Pattern p = Pattern.compile(nullRegex);
// Matcher matcher = new p.matcher();
while ((boolTest = in.readLine()) != null && (num = lr.readLine(line)) > 0 && ! ( in.readLine().isEmpty())){
// numLines++;
length += num;
 
 
splits.add(new FileSplit(fileName, begin, length, new String[]{}));}
begin=length;
}finally {
if (lr != null) {
lr.close();
}
 
 
 
}
 
}
return splits.toArray(new FileSplit[splits.size()]);
}
 


}




*------------------------*

Cheers !!!

Siddharth Tiwari

Have a refreshing day !!!
"Every duty is holy, and devotion to duty is the highest form of worship of God.” 

"Maybe other people will try to limit me but I don't limit myself"


> Date: Fri, 24 Aug 2012 09:54:10 +0200
> Subject: Re: Reading multiple lines from a microsoft doc in hadoop
> From: haavard.kongsgaard@gmail.com
> To: user@hadoop.apache.org
> 
> Hi, maybe you should check out the old nutch project
> http://nutch.apache.org/ (hadoop was developed for nutch).
> It's a web crawler and indexer, but the malinglists hold much info
> doc/pdf parsing which also relates to hadoop.
> 
> Have never parsed many docx or doc files, but it should be
> strait-forward. But generally for text analysis preprocessing is the
> KEY! For example replace dual lines \r\n\r\n or (\n\n) with #### is a
> simple trick)
> 
> 
> -Håvard
> 
> On Fri, Aug 24, 2012 at 9:30 AM, Siddharth Tiwari
> <si...@live.com> wrote:
> > Hi,
> > Thank you for the suggestion. Actually I was using poi to extract text, but
> > since now  I  have so many  documents I thought I will use hadoop directly
> > to parse as well. Average size of each document is around 120 kb. Also I
> > want to read multiple lines from the text until I find a blank line. I do
> > not have any idea ankit how to design custom input format and record reader.
> > Pleaser help with some tutorial tutorial, code or resource around it. I am
> > struggling with the issue. I will be highly grateful. Thank you so much once
> > again
> >
> >> Date: Fri, 24 Aug 2012 08:07:39 +0200
> >> Subject: Re: Reading multiple lines from a microsoft doc in hadoop
> >> From: haavard.kongsgaard@gmail.com
> >> To: user@hadoop.apache.org
> >
> >>
> >> It's much easier if you convert the documents to text first
> >>
> >> use
> >> http://tika.apache.org/
> >>
> >> or some other doc parser
> >>
> >>
> >> -Håvard
> >>
> >> On Fri, Aug 24, 2012 at 7:52 AM, Siddharth Tiwari
> >> <si...@live.com> wrote:
> >> > hi,
> >> > I have doc files in msword doc and docx format. These have entries which
> >> > are
> >> > seperated by an empty line. Is it possible for me to read
> >> > these lines separated from empty lines at a time. Also which inpurformat
> >> > shall I use to read doc docx. Please help
> >> >
> >> > *------------------------*
> >> > Cheers !!!
> >> > Siddharth Tiwari
> >> > Have a refreshing day !!!
> >> > "Every duty is holy, and devotion to duty is the highest form of worship
> >> > of
> >> > God.”
> >> > "Maybe other people will try to limit me but I don't limit myself"
> >>
> >>
> >>
> >> --
> >> Håvard Wahl Kongsgård
> >> Faculty of Medicine &
> >> Department of Mathematical Sciences
> >> NTNU
> >>
> >> http://havard.security-review.net/
> 
> 
> 
> -- 
> Håvard Wahl Kongsgård
> Faculty of Medicine &
> Department of Mathematical Sciences
> NTNU
> 
> http://havard.security-review.net/
 		 	   		   		 	   		   		 	   		  

RE: Reading multiple lines from a microsoft doc in hadoop

Posted by Siddharth Tiwari <si...@live.com>.
Any help on below would be really appreciated. i am stuck with it

*------------------------*

Cheers !!!

Siddharth Tiwari

Have a refreshing day !!!
"Every duty is holy, and devotion to duty is the highest form of worship of God.” 

"Maybe other people will try to limit me but I don't limit myself"


From: siddharth.tiwari@live.com
To: user@hadoop.apache.org; bejoy.hadoop@gmail.com; bejoy_ks@yahoo.com
Subject: RE: Reading multiple lines from a microsoft doc in hadoop
Date: Fri, 24 Aug 2012 20:23:45 +0000





Hi ,

Can anyone please help ?

Thank you in advance

*------------------------*

Cheers !!!

Siddharth Tiwari

Have a refreshing day !!!
"Every duty is holy, and devotion to duty is the highest form of worship of God.” 

"Maybe other people will try to limit me but I don't limit myself"


From: siddharth.tiwari@live.com
To: user@hadoop.apache.org; bejoy.hadoop@gmail.com; bejoy_ks@yahoo.com
Subject: RE: Reading multiple lines from a microsoft doc in hadoop
Date: Fri, 24 Aug 2012 16:22:57 +0000





Hi Team,

Thanks a lot for so many good suggestions. I wrote a custom input format for reading one paragraph at a time. But when I use it I get lines read. Can you please suggest what changes I must make to read one para at a time seperated by null lines ?
below is the code I wrote:-


import java.io.IOException;
import java.util.ArrayList;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.JobContext;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.mapreduce.lib.input.LineRecordReader;
import org.apache.hadoop.util.LineReader;




/**
 * 
 */

/**
 * @author 460615
 *
 */
//FileInputFormat is the base class for all file-based InputFormats
public class ParaInputFormat extends FileInputFormat<LongWritable,Text> {
private String nullRegex = "^\\s*$" ;
public String StrLine = null;
/*public RecordReader<LongWritable, Text> getRecordReader (InputSplit genericSplit, JobConf job, Reporter reporter) throws IOException {
reporter.setStatus(genericSplit.toString());
return new ParaInputFormat(job, (FileSplit)genericSplit);
}*/
public RecordReader<LongWritable, Text> createRecordReader(InputSplit genericSplit, TaskAttemptContext context)throws IOException {
   context.setStatus(genericSplit.toString());
   return new LineRecordReader();
 }


public InputSplit[] getSplits(JobContext job, Configuration conf) throws IOException {
ArrayList<FileSplit> splits = new ArrayList<FileSplit>();
for (FileStatus status : listStatus(job)) {
Path fileName = status.getPath();
if (status.isDir()) {
throw new IOException("Not a file: " + fileName);
}
FileSystem  fs = fileName.getFileSystem(conf);
LineReader lr = null;
try {
FSDataInputStream in  = fs.open(fileName);
lr = new LineReader(in, conf);
// String regexMatch =in.readLine();
Text line = new Text();
long begin = 0;
long length = 0;
int num = -1;
String boolTest = null;
boolean match = false;
Pattern p = Pattern.compile(nullRegex);
// Matcher matcher = new p.matcher();
while ((boolTest = in.readLine()) != null && (num = lr.readLine(line)) > 0 && ! ( in.readLine().isEmpty())){
// numLines++;
length += num;
 
 
splits.add(new FileSplit(fileName, begin, length, new String[]{}));}
begin=length;
}finally {
if (lr != null) {
lr.close();
}
 
 
 
}
 
}
return splits.toArray(new FileSplit[splits.size()]);
}
 


}




*------------------------*

Cheers !!!

Siddharth Tiwari

Have a refreshing day !!!
"Every duty is holy, and devotion to duty is the highest form of worship of God.” 

"Maybe other people will try to limit me but I don't limit myself"


> Date: Fri, 24 Aug 2012 09:54:10 +0200
> Subject: Re: Reading multiple lines from a microsoft doc in hadoop
> From: haavard.kongsgaard@gmail.com
> To: user@hadoop.apache.org
> 
> Hi, maybe you should check out the old nutch project
> http://nutch.apache.org/ (hadoop was developed for nutch).
> It's a web crawler and indexer, but the malinglists hold much info
> doc/pdf parsing which also relates to hadoop.
> 
> Have never parsed many docx or doc files, but it should be
> strait-forward. But generally for text analysis preprocessing is the
> KEY! For example replace dual lines \r\n\r\n or (\n\n) with #### is a
> simple trick)
> 
> 
> -Håvard
> 
> On Fri, Aug 24, 2012 at 9:30 AM, Siddharth Tiwari
> <si...@live.com> wrote:
> > Hi,
> > Thank you for the suggestion. Actually I was using poi to extract text, but
> > since now  I  have so many  documents I thought I will use hadoop directly
> > to parse as well. Average size of each document is around 120 kb. Also I
> > want to read multiple lines from the text until I find a blank line. I do
> > not have any idea ankit how to design custom input format and record reader.
> > Pleaser help with some tutorial tutorial, code or resource around it. I am
> > struggling with the issue. I will be highly grateful. Thank you so much once
> > again
> >
> >> Date: Fri, 24 Aug 2012 08:07:39 +0200
> >> Subject: Re: Reading multiple lines from a microsoft doc in hadoop
> >> From: haavard.kongsgaard@gmail.com
> >> To: user@hadoop.apache.org
> >
> >>
> >> It's much easier if you convert the documents to text first
> >>
> >> use
> >> http://tika.apache.org/
> >>
> >> or some other doc parser
> >>
> >>
> >> -Håvard
> >>
> >> On Fri, Aug 24, 2012 at 7:52 AM, Siddharth Tiwari
> >> <si...@live.com> wrote:
> >> > hi,
> >> > I have doc files in msword doc and docx format. These have entries which
> >> > are
> >> > seperated by an empty line. Is it possible for me to read
> >> > these lines separated from empty lines at a time. Also which inpurformat
> >> > shall I use to read doc docx. Please help
> >> >
> >> > *------------------------*
> >> > Cheers !!!
> >> > Siddharth Tiwari
> >> > Have a refreshing day !!!
> >> > "Every duty is holy, and devotion to duty is the highest form of worship
> >> > of
> >> > God.”
> >> > "Maybe other people will try to limit me but I don't limit myself"
> >>
> >>
> >>
> >> --
> >> Håvard Wahl Kongsgård
> >> Faculty of Medicine &
> >> Department of Mathematical Sciences
> >> NTNU
> >>
> >> http://havard.security-review.net/
> 
> 
> 
> -- 
> Håvard Wahl Kongsgård
> Faculty of Medicine &
> Department of Mathematical Sciences
> NTNU
> 
> http://havard.security-review.net/
 		 	   		   		 	   		   		 	   		  

RE: Reading multiple lines from a microsoft doc in hadoop

Posted by Siddharth Tiwari <si...@live.com>.
Any help on below would be really appreciated. i am stuck with it

*------------------------*

Cheers !!!

Siddharth Tiwari

Have a refreshing day !!!
"Every duty is holy, and devotion to duty is the highest form of worship of God.” 

"Maybe other people will try to limit me but I don't limit myself"


From: siddharth.tiwari@live.com
To: user@hadoop.apache.org; bejoy.hadoop@gmail.com; bejoy_ks@yahoo.com
Subject: RE: Reading multiple lines from a microsoft doc in hadoop
Date: Fri, 24 Aug 2012 20:23:45 +0000





Hi ,

Can anyone please help ?

Thank you in advance

*------------------------*

Cheers !!!

Siddharth Tiwari

Have a refreshing day !!!
"Every duty is holy, and devotion to duty is the highest form of worship of God.” 

"Maybe other people will try to limit me but I don't limit myself"


From: siddharth.tiwari@live.com
To: user@hadoop.apache.org; bejoy.hadoop@gmail.com; bejoy_ks@yahoo.com
Subject: RE: Reading multiple lines from a microsoft doc in hadoop
Date: Fri, 24 Aug 2012 16:22:57 +0000





Hi Team,

Thanks a lot for so many good suggestions. I wrote a custom input format for reading one paragraph at a time. But when I use it I get lines read. Can you please suggest what changes I must make to read one para at a time seperated by null lines ?
below is the code I wrote:-


import java.io.IOException;
import java.util.ArrayList;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.JobContext;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.mapreduce.lib.input.LineRecordReader;
import org.apache.hadoop.util.LineReader;




/**
 * 
 */

/**
 * @author 460615
 *
 */
//FileInputFormat is the base class for all file-based InputFormats
public class ParaInputFormat extends FileInputFormat<LongWritable,Text> {
private String nullRegex = "^\\s*$" ;
public String StrLine = null;
/*public RecordReader<LongWritable, Text> getRecordReader (InputSplit genericSplit, JobConf job, Reporter reporter) throws IOException {
reporter.setStatus(genericSplit.toString());
return new ParaInputFormat(job, (FileSplit)genericSplit);
}*/
public RecordReader<LongWritable, Text> createRecordReader(InputSplit genericSplit, TaskAttemptContext context)throws IOException {
   context.setStatus(genericSplit.toString());
   return new LineRecordReader();
 }


public InputSplit[] getSplits(JobContext job, Configuration conf) throws IOException {
ArrayList<FileSplit> splits = new ArrayList<FileSplit>();
for (FileStatus status : listStatus(job)) {
Path fileName = status.getPath();
if (status.isDir()) {
throw new IOException("Not a file: " + fileName);
}
FileSystem  fs = fileName.getFileSystem(conf);
LineReader lr = null;
try {
FSDataInputStream in  = fs.open(fileName);
lr = new LineReader(in, conf);
// String regexMatch =in.readLine();
Text line = new Text();
long begin = 0;
long length = 0;
int num = -1;
String boolTest = null;
boolean match = false;
Pattern p = Pattern.compile(nullRegex);
// Matcher matcher = new p.matcher();
while ((boolTest = in.readLine()) != null && (num = lr.readLine(line)) > 0 && ! ( in.readLine().isEmpty())){
// numLines++;
length += num;
 
 
splits.add(new FileSplit(fileName, begin, length, new String[]{}));}
begin=length;
}finally {
if (lr != null) {
lr.close();
}
 
 
 
}
 
}
return splits.toArray(new FileSplit[splits.size()]);
}
 


}




*------------------------*

Cheers !!!

Siddharth Tiwari

Have a refreshing day !!!
"Every duty is holy, and devotion to duty is the highest form of worship of God.” 

"Maybe other people will try to limit me but I don't limit myself"


> Date: Fri, 24 Aug 2012 09:54:10 +0200
> Subject: Re: Reading multiple lines from a microsoft doc in hadoop
> From: haavard.kongsgaard@gmail.com
> To: user@hadoop.apache.org
> 
> Hi, maybe you should check out the old nutch project
> http://nutch.apache.org/ (hadoop was developed for nutch).
> It's a web crawler and indexer, but the malinglists hold much info
> doc/pdf parsing which also relates to hadoop.
> 
> Have never parsed many docx or doc files, but it should be
> strait-forward. But generally for text analysis preprocessing is the
> KEY! For example replace dual lines \r\n\r\n or (\n\n) with #### is a
> simple trick)
> 
> 
> -Håvard
> 
> On Fri, Aug 24, 2012 at 9:30 AM, Siddharth Tiwari
> <si...@live.com> wrote:
> > Hi,
> > Thank you for the suggestion. Actually I was using poi to extract text, but
> > since now  I  have so many  documents I thought I will use hadoop directly
> > to parse as well. Average size of each document is around 120 kb. Also I
> > want to read multiple lines from the text until I find a blank line. I do
> > not have any idea ankit how to design custom input format and record reader.
> > Pleaser help with some tutorial tutorial, code or resource around it. I am
> > struggling with the issue. I will be highly grateful. Thank you so much once
> > again
> >
> >> Date: Fri, 24 Aug 2012 08:07:39 +0200
> >> Subject: Re: Reading multiple lines from a microsoft doc in hadoop
> >> From: haavard.kongsgaard@gmail.com
> >> To: user@hadoop.apache.org
> >
> >>
> >> It's much easier if you convert the documents to text first
> >>
> >> use
> >> http://tika.apache.org/
> >>
> >> or some other doc parser
> >>
> >>
> >> -Håvard
> >>
> >> On Fri, Aug 24, 2012 at 7:52 AM, Siddharth Tiwari
> >> <si...@live.com> wrote:
> >> > hi,
> >> > I have doc files in msword doc and docx format. These have entries which
> >> > are
> >> > seperated by an empty line. Is it possible for me to read
> >> > these lines separated from empty lines at a time. Also which inpurformat
> >> > shall I use to read doc docx. Please help
> >> >
> >> > *------------------------*
> >> > Cheers !!!
> >> > Siddharth Tiwari
> >> > Have a refreshing day !!!
> >> > "Every duty is holy, and devotion to duty is the highest form of worship
> >> > of
> >> > God.”
> >> > "Maybe other people will try to limit me but I don't limit myself"
> >>
> >>
> >>
> >> --
> >> Håvard Wahl Kongsgård
> >> Faculty of Medicine &
> >> Department of Mathematical Sciences
> >> NTNU
> >>
> >> http://havard.security-review.net/
> 
> 
> 
> -- 
> Håvard Wahl Kongsgård
> Faculty of Medicine &
> Department of Mathematical Sciences
> NTNU
> 
> http://havard.security-review.net/
 		 	   		   		 	   		   		 	   		  

RE: Reading multiple lines from a microsoft doc in hadoop

Posted by Siddharth Tiwari <si...@live.com>.
Hi ,

Can anyone please help ?

Thank you in advance

*------------------------*

Cheers !!!

Siddharth Tiwari

Have a refreshing day !!!
"Every duty is holy, and devotion to duty is the highest form of worship of God.” 

"Maybe other people will try to limit me but I don't limit myself"


From: siddharth.tiwari@live.com
To: user@hadoop.apache.org; bejoy.hadoop@gmail.com; bejoy_ks@yahoo.com
Subject: RE: Reading multiple lines from a microsoft doc in hadoop
Date: Fri, 24 Aug 2012 16:22:57 +0000





Hi Team,

Thanks a lot for so many good suggestions. I wrote a custom input format for reading one paragraph at a time. But when I use it I get lines read. Can you please suggest what changes I must make to read one para at a time seperated by null lines ?
below is the code I wrote:-


import java.io.IOException;
import java.util.ArrayList;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.JobContext;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.mapreduce.lib.input.LineRecordReader;
import org.apache.hadoop.util.LineReader;




/**
 * 
 */

/**
 * @author 460615
 *
 */
//FileInputFormat is the base class for all file-based InputFormats
public class ParaInputFormat extends FileInputFormat<LongWritable,Text> {
private String nullRegex = "^\\s*$" ;
public String StrLine = null;
/*public RecordReader<LongWritable, Text> getRecordReader (InputSplit genericSplit, JobConf job, Reporter reporter) throws IOException {
reporter.setStatus(genericSplit.toString());
return new ParaInputFormat(job, (FileSplit)genericSplit);
}*/
public RecordReader<LongWritable, Text> createRecordReader(InputSplit genericSplit, TaskAttemptContext context)throws IOException {
   context.setStatus(genericSplit.toString());
   return new LineRecordReader();
 }


public InputSplit[] getSplits(JobContext job, Configuration conf) throws IOException {
ArrayList<FileSplit> splits = new ArrayList<FileSplit>();
for (FileStatus status : listStatus(job)) {
Path fileName = status.getPath();
if (status.isDir()) {
throw new IOException("Not a file: " + fileName);
}
FileSystem  fs = fileName.getFileSystem(conf);
LineReader lr = null;
try {
FSDataInputStream in  = fs.open(fileName);
lr = new LineReader(in, conf);
// String regexMatch =in.readLine();
Text line = new Text();
long begin = 0;
long length = 0;
int num = -1;
String boolTest = null;
boolean match = false;
Pattern p = Pattern.compile(nullRegex);
// Matcher matcher = new p.matcher();
while ((boolTest = in.readLine()) != null && (num = lr.readLine(line)) > 0 && ! ( in.readLine().isEmpty())){
// numLines++;
length += num;
 
 
splits.add(new FileSplit(fileName, begin, length, new String[]{}));}
begin=length;
}finally {
if (lr != null) {
lr.close();
}
 
 
 
}
 
}
return splits.toArray(new FileSplit[splits.size()]);
}
 


}




*------------------------*

Cheers !!!

Siddharth Tiwari

Have a refreshing day !!!
"Every duty is holy, and devotion to duty is the highest form of worship of God.” 

"Maybe other people will try to limit me but I don't limit myself"


> Date: Fri, 24 Aug 2012 09:54:10 +0200
> Subject: Re: Reading multiple lines from a microsoft doc in hadoop
> From: haavard.kongsgaard@gmail.com
> To: user@hadoop.apache.org
> 
> Hi, maybe you should check out the old nutch project
> http://nutch.apache.org/ (hadoop was developed for nutch).
> It's a web crawler and indexer, but the malinglists hold much info
> doc/pdf parsing which also relates to hadoop.
> 
> Have never parsed many docx or doc files, but it should be
> strait-forward. But generally for text analysis preprocessing is the
> KEY! For example replace dual lines \r\n\r\n or (\n\n) with #### is a
> simple trick)
> 
> 
> -Håvard
> 
> On Fri, Aug 24, 2012 at 9:30 AM, Siddharth Tiwari
> <si...@live.com> wrote:
> > Hi,
> > Thank you for the suggestion. Actually I was using poi to extract text, but
> > since now  I  have so many  documents I thought I will use hadoop directly
> > to parse as well. Average size of each document is around 120 kb. Also I
> > want to read multiple lines from the text until I find a blank line. I do
> > not have any idea ankit how to design custom input format and record reader.
> > Pleaser help with some tutorial tutorial, code or resource around it. I am
> > struggling with the issue. I will be highly grateful. Thank you so much once
> > again
> >
> >> Date: Fri, 24 Aug 2012 08:07:39 +0200
> >> Subject: Re: Reading multiple lines from a microsoft doc in hadoop
> >> From: haavard.kongsgaard@gmail.com
> >> To: user@hadoop.apache.org
> >
> >>
> >> It's much easier if you convert the documents to text first
> >>
> >> use
> >> http://tika.apache.org/
> >>
> >> or some other doc parser
> >>
> >>
> >> -Håvard
> >>
> >> On Fri, Aug 24, 2012 at 7:52 AM, Siddharth Tiwari
> >> <si...@live.com> wrote:
> >> > hi,
> >> > I have doc files in msword doc and docx format. These have entries which
> >> > are
> >> > seperated by an empty line. Is it possible for me to read
> >> > these lines separated from empty lines at a time. Also which inpurformat
> >> > shall I use to read doc docx. Please help
> >> >
> >> > *------------------------*
> >> > Cheers !!!
> >> > Siddharth Tiwari
> >> > Have a refreshing day !!!
> >> > "Every duty is holy, and devotion to duty is the highest form of worship
> >> > of
> >> > God.”
> >> > "Maybe other people will try to limit me but I don't limit myself"
> >>
> >>
> >>
> >> --
> >> Håvard Wahl Kongsgård
> >> Faculty of Medicine &
> >> Department of Mathematical Sciences
> >> NTNU
> >>
> >> http://havard.security-review.net/
> 
> 
> 
> -- 
> Håvard Wahl Kongsgård
> Faculty of Medicine &
> Department of Mathematical Sciences
> NTNU
> 
> http://havard.security-review.net/
 		 	   		   		 	   		  

RE: Reading multiple lines from a microsoft doc in hadoop

Posted by Siddharth Tiwari <si...@live.com>.
Hi ,

Can anyone please help ?

Thank you in advance

*------------------------*

Cheers !!!

Siddharth Tiwari

Have a refreshing day !!!
"Every duty is holy, and devotion to duty is the highest form of worship of God.” 

"Maybe other people will try to limit me but I don't limit myself"


From: siddharth.tiwari@live.com
To: user@hadoop.apache.org; bejoy.hadoop@gmail.com; bejoy_ks@yahoo.com
Subject: RE: Reading multiple lines from a microsoft doc in hadoop
Date: Fri, 24 Aug 2012 16:22:57 +0000





Hi Team,

Thanks a lot for so many good suggestions. I wrote a custom input format for reading one paragraph at a time. But when I use it I get lines read. Can you please suggest what changes I must make to read one para at a time seperated by null lines ?
below is the code I wrote:-


import java.io.IOException;
import java.util.ArrayList;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.JobContext;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.mapreduce.lib.input.LineRecordReader;
import org.apache.hadoop.util.LineReader;




/**
 * 
 */

/**
 * @author 460615
 *
 */
//FileInputFormat is the base class for all file-based InputFormats
public class ParaInputFormat extends FileInputFormat<LongWritable,Text> {
private String nullRegex = "^\\s*$" ;
public String StrLine = null;
/*public RecordReader<LongWritable, Text> getRecordReader (InputSplit genericSplit, JobConf job, Reporter reporter) throws IOException {
reporter.setStatus(genericSplit.toString());
return new ParaInputFormat(job, (FileSplit)genericSplit);
}*/
public RecordReader<LongWritable, Text> createRecordReader(InputSplit genericSplit, TaskAttemptContext context)throws IOException {
   context.setStatus(genericSplit.toString());
   return new LineRecordReader();
 }


public InputSplit[] getSplits(JobContext job, Configuration conf) throws IOException {
ArrayList<FileSplit> splits = new ArrayList<FileSplit>();
for (FileStatus status : listStatus(job)) {
Path fileName = status.getPath();
if (status.isDir()) {
throw new IOException("Not a file: " + fileName);
}
FileSystem  fs = fileName.getFileSystem(conf);
LineReader lr = null;
try {
FSDataInputStream in  = fs.open(fileName);
lr = new LineReader(in, conf);
// String regexMatch =in.readLine();
Text line = new Text();
long begin = 0;
long length = 0;
int num = -1;
String boolTest = null;
boolean match = false;
Pattern p = Pattern.compile(nullRegex);
// Matcher matcher = new p.matcher();
while ((boolTest = in.readLine()) != null && (num = lr.readLine(line)) > 0 && ! ( in.readLine().isEmpty())){
// numLines++;
length += num;
 
 
splits.add(new FileSplit(fileName, begin, length, new String[]{}));}
begin=length;
}finally {
if (lr != null) {
lr.close();
}
 
 
 
}
 
}
return splits.toArray(new FileSplit[splits.size()]);
}
 


}




*------------------------*

Cheers !!!

Siddharth Tiwari

Have a refreshing day !!!
"Every duty is holy, and devotion to duty is the highest form of worship of God.” 

"Maybe other people will try to limit me but I don't limit myself"


> Date: Fri, 24 Aug 2012 09:54:10 +0200
> Subject: Re: Reading multiple lines from a microsoft doc in hadoop
> From: haavard.kongsgaard@gmail.com
> To: user@hadoop.apache.org
> 
> Hi, maybe you should check out the old nutch project
> http://nutch.apache.org/ (hadoop was developed for nutch).
> It's a web crawler and indexer, but the malinglists hold much info
> doc/pdf parsing which also relates to hadoop.
> 
> Have never parsed many docx or doc files, but it should be
> strait-forward. But generally for text analysis preprocessing is the
> KEY! For example replace dual lines \r\n\r\n or (\n\n) with #### is a
> simple trick)
> 
> 
> -Håvard
> 
> On Fri, Aug 24, 2012 at 9:30 AM, Siddharth Tiwari
> <si...@live.com> wrote:
> > Hi,
> > Thank you for the suggestion. Actually I was using poi to extract text, but
> > since now  I  have so many  documents I thought I will use hadoop directly
> > to parse as well. Average size of each document is around 120 kb. Also I
> > want to read multiple lines from the text until I find a blank line. I do
> > not have any idea ankit how to design custom input format and record reader.
> > Pleaser help with some tutorial tutorial, code or resource around it. I am
> > struggling with the issue. I will be highly grateful. Thank you so much once
> > again
> >
> >> Date: Fri, 24 Aug 2012 08:07:39 +0200
> >> Subject: Re: Reading multiple lines from a microsoft doc in hadoop
> >> From: haavard.kongsgaard@gmail.com
> >> To: user@hadoop.apache.org
> >
> >>
> >> It's much easier if you convert the documents to text first
> >>
> >> use
> >> http://tika.apache.org/
> >>
> >> or some other doc parser
> >>
> >>
> >> -Håvard
> >>
> >> On Fri, Aug 24, 2012 at 7:52 AM, Siddharth Tiwari
> >> <si...@live.com> wrote:
> >> > hi,
> >> > I have doc files in msword doc and docx format. These have entries which
> >> > are
> >> > seperated by an empty line. Is it possible for me to read
> >> > these lines separated from empty lines at a time. Also which inpurformat
> >> > shall I use to read doc docx. Please help
> >> >
> >> > *------------------------*
> >> > Cheers !!!
> >> > Siddharth Tiwari
> >> > Have a refreshing day !!!
> >> > "Every duty is holy, and devotion to duty is the highest form of worship
> >> > of
> >> > God.”
> >> > "Maybe other people will try to limit me but I don't limit myself"
> >>
> >>
> >>
> >> --
> >> Håvard Wahl Kongsgård
> >> Faculty of Medicine &
> >> Department of Mathematical Sciences
> >> NTNU
> >>
> >> http://havard.security-review.net/
> 
> 
> 
> -- 
> Håvard Wahl Kongsgård
> Faculty of Medicine &
> Department of Mathematical Sciences
> NTNU
> 
> http://havard.security-review.net/
 		 	   		   		 	   		  

RE: Reading multiple lines from a microsoft doc in hadoop

Posted by Siddharth Tiwari <si...@live.com>.
Hi ,

Can anyone please help ?

Thank you in advance

*------------------------*

Cheers !!!

Siddharth Tiwari

Have a refreshing day !!!
"Every duty is holy, and devotion to duty is the highest form of worship of God.” 

"Maybe other people will try to limit me but I don't limit myself"


From: siddharth.tiwari@live.com
To: user@hadoop.apache.org; bejoy.hadoop@gmail.com; bejoy_ks@yahoo.com
Subject: RE: Reading multiple lines from a microsoft doc in hadoop
Date: Fri, 24 Aug 2012 16:22:57 +0000





Hi Team,

Thanks a lot for so many good suggestions. I wrote a custom input format for reading one paragraph at a time. But when I use it I get lines read. Can you please suggest what changes I must make to read one para at a time seperated by null lines ?
below is the code I wrote:-


import java.io.IOException;
import java.util.ArrayList;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.JobContext;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.mapreduce.lib.input.LineRecordReader;
import org.apache.hadoop.util.LineReader;




/**
 * 
 */

/**
 * @author 460615
 *
 */
//FileInputFormat is the base class for all file-based InputFormats
public class ParaInputFormat extends FileInputFormat<LongWritable,Text> {
private String nullRegex = "^\\s*$" ;
public String StrLine = null;
/*public RecordReader<LongWritable, Text> getRecordReader (InputSplit genericSplit, JobConf job, Reporter reporter) throws IOException {
reporter.setStatus(genericSplit.toString());
return new ParaInputFormat(job, (FileSplit)genericSplit);
}*/
public RecordReader<LongWritable, Text> createRecordReader(InputSplit genericSplit, TaskAttemptContext context)throws IOException {
   context.setStatus(genericSplit.toString());
   return new LineRecordReader();
 }


public InputSplit[] getSplits(JobContext job, Configuration conf) throws IOException {
ArrayList<FileSplit> splits = new ArrayList<FileSplit>();
for (FileStatus status : listStatus(job)) {
Path fileName = status.getPath();
if (status.isDir()) {
throw new IOException("Not a file: " + fileName);
}
FileSystem  fs = fileName.getFileSystem(conf);
LineReader lr = null;
try {
FSDataInputStream in  = fs.open(fileName);
lr = new LineReader(in, conf);
// String regexMatch =in.readLine();
Text line = new Text();
long begin = 0;
long length = 0;
int num = -1;
String boolTest = null;
boolean match = false;
Pattern p = Pattern.compile(nullRegex);
// Matcher matcher = new p.matcher();
while ((boolTest = in.readLine()) != null && (num = lr.readLine(line)) > 0 && ! ( in.readLine().isEmpty())){
// numLines++;
length += num;
 
 
splits.add(new FileSplit(fileName, begin, length, new String[]{}));}
begin=length;
}finally {
if (lr != null) {
lr.close();
}
 
 
 
}
 
}
return splits.toArray(new FileSplit[splits.size()]);
}
 


}




*------------------------*

Cheers !!!

Siddharth Tiwari

Have a refreshing day !!!
"Every duty is holy, and devotion to duty is the highest form of worship of God.” 

"Maybe other people will try to limit me but I don't limit myself"


> Date: Fri, 24 Aug 2012 09:54:10 +0200
> Subject: Re: Reading multiple lines from a microsoft doc in hadoop
> From: haavard.kongsgaard@gmail.com
> To: user@hadoop.apache.org
> 
> Hi, maybe you should check out the old nutch project
> http://nutch.apache.org/ (hadoop was developed for nutch).
> It's a web crawler and indexer, but the malinglists hold much info
> doc/pdf parsing which also relates to hadoop.
> 
> Have never parsed many docx or doc files, but it should be
> strait-forward. But generally for text analysis preprocessing is the
> KEY! For example replace dual lines \r\n\r\n or (\n\n) with #### is a
> simple trick)
> 
> 
> -Håvard
> 
> On Fri, Aug 24, 2012 at 9:30 AM, Siddharth Tiwari
> <si...@live.com> wrote:
> > Hi,
> > Thank you for the suggestion. Actually I was using poi to extract text, but
> > since now  I  have so many  documents I thought I will use hadoop directly
> > to parse as well. Average size of each document is around 120 kb. Also I
> > want to read multiple lines from the text until I find a blank line. I do
> > not have any idea ankit how to design custom input format and record reader.
> > Pleaser help with some tutorial tutorial, code or resource around it. I am
> > struggling with the issue. I will be highly grateful. Thank you so much once
> > again
> >
> >> Date: Fri, 24 Aug 2012 08:07:39 +0200
> >> Subject: Re: Reading multiple lines from a microsoft doc in hadoop
> >> From: haavard.kongsgaard@gmail.com
> >> To: user@hadoop.apache.org
> >
> >>
> >> It's much easier if you convert the documents to text first
> >>
> >> use
> >> http://tika.apache.org/
> >>
> >> or some other doc parser
> >>
> >>
> >> -Håvard
> >>
> >> On Fri, Aug 24, 2012 at 7:52 AM, Siddharth Tiwari
> >> <si...@live.com> wrote:
> >> > hi,
> >> > I have doc files in msword doc and docx format. These have entries which
> >> > are
> >> > seperated by an empty line. Is it possible for me to read
> >> > these lines separated from empty lines at a time. Also which inpurformat
> >> > shall I use to read doc docx. Please help
> >> >
> >> > *------------------------*
> >> > Cheers !!!
> >> > Siddharth Tiwari
> >> > Have a refreshing day !!!
> >> > "Every duty is holy, and devotion to duty is the highest form of worship
> >> > of
> >> > God.”
> >> > "Maybe other people will try to limit me but I don't limit myself"
> >>
> >>
> >>
> >> --
> >> Håvard Wahl Kongsgård
> >> Faculty of Medicine &
> >> Department of Mathematical Sciences
> >> NTNU
> >>
> >> http://havard.security-review.net/
> 
> 
> 
> -- 
> Håvard Wahl Kongsgård
> Faculty of Medicine &
> Department of Mathematical Sciences
> NTNU
> 
> http://havard.security-review.net/
 		 	   		   		 	   		  

RE: Reading multiple lines from a microsoft doc in hadoop

Posted by Siddharth Tiwari <si...@live.com>.
Hi ,

Can anyone please help ?

Thank you in advance

*------------------------*

Cheers !!!

Siddharth Tiwari

Have a refreshing day !!!
"Every duty is holy, and devotion to duty is the highest form of worship of God.” 

"Maybe other people will try to limit me but I don't limit myself"


From: siddharth.tiwari@live.com
To: user@hadoop.apache.org; bejoy.hadoop@gmail.com; bejoy_ks@yahoo.com
Subject: RE: Reading multiple lines from a microsoft doc in hadoop
Date: Fri, 24 Aug 2012 16:22:57 +0000





Hi Team,

Thanks a lot for so many good suggestions. I wrote a custom input format for reading one paragraph at a time. But when I use it I get lines read. Can you please suggest what changes I must make to read one para at a time seperated by null lines ?
below is the code I wrote:-


import java.io.IOException;
import java.util.ArrayList;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.JobContext;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.mapreduce.lib.input.LineRecordReader;
import org.apache.hadoop.util.LineReader;




/**
 * 
 */

/**
 * @author 460615
 *
 */
//FileInputFormat is the base class for all file-based InputFormats
public class ParaInputFormat extends FileInputFormat<LongWritable,Text> {
private String nullRegex = "^\\s*$" ;
public String StrLine = null;
/*public RecordReader<LongWritable, Text> getRecordReader (InputSplit genericSplit, JobConf job, Reporter reporter) throws IOException {
reporter.setStatus(genericSplit.toString());
return new ParaInputFormat(job, (FileSplit)genericSplit);
}*/
public RecordReader<LongWritable, Text> createRecordReader(InputSplit genericSplit, TaskAttemptContext context)throws IOException {
   context.setStatus(genericSplit.toString());
   return new LineRecordReader();
 }


public InputSplit[] getSplits(JobContext job, Configuration conf) throws IOException {
ArrayList<FileSplit> splits = new ArrayList<FileSplit>();
for (FileStatus status : listStatus(job)) {
Path fileName = status.getPath();
if (status.isDir()) {
throw new IOException("Not a file: " + fileName);
}
FileSystem  fs = fileName.getFileSystem(conf);
LineReader lr = null;
try {
FSDataInputStream in  = fs.open(fileName);
lr = new LineReader(in, conf);
// String regexMatch =in.readLine();
Text line = new Text();
long begin = 0;
long length = 0;
int num = -1;
String boolTest = null;
boolean match = false;
Pattern p = Pattern.compile(nullRegex);
// Matcher matcher = new p.matcher();
while ((boolTest = in.readLine()) != null && (num = lr.readLine(line)) > 0 && ! ( in.readLine().isEmpty())){
// numLines++;
length += num;
 
 
splits.add(new FileSplit(fileName, begin, length, new String[]{}));}
begin=length;
}finally {
if (lr != null) {
lr.close();
}
 
 
 
}
 
}
return splits.toArray(new FileSplit[splits.size()]);
}
 


}




*------------------------*

Cheers !!!

Siddharth Tiwari

Have a refreshing day !!!
"Every duty is holy, and devotion to duty is the highest form of worship of God.” 

"Maybe other people will try to limit me but I don't limit myself"


> Date: Fri, 24 Aug 2012 09:54:10 +0200
> Subject: Re: Reading multiple lines from a microsoft doc in hadoop
> From: haavard.kongsgaard@gmail.com
> To: user@hadoop.apache.org
> 
> Hi, maybe you should check out the old nutch project
> http://nutch.apache.org/ (hadoop was developed for nutch).
> It's a web crawler and indexer, but the malinglists hold much info
> doc/pdf parsing which also relates to hadoop.
> 
> Have never parsed many docx or doc files, but it should be
> strait-forward. But generally for text analysis preprocessing is the
> KEY! For example replace dual lines \r\n\r\n or (\n\n) with #### is a
> simple trick)
> 
> 
> -Håvard
> 
> On Fri, Aug 24, 2012 at 9:30 AM, Siddharth Tiwari
> <si...@live.com> wrote:
> > Hi,
> > Thank you for the suggestion. Actually I was using poi to extract text, but
> > since now  I  have so many  documents I thought I will use hadoop directly
> > to parse as well. Average size of each document is around 120 kb. Also I
> > want to read multiple lines from the text until I find a blank line. I do
> > not have any idea ankit how to design custom input format and record reader.
> > Pleaser help with some tutorial tutorial, code or resource around it. I am
> > struggling with the issue. I will be highly grateful. Thank you so much once
> > again
> >
> >> Date: Fri, 24 Aug 2012 08:07:39 +0200
> >> Subject: Re: Reading multiple lines from a microsoft doc in hadoop
> >> From: haavard.kongsgaard@gmail.com
> >> To: user@hadoop.apache.org
> >
> >>
> >> It's much easier if you convert the documents to text first
> >>
> >> use
> >> http://tika.apache.org/
> >>
> >> or some other doc parser
> >>
> >>
> >> -Håvard
> >>
> >> On Fri, Aug 24, 2012 at 7:52 AM, Siddharth Tiwari
> >> <si...@live.com> wrote:
> >> > hi,
> >> > I have doc files in msword doc and docx format. These have entries which
> >> > are
> >> > seperated by an empty line. Is it possible for me to read
> >> > these lines separated from empty lines at a time. Also which inpurformat
> >> > shall I use to read doc docx. Please help
> >> >
> >> > *------------------------*
> >> > Cheers !!!
> >> > Siddharth Tiwari
> >> > Have a refreshing day !!!
> >> > "Every duty is holy, and devotion to duty is the highest form of worship
> >> > of
> >> > God.”
> >> > "Maybe other people will try to limit me but I don't limit myself"
> >>
> >>
> >>
> >> --
> >> Håvard Wahl Kongsgård
> >> Faculty of Medicine &
> >> Department of Mathematical Sciences
> >> NTNU
> >>
> >> http://havard.security-review.net/
> 
> 
> 
> -- 
> Håvard Wahl Kongsgård
> Faculty of Medicine &
> Department of Mathematical Sciences
> NTNU
> 
> http://havard.security-review.net/
 		 	   		   		 	   		  

RE: Reading multiple lines from a microsoft doc in hadoop

Posted by Siddharth Tiwari <si...@live.com>.
Hi Team,

Thanks a lot for so many good suggestions. I wrote a custom input format for reading one paragraph at a time. But when I use it I get lines read. Can you please suggest what changes I must make to read one para at a time seperated by null lines ?
below is the code I wrote:-


import java.io.IOException;
import java.util.ArrayList;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.JobContext;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.mapreduce.lib.input.LineRecordReader;
import org.apache.hadoop.util.LineReader;




/**
 * 
 */

/**
 * @author 460615
 *
 */
//FileInputFormat is the base class for all file-based InputFormats
public class ParaInputFormat extends FileInputFormat<LongWritable,Text> {
private String nullRegex = "^\\s*$" ;
public String StrLine = null;
/*public RecordReader<LongWritable, Text> getRecordReader (InputSplit genericSplit, JobConf job, Reporter reporter) throws IOException {
reporter.setStatus(genericSplit.toString());
return new ParaInputFormat(job, (FileSplit)genericSplit);
}*/
public RecordReader<LongWritable, Text> createRecordReader(InputSplit genericSplit, TaskAttemptContext context)throws IOException {
   context.setStatus(genericSplit.toString());
   return new LineRecordReader();
 }


public InputSplit[] getSplits(JobContext job, Configuration conf) throws IOException {
ArrayList<FileSplit> splits = new ArrayList<FileSplit>();
for (FileStatus status : listStatus(job)) {
Path fileName = status.getPath();
if (status.isDir()) {
throw new IOException("Not a file: " + fileName);
}
FileSystem  fs = fileName.getFileSystem(conf);
LineReader lr = null;
try {
FSDataInputStream in  = fs.open(fileName);
lr = new LineReader(in, conf);
// String regexMatch =in.readLine();
Text line = new Text();
long begin = 0;
long length = 0;
int num = -1;
String boolTest = null;
boolean match = false;
Pattern p = Pattern.compile(nullRegex);
// Matcher matcher = new p.matcher();
while ((boolTest = in.readLine()) != null && (num = lr.readLine(line)) > 0 && ! ( in.readLine().isEmpty())){
// numLines++;
length += num;
 
 
splits.add(new FileSplit(fileName, begin, length, new String[]{}));}
begin=length;
}finally {
if (lr != null) {
lr.close();
}
 
 
 
}
 
}
return splits.toArray(new FileSplit[splits.size()]);
}
 


}




*------------------------*

Cheers !!!

Siddharth Tiwari

Have a refreshing day !!!
"Every duty is holy, and devotion to duty is the highest form of worship of God.” 

"Maybe other people will try to limit me but I don't limit myself"


> Date: Fri, 24 Aug 2012 09:54:10 +0200
> Subject: Re: Reading multiple lines from a microsoft doc in hadoop
> From: haavard.kongsgaard@gmail.com
> To: user@hadoop.apache.org
> 
> Hi, maybe you should check out the old nutch project
> http://nutch.apache.org/ (hadoop was developed for nutch).
> It's a web crawler and indexer, but the malinglists hold much info
> doc/pdf parsing which also relates to hadoop.
> 
> Have never parsed many docx or doc files, but it should be
> strait-forward. But generally for text analysis preprocessing is the
> KEY! For example replace dual lines \r\n\r\n or (\n\n) with #### is a
> simple trick)
> 
> 
> -Håvard
> 
> On Fri, Aug 24, 2012 at 9:30 AM, Siddharth Tiwari
> <si...@live.com> wrote:
> > Hi,
> > Thank you for the suggestion. Actually I was using poi to extract text, but
> > since now  I  have so many  documents I thought I will use hadoop directly
> > to parse as well. Average size of each document is around 120 kb. Also I
> > want to read multiple lines from the text until I find a blank line. I do
> > not have any idea ankit how to design custom input format and record reader.
> > Pleaser help with some tutorial tutorial, code or resource around it. I am
> > struggling with the issue. I will be highly grateful. Thank you so much once
> > again
> >
> >> Date: Fri, 24 Aug 2012 08:07:39 +0200
> >> Subject: Re: Reading multiple lines from a microsoft doc in hadoop
> >> From: haavard.kongsgaard@gmail.com
> >> To: user@hadoop.apache.org
> >
> >>
> >> It's much easier if you convert the documents to text first
> >>
> >> use
> >> http://tika.apache.org/
> >>
> >> or some other doc parser
> >>
> >>
> >> -Håvard
> >>
> >> On Fri, Aug 24, 2012 at 7:52 AM, Siddharth Tiwari
> >> <si...@live.com> wrote:
> >> > hi,
> >> > I have doc files in msword doc and docx format. These have entries which
> >> > are
> >> > seperated by an empty line. Is it possible for me to read
> >> > these lines separated from empty lines at a time. Also which inpurformat
> >> > shall I use to read doc docx. Please help
> >> >
> >> > *------------------------*
> >> > Cheers !!!
> >> > Siddharth Tiwari
> >> > Have a refreshing day !!!
> >> > "Every duty is holy, and devotion to duty is the highest form of worship
> >> > of
> >> > God.”
> >> > "Maybe other people will try to limit me but I don't limit myself"
> >>
> >>
> >>
> >> --
> >> Håvard Wahl Kongsgård
> >> Faculty of Medicine &
> >> Department of Mathematical Sciences
> >> NTNU
> >>
> >> http://havard.security-review.net/
> 
> 
> 
> -- 
> Håvard Wahl Kongsgård
> Faculty of Medicine &
> Department of Mathematical Sciences
> NTNU
> 
> http://havard.security-review.net/
 		 	   		  

RE: Reading multiple lines from a microsoft doc in hadoop

Posted by Siddharth Tiwari <si...@live.com>.
Hi Team,

Thanks a lot for so many good suggestions. I wrote a custom input format for reading one paragraph at a time. But when I use it I get lines read. Can you please suggest what changes I must make to read one para at a time seperated by null lines ?
below is the code I wrote:-


import java.io.IOException;
import java.util.ArrayList;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.JobContext;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.mapreduce.lib.input.LineRecordReader;
import org.apache.hadoop.util.LineReader;




/**
 * 
 */

/**
 * @author 460615
 *
 */
//FileInputFormat is the base class for all file-based InputFormats
public class ParaInputFormat extends FileInputFormat<LongWritable,Text> {
private String nullRegex = "^\\s*$" ;
public String StrLine = null;
/*public RecordReader<LongWritable, Text> getRecordReader (InputSplit genericSplit, JobConf job, Reporter reporter) throws IOException {
reporter.setStatus(genericSplit.toString());
return new ParaInputFormat(job, (FileSplit)genericSplit);
}*/
public RecordReader<LongWritable, Text> createRecordReader(InputSplit genericSplit, TaskAttemptContext context)throws IOException {
   context.setStatus(genericSplit.toString());
   return new LineRecordReader();
 }


public InputSplit[] getSplits(JobContext job, Configuration conf) throws IOException {
ArrayList<FileSplit> splits = new ArrayList<FileSplit>();
for (FileStatus status : listStatus(job)) {
Path fileName = status.getPath();
if (status.isDir()) {
throw new IOException("Not a file: " + fileName);
}
FileSystem  fs = fileName.getFileSystem(conf);
LineReader lr = null;
try {
FSDataInputStream in  = fs.open(fileName);
lr = new LineReader(in, conf);
// String regexMatch =in.readLine();
Text line = new Text();
long begin = 0;
long length = 0;
int num = -1;
String boolTest = null;
boolean match = false;
Pattern p = Pattern.compile(nullRegex);
// Matcher matcher = new p.matcher();
while ((boolTest = in.readLine()) != null && (num = lr.readLine(line)) > 0 && ! ( in.readLine().isEmpty())){
// numLines++;
length += num;
 
 
splits.add(new FileSplit(fileName, begin, length, new String[]{}));}
begin=length;
}finally {
if (lr != null) {
lr.close();
}
 
 
 
}
 
}
return splits.toArray(new FileSplit[splits.size()]);
}
 


}




*------------------------*

Cheers !!!

Siddharth Tiwari

Have a refreshing day !!!
"Every duty is holy, and devotion to duty is the highest form of worship of God.” 

"Maybe other people will try to limit me but I don't limit myself"


> Date: Fri, 24 Aug 2012 09:54:10 +0200
> Subject: Re: Reading multiple lines from a microsoft doc in hadoop
> From: haavard.kongsgaard@gmail.com
> To: user@hadoop.apache.org
> 
> Hi, maybe you should check out the old nutch project
> http://nutch.apache.org/ (hadoop was developed for nutch).
> It's a web crawler and indexer, but the malinglists hold much info
> doc/pdf parsing which also relates to hadoop.
> 
> Have never parsed many docx or doc files, but it should be
> strait-forward. But generally for text analysis preprocessing is the
> KEY! For example replace dual lines \r\n\r\n or (\n\n) with #### is a
> simple trick)
> 
> 
> -Håvard
> 
> On Fri, Aug 24, 2012 at 9:30 AM, Siddharth Tiwari
> <si...@live.com> wrote:
> > Hi,
> > Thank you for the suggestion. Actually I was using poi to extract text, but
> > since now  I  have so many  documents I thought I will use hadoop directly
> > to parse as well. Average size of each document is around 120 kb. Also I
> > want to read multiple lines from the text until I find a blank line. I do
> > not have any idea ankit how to design custom input format and record reader.
> > Pleaser help with some tutorial tutorial, code or resource around it. I am
> > struggling with the issue. I will be highly grateful. Thank you so much once
> > again
> >
> >> Date: Fri, 24 Aug 2012 08:07:39 +0200
> >> Subject: Re: Reading multiple lines from a microsoft doc in hadoop
> >> From: haavard.kongsgaard@gmail.com
> >> To: user@hadoop.apache.org
> >
> >>
> >> It's much easier if you convert the documents to text first
> >>
> >> use
> >> http://tika.apache.org/
> >>
> >> or some other doc parser
> >>
> >>
> >> -Håvard
> >>
> >> On Fri, Aug 24, 2012 at 7:52 AM, Siddharth Tiwari
> >> <si...@live.com> wrote:
> >> > hi,
> >> > I have doc files in msword doc and docx format. These have entries which
> >> > are
> >> > seperated by an empty line. Is it possible for me to read
> >> > these lines separated from empty lines at a time. Also which inpurformat
> >> > shall I use to read doc docx. Please help
> >> >
> >> > *------------------------*
> >> > Cheers !!!
> >> > Siddharth Tiwari
> >> > Have a refreshing day !!!
> >> > "Every duty is holy, and devotion to duty is the highest form of worship
> >> > of
> >> > God.”
> >> > "Maybe other people will try to limit me but I don't limit myself"
> >>
> >>
> >>
> >> --
> >> Håvard Wahl Kongsgård
> >> Faculty of Medicine &
> >> Department of Mathematical Sciences
> >> NTNU
> >>
> >> http://havard.security-review.net/
> 
> 
> 
> -- 
> Håvard Wahl Kongsgård
> Faculty of Medicine &
> Department of Mathematical Sciences
> NTNU
> 
> http://havard.security-review.net/
 		 	   		  

RE: Reading multiple lines from a microsoft doc in hadoop

Posted by Siddharth Tiwari <si...@live.com>.
Hi Team,

Thanks a lot for so many good suggestions. I wrote a custom input format for reading one paragraph at a time. But when I use it I get lines read. Can you please suggest what changes I must make to read one para at a time seperated by null lines ?
below is the code I wrote:-


import java.io.IOException;
import java.util.ArrayList;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.JobContext;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.mapreduce.lib.input.LineRecordReader;
import org.apache.hadoop.util.LineReader;




/**
 * 
 */

/**
 * @author 460615
 *
 */
//FileInputFormat is the base class for all file-based InputFormats
public class ParaInputFormat extends FileInputFormat<LongWritable,Text> {
private String nullRegex = "^\\s*$" ;
public String StrLine = null;
/*public RecordReader<LongWritable, Text> getRecordReader (InputSplit genericSplit, JobConf job, Reporter reporter) throws IOException {
reporter.setStatus(genericSplit.toString());
return new ParaInputFormat(job, (FileSplit)genericSplit);
}*/
public RecordReader<LongWritable, Text> createRecordReader(InputSplit genericSplit, TaskAttemptContext context)throws IOException {
   context.setStatus(genericSplit.toString());
   return new LineRecordReader();
 }


public InputSplit[] getSplits(JobContext job, Configuration conf) throws IOException {
ArrayList<FileSplit> splits = new ArrayList<FileSplit>();
for (FileStatus status : listStatus(job)) {
Path fileName = status.getPath();
if (status.isDir()) {
throw new IOException("Not a file: " + fileName);
}
FileSystem  fs = fileName.getFileSystem(conf);
LineReader lr = null;
try {
FSDataInputStream in  = fs.open(fileName);
lr = new LineReader(in, conf);
// String regexMatch =in.readLine();
Text line = new Text();
long begin = 0;
long length = 0;
int num = -1;
String boolTest = null;
boolean match = false;
Pattern p = Pattern.compile(nullRegex);
// Matcher matcher = new p.matcher();
while ((boolTest = in.readLine()) != null && (num = lr.readLine(line)) > 0 && ! ( in.readLine().isEmpty())){
// numLines++;
length += num;
 
 
splits.add(new FileSplit(fileName, begin, length, new String[]{}));}
begin=length;
}finally {
if (lr != null) {
lr.close();
}
 
 
 
}
 
}
return splits.toArray(new FileSplit[splits.size()]);
}
 


}




*------------------------*

Cheers !!!

Siddharth Tiwari

Have a refreshing day !!!
"Every duty is holy, and devotion to duty is the highest form of worship of God.” 

"Maybe other people will try to limit me but I don't limit myself"


> Date: Fri, 24 Aug 2012 09:54:10 +0200
> Subject: Re: Reading multiple lines from a microsoft doc in hadoop
> From: haavard.kongsgaard@gmail.com
> To: user@hadoop.apache.org
> 
> Hi, maybe you should check out the old nutch project
> http://nutch.apache.org/ (hadoop was developed for nutch).
> It's a web crawler and indexer, but the malinglists hold much info
> doc/pdf parsing which also relates to hadoop.
> 
> Have never parsed many docx or doc files, but it should be
> strait-forward. But generally for text analysis preprocessing is the
> KEY! For example replace dual lines \r\n\r\n or (\n\n) with #### is a
> simple trick)
> 
> 
> -Håvard
> 
> On Fri, Aug 24, 2012 at 9:30 AM, Siddharth Tiwari
> <si...@live.com> wrote:
> > Hi,
> > Thank you for the suggestion. Actually I was using poi to extract text, but
> > since now  I  have so many  documents I thought I will use hadoop directly
> > to parse as well. Average size of each document is around 120 kb. Also I
> > want to read multiple lines from the text until I find a blank line. I do
> > not have any idea ankit how to design custom input format and record reader.
> > Pleaser help with some tutorial tutorial, code or resource around it. I am
> > struggling with the issue. I will be highly grateful. Thank you so much once
> > again
> >
> >> Date: Fri, 24 Aug 2012 08:07:39 +0200
> >> Subject: Re: Reading multiple lines from a microsoft doc in hadoop
> >> From: haavard.kongsgaard@gmail.com
> >> To: user@hadoop.apache.org
> >
> >>
> >> It's much easier if you convert the documents to text first
> >>
> >> use
> >> http://tika.apache.org/
> >>
> >> or some other doc parser
> >>
> >>
> >> -Håvard
> >>
> >> On Fri, Aug 24, 2012 at 7:52 AM, Siddharth Tiwari
> >> <si...@live.com> wrote:
> >> > hi,
> >> > I have doc files in msword doc and docx format. These have entries which
> >> > are
> >> > seperated by an empty line. Is it possible for me to read
> >> > these lines separated from empty lines at a time. Also which inpurformat
> >> > shall I use to read doc docx. Please help
> >> >
> >> > *------------------------*
> >> > Cheers !!!
> >> > Siddharth Tiwari
> >> > Have a refreshing day !!!
> >> > "Every duty is holy, and devotion to duty is the highest form of worship
> >> > of
> >> > God.”
> >> > "Maybe other people will try to limit me but I don't limit myself"
> >>
> >>
> >>
> >> --
> >> Håvard Wahl Kongsgård
> >> Faculty of Medicine &
> >> Department of Mathematical Sciences
> >> NTNU
> >>
> >> http://havard.security-review.net/
> 
> 
> 
> -- 
> Håvard Wahl Kongsgård
> Faculty of Medicine &
> Department of Mathematical Sciences
> NTNU
> 
> http://havard.security-review.net/
 		 	   		  

RE: Reading multiple lines from a microsoft doc in hadoop

Posted by Siddharth Tiwari <si...@live.com>.
Hi Team,

Thanks a lot for so many good suggestions. I wrote a custom input format for reading one paragraph at a time. But when I use it I get lines read. Can you please suggest what changes I must make to read one para at a time seperated by null lines ?
below is the code I wrote:-


import java.io.IOException;
import java.util.ArrayList;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.JobContext;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.mapreduce.lib.input.LineRecordReader;
import org.apache.hadoop.util.LineReader;




/**
 * 
 */

/**
 * @author 460615
 *
 */
//FileInputFormat is the base class for all file-based InputFormats
public class ParaInputFormat extends FileInputFormat<LongWritable,Text> {
private String nullRegex = "^\\s*$" ;
public String StrLine = null;
/*public RecordReader<LongWritable, Text> getRecordReader (InputSplit genericSplit, JobConf job, Reporter reporter) throws IOException {
reporter.setStatus(genericSplit.toString());
return new ParaInputFormat(job, (FileSplit)genericSplit);
}*/
public RecordReader<LongWritable, Text> createRecordReader(InputSplit genericSplit, TaskAttemptContext context)throws IOException {
   context.setStatus(genericSplit.toString());
   return new LineRecordReader();
 }


public InputSplit[] getSplits(JobContext job, Configuration conf) throws IOException {
ArrayList<FileSplit> splits = new ArrayList<FileSplit>();
for (FileStatus status : listStatus(job)) {
Path fileName = status.getPath();
if (status.isDir()) {
throw new IOException("Not a file: " + fileName);
}
FileSystem  fs = fileName.getFileSystem(conf);
LineReader lr = null;
try {
FSDataInputStream in  = fs.open(fileName);
lr = new LineReader(in, conf);
// String regexMatch =in.readLine();
Text line = new Text();
long begin = 0;
long length = 0;
int num = -1;
String boolTest = null;
boolean match = false;
Pattern p = Pattern.compile(nullRegex);
// Matcher matcher = new p.matcher();
while ((boolTest = in.readLine()) != null && (num = lr.readLine(line)) > 0 && ! ( in.readLine().isEmpty())){
// numLines++;
length += num;
 
 
splits.add(new FileSplit(fileName, begin, length, new String[]{}));}
begin=length;
}finally {
if (lr != null) {
lr.close();
}
 
 
 
}
 
}
return splits.toArray(new FileSplit[splits.size()]);
}
 


}




*------------------------*

Cheers !!!

Siddharth Tiwari

Have a refreshing day !!!
"Every duty is holy, and devotion to duty is the highest form of worship of God.” 

"Maybe other people will try to limit me but I don't limit myself"


> Date: Fri, 24 Aug 2012 09:54:10 +0200
> Subject: Re: Reading multiple lines from a microsoft doc in hadoop
> From: haavard.kongsgaard@gmail.com
> To: user@hadoop.apache.org
> 
> Hi, maybe you should check out the old nutch project
> http://nutch.apache.org/ (hadoop was developed for nutch).
> It's a web crawler and indexer, but the malinglists hold much info
> doc/pdf parsing which also relates to hadoop.
> 
> Have never parsed many docx or doc files, but it should be
> strait-forward. But generally for text analysis preprocessing is the
> KEY! For example replace dual lines \r\n\r\n or (\n\n) with #### is a
> simple trick)
> 
> 
> -Håvard
> 
> On Fri, Aug 24, 2012 at 9:30 AM, Siddharth Tiwari
> <si...@live.com> wrote:
> > Hi,
> > Thank you for the suggestion. Actually I was using poi to extract text, but
> > since now  I  have so many  documents I thought I will use hadoop directly
> > to parse as well. Average size of each document is around 120 kb. Also I
> > want to read multiple lines from the text until I find a blank line. I do
> > not have any idea ankit how to design custom input format and record reader.
> > Pleaser help with some tutorial tutorial, code or resource around it. I am
> > struggling with the issue. I will be highly grateful. Thank you so much once
> > again
> >
> >> Date: Fri, 24 Aug 2012 08:07:39 +0200
> >> Subject: Re: Reading multiple lines from a microsoft doc in hadoop
> >> From: haavard.kongsgaard@gmail.com
> >> To: user@hadoop.apache.org
> >
> >>
> >> It's much easier if you convert the documents to text first
> >>
> >> use
> >> http://tika.apache.org/
> >>
> >> or some other doc parser
> >>
> >>
> >> -Håvard
> >>
> >> On Fri, Aug 24, 2012 at 7:52 AM, Siddharth Tiwari
> >> <si...@live.com> wrote:
> >> > hi,
> >> > I have doc files in msword doc and docx format. These have entries which
> >> > are
> >> > seperated by an empty line. Is it possible for me to read
> >> > these lines separated from empty lines at a time. Also which inpurformat
> >> > shall I use to read doc docx. Please help
> >> >
> >> > *------------------------*
> >> > Cheers !!!
> >> > Siddharth Tiwari
> >> > Have a refreshing day !!!
> >> > "Every duty is holy, and devotion to duty is the highest form of worship
> >> > of
> >> > God.”
> >> > "Maybe other people will try to limit me but I don't limit myself"
> >>
> >>
> >>
> >> --
> >> Håvard Wahl Kongsgård
> >> Faculty of Medicine &
> >> Department of Mathematical Sciences
> >> NTNU
> >>
> >> http://havard.security-review.net/
> 
> 
> 
> -- 
> Håvard Wahl Kongsgård
> Faculty of Medicine &
> Department of Mathematical Sciences
> NTNU
> 
> http://havard.security-review.net/
 		 	   		  

Re: Reading multiple lines from a microsoft doc in hadoop

Posted by Håvard Wahl Kongsgård <ha...@gmail.com>.
Hi, maybe you should check out the old nutch project
http://nutch.apache.org/ (hadoop was developed for nutch).
It's a web crawler and indexer, but the malinglists hold much info
doc/pdf parsing which also relates to hadoop.

Have never parsed many docx or doc files, but it should be
strait-forward. But generally for text analysis preprocessing is the
KEY! For example replace dual lines \r\n\r\n or (\n\n) with #### is a
simple trick)


-Håvard

On Fri, Aug 24, 2012 at 9:30 AM, Siddharth Tiwari
<si...@live.com> wrote:
> Hi,
> Thank you for the suggestion. Actually I was using poi to extract text, but
> since now  I  have so many  documents I thought I will use hadoop directly
> to parse as well. Average size of each document is around 120 kb. Also I
> want to read multiple lines from the text until I find a blank line. I do
> not have any idea ankit how to design custom input format and record reader.
> Pleaser help with some tutorial tutorial, code or resource around it. I am
> struggling with the issue. I will be highly grateful. Thank you so much once
> again
>
>> Date: Fri, 24 Aug 2012 08:07:39 +0200
>> Subject: Re: Reading multiple lines from a microsoft doc in hadoop
>> From: haavard.kongsgaard@gmail.com
>> To: user@hadoop.apache.org
>
>>
>> It's much easier if you convert the documents to text first
>>
>> use
>> http://tika.apache.org/
>>
>> or some other doc parser
>>
>>
>> -Håvard
>>
>> On Fri, Aug 24, 2012 at 7:52 AM, Siddharth Tiwari
>> <si...@live.com> wrote:
>> > hi,
>> > I have doc files in msword doc and docx format. These have entries which
>> > are
>> > seperated by an empty line. Is it possible for me to read
>> > these lines separated from empty lines at a time. Also which inpurformat
>> > shall I use to read doc docx. Please help
>> >
>> > *------------------------*
>> > Cheers !!!
>> > Siddharth Tiwari
>> > Have a refreshing day !!!
>> > "Every duty is holy, and devotion to duty is the highest form of worship
>> > of
>> > God.”
>> > "Maybe other people will try to limit me but I don't limit myself"
>>
>>
>>
>> --
>> Håvard Wahl Kongsgård
>> Faculty of Medicine &
>> Department of Mathematical Sciences
>> NTNU
>>
>> http://havard.security-review.net/



-- 
Håvard Wahl Kongsgård
Faculty of Medicine &
Department of Mathematical Sciences
NTNU

http://havard.security-review.net/

Re: Reading multiple lines from a microsoft doc in hadoop

Posted by Mohammad Tariq <do...@gmail.com>.
Hello Siddharth,

       You can tweak the "NLineInputFormat" as per your requirement and use
it. It allows us to read a specified no of lines
unlike "TextInputFormat". Here is a good post by Boris and Michael on
custom record reader. Also I would suggest you to
combine similar files together into one bigger file if feasible, as you
files are very small.

Regards,
    Mohammad Tariq



On Fri, Aug 24, 2012 at 1:00 PM, Siddharth Tiwari <siddharth.tiwari@live.com
> wrote:

> Hi,
> Thank you for the suggestion. Actually I was using poi to extract text,
> but since now  I  have so many  documents I thought I will use hadoop
> directly to parse as well. Average size of each document is around 120 kb.
> Also I want to read multiple lines from the text until I find a blank line.
> I do not have any idea ankit how to design custom input format and record
> reader. Pleaser help with some tutorial tutorial, code or resource around
> it. I am struggling with the issue. I will be highly grateful. Thank you so
> much once again
>
> > Date: Fri, 24 Aug 2012 08:07:39 +0200
> > Subject: Re: Reading multiple lines from a microsoft doc in hadoop
> > From: haavard.kongsgaard@gmail.com
> > To: user@hadoop.apache.org
> >
> > It's much easier if you convert the documents to text first
> >
> > use
> > http://tika.apache.org/
> >
> > or some other doc parser
> >
> >
> > -Håvard
> >
> > On Fri, Aug 24, 2012 at 7:52 AM, Siddharth Tiwari
> > <si...@live.com> wrote:
> > > hi,
> > > I have doc files in msword doc and docx format. These have entries
> which are
> > > seperated by an empty line. Is it possible for me to read
> > > these lines separated from empty lines at a time. Also which
> inpurformat
> > > shall I use to read doc docx. Please help
> > >
> > > *------------------------*
> > > Cheers !!!
> > > Siddharth Tiwari
> > > Have a refreshing day !!!
> > > "Every duty is holy, and devotion to duty is the highest form of
> worship of
> > > God.”
> > > "Maybe other people will try to limit me but I don't limit myself"
> >
> >
> >
> > --
> > Håvard Wahl Kongsgård
> > Faculty of Medicine &
> > Department of Mathematical Sciences
> > NTNU
> >
> > http://havard.security-review.net/
>

Re: Reading multiple lines from a microsoft doc in hadoop

Posted by Håvard Wahl Kongsgård <ha...@gmail.com>.
Hi, maybe you should check out the old nutch project
http://nutch.apache.org/ (hadoop was developed for nutch).
It's a web crawler and indexer, but the malinglists hold much info
doc/pdf parsing which also relates to hadoop.

Have never parsed many docx or doc files, but it should be
strait-forward. But generally for text analysis preprocessing is the
KEY! For example replace dual lines \r\n\r\n or (\n\n) with #### is a
simple trick)


-Håvard

On Fri, Aug 24, 2012 at 9:30 AM, Siddharth Tiwari
<si...@live.com> wrote:
> Hi,
> Thank you for the suggestion. Actually I was using poi to extract text, but
> since now  I  have so many  documents I thought I will use hadoop directly
> to parse as well. Average size of each document is around 120 kb. Also I
> want to read multiple lines from the text until I find a blank line. I do
> not have any idea ankit how to design custom input format and record reader.
> Pleaser help with some tutorial tutorial, code or resource around it. I am
> struggling with the issue. I will be highly grateful. Thank you so much once
> again
>
>> Date: Fri, 24 Aug 2012 08:07:39 +0200
>> Subject: Re: Reading multiple lines from a microsoft doc in hadoop
>> From: haavard.kongsgaard@gmail.com
>> To: user@hadoop.apache.org
>
>>
>> It's much easier if you convert the documents to text first
>>
>> use
>> http://tika.apache.org/
>>
>> or some other doc parser
>>
>>
>> -Håvard
>>
>> On Fri, Aug 24, 2012 at 7:52 AM, Siddharth Tiwari
>> <si...@live.com> wrote:
>> > hi,
>> > I have doc files in msword doc and docx format. These have entries which
>> > are
>> > seperated by an empty line. Is it possible for me to read
>> > these lines separated from empty lines at a time. Also which inpurformat
>> > shall I use to read doc docx. Please help
>> >
>> > *------------------------*
>> > Cheers !!!
>> > Siddharth Tiwari
>> > Have a refreshing day !!!
>> > "Every duty is holy, and devotion to duty is the highest form of worship
>> > of
>> > God.”
>> > "Maybe other people will try to limit me but I don't limit myself"
>>
>>
>>
>> --
>> Håvard Wahl Kongsgård
>> Faculty of Medicine &
>> Department of Mathematical Sciences
>> NTNU
>>
>> http://havard.security-review.net/



-- 
Håvard Wahl Kongsgård
Faculty of Medicine &
Department of Mathematical Sciences
NTNU

http://havard.security-review.net/

Re: Reading multiple lines from a microsoft doc in hadoop

Posted by Mohammad Tariq <do...@gmail.com>.
Hello Siddharth,

       You can tweak the "NLineInputFormat" as per your requirement and use
it. It allows us to read a specified no of lines
unlike "TextInputFormat". Here is a good post by Boris and Michael on
custom record reader. Also I would suggest you to
combine similar files together into one bigger file if feasible, as you
files are very small.

Regards,
    Mohammad Tariq



On Fri, Aug 24, 2012 at 1:00 PM, Siddharth Tiwari <siddharth.tiwari@live.com
> wrote:

> Hi,
> Thank you for the suggestion. Actually I was using poi to extract text,
> but since now  I  have so many  documents I thought I will use hadoop
> directly to parse as well. Average size of each document is around 120 kb.
> Also I want to read multiple lines from the text until I find a blank line.
> I do not have any idea ankit how to design custom input format and record
> reader. Pleaser help with some tutorial tutorial, code or resource around
> it. I am struggling with the issue. I will be highly grateful. Thank you so
> much once again
>
> > Date: Fri, 24 Aug 2012 08:07:39 +0200
> > Subject: Re: Reading multiple lines from a microsoft doc in hadoop
> > From: haavard.kongsgaard@gmail.com
> > To: user@hadoop.apache.org
> >
> > It's much easier if you convert the documents to text first
> >
> > use
> > http://tika.apache.org/
> >
> > or some other doc parser
> >
> >
> > -Håvard
> >
> > On Fri, Aug 24, 2012 at 7:52 AM, Siddharth Tiwari
> > <si...@live.com> wrote:
> > > hi,
> > > I have doc files in msword doc and docx format. These have entries
> which are
> > > seperated by an empty line. Is it possible for me to read
> > > these lines separated from empty lines at a time. Also which
> inpurformat
> > > shall I use to read doc docx. Please help
> > >
> > > *------------------------*
> > > Cheers !!!
> > > Siddharth Tiwari
> > > Have a refreshing day !!!
> > > "Every duty is holy, and devotion to duty is the highest form of
> worship of
> > > God.”
> > > "Maybe other people will try to limit me but I don't limit myself"
> >
> >
> >
> > --
> > Håvard Wahl Kongsgård
> > Faculty of Medicine &
> > Department of Mathematical Sciences
> > NTNU
> >
> > http://havard.security-review.net/
>

Re: Reading multiple lines from a microsoft doc in hadoop

Posted by Mohammad Tariq <do...@gmail.com>.
Hello Siddharth,

       You can tweak the "NLineInputFormat" as per your requirement and use
it. It allows us to read a specified no of lines
unlike "TextInputFormat". Here is a good post by Boris and Michael on
custom record reader. Also I would suggest you to
combine similar files together into one bigger file if feasible, as you
files are very small.

Regards,
    Mohammad Tariq



On Fri, Aug 24, 2012 at 1:00 PM, Siddharth Tiwari <siddharth.tiwari@live.com
> wrote:

> Hi,
> Thank you for the suggestion. Actually I was using poi to extract text,
> but since now  I  have so many  documents I thought I will use hadoop
> directly to parse as well. Average size of each document is around 120 kb.
> Also I want to read multiple lines from the text until I find a blank line.
> I do not have any idea ankit how to design custom input format and record
> reader. Pleaser help with some tutorial tutorial, code or resource around
> it. I am struggling with the issue. I will be highly grateful. Thank you so
> much once again
>
> > Date: Fri, 24 Aug 2012 08:07:39 +0200
> > Subject: Re: Reading multiple lines from a microsoft doc in hadoop
> > From: haavard.kongsgaard@gmail.com
> > To: user@hadoop.apache.org
> >
> > It's much easier if you convert the documents to text first
> >
> > use
> > http://tika.apache.org/
> >
> > or some other doc parser
> >
> >
> > -Håvard
> >
> > On Fri, Aug 24, 2012 at 7:52 AM, Siddharth Tiwari
> > <si...@live.com> wrote:
> > > hi,
> > > I have doc files in msword doc and docx format. These have entries
> which are
> > > seperated by an empty line. Is it possible for me to read
> > > these lines separated from empty lines at a time. Also which
> inpurformat
> > > shall I use to read doc docx. Please help
> > >
> > > *------------------------*
> > > Cheers !!!
> > > Siddharth Tiwari
> > > Have a refreshing day !!!
> > > "Every duty is holy, and devotion to duty is the highest form of
> worship of
> > > God.”
> > > "Maybe other people will try to limit me but I don't limit myself"
> >
> >
> >
> > --
> > Håvard Wahl Kongsgård
> > Faculty of Medicine &
> > Department of Mathematical Sciences
> > NTNU
> >
> > http://havard.security-review.net/
>

Re: Reading multiple lines from a microsoft doc in hadoop

Posted by Håvard Wahl Kongsgård <ha...@gmail.com>.
Hi, maybe you should check out the old nutch project
http://nutch.apache.org/ (hadoop was developed for nutch).
It's a web crawler and indexer, but the malinglists hold much info
doc/pdf parsing which also relates to hadoop.

Have never parsed many docx or doc files, but it should be
strait-forward. But generally for text analysis preprocessing is the
KEY! For example replace dual lines \r\n\r\n or (\n\n) with #### is a
simple trick)


-Håvard

On Fri, Aug 24, 2012 at 9:30 AM, Siddharth Tiwari
<si...@live.com> wrote:
> Hi,
> Thank you for the suggestion. Actually I was using poi to extract text, but
> since now  I  have so many  documents I thought I will use hadoop directly
> to parse as well. Average size of each document is around 120 kb. Also I
> want to read multiple lines from the text until I find a blank line. I do
> not have any idea ankit how to design custom input format and record reader.
> Pleaser help with some tutorial tutorial, code or resource around it. I am
> struggling with the issue. I will be highly grateful. Thank you so much once
> again
>
>> Date: Fri, 24 Aug 2012 08:07:39 +0200
>> Subject: Re: Reading multiple lines from a microsoft doc in hadoop
>> From: haavard.kongsgaard@gmail.com
>> To: user@hadoop.apache.org
>
>>
>> It's much easier if you convert the documents to text first
>>
>> use
>> http://tika.apache.org/
>>
>> or some other doc parser
>>
>>
>> -Håvard
>>
>> On Fri, Aug 24, 2012 at 7:52 AM, Siddharth Tiwari
>> <si...@live.com> wrote:
>> > hi,
>> > I have doc files in msword doc and docx format. These have entries which
>> > are
>> > seperated by an empty line. Is it possible for me to read
>> > these lines separated from empty lines at a time. Also which inpurformat
>> > shall I use to read doc docx. Please help
>> >
>> > *------------------------*
>> > Cheers !!!
>> > Siddharth Tiwari
>> > Have a refreshing day !!!
>> > "Every duty is holy, and devotion to duty is the highest form of worship
>> > of
>> > God.”
>> > "Maybe other people will try to limit me but I don't limit myself"
>>
>>
>>
>> --
>> Håvard Wahl Kongsgård
>> Faculty of Medicine &
>> Department of Mathematical Sciences
>> NTNU
>>
>> http://havard.security-review.net/



-- 
Håvard Wahl Kongsgård
Faculty of Medicine &
Department of Mathematical Sciences
NTNU

http://havard.security-review.net/

Re: Reading multiple lines from a microsoft doc in hadoop

Posted by Håvard Wahl Kongsgård <ha...@gmail.com>.
Hi, maybe you should check out the old nutch project
http://nutch.apache.org/ (hadoop was developed for nutch).
It's a web crawler and indexer, but the malinglists hold much info
doc/pdf parsing which also relates to hadoop.

Have never parsed many docx or doc files, but it should be
strait-forward. But generally for text analysis preprocessing is the
KEY! For example replace dual lines \r\n\r\n or (\n\n) with #### is a
simple trick)


-Håvard

On Fri, Aug 24, 2012 at 9:30 AM, Siddharth Tiwari
<si...@live.com> wrote:
> Hi,
> Thank you for the suggestion. Actually I was using poi to extract text, but
> since now  I  have so many  documents I thought I will use hadoop directly
> to parse as well. Average size of each document is around 120 kb. Also I
> want to read multiple lines from the text until I find a blank line. I do
> not have any idea ankit how to design custom input format and record reader.
> Pleaser help with some tutorial tutorial, code or resource around it. I am
> struggling with the issue. I will be highly grateful. Thank you so much once
> again
>
>> Date: Fri, 24 Aug 2012 08:07:39 +0200
>> Subject: Re: Reading multiple lines from a microsoft doc in hadoop
>> From: haavard.kongsgaard@gmail.com
>> To: user@hadoop.apache.org
>
>>
>> It's much easier if you convert the documents to text first
>>
>> use
>> http://tika.apache.org/
>>
>> or some other doc parser
>>
>>
>> -Håvard
>>
>> On Fri, Aug 24, 2012 at 7:52 AM, Siddharth Tiwari
>> <si...@live.com> wrote:
>> > hi,
>> > I have doc files in msword doc and docx format. These have entries which
>> > are
>> > seperated by an empty line. Is it possible for me to read
>> > these lines separated from empty lines at a time. Also which inpurformat
>> > shall I use to read doc docx. Please help
>> >
>> > *------------------------*
>> > Cheers !!!
>> > Siddharth Tiwari
>> > Have a refreshing day !!!
>> > "Every duty is holy, and devotion to duty is the highest form of worship
>> > of
>> > God.”
>> > "Maybe other people will try to limit me but I don't limit myself"
>>
>>
>>
>> --
>> Håvard Wahl Kongsgård
>> Faculty of Medicine &
>> Department of Mathematical Sciences
>> NTNU
>>
>> http://havard.security-review.net/



-- 
Håvard Wahl Kongsgård
Faculty of Medicine &
Department of Mathematical Sciences
NTNU

http://havard.security-review.net/

RE: Reading multiple lines from a microsoft doc in hadoop

Posted by Siddharth Tiwari <si...@live.com>.
Hi,
Thank you for the suggestion. Actually I was using poi to extract text, but since now  I  have so many  documents I thought I will use hadoop directly to parse as well. Average size of each document is around 120 kb. Also I want to read multiple lines from the text until I find a blank line. I do not have any idea ankit how to design custom input format and record reader. Pleaser help with some tutorial tutorial, code or resource around it. I am struggling with the issue. I will be highly grateful. Thank you so much once again

> Date: Fri, 24 Aug 2012 08:07:39 +0200
> Subject: Re: Reading multiple lines from a microsoft doc in hadoop
> From: haavard.kongsgaard@gmail.com
> To: user@hadoop.apache.org
> 
> It's much easier if you convert the documents to text first
> 
> use
> http://tika.apache.org/
> 
> or some other doc parser
> 
> 
> -Håvard
> 
> On Fri, Aug 24, 2012 at 7:52 AM, Siddharth Tiwari
> <si...@live.com> wrote:
> > hi,
> > I have doc files in msword doc and docx format. These have entries which are
> > seperated by an empty line. Is it possible for me to read
> > these lines separated from empty lines at a time. Also which inpurformat
> > shall I use to read doc docx. Please help
> >
> > *------------------------*
> > Cheers !!!
> > Siddharth Tiwari
> > Have a refreshing day !!!
> > "Every duty is holy, and devotion to duty is the highest form of worship of
> > God.”
> > "Maybe other people will try to limit me but I don't limit myself"
> 
> 
> 
> -- 
> Håvard Wahl Kongsgård
> Faculty of Medicine &
> Department of Mathematical Sciences
> NTNU
> 
> http://havard.security-review.net/
 		 	   		  

Re: Reading multiple lines from a microsoft doc in hadoop

Posted by Bertrand Dechoux <de...@gmail.com>.
And that would help you with performance too.
Were you originally planning to have one file per word document?
What is the average size of you word documents?
It shouldn't be much. I am afraid your map startup time won't be negligible
in that case.

Regards

Bertrand

On Fri, Aug 24, 2012 at 8:07 AM, Håvard Wahl Kongsgård <
haavard.kongsgaard@gmail.com> wrote:

> It's much easier if you convert the documents to text first
>
> use
> http://tika.apache.org/
>
> or some other doc parser
>
>
> -Håvard
>
> On Fri, Aug 24, 2012 at 7:52 AM, Siddharth Tiwari
> <si...@live.com> wrote:
> > hi,
> > I have doc files in msword doc and docx format. These have entries which
> are
> > seperated by an empty line. Is it possible for me to read
> > these lines separated from empty lines at a time. Also which inpurformat
> > shall I use to read doc docx. Please help
> >
> > *------------------------*
> > Cheers !!!
> > Siddharth Tiwari
> > Have a refreshing day !!!
> > "Every duty is holy, and devotion to duty is the highest form of worship
> of
> > God.”
> > "Maybe other people will try to limit me but I don't limit myself"
>
>
>
> --
> Håvard Wahl Kongsgård
> Faculty of Medicine &
> Department of Mathematical Sciences
> NTNU
>
> http://havard.security-review.net/
>



-- 
Bertrand Dechoux

RE: Reading multiple lines from a microsoft doc in hadoop

Posted by Siddharth Tiwari <si...@live.com>.
Hi,
Thank you for the suggestion. Actually I was using poi to extract text, but since now  I  have so many  documents I thought I will use hadoop directly to parse as well. Average size of each document is around 120 kb. Also I want to read multiple lines from the text until I find a blank line. I do not have any idea ankit how to design custom input format and record reader. Pleaser help with some tutorial tutorial, code or resource around it. I am struggling with the issue. I will be highly grateful. Thank you so much once again

> Date: Fri, 24 Aug 2012 08:07:39 +0200
> Subject: Re: Reading multiple lines from a microsoft doc in hadoop
> From: haavard.kongsgaard@gmail.com
> To: user@hadoop.apache.org
> 
> It's much easier if you convert the documents to text first
> 
> use
> http://tika.apache.org/
> 
> or some other doc parser
> 
> 
> -Håvard
> 
> On Fri, Aug 24, 2012 at 7:52 AM, Siddharth Tiwari
> <si...@live.com> wrote:
> > hi,
> > I have doc files in msword doc and docx format. These have entries which are
> > seperated by an empty line. Is it possible for me to read
> > these lines separated from empty lines at a time. Also which inpurformat
> > shall I use to read doc docx. Please help
> >
> > *------------------------*
> > Cheers !!!
> > Siddharth Tiwari
> > Have a refreshing day !!!
> > "Every duty is holy, and devotion to duty is the highest form of worship of
> > God.”
> > "Maybe other people will try to limit me but I don't limit myself"
> 
> 
> 
> -- 
> Håvard Wahl Kongsgård
> Faculty of Medicine &
> Department of Mathematical Sciences
> NTNU
> 
> http://havard.security-review.net/
 		 	   		  

Re: Reading multiple lines from a microsoft doc in hadoop

Posted by Bertrand Dechoux <de...@gmail.com>.
And that would help you with performance too.
Were you originally planning to have one file per word document?
What is the average size of you word documents?
It shouldn't be much. I am afraid your map startup time won't be negligible
in that case.

Regards

Bertrand

On Fri, Aug 24, 2012 at 8:07 AM, Håvard Wahl Kongsgård <
haavard.kongsgaard@gmail.com> wrote:

> It's much easier if you convert the documents to text first
>
> use
> http://tika.apache.org/
>
> or some other doc parser
>
>
> -Håvard
>
> On Fri, Aug 24, 2012 at 7:52 AM, Siddharth Tiwari
> <si...@live.com> wrote:
> > hi,
> > I have doc files in msword doc and docx format. These have entries which
> are
> > seperated by an empty line. Is it possible for me to read
> > these lines separated from empty lines at a time. Also which inpurformat
> > shall I use to read doc docx. Please help
> >
> > *------------------------*
> > Cheers !!!
> > Siddharth Tiwari
> > Have a refreshing day !!!
> > "Every duty is holy, and devotion to duty is the highest form of worship
> of
> > God.”
> > "Maybe other people will try to limit me but I don't limit myself"
>
>
>
> --
> Håvard Wahl Kongsgård
> Faculty of Medicine &
> Department of Mathematical Sciences
> NTNU
>
> http://havard.security-review.net/
>



-- 
Bertrand Dechoux

RE: Reading multiple lines from a microsoft doc in hadoop

Posted by Siddharth Tiwari <si...@live.com>.
Hi,
Thank you for the suggestion. Actually I was using poi to extract text, but since now  I  have so many  documents I thought I will use hadoop directly to parse as well. Average size of each document is around 120 kb. Also I want to read multiple lines from the text until I find a blank line. I do not have any idea ankit how to design custom input format and record reader. Pleaser help with some tutorial tutorial, code or resource around it. I am struggling with the issue. I will be highly grateful. Thank you so much once again

> Date: Fri, 24 Aug 2012 08:07:39 +0200
> Subject: Re: Reading multiple lines from a microsoft doc in hadoop
> From: haavard.kongsgaard@gmail.com
> To: user@hadoop.apache.org
> 
> It's much easier if you convert the documents to text first
> 
> use
> http://tika.apache.org/
> 
> or some other doc parser
> 
> 
> -Håvard
> 
> On Fri, Aug 24, 2012 at 7:52 AM, Siddharth Tiwari
> <si...@live.com> wrote:
> > hi,
> > I have doc files in msword doc and docx format. These have entries which are
> > seperated by an empty line. Is it possible for me to read
> > these lines separated from empty lines at a time. Also which inpurformat
> > shall I use to read doc docx. Please help
> >
> > *------------------------*
> > Cheers !!!
> > Siddharth Tiwari
> > Have a refreshing day !!!
> > "Every duty is holy, and devotion to duty is the highest form of worship of
> > God.”
> > "Maybe other people will try to limit me but I don't limit myself"
> 
> 
> 
> -- 
> Håvard Wahl Kongsgård
> Faculty of Medicine &
> Department of Mathematical Sciences
> NTNU
> 
> http://havard.security-review.net/
 		 	   		  

RE: Reading multiple lines from a microsoft doc in hadoop

Posted by Siddharth Tiwari <si...@live.com>.
Hi,
Thank you for the suggestion. Actually I was using poi to extract text, but since now  I  have so many  documents I thought I will use hadoop directly to parse as well. Average size of each document is around 120 kb. Also I want to read multiple lines from the text until I find a blank line. I do not have any idea ankit how to design custom input format and record reader. Pleaser help with some tutorial tutorial, code or resource around it. I am struggling with the issue. I will be highly grateful. Thank you so much once again

> Date: Fri, 24 Aug 2012 08:07:39 +0200
> Subject: Re: Reading multiple lines from a microsoft doc in hadoop
> From: haavard.kongsgaard@gmail.com
> To: user@hadoop.apache.org
> 
> It's much easier if you convert the documents to text first
> 
> use
> http://tika.apache.org/
> 
> or some other doc parser
> 
> 
> -Håvard
> 
> On Fri, Aug 24, 2012 at 7:52 AM, Siddharth Tiwari
> <si...@live.com> wrote:
> > hi,
> > I have doc files in msword doc and docx format. These have entries which are
> > seperated by an empty line. Is it possible for me to read
> > these lines separated from empty lines at a time. Also which inpurformat
> > shall I use to read doc docx. Please help
> >
> > *------------------------*
> > Cheers !!!
> > Siddharth Tiwari
> > Have a refreshing day !!!
> > "Every duty is holy, and devotion to duty is the highest form of worship of
> > God.”
> > "Maybe other people will try to limit me but I don't limit myself"
> 
> 
> 
> -- 
> Håvard Wahl Kongsgård
> Faculty of Medicine &
> Department of Mathematical Sciences
> NTNU
> 
> http://havard.security-review.net/
 		 	   		  

Re: Reading multiple lines from a microsoft doc in hadoop

Posted by Bertrand Dechoux <de...@gmail.com>.
And that would help you with performance too.
Were you originally planning to have one file per word document?
What is the average size of you word documents?
It shouldn't be much. I am afraid your map startup time won't be negligible
in that case.

Regards

Bertrand

On Fri, Aug 24, 2012 at 8:07 AM, Håvard Wahl Kongsgård <
haavard.kongsgaard@gmail.com> wrote:

> It's much easier if you convert the documents to text first
>
> use
> http://tika.apache.org/
>
> or some other doc parser
>
>
> -Håvard
>
> On Fri, Aug 24, 2012 at 7:52 AM, Siddharth Tiwari
> <si...@live.com> wrote:
> > hi,
> > I have doc files in msword doc and docx format. These have entries which
> are
> > seperated by an empty line. Is it possible for me to read
> > these lines separated from empty lines at a time. Also which inpurformat
> > shall I use to read doc docx. Please help
> >
> > *------------------------*
> > Cheers !!!
> > Siddharth Tiwari
> > Have a refreshing day !!!
> > "Every duty is holy, and devotion to duty is the highest form of worship
> of
> > God.”
> > "Maybe other people will try to limit me but I don't limit myself"
>
>
>
> --
> Håvard Wahl Kongsgård
> Faculty of Medicine &
> Department of Mathematical Sciences
> NTNU
>
> http://havard.security-review.net/
>



-- 
Bertrand Dechoux

Re: Reading multiple lines from a microsoft doc in hadoop

Posted by Håvard Wahl Kongsgård <ha...@gmail.com>.
It's much easier if you convert the documents to text first

use
http://tika.apache.org/

or some other doc parser


-Håvard

On Fri, Aug 24, 2012 at 7:52 AM, Siddharth Tiwari
<si...@live.com> wrote:
> hi,
> I have doc files in msword doc and docx format. These have entries which are
> seperated by an empty line. Is it possible for me to read
> these lines separated from empty lines at a time. Also which inpurformat
> shall I use to read doc docx. Please help
>
> *------------------------*
> Cheers !!!
> Siddharth Tiwari
> Have a refreshing day !!!
> "Every duty is holy, and devotion to duty is the highest form of worship of
> God.”
> "Maybe other people will try to limit me but I don't limit myself"



-- 
Håvard Wahl Kongsgård
Faculty of Medicine &
Department of Mathematical Sciences
NTNU

http://havard.security-review.net/

Re: Reading multiple lines from a microsoft doc in hadoop

Posted by Biju Balakrishnan <bi...@gmail.com>.
Siddharth,


> I have doc files in msword doc and docx format. These have entries which
> are seperated by an empty line. Is it possible for me to read
> these lines separated from empty lines at a time. Also which inpurformat
> shall I use to read doc docx. Please help
>
>
As far as i know, none of the input format supports the doc & docx(to be
noted: as far as i know).
you might need to write a custom input format to support doc[x] files.

its better to convert to text files before processing using hadoop.


-- 
*Biju
*

Re: Reading multiple lines from a microsoft doc in hadoop

Posted by Bejoy KS <be...@gmail.com>.
Hi Siddharth

I believe doc and docx have custom formatting other than text. In that case you may have to build your own input format. Also your own record reader if you want to have the record delimiter as an empty line. 

Regards
Bejoy KS

Sent from handheld, please excuse typos.

-----Original Message-----
From: Siddharth Tiwari <si...@live.com>
Date: Fri, 24 Aug 2012 05:52:13 
To: USers Hadoop<us...@hadoop.apache.org>
Reply-To: user@hadoop.apache.org
Subject: Reading multiple lines from a microsoft doc in hadoop


hi,
I have doc files in msword doc and docx format. These have entries which are seperated by an empty line. Is it possible for me to read these lines separated from empty lines at a time. Also which inpurformat shall I use to read doc docx. Please help

*------------------------*

Cheers !!!

Siddharth Tiwari

Have a refreshing day !!!
"Every duty is holy, and devotion to duty is the highest form of worship of God.” 

"Maybe other people will try to limit me but I don't limit myself"
 		 	   		  

Re: Reading multiple lines from a microsoft doc in hadoop

Posted by Bejoy KS <be...@gmail.com>.
Hi Siddharth

I believe doc and docx have custom formatting other than text. In that case you may have to build your own input format. Also your own record reader if you want to have the record delimiter as an empty line. 

Regards
Bejoy KS

Sent from handheld, please excuse typos.

-----Original Message-----
From: Siddharth Tiwari <si...@live.com>
Date: Fri, 24 Aug 2012 05:52:13 
To: USers Hadoop<us...@hadoop.apache.org>
Reply-To: user@hadoop.apache.org
Subject: Reading multiple lines from a microsoft doc in hadoop


hi,
I have doc files in msword doc and docx format. These have entries which are seperated by an empty line. Is it possible for me to read these lines separated from empty lines at a time. Also which inpurformat shall I use to read doc docx. Please help

*------------------------*

Cheers !!!

Siddharth Tiwari

Have a refreshing day !!!
"Every duty is holy, and devotion to duty is the highest form of worship of God.” 

"Maybe other people will try to limit me but I don't limit myself"
 		 	   		  

Re: Reading multiple lines from a microsoft doc in hadoop

Posted by Håvard Wahl Kongsgård <ha...@gmail.com>.
It's much easier if you convert the documents to text first

use
http://tika.apache.org/

or some other doc parser


-Håvard

On Fri, Aug 24, 2012 at 7:52 AM, Siddharth Tiwari
<si...@live.com> wrote:
> hi,
> I have doc files in msword doc and docx format. These have entries which are
> seperated by an empty line. Is it possible for me to read
> these lines separated from empty lines at a time. Also which inpurformat
> shall I use to read doc docx. Please help
>
> *------------------------*
> Cheers !!!
> Siddharth Tiwari
> Have a refreshing day !!!
> "Every duty is holy, and devotion to duty is the highest form of worship of
> God.”
> "Maybe other people will try to limit me but I don't limit myself"



-- 
Håvard Wahl Kongsgård
Faculty of Medicine &
Department of Mathematical Sciences
NTNU

http://havard.security-review.net/

Re: Reading multiple lines from a microsoft doc in hadoop

Posted by Håvard Wahl Kongsgård <ha...@gmail.com>.
It's much easier if you convert the documents to text first

use
http://tika.apache.org/

or some other doc parser


-Håvard

On Fri, Aug 24, 2012 at 7:52 AM, Siddharth Tiwari
<si...@live.com> wrote:
> hi,
> I have doc files in msword doc and docx format. These have entries which are
> seperated by an empty line. Is it possible for me to read
> these lines separated from empty lines at a time. Also which inpurformat
> shall I use to read doc docx. Please help
>
> *------------------------*
> Cheers !!!
> Siddharth Tiwari
> Have a refreshing day !!!
> "Every duty is holy, and devotion to duty is the highest form of worship of
> God.”
> "Maybe other people will try to limit me but I don't limit myself"



-- 
Håvard Wahl Kongsgård
Faculty of Medicine &
Department of Mathematical Sciences
NTNU

http://havard.security-review.net/

Re: Reading multiple lines from a microsoft doc in hadoop

Posted by Bejoy KS <be...@gmail.com>.
Hi Siddharth

I believe doc and docx have custom formatting other than text. In that case you may have to build your own input format. Also your own record reader if you want to have the record delimiter as an empty line. 

Regards
Bejoy KS

Sent from handheld, please excuse typos.

-----Original Message-----
From: Siddharth Tiwari <si...@live.com>
Date: Fri, 24 Aug 2012 05:52:13 
To: USers Hadoop<us...@hadoop.apache.org>
Reply-To: user@hadoop.apache.org
Subject: Reading multiple lines from a microsoft doc in hadoop


hi,
I have doc files in msword doc and docx format. These have entries which are seperated by an empty line. Is it possible for me to read these lines separated from empty lines at a time. Also which inpurformat shall I use to read doc docx. Please help

*------------------------*

Cheers !!!

Siddharth Tiwari

Have a refreshing day !!!
"Every duty is holy, and devotion to duty is the highest form of worship of God.” 

"Maybe other people will try to limit me but I don't limit myself"
 		 	   		  

Re: Reading multiple lines from a microsoft doc in hadoop

Posted by Håvard Wahl Kongsgård <ha...@gmail.com>.
It's much easier if you convert the documents to text first

use
http://tika.apache.org/

or some other doc parser


-Håvard

On Fri, Aug 24, 2012 at 7:52 AM, Siddharth Tiwari
<si...@live.com> wrote:
> hi,
> I have doc files in msword doc and docx format. These have entries which are
> seperated by an empty line. Is it possible for me to read
> these lines separated from empty lines at a time. Also which inpurformat
> shall I use to read doc docx. Please help
>
> *------------------------*
> Cheers !!!
> Siddharth Tiwari
> Have a refreshing day !!!
> "Every duty is holy, and devotion to duty is the highest form of worship of
> God.”
> "Maybe other people will try to limit me but I don't limit myself"



-- 
Håvard Wahl Kongsgård
Faculty of Medicine &
Department of Mathematical Sciences
NTNU

http://havard.security-review.net/