You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by Ranjini Rathinam <ra...@gmail.com> on 2014/01/03 06:16:58 UTC

XML to TEXT

Hi,

Need to convert XML into text using mapreduce.

I have used DOM and SAX parser.

After using SAX Builder in mapper class. the child node act as root Element.

While seeing in Sys out i found thar root element is taking the child
element and printing.

For Eg,

<Comp><Emp><id>100</id><name>RR</name></Emp></Comp>
when this xml is passed in mapper , in sys out printing the root element

I am getting the the root element as

<id>
<name>

Please suggest and help to fix this.

I need to convert the xml into text using mapreduce code. Please provide
with example.

Required output is

id,name
100,RR

Please help.

Thanks in advance,
Ranjini R

Re: XML to TEXT

Posted by Ranjini Rathinam <ra...@gmail.com>.
Hi,

I used XMLInputFormat , in that i used  Record Reader class. Same as u have
given

THe whole xml is been split into part For Eg: consider the below xml

<Comp><Emp><id></id><name></name></Emp><Emp><id></id><name></name></Emp></Comp>

after using the RecordReader class the xml output is

<Emp><id></id><name></name></Emp><Emp><id></id><name></name></Emp>

the starting and end tag is Emp.

it does not convert into text.

Please suggest and help.

Thanks in advance

Ranjini

On Fri, Jan 3, 2014 at 11:22 AM, Azuryy Yu <az...@gmail.com> wrote:

>     Hi,
>
> you can use org.apache.hadoop.streaming.StreamInputFormat  using map
> reduce to convert XML to text.
>
> such as your xml like this:
> <xml>
>   <name>lll</name>
> </xml>
>
> you need to specify stream.recordreader.begin and stream.recordreader.end
> in the Configuration:
> Configuration conf = new Configuration();
> conf.set("stream.recordreader.begin", "<xml>");
> conf.set("stream.recordreader.end", "</xml>");
>
>
>
>
>
>
> On Fri, Jan 3, 2014 at 1:16 PM, Ranjini Rathinam <ra...@gmail.com>wrote:
>
>> Hi,
>>
>> Need to convert XML into text using mapreduce.
>>
>> I have used DOM and SAX parser.
>>
>> After using SAX Builder in mapper class. the child node act as root
>> Element.
>>
>> While seeing in Sys out i found thar root element is taking the child
>> element and printing.
>>
>> For Eg,
>>
>> <Comp><Emp><id>100</id><name>RR</name></Emp></Comp>
>> when this xml is passed in mapper , in sys out printing the root element
>>
>> I am getting the the root element as
>>
>> <id>
>> <name>
>>
>> Please suggest and help to fix this.
>>
>> I need to convert the xml into text using mapreduce code. Please provide
>> with example.
>>
>> Required output is
>>
>> id,name
>> 100,RR
>>
>> Please help.
>>
>> Thanks in advance,
>> Ranjini R
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>
>

Re: XML to TEXT

Posted by Ranjini Rathinam <ra...@gmail.com>.
Hi,

I used XMLInputFormat , in that i used  Record Reader class. Same as u have
given

THe whole xml is been split into part For Eg: consider the below xml

<Comp><Emp><id></id><name></name></Emp><Emp><id></id><name></name></Emp></Comp>

after using the RecordReader class the xml output is

<Emp><id></id><name></name></Emp><Emp><id></id><name></name></Emp>

the starting and end tag is Emp.

it does not convert into text.

Please suggest and help.

Thanks in advance

Ranjini

On Fri, Jan 3, 2014 at 11:22 AM, Azuryy Yu <az...@gmail.com> wrote:

>     Hi,
>
> you can use org.apache.hadoop.streaming.StreamInputFormat  using map
> reduce to convert XML to text.
>
> such as your xml like this:
> <xml>
>   <name>lll</name>
> </xml>
>
> you need to specify stream.recordreader.begin and stream.recordreader.end
> in the Configuration:
> Configuration conf = new Configuration();
> conf.set("stream.recordreader.begin", "<xml>");
> conf.set("stream.recordreader.end", "</xml>");
>
>
>
>
>
>
> On Fri, Jan 3, 2014 at 1:16 PM, Ranjini Rathinam <ra...@gmail.com>wrote:
>
>> Hi,
>>
>> Need to convert XML into text using mapreduce.
>>
>> I have used DOM and SAX parser.
>>
>> After using SAX Builder in mapper class. the child node act as root
>> Element.
>>
>> While seeing in Sys out i found thar root element is taking the child
>> element and printing.
>>
>> For Eg,
>>
>> <Comp><Emp><id>100</id><name>RR</name></Emp></Comp>
>> when this xml is passed in mapper , in sys out printing the root element
>>
>> I am getting the the root element as
>>
>> <id>
>> <name>
>>
>> Please suggest and help to fix this.
>>
>> I need to convert the xml into text using mapreduce code. Please provide
>> with example.
>>
>> Required output is
>>
>> id,name
>> 100,RR
>>
>> Please help.
>>
>> Thanks in advance,
>> Ranjini R
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>
>

Re: XML to TEXT

Posted by Ranjini Rathinam <ra...@gmail.com>.
Hi,

I used XMLInputFormat , in that i used  Record Reader class. Same as u have
given

THe whole xml is been split into part For Eg: consider the below xml

<Comp><Emp><id></id><name></name></Emp><Emp><id></id><name></name></Emp></Comp>

after using the RecordReader class the xml output is

<Emp><id></id><name></name></Emp><Emp><id></id><name></name></Emp>

the starting and end tag is Emp.

it does not convert into text.

Please suggest and help.

Thanks in advance

Ranjini

On Fri, Jan 3, 2014 at 11:22 AM, Azuryy Yu <az...@gmail.com> wrote:

>     Hi,
>
> you can use org.apache.hadoop.streaming.StreamInputFormat  using map
> reduce to convert XML to text.
>
> such as your xml like this:
> <xml>
>   <name>lll</name>
> </xml>
>
> you need to specify stream.recordreader.begin and stream.recordreader.end
> in the Configuration:
> Configuration conf = new Configuration();
> conf.set("stream.recordreader.begin", "<xml>");
> conf.set("stream.recordreader.end", "</xml>");
>
>
>
>
>
>
> On Fri, Jan 3, 2014 at 1:16 PM, Ranjini Rathinam <ra...@gmail.com>wrote:
>
>> Hi,
>>
>> Need to convert XML into text using mapreduce.
>>
>> I have used DOM and SAX parser.
>>
>> After using SAX Builder in mapper class. the child node act as root
>> Element.
>>
>> While seeing in Sys out i found thar root element is taking the child
>> element and printing.
>>
>> For Eg,
>>
>> <Comp><Emp><id>100</id><name>RR</name></Emp></Comp>
>> when this xml is passed in mapper , in sys out printing the root element
>>
>> I am getting the the root element as
>>
>> <id>
>> <name>
>>
>> Please suggest and help to fix this.
>>
>> I need to convert the xml into text using mapreduce code. Please provide
>> with example.
>>
>> Required output is
>>
>> id,name
>> 100,RR
>>
>> Please help.
>>
>> Thanks in advance,
>> Ranjini R
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>
>

Re: XML to TEXT

Posted by Ranjini Rathinam <ra...@gmail.com>.
Hi,

I used XMLInputFormat , in that i used  Record Reader class. Same as u have
given

THe whole xml is been split into part For Eg: consider the below xml

<Comp><Emp><id></id><name></name></Emp><Emp><id></id><name></name></Emp></Comp>

after using the RecordReader class the xml output is

<Emp><id></id><name></name></Emp><Emp><id></id><name></name></Emp>

the starting and end tag is Emp.

it does not convert into text.

Please suggest and help.

Thanks in advance

Ranjini

On Fri, Jan 3, 2014 at 11:22 AM, Azuryy Yu <az...@gmail.com> wrote:

>     Hi,
>
> you can use org.apache.hadoop.streaming.StreamInputFormat  using map
> reduce to convert XML to text.
>
> such as your xml like this:
> <xml>
>   <name>lll</name>
> </xml>
>
> you need to specify stream.recordreader.begin and stream.recordreader.end
> in the Configuration:
> Configuration conf = new Configuration();
> conf.set("stream.recordreader.begin", "<xml>");
> conf.set("stream.recordreader.end", "</xml>");
>
>
>
>
>
>
> On Fri, Jan 3, 2014 at 1:16 PM, Ranjini Rathinam <ra...@gmail.com>wrote:
>
>> Hi,
>>
>> Need to convert XML into text using mapreduce.
>>
>> I have used DOM and SAX parser.
>>
>> After using SAX Builder in mapper class. the child node act as root
>> Element.
>>
>> While seeing in Sys out i found thar root element is taking the child
>> element and printing.
>>
>> For Eg,
>>
>> <Comp><Emp><id>100</id><name>RR</name></Emp></Comp>
>> when this xml is passed in mapper , in sys out printing the root element
>>
>> I am getting the the root element as
>>
>> <id>
>> <name>
>>
>> Please suggest and help to fix this.
>>
>> I need to convert the xml into text using mapreduce code. Please provide
>> with example.
>>
>> Required output is
>>
>> id,name
>> 100,RR
>>
>> Please help.
>>
>> Thanks in advance,
>> Ranjini R
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>
>

Re: XML to TEXT

Posted by Azuryy Yu <az...@gmail.com>.
Hi,

you can use org.apache.hadoop.streaming.StreamInputFormat  using map reduce
to convert XML to text.

such as your xml like this:
<xml>
  <name>lll</name>
</xml>

you need to specify stream.recordreader.begin and stream.recordreader.end
in the Configuration:
Configuration conf = new Configuration();
conf.set("stream.recordreader.begin", "<xml>");
conf.set("stream.recordreader.end", "</xml>");






On Fri, Jan 3, 2014 at 1:16 PM, Ranjini Rathinam <ra...@gmail.com>wrote:

> Hi,
>
> Need to convert XML into text using mapreduce.
>
> I have used DOM and SAX parser.
>
> After using SAX Builder in mapper class. the child node act as root
> Element.
>
> While seeing in Sys out i found thar root element is taking the child
> element and printing.
>
> For Eg,
>
> <Comp><Emp><id>100</id><name>RR</name></Emp></Comp>
> when this xml is passed in mapper , in sys out printing the root element
>
> I am getting the the root element as
>
> <id>
> <name>
>
> Please suggest and help to fix this.
>
> I need to convert the xml into text using mapreduce code. Please provide
> with example.
>
> Required output is
>
> id,name
> 100,RR
>
> Please help.
>
> Thanks in advance,
> Ranjini R
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>

Re: XML to TEXT

Posted by Shekhar Sharma <sh...@gmail.com>.
Which input format you are using . Use xml input format.
On 3 Jan 2014 10:47, "Ranjini Rathinam" <ra...@gmail.com> wrote:

> Hi,
>
> Need to convert XML into text using mapreduce.
>
> I have used DOM and SAX parser.
>
> After using SAX Builder in mapper class. the child node act as root
> Element.
>
> While seeing in Sys out i found thar root element is taking the child
> element and printing.
>
> For Eg,
>
> <Comp><Emp><id>100</id><name>RR</name></Emp></Comp>
> when this xml is passed in mapper , in sys out printing the root element
>
> I am getting the the root element as
>
> <id>
> <name>
>
> Please suggest and help to fix this.
>
> I need to convert the xml into text using mapreduce code. Please provide
> with example.
>
> Required output is
>
> id,name
> 100,RR
>
> Please help.
>
> Thanks in advance,
> Ranjini R
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>

XML to TEXT

Posted by Ranjini Rathinam <ra...@gmail.com>.
Hi,

As suggest i tried with the code , but in the result.txt i got output only
header. Nothing else was printing.

After debugging i came to know that while parsing , there is no value.

The problem is in line given below which is bold. While putting SysOut i
found no value printing in this line.

String xmlContent = value.toString();

        InputStream is = new ByteArrayInputStream(xmlContent.getBytes());
        DocumentBuilderFactory factory =
DocumentBuilderFactory.newInstance();
        DocumentBuilder builder;
        try {
            builder = factory.newDocumentBuilder();
         *   Document doc = builder.parse(is);*


*String ed=doc.getDocumentElement().getNodeName();*
out.write(ed.getBytes());
            DTMNodeList list = (DTMNodeList) getNode("/Company/Employee",
doc,XPathConstants.NODESET);


When iam printing

out.write(xmlContent.getBytes):- the whole xml is being printed.

then i wrote for Sysout for list ,nothing printed.
out.write(ed.getBytes):- nothing is being printed.

Please suggest where i am going wrong. Please help to fix this.

Thanks in advance.

I have attached my code.Please review.


Mapper class:-

public class XmlTextMapper extends Mapper<LongWritable, Text, Text, Text> {
    private static final XPathFactory xpathFactory =
XPathFactory.newInstance();
    @Override
    public void map(LongWritable key, Text value, Context context)
            throws IOException, InterruptedException {
        String resultFileName = "/user/task/Sales/result.txt";

        Configuration conf = new Configuration();
        FileSystem fs = FileSystem.get(URI.create(resultFileName), conf);
        FSDataOutputStream out = fs.create(new Path(resultFileName));
        InputStream resultIS = new ByteArrayInputStream(new byte[0]);
        String header = "id,name\n";
        out.write(header.getBytes());
        String xmlContent = value.toString();

        InputStream is = new ByteArrayInputStream(xmlContent.getBytes());
        DocumentBuilderFactory factory =
DocumentBuilderFactory.newInstance();
        DocumentBuilder builder;
        try {
            builder = factory.newDocumentBuilder();
            Document doc = builder.parse(is);

   String ed=doc.getDocumentElement().getNodeName();
   out.write(ed.getBytes());
            DTMNodeList list = (DTMNodeList) getNode("/Company/Employee",
doc,XPathConstants.NODESET);
            int size = list.getLength();
            for (int i = 0; i < size; i++) {
                Node node = list.item(i);
                String line = "";
                NodeList nodeList = node.getChildNodes();
                int childNumber = nodeList.getLength();
                for (int j = 0; j < childNumber; j++)
    {
                    line += nodeList.item(j).getTextContent() + ",";
                }
                if (line.endsWith(","))
                    line = line.substring(0, line.length() - 1);
                line += "\n";
                out.write(line.getBytes());
            }
        } catch (ParserConfigurationException e) {
             e.printStackTrace();
        } catch (SAXException e) {
             e.printStackTrace();
        } catch (XPathExpressionException e) {
             e.printStackTrace();
        }
        IOUtils.copyBytes(resultIS, out, 4096, true);
        out.close();
    }
    public static Object getNode(String xpathStr, Node node, QName
retunType)
            throws XPathExpressionException {
        XPath xpath = xpathFactory.newXPath();
        return xpath.evaluate(xpathStr, node, retunType);
    }
}



Main class
public class MainXml {
    public static void main(String[] args) throws Exception {

Configuration conf = new Configuration();

        if (args.length != 2) {
            System.err
                    .println("Usage: XMLtoText <input path> <output path>");
            System.exit(-1);
        }

  String output="/user/task/Sales/";
       Job job = new Job(conf, "XML to Text");
        job.setJarByClass(MainXml.class);
       // job.setJobName("XML to Text");

        FileInputFormat.addInputPath(job, new Path(args[0]));

       // FileOutputFormat.setOutputPath(job, new Path(args[1]));
  Path outPath = new Path(output);
  FileOutputFormat.setOutputPath(job, outPath);
  FileSystem dfs = FileSystem.get(outPath.toUri(), conf);
  if (dfs.exists(outPath)) {
  dfs.delete(outPath, true);
  }
        job.setMapperClass(XmlTextMapper.class);

        job.setNumReduceTasks(0);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(Text.class);
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}



My xml file

<Company>
<Employee>
<id>100</id>
<ename>ranjini</ename>
<dept>IT1</dept>
<sal>123456</sal>
<location>nextlevel1</location>
<Address>
<Home>Chennai1</Home>
<Office>Navallur1</Office>
</Address>
</Employee>
<Employee>
<id>1001</id>
<ename>ranjinikumar</ename>
<dept>IT</dept>
<sal>1234516</sal>
<location>nextlevel</location>
<Address>
<Home>Chennai</Home>
<Office>Navallur</Office>
</Address>
</Employee>
</Company>


Thanks in advance
Ranjini. R

XML to TEXT

Posted by Ranjini Rathinam <ra...@gmail.com>.
Hi,

As suggest i tried with the code , but in the result.txt i got output only
header. Nothing else was printing.

After debugging i came to know that while parsing , there is no value.

The problem is in line given below which is bold. While putting SysOut i
found no value printing in this line.

String xmlContent = value.toString();

        InputStream is = new ByteArrayInputStream(xmlContent.getBytes());
        DocumentBuilderFactory factory =
DocumentBuilderFactory.newInstance();
        DocumentBuilder builder;
        try {
            builder = factory.newDocumentBuilder();
         *   Document doc = builder.parse(is);*


*String ed=doc.getDocumentElement().getNodeName();*
out.write(ed.getBytes());
            DTMNodeList list = (DTMNodeList) getNode("/Company/Employee",
doc,XPathConstants.NODESET);


When iam printing

out.write(xmlContent.getBytes):- the whole xml is being printed.

then i wrote for Sysout for list ,nothing printed.
out.write(ed.getBytes):- nothing is being printed.

Please suggest where i am going wrong. Please help to fix this.

Thanks in advance.

I have attached my code.Please review.


Mapper class:-

public class XmlTextMapper extends Mapper<LongWritable, Text, Text, Text> {
    private static final XPathFactory xpathFactory =
XPathFactory.newInstance();
    @Override
    public void map(LongWritable key, Text value, Context context)
            throws IOException, InterruptedException {
        String resultFileName = "/user/task/Sales/result.txt";

        Configuration conf = new Configuration();
        FileSystem fs = FileSystem.get(URI.create(resultFileName), conf);
        FSDataOutputStream out = fs.create(new Path(resultFileName));
        InputStream resultIS = new ByteArrayInputStream(new byte[0]);
        String header = "id,name\n";
        out.write(header.getBytes());
        String xmlContent = value.toString();

        InputStream is = new ByteArrayInputStream(xmlContent.getBytes());
        DocumentBuilderFactory factory =
DocumentBuilderFactory.newInstance();
        DocumentBuilder builder;
        try {
            builder = factory.newDocumentBuilder();
            Document doc = builder.parse(is);

   String ed=doc.getDocumentElement().getNodeName();
   out.write(ed.getBytes());
            DTMNodeList list = (DTMNodeList) getNode("/Company/Employee",
doc,XPathConstants.NODESET);
            int size = list.getLength();
            for (int i = 0; i < size; i++) {
                Node node = list.item(i);
                String line = "";
                NodeList nodeList = node.getChildNodes();
                int childNumber = nodeList.getLength();
                for (int j = 0; j < childNumber; j++)
    {
                    line += nodeList.item(j).getTextContent() + ",";
                }
                if (line.endsWith(","))
                    line = line.substring(0, line.length() - 1);
                line += "\n";
                out.write(line.getBytes());
            }
        } catch (ParserConfigurationException e) {
             e.printStackTrace();
        } catch (SAXException e) {
             e.printStackTrace();
        } catch (XPathExpressionException e) {
             e.printStackTrace();
        }
        IOUtils.copyBytes(resultIS, out, 4096, true);
        out.close();
    }
    public static Object getNode(String xpathStr, Node node, QName
retunType)
            throws XPathExpressionException {
        XPath xpath = xpathFactory.newXPath();
        return xpath.evaluate(xpathStr, node, retunType);
    }
}



Main class
public class MainXml {
    public static void main(String[] args) throws Exception {

Configuration conf = new Configuration();

        if (args.length != 2) {
            System.err
                    .println("Usage: XMLtoText <input path> <output path>");
            System.exit(-1);
        }

  String output="/user/task/Sales/";
       Job job = new Job(conf, "XML to Text");
        job.setJarByClass(MainXml.class);
       // job.setJobName("XML to Text");

        FileInputFormat.addInputPath(job, new Path(args[0]));

       // FileOutputFormat.setOutputPath(job, new Path(args[1]));
  Path outPath = new Path(output);
  FileOutputFormat.setOutputPath(job, outPath);
  FileSystem dfs = FileSystem.get(outPath.toUri(), conf);
  if (dfs.exists(outPath)) {
  dfs.delete(outPath, true);
  }
        job.setMapperClass(XmlTextMapper.class);

        job.setNumReduceTasks(0);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(Text.class);
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}



My xml file

<Company>
<Employee>
<id>100</id>
<ename>ranjini</ename>
<dept>IT1</dept>
<sal>123456</sal>
<location>nextlevel1</location>
<Address>
<Home>Chennai1</Home>
<Office>Navallur1</Office>
</Address>
</Employee>
<Employee>
<id>1001</id>
<ename>ranjinikumar</ename>
<dept>IT</dept>
<sal>1234516</sal>
<location>nextlevel</location>
<Address>
<Home>Chennai</Home>
<Office>Navallur</Office>
</Address>
</Employee>
</Company>


Thanks in advance
Ranjini. R

XML to TEXT

Posted by Ranjini Rathinam <ra...@gmail.com>.
Hi,

As suggest i tried with the code , but in the result.txt i got output only
header. Nothing else was printing.

After debugging i came to know that while parsing , there is no value.

The problem is in line given below which is bold. While putting SysOut i
found no value printing in this line.

String xmlContent = value.toString();

        InputStream is = new ByteArrayInputStream(xmlContent.getBytes());
        DocumentBuilderFactory factory =
DocumentBuilderFactory.newInstance();
        DocumentBuilder builder;
        try {
            builder = factory.newDocumentBuilder();
         *   Document doc = builder.parse(is);*


*String ed=doc.getDocumentElement().getNodeName();*
out.write(ed.getBytes());
            DTMNodeList list = (DTMNodeList) getNode("/Company/Employee",
doc,XPathConstants.NODESET);


When iam printing

out.write(xmlContent.getBytes):- the whole xml is being printed.

then i wrote for Sysout for list ,nothing printed.
out.write(ed.getBytes):- nothing is being printed.

Please suggest where i am going wrong. Please help to fix this.

Thanks in advance.

I have attached my code.Please review.


Mapper class:-

public class XmlTextMapper extends Mapper<LongWritable, Text, Text, Text> {
    private static final XPathFactory xpathFactory =
XPathFactory.newInstance();
    @Override
    public void map(LongWritable key, Text value, Context context)
            throws IOException, InterruptedException {
        String resultFileName = "/user/task/Sales/result.txt";

        Configuration conf = new Configuration();
        FileSystem fs = FileSystem.get(URI.create(resultFileName), conf);
        FSDataOutputStream out = fs.create(new Path(resultFileName));
        InputStream resultIS = new ByteArrayInputStream(new byte[0]);
        String header = "id,name\n";
        out.write(header.getBytes());
        String xmlContent = value.toString();

        InputStream is = new ByteArrayInputStream(xmlContent.getBytes());
        DocumentBuilderFactory factory =
DocumentBuilderFactory.newInstance();
        DocumentBuilder builder;
        try {
            builder = factory.newDocumentBuilder();
            Document doc = builder.parse(is);

   String ed=doc.getDocumentElement().getNodeName();
   out.write(ed.getBytes());
            DTMNodeList list = (DTMNodeList) getNode("/Company/Employee",
doc,XPathConstants.NODESET);
            int size = list.getLength();
            for (int i = 0; i < size; i++) {
                Node node = list.item(i);
                String line = "";
                NodeList nodeList = node.getChildNodes();
                int childNumber = nodeList.getLength();
                for (int j = 0; j < childNumber; j++)
    {
                    line += nodeList.item(j).getTextContent() + ",";
                }
                if (line.endsWith(","))
                    line = line.substring(0, line.length() - 1);
                line += "\n";
                out.write(line.getBytes());
            }
        } catch (ParserConfigurationException e) {
             e.printStackTrace();
        } catch (SAXException e) {
             e.printStackTrace();
        } catch (XPathExpressionException e) {
             e.printStackTrace();
        }
        IOUtils.copyBytes(resultIS, out, 4096, true);
        out.close();
    }
    public static Object getNode(String xpathStr, Node node, QName
retunType)
            throws XPathExpressionException {
        XPath xpath = xpathFactory.newXPath();
        return xpath.evaluate(xpathStr, node, retunType);
    }
}



Main class
public class MainXml {
    public static void main(String[] args) throws Exception {

Configuration conf = new Configuration();

        if (args.length != 2) {
            System.err
                    .println("Usage: XMLtoText <input path> <output path>");
            System.exit(-1);
        }

  String output="/user/task/Sales/";
       Job job = new Job(conf, "XML to Text");
        job.setJarByClass(MainXml.class);
       // job.setJobName("XML to Text");

        FileInputFormat.addInputPath(job, new Path(args[0]));

       // FileOutputFormat.setOutputPath(job, new Path(args[1]));
  Path outPath = new Path(output);
  FileOutputFormat.setOutputPath(job, outPath);
  FileSystem dfs = FileSystem.get(outPath.toUri(), conf);
  if (dfs.exists(outPath)) {
  dfs.delete(outPath, true);
  }
        job.setMapperClass(XmlTextMapper.class);

        job.setNumReduceTasks(0);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(Text.class);
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}



My xml file

<Company>
<Employee>
<id>100</id>
<ename>ranjini</ename>
<dept>IT1</dept>
<sal>123456</sal>
<location>nextlevel1</location>
<Address>
<Home>Chennai1</Home>
<Office>Navallur1</Office>
</Address>
</Employee>
<Employee>
<id>1001</id>
<ename>ranjinikumar</ename>
<dept>IT</dept>
<sal>1234516</sal>
<location>nextlevel</location>
<Address>
<Home>Chennai</Home>
<Office>Navallur</Office>
</Address>
</Employee>
</Company>


Thanks in advance
Ranjini. R

XML to TEXT

Posted by Ranjini Rathinam <ra...@gmail.com>.
Hi,

As suggest i tried with the code , but in the result.txt i got output only
header. Nothing else was printing.

After debugging i came to know that while parsing , there is no value.

The problem is in line given below which is bold. While putting SysOut i
found no value printing in this line.

String xmlContent = value.toString();

        InputStream is = new ByteArrayInputStream(xmlContent.getBytes());
        DocumentBuilderFactory factory =
DocumentBuilderFactory.newInstance();
        DocumentBuilder builder;
        try {
            builder = factory.newDocumentBuilder();
         *   Document doc = builder.parse(is);*


*String ed=doc.getDocumentElement().getNodeName();*
out.write(ed.getBytes());
            DTMNodeList list = (DTMNodeList) getNode("/Company/Employee",
doc,XPathConstants.NODESET);


When iam printing

out.write(xmlContent.getBytes):- the whole xml is being printed.

then i wrote for Sysout for list ,nothing printed.
out.write(ed.getBytes):- nothing is being printed.

Please suggest where i am going wrong. Please help to fix this.

Thanks in advance.

I have attached my code.Please review.


Mapper class:-

public class XmlTextMapper extends Mapper<LongWritable, Text, Text, Text> {
    private static final XPathFactory xpathFactory =
XPathFactory.newInstance();
    @Override
    public void map(LongWritable key, Text value, Context context)
            throws IOException, InterruptedException {
        String resultFileName = "/user/task/Sales/result.txt";

        Configuration conf = new Configuration();
        FileSystem fs = FileSystem.get(URI.create(resultFileName), conf);
        FSDataOutputStream out = fs.create(new Path(resultFileName));
        InputStream resultIS = new ByteArrayInputStream(new byte[0]);
        String header = "id,name\n";
        out.write(header.getBytes());
        String xmlContent = value.toString();

        InputStream is = new ByteArrayInputStream(xmlContent.getBytes());
        DocumentBuilderFactory factory =
DocumentBuilderFactory.newInstance();
        DocumentBuilder builder;
        try {
            builder = factory.newDocumentBuilder();
            Document doc = builder.parse(is);

   String ed=doc.getDocumentElement().getNodeName();
   out.write(ed.getBytes());
            DTMNodeList list = (DTMNodeList) getNode("/Company/Employee",
doc,XPathConstants.NODESET);
            int size = list.getLength();
            for (int i = 0; i < size; i++) {
                Node node = list.item(i);
                String line = "";
                NodeList nodeList = node.getChildNodes();
                int childNumber = nodeList.getLength();
                for (int j = 0; j < childNumber; j++)
    {
                    line += nodeList.item(j).getTextContent() + ",";
                }
                if (line.endsWith(","))
                    line = line.substring(0, line.length() - 1);
                line += "\n";
                out.write(line.getBytes());
            }
        } catch (ParserConfigurationException e) {
             e.printStackTrace();
        } catch (SAXException e) {
             e.printStackTrace();
        } catch (XPathExpressionException e) {
             e.printStackTrace();
        }
        IOUtils.copyBytes(resultIS, out, 4096, true);
        out.close();
    }
    public static Object getNode(String xpathStr, Node node, QName
retunType)
            throws XPathExpressionException {
        XPath xpath = xpathFactory.newXPath();
        return xpath.evaluate(xpathStr, node, retunType);
    }
}



Main class
public class MainXml {
    public static void main(String[] args) throws Exception {

Configuration conf = new Configuration();

        if (args.length != 2) {
            System.err
                    .println("Usage: XMLtoText <input path> <output path>");
            System.exit(-1);
        }

  String output="/user/task/Sales/";
       Job job = new Job(conf, "XML to Text");
        job.setJarByClass(MainXml.class);
       // job.setJobName("XML to Text");

        FileInputFormat.addInputPath(job, new Path(args[0]));

       // FileOutputFormat.setOutputPath(job, new Path(args[1]));
  Path outPath = new Path(output);
  FileOutputFormat.setOutputPath(job, outPath);
  FileSystem dfs = FileSystem.get(outPath.toUri(), conf);
  if (dfs.exists(outPath)) {
  dfs.delete(outPath, true);
  }
        job.setMapperClass(XmlTextMapper.class);

        job.setNumReduceTasks(0);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(Text.class);
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}



My xml file

<Company>
<Employee>
<id>100</id>
<ename>ranjini</ename>
<dept>IT1</dept>
<sal>123456</sal>
<location>nextlevel1</location>
<Address>
<Home>Chennai1</Home>
<Office>Navallur1</Office>
</Address>
</Employee>
<Employee>
<id>1001</id>
<ename>ranjinikumar</ename>
<dept>IT</dept>
<sal>1234516</sal>
<location>nextlevel</location>
<Address>
<Home>Chennai</Home>
<Office>Navallur</Office>
</Address>
</Employee>
</Company>


Thanks in advance
Ranjini. R

XML to TEXT

Posted by Ranjini Rathinam <ra...@gmail.com>.
Hi,

As suggest i tried with the code , but in the result.txt i got output only
header. Nothing else was printing.

After debugging i came to know that while parsing , there is no value.

The problem is in line given below which is bold. While putting SysOut i
found no value printing in this line.

String xmlContent = value.toString();

        InputStream is = new ByteArrayInputStream(xmlContent.getBytes());
        DocumentBuilderFactory factory =
DocumentBuilderFactory.newInstance();
        DocumentBuilder builder;
        try {
            builder = factory.newDocumentBuilder();
         *   Document doc = builder.parse(is);*


*String ed=doc.getDocumentElement().getNodeName();*
out.write(ed.getBytes());
            DTMNodeList list = (DTMNodeList) getNode("/Company/Employee",
doc,XPathConstants.NODESET);


When iam printing

out.write(xmlContent.getBytes):- the whole xml is being printed.

then i wrote for Sysout for list ,nothing printed.
out.write(ed.getBytes):- nothing is being printed.

Please suggest where i am going wrong. Please help to fix this.

Thanks in advance.

I have attached my code.Please review.


Mapper class:-

public class XmlTextMapper extends Mapper<LongWritable, Text, Text, Text> {
    private static final XPathFactory xpathFactory =
XPathFactory.newInstance();
    @Override
    public void map(LongWritable key, Text value, Context context)
            throws IOException, InterruptedException {
        String resultFileName = "/user/task/Sales/result.txt";

        Configuration conf = new Configuration();
        FileSystem fs = FileSystem.get(URI.create(resultFileName), conf);
        FSDataOutputStream out = fs.create(new Path(resultFileName));
        InputStream resultIS = new ByteArrayInputStream(new byte[0]);
        String header = "id,name\n";
        out.write(header.getBytes());
        String xmlContent = value.toString();

        InputStream is = new ByteArrayInputStream(xmlContent.getBytes());
        DocumentBuilderFactory factory =
DocumentBuilderFactory.newInstance();
        DocumentBuilder builder;
        try {
            builder = factory.newDocumentBuilder();
            Document doc = builder.parse(is);

   String ed=doc.getDocumentElement().getNodeName();
   out.write(ed.getBytes());
            DTMNodeList list = (DTMNodeList) getNode("/Company/Employee",
doc,XPathConstants.NODESET);
            int size = list.getLength();
            for (int i = 0; i < size; i++) {
                Node node = list.item(i);
                String line = "";
                NodeList nodeList = node.getChildNodes();
                int childNumber = nodeList.getLength();
                for (int j = 0; j < childNumber; j++)
    {
                    line += nodeList.item(j).getTextContent() + ",";
                }
                if (line.endsWith(","))
                    line = line.substring(0, line.length() - 1);
                line += "\n";
                out.write(line.getBytes());
            }
        } catch (ParserConfigurationException e) {
             e.printStackTrace();
        } catch (SAXException e) {
             e.printStackTrace();
        } catch (XPathExpressionException e) {
             e.printStackTrace();
        }
        IOUtils.copyBytes(resultIS, out, 4096, true);
        out.close();
    }
    public static Object getNode(String xpathStr, Node node, QName
retunType)
            throws XPathExpressionException {
        XPath xpath = xpathFactory.newXPath();
        return xpath.evaluate(xpathStr, node, retunType);
    }
}



Main class
public class MainXml {
    public static void main(String[] args) throws Exception {

Configuration conf = new Configuration();

        if (args.length != 2) {
            System.err
                    .println("Usage: XMLtoText <input path> <output path>");
            System.exit(-1);
        }

  String output="/user/task/Sales/";
       Job job = new Job(conf, "XML to Text");
        job.setJarByClass(MainXml.class);
       // job.setJobName("XML to Text");

        FileInputFormat.addInputPath(job, new Path(args[0]));

       // FileOutputFormat.setOutputPath(job, new Path(args[1]));
  Path outPath = new Path(output);
  FileOutputFormat.setOutputPath(job, outPath);
  FileSystem dfs = FileSystem.get(outPath.toUri(), conf);
  if (dfs.exists(outPath)) {
  dfs.delete(outPath, true);
  }
        job.setMapperClass(XmlTextMapper.class);

        job.setNumReduceTasks(0);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(Text.class);
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}



My xml file

<Company>
<Employee>
<id>100</id>
<ename>ranjini</ename>
<dept>IT1</dept>
<sal>123456</sal>
<location>nextlevel1</location>
<Address>
<Home>Chennai1</Home>
<Office>Navallur1</Office>
</Address>
</Employee>
<Employee>
<id>1001</id>
<ename>ranjinikumar</ename>
<dept>IT</dept>
<sal>1234516</sal>
<location>nextlevel</location>
<Address>
<Home>Chennai</Home>
<Office>Navallur</Office>
</Address>
</Employee>
</Company>


Thanks in advance
Ranjini. R

Re: XML to TEXT

Posted by Ranjini Rathinam <ra...@gmail.com>.
Hi,

I am using hive. As suggest i am using xpath in select clause, but the
error is coming as invalid expression.

Please give some sample xml to process xml in hive.

Thanks in advance

Ranjini

On Tue, Jan 7, 2014 at 5:14 PM, Ranjini Rathinam <ra...@gmail.com>wrote:

> Hi Gutierrez ,
>
> As suggest i tried with the code , but in the result.txt i got output only
> header. Nothing else was printing.
>
> After debugging i came to know that while parsing , there is no value.
>
> The problem is in line given below which is bold. While putting SysOut i
> found no value printing in this line.
>
>  String xmlContent = value.toString();
>
>         InputStream is = new ByteArrayInputStream(xmlContent.getBytes());
>         DocumentBuilderFactory factory =
> DocumentBuilderFactory.newInstance();
>         DocumentBuilder builder;
>         try {
>             builder = factory.newDocumentBuilder();
>
> * Document doc = builder.parse(is);*
>    String ed=doc.getDocumentElement().getNodeName();
>    out.write(ed.getBytes());
>             DTMNodeList list = (DTMNodeList) getNode("/Company/Employee",
> doc,XPathConstants.NODESET);
>
> When iam printing
>
> out.write(xmlContent.getBytes):- the whole xml is being printed.
>
> then i wrote for Sysout for list ,nothing printed.
>  out.write(ed.getBytes):- nothing is being printed.
>
> Please suggest where i am going wrong. Please help to fix this.
>
> Thanks in advance.
>
> I have attached my code.Please review.
>
>
> Mapper class:-
>
> public class XmlTextMapper extends Mapper<LongWritable, Text, Text, Text> {
>      private static final XPathFactory xpathFactory =
> XPathFactory.newInstance();
>     @Override
>     public void map(LongWritable key, Text value, Context context)
>             throws IOException, InterruptedException {
>         String resultFileName = "/user/task/Sales/result.txt";
>
>         Configuration conf = new Configuration();
>         FileSystem fs = FileSystem.get(URI.create(resultFileName), conf);
>         FSDataOutputStream out = fs.create(new Path(resultFileName));
>         InputStream resultIS = new ByteArrayInputStream(new byte[0]);
>         String header = "id,name\n";
>         out.write(header.getBytes());
>          String xmlContent = value.toString();
>
>         InputStream is = new ByteArrayInputStream(xmlContent.getBytes());
>         DocumentBuilderFactory factory =
> DocumentBuilderFactory.newInstance();
>         DocumentBuilder builder;
>         try {
>             builder = factory.newDocumentBuilder();
>             Document doc = builder.parse(is);
>    String ed=doc.getDocumentElement().getNodeName();
>    out.write(ed.getBytes());
>             DTMNodeList list = (DTMNodeList) getNode("/Company/Employee",
> doc,XPathConstants.NODESET);
>              int size = list.getLength();
>             for (int i = 0; i < size; i++) {
>                 Node node = list.item(i);
>                 String line = "";
>                 NodeList nodeList = node.getChildNodes();
>                 int childNumber = nodeList.getLength();
>                 for (int j = 0; j < childNumber; j++)
>     {
>                     line += nodeList.item(j).getTextContent() + ",";
>                 }
>                 if (line.endsWith(","))
>                     line = line.substring(0, line.length() - 1);
>                 line += "\n";
>                 out.write(line.getBytes());
>             }
>         } catch (ParserConfigurationException e) {
>              e.printStackTrace();
>         } catch (SAXException e) {
>              e.printStackTrace();
>         } catch (XPathExpressionException e) {
>              e.printStackTrace();
>         }
>          IOUtils.copyBytes(resultIS, out, 4096, true);
>         out.close();
>     }
>     public static Object getNode(String xpathStr, Node node, QName
> retunType)
>             throws XPathExpressionException {
>         XPath xpath = xpathFactory.newXPath();
>         return xpath.evaluate(xpathStr, node, retunType);
>     }
> }
>
>
>
> Main class
> public class MainXml {
>      public static void main(String[] args) throws Exception {
> Configuration conf = new Configuration();
>         if (args.length != 2) {
>             System.err
>                     .println("Usage: XMLtoText <input path> <output
> path>");
>             System.exit(-1);
>         }
>   String output="/user/task/Sales/";
>        Job job = new Job(conf, "XML to Text");
>         job.setJarByClass(MainXml.class);
>        // job.setJobName("XML to Text");
>
>         FileInputFormat.addInputPath(job, new Path(args[0]));
>        // FileOutputFormat.setOutputPath(job, new Path(args[1]));
>   Path outPath = new Path(output);
>   FileOutputFormat.setOutputPath(job, outPath);
>   FileSystem dfs = FileSystem.get(outPath.toUri(), conf);
>   if (dfs.exists(outPath)) {
>   dfs.delete(outPath, true);
>   }
>         job.setMapperClass(XmlTextMapper.class);
>
>         job.setNumReduceTasks(0);
>         job.setMapOutputKeyClass(Text.class);
>         job.setMapOutputValueClass(Text.class);
>         System.exit(job.waitForCompletion(true) ? 0 : 1);
>     }
> }
>
>
> My xml file
>
> <Company>
> <Employee>
> <id>100</id>
> <ename>ranjini</ename>
> <dept>IT1</dept>
> <sal>123456</sal>
> <location>nextlevel1</location>
> <Address>
> <Home>Chennai1</Home>
> <Office>Navallur1</Office>
> </Address>
> </Employee>
> <Employee>
> <id>1001</id>
> <ename>ranjinikumar</ename>
> <dept>IT</dept>
> <sal>1234516</sal>
> <location>nextlevel</location>
> <Address>
> <Home>Chennai</Home>
> <Office>Navallur</Office>
> </Address>
> </Employee>
> </Company>
>
>
> Thanks in advance.
>
> Ranjini
>
>
>
>>  On Mon, Jan 6, 2014 at 2:44 PM, Ranjini Rathinam <ranjinibecse@gmail.com
>> > wrote:
>>
>>> Hi,
>>>
>>> Thanks a lot .
>>>
>>> Ranjini
>>>
>>> On Fri, Jan 3, 2014 at 10:40 PM, Diego Gutierrez <
>>> diego.gutierrez@ucsp.edu.pe> wrote:
>>>
>>>>  Hi,
>>>>
>>>> I suggest to use the XPath, this is a native java support for parse xml
>>>> and json formats.
>>>>
>>>> For the main problem, like distcp command(
>>>> http://hadoop.apache.org/docs/r0.19.0/distcp.pdf ) there is no need of
>>>> a reduce function, because you can parse the xml input file and create the
>>>> file you need in the map function.For example the following code reads an
>>>> xml file in HDFS, parse it and create a new file ( "/result.txt" ) with the
>>>> expected format:
>>>> id,name
>>>> 100,RR
>>>>
>>>>
>>>> Mapper function:
>>>>
>>>> import java.io.ByteArrayInputStream;
>>>> import java.io.IOException;
>>>> import java.io.InputStream;
>>>> import java.net.URI;
>>>>
>>>> import javax.xml.namespace.QName;
>>>> import javax.xml.parsers.DocumentBuilder;
>>>> import javax.xml.parsers.DocumentBuilderFactory;
>>>> import javax.xml.parsers.ParserConfigurationException;
>>>> import javax.xml.xpath.XPath;
>>>> import javax.xml.xpath.XPathConstants;
>>>> import javax.xml.xpath.XPathExpressionException;
>>>> import javax.xml.xpath.XPathFactory;
>>>>
>>>> import org.apache.hadoop.conf.Configuration;
>>>> import org.apache.hadoop.fs.FSDataOutputStream;
>>>> import org.apache.hadoop.fs.FileSystem;
>>>> import org.apache.hadoop.fs.Path;
>>>> import org.apache.hadoop.io.IOUtils;
>>>> import org.apache.hadoop.io.LongWritable;
>>>> import org.apache.hadoop.io.Text;
>>>> import org.apache.hadoop.mapreduce.Mapper;
>>>> import org.w3c.dom.Document;
>>>> import org.w3c.dom.Node;
>>>> import org.w3c.dom.NodeList;
>>>> import org.xml.sax.SAXException;
>>>>
>>>> import com.sun.org.apache.xml.internal.dtm.ref.DTMNodeList;
>>>>
>>>> public class XmlToTextMapper extends Mapper<LongWritable, Text, Text,
>>>> Text> {
>>>>
>>>>     private static final XPathFactory xpathFactory =
>>>> XPathFactory.newInstance();
>>>>
>>>>     @Override
>>>>     public void map(LongWritable key, Text value, Context context)
>>>>             throws IOException, InterruptedException {
>>>>
>>>>         String resultFileName = "/result.txt";
>>>>
>>>>
>>>>         Configuration conf = new Configuration();
>>>>         FileSystem fs = FileSystem.get(URI.create(resultFileName),
>>>> conf);
>>>>         FSDataOutputStream out = fs.create(new Path(resultFileName));
>>>>
>>>>         InputStream resultIS = new ByteArrayInputStream(new byte[0]);
>>>>
>>>>         String header = "id,name\n";
>>>>         out.write(header.getBytes());
>>>>
>>>>         String xmlContent = value.toString();
>>>>         InputStream is = new
>>>> ByteArrayInputStream(xmlContent.getBytes());
>>>>         DocumentBuilderFactory factory =
>>>> DocumentBuilderFactory.newInstance();
>>>>         DocumentBuilder builder;
>>>>         try {
>>>>             builder = factory.newDocumentBuilder();
>>>>             Document doc = builder.parse(is);
>>>>             DTMNodeList list = (DTMNodeList) getNode("/main/data", doc,
>>>>                     XPathConstants.NODESET);
>>>>
>>>>             int size = list.getLength();
>>>>             for (int i = 0; i < size; i++) {
>>>>                 Node node = list.item(i);
>>>>                 String line = "";
>>>>                 NodeList nodeList = node.getChildNodes();
>>>>                 int childNumber = nodeList.getLength();
>>>>                 for (int j = 0; j < childNumber; j++) {
>>>>                     line += nodeList.item(j).getTextContent() + ",";
>>>>                 }
>>>>                 if (line.endsWith(","))
>>>>                     line = line.substring(0, line.length() - 1);
>>>>                 line += "\n";
>>>>                 out.write(line.getBytes());
>>>>
>>>>             }
>>>>
>>>>         } catch (ParserConfigurationException e) {
>>>>             MyLogguer.log("error: " + e.getMessage());
>>>>             e.printStackTrace();
>>>>         } catch (SAXException e) {
>>>>             MyLogguer.log("error: " + e.getMessage());
>>>>             e.printStackTrace();
>>>>         } catch (XPathExpressionException e) {
>>>>             MyLogguer.log("error: " + e.getMessage());
>>>>             e.printStackTrace();
>>>>         }
>>>>
>>>>         IOUtils.copyBytes(resultIS, out, 4096, true);
>>>>         out.close();
>>>>     }
>>>>
>>>>     public static Object getNode(String xpathStr, Node node, QName
>>>> retunType)
>>>>             throws XPathExpressionException {
>>>>         XPath xpath = xpathFactory.newXPath();
>>>>         return xpath.evaluate(xpathStr, node, retunType);
>>>>     }
>>>> }
>>>>
>>>>
>>>>
>>>> --------------------------------------
>>>> Main class:
>>>>
>>>>
>>>> public class Main {
>>>>
>>>>     public static void main(String[] args) throws Exception {
>>>>
>>>>         if (args.length != 2) {
>>>>             System.err
>>>>                     .println("Usage: XMLtoText <input path> <output
>>>> path>");
>>>>             System.exit(-1);
>>>>         }
>>>>
>>>>         Job job = new Job();
>>>>         job.setJarByClass(Main.class);
>>>>         job.setJobName("XML to Text");
>>>>         FileInputFormat.addInputPath(job, new Path(args[0]));
>>>>         FileOutputFormat.setOutputPath(job, new Path(args[1]));
>>>>
>>>>         job.setMapperClass(XmlToTextMapper.class);
>>>>         job.setNumReduceTasks(0);
>>>>         job.setMapOutputKeyClass(Text.class);
>>>>         job.setMapOutputValueClass(Text.class);
>>>>         System.exit(job.waitForCompletion(true) ? 0 : 1);
>>>>
>>>>     }
>>>> }
>>>>
>>>> To execute the job you can use :
>>>>
>>>>          bin/hadoop Main /data.xml /output.
>>>>
>>>>
>>>> Then you can use this to see result.txt file:
>>>>
>>>>           hadoop fs -cat /result.txt
>>>>
>>>>
>>>> I'm using this xml as input:
>>>>
>>>>
>>>> <Comp><Emp><id>1</id><name>NameA</name></data><data><id>2</id><name>NameB</name></Emp></Comp>
>>>>
>>>> and the content in result.txt is like this:
>>>>
>>>> id,name
>>>> 1,NameA
>>>> 2,NameB
>>>>
>>>>
>>>> Hope this helps.
>>>>
>>>>
>>>> 2014/1/3 Ranjini Rathinam <ra...@gmail.com>
>>>>
>>>>> Hi,
>>>>>
>>>>> Need to convert XML into text using mapreduce.
>>>>>
>>>>> I have used DOM and SAX parser.
>>>>>
>>>>> After using SAX Builder in mapper class. the child node act as root
>>>>> Element.
>>>>>
>>>>> While seeing in Sys out i found thar root element is taking the child
>>>>> element and printing.
>>>>>
>>>>> For Eg,
>>>>>
>>>>> <Comp><Emp><id>100</id><name>RR</name></Emp></Comp>
>>>>> when this xml is passed in mapper , in sys out printing the root
>>>>> element
>>>>>
>>>>> I am getting the the root element as
>>>>>
>>>>> <id>
>>>>> <name>
>>>>>
>>>>> Please suggest and help to fix this.
>>>>>
>>>>> I need to convert the xml into text using mapreduce code. Please
>>>>> provide with example.
>>>>>
>>>>> Required output is
>>>>>
>>>>> id,name
>>>>> 100,RR
>>>>>
>>>>> Please help.
>>>>>
>>>>> Thanks in advance,
>>>>> Ranjini R
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>

Re: XML to TEXT

Posted by Ranjini Rathinam <ra...@gmail.com>.
Hi,

I am using hive. As suggest i am using xpath in select clause, but the
error is coming as invalid expression.

Please give some sample xml to process xml in hive.

Thanks in advance

Ranjini

On Tue, Jan 7, 2014 at 5:14 PM, Ranjini Rathinam <ra...@gmail.com>wrote:

> Hi Gutierrez ,
>
> As suggest i tried with the code , but in the result.txt i got output only
> header. Nothing else was printing.
>
> After debugging i came to know that while parsing , there is no value.
>
> The problem is in line given below which is bold. While putting SysOut i
> found no value printing in this line.
>
>  String xmlContent = value.toString();
>
>         InputStream is = new ByteArrayInputStream(xmlContent.getBytes());
>         DocumentBuilderFactory factory =
> DocumentBuilderFactory.newInstance();
>         DocumentBuilder builder;
>         try {
>             builder = factory.newDocumentBuilder();
>
> * Document doc = builder.parse(is);*
>    String ed=doc.getDocumentElement().getNodeName();
>    out.write(ed.getBytes());
>             DTMNodeList list = (DTMNodeList) getNode("/Company/Employee",
> doc,XPathConstants.NODESET);
>
> When iam printing
>
> out.write(xmlContent.getBytes):- the whole xml is being printed.
>
> then i wrote for Sysout for list ,nothing printed.
>  out.write(ed.getBytes):- nothing is being printed.
>
> Please suggest where i am going wrong. Please help to fix this.
>
> Thanks in advance.
>
> I have attached my code.Please review.
>
>
> Mapper class:-
>
> public class XmlTextMapper extends Mapper<LongWritable, Text, Text, Text> {
>      private static final XPathFactory xpathFactory =
> XPathFactory.newInstance();
>     @Override
>     public void map(LongWritable key, Text value, Context context)
>             throws IOException, InterruptedException {
>         String resultFileName = "/user/task/Sales/result.txt";
>
>         Configuration conf = new Configuration();
>         FileSystem fs = FileSystem.get(URI.create(resultFileName), conf);
>         FSDataOutputStream out = fs.create(new Path(resultFileName));
>         InputStream resultIS = new ByteArrayInputStream(new byte[0]);
>         String header = "id,name\n";
>         out.write(header.getBytes());
>          String xmlContent = value.toString();
>
>         InputStream is = new ByteArrayInputStream(xmlContent.getBytes());
>         DocumentBuilderFactory factory =
> DocumentBuilderFactory.newInstance();
>         DocumentBuilder builder;
>         try {
>             builder = factory.newDocumentBuilder();
>             Document doc = builder.parse(is);
>    String ed=doc.getDocumentElement().getNodeName();
>    out.write(ed.getBytes());
>             DTMNodeList list = (DTMNodeList) getNode("/Company/Employee",
> doc,XPathConstants.NODESET);
>              int size = list.getLength();
>             for (int i = 0; i < size; i++) {
>                 Node node = list.item(i);
>                 String line = "";
>                 NodeList nodeList = node.getChildNodes();
>                 int childNumber = nodeList.getLength();
>                 for (int j = 0; j < childNumber; j++)
>     {
>                     line += nodeList.item(j).getTextContent() + ",";
>                 }
>                 if (line.endsWith(","))
>                     line = line.substring(0, line.length() - 1);
>                 line += "\n";
>                 out.write(line.getBytes());
>             }
>         } catch (ParserConfigurationException e) {
>              e.printStackTrace();
>         } catch (SAXException e) {
>              e.printStackTrace();
>         } catch (XPathExpressionException e) {
>              e.printStackTrace();
>         }
>          IOUtils.copyBytes(resultIS, out, 4096, true);
>         out.close();
>     }
>     public static Object getNode(String xpathStr, Node node, QName
> retunType)
>             throws XPathExpressionException {
>         XPath xpath = xpathFactory.newXPath();
>         return xpath.evaluate(xpathStr, node, retunType);
>     }
> }
>
>
>
> Main class
> public class MainXml {
>      public static void main(String[] args) throws Exception {
> Configuration conf = new Configuration();
>         if (args.length != 2) {
>             System.err
>                     .println("Usage: XMLtoText <input path> <output
> path>");
>             System.exit(-1);
>         }
>   String output="/user/task/Sales/";
>        Job job = new Job(conf, "XML to Text");
>         job.setJarByClass(MainXml.class);
>        // job.setJobName("XML to Text");
>
>         FileInputFormat.addInputPath(job, new Path(args[0]));
>        // FileOutputFormat.setOutputPath(job, new Path(args[1]));
>   Path outPath = new Path(output);
>   FileOutputFormat.setOutputPath(job, outPath);
>   FileSystem dfs = FileSystem.get(outPath.toUri(), conf);
>   if (dfs.exists(outPath)) {
>   dfs.delete(outPath, true);
>   }
>         job.setMapperClass(XmlTextMapper.class);
>
>         job.setNumReduceTasks(0);
>         job.setMapOutputKeyClass(Text.class);
>         job.setMapOutputValueClass(Text.class);
>         System.exit(job.waitForCompletion(true) ? 0 : 1);
>     }
> }
>
>
> My xml file
>
> <Company>
> <Employee>
> <id>100</id>
> <ename>ranjini</ename>
> <dept>IT1</dept>
> <sal>123456</sal>
> <location>nextlevel1</location>
> <Address>
> <Home>Chennai1</Home>
> <Office>Navallur1</Office>
> </Address>
> </Employee>
> <Employee>
> <id>1001</id>
> <ename>ranjinikumar</ename>
> <dept>IT</dept>
> <sal>1234516</sal>
> <location>nextlevel</location>
> <Address>
> <Home>Chennai</Home>
> <Office>Navallur</Office>
> </Address>
> </Employee>
> </Company>
>
>
> Thanks in advance.
>
> Ranjini
>
>
>
>>  On Mon, Jan 6, 2014 at 2:44 PM, Ranjini Rathinam <ranjinibecse@gmail.com
>> > wrote:
>>
>>> Hi,
>>>
>>> Thanks a lot .
>>>
>>> Ranjini
>>>
>>> On Fri, Jan 3, 2014 at 10:40 PM, Diego Gutierrez <
>>> diego.gutierrez@ucsp.edu.pe> wrote:
>>>
>>>>  Hi,
>>>>
>>>> I suggest to use the XPath, this is a native java support for parse xml
>>>> and json formats.
>>>>
>>>> For the main problem, like distcp command(
>>>> http://hadoop.apache.org/docs/r0.19.0/distcp.pdf ) there is no need of
>>>> a reduce function, because you can parse the xml input file and create the
>>>> file you need in the map function.For example the following code reads an
>>>> xml file in HDFS, parse it and create a new file ( "/result.txt" ) with the
>>>> expected format:
>>>> id,name
>>>> 100,RR
>>>>
>>>>
>>>> Mapper function:
>>>>
>>>> import java.io.ByteArrayInputStream;
>>>> import java.io.IOException;
>>>> import java.io.InputStream;
>>>> import java.net.URI;
>>>>
>>>> import javax.xml.namespace.QName;
>>>> import javax.xml.parsers.DocumentBuilder;
>>>> import javax.xml.parsers.DocumentBuilderFactory;
>>>> import javax.xml.parsers.ParserConfigurationException;
>>>> import javax.xml.xpath.XPath;
>>>> import javax.xml.xpath.XPathConstants;
>>>> import javax.xml.xpath.XPathExpressionException;
>>>> import javax.xml.xpath.XPathFactory;
>>>>
>>>> import org.apache.hadoop.conf.Configuration;
>>>> import org.apache.hadoop.fs.FSDataOutputStream;
>>>> import org.apache.hadoop.fs.FileSystem;
>>>> import org.apache.hadoop.fs.Path;
>>>> import org.apache.hadoop.io.IOUtils;
>>>> import org.apache.hadoop.io.LongWritable;
>>>> import org.apache.hadoop.io.Text;
>>>> import org.apache.hadoop.mapreduce.Mapper;
>>>> import org.w3c.dom.Document;
>>>> import org.w3c.dom.Node;
>>>> import org.w3c.dom.NodeList;
>>>> import org.xml.sax.SAXException;
>>>>
>>>> import com.sun.org.apache.xml.internal.dtm.ref.DTMNodeList;
>>>>
>>>> public class XmlToTextMapper extends Mapper<LongWritable, Text, Text,
>>>> Text> {
>>>>
>>>>     private static final XPathFactory xpathFactory =
>>>> XPathFactory.newInstance();
>>>>
>>>>     @Override
>>>>     public void map(LongWritable key, Text value, Context context)
>>>>             throws IOException, InterruptedException {
>>>>
>>>>         String resultFileName = "/result.txt";
>>>>
>>>>
>>>>         Configuration conf = new Configuration();
>>>>         FileSystem fs = FileSystem.get(URI.create(resultFileName),
>>>> conf);
>>>>         FSDataOutputStream out = fs.create(new Path(resultFileName));
>>>>
>>>>         InputStream resultIS = new ByteArrayInputStream(new byte[0]);
>>>>
>>>>         String header = "id,name\n";
>>>>         out.write(header.getBytes());
>>>>
>>>>         String xmlContent = value.toString();
>>>>         InputStream is = new
>>>> ByteArrayInputStream(xmlContent.getBytes());
>>>>         DocumentBuilderFactory factory =
>>>> DocumentBuilderFactory.newInstance();
>>>>         DocumentBuilder builder;
>>>>         try {
>>>>             builder = factory.newDocumentBuilder();
>>>>             Document doc = builder.parse(is);
>>>>             DTMNodeList list = (DTMNodeList) getNode("/main/data", doc,
>>>>                     XPathConstants.NODESET);
>>>>
>>>>             int size = list.getLength();
>>>>             for (int i = 0; i < size; i++) {
>>>>                 Node node = list.item(i);
>>>>                 String line = "";
>>>>                 NodeList nodeList = node.getChildNodes();
>>>>                 int childNumber = nodeList.getLength();
>>>>                 for (int j = 0; j < childNumber; j++) {
>>>>                     line += nodeList.item(j).getTextContent() + ",";
>>>>                 }
>>>>                 if (line.endsWith(","))
>>>>                     line = line.substring(0, line.length() - 1);
>>>>                 line += "\n";
>>>>                 out.write(line.getBytes());
>>>>
>>>>             }
>>>>
>>>>         } catch (ParserConfigurationException e) {
>>>>             MyLogguer.log("error: " + e.getMessage());
>>>>             e.printStackTrace();
>>>>         } catch (SAXException e) {
>>>>             MyLogguer.log("error: " + e.getMessage());
>>>>             e.printStackTrace();
>>>>         } catch (XPathExpressionException e) {
>>>>             MyLogguer.log("error: " + e.getMessage());
>>>>             e.printStackTrace();
>>>>         }
>>>>
>>>>         IOUtils.copyBytes(resultIS, out, 4096, true);
>>>>         out.close();
>>>>     }
>>>>
>>>>     public static Object getNode(String xpathStr, Node node, QName
>>>> retunType)
>>>>             throws XPathExpressionException {
>>>>         XPath xpath = xpathFactory.newXPath();
>>>>         return xpath.evaluate(xpathStr, node, retunType);
>>>>     }
>>>> }
>>>>
>>>>
>>>>
>>>> --------------------------------------
>>>> Main class:
>>>>
>>>>
>>>> public class Main {
>>>>
>>>>     public static void main(String[] args) throws Exception {
>>>>
>>>>         if (args.length != 2) {
>>>>             System.err
>>>>                     .println("Usage: XMLtoText <input path> <output
>>>> path>");
>>>>             System.exit(-1);
>>>>         }
>>>>
>>>>         Job job = new Job();
>>>>         job.setJarByClass(Main.class);
>>>>         job.setJobName("XML to Text");
>>>>         FileInputFormat.addInputPath(job, new Path(args[0]));
>>>>         FileOutputFormat.setOutputPath(job, new Path(args[1]));
>>>>
>>>>         job.setMapperClass(XmlToTextMapper.class);
>>>>         job.setNumReduceTasks(0);
>>>>         job.setMapOutputKeyClass(Text.class);
>>>>         job.setMapOutputValueClass(Text.class);
>>>>         System.exit(job.waitForCompletion(true) ? 0 : 1);
>>>>
>>>>     }
>>>> }
>>>>
>>>> To execute the job you can use :
>>>>
>>>>          bin/hadoop Main /data.xml /output.
>>>>
>>>>
>>>> Then you can use this to see result.txt file:
>>>>
>>>>           hadoop fs -cat /result.txt
>>>>
>>>>
>>>> I'm using this xml as input:
>>>>
>>>>
>>>> <Comp><Emp><id>1</id><name>NameA</name></data><data><id>2</id><name>NameB</name></Emp></Comp>
>>>>
>>>> and the content in result.txt is like this:
>>>>
>>>> id,name
>>>> 1,NameA
>>>> 2,NameB
>>>>
>>>>
>>>> Hope this helps.
>>>>
>>>>
>>>> 2014/1/3 Ranjini Rathinam <ra...@gmail.com>
>>>>
>>>>> Hi,
>>>>>
>>>>> Need to convert XML into text using mapreduce.
>>>>>
>>>>> I have used DOM and SAX parser.
>>>>>
>>>>> After using SAX Builder in mapper class. the child node act as root
>>>>> Element.
>>>>>
>>>>> While seeing in Sys out i found thar root element is taking the child
>>>>> element and printing.
>>>>>
>>>>> For Eg,
>>>>>
>>>>> <Comp><Emp><id>100</id><name>RR</name></Emp></Comp>
>>>>> when this xml is passed in mapper , in sys out printing the root
>>>>> element
>>>>>
>>>>> I am getting the the root element as
>>>>>
>>>>> <id>
>>>>> <name>
>>>>>
>>>>> Please suggest and help to fix this.
>>>>>
>>>>> I need to convert the xml into text using mapreduce code. Please
>>>>> provide with example.
>>>>>
>>>>> Required output is
>>>>>
>>>>> id,name
>>>>> 100,RR
>>>>>
>>>>> Please help.
>>>>>
>>>>> Thanks in advance,
>>>>> Ranjini R
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>

Re: XML to TEXT

Posted by Ranjini Rathinam <ra...@gmail.com>.
Hi,

I am using hive. As suggest i am using xpath in select clause, but the
error is coming as invalid expression.

Please give some sample xml to process xml in hive.

Thanks in advance

Ranjini

On Tue, Jan 7, 2014 at 5:14 PM, Ranjini Rathinam <ra...@gmail.com>wrote:

> Hi Gutierrez ,
>
> As suggest i tried with the code , but in the result.txt i got output only
> header. Nothing else was printing.
>
> After debugging i came to know that while parsing , there is no value.
>
> The problem is in line given below which is bold. While putting SysOut i
> found no value printing in this line.
>
>  String xmlContent = value.toString();
>
>         InputStream is = new ByteArrayInputStream(xmlContent.getBytes());
>         DocumentBuilderFactory factory =
> DocumentBuilderFactory.newInstance();
>         DocumentBuilder builder;
>         try {
>             builder = factory.newDocumentBuilder();
>
> * Document doc = builder.parse(is);*
>    String ed=doc.getDocumentElement().getNodeName();
>    out.write(ed.getBytes());
>             DTMNodeList list = (DTMNodeList) getNode("/Company/Employee",
> doc,XPathConstants.NODESET);
>
> When iam printing
>
> out.write(xmlContent.getBytes):- the whole xml is being printed.
>
> then i wrote for Sysout for list ,nothing printed.
>  out.write(ed.getBytes):- nothing is being printed.
>
> Please suggest where i am going wrong. Please help to fix this.
>
> Thanks in advance.
>
> I have attached my code.Please review.
>
>
> Mapper class:-
>
> public class XmlTextMapper extends Mapper<LongWritable, Text, Text, Text> {
>      private static final XPathFactory xpathFactory =
> XPathFactory.newInstance();
>     @Override
>     public void map(LongWritable key, Text value, Context context)
>             throws IOException, InterruptedException {
>         String resultFileName = "/user/task/Sales/result.txt";
>
>         Configuration conf = new Configuration();
>         FileSystem fs = FileSystem.get(URI.create(resultFileName), conf);
>         FSDataOutputStream out = fs.create(new Path(resultFileName));
>         InputStream resultIS = new ByteArrayInputStream(new byte[0]);
>         String header = "id,name\n";
>         out.write(header.getBytes());
>          String xmlContent = value.toString();
>
>         InputStream is = new ByteArrayInputStream(xmlContent.getBytes());
>         DocumentBuilderFactory factory =
> DocumentBuilderFactory.newInstance();
>         DocumentBuilder builder;
>         try {
>             builder = factory.newDocumentBuilder();
>             Document doc = builder.parse(is);
>    String ed=doc.getDocumentElement().getNodeName();
>    out.write(ed.getBytes());
>             DTMNodeList list = (DTMNodeList) getNode("/Company/Employee",
> doc,XPathConstants.NODESET);
>              int size = list.getLength();
>             for (int i = 0; i < size; i++) {
>                 Node node = list.item(i);
>                 String line = "";
>                 NodeList nodeList = node.getChildNodes();
>                 int childNumber = nodeList.getLength();
>                 for (int j = 0; j < childNumber; j++)
>     {
>                     line += nodeList.item(j).getTextContent() + ",";
>                 }
>                 if (line.endsWith(","))
>                     line = line.substring(0, line.length() - 1);
>                 line += "\n";
>                 out.write(line.getBytes());
>             }
>         } catch (ParserConfigurationException e) {
>              e.printStackTrace();
>         } catch (SAXException e) {
>              e.printStackTrace();
>         } catch (XPathExpressionException e) {
>              e.printStackTrace();
>         }
>          IOUtils.copyBytes(resultIS, out, 4096, true);
>         out.close();
>     }
>     public static Object getNode(String xpathStr, Node node, QName
> retunType)
>             throws XPathExpressionException {
>         XPath xpath = xpathFactory.newXPath();
>         return xpath.evaluate(xpathStr, node, retunType);
>     }
> }
>
>
>
> Main class
> public class MainXml {
>      public static void main(String[] args) throws Exception {
> Configuration conf = new Configuration();
>         if (args.length != 2) {
>             System.err
>                     .println("Usage: XMLtoText <input path> <output
> path>");
>             System.exit(-1);
>         }
>   String output="/user/task/Sales/";
>        Job job = new Job(conf, "XML to Text");
>         job.setJarByClass(MainXml.class);
>        // job.setJobName("XML to Text");
>
>         FileInputFormat.addInputPath(job, new Path(args[0]));
>        // FileOutputFormat.setOutputPath(job, new Path(args[1]));
>   Path outPath = new Path(output);
>   FileOutputFormat.setOutputPath(job, outPath);
>   FileSystem dfs = FileSystem.get(outPath.toUri(), conf);
>   if (dfs.exists(outPath)) {
>   dfs.delete(outPath, true);
>   }
>         job.setMapperClass(XmlTextMapper.class);
>
>         job.setNumReduceTasks(0);
>         job.setMapOutputKeyClass(Text.class);
>         job.setMapOutputValueClass(Text.class);
>         System.exit(job.waitForCompletion(true) ? 0 : 1);
>     }
> }
>
>
> My xml file
>
> <Company>
> <Employee>
> <id>100</id>
> <ename>ranjini</ename>
> <dept>IT1</dept>
> <sal>123456</sal>
> <location>nextlevel1</location>
> <Address>
> <Home>Chennai1</Home>
> <Office>Navallur1</Office>
> </Address>
> </Employee>
> <Employee>
> <id>1001</id>
> <ename>ranjinikumar</ename>
> <dept>IT</dept>
> <sal>1234516</sal>
> <location>nextlevel</location>
> <Address>
> <Home>Chennai</Home>
> <Office>Navallur</Office>
> </Address>
> </Employee>
> </Company>
>
>
> Thanks in advance.
>
> Ranjini
>
>
>
>>  On Mon, Jan 6, 2014 at 2:44 PM, Ranjini Rathinam <ranjinibecse@gmail.com
>> > wrote:
>>
>>> Hi,
>>>
>>> Thanks a lot .
>>>
>>> Ranjini
>>>
>>> On Fri, Jan 3, 2014 at 10:40 PM, Diego Gutierrez <
>>> diego.gutierrez@ucsp.edu.pe> wrote:
>>>
>>>>  Hi,
>>>>
>>>> I suggest to use the XPath, this is a native java support for parse xml
>>>> and json formats.
>>>>
>>>> For the main problem, like distcp command(
>>>> http://hadoop.apache.org/docs/r0.19.0/distcp.pdf ) there is no need of
>>>> a reduce function, because you can parse the xml input file and create the
>>>> file you need in the map function.For example the following code reads an
>>>> xml file in HDFS, parse it and create a new file ( "/result.txt" ) with the
>>>> expected format:
>>>> id,name
>>>> 100,RR
>>>>
>>>>
>>>> Mapper function:
>>>>
>>>> import java.io.ByteArrayInputStream;
>>>> import java.io.IOException;
>>>> import java.io.InputStream;
>>>> import java.net.URI;
>>>>
>>>> import javax.xml.namespace.QName;
>>>> import javax.xml.parsers.DocumentBuilder;
>>>> import javax.xml.parsers.DocumentBuilderFactory;
>>>> import javax.xml.parsers.ParserConfigurationException;
>>>> import javax.xml.xpath.XPath;
>>>> import javax.xml.xpath.XPathConstants;
>>>> import javax.xml.xpath.XPathExpressionException;
>>>> import javax.xml.xpath.XPathFactory;
>>>>
>>>> import org.apache.hadoop.conf.Configuration;
>>>> import org.apache.hadoop.fs.FSDataOutputStream;
>>>> import org.apache.hadoop.fs.FileSystem;
>>>> import org.apache.hadoop.fs.Path;
>>>> import org.apache.hadoop.io.IOUtils;
>>>> import org.apache.hadoop.io.LongWritable;
>>>> import org.apache.hadoop.io.Text;
>>>> import org.apache.hadoop.mapreduce.Mapper;
>>>> import org.w3c.dom.Document;
>>>> import org.w3c.dom.Node;
>>>> import org.w3c.dom.NodeList;
>>>> import org.xml.sax.SAXException;
>>>>
>>>> import com.sun.org.apache.xml.internal.dtm.ref.DTMNodeList;
>>>>
>>>> public class XmlToTextMapper extends Mapper<LongWritable, Text, Text,
>>>> Text> {
>>>>
>>>>     private static final XPathFactory xpathFactory =
>>>> XPathFactory.newInstance();
>>>>
>>>>     @Override
>>>>     public void map(LongWritable key, Text value, Context context)
>>>>             throws IOException, InterruptedException {
>>>>
>>>>         String resultFileName = "/result.txt";
>>>>
>>>>
>>>>         Configuration conf = new Configuration();
>>>>         FileSystem fs = FileSystem.get(URI.create(resultFileName),
>>>> conf);
>>>>         FSDataOutputStream out = fs.create(new Path(resultFileName));
>>>>
>>>>         InputStream resultIS = new ByteArrayInputStream(new byte[0]);
>>>>
>>>>         String header = "id,name\n";
>>>>         out.write(header.getBytes());
>>>>
>>>>         String xmlContent = value.toString();
>>>>         InputStream is = new
>>>> ByteArrayInputStream(xmlContent.getBytes());
>>>>         DocumentBuilderFactory factory =
>>>> DocumentBuilderFactory.newInstance();
>>>>         DocumentBuilder builder;
>>>>         try {
>>>>             builder = factory.newDocumentBuilder();
>>>>             Document doc = builder.parse(is);
>>>>             DTMNodeList list = (DTMNodeList) getNode("/main/data", doc,
>>>>                     XPathConstants.NODESET);
>>>>
>>>>             int size = list.getLength();
>>>>             for (int i = 0; i < size; i++) {
>>>>                 Node node = list.item(i);
>>>>                 String line = "";
>>>>                 NodeList nodeList = node.getChildNodes();
>>>>                 int childNumber = nodeList.getLength();
>>>>                 for (int j = 0; j < childNumber; j++) {
>>>>                     line += nodeList.item(j).getTextContent() + ",";
>>>>                 }
>>>>                 if (line.endsWith(","))
>>>>                     line = line.substring(0, line.length() - 1);
>>>>                 line += "\n";
>>>>                 out.write(line.getBytes());
>>>>
>>>>             }
>>>>
>>>>         } catch (ParserConfigurationException e) {
>>>>             MyLogguer.log("error: " + e.getMessage());
>>>>             e.printStackTrace();
>>>>         } catch (SAXException e) {
>>>>             MyLogguer.log("error: " + e.getMessage());
>>>>             e.printStackTrace();
>>>>         } catch (XPathExpressionException e) {
>>>>             MyLogguer.log("error: " + e.getMessage());
>>>>             e.printStackTrace();
>>>>         }
>>>>
>>>>         IOUtils.copyBytes(resultIS, out, 4096, true);
>>>>         out.close();
>>>>     }
>>>>
>>>>     public static Object getNode(String xpathStr, Node node, QName
>>>> retunType)
>>>>             throws XPathExpressionException {
>>>>         XPath xpath = xpathFactory.newXPath();
>>>>         return xpath.evaluate(xpathStr, node, retunType);
>>>>     }
>>>> }
>>>>
>>>>
>>>>
>>>> --------------------------------------
>>>> Main class:
>>>>
>>>>
>>>> public class Main {
>>>>
>>>>     public static void main(String[] args) throws Exception {
>>>>
>>>>         if (args.length != 2) {
>>>>             System.err
>>>>                     .println("Usage: XMLtoText <input path> <output
>>>> path>");
>>>>             System.exit(-1);
>>>>         }
>>>>
>>>>         Job job = new Job();
>>>>         job.setJarByClass(Main.class);
>>>>         job.setJobName("XML to Text");
>>>>         FileInputFormat.addInputPath(job, new Path(args[0]));
>>>>         FileOutputFormat.setOutputPath(job, new Path(args[1]));
>>>>
>>>>         job.setMapperClass(XmlToTextMapper.class);
>>>>         job.setNumReduceTasks(0);
>>>>         job.setMapOutputKeyClass(Text.class);
>>>>         job.setMapOutputValueClass(Text.class);
>>>>         System.exit(job.waitForCompletion(true) ? 0 : 1);
>>>>
>>>>     }
>>>> }
>>>>
>>>> To execute the job you can use :
>>>>
>>>>          bin/hadoop Main /data.xml /output.
>>>>
>>>>
>>>> Then you can use this to see result.txt file:
>>>>
>>>>           hadoop fs -cat /result.txt
>>>>
>>>>
>>>> I'm using this xml as input:
>>>>
>>>>
>>>> <Comp><Emp><id>1</id><name>NameA</name></data><data><id>2</id><name>NameB</name></Emp></Comp>
>>>>
>>>> and the content in result.txt is like this:
>>>>
>>>> id,name
>>>> 1,NameA
>>>> 2,NameB
>>>>
>>>>
>>>> Hope this helps.
>>>>
>>>>
>>>> 2014/1/3 Ranjini Rathinam <ra...@gmail.com>
>>>>
>>>>> Hi,
>>>>>
>>>>> Need to convert XML into text using mapreduce.
>>>>>
>>>>> I have used DOM and SAX parser.
>>>>>
>>>>> After using SAX Builder in mapper class. the child node act as root
>>>>> Element.
>>>>>
>>>>> While seeing in Sys out i found thar root element is taking the child
>>>>> element and printing.
>>>>>
>>>>> For Eg,
>>>>>
>>>>> <Comp><Emp><id>100</id><name>RR</name></Emp></Comp>
>>>>> when this xml is passed in mapper , in sys out printing the root
>>>>> element
>>>>>
>>>>> I am getting the the root element as
>>>>>
>>>>> <id>
>>>>> <name>
>>>>>
>>>>> Please suggest and help to fix this.
>>>>>
>>>>> I need to convert the xml into text using mapreduce code. Please
>>>>> provide with example.
>>>>>
>>>>> Required output is
>>>>>
>>>>> id,name
>>>>> 100,RR
>>>>>
>>>>> Please help.
>>>>>
>>>>> Thanks in advance,
>>>>> Ranjini R
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>

Re: XML to TEXT

Posted by Ranjini Rathinam <ra...@gmail.com>.
Hi,

I am using hive. As suggest i am using xpath in select clause, but the
error is coming as invalid expression.

Please give some sample xml to process xml in hive.

Thanks in advance

Ranjini

On Tue, Jan 7, 2014 at 5:14 PM, Ranjini Rathinam <ra...@gmail.com>wrote:

> Hi Gutierrez ,
>
> As suggest i tried with the code , but in the result.txt i got output only
> header. Nothing else was printing.
>
> After debugging i came to know that while parsing , there is no value.
>
> The problem is in line given below which is bold. While putting SysOut i
> found no value printing in this line.
>
>  String xmlContent = value.toString();
>
>         InputStream is = new ByteArrayInputStream(xmlContent.getBytes());
>         DocumentBuilderFactory factory =
> DocumentBuilderFactory.newInstance();
>         DocumentBuilder builder;
>         try {
>             builder = factory.newDocumentBuilder();
>
> * Document doc = builder.parse(is);*
>    String ed=doc.getDocumentElement().getNodeName();
>    out.write(ed.getBytes());
>             DTMNodeList list = (DTMNodeList) getNode("/Company/Employee",
> doc,XPathConstants.NODESET);
>
> When iam printing
>
> out.write(xmlContent.getBytes):- the whole xml is being printed.
>
> then i wrote for Sysout for list ,nothing printed.
>  out.write(ed.getBytes):- nothing is being printed.
>
> Please suggest where i am going wrong. Please help to fix this.
>
> Thanks in advance.
>
> I have attached my code.Please review.
>
>
> Mapper class:-
>
> public class XmlTextMapper extends Mapper<LongWritable, Text, Text, Text> {
>      private static final XPathFactory xpathFactory =
> XPathFactory.newInstance();
>     @Override
>     public void map(LongWritable key, Text value, Context context)
>             throws IOException, InterruptedException {
>         String resultFileName = "/user/task/Sales/result.txt";
>
>         Configuration conf = new Configuration();
>         FileSystem fs = FileSystem.get(URI.create(resultFileName), conf);
>         FSDataOutputStream out = fs.create(new Path(resultFileName));
>         InputStream resultIS = new ByteArrayInputStream(new byte[0]);
>         String header = "id,name\n";
>         out.write(header.getBytes());
>          String xmlContent = value.toString();
>
>         InputStream is = new ByteArrayInputStream(xmlContent.getBytes());
>         DocumentBuilderFactory factory =
> DocumentBuilderFactory.newInstance();
>         DocumentBuilder builder;
>         try {
>             builder = factory.newDocumentBuilder();
>             Document doc = builder.parse(is);
>    String ed=doc.getDocumentElement().getNodeName();
>    out.write(ed.getBytes());
>             DTMNodeList list = (DTMNodeList) getNode("/Company/Employee",
> doc,XPathConstants.NODESET);
>              int size = list.getLength();
>             for (int i = 0; i < size; i++) {
>                 Node node = list.item(i);
>                 String line = "";
>                 NodeList nodeList = node.getChildNodes();
>                 int childNumber = nodeList.getLength();
>                 for (int j = 0; j < childNumber; j++)
>     {
>                     line += nodeList.item(j).getTextContent() + ",";
>                 }
>                 if (line.endsWith(","))
>                     line = line.substring(0, line.length() - 1);
>                 line += "\n";
>                 out.write(line.getBytes());
>             }
>         } catch (ParserConfigurationException e) {
>              e.printStackTrace();
>         } catch (SAXException e) {
>              e.printStackTrace();
>         } catch (XPathExpressionException e) {
>              e.printStackTrace();
>         }
>          IOUtils.copyBytes(resultIS, out, 4096, true);
>         out.close();
>     }
>     public static Object getNode(String xpathStr, Node node, QName
> retunType)
>             throws XPathExpressionException {
>         XPath xpath = xpathFactory.newXPath();
>         return xpath.evaluate(xpathStr, node, retunType);
>     }
> }
>
>
>
> Main class
> public class MainXml {
>      public static void main(String[] args) throws Exception {
> Configuration conf = new Configuration();
>         if (args.length != 2) {
>             System.err
>                     .println("Usage: XMLtoText <input path> <output
> path>");
>             System.exit(-1);
>         }
>   String output="/user/task/Sales/";
>        Job job = new Job(conf, "XML to Text");
>         job.setJarByClass(MainXml.class);
>        // job.setJobName("XML to Text");
>
>         FileInputFormat.addInputPath(job, new Path(args[0]));
>        // FileOutputFormat.setOutputPath(job, new Path(args[1]));
>   Path outPath = new Path(output);
>   FileOutputFormat.setOutputPath(job, outPath);
>   FileSystem dfs = FileSystem.get(outPath.toUri(), conf);
>   if (dfs.exists(outPath)) {
>   dfs.delete(outPath, true);
>   }
>         job.setMapperClass(XmlTextMapper.class);
>
>         job.setNumReduceTasks(0);
>         job.setMapOutputKeyClass(Text.class);
>         job.setMapOutputValueClass(Text.class);
>         System.exit(job.waitForCompletion(true) ? 0 : 1);
>     }
> }
>
>
> My xml file
>
> <Company>
> <Employee>
> <id>100</id>
> <ename>ranjini</ename>
> <dept>IT1</dept>
> <sal>123456</sal>
> <location>nextlevel1</location>
> <Address>
> <Home>Chennai1</Home>
> <Office>Navallur1</Office>
> </Address>
> </Employee>
> <Employee>
> <id>1001</id>
> <ename>ranjinikumar</ename>
> <dept>IT</dept>
> <sal>1234516</sal>
> <location>nextlevel</location>
> <Address>
> <Home>Chennai</Home>
> <Office>Navallur</Office>
> </Address>
> </Employee>
> </Company>
>
>
> Thanks in advance.
>
> Ranjini
>
>
>
>>  On Mon, Jan 6, 2014 at 2:44 PM, Ranjini Rathinam <ranjinibecse@gmail.com
>> > wrote:
>>
>>> Hi,
>>>
>>> Thanks a lot .
>>>
>>> Ranjini
>>>
>>> On Fri, Jan 3, 2014 at 10:40 PM, Diego Gutierrez <
>>> diego.gutierrez@ucsp.edu.pe> wrote:
>>>
>>>>  Hi,
>>>>
>>>> I suggest to use the XPath, this is a native java support for parse xml
>>>> and json formats.
>>>>
>>>> For the main problem, like distcp command(
>>>> http://hadoop.apache.org/docs/r0.19.0/distcp.pdf ) there is no need of
>>>> a reduce function, because you can parse the xml input file and create the
>>>> file you need in the map function.For example the following code reads an
>>>> xml file in HDFS, parse it and create a new file ( "/result.txt" ) with the
>>>> expected format:
>>>> id,name
>>>> 100,RR
>>>>
>>>>
>>>> Mapper function:
>>>>
>>>> import java.io.ByteArrayInputStream;
>>>> import java.io.IOException;
>>>> import java.io.InputStream;
>>>> import java.net.URI;
>>>>
>>>> import javax.xml.namespace.QName;
>>>> import javax.xml.parsers.DocumentBuilder;
>>>> import javax.xml.parsers.DocumentBuilderFactory;
>>>> import javax.xml.parsers.ParserConfigurationException;
>>>> import javax.xml.xpath.XPath;
>>>> import javax.xml.xpath.XPathConstants;
>>>> import javax.xml.xpath.XPathExpressionException;
>>>> import javax.xml.xpath.XPathFactory;
>>>>
>>>> import org.apache.hadoop.conf.Configuration;
>>>> import org.apache.hadoop.fs.FSDataOutputStream;
>>>> import org.apache.hadoop.fs.FileSystem;
>>>> import org.apache.hadoop.fs.Path;
>>>> import org.apache.hadoop.io.IOUtils;
>>>> import org.apache.hadoop.io.LongWritable;
>>>> import org.apache.hadoop.io.Text;
>>>> import org.apache.hadoop.mapreduce.Mapper;
>>>> import org.w3c.dom.Document;
>>>> import org.w3c.dom.Node;
>>>> import org.w3c.dom.NodeList;
>>>> import org.xml.sax.SAXException;
>>>>
>>>> import com.sun.org.apache.xml.internal.dtm.ref.DTMNodeList;
>>>>
>>>> public class XmlToTextMapper extends Mapper<LongWritable, Text, Text,
>>>> Text> {
>>>>
>>>>     private static final XPathFactory xpathFactory =
>>>> XPathFactory.newInstance();
>>>>
>>>>     @Override
>>>>     public void map(LongWritable key, Text value, Context context)
>>>>             throws IOException, InterruptedException {
>>>>
>>>>         String resultFileName = "/result.txt";
>>>>
>>>>
>>>>         Configuration conf = new Configuration();
>>>>         FileSystem fs = FileSystem.get(URI.create(resultFileName),
>>>> conf);
>>>>         FSDataOutputStream out = fs.create(new Path(resultFileName));
>>>>
>>>>         InputStream resultIS = new ByteArrayInputStream(new byte[0]);
>>>>
>>>>         String header = "id,name\n";
>>>>         out.write(header.getBytes());
>>>>
>>>>         String xmlContent = value.toString();
>>>>         InputStream is = new
>>>> ByteArrayInputStream(xmlContent.getBytes());
>>>>         DocumentBuilderFactory factory =
>>>> DocumentBuilderFactory.newInstance();
>>>>         DocumentBuilder builder;
>>>>         try {
>>>>             builder = factory.newDocumentBuilder();
>>>>             Document doc = builder.parse(is);
>>>>             DTMNodeList list = (DTMNodeList) getNode("/main/data", doc,
>>>>                     XPathConstants.NODESET);
>>>>
>>>>             int size = list.getLength();
>>>>             for (int i = 0; i < size; i++) {
>>>>                 Node node = list.item(i);
>>>>                 String line = "";
>>>>                 NodeList nodeList = node.getChildNodes();
>>>>                 int childNumber = nodeList.getLength();
>>>>                 for (int j = 0; j < childNumber; j++) {
>>>>                     line += nodeList.item(j).getTextContent() + ",";
>>>>                 }
>>>>                 if (line.endsWith(","))
>>>>                     line = line.substring(0, line.length() - 1);
>>>>                 line += "\n";
>>>>                 out.write(line.getBytes());
>>>>
>>>>             }
>>>>
>>>>         } catch (ParserConfigurationException e) {
>>>>             MyLogguer.log("error: " + e.getMessage());
>>>>             e.printStackTrace();
>>>>         } catch (SAXException e) {
>>>>             MyLogguer.log("error: " + e.getMessage());
>>>>             e.printStackTrace();
>>>>         } catch (XPathExpressionException e) {
>>>>             MyLogguer.log("error: " + e.getMessage());
>>>>             e.printStackTrace();
>>>>         }
>>>>
>>>>         IOUtils.copyBytes(resultIS, out, 4096, true);
>>>>         out.close();
>>>>     }
>>>>
>>>>     public static Object getNode(String xpathStr, Node node, QName
>>>> retunType)
>>>>             throws XPathExpressionException {
>>>>         XPath xpath = xpathFactory.newXPath();
>>>>         return xpath.evaluate(xpathStr, node, retunType);
>>>>     }
>>>> }
>>>>
>>>>
>>>>
>>>> --------------------------------------
>>>> Main class:
>>>>
>>>>
>>>> public class Main {
>>>>
>>>>     public static void main(String[] args) throws Exception {
>>>>
>>>>         if (args.length != 2) {
>>>>             System.err
>>>>                     .println("Usage: XMLtoText <input path> <output
>>>> path>");
>>>>             System.exit(-1);
>>>>         }
>>>>
>>>>         Job job = new Job();
>>>>         job.setJarByClass(Main.class);
>>>>         job.setJobName("XML to Text");
>>>>         FileInputFormat.addInputPath(job, new Path(args[0]));
>>>>         FileOutputFormat.setOutputPath(job, new Path(args[1]));
>>>>
>>>>         job.setMapperClass(XmlToTextMapper.class);
>>>>         job.setNumReduceTasks(0);
>>>>         job.setMapOutputKeyClass(Text.class);
>>>>         job.setMapOutputValueClass(Text.class);
>>>>         System.exit(job.waitForCompletion(true) ? 0 : 1);
>>>>
>>>>     }
>>>> }
>>>>
>>>> To execute the job you can use :
>>>>
>>>>          bin/hadoop Main /data.xml /output.
>>>>
>>>>
>>>> Then you can use this to see result.txt file:
>>>>
>>>>           hadoop fs -cat /result.txt
>>>>
>>>>
>>>> I'm using this xml as input:
>>>>
>>>>
>>>> <Comp><Emp><id>1</id><name>NameA</name></data><data><id>2</id><name>NameB</name></Emp></Comp>
>>>>
>>>> and the content in result.txt is like this:
>>>>
>>>> id,name
>>>> 1,NameA
>>>> 2,NameB
>>>>
>>>>
>>>> Hope this helps.
>>>>
>>>>
>>>> 2014/1/3 Ranjini Rathinam <ra...@gmail.com>
>>>>
>>>>> Hi,
>>>>>
>>>>> Need to convert XML into text using mapreduce.
>>>>>
>>>>> I have used DOM and SAX parser.
>>>>>
>>>>> After using SAX Builder in mapper class. the child node act as root
>>>>> Element.
>>>>>
>>>>> While seeing in Sys out i found thar root element is taking the child
>>>>> element and printing.
>>>>>
>>>>> For Eg,
>>>>>
>>>>> <Comp><Emp><id>100</id><name>RR</name></Emp></Comp>
>>>>> when this xml is passed in mapper , in sys out printing the root
>>>>> element
>>>>>
>>>>> I am getting the the root element as
>>>>>
>>>>> <id>
>>>>> <name>
>>>>>
>>>>> Please suggest and help to fix this.
>>>>>
>>>>> I need to convert the xml into text using mapreduce code. Please
>>>>> provide with example.
>>>>>
>>>>> Required output is
>>>>>
>>>>> id,name
>>>>> 100,RR
>>>>>
>>>>> Please help.
>>>>>
>>>>> Thanks in advance,
>>>>> Ranjini R
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>

RE: XML to TEXT

Posted by Shankar hiremath <sh...@huawei.com>.
As per my understanding reading one xml line and sending to map task will not work generally,
I suggest to make or partition the xml data as one complete "student" element as per xml specification,
Then pass each partitioned "student" xml element as input to mapper, and mapper will parse this xml and generate the text data (here you can reuse your existing recursive code) in a single line.

Ex;
          <student>
                        <id>100</id>
                        <name>ranjini-1</name>
                              ................................................
            </student>
The above student element should be sent to mapper-1

<student>
                        <id>101</id>
                        <name>ranjini-2</name>
                              ................................................
            </student>
The above student element should be sent to mapper-2



Complete XML:

<school>
            <student>
                        <id>100</id>
                        <name>ranjini-1</name>
                              ................................................
            </student>
<student>
                        <id>101</id>
                        <name>ranjini-2</name>
                              ................................................
            </student>
             ........
</school>



From: Ranjini Rathinam [mailto:ranjinibecse@gmail.com]
Sent: 12 February 2014 PM 01:46
To: user@hadoop.apache.org
Subject: Fwd: XML to TEXT



Please help to convert this xml to text.

I have the attached the xml. Please find the attachement.

Some student has two address tag and some student has one address tag and some student dont have address tag tag.

I need to convert the xml into string.

this is my desired output.

100,ranjini,HOME,a street,ad street,ads street,chennai,tn,OFFICE,adsja1 street,adsja2 street,adsja3 street,mumbai,Maharastra
101,nivetha,HOME,a street,ad street,ads street,chennai,tn
102,siva


In normal java i have written using recursion but how to write in mapreduce.

How to write the code in Mapreduce .? Pl help .

Thanks in advance.
Regards,
Ranjini R


On Fri, Jan 10, 2014 at 12:47 PM, Ranjini Rathinam <ra...@gmail.com>> wrote:
Hi,

Its working fine. problem was in xml . THe space i have given.

Thanks a lot.

Regards,
Ranjini.R
On Thu, Jan 9, 2014 at 10:47 PM, Diego Gutierrez <di...@ucsp.edu.pe>> wrote:
Hi,
I'm sending you the eclipse project with the code. Hope this helps.
Regards
Diego Gutiérrez


2014/1/9 Ranjini Rathinam <ra...@gmail.com>>
Hi,

I am using here java 1.6 and hadoop 0.20 version ,  ubuntu 12.04.

If possible please send the jar and code for review.

Thanks for the support,

Ranjini
On Wed, Jan 8, 2014 at 11:00 PM, Diego Gutierrez <di...@ucsp.edu.pe>> wrote:
Hi,
I've notice that your xml file has break lines. Hadoop by default splits every file into lines and pass them to the map function, in other words, each map function process one line of the file. Please remove the break lines from your xml and try again. I've tested here with your xml file(just changing DTMNodeList list = (DTMNodeList) getNode("/Company/Employee", doc,
                    XPathConstants.NODESET) ) and this is the output in result.txt


id,name
100,ranjini,IT1,123456,nextlevel1,Chennai1Navallur1
1001,ranjinikumar,IT,1234516,nextlevel,ChennaiNavallur

Note: I dont know if the java version or hadoop version can be the problem here. I'm using ubuntu 12.04, java oracle 7 and hadoop 2.2.0.

If you want, I can send you the jar file with the code :)

Regards
Diego Gutiérrez.


2014/1/7 Ranjini Rathinam <ra...@gmail.com>>
Hi Gutierrez ,

As suggest i tried with the code , but in the result.txt i got output only header. Nothing else was printing.

After debugging i came to know that while parsing , there is no value.

The problem is in line given below which is bold. While putting SysOut i found no value printing in this line.

String xmlContent = value.toString();

        InputStream is = new ByteArrayInputStream(xmlContent.getBytes());
        DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
        DocumentBuilder builder;
        try {
            builder = factory.newDocumentBuilder();
            Document doc = builder.parse(is);
   String ed=doc.getDocumentElement().getNodeName();
   out.write(ed.getBytes());
            DTMNodeList list = (DTMNodeList) getNode("/Company/Employee", doc,XPathConstants.NODESET);

When iam printing

out.write(xmlContent.getBytes):- the whole xml is being printed.

then i wrote for Sysout for list ,nothing printed.
out.write(ed.getBytes):- nothing is being printed.

Please suggest where i am going wrong. Please help to fix this.

Thanks in advance.

I have attached my code.Please review.


Mapper class:-

public class XmlTextMapper extends Mapper<LongWritable, Text, Text, Text> {
    private static final XPathFactory xpathFactory = XPathFactory.newInstance();
    @Override
    public void map(LongWritable key, Text value, Context context)
            throws IOException, InterruptedException {
        String resultFileName = "/user/task/Sales/result.txt";

        Configuration conf = new Configuration();
        FileSystem fs = FileSystem.get(URI.create(resultFileName), conf);
        FSDataOutputStream out = fs.create(new Path(resultFileName));
        InputStream resultIS = new ByteArrayInputStream(new byte[0]);
        String header = "id,name\n";
        out.write(header.getBytes());
        String xmlContent = value.toString();

        InputStream is = new ByteArrayInputStream(xmlContent.getBytes());
        DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
        DocumentBuilder builder;
        try {
            builder = factory.newDocumentBuilder();
            Document doc = builder.parse(is);
   String ed=doc.getDocumentElement().getNodeName();
   out.write(ed.getBytes());
            DTMNodeList list = (DTMNodeList) getNode("/Company/Employee", doc,XPathConstants.NODESET);
            int size = list.getLength();
            for (int i = 0; i < size; i++) {
                Node node = list.item(i);
                String line = "";
                NodeList nodeList = node.getChildNodes();
                int childNumber = nodeList.getLength();
                for (int j = 0; j < childNumber; j++)
    {
                    line += nodeList.item(j).getTextContent() + ",";
                }
                if (line.endsWith(","))
                    line = line.substring(0, line.length() - 1);
                line += "\n";
                out.write(line.getBytes());
            }
        } catch (ParserConfigurationException e) {
             e.printStackTrace();
        } catch (SAXException e) {
             e.printStackTrace();
        } catch (XPathExpressionException e) {
             e.printStackTrace();
        }
        IOUtils.copyBytes(resultIS, out, 4096, true);
        out.close();
    }
    public static Object getNode(String xpathStr, Node node, QName retunType)
            throws XPathExpressionException {
        XPath xpath = xpathFactory.newXPath();
        return xpath.evaluate(xpathStr, node, retunType);
    }
}



Main class
public class MainXml {
    public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
        if (args.length != 2) {
            System.err
                    .println("Usage: XMLtoText <input path> <output path>");
            System.exit(-1);
        }
  String output="/user/task/Sales/";
       Job job = new Job(conf, "XML to Text");
        job.setJarByClass(MainXml.class);
       // job.setJobName("XML to Text");

        FileInputFormat.addInputPath(job, new Path(args[0]));
       // FileOutputFormat.setOutputPath(job, new Path(args[1]));
  Path outPath = new Path(output);
  FileOutputFormat.setOutputPath(job, outPath);
  FileSystem dfs = FileSystem.get(outPath.toUri(), conf);
  if (dfs.exists(outPath)) {
  dfs.delete(outPath, true);
  }
        job.setMapperClass(XmlTextMapper.class);

        job.setNumReduceTasks(0);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(Text.class);
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}


My xml file

<Company>
<Employee>
<id>100</id>
<ename>ranjini</ename>
<dept>IT1</dept>
<sal>123456</sal>
<location>nextlevel1</location>
<Address>
<Home>Chennai1</Home>
<Office>Navallur1</Office>
</Address>
</Employee>
<Employee>
<id>1001</id>
<ename>ranjinikumar</ename>
<dept>IT</dept>
<sal>1234516</sal>
<location>nextlevel</location>
<Address>
<Home>Chennai</Home>
<Office>Navallur</Office>
</Address>
</Employee>
</Company>


Thanks in advance.

Ranjini


On Mon, Jan 6, 2014 at 2:44 PM, Ranjini Rathinam <ra...@gmail.com>> wrote:
Hi,

Thanks a lot .

Ranjini
On Fri, Jan 3, 2014 at 10:40 PM, Diego Gutierrez <di...@ucsp.edu.pe>> wrote:
Hi,
I suggest to use the XPath, this is a native java support for parse xml and json formats.
For the main problem, like distcp command( http://hadoop.apache.org/docs/r0.19.0/distcp.pdf ) there is no need of a reduce function, because you can parse the xml input file and create the file you need in the map function.For example the following code reads an xml file in HDFS, parse it and create a new file ( "/result.txt" ) with the expected format:
id,name
100,RR

Mapper function:

import java.io.ByteArrayInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.net.URI;

import javax.xml.namespace.QName;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.xpath.XPath;
import javax.xml.xpath.XPathConstants;
import javax.xml.xpath.XPathExpressionException;
import javax.xml.xpath.XPathFactory;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.w3c.dom.Document;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
import org.xml.sax.SAXException;

import com.sun.org.apache.xml.internal.dtm.ref.DTMNodeList;

public class XmlToTextMapper extends Mapper<LongWritable, Text, Text, Text> {

    private static final XPathFactory xpathFactory = XPathFactory.newInstance();

    @Override
    public void map(LongWritable key, Text value, Context context)
            throws IOException, InterruptedException {

        String resultFileName = "/result.txt";


        Configuration conf = new Configuration();
        FileSystem fs = FileSystem.get(URI.create(resultFileName), conf);
        FSDataOutputStream out = fs.create(new Path(resultFileName));

        InputStream resultIS = new ByteArrayInputStream(new byte[0]);

        String header = "id,name\n";
        out.write(header.getBytes());

        String xmlContent = value.toString();
        InputStream is = new ByteArrayInputStream(xmlContent.getBytes());
        DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
        DocumentBuilder builder;
        try {
            builder = factory.newDocumentBuilder();
            Document doc = builder.parse(is);
            DTMNodeList list = (DTMNodeList) getNode("/main/data", doc,
                    XPathConstants.NODESET);

            int size = list.getLength();
            for (int i = 0; i < size; i++) {
                Node node = list.item(i);
                String line = "";
                NodeList nodeList = node.getChildNodes();
                int childNumber = nodeList.getLength();
                for (int j = 0; j < childNumber; j++) {
                    line += nodeList.item(j).getTextContent() + ",";
                }
                if (line.endsWith(","))
                    line = line.substring(0, line.length() - 1);
                line += "\n";
                out.write(line.getBytes());

            }

        } catch (ParserConfigurationException e) {
            MyLogguer.log("error: " + e.getMessage());
            e.printStackTrace();
        } catch (SAXException e) {
            MyLogguer.log("error: " + e.getMessage());
            e.printStackTrace();
        } catch (XPathExpressionException e) {
            MyLogguer.log("error: " + e.getMessage());
            e.printStackTrace();
        }

        IOUtils.copyBytes(resultIS, out, 4096, true);
        out.close();
    }

    public static Object getNode(String xpathStr, Node node, QName retunType)
            throws XPathExpressionException {
        XPath xpath = xpathFactory.newXPath();
        return xpath.evaluate(xpathStr, node, retunType);
    }
}



--------------------------------------
Main class:


public class Main {

    public static void main(String[] args) throws Exception {

        if (args.length != 2) {
            System.err
                    .println("Usage: XMLtoText <input path> <output path>");
            System.exit(-1);
        }

        Job job = new Job();
        job.setJarByClass(Main.class);
        job.setJobName("XML to Text");
        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        job.setMapperClass(XmlToTextMapper.class);
        job.setNumReduceTasks(0);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(Text.class);
        System.exit(job.waitForCompletion(true) ? 0 : 1);

    }
}

To execute the job you can use :

         bin/hadoop Main /data.xml /output.

Then you can use this to see result.txt file:

          hadoop fs -cat /result.txt

I'm using this xml as input:

<Comp><Emp><id>1</id><name>NameA</name></data><data><id>2</id><name>NameB</name></Emp></Comp>
and the content in result.txt is like this:

id,name
1,NameA
2,NameB

Hope this helps.

2014/1/3 Ranjini Rathinam <ra...@gmail.com>>
Hi,

Need to convert XML into text using mapreduce.

I have used DOM and SAX parser.

After using SAX Builder in mapper class. the child node act as root Element.

While seeing in Sys out i found thar root element is taking the child element and printing.

For Eg,

<Comp><Emp><id>100</id><name>RR</name></Emp></Comp>
when this xml is passed in mapper , in sys out printing the root element

I am getting the the root element as

<id>
<name>

Please suggest and help to fix this.

I need to convert the xml into text using mapreduce code. Please provide with example.

Required output is

id,name
100,RR

Please help.

Thanks in advance,
Ranjini R






























RE: XML to TEXT

Posted by Shankar hiremath <sh...@huawei.com>.
As per my understanding reading one xml line and sending to map task will not work generally,
I suggest to make or partition the xml data as one complete "student" element as per xml specification,
Then pass each partitioned "student" xml element as input to mapper, and mapper will parse this xml and generate the text data (here you can reuse your existing recursive code) in a single line.

Ex;
          <student>
                        <id>100</id>
                        <name>ranjini-1</name>
                              ................................................
            </student>
The above student element should be sent to mapper-1

<student>
                        <id>101</id>
                        <name>ranjini-2</name>
                              ................................................
            </student>
The above student element should be sent to mapper-2



Complete XML:

<school>
            <student>
                        <id>100</id>
                        <name>ranjini-1</name>
                              ................................................
            </student>
<student>
                        <id>101</id>
                        <name>ranjini-2</name>
                              ................................................
            </student>
             ........
</school>



From: Ranjini Rathinam [mailto:ranjinibecse@gmail.com]
Sent: 12 February 2014 PM 01:46
To: user@hadoop.apache.org
Subject: Fwd: XML to TEXT



Please help to convert this xml to text.

I have the attached the xml. Please find the attachement.

Some student has two address tag and some student has one address tag and some student dont have address tag tag.

I need to convert the xml into string.

this is my desired output.

100,ranjini,HOME,a street,ad street,ads street,chennai,tn,OFFICE,adsja1 street,adsja2 street,adsja3 street,mumbai,Maharastra
101,nivetha,HOME,a street,ad street,ads street,chennai,tn
102,siva


In normal java i have written using recursion but how to write in mapreduce.

How to write the code in Mapreduce .? Pl help .

Thanks in advance.
Regards,
Ranjini R


On Fri, Jan 10, 2014 at 12:47 PM, Ranjini Rathinam <ra...@gmail.com>> wrote:
Hi,

Its working fine. problem was in xml . THe space i have given.

Thanks a lot.

Regards,
Ranjini.R
On Thu, Jan 9, 2014 at 10:47 PM, Diego Gutierrez <di...@ucsp.edu.pe>> wrote:
Hi,
I'm sending you the eclipse project with the code. Hope this helps.
Regards
Diego Gutiérrez


2014/1/9 Ranjini Rathinam <ra...@gmail.com>>
Hi,

I am using here java 1.6 and hadoop 0.20 version ,  ubuntu 12.04.

If possible please send the jar and code for review.

Thanks for the support,

Ranjini
On Wed, Jan 8, 2014 at 11:00 PM, Diego Gutierrez <di...@ucsp.edu.pe>> wrote:
Hi,
I've notice that your xml file has break lines. Hadoop by default splits every file into lines and pass them to the map function, in other words, each map function process one line of the file. Please remove the break lines from your xml and try again. I've tested here with your xml file(just changing DTMNodeList list = (DTMNodeList) getNode("/Company/Employee", doc,
                    XPathConstants.NODESET) ) and this is the output in result.txt


id,name
100,ranjini,IT1,123456,nextlevel1,Chennai1Navallur1
1001,ranjinikumar,IT,1234516,nextlevel,ChennaiNavallur

Note: I dont know if the java version or hadoop version can be the problem here. I'm using ubuntu 12.04, java oracle 7 and hadoop 2.2.0.

If you want, I can send you the jar file with the code :)

Regards
Diego Gutiérrez.


2014/1/7 Ranjini Rathinam <ra...@gmail.com>>
Hi Gutierrez ,

As suggest i tried with the code , but in the result.txt i got output only header. Nothing else was printing.

After debugging i came to know that while parsing , there is no value.

The problem is in line given below which is bold. While putting SysOut i found no value printing in this line.

String xmlContent = value.toString();

        InputStream is = new ByteArrayInputStream(xmlContent.getBytes());
        DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
        DocumentBuilder builder;
        try {
            builder = factory.newDocumentBuilder();
            Document doc = builder.parse(is);
   String ed=doc.getDocumentElement().getNodeName();
   out.write(ed.getBytes());
            DTMNodeList list = (DTMNodeList) getNode("/Company/Employee", doc,XPathConstants.NODESET);

When iam printing

out.write(xmlContent.getBytes):- the whole xml is being printed.

then i wrote for Sysout for list ,nothing printed.
out.write(ed.getBytes):- nothing is being printed.

Please suggest where i am going wrong. Please help to fix this.

Thanks in advance.

I have attached my code.Please review.


Mapper class:-

public class XmlTextMapper extends Mapper<LongWritable, Text, Text, Text> {
    private static final XPathFactory xpathFactory = XPathFactory.newInstance();
    @Override
    public void map(LongWritable key, Text value, Context context)
            throws IOException, InterruptedException {
        String resultFileName = "/user/task/Sales/result.txt";

        Configuration conf = new Configuration();
        FileSystem fs = FileSystem.get(URI.create(resultFileName), conf);
        FSDataOutputStream out = fs.create(new Path(resultFileName));
        InputStream resultIS = new ByteArrayInputStream(new byte[0]);
        String header = "id,name\n";
        out.write(header.getBytes());
        String xmlContent = value.toString();

        InputStream is = new ByteArrayInputStream(xmlContent.getBytes());
        DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
        DocumentBuilder builder;
        try {
            builder = factory.newDocumentBuilder();
            Document doc = builder.parse(is);
   String ed=doc.getDocumentElement().getNodeName();
   out.write(ed.getBytes());
            DTMNodeList list = (DTMNodeList) getNode("/Company/Employee", doc,XPathConstants.NODESET);
            int size = list.getLength();
            for (int i = 0; i < size; i++) {
                Node node = list.item(i);
                String line = "";
                NodeList nodeList = node.getChildNodes();
                int childNumber = nodeList.getLength();
                for (int j = 0; j < childNumber; j++)
    {
                    line += nodeList.item(j).getTextContent() + ",";
                }
                if (line.endsWith(","))
                    line = line.substring(0, line.length() - 1);
                line += "\n";
                out.write(line.getBytes());
            }
        } catch (ParserConfigurationException e) {
             e.printStackTrace();
        } catch (SAXException e) {
             e.printStackTrace();
        } catch (XPathExpressionException e) {
             e.printStackTrace();
        }
        IOUtils.copyBytes(resultIS, out, 4096, true);
        out.close();
    }
    public static Object getNode(String xpathStr, Node node, QName retunType)
            throws XPathExpressionException {
        XPath xpath = xpathFactory.newXPath();
        return xpath.evaluate(xpathStr, node, retunType);
    }
}



Main class
public class MainXml {
    public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
        if (args.length != 2) {
            System.err
                    .println("Usage: XMLtoText <input path> <output path>");
            System.exit(-1);
        }
  String output="/user/task/Sales/";
       Job job = new Job(conf, "XML to Text");
        job.setJarByClass(MainXml.class);
       // job.setJobName("XML to Text");

        FileInputFormat.addInputPath(job, new Path(args[0]));
       // FileOutputFormat.setOutputPath(job, new Path(args[1]));
  Path outPath = new Path(output);
  FileOutputFormat.setOutputPath(job, outPath);
  FileSystem dfs = FileSystem.get(outPath.toUri(), conf);
  if (dfs.exists(outPath)) {
  dfs.delete(outPath, true);
  }
        job.setMapperClass(XmlTextMapper.class);

        job.setNumReduceTasks(0);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(Text.class);
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}


My xml file

<Company>
<Employee>
<id>100</id>
<ename>ranjini</ename>
<dept>IT1</dept>
<sal>123456</sal>
<location>nextlevel1</location>
<Address>
<Home>Chennai1</Home>
<Office>Navallur1</Office>
</Address>
</Employee>
<Employee>
<id>1001</id>
<ename>ranjinikumar</ename>
<dept>IT</dept>
<sal>1234516</sal>
<location>nextlevel</location>
<Address>
<Home>Chennai</Home>
<Office>Navallur</Office>
</Address>
</Employee>
</Company>


Thanks in advance.

Ranjini


On Mon, Jan 6, 2014 at 2:44 PM, Ranjini Rathinam <ra...@gmail.com>> wrote:
Hi,

Thanks a lot .

Ranjini
On Fri, Jan 3, 2014 at 10:40 PM, Diego Gutierrez <di...@ucsp.edu.pe>> wrote:
Hi,
I suggest to use the XPath, this is a native java support for parse xml and json formats.
For the main problem, like distcp command( http://hadoop.apache.org/docs/r0.19.0/distcp.pdf ) there is no need of a reduce function, because you can parse the xml input file and create the file you need in the map function.For example the following code reads an xml file in HDFS, parse it and create a new file ( "/result.txt" ) with the expected format:
id,name
100,RR

Mapper function:

import java.io.ByteArrayInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.net.URI;

import javax.xml.namespace.QName;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.xpath.XPath;
import javax.xml.xpath.XPathConstants;
import javax.xml.xpath.XPathExpressionException;
import javax.xml.xpath.XPathFactory;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.w3c.dom.Document;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
import org.xml.sax.SAXException;

import com.sun.org.apache.xml.internal.dtm.ref.DTMNodeList;

public class XmlToTextMapper extends Mapper<LongWritable, Text, Text, Text> {

    private static final XPathFactory xpathFactory = XPathFactory.newInstance();

    @Override
    public void map(LongWritable key, Text value, Context context)
            throws IOException, InterruptedException {

        String resultFileName = "/result.txt";


        Configuration conf = new Configuration();
        FileSystem fs = FileSystem.get(URI.create(resultFileName), conf);
        FSDataOutputStream out = fs.create(new Path(resultFileName));

        InputStream resultIS = new ByteArrayInputStream(new byte[0]);

        String header = "id,name\n";
        out.write(header.getBytes());

        String xmlContent = value.toString();
        InputStream is = new ByteArrayInputStream(xmlContent.getBytes());
        DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
        DocumentBuilder builder;
        try {
            builder = factory.newDocumentBuilder();
            Document doc = builder.parse(is);
            DTMNodeList list = (DTMNodeList) getNode("/main/data", doc,
                    XPathConstants.NODESET);

            int size = list.getLength();
            for (int i = 0; i < size; i++) {
                Node node = list.item(i);
                String line = "";
                NodeList nodeList = node.getChildNodes();
                int childNumber = nodeList.getLength();
                for (int j = 0; j < childNumber; j++) {
                    line += nodeList.item(j).getTextContent() + ",";
                }
                if (line.endsWith(","))
                    line = line.substring(0, line.length() - 1);
                line += "\n";
                out.write(line.getBytes());

            }

        } catch (ParserConfigurationException e) {
            MyLogguer.log("error: " + e.getMessage());
            e.printStackTrace();
        } catch (SAXException e) {
            MyLogguer.log("error: " + e.getMessage());
            e.printStackTrace();
        } catch (XPathExpressionException e) {
            MyLogguer.log("error: " + e.getMessage());
            e.printStackTrace();
        }

        IOUtils.copyBytes(resultIS, out, 4096, true);
        out.close();
    }

    public static Object getNode(String xpathStr, Node node, QName retunType)
            throws XPathExpressionException {
        XPath xpath = xpathFactory.newXPath();
        return xpath.evaluate(xpathStr, node, retunType);
    }
}



--------------------------------------
Main class:


public class Main {

    public static void main(String[] args) throws Exception {

        if (args.length != 2) {
            System.err
                    .println("Usage: XMLtoText <input path> <output path>");
            System.exit(-1);
        }

        Job job = new Job();
        job.setJarByClass(Main.class);
        job.setJobName("XML to Text");
        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        job.setMapperClass(XmlToTextMapper.class);
        job.setNumReduceTasks(0);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(Text.class);
        System.exit(job.waitForCompletion(true) ? 0 : 1);

    }
}

To execute the job you can use :

         bin/hadoop Main /data.xml /output.

Then you can use this to see result.txt file:

          hadoop fs -cat /result.txt

I'm using this xml as input:

<Comp><Emp><id>1</id><name>NameA</name></data><data><id>2</id><name>NameB</name></Emp></Comp>
and the content in result.txt is like this:

id,name
1,NameA
2,NameB

Hope this helps.

2014/1/3 Ranjini Rathinam <ra...@gmail.com>>
Hi,

Need to convert XML into text using mapreduce.

I have used DOM and SAX parser.

After using SAX Builder in mapper class. the child node act as root Element.

While seeing in Sys out i found thar root element is taking the child element and printing.

For Eg,

<Comp><Emp><id>100</id><name>RR</name></Emp></Comp>
when this xml is passed in mapper , in sys out printing the root element

I am getting the the root element as

<id>
<name>

Please suggest and help to fix this.

I need to convert the xml into text using mapreduce code. Please provide with example.

Required output is

id,name
100,RR

Please help.

Thanks in advance,
Ranjini R






























RE: XML to TEXT

Posted by Shankar hiremath <sh...@huawei.com>.
As per my understanding reading one xml line and sending to map task will not work generally,
I suggest to make or partition the xml data as one complete "student" element as per xml specification,
Then pass each partitioned "student" xml element as input to mapper, and mapper will parse this xml and generate the text data (here you can reuse your existing recursive code) in a single line.

Ex;
          <student>
                        <id>100</id>
                        <name>ranjini-1</name>
                              ................................................
            </student>
The above student element should be sent to mapper-1

<student>
                        <id>101</id>
                        <name>ranjini-2</name>
                              ................................................
            </student>
The above student element should be sent to mapper-2



Complete XML:

<school>
            <student>
                        <id>100</id>
                        <name>ranjini-1</name>
                              ................................................
            </student>
<student>
                        <id>101</id>
                        <name>ranjini-2</name>
                              ................................................
            </student>
             ........
</school>



From: Ranjini Rathinam [mailto:ranjinibecse@gmail.com]
Sent: 12 February 2014 PM 01:46
To: user@hadoop.apache.org
Subject: Fwd: XML to TEXT



Please help to convert this xml to text.

I have the attached the xml. Please find the attachement.

Some student has two address tag and some student has one address tag and some student dont have address tag tag.

I need to convert the xml into string.

this is my desired output.

100,ranjini,HOME,a street,ad street,ads street,chennai,tn,OFFICE,adsja1 street,adsja2 street,adsja3 street,mumbai,Maharastra
101,nivetha,HOME,a street,ad street,ads street,chennai,tn
102,siva


In normal java i have written using recursion but how to write in mapreduce.

How to write the code in Mapreduce .? Pl help .

Thanks in advance.
Regards,
Ranjini R


On Fri, Jan 10, 2014 at 12:47 PM, Ranjini Rathinam <ra...@gmail.com>> wrote:
Hi,

Its working fine. problem was in xml . THe space i have given.

Thanks a lot.

Regards,
Ranjini.R
On Thu, Jan 9, 2014 at 10:47 PM, Diego Gutierrez <di...@ucsp.edu.pe>> wrote:
Hi,
I'm sending you the eclipse project with the code. Hope this helps.
Regards
Diego Gutiérrez


2014/1/9 Ranjini Rathinam <ra...@gmail.com>>
Hi,

I am using here java 1.6 and hadoop 0.20 version ,  ubuntu 12.04.

If possible please send the jar and code for review.

Thanks for the support,

Ranjini
On Wed, Jan 8, 2014 at 11:00 PM, Diego Gutierrez <di...@ucsp.edu.pe>> wrote:
Hi,
I've notice that your xml file has break lines. Hadoop by default splits every file into lines and pass them to the map function, in other words, each map function process one line of the file. Please remove the break lines from your xml and try again. I've tested here with your xml file(just changing DTMNodeList list = (DTMNodeList) getNode("/Company/Employee", doc,
                    XPathConstants.NODESET) ) and this is the output in result.txt


id,name
100,ranjini,IT1,123456,nextlevel1,Chennai1Navallur1
1001,ranjinikumar,IT,1234516,nextlevel,ChennaiNavallur

Note: I dont know if the java version or hadoop version can be the problem here. I'm using ubuntu 12.04, java oracle 7 and hadoop 2.2.0.

If you want, I can send you the jar file with the code :)

Regards
Diego Gutiérrez.


2014/1/7 Ranjini Rathinam <ra...@gmail.com>>
Hi Gutierrez ,

As suggest i tried with the code , but in the result.txt i got output only header. Nothing else was printing.

After debugging i came to know that while parsing , there is no value.

The problem is in line given below which is bold. While putting SysOut i found no value printing in this line.

String xmlContent = value.toString();

        InputStream is = new ByteArrayInputStream(xmlContent.getBytes());
        DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
        DocumentBuilder builder;
        try {
            builder = factory.newDocumentBuilder();
            Document doc = builder.parse(is);
   String ed=doc.getDocumentElement().getNodeName();
   out.write(ed.getBytes());
            DTMNodeList list = (DTMNodeList) getNode("/Company/Employee", doc,XPathConstants.NODESET);

When iam printing

out.write(xmlContent.getBytes):- the whole xml is being printed.

then i wrote for Sysout for list ,nothing printed.
out.write(ed.getBytes):- nothing is being printed.

Please suggest where i am going wrong. Please help to fix this.

Thanks in advance.

I have attached my code.Please review.


Mapper class:-

public class XmlTextMapper extends Mapper<LongWritable, Text, Text, Text> {
    private static final XPathFactory xpathFactory = XPathFactory.newInstance();
    @Override
    public void map(LongWritable key, Text value, Context context)
            throws IOException, InterruptedException {
        String resultFileName = "/user/task/Sales/result.txt";

        Configuration conf = new Configuration();
        FileSystem fs = FileSystem.get(URI.create(resultFileName), conf);
        FSDataOutputStream out = fs.create(new Path(resultFileName));
        InputStream resultIS = new ByteArrayInputStream(new byte[0]);
        String header = "id,name\n";
        out.write(header.getBytes());
        String xmlContent = value.toString();

        InputStream is = new ByteArrayInputStream(xmlContent.getBytes());
        DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
        DocumentBuilder builder;
        try {
            builder = factory.newDocumentBuilder();
            Document doc = builder.parse(is);
   String ed=doc.getDocumentElement().getNodeName();
   out.write(ed.getBytes());
            DTMNodeList list = (DTMNodeList) getNode("/Company/Employee", doc,XPathConstants.NODESET);
            int size = list.getLength();
            for (int i = 0; i < size; i++) {
                Node node = list.item(i);
                String line = "";
                NodeList nodeList = node.getChildNodes();
                int childNumber = nodeList.getLength();
                for (int j = 0; j < childNumber; j++)
    {
                    line += nodeList.item(j).getTextContent() + ",";
                }
                if (line.endsWith(","))
                    line = line.substring(0, line.length() - 1);
                line += "\n";
                out.write(line.getBytes());
            }
        } catch (ParserConfigurationException e) {
             e.printStackTrace();
        } catch (SAXException e) {
             e.printStackTrace();
        } catch (XPathExpressionException e) {
             e.printStackTrace();
        }
        IOUtils.copyBytes(resultIS, out, 4096, true);
        out.close();
    }
    public static Object getNode(String xpathStr, Node node, QName retunType)
            throws XPathExpressionException {
        XPath xpath = xpathFactory.newXPath();
        return xpath.evaluate(xpathStr, node, retunType);
    }
}



Main class
public class MainXml {
    public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
        if (args.length != 2) {
            System.err
                    .println("Usage: XMLtoText <input path> <output path>");
            System.exit(-1);
        }
  String output="/user/task/Sales/";
       Job job = new Job(conf, "XML to Text");
        job.setJarByClass(MainXml.class);
       // job.setJobName("XML to Text");

        FileInputFormat.addInputPath(job, new Path(args[0]));
       // FileOutputFormat.setOutputPath(job, new Path(args[1]));
  Path outPath = new Path(output);
  FileOutputFormat.setOutputPath(job, outPath);
  FileSystem dfs = FileSystem.get(outPath.toUri(), conf);
  if (dfs.exists(outPath)) {
  dfs.delete(outPath, true);
  }
        job.setMapperClass(XmlTextMapper.class);

        job.setNumReduceTasks(0);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(Text.class);
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}


My xml file

<Company>
<Employee>
<id>100</id>
<ename>ranjini</ename>
<dept>IT1</dept>
<sal>123456</sal>
<location>nextlevel1</location>
<Address>
<Home>Chennai1</Home>
<Office>Navallur1</Office>
</Address>
</Employee>
<Employee>
<id>1001</id>
<ename>ranjinikumar</ename>
<dept>IT</dept>
<sal>1234516</sal>
<location>nextlevel</location>
<Address>
<Home>Chennai</Home>
<Office>Navallur</Office>
</Address>
</Employee>
</Company>


Thanks in advance.

Ranjini


On Mon, Jan 6, 2014 at 2:44 PM, Ranjini Rathinam <ra...@gmail.com>> wrote:
Hi,

Thanks a lot .

Ranjini
On Fri, Jan 3, 2014 at 10:40 PM, Diego Gutierrez <di...@ucsp.edu.pe>> wrote:
Hi,
I suggest to use the XPath, this is a native java support for parse xml and json formats.
For the main problem, like distcp command( http://hadoop.apache.org/docs/r0.19.0/distcp.pdf ) there is no need of a reduce function, because you can parse the xml input file and create the file you need in the map function.For example the following code reads an xml file in HDFS, parse it and create a new file ( "/result.txt" ) with the expected format:
id,name
100,RR

Mapper function:

import java.io.ByteArrayInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.net.URI;

import javax.xml.namespace.QName;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.xpath.XPath;
import javax.xml.xpath.XPathConstants;
import javax.xml.xpath.XPathExpressionException;
import javax.xml.xpath.XPathFactory;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.w3c.dom.Document;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
import org.xml.sax.SAXException;

import com.sun.org.apache.xml.internal.dtm.ref.DTMNodeList;

public class XmlToTextMapper extends Mapper<LongWritable, Text, Text, Text> {

    private static final XPathFactory xpathFactory = XPathFactory.newInstance();

    @Override
    public void map(LongWritable key, Text value, Context context)
            throws IOException, InterruptedException {

        String resultFileName = "/result.txt";


        Configuration conf = new Configuration();
        FileSystem fs = FileSystem.get(URI.create(resultFileName), conf);
        FSDataOutputStream out = fs.create(new Path(resultFileName));

        InputStream resultIS = new ByteArrayInputStream(new byte[0]);

        String header = "id,name\n";
        out.write(header.getBytes());

        String xmlContent = value.toString();
        InputStream is = new ByteArrayInputStream(xmlContent.getBytes());
        DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
        DocumentBuilder builder;
        try {
            builder = factory.newDocumentBuilder();
            Document doc = builder.parse(is);
            DTMNodeList list = (DTMNodeList) getNode("/main/data", doc,
                    XPathConstants.NODESET);

            int size = list.getLength();
            for (int i = 0; i < size; i++) {
                Node node = list.item(i);
                String line = "";
                NodeList nodeList = node.getChildNodes();
                int childNumber = nodeList.getLength();
                for (int j = 0; j < childNumber; j++) {
                    line += nodeList.item(j).getTextContent() + ",";
                }
                if (line.endsWith(","))
                    line = line.substring(0, line.length() - 1);
                line += "\n";
                out.write(line.getBytes());

            }

        } catch (ParserConfigurationException e) {
            MyLogguer.log("error: " + e.getMessage());
            e.printStackTrace();
        } catch (SAXException e) {
            MyLogguer.log("error: " + e.getMessage());
            e.printStackTrace();
        } catch (XPathExpressionException e) {
            MyLogguer.log("error: " + e.getMessage());
            e.printStackTrace();
        }

        IOUtils.copyBytes(resultIS, out, 4096, true);
        out.close();
    }

    public static Object getNode(String xpathStr, Node node, QName retunType)
            throws XPathExpressionException {
        XPath xpath = xpathFactory.newXPath();
        return xpath.evaluate(xpathStr, node, retunType);
    }
}



--------------------------------------
Main class:


public class Main {

    public static void main(String[] args) throws Exception {

        if (args.length != 2) {
            System.err
                    .println("Usage: XMLtoText <input path> <output path>");
            System.exit(-1);
        }

        Job job = new Job();
        job.setJarByClass(Main.class);
        job.setJobName("XML to Text");
        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        job.setMapperClass(XmlToTextMapper.class);
        job.setNumReduceTasks(0);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(Text.class);
        System.exit(job.waitForCompletion(true) ? 0 : 1);

    }
}

To execute the job you can use :

         bin/hadoop Main /data.xml /output.

Then you can use this to see result.txt file:

          hadoop fs -cat /result.txt

I'm using this xml as input:

<Comp><Emp><id>1</id><name>NameA</name></data><data><id>2</id><name>NameB</name></Emp></Comp>
and the content in result.txt is like this:

id,name
1,NameA
2,NameB

Hope this helps.

2014/1/3 Ranjini Rathinam <ra...@gmail.com>>
Hi,

Need to convert XML into text using mapreduce.

I have used DOM and SAX parser.

After using SAX Builder in mapper class. the child node act as root Element.

While seeing in Sys out i found thar root element is taking the child element and printing.

For Eg,

<Comp><Emp><id>100</id><name>RR</name></Emp></Comp>
when this xml is passed in mapper , in sys out printing the root element

I am getting the the root element as

<id>
<name>

Please suggest and help to fix this.

I need to convert the xml into text using mapreduce code. Please provide with example.

Required output is

id,name
100,RR

Please help.

Thanks in advance,
Ranjini R






























RE: XML to TEXT

Posted by Shankar hiremath <sh...@huawei.com>.
As per my understanding reading one xml line and sending to map task will not work generally,
I suggest to make or partition the xml data as one complete "student" element as per xml specification,
Then pass each partitioned "student" xml element as input to mapper, and mapper will parse this xml and generate the text data (here you can reuse your existing recursive code) in a single line.

Ex;
          <student>
                        <id>100</id>
                        <name>ranjini-1</name>
                              ................................................
            </student>
The above student element should be sent to mapper-1

<student>
                        <id>101</id>
                        <name>ranjini-2</name>
                              ................................................
            </student>
The above student element should be sent to mapper-2



Complete XML:

<school>
            <student>
                        <id>100</id>
                        <name>ranjini-1</name>
                              ................................................
            </student>
<student>
                        <id>101</id>
                        <name>ranjini-2</name>
                              ................................................
            </student>
             ........
</school>



From: Ranjini Rathinam [mailto:ranjinibecse@gmail.com]
Sent: 12 February 2014 PM 01:46
To: user@hadoop.apache.org
Subject: Fwd: XML to TEXT



Please help to convert this xml to text.

I have the attached the xml. Please find the attachement.

Some student has two address tag and some student has one address tag and some student dont have address tag tag.

I need to convert the xml into string.

this is my desired output.

100,ranjini,HOME,a street,ad street,ads street,chennai,tn,OFFICE,adsja1 street,adsja2 street,adsja3 street,mumbai,Maharastra
101,nivetha,HOME,a street,ad street,ads street,chennai,tn
102,siva


In normal java i have written using recursion but how to write in mapreduce.

How to write the code in Mapreduce .? Pl help .

Thanks in advance.
Regards,
Ranjini R


On Fri, Jan 10, 2014 at 12:47 PM, Ranjini Rathinam <ra...@gmail.com>> wrote:
Hi,

Its working fine. problem was in xml . THe space i have given.

Thanks a lot.

Regards,
Ranjini.R
On Thu, Jan 9, 2014 at 10:47 PM, Diego Gutierrez <di...@ucsp.edu.pe>> wrote:
Hi,
I'm sending you the eclipse project with the code. Hope this helps.
Regards
Diego Gutiérrez


2014/1/9 Ranjini Rathinam <ra...@gmail.com>>
Hi,

I am using here java 1.6 and hadoop 0.20 version ,  ubuntu 12.04.

If possible please send the jar and code for review.

Thanks for the support,

Ranjini
On Wed, Jan 8, 2014 at 11:00 PM, Diego Gutierrez <di...@ucsp.edu.pe>> wrote:
Hi,
I've notice that your xml file has break lines. Hadoop by default splits every file into lines and pass them to the map function, in other words, each map function process one line of the file. Please remove the break lines from your xml and try again. I've tested here with your xml file(just changing DTMNodeList list = (DTMNodeList) getNode("/Company/Employee", doc,
                    XPathConstants.NODESET) ) and this is the output in result.txt


id,name
100,ranjini,IT1,123456,nextlevel1,Chennai1Navallur1
1001,ranjinikumar,IT,1234516,nextlevel,ChennaiNavallur

Note: I dont know if the java version or hadoop version can be the problem here. I'm using ubuntu 12.04, java oracle 7 and hadoop 2.2.0.

If you want, I can send you the jar file with the code :)

Regards
Diego Gutiérrez.


2014/1/7 Ranjini Rathinam <ra...@gmail.com>>
Hi Gutierrez ,

As suggest i tried with the code , but in the result.txt i got output only header. Nothing else was printing.

After debugging i came to know that while parsing , there is no value.

The problem is in line given below which is bold. While putting SysOut i found no value printing in this line.

String xmlContent = value.toString();

        InputStream is = new ByteArrayInputStream(xmlContent.getBytes());
        DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
        DocumentBuilder builder;
        try {
            builder = factory.newDocumentBuilder();
            Document doc = builder.parse(is);
   String ed=doc.getDocumentElement().getNodeName();
   out.write(ed.getBytes());
            DTMNodeList list = (DTMNodeList) getNode("/Company/Employee", doc,XPathConstants.NODESET);

When iam printing

out.write(xmlContent.getBytes):- the whole xml is being printed.

then i wrote for Sysout for list ,nothing printed.
out.write(ed.getBytes):- nothing is being printed.

Please suggest where i am going wrong. Please help to fix this.

Thanks in advance.

I have attached my code.Please review.


Mapper class:-

public class XmlTextMapper extends Mapper<LongWritable, Text, Text, Text> {
    private static final XPathFactory xpathFactory = XPathFactory.newInstance();
    @Override
    public void map(LongWritable key, Text value, Context context)
            throws IOException, InterruptedException {
        String resultFileName = "/user/task/Sales/result.txt";

        Configuration conf = new Configuration();
        FileSystem fs = FileSystem.get(URI.create(resultFileName), conf);
        FSDataOutputStream out = fs.create(new Path(resultFileName));
        InputStream resultIS = new ByteArrayInputStream(new byte[0]);
        String header = "id,name\n";
        out.write(header.getBytes());
        String xmlContent = value.toString();

        InputStream is = new ByteArrayInputStream(xmlContent.getBytes());
        DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
        DocumentBuilder builder;
        try {
            builder = factory.newDocumentBuilder();
            Document doc = builder.parse(is);
   String ed=doc.getDocumentElement().getNodeName();
   out.write(ed.getBytes());
            DTMNodeList list = (DTMNodeList) getNode("/Company/Employee", doc,XPathConstants.NODESET);
            int size = list.getLength();
            for (int i = 0; i < size; i++) {
                Node node = list.item(i);
                String line = "";
                NodeList nodeList = node.getChildNodes();
                int childNumber = nodeList.getLength();
                for (int j = 0; j < childNumber; j++)
    {
                    line += nodeList.item(j).getTextContent() + ",";
                }
                if (line.endsWith(","))
                    line = line.substring(0, line.length() - 1);
                line += "\n";
                out.write(line.getBytes());
            }
        } catch (ParserConfigurationException e) {
             e.printStackTrace();
        } catch (SAXException e) {
             e.printStackTrace();
        } catch (XPathExpressionException e) {
             e.printStackTrace();
        }
        IOUtils.copyBytes(resultIS, out, 4096, true);
        out.close();
    }
    public static Object getNode(String xpathStr, Node node, QName retunType)
            throws XPathExpressionException {
        XPath xpath = xpathFactory.newXPath();
        return xpath.evaluate(xpathStr, node, retunType);
    }
}



Main class
public class MainXml {
    public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
        if (args.length != 2) {
            System.err
                    .println("Usage: XMLtoText <input path> <output path>");
            System.exit(-1);
        }
  String output="/user/task/Sales/";
       Job job = new Job(conf, "XML to Text");
        job.setJarByClass(MainXml.class);
       // job.setJobName("XML to Text");

        FileInputFormat.addInputPath(job, new Path(args[0]));
       // FileOutputFormat.setOutputPath(job, new Path(args[1]));
  Path outPath = new Path(output);
  FileOutputFormat.setOutputPath(job, outPath);
  FileSystem dfs = FileSystem.get(outPath.toUri(), conf);
  if (dfs.exists(outPath)) {
  dfs.delete(outPath, true);
  }
        job.setMapperClass(XmlTextMapper.class);

        job.setNumReduceTasks(0);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(Text.class);
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}


My xml file

<Company>
<Employee>
<id>100</id>
<ename>ranjini</ename>
<dept>IT1</dept>
<sal>123456</sal>
<location>nextlevel1</location>
<Address>
<Home>Chennai1</Home>
<Office>Navallur1</Office>
</Address>
</Employee>
<Employee>
<id>1001</id>
<ename>ranjinikumar</ename>
<dept>IT</dept>
<sal>1234516</sal>
<location>nextlevel</location>
<Address>
<Home>Chennai</Home>
<Office>Navallur</Office>
</Address>
</Employee>
</Company>


Thanks in advance.

Ranjini


On Mon, Jan 6, 2014 at 2:44 PM, Ranjini Rathinam <ra...@gmail.com>> wrote:
Hi,

Thanks a lot .

Ranjini
On Fri, Jan 3, 2014 at 10:40 PM, Diego Gutierrez <di...@ucsp.edu.pe>> wrote:
Hi,
I suggest to use the XPath, this is a native java support for parse xml and json formats.
For the main problem, like distcp command( http://hadoop.apache.org/docs/r0.19.0/distcp.pdf ) there is no need of a reduce function, because you can parse the xml input file and create the file you need in the map function.For example the following code reads an xml file in HDFS, parse it and create a new file ( "/result.txt" ) with the expected format:
id,name
100,RR

Mapper function:

import java.io.ByteArrayInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.net.URI;

import javax.xml.namespace.QName;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.xpath.XPath;
import javax.xml.xpath.XPathConstants;
import javax.xml.xpath.XPathExpressionException;
import javax.xml.xpath.XPathFactory;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.w3c.dom.Document;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
import org.xml.sax.SAXException;

import com.sun.org.apache.xml.internal.dtm.ref.DTMNodeList;

public class XmlToTextMapper extends Mapper<LongWritable, Text, Text, Text> {

    private static final XPathFactory xpathFactory = XPathFactory.newInstance();

    @Override
    public void map(LongWritable key, Text value, Context context)
            throws IOException, InterruptedException {

        String resultFileName = "/result.txt";


        Configuration conf = new Configuration();
        FileSystem fs = FileSystem.get(URI.create(resultFileName), conf);
        FSDataOutputStream out = fs.create(new Path(resultFileName));

        InputStream resultIS = new ByteArrayInputStream(new byte[0]);

        String header = "id,name\n";
        out.write(header.getBytes());

        String xmlContent = value.toString();
        InputStream is = new ByteArrayInputStream(xmlContent.getBytes());
        DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
        DocumentBuilder builder;
        try {
            builder = factory.newDocumentBuilder();
            Document doc = builder.parse(is);
            DTMNodeList list = (DTMNodeList) getNode("/main/data", doc,
                    XPathConstants.NODESET);

            int size = list.getLength();
            for (int i = 0; i < size; i++) {
                Node node = list.item(i);
                String line = "";
                NodeList nodeList = node.getChildNodes();
                int childNumber = nodeList.getLength();
                for (int j = 0; j < childNumber; j++) {
                    line += nodeList.item(j).getTextContent() + ",";
                }
                if (line.endsWith(","))
                    line = line.substring(0, line.length() - 1);
                line += "\n";
                out.write(line.getBytes());

            }

        } catch (ParserConfigurationException e) {
            MyLogguer.log("error: " + e.getMessage());
            e.printStackTrace();
        } catch (SAXException e) {
            MyLogguer.log("error: " + e.getMessage());
            e.printStackTrace();
        } catch (XPathExpressionException e) {
            MyLogguer.log("error: " + e.getMessage());
            e.printStackTrace();
        }

        IOUtils.copyBytes(resultIS, out, 4096, true);
        out.close();
    }

    public static Object getNode(String xpathStr, Node node, QName retunType)
            throws XPathExpressionException {
        XPath xpath = xpathFactory.newXPath();
        return xpath.evaluate(xpathStr, node, retunType);
    }
}



--------------------------------------
Main class:


public class Main {

    public static void main(String[] args) throws Exception {

        if (args.length != 2) {
            System.err
                    .println("Usage: XMLtoText <input path> <output path>");
            System.exit(-1);
        }

        Job job = new Job();
        job.setJarByClass(Main.class);
        job.setJobName("XML to Text");
        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        job.setMapperClass(XmlToTextMapper.class);
        job.setNumReduceTasks(0);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(Text.class);
        System.exit(job.waitForCompletion(true) ? 0 : 1);

    }
}

To execute the job you can use :

         bin/hadoop Main /data.xml /output.

Then you can use this to see result.txt file:

          hadoop fs -cat /result.txt

I'm using this xml as input:

<Comp><Emp><id>1</id><name>NameA</name></data><data><id>2</id><name>NameB</name></Emp></Comp>
and the content in result.txt is like this:

id,name
1,NameA
2,NameB

Hope this helps.

2014/1/3 Ranjini Rathinam <ra...@gmail.com>>
Hi,

Need to convert XML into text using mapreduce.

I have used DOM and SAX parser.

After using SAX Builder in mapper class. the child node act as root Element.

While seeing in Sys out i found thar root element is taking the child element and printing.

For Eg,

<Comp><Emp><id>100</id><name>RR</name></Emp></Comp>
when this xml is passed in mapper , in sys out printing the root element

I am getting the the root element as

<id>
<name>

Please suggest and help to fix this.

I need to convert the xml into text using mapreduce code. Please provide with example.

Required output is

id,name
100,RR

Please help.

Thanks in advance,
Ranjini R






























Fwd: XML to TEXT

Posted by Ranjini Rathinam <ra...@gmail.com>.
>
> Please help to convert this xml to text.
>>
>>
>>  I have the attached the xml. Please find the attachement.
>>
>> Some student has two address tag and some student has one address tag and
>> some student dont have address tag tag.
>>
>> I need to convert the xml into string.
>>
>> this is my desired output.
>>
>> 100,ranjini,HOME,a street,ad street,ads street,chennai,tn,OFFICE,adsja1
>> street,adsja2 street,adsja3 street,mumbai,Maharastra
>> 101,nivetha,HOME,a street,ad street,ads street,chennai,tn
>> 102,siva
>>
>>
>> In normal java i have written using recursion but how to write in
>> mapreduce.
>>
>> How to write the code in Mapreduce .? Pl help .
>>
>> Thanks in advance.
>>  Regards,
>> Ranjini R
>>
>>
>> On Fri, Jan 10, 2014 at 12:47 PM, Ranjini Rathinam <
>> ranjinibecse@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> Its working fine. problem was in xml . THe space i have given.
>>>
>>> Thanks a lot.
>>>
>>> Regards,
>>> Ranjini.R
>>>
>>>  On Thu, Jan 9, 2014 at 10:47 PM, Diego Gutierrez <
>>> diego.gutierrez@ucsp.edu.pe> wrote:
>>>
>>>>  Hi,
>>>>
>>>> I'm sending you the eclipse project with the code. Hope this helps.
>>>>
>>>> Regards
>>>> Diego Gutiérrez
>>>>
>>>>
>>>>
>>>> 2014/1/9 Ranjini Rathinam <ra...@gmail.com>
>>>>
>>>>> Hi,
>>>>>
>>>>> I am using here java 1.6 and hadoop 0.20 version ,  ubuntu 12.04.
>>>>>
>>>>> If possible please send the jar and code for review.
>>>>>
>>>>> Thanks for the support,
>>>>>
>>>>> Ranjini
>>>>>
>>>>>  On Wed, Jan 8, 2014 at 11:00 PM, Diego Gutierrez <
>>>>> diego.gutierrez@ucsp.edu.pe> wrote:
>>>>>
>>>>>>   Hi,
>>>>>>
>>>>>> I've notice that your xml file has break lines. Hadoop by default
>>>>>> splits every file into lines and pass them to the map function, in other
>>>>>> words, each map function process one line of the file. Please remove the
>>>>>> break lines from your xml and try again. I've tested here with your xml
>>>>>> file(just changing DTMNodeList list = (DTMNodeList)
>>>>>> getNode("/Company/Employee", doc,
>>>>>>                     XPathConstants.NODESET) ) and this is the output
>>>>>> in result.txt
>>>>>>
>>>>>>
>>>>>> id,name
>>>>>> 100,ranjini,IT1,123456,nextlevel1,Chennai1Navallur1
>>>>>> 1001,ranjinikumar,IT,1234516,nextlevel,ChennaiNavallur
>>>>>>
>>>>>>
>>>>>> Note: I dont know if the java version or hadoop version can be the
>>>>>> problem here. I'm using ubuntu 12.04, java oracle 7 and hadoop 2.2.0.
>>>>>>
>>>>>>
>>>>>> If you want, I can send you the jar file with the code :)
>>>>>>
>>>>>> Regards
>>>>>> Diego Gutiérrez.
>>>>>>
>>>>>>
>>>>>>
>>>>>> 2014/1/7 Ranjini Rathinam <ra...@gmail.com>
>>>>>>
>>>>>>> Hi Gutierrez ,
>>>>>>>
>>>>>>> As suggest i tried with the code , but in the result.txt i got
>>>>>>> output only header. Nothing else was printing.
>>>>>>>
>>>>>>> After debugging i came to know that while parsing , there is no
>>>>>>> value.
>>>>>>>
>>>>>>> The problem is in line given below which is bold. While putting
>>>>>>> SysOut i found no value printing in this line.
>>>>>>>
>>>>>>>  String xmlContent = value.toString();
>>>>>>>
>>>>>>>         InputStream is = new
>>>>>>> ByteArrayInputStream(xmlContent.getBytes());
>>>>>>>         DocumentBuilderFactory factory =
>>>>>>> DocumentBuilderFactory.newInstance();
>>>>>>>         DocumentBuilder builder;
>>>>>>>         try {
>>>>>>>             builder = factory.newDocumentBuilder();
>>>>>>>
>>>>>>> * Document doc = builder.parse(is);*
>>>>>>>    String ed=doc.getDocumentElement().getNodeName();
>>>>>>>    out.write(ed.getBytes());
>>>>>>>             DTMNodeList list = (DTMNodeList)
>>>>>>> getNode("/Company/Employee", doc,XPathConstants.NODESET);
>>>>>>>
>>>>>>> When iam printing
>>>>>>>
>>>>>>> out.write(xmlContent.getBytes):- the whole xml is being printed.
>>>>>>>
>>>>>>> then i wrote for Sysout for list ,nothing printed.
>>>>>>>  out.write(ed.getBytes):- nothing is being printed.
>>>>>>>
>>>>>>> Please suggest where i am going wrong. Please help to fix this.
>>>>>>>
>>>>>>> Thanks in advance.
>>>>>>>
>>>>>>> I have attached my code.Please review.
>>>>>>>
>>>>>>>
>>>>>>> Mapper class:-
>>>>>>>
>>>>>>> public class XmlTextMapper extends Mapper<LongWritable, Text, Text,
>>>>>>> Text> {
>>>>>>>      private static final XPathFactory xpathFactory =
>>>>>>> XPathFactory.newInstance();
>>>>>>>     @Override
>>>>>>>     public void map(LongWritable key, Text value, Context context)
>>>>>>>             throws IOException, InterruptedException {
>>>>>>>         String resultFileName = "/user/task/Sales/result.txt";
>>>>>>>
>>>>>>>         Configuration conf = new Configuration();
>>>>>>>         FileSystem fs = FileSystem.get(URI.create(resultFileName),
>>>>>>> conf);
>>>>>>>         FSDataOutputStream out = fs.create(new Path(resultFileName));
>>>>>>>         InputStream resultIS = new ByteArrayInputStream(new byte[0]);
>>>>>>>         String header = "id,name\n";
>>>>>>>         out.write(header.getBytes());
>>>>>>>          String xmlContent = value.toString();
>>>>>>>
>>>>>>>         InputStream is = new
>>>>>>> ByteArrayInputStream(xmlContent.getBytes());
>>>>>>>         DocumentBuilderFactory factory =
>>>>>>> DocumentBuilderFactory.newInstance();
>>>>>>>         DocumentBuilder builder;
>>>>>>>         try {
>>>>>>>             builder = factory.newDocumentBuilder();
>>>>>>>             Document doc = builder.parse(is);
>>>>>>>    String ed=doc.getDocumentElement().getNodeName();
>>>>>>>    out.write(ed.getBytes());
>>>>>>>             DTMNodeList list = (DTMNodeList)
>>>>>>> getNode("/Company/Employee", doc,XPathConstants.NODESET);
>>>>>>>              int size = list.getLength();
>>>>>>>             for (int i = 0; i < size; i++) {
>>>>>>>                 Node node = list.item(i);
>>>>>>>                 String line = "";
>>>>>>>                 NodeList nodeList = node.getChildNodes();
>>>>>>>                 int childNumber = nodeList.getLength();
>>>>>>>                 for (int j = 0; j < childNumber; j++)
>>>>>>>     {
>>>>>>>                     line += nodeList.item(j).getTextContent() + ",";
>>>>>>>                 }
>>>>>>>                 if (line.endsWith(","))
>>>>>>>                     line = line.substring(0, line.length() - 1);
>>>>>>>                 line += "\n";
>>>>>>>                 out.write(line.getBytes());
>>>>>>>             }
>>>>>>>         } catch (ParserConfigurationException e) {
>>>>>>>              e.printStackTrace();
>>>>>>>         } catch (SAXException e) {
>>>>>>>              e.printStackTrace();
>>>>>>>         } catch (XPathExpressionException e) {
>>>>>>>              e.printStackTrace();
>>>>>>>         }
>>>>>>>          IOUtils.copyBytes(resultIS, out, 4096, true);
>>>>>>>         out.close();
>>>>>>>     }
>>>>>>>     public static Object getNode(String xpathStr, Node node, QName
>>>>>>> retunType)
>>>>>>>             throws XPathExpressionException {
>>>>>>>         XPath xpath = xpathFactory.newXPath();
>>>>>>>         return xpath.evaluate(xpathStr, node, retunType);
>>>>>>>     }
>>>>>>> }
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Main class
>>>>>>> public class MainXml {
>>>>>>>      public static void main(String[] args) throws Exception {
>>>>>>> Configuration conf = new Configuration();
>>>>>>>         if (args.length != 2) {
>>>>>>>             System.err
>>>>>>>                     .println("Usage: XMLtoText <input path> <output
>>>>>>> path>");
>>>>>>>             System.exit(-1);
>>>>>>>         }
>>>>>>>   String output="/user/task/Sales/";
>>>>>>>        Job job = new Job(conf, "XML to Text");
>>>>>>>         job.setJarByClass(MainXml.class);
>>>>>>>        // job.setJobName("XML to Text");
>>>>>>>
>>>>>>>         FileInputFormat.addInputPath(job, new Path(args[0]));
>>>>>>>        // FileOutputFormat.setOutputPath(job, new Path(args[1]));
>>>>>>>   Path outPath = new Path(output);
>>>>>>>   FileOutputFormat.setOutputPath(job, outPath);
>>>>>>>   FileSystem dfs = FileSystem.get(outPath.toUri(), conf);
>>>>>>>   if (dfs.exists(outPath)) {
>>>>>>>   dfs.delete(outPath, true);
>>>>>>>   }
>>>>>>>         job.setMapperClass(XmlTextMapper.class);
>>>>>>>
>>>>>>>         job.setNumReduceTasks(0);
>>>>>>>         job.setMapOutputKeyClass(Text.class);
>>>>>>>         job.setMapOutputValueClass(Text.class);
>>>>>>>         System.exit(job.waitForCompletion(true) ? 0 : 1);
>>>>>>>     }
>>>>>>> }
>>>>>>>
>>>>>>>
>>>>>>> My xml file
>>>>>>>
>>>>>>> <Company>
>>>>>>> <Employee>
>>>>>>> <id>100</id>
>>>>>>> <ename>ranjini</ename>
>>>>>>> <dept>IT1</dept>
>>>>>>> <sal>123456</sal>
>>>>>>> <location>nextlevel1</location>
>>>>>>> <Address>
>>>>>>> <Home>Chennai1</Home>
>>>>>>> <Office>Navallur1</Office>
>>>>>>> </Address>
>>>>>>> </Employee>
>>>>>>> <Employee>
>>>>>>> <id>1001</id>
>>>>>>> <ename>ranjinikumar</ename>
>>>>>>> <dept>IT</dept>
>>>>>>> <sal>1234516</sal>
>>>>>>> <location>nextlevel</location>
>>>>>>> <Address>
>>>>>>> <Home>Chennai</Home>
>>>>>>> <Office>Navallur</Office>
>>>>>>> </Address>
>>>>>>> </Employee>
>>>>>>> </Company>
>>>>>>>
>>>>>>>
>>>>>>> Thanks in advance.
>>>>>>>
>>>>>>> Ranjini
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>  On Mon, Jan 6, 2014 at 2:44 PM, Ranjini Rathinam <
>>>>>>>> ranjinibecse@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> Thanks a lot .
>>>>>>>>>
>>>>>>>>> Ranjini
>>>>>>>>>
>>>>>>>>> On Fri, Jan 3, 2014 at 10:40 PM, Diego Gutierrez <
>>>>>>>>> diego.gutierrez@ucsp.edu.pe> wrote:
>>>>>>>>>
>>>>>>>>>>  Hi,
>>>>>>>>>>
>>>>>>>>>> I suggest to use the XPath, this is a native java support for
>>>>>>>>>> parse xml and json formats.
>>>>>>>>>>
>>>>>>>>>> For the main problem, like distcp command(
>>>>>>>>>> http://hadoop.apache.org/docs/r0.19.0/distcp.pdf ) there is no
>>>>>>>>>> need of a reduce function, because you can parse the xml input file and
>>>>>>>>>> create the file you need in the map function.For example the following code
>>>>>>>>>> reads an xml file in HDFS, parse it and create a new file ( "/result.txt" )
>>>>>>>>>> with the expected format:
>>>>>>>>>> id,name
>>>>>>>>>> 100,RR
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Mapper function:
>>>>>>>>>>
>>>>>>>>>> import java.io.ByteArrayInputStream;
>>>>>>>>>> import java.io.IOException;
>>>>>>>>>> import java.io.InputStream;
>>>>>>>>>> import java.net.URI;
>>>>>>>>>>
>>>>>>>>>> import javax.xml.namespace.QName;
>>>>>>>>>> import javax.xml.parsers.DocumentBuilder;
>>>>>>>>>> import javax.xml.parsers.DocumentBuilderFactory;
>>>>>>>>>> import javax.xml.parsers.ParserConfigurationException;
>>>>>>>>>> import javax.xml.xpath.XPath;
>>>>>>>>>> import javax.xml.xpath.XPathConstants;
>>>>>>>>>> import javax.xml.xpath.XPathExpressionException;
>>>>>>>>>> import javax.xml.xpath.XPathFactory;
>>>>>>>>>>
>>>>>>>>>> import org.apache.hadoop.conf.Configuration;
>>>>>>>>>> import org.apache.hadoop.fs.FSDataOutputStream;
>>>>>>>>>> import org.apache.hadoop.fs.FileSystem;
>>>>>>>>>> import org.apache.hadoop.fs.Path;
>>>>>>>>>> import org.apache.hadoop.io.IOUtils;
>>>>>>>>>> import org.apache.hadoop.io.LongWritable;
>>>>>>>>>> import org.apache.hadoop.io.Text;
>>>>>>>>>> import org.apache.hadoop.mapreduce.Mapper;
>>>>>>>>>> import org.w3c.dom.Document;
>>>>>>>>>> import org.w3c.dom.Node;
>>>>>>>>>> import org.w3c.dom.NodeList;
>>>>>>>>>> import org.xml.sax.SAXException;
>>>>>>>>>>
>>>>>>>>>> import com.sun.org.apache.xml.internal.dtm.ref.DTMNodeList;
>>>>>>>>>>
>>>>>>>>>> public class XmlToTextMapper extends Mapper<LongWritable, Text,
>>>>>>>>>> Text, Text> {
>>>>>>>>>>
>>>>>>>>>>     private static final XPathFactory xpathFactory =
>>>>>>>>>> XPathFactory.newInstance();
>>>>>>>>>>
>>>>>>>>>>     @Override
>>>>>>>>>>     public void map(LongWritable key, Text value, Context context)
>>>>>>>>>>             throws IOException, InterruptedException {
>>>>>>>>>>
>>>>>>>>>>         String resultFileName = "/result.txt";
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>         Configuration conf = new Configuration();
>>>>>>>>>>         FileSystem fs =
>>>>>>>>>> FileSystem.get(URI.create(resultFileName), conf);
>>>>>>>>>>         FSDataOutputStream out = fs.create(new
>>>>>>>>>> Path(resultFileName));
>>>>>>>>>>
>>>>>>>>>>         InputStream resultIS = new ByteArrayInputStream(new
>>>>>>>>>> byte[0]);
>>>>>>>>>>
>>>>>>>>>>         String header = "id,name\n";
>>>>>>>>>>         out.write(header.getBytes());
>>>>>>>>>>
>>>>>>>>>>         String xmlContent = value.toString();
>>>>>>>>>>         InputStream is = new
>>>>>>>>>> ByteArrayInputStream(xmlContent.getBytes());
>>>>>>>>>>         DocumentBuilderFactory factory =
>>>>>>>>>> DocumentBuilderFactory.newInstance();
>>>>>>>>>>         DocumentBuilder builder;
>>>>>>>>>>         try {
>>>>>>>>>>             builder = factory.newDocumentBuilder();
>>>>>>>>>>             Document doc = builder.parse(is);
>>>>>>>>>>             DTMNodeList list = (DTMNodeList)
>>>>>>>>>> getNode("/main/data", doc,
>>>>>>>>>>                     XPathConstants.NODESET);
>>>>>>>>>>
>>>>>>>>>>             int size = list.getLength();
>>>>>>>>>>             for (int i = 0; i < size; i++) {
>>>>>>>>>>                 Node node = list.item(i);
>>>>>>>>>>                 String line = "";
>>>>>>>>>>                 NodeList nodeList = node.getChildNodes();
>>>>>>>>>>                 int childNumber = nodeList.getLength();
>>>>>>>>>>                 for (int j = 0; j < childNumber; j++) {
>>>>>>>>>>                     line += nodeList.item(j).getTextContent() +
>>>>>>>>>> ",";
>>>>>>>>>>                 }
>>>>>>>>>>                 if (line.endsWith(","))
>>>>>>>>>>                     line = line.substring(0, line.length() - 1);
>>>>>>>>>>                 line += "\n";
>>>>>>>>>>                 out.write(line.getBytes());
>>>>>>>>>>
>>>>>>>>>>             }
>>>>>>>>>>
>>>>>>>>>>         } catch (ParserConfigurationException e) {
>>>>>>>>>>             MyLogguer.log("error: " + e.getMessage());
>>>>>>>>>>             e.printStackTrace();
>>>>>>>>>>         } catch (SAXException e) {
>>>>>>>>>>             MyLogguer.log("error: " + e.getMessage());
>>>>>>>>>>             e.printStackTrace();
>>>>>>>>>>         } catch (XPathExpressionException e) {
>>>>>>>>>>             MyLogguer.log("error: " + e.getMessage());
>>>>>>>>>>             e.printStackTrace();
>>>>>>>>>>         }
>>>>>>>>>>
>>>>>>>>>>         IOUtils.copyBytes(resultIS, out, 4096, true);
>>>>>>>>>>         out.close();
>>>>>>>>>>     }
>>>>>>>>>>
>>>>>>>>>>     public static Object getNode(String xpathStr, Node node,
>>>>>>>>>> QName retunType)
>>>>>>>>>>             throws XPathExpressionException {
>>>>>>>>>>         XPath xpath = xpathFactory.newXPath();
>>>>>>>>>>         return xpath.evaluate(xpathStr, node, retunType);
>>>>>>>>>>     }
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --------------------------------------
>>>>>>>>>> Main class:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> public class Main {
>>>>>>>>>>
>>>>>>>>>>     public static void main(String[] args) throws Exception {
>>>>>>>>>>
>>>>>>>>>>         if (args.length != 2) {
>>>>>>>>>>             System.err
>>>>>>>>>>                     .println("Usage: XMLtoText <input path>
>>>>>>>>>> <output path>");
>>>>>>>>>>             System.exit(-1);
>>>>>>>>>>         }
>>>>>>>>>>
>>>>>>>>>>         Job job = new Job();
>>>>>>>>>>         job.setJarByClass(Main.class);
>>>>>>>>>>         job.setJobName("XML to Text");
>>>>>>>>>>         FileInputFormat.addInputPath(job, new Path(args[0]));
>>>>>>>>>>         FileOutputFormat.setOutputPath(job, new Path(args[1]));
>>>>>>>>>>
>>>>>>>>>>         job.setMapperClass(XmlToTextMapper.class);
>>>>>>>>>>         job.setNumReduceTasks(0);
>>>>>>>>>>         job.setMapOutputKeyClass(Text.class);
>>>>>>>>>>         job.setMapOutputValueClass(Text.class);
>>>>>>>>>>         System.exit(job.waitForCompletion(true) ? 0 : 1);
>>>>>>>>>>
>>>>>>>>>>     }
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>> To execute the job you can use :
>>>>>>>>>>
>>>>>>>>>>          bin/hadoop Main /data.xml /output.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Then you can use this to see result.txt file:
>>>>>>>>>>
>>>>>>>>>>           hadoop fs -cat /result.txt
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I'm using this xml as input:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> <Comp><Emp><id>1</id><name>NameA</name></data><data><id>2</id><name>NameB</name></Emp></Comp>
>>>>>>>>>>
>>>>>>>>>> and the content in result.txt is like this:
>>>>>>>>>>
>>>>>>>>>> id,name
>>>>>>>>>> 1,NameA
>>>>>>>>>> 2,NameB
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Hope this helps.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> 2014/1/3 Ranjini Rathinam <ra...@gmail.com>
>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> Need to convert XML into text using mapreduce.
>>>>>>>>>>>
>>>>>>>>>>> I have used DOM and SAX parser.
>>>>>>>>>>>
>>>>>>>>>>> After using SAX Builder in mapper class. the child node act as
>>>>>>>>>>> root Element.
>>>>>>>>>>>
>>>>>>>>>>> While seeing in Sys out i found thar root element is taking the
>>>>>>>>>>> child element and printing.
>>>>>>>>>>>
>>>>>>>>>>> For Eg,
>>>>>>>>>>>
>>>>>>>>>>> <Comp><Emp><id>100</id><name>RR</name></Emp></Comp>
>>>>>>>>>>> when this xml is passed in mapper , in sys out printing the root
>>>>>>>>>>> element
>>>>>>>>>>>
>>>>>>>>>>> I am getting the the root element as
>>>>>>>>>>>
>>>>>>>>>>> <id>
>>>>>>>>>>> <name>
>>>>>>>>>>>
>>>>>>>>>>> Please suggest and help to fix this.
>>>>>>>>>>>
>>>>>>>>>>> I need to convert the xml into text using mapreduce code. Please
>>>>>>>>>>> provide with example.
>>>>>>>>>>>
>>>>>>>>>>> Required output is
>>>>>>>>>>>
>>>>>>>>>>> id,name
>>>>>>>>>>> 100,RR
>>>>>>>>>>>
>>>>>>>>>>> Please help.
>>>>>>>>>>>
>>>>>>>>>>> Thanks in advance,
>>>>>>>>>>> Ranjini R
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>>
>

Fwd: XML to TEXT

Posted by Ranjini Rathinam <ra...@gmail.com>.
>
> Please help to convert this xml to text.
>>
>>
>>  I have the attached the xml. Please find the attachement.
>>
>> Some student has two address tag and some student has one address tag and
>> some student dont have address tag tag.
>>
>> I need to convert the xml into string.
>>
>> this is my desired output.
>>
>> 100,ranjini,HOME,a street,ad street,ads street,chennai,tn,OFFICE,adsja1
>> street,adsja2 street,adsja3 street,mumbai,Maharastra
>> 101,nivetha,HOME,a street,ad street,ads street,chennai,tn
>> 102,siva
>>
>>
>> In normal java i have written using recursion but how to write in
>> mapreduce.
>>
>> How to write the code in Mapreduce .? Pl help .
>>
>> Thanks in advance.
>>  Regards,
>> Ranjini R
>>
>>
>> On Fri, Jan 10, 2014 at 12:47 PM, Ranjini Rathinam <
>> ranjinibecse@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> Its working fine. problem was in xml . THe space i have given.
>>>
>>> Thanks a lot.
>>>
>>> Regards,
>>> Ranjini.R
>>>
>>>  On Thu, Jan 9, 2014 at 10:47 PM, Diego Gutierrez <
>>> diego.gutierrez@ucsp.edu.pe> wrote:
>>>
>>>>  Hi,
>>>>
>>>> I'm sending you the eclipse project with the code. Hope this helps.
>>>>
>>>> Regards
>>>> Diego Guti�rrez
>>>>
>>>>
>>>>
>>>> 2014/1/9 Ranjini Rathinam <ra...@gmail.com>
>>>>
>>>>> Hi,
>>>>>
>>>>> I am using here java 1.6 and hadoop 0.20 version ,  ubuntu 12.04.
>>>>>
>>>>> If possible please send the jar and code for review.
>>>>>
>>>>> Thanks for the support,
>>>>>
>>>>> Ranjini
>>>>>
>>>>>  On Wed, Jan 8, 2014 at 11:00 PM, Diego Gutierrez <
>>>>> diego.gutierrez@ucsp.edu.pe> wrote:
>>>>>
>>>>>>   Hi,
>>>>>>
>>>>>> I've notice that your xml file has break lines. Hadoop by default
>>>>>> splits every file into lines and pass them to the map function, in other
>>>>>> words, each map function process one line of the file. Please remove the
>>>>>> break lines from your xml and try again. I've tested here with your xml
>>>>>> file(just changing DTMNodeList list = (DTMNodeList)
>>>>>> getNode("/Company/Employee", doc,
>>>>>>                     XPathConstants.NODESET) ) and this is the output
>>>>>> in result.txt
>>>>>>
>>>>>>
>>>>>> id,name
>>>>>> 100,ranjini,IT1,123456,nextlevel1,Chennai1Navallur1
>>>>>> 1001,ranjinikumar,IT,1234516,nextlevel,ChennaiNavallur
>>>>>>
>>>>>>
>>>>>> Note: I dont know if the java version or hadoop version can be the
>>>>>> problem here. I'm using ubuntu 12.04, java oracle 7 and hadoop 2.2.0.
>>>>>>
>>>>>>
>>>>>> If you want, I can send you the jar file with the code :)
>>>>>>
>>>>>> Regards
>>>>>> Diego Guti�rrez.
>>>>>>
>>>>>>
>>>>>>
>>>>>> 2014/1/7 Ranjini Rathinam <ra...@gmail.com>
>>>>>>
>>>>>>> Hi Gutierrez ,
>>>>>>>
>>>>>>> As suggest i tried with the code , but in the result.txt i got
>>>>>>> output only header. Nothing else was printing.
>>>>>>>
>>>>>>> After debugging i came to know that while parsing , there is no
>>>>>>> value.
>>>>>>>
>>>>>>> The problem is in line given below which is bold. While putting
>>>>>>> SysOut i found no value printing in this line.
>>>>>>>
>>>>>>>  String xmlContent = value.toString();
>>>>>>>
>>>>>>>         InputStream is = new
>>>>>>> ByteArrayInputStream(xmlContent.getBytes());
>>>>>>>         DocumentBuilderFactory factory =
>>>>>>> DocumentBuilderFactory.newInstance();
>>>>>>>         DocumentBuilder builder;
>>>>>>>         try {
>>>>>>>             builder = factory.newDocumentBuilder();
>>>>>>>
>>>>>>> * Document doc = builder.parse(is);*
>>>>>>>    String ed=doc.getDocumentElement().getNodeName();
>>>>>>>    out.write(ed.getBytes());
>>>>>>>             DTMNodeList list = (DTMNodeList)
>>>>>>> getNode("/Company/Employee", doc,XPathConstants.NODESET);
>>>>>>>
>>>>>>> When iam printing
>>>>>>>
>>>>>>> out.write(xmlContent.getBytes):- the whole xml is being printed.
>>>>>>>
>>>>>>> then i wrote for Sysout for list ,nothing printed.
>>>>>>>  out.write(ed.getBytes):- nothing is being printed.
>>>>>>>
>>>>>>> Please suggest where i am going wrong. Please help to fix this.
>>>>>>>
>>>>>>> Thanks in advance.
>>>>>>>
>>>>>>> I have attached my code.Please review.
>>>>>>>
>>>>>>>
>>>>>>> Mapper class:-
>>>>>>>
>>>>>>> public class XmlTextMapper extends Mapper<LongWritable, Text, Text,
>>>>>>> Text> {
>>>>>>>      private static final XPathFactory xpathFactory =
>>>>>>> XPathFactory.newInstance();
>>>>>>>     @Override
>>>>>>>     public void map(LongWritable key, Text value, Context context)
>>>>>>>             throws IOException, InterruptedException {
>>>>>>>         String resultFileName = "/user/task/Sales/result.txt";
>>>>>>>
>>>>>>>         Configuration conf = new Configuration();
>>>>>>>         FileSystem fs = FileSystem.get(URI.create(resultFileName),
>>>>>>> conf);
>>>>>>>         FSDataOutputStream out = fs.create(new Path(resultFileName));
>>>>>>>         InputStream resultIS = new ByteArrayInputStream(new byte[0]);
>>>>>>>         String header = "id,name\n";
>>>>>>>         out.write(header.getBytes());
>>>>>>>          String xmlContent = value.toString();
>>>>>>>
>>>>>>>         InputStream is = new
>>>>>>> ByteArrayInputStream(xmlContent.getBytes());
>>>>>>>         DocumentBuilderFactory factory =
>>>>>>> DocumentBuilderFactory.newInstance();
>>>>>>>         DocumentBuilder builder;
>>>>>>>         try {
>>>>>>>             builder = factory.newDocumentBuilder();
>>>>>>>             Document doc = builder.parse(is);
>>>>>>>    String ed=doc.getDocumentElement().getNodeName();
>>>>>>>    out.write(ed.getBytes());
>>>>>>>             DTMNodeList list = (DTMNodeList)
>>>>>>> getNode("/Company/Employee", doc,XPathConstants.NODESET);
>>>>>>>              int size = list.getLength();
>>>>>>>             for (int i = 0; i < size; i++) {
>>>>>>>                 Node node = list.item(i);
>>>>>>>                 String line = "";
>>>>>>>                 NodeList nodeList = node.getChildNodes();
>>>>>>>                 int childNumber = nodeList.getLength();
>>>>>>>                 for (int j = 0; j < childNumber; j++)
>>>>>>>     {
>>>>>>>                     line += nodeList.item(j).getTextContent() + ",";
>>>>>>>                 }
>>>>>>>                 if (line.endsWith(","))
>>>>>>>                     line = line.substring(0, line.length() - 1);
>>>>>>>                 line += "\n";
>>>>>>>                 out.write(line.getBytes());
>>>>>>>             }
>>>>>>>         } catch (ParserConfigurationException e) {
>>>>>>>              e.printStackTrace();
>>>>>>>         } catch (SAXException e) {
>>>>>>>              e.printStackTrace();
>>>>>>>         } catch (XPathExpressionException e) {
>>>>>>>              e.printStackTrace();
>>>>>>>         }
>>>>>>>          IOUtils.copyBytes(resultIS, out, 4096, true);
>>>>>>>         out.close();
>>>>>>>     }
>>>>>>>     public static Object getNode(String xpathStr, Node node, QName
>>>>>>> retunType)
>>>>>>>             throws XPathExpressionException {
>>>>>>>         XPath xpath = xpathFactory.newXPath();
>>>>>>>         return xpath.evaluate(xpathStr, node, retunType);
>>>>>>>     }
>>>>>>> }
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Main class
>>>>>>> public class MainXml {
>>>>>>>      public static void main(String[] args) throws Exception {
>>>>>>> Configuration conf = new Configuration();
>>>>>>>         if (args.length != 2) {
>>>>>>>             System.err
>>>>>>>                     .println("Usage: XMLtoText <input path> <output
>>>>>>> path>");
>>>>>>>             System.exit(-1);
>>>>>>>         }
>>>>>>>   String output="/user/task/Sales/";
>>>>>>>        Job job = new Job(conf, "XML to Text");
>>>>>>>         job.setJarByClass(MainXml.class);
>>>>>>>        // job.setJobName("XML to Text");
>>>>>>>
>>>>>>>         FileInputFormat.addInputPath(job, new Path(args[0]));
>>>>>>>        // FileOutputFormat.setOutputPath(job, new Path(args[1]));
>>>>>>>   Path outPath = new Path(output);
>>>>>>>   FileOutputFormat.setOutputPath(job, outPath);
>>>>>>>   FileSystem dfs = FileSystem.get(outPath.toUri(), conf);
>>>>>>>   if (dfs.exists(outPath)) {
>>>>>>>   dfs.delete(outPath, true);
>>>>>>>   }
>>>>>>>         job.setMapperClass(XmlTextMapper.class);
>>>>>>>
>>>>>>>         job.setNumReduceTasks(0);
>>>>>>>         job.setMapOutputKeyClass(Text.class);
>>>>>>>         job.setMapOutputValueClass(Text.class);
>>>>>>>         System.exit(job.waitForCompletion(true) ? 0 : 1);
>>>>>>>     }
>>>>>>> }
>>>>>>>
>>>>>>>
>>>>>>> My xml file
>>>>>>>
>>>>>>> <Company>
>>>>>>> <Employee>
>>>>>>> <id>100</id>
>>>>>>> <ename>ranjini</ename>
>>>>>>> <dept>IT1</dept>
>>>>>>> <sal>123456</sal>
>>>>>>> <location>nextlevel1</location>
>>>>>>> <Address>
>>>>>>> <Home>Chennai1</Home>
>>>>>>> <Office>Navallur1</Office>
>>>>>>> </Address>
>>>>>>> </Employee>
>>>>>>> <Employee>
>>>>>>> <id>1001</id>
>>>>>>> <ename>ranjinikumar</ename>
>>>>>>> <dept>IT</dept>
>>>>>>> <sal>1234516</sal>
>>>>>>> <location>nextlevel</location>
>>>>>>> <Address>
>>>>>>> <Home>Chennai</Home>
>>>>>>> <Office>Navallur</Office>
>>>>>>> </Address>
>>>>>>> </Employee>
>>>>>>> </Company>
>>>>>>>
>>>>>>>
>>>>>>> Thanks in advance.
>>>>>>>
>>>>>>> Ranjini
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>  On Mon, Jan 6, 2014 at 2:44 PM, Ranjini Rathinam <
>>>>>>>> ranjinibecse@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> Thanks a lot .
>>>>>>>>>
>>>>>>>>> Ranjini
>>>>>>>>>
>>>>>>>>> On Fri, Jan 3, 2014 at 10:40 PM, Diego Gutierrez <
>>>>>>>>> diego.gutierrez@ucsp.edu.pe> wrote:
>>>>>>>>>
>>>>>>>>>>  Hi,
>>>>>>>>>>
>>>>>>>>>> I suggest to use the XPath, this is a native java support for
>>>>>>>>>> parse xml and json formats.
>>>>>>>>>>
>>>>>>>>>> For the main problem, like distcp command(
>>>>>>>>>> http://hadoop.apache.org/docs/r0.19.0/distcp.pdf ) there is no
>>>>>>>>>> need of a reduce function, because you can parse the xml input file and
>>>>>>>>>> create the file you need in the map function.For example the following code
>>>>>>>>>> reads an xml file in HDFS, parse it and create a new file ( "/result.txt" )
>>>>>>>>>> with the expected format:
>>>>>>>>>> id,name
>>>>>>>>>> 100,RR
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Mapper function:
>>>>>>>>>>
>>>>>>>>>> import java.io.ByteArrayInputStream;
>>>>>>>>>> import java.io.IOException;
>>>>>>>>>> import java.io.InputStream;
>>>>>>>>>> import java.net.URI;
>>>>>>>>>>
>>>>>>>>>> import javax.xml.namespace.QName;
>>>>>>>>>> import javax.xml.parsers.DocumentBuilder;
>>>>>>>>>> import javax.xml.parsers.DocumentBuilderFactory;
>>>>>>>>>> import javax.xml.parsers.ParserConfigurationException;
>>>>>>>>>> import javax.xml.xpath.XPath;
>>>>>>>>>> import javax.xml.xpath.XPathConstants;
>>>>>>>>>> import javax.xml.xpath.XPathExpressionException;
>>>>>>>>>> import javax.xml.xpath.XPathFactory;
>>>>>>>>>>
>>>>>>>>>> import org.apache.hadoop.conf.Configuration;
>>>>>>>>>> import org.apache.hadoop.fs.FSDataOutputStream;
>>>>>>>>>> import org.apache.hadoop.fs.FileSystem;
>>>>>>>>>> import org.apache.hadoop.fs.Path;
>>>>>>>>>> import org.apache.hadoop.io.IOUtils;
>>>>>>>>>> import org.apache.hadoop.io.LongWritable;
>>>>>>>>>> import org.apache.hadoop.io.Text;
>>>>>>>>>> import org.apache.hadoop.mapreduce.Mapper;
>>>>>>>>>> import org.w3c.dom.Document;
>>>>>>>>>> import org.w3c.dom.Node;
>>>>>>>>>> import org.w3c.dom.NodeList;
>>>>>>>>>> import org.xml.sax.SAXException;
>>>>>>>>>>
>>>>>>>>>> import com.sun.org.apache.xml.internal.dtm.ref.DTMNodeList;
>>>>>>>>>>
>>>>>>>>>> public class XmlToTextMapper extends Mapper<LongWritable, Text,
>>>>>>>>>> Text, Text> {
>>>>>>>>>>
>>>>>>>>>>     private static final XPathFactory xpathFactory =
>>>>>>>>>> XPathFactory.newInstance();
>>>>>>>>>>
>>>>>>>>>>     @Override
>>>>>>>>>>     public void map(LongWritable key, Text value, Context context)
>>>>>>>>>>             throws IOException, InterruptedException {
>>>>>>>>>>
>>>>>>>>>>         String resultFileName = "/result.txt";
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>         Configuration conf = new Configuration();
>>>>>>>>>>         FileSystem fs =
>>>>>>>>>> FileSystem.get(URI.create(resultFileName), conf);
>>>>>>>>>>         FSDataOutputStream out = fs.create(new
>>>>>>>>>> Path(resultFileName));
>>>>>>>>>>
>>>>>>>>>>         InputStream resultIS = new ByteArrayInputStream(new
>>>>>>>>>> byte[0]);
>>>>>>>>>>
>>>>>>>>>>         String header = "id,name\n";
>>>>>>>>>>         out.write(header.getBytes());
>>>>>>>>>>
>>>>>>>>>>         String xmlContent = value.toString();
>>>>>>>>>>         InputStream is = new
>>>>>>>>>> ByteArrayInputStream(xmlContent.getBytes());
>>>>>>>>>>         DocumentBuilderFactory factory =
>>>>>>>>>> DocumentBuilderFactory.newInstance();
>>>>>>>>>>         DocumentBuilder builder;
>>>>>>>>>>         try {
>>>>>>>>>>             builder = factory.newDocumentBuilder();
>>>>>>>>>>             Document doc = builder.parse(is);
>>>>>>>>>>             DTMNodeList list = (DTMNodeList)
>>>>>>>>>> getNode("/main/data", doc,
>>>>>>>>>>                     XPathConstants.NODESET);
>>>>>>>>>>
>>>>>>>>>>             int size = list.getLength();
>>>>>>>>>>             for (int i = 0; i < size; i++) {
>>>>>>>>>>                 Node node = list.item(i);
>>>>>>>>>>                 String line = "";
>>>>>>>>>>                 NodeList nodeList = node.getChildNodes();
>>>>>>>>>>                 int childNumber = nodeList.getLength();
>>>>>>>>>>                 for (int j = 0; j < childNumber; j++) {
>>>>>>>>>>                     line += nodeList.item(j).getTextContent() +
>>>>>>>>>> ",";
>>>>>>>>>>                 }
>>>>>>>>>>                 if (line.endsWith(","))
>>>>>>>>>>                     line = line.substring(0, line.length() - 1);
>>>>>>>>>>                 line += "\n";
>>>>>>>>>>                 out.write(line.getBytes());
>>>>>>>>>>
>>>>>>>>>>             }
>>>>>>>>>>
>>>>>>>>>>         } catch (ParserConfigurationException e) {
>>>>>>>>>>             MyLogguer.log("error: " + e.getMessage());
>>>>>>>>>>             e.printStackTrace();
>>>>>>>>>>         } catch (SAXException e) {
>>>>>>>>>>             MyLogguer.log("error: " + e.getMessage());
>>>>>>>>>>             e.printStackTrace();
>>>>>>>>>>         } catch (XPathExpressionException e) {
>>>>>>>>>>             MyLogguer.log("error: " + e.getMessage());
>>>>>>>>>>             e.printStackTrace();
>>>>>>>>>>         }
>>>>>>>>>>
>>>>>>>>>>         IOUtils.copyBytes(resultIS, out, 4096, true);
>>>>>>>>>>         out.close();
>>>>>>>>>>     }
>>>>>>>>>>
>>>>>>>>>>     public static Object getNode(String xpathStr, Node node,
>>>>>>>>>> QName retunType)
>>>>>>>>>>             throws XPathExpressionException {
>>>>>>>>>>         XPath xpath = xpathFactory.newXPath();
>>>>>>>>>>         return xpath.evaluate(xpathStr, node, retunType);
>>>>>>>>>>     }
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --------------------------------------
>>>>>>>>>> Main class:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> public class Main {
>>>>>>>>>>
>>>>>>>>>>     public static void main(String[] args) throws Exception {
>>>>>>>>>>
>>>>>>>>>>         if (args.length != 2) {
>>>>>>>>>>             System.err
>>>>>>>>>>                     .println("Usage: XMLtoText <input path>
>>>>>>>>>> <output path>");
>>>>>>>>>>             System.exit(-1);
>>>>>>>>>>         }
>>>>>>>>>>
>>>>>>>>>>         Job job = new Job();
>>>>>>>>>>         job.setJarByClass(Main.class);
>>>>>>>>>>         job.setJobName("XML to Text");
>>>>>>>>>>         FileInputFormat.addInputPath(job, new Path(args[0]));
>>>>>>>>>>         FileOutputFormat.setOutputPath(job, new Path(args[1]));
>>>>>>>>>>
>>>>>>>>>>         job.setMapperClass(XmlToTextMapper.class);
>>>>>>>>>>         job.setNumReduceTasks(0);
>>>>>>>>>>         job.setMapOutputKeyClass(Text.class);
>>>>>>>>>>         job.setMapOutputValueClass(Text.class);
>>>>>>>>>>         System.exit(job.waitForCompletion(true) ? 0 : 1);
>>>>>>>>>>
>>>>>>>>>>     }
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>> To execute the job you can use :
>>>>>>>>>>
>>>>>>>>>>          bin/hadoop Main /data.xml /output.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Then you can use this to see result.txt file:
>>>>>>>>>>
>>>>>>>>>>           hadoop fs -cat /result.txt
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I'm using this xml as input:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> <Comp><Emp><id>1</id><name>NameA</name></data><data><id>2</id><name>NameB</name></Emp></Comp>
>>>>>>>>>>
>>>>>>>>>> and the content in result.txt is like this:
>>>>>>>>>>
>>>>>>>>>> id,name
>>>>>>>>>> 1,NameA
>>>>>>>>>> 2,NameB
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Hope this helps.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> 2014/1/3 Ranjini Rathinam <ra...@gmail.com>
>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> Need to convert XML into text using mapreduce.
>>>>>>>>>>>
>>>>>>>>>>> I have used DOM and SAX parser.
>>>>>>>>>>>
>>>>>>>>>>> After using SAX Builder in mapper class. the child node act as
>>>>>>>>>>> root Element.
>>>>>>>>>>>
>>>>>>>>>>> While seeing in Sys out i found thar root element is taking the
>>>>>>>>>>> child element and printing.
>>>>>>>>>>>
>>>>>>>>>>> For Eg,
>>>>>>>>>>>
>>>>>>>>>>> <Comp><Emp><id>100</id><name>RR</name></Emp></Comp>
>>>>>>>>>>> when this xml is passed in mapper , in sys out printing the root
>>>>>>>>>>> element
>>>>>>>>>>>
>>>>>>>>>>> I am getting the the root element as
>>>>>>>>>>>
>>>>>>>>>>> <id>
>>>>>>>>>>> <name>
>>>>>>>>>>>
>>>>>>>>>>> Please suggest and help to fix this.
>>>>>>>>>>>
>>>>>>>>>>> I need to convert the xml into text using mapreduce code. Please
>>>>>>>>>>> provide with example.
>>>>>>>>>>>
>>>>>>>>>>> Required output is
>>>>>>>>>>>
>>>>>>>>>>> id,name
>>>>>>>>>>> 100,RR
>>>>>>>>>>>
>>>>>>>>>>> Please help.
>>>>>>>>>>>
>>>>>>>>>>> Thanks in advance,
>>>>>>>>>>> Ranjini R
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>>
>

Fwd: XML to TEXT

Posted by Ranjini Rathinam <ra...@gmail.com>.
>
> Please help to convert this xml to text.
>>
>>
>>  I have the attached the xml. Please find the attachement.
>>
>> Some student has two address tag and some student has one address tag and
>> some student dont have address tag tag.
>>
>> I need to convert the xml into string.
>>
>> this is my desired output.
>>
>> 100,ranjini,HOME,a street,ad street,ads street,chennai,tn,OFFICE,adsja1
>> street,adsja2 street,adsja3 street,mumbai,Maharastra
>> 101,nivetha,HOME,a street,ad street,ads street,chennai,tn
>> 102,siva
>>
>>
>> In normal java i have written using recursion but how to write in
>> mapreduce.
>>
>> How to write the code in Mapreduce .? Pl help .
>>
>> Thanks in advance.
>>  Regards,
>> Ranjini R
>>
>>
>> On Fri, Jan 10, 2014 at 12:47 PM, Ranjini Rathinam <
>> ranjinibecse@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> Its working fine. problem was in xml . THe space i have given.
>>>
>>> Thanks a lot.
>>>
>>> Regards,
>>> Ranjini.R
>>>
>>>  On Thu, Jan 9, 2014 at 10:47 PM, Diego Gutierrez <
>>> diego.gutierrez@ucsp.edu.pe> wrote:
>>>
>>>>  Hi,
>>>>
>>>> I'm sending you the eclipse project with the code. Hope this helps.
>>>>
>>>> Regards
>>>> Diego Guti�rrez
>>>>
>>>>
>>>>
>>>> 2014/1/9 Ranjini Rathinam <ra...@gmail.com>
>>>>
>>>>> Hi,
>>>>>
>>>>> I am using here java 1.6 and hadoop 0.20 version ,  ubuntu 12.04.
>>>>>
>>>>> If possible please send the jar and code for review.
>>>>>
>>>>> Thanks for the support,
>>>>>
>>>>> Ranjini
>>>>>
>>>>>  On Wed, Jan 8, 2014 at 11:00 PM, Diego Gutierrez <
>>>>> diego.gutierrez@ucsp.edu.pe> wrote:
>>>>>
>>>>>>   Hi,
>>>>>>
>>>>>> I've notice that your xml file has break lines. Hadoop by default
>>>>>> splits every file into lines and pass them to the map function, in other
>>>>>> words, each map function process one line of the file. Please remove the
>>>>>> break lines from your xml and try again. I've tested here with your xml
>>>>>> file(just changing DTMNodeList list = (DTMNodeList)
>>>>>> getNode("/Company/Employee", doc,
>>>>>>                     XPathConstants.NODESET) ) and this is the output
>>>>>> in result.txt
>>>>>>
>>>>>>
>>>>>> id,name
>>>>>> 100,ranjini,IT1,123456,nextlevel1,Chennai1Navallur1
>>>>>> 1001,ranjinikumar,IT,1234516,nextlevel,ChennaiNavallur
>>>>>>
>>>>>>
>>>>>> Note: I dont know if the java version or hadoop version can be the
>>>>>> problem here. I'm using ubuntu 12.04, java oracle 7 and hadoop 2.2.0.
>>>>>>
>>>>>>
>>>>>> If you want, I can send you the jar file with the code :)
>>>>>>
>>>>>> Regards
>>>>>> Diego Guti�rrez.
>>>>>>
>>>>>>
>>>>>>
>>>>>> 2014/1/7 Ranjini Rathinam <ra...@gmail.com>
>>>>>>
>>>>>>> Hi Gutierrez ,
>>>>>>>
>>>>>>> As suggest i tried with the code , but in the result.txt i got
>>>>>>> output only header. Nothing else was printing.
>>>>>>>
>>>>>>> After debugging i came to know that while parsing , there is no
>>>>>>> value.
>>>>>>>
>>>>>>> The problem is in line given below which is bold. While putting
>>>>>>> SysOut i found no value printing in this line.
>>>>>>>
>>>>>>>  String xmlContent = value.toString();
>>>>>>>
>>>>>>>         InputStream is = new
>>>>>>> ByteArrayInputStream(xmlContent.getBytes());
>>>>>>>         DocumentBuilderFactory factory =
>>>>>>> DocumentBuilderFactory.newInstance();
>>>>>>>         DocumentBuilder builder;
>>>>>>>         try {
>>>>>>>             builder = factory.newDocumentBuilder();
>>>>>>>
>>>>>>> * Document doc = builder.parse(is);*
>>>>>>>    String ed=doc.getDocumentElement().getNodeName();
>>>>>>>    out.write(ed.getBytes());
>>>>>>>             DTMNodeList list = (DTMNodeList)
>>>>>>> getNode("/Company/Employee", doc,XPathConstants.NODESET);
>>>>>>>
>>>>>>> When iam printing
>>>>>>>
>>>>>>> out.write(xmlContent.getBytes):- the whole xml is being printed.
>>>>>>>
>>>>>>> then i wrote for Sysout for list ,nothing printed.
>>>>>>>  out.write(ed.getBytes):- nothing is being printed.
>>>>>>>
>>>>>>> Please suggest where i am going wrong. Please help to fix this.
>>>>>>>
>>>>>>> Thanks in advance.
>>>>>>>
>>>>>>> I have attached my code.Please review.
>>>>>>>
>>>>>>>
>>>>>>> Mapper class:-
>>>>>>>
>>>>>>> public class XmlTextMapper extends Mapper<LongWritable, Text, Text,
>>>>>>> Text> {
>>>>>>>      private static final XPathFactory xpathFactory =
>>>>>>> XPathFactory.newInstance();
>>>>>>>     @Override
>>>>>>>     public void map(LongWritable key, Text value, Context context)
>>>>>>>             throws IOException, InterruptedException {
>>>>>>>         String resultFileName = "/user/task/Sales/result.txt";
>>>>>>>
>>>>>>>         Configuration conf = new Configuration();
>>>>>>>         FileSystem fs = FileSystem.get(URI.create(resultFileName),
>>>>>>> conf);
>>>>>>>         FSDataOutputStream out = fs.create(new Path(resultFileName));
>>>>>>>         InputStream resultIS = new ByteArrayInputStream(new byte[0]);
>>>>>>>         String header = "id,name\n";
>>>>>>>         out.write(header.getBytes());
>>>>>>>          String xmlContent = value.toString();
>>>>>>>
>>>>>>>         InputStream is = new
>>>>>>> ByteArrayInputStream(xmlContent.getBytes());
>>>>>>>         DocumentBuilderFactory factory =
>>>>>>> DocumentBuilderFactory.newInstance();
>>>>>>>         DocumentBuilder builder;
>>>>>>>         try {
>>>>>>>             builder = factory.newDocumentBuilder();
>>>>>>>             Document doc = builder.parse(is);
>>>>>>>    String ed=doc.getDocumentElement().getNodeName();
>>>>>>>    out.write(ed.getBytes());
>>>>>>>             DTMNodeList list = (DTMNodeList)
>>>>>>> getNode("/Company/Employee", doc,XPathConstants.NODESET);
>>>>>>>              int size = list.getLength();
>>>>>>>             for (int i = 0; i < size; i++) {
>>>>>>>                 Node node = list.item(i);
>>>>>>>                 String line = "";
>>>>>>>                 NodeList nodeList = node.getChildNodes();
>>>>>>>                 int childNumber = nodeList.getLength();
>>>>>>>                 for (int j = 0; j < childNumber; j++)
>>>>>>>     {
>>>>>>>                     line += nodeList.item(j).getTextContent() + ",";
>>>>>>>                 }
>>>>>>>                 if (line.endsWith(","))
>>>>>>>                     line = line.substring(0, line.length() - 1);
>>>>>>>                 line += "\n";
>>>>>>>                 out.write(line.getBytes());
>>>>>>>             }
>>>>>>>         } catch (ParserConfigurationException e) {
>>>>>>>              e.printStackTrace();
>>>>>>>         } catch (SAXException e) {
>>>>>>>              e.printStackTrace();
>>>>>>>         } catch (XPathExpressionException e) {
>>>>>>>              e.printStackTrace();
>>>>>>>         }
>>>>>>>          IOUtils.copyBytes(resultIS, out, 4096, true);
>>>>>>>         out.close();
>>>>>>>     }
>>>>>>>     public static Object getNode(String xpathStr, Node node, QName
>>>>>>> retunType)
>>>>>>>             throws XPathExpressionException {
>>>>>>>         XPath xpath = xpathFactory.newXPath();
>>>>>>>         return xpath.evaluate(xpathStr, node, retunType);
>>>>>>>     }
>>>>>>> }
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Main class
>>>>>>> public class MainXml {
>>>>>>>      public static void main(String[] args) throws Exception {
>>>>>>> Configuration conf = new Configuration();
>>>>>>>         if (args.length != 2) {
>>>>>>>             System.err
>>>>>>>                     .println("Usage: XMLtoText <input path> <output
>>>>>>> path>");
>>>>>>>             System.exit(-1);
>>>>>>>         }
>>>>>>>   String output="/user/task/Sales/";
>>>>>>>        Job job = new Job(conf, "XML to Text");
>>>>>>>         job.setJarByClass(MainXml.class);
>>>>>>>        // job.setJobName("XML to Text");
>>>>>>>
>>>>>>>         FileInputFormat.addInputPath(job, new Path(args[0]));
>>>>>>>        // FileOutputFormat.setOutputPath(job, new Path(args[1]));
>>>>>>>   Path outPath = new Path(output);
>>>>>>>   FileOutputFormat.setOutputPath(job, outPath);
>>>>>>>   FileSystem dfs = FileSystem.get(outPath.toUri(), conf);
>>>>>>>   if (dfs.exists(outPath)) {
>>>>>>>   dfs.delete(outPath, true);
>>>>>>>   }
>>>>>>>         job.setMapperClass(XmlTextMapper.class);
>>>>>>>
>>>>>>>         job.setNumReduceTasks(0);
>>>>>>>         job.setMapOutputKeyClass(Text.class);
>>>>>>>         job.setMapOutputValueClass(Text.class);
>>>>>>>         System.exit(job.waitForCompletion(true) ? 0 : 1);
>>>>>>>     }
>>>>>>> }
>>>>>>>
>>>>>>>
>>>>>>> My xml file
>>>>>>>
>>>>>>> <Company>
>>>>>>> <Employee>
>>>>>>> <id>100</id>
>>>>>>> <ename>ranjini</ename>
>>>>>>> <dept>IT1</dept>
>>>>>>> <sal>123456</sal>
>>>>>>> <location>nextlevel1</location>
>>>>>>> <Address>
>>>>>>> <Home>Chennai1</Home>
>>>>>>> <Office>Navallur1</Office>
>>>>>>> </Address>
>>>>>>> </Employee>
>>>>>>> <Employee>
>>>>>>> <id>1001</id>
>>>>>>> <ename>ranjinikumar</ename>
>>>>>>> <dept>IT</dept>
>>>>>>> <sal>1234516</sal>
>>>>>>> <location>nextlevel</location>
>>>>>>> <Address>
>>>>>>> <Home>Chennai</Home>
>>>>>>> <Office>Navallur</Office>
>>>>>>> </Address>
>>>>>>> </Employee>
>>>>>>> </Company>
>>>>>>>
>>>>>>>
>>>>>>> Thanks in advance.
>>>>>>>
>>>>>>> Ranjini
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>  On Mon, Jan 6, 2014 at 2:44 PM, Ranjini Rathinam <
>>>>>>>> ranjinibecse@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> Thanks a lot .
>>>>>>>>>
>>>>>>>>> Ranjini
>>>>>>>>>
>>>>>>>>> On Fri, Jan 3, 2014 at 10:40 PM, Diego Gutierrez <
>>>>>>>>> diego.gutierrez@ucsp.edu.pe> wrote:
>>>>>>>>>
>>>>>>>>>>  Hi,
>>>>>>>>>>
>>>>>>>>>> I suggest to use the XPath, this is a native java support for
>>>>>>>>>> parse xml and json formats.
>>>>>>>>>>
>>>>>>>>>> For the main problem, like distcp command(
>>>>>>>>>> http://hadoop.apache.org/docs/r0.19.0/distcp.pdf ) there is no
>>>>>>>>>> need of a reduce function, because you can parse the xml input file and
>>>>>>>>>> create the file you need in the map function.For example the following code
>>>>>>>>>> reads an xml file in HDFS, parse it and create a new file ( "/result.txt" )
>>>>>>>>>> with the expected format:
>>>>>>>>>> id,name
>>>>>>>>>> 100,RR
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Mapper function:
>>>>>>>>>>
>>>>>>>>>> import java.io.ByteArrayInputStream;
>>>>>>>>>> import java.io.IOException;
>>>>>>>>>> import java.io.InputStream;
>>>>>>>>>> import java.net.URI;
>>>>>>>>>>
>>>>>>>>>> import javax.xml.namespace.QName;
>>>>>>>>>> import javax.xml.parsers.DocumentBuilder;
>>>>>>>>>> import javax.xml.parsers.DocumentBuilderFactory;
>>>>>>>>>> import javax.xml.parsers.ParserConfigurationException;
>>>>>>>>>> import javax.xml.xpath.XPath;
>>>>>>>>>> import javax.xml.xpath.XPathConstants;
>>>>>>>>>> import javax.xml.xpath.XPathExpressionException;
>>>>>>>>>> import javax.xml.xpath.XPathFactory;
>>>>>>>>>>
>>>>>>>>>> import org.apache.hadoop.conf.Configuration;
>>>>>>>>>> import org.apache.hadoop.fs.FSDataOutputStream;
>>>>>>>>>> import org.apache.hadoop.fs.FileSystem;
>>>>>>>>>> import org.apache.hadoop.fs.Path;
>>>>>>>>>> import org.apache.hadoop.io.IOUtils;
>>>>>>>>>> import org.apache.hadoop.io.LongWritable;
>>>>>>>>>> import org.apache.hadoop.io.Text;
>>>>>>>>>> import org.apache.hadoop.mapreduce.Mapper;
>>>>>>>>>> import org.w3c.dom.Document;
>>>>>>>>>> import org.w3c.dom.Node;
>>>>>>>>>> import org.w3c.dom.NodeList;
>>>>>>>>>> import org.xml.sax.SAXException;
>>>>>>>>>>
>>>>>>>>>> import com.sun.org.apache.xml.internal.dtm.ref.DTMNodeList;
>>>>>>>>>>
>>>>>>>>>> public class XmlToTextMapper extends Mapper<LongWritable, Text,
>>>>>>>>>> Text, Text> {
>>>>>>>>>>
>>>>>>>>>>     private static final XPathFactory xpathFactory =
>>>>>>>>>> XPathFactory.newInstance();
>>>>>>>>>>
>>>>>>>>>>     @Override
>>>>>>>>>>     public void map(LongWritable key, Text value, Context context)
>>>>>>>>>>             throws IOException, InterruptedException {
>>>>>>>>>>
>>>>>>>>>>         String resultFileName = "/result.txt";
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>         Configuration conf = new Configuration();
>>>>>>>>>>         FileSystem fs =
>>>>>>>>>> FileSystem.get(URI.create(resultFileName), conf);
>>>>>>>>>>         FSDataOutputStream out = fs.create(new
>>>>>>>>>> Path(resultFileName));
>>>>>>>>>>
>>>>>>>>>>         InputStream resultIS = new ByteArrayInputStream(new
>>>>>>>>>> byte[0]);
>>>>>>>>>>
>>>>>>>>>>         String header = "id,name\n";
>>>>>>>>>>         out.write(header.getBytes());
>>>>>>>>>>
>>>>>>>>>>         String xmlContent = value.toString();
>>>>>>>>>>         InputStream is = new
>>>>>>>>>> ByteArrayInputStream(xmlContent.getBytes());
>>>>>>>>>>         DocumentBuilderFactory factory =
>>>>>>>>>> DocumentBuilderFactory.newInstance();
>>>>>>>>>>         DocumentBuilder builder;
>>>>>>>>>>         try {
>>>>>>>>>>             builder = factory.newDocumentBuilder();
>>>>>>>>>>             Document doc = builder.parse(is);
>>>>>>>>>>             DTMNodeList list = (DTMNodeList)
>>>>>>>>>> getNode("/main/data", doc,
>>>>>>>>>>                     XPathConstants.NODESET);
>>>>>>>>>>
>>>>>>>>>>             int size = list.getLength();
>>>>>>>>>>             for (int i = 0; i < size; i++) {
>>>>>>>>>>                 Node node = list.item(i);
>>>>>>>>>>                 String line = "";
>>>>>>>>>>                 NodeList nodeList = node.getChildNodes();
>>>>>>>>>>                 int childNumber = nodeList.getLength();
>>>>>>>>>>                 for (int j = 0; j < childNumber; j++) {
>>>>>>>>>>                     line += nodeList.item(j).getTextContent() +
>>>>>>>>>> ",";
>>>>>>>>>>                 }
>>>>>>>>>>                 if (line.endsWith(","))
>>>>>>>>>>                     line = line.substring(0, line.length() - 1);
>>>>>>>>>>                 line += "\n";
>>>>>>>>>>                 out.write(line.getBytes());
>>>>>>>>>>
>>>>>>>>>>             }
>>>>>>>>>>
>>>>>>>>>>         } catch (ParserConfigurationException e) {
>>>>>>>>>>             MyLogguer.log("error: " + e.getMessage());
>>>>>>>>>>             e.printStackTrace();
>>>>>>>>>>         } catch (SAXException e) {
>>>>>>>>>>             MyLogguer.log("error: " + e.getMessage());
>>>>>>>>>>             e.printStackTrace();
>>>>>>>>>>         } catch (XPathExpressionException e) {
>>>>>>>>>>             MyLogguer.log("error: " + e.getMessage());
>>>>>>>>>>             e.printStackTrace();
>>>>>>>>>>         }
>>>>>>>>>>
>>>>>>>>>>         IOUtils.copyBytes(resultIS, out, 4096, true);
>>>>>>>>>>         out.close();
>>>>>>>>>>     }
>>>>>>>>>>
>>>>>>>>>>     public static Object getNode(String xpathStr, Node node,
>>>>>>>>>> QName retunType)
>>>>>>>>>>             throws XPathExpressionException {
>>>>>>>>>>         XPath xpath = xpathFactory.newXPath();
>>>>>>>>>>         return xpath.evaluate(xpathStr, node, retunType);
>>>>>>>>>>     }
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --------------------------------------
>>>>>>>>>> Main class:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> public class Main {
>>>>>>>>>>
>>>>>>>>>>     public static void main(String[] args) throws Exception {
>>>>>>>>>>
>>>>>>>>>>         if (args.length != 2) {
>>>>>>>>>>             System.err
>>>>>>>>>>                     .println("Usage: XMLtoText <input path>
>>>>>>>>>> <output path>");
>>>>>>>>>>             System.exit(-1);
>>>>>>>>>>         }
>>>>>>>>>>
>>>>>>>>>>         Job job = new Job();
>>>>>>>>>>         job.setJarByClass(Main.class);
>>>>>>>>>>         job.setJobName("XML to Text");
>>>>>>>>>>         FileInputFormat.addInputPath(job, new Path(args[0]));
>>>>>>>>>>         FileOutputFormat.setOutputPath(job, new Path(args[1]));
>>>>>>>>>>
>>>>>>>>>>         job.setMapperClass(XmlToTextMapper.class);
>>>>>>>>>>         job.setNumReduceTasks(0);
>>>>>>>>>>         job.setMapOutputKeyClass(Text.class);
>>>>>>>>>>         job.setMapOutputValueClass(Text.class);
>>>>>>>>>>         System.exit(job.waitForCompletion(true) ? 0 : 1);
>>>>>>>>>>
>>>>>>>>>>     }
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>> To execute the job you can use :
>>>>>>>>>>
>>>>>>>>>>          bin/hadoop Main /data.xml /output.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Then you can use this to see result.txt file:
>>>>>>>>>>
>>>>>>>>>>           hadoop fs -cat /result.txt
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I'm using this xml as input:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> <Comp><Emp><id>1</id><name>NameA</name></data><data><id>2</id><name>NameB</name></Emp></Comp>
>>>>>>>>>>
>>>>>>>>>> and the content in result.txt is like this:
>>>>>>>>>>
>>>>>>>>>> id,name
>>>>>>>>>> 1,NameA
>>>>>>>>>> 2,NameB
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Hope this helps.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> 2014/1/3 Ranjini Rathinam <ra...@gmail.com>
>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> Need to convert XML into text using mapreduce.
>>>>>>>>>>>
>>>>>>>>>>> I have used DOM and SAX parser.
>>>>>>>>>>>
>>>>>>>>>>> After using SAX Builder in mapper class. the child node act as
>>>>>>>>>>> root Element.
>>>>>>>>>>>
>>>>>>>>>>> While seeing in Sys out i found thar root element is taking the
>>>>>>>>>>> child element and printing.
>>>>>>>>>>>
>>>>>>>>>>> For Eg,
>>>>>>>>>>>
>>>>>>>>>>> <Comp><Emp><id>100</id><name>RR</name></Emp></Comp>
>>>>>>>>>>> when this xml is passed in mapper , in sys out printing the root
>>>>>>>>>>> element
>>>>>>>>>>>
>>>>>>>>>>> I am getting the the root element as
>>>>>>>>>>>
>>>>>>>>>>> <id>
>>>>>>>>>>> <name>
>>>>>>>>>>>
>>>>>>>>>>> Please suggest and help to fix this.
>>>>>>>>>>>
>>>>>>>>>>> I need to convert the xml into text using mapreduce code. Please
>>>>>>>>>>> provide with example.
>>>>>>>>>>>
>>>>>>>>>>> Required output is
>>>>>>>>>>>
>>>>>>>>>>> id,name
>>>>>>>>>>> 100,RR
>>>>>>>>>>>
>>>>>>>>>>> Please help.
>>>>>>>>>>>
>>>>>>>>>>> Thanks in advance,
>>>>>>>>>>> Ranjini R
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>>
>

Fwd: XML to TEXT

Posted by Ranjini Rathinam <ra...@gmail.com>.
>
> Please help to convert this xml to text.
>>
>>
>>  I have the attached the xml. Please find the attachement.
>>
>> Some student has two address tag and some student has one address tag and
>> some student dont have address tag tag.
>>
>> I need to convert the xml into string.
>>
>> this is my desired output.
>>
>> 100,ranjini,HOME,a street,ad street,ads street,chennai,tn,OFFICE,adsja1
>> street,adsja2 street,adsja3 street,mumbai,Maharastra
>> 101,nivetha,HOME,a street,ad street,ads street,chennai,tn
>> 102,siva
>>
>>
>> In normal java i have written using recursion but how to write in
>> mapreduce.
>>
>> How to write the code in Mapreduce .? Pl help .
>>
>> Thanks in advance.
>>  Regards,
>> Ranjini R
>>
>>
>> On Fri, Jan 10, 2014 at 12:47 PM, Ranjini Rathinam <
>> ranjinibecse@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> Its working fine. problem was in xml . THe space i have given.
>>>
>>> Thanks a lot.
>>>
>>> Regards,
>>> Ranjini.R
>>>
>>>  On Thu, Jan 9, 2014 at 10:47 PM, Diego Gutierrez <
>>> diego.gutierrez@ucsp.edu.pe> wrote:
>>>
>>>>  Hi,
>>>>
>>>> I'm sending you the eclipse project with the code. Hope this helps.
>>>>
>>>> Regards
>>>> Diego Gutiérrez
>>>>
>>>>
>>>>
>>>> 2014/1/9 Ranjini Rathinam <ra...@gmail.com>
>>>>
>>>>> Hi,
>>>>>
>>>>> I am using here java 1.6 and hadoop 0.20 version ,  ubuntu 12.04.
>>>>>
>>>>> If possible please send the jar and code for review.
>>>>>
>>>>> Thanks for the support,
>>>>>
>>>>> Ranjini
>>>>>
>>>>>  On Wed, Jan 8, 2014 at 11:00 PM, Diego Gutierrez <
>>>>> diego.gutierrez@ucsp.edu.pe> wrote:
>>>>>
>>>>>>   Hi,
>>>>>>
>>>>>> I've notice that your xml file has break lines. Hadoop by default
>>>>>> splits every file into lines and pass them to the map function, in other
>>>>>> words, each map function process one line of the file. Please remove the
>>>>>> break lines from your xml and try again. I've tested here with your xml
>>>>>> file(just changing DTMNodeList list = (DTMNodeList)
>>>>>> getNode("/Company/Employee", doc,
>>>>>>                     XPathConstants.NODESET) ) and this is the output
>>>>>> in result.txt
>>>>>>
>>>>>>
>>>>>> id,name
>>>>>> 100,ranjini,IT1,123456,nextlevel1,Chennai1Navallur1
>>>>>> 1001,ranjinikumar,IT,1234516,nextlevel,ChennaiNavallur
>>>>>>
>>>>>>
>>>>>> Note: I dont know if the java version or hadoop version can be the
>>>>>> problem here. I'm using ubuntu 12.04, java oracle 7 and hadoop 2.2.0.
>>>>>>
>>>>>>
>>>>>> If you want, I can send you the jar file with the code :)
>>>>>>
>>>>>> Regards
>>>>>> Diego Gutiérrez.
>>>>>>
>>>>>>
>>>>>>
>>>>>> 2014/1/7 Ranjini Rathinam <ra...@gmail.com>
>>>>>>
>>>>>>> Hi Gutierrez ,
>>>>>>>
>>>>>>> As suggest i tried with the code , but in the result.txt i got
>>>>>>> output only header. Nothing else was printing.
>>>>>>>
>>>>>>> After debugging i came to know that while parsing , there is no
>>>>>>> value.
>>>>>>>
>>>>>>> The problem is in line given below which is bold. While putting
>>>>>>> SysOut i found no value printing in this line.
>>>>>>>
>>>>>>>  String xmlContent = value.toString();
>>>>>>>
>>>>>>>         InputStream is = new
>>>>>>> ByteArrayInputStream(xmlContent.getBytes());
>>>>>>>         DocumentBuilderFactory factory =
>>>>>>> DocumentBuilderFactory.newInstance();
>>>>>>>         DocumentBuilder builder;
>>>>>>>         try {
>>>>>>>             builder = factory.newDocumentBuilder();
>>>>>>>
>>>>>>> * Document doc = builder.parse(is);*
>>>>>>>    String ed=doc.getDocumentElement().getNodeName();
>>>>>>>    out.write(ed.getBytes());
>>>>>>>             DTMNodeList list = (DTMNodeList)
>>>>>>> getNode("/Company/Employee", doc,XPathConstants.NODESET);
>>>>>>>
>>>>>>> When iam printing
>>>>>>>
>>>>>>> out.write(xmlContent.getBytes):- the whole xml is being printed.
>>>>>>>
>>>>>>> then i wrote for Sysout for list ,nothing printed.
>>>>>>>  out.write(ed.getBytes):- nothing is being printed.
>>>>>>>
>>>>>>> Please suggest where i am going wrong. Please help to fix this.
>>>>>>>
>>>>>>> Thanks in advance.
>>>>>>>
>>>>>>> I have attached my code.Please review.
>>>>>>>
>>>>>>>
>>>>>>> Mapper class:-
>>>>>>>
>>>>>>> public class XmlTextMapper extends Mapper<LongWritable, Text, Text,
>>>>>>> Text> {
>>>>>>>      private static final XPathFactory xpathFactory =
>>>>>>> XPathFactory.newInstance();
>>>>>>>     @Override
>>>>>>>     public void map(LongWritable key, Text value, Context context)
>>>>>>>             throws IOException, InterruptedException {
>>>>>>>         String resultFileName = "/user/task/Sales/result.txt";
>>>>>>>
>>>>>>>         Configuration conf = new Configuration();
>>>>>>>         FileSystem fs = FileSystem.get(URI.create(resultFileName),
>>>>>>> conf);
>>>>>>>         FSDataOutputStream out = fs.create(new Path(resultFileName));
>>>>>>>         InputStream resultIS = new ByteArrayInputStream(new byte[0]);
>>>>>>>         String header = "id,name\n";
>>>>>>>         out.write(header.getBytes());
>>>>>>>          String xmlContent = value.toString();
>>>>>>>
>>>>>>>         InputStream is = new
>>>>>>> ByteArrayInputStream(xmlContent.getBytes());
>>>>>>>         DocumentBuilderFactory factory =
>>>>>>> DocumentBuilderFactory.newInstance();
>>>>>>>         DocumentBuilder builder;
>>>>>>>         try {
>>>>>>>             builder = factory.newDocumentBuilder();
>>>>>>>             Document doc = builder.parse(is);
>>>>>>>    String ed=doc.getDocumentElement().getNodeName();
>>>>>>>    out.write(ed.getBytes());
>>>>>>>             DTMNodeList list = (DTMNodeList)
>>>>>>> getNode("/Company/Employee", doc,XPathConstants.NODESET);
>>>>>>>              int size = list.getLength();
>>>>>>>             for (int i = 0; i < size; i++) {
>>>>>>>                 Node node = list.item(i);
>>>>>>>                 String line = "";
>>>>>>>                 NodeList nodeList = node.getChildNodes();
>>>>>>>                 int childNumber = nodeList.getLength();
>>>>>>>                 for (int j = 0; j < childNumber; j++)
>>>>>>>     {
>>>>>>>                     line += nodeList.item(j).getTextContent() + ",";
>>>>>>>                 }
>>>>>>>                 if (line.endsWith(","))
>>>>>>>                     line = line.substring(0, line.length() - 1);
>>>>>>>                 line += "\n";
>>>>>>>                 out.write(line.getBytes());
>>>>>>>             }
>>>>>>>         } catch (ParserConfigurationException e) {
>>>>>>>              e.printStackTrace();
>>>>>>>         } catch (SAXException e) {
>>>>>>>              e.printStackTrace();
>>>>>>>         } catch (XPathExpressionException e) {
>>>>>>>              e.printStackTrace();
>>>>>>>         }
>>>>>>>          IOUtils.copyBytes(resultIS, out, 4096, true);
>>>>>>>         out.close();
>>>>>>>     }
>>>>>>>     public static Object getNode(String xpathStr, Node node, QName
>>>>>>> retunType)
>>>>>>>             throws XPathExpressionException {
>>>>>>>         XPath xpath = xpathFactory.newXPath();
>>>>>>>         return xpath.evaluate(xpathStr, node, retunType);
>>>>>>>     }
>>>>>>> }
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Main class
>>>>>>> public class MainXml {
>>>>>>>      public static void main(String[] args) throws Exception {
>>>>>>> Configuration conf = new Configuration();
>>>>>>>         if (args.length != 2) {
>>>>>>>             System.err
>>>>>>>                     .println("Usage: XMLtoText <input path> <output
>>>>>>> path>");
>>>>>>>             System.exit(-1);
>>>>>>>         }
>>>>>>>   String output="/user/task/Sales/";
>>>>>>>        Job job = new Job(conf, "XML to Text");
>>>>>>>         job.setJarByClass(MainXml.class);
>>>>>>>        // job.setJobName("XML to Text");
>>>>>>>
>>>>>>>         FileInputFormat.addInputPath(job, new Path(args[0]));
>>>>>>>        // FileOutputFormat.setOutputPath(job, new Path(args[1]));
>>>>>>>   Path outPath = new Path(output);
>>>>>>>   FileOutputFormat.setOutputPath(job, outPath);
>>>>>>>   FileSystem dfs = FileSystem.get(outPath.toUri(), conf);
>>>>>>>   if (dfs.exists(outPath)) {
>>>>>>>   dfs.delete(outPath, true);
>>>>>>>   }
>>>>>>>         job.setMapperClass(XmlTextMapper.class);
>>>>>>>
>>>>>>>         job.setNumReduceTasks(0);
>>>>>>>         job.setMapOutputKeyClass(Text.class);
>>>>>>>         job.setMapOutputValueClass(Text.class);
>>>>>>>         System.exit(job.waitForCompletion(true) ? 0 : 1);
>>>>>>>     }
>>>>>>> }
>>>>>>>
>>>>>>>
>>>>>>> My xml file
>>>>>>>
>>>>>>> <Company>
>>>>>>> <Employee>
>>>>>>> <id>100</id>
>>>>>>> <ename>ranjini</ename>
>>>>>>> <dept>IT1</dept>
>>>>>>> <sal>123456</sal>
>>>>>>> <location>nextlevel1</location>
>>>>>>> <Address>
>>>>>>> <Home>Chennai1</Home>
>>>>>>> <Office>Navallur1</Office>
>>>>>>> </Address>
>>>>>>> </Employee>
>>>>>>> <Employee>
>>>>>>> <id>1001</id>
>>>>>>> <ename>ranjinikumar</ename>
>>>>>>> <dept>IT</dept>
>>>>>>> <sal>1234516</sal>
>>>>>>> <location>nextlevel</location>
>>>>>>> <Address>
>>>>>>> <Home>Chennai</Home>
>>>>>>> <Office>Navallur</Office>
>>>>>>> </Address>
>>>>>>> </Employee>
>>>>>>> </Company>
>>>>>>>
>>>>>>>
>>>>>>> Thanks in advance.
>>>>>>>
>>>>>>> Ranjini
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>  On Mon, Jan 6, 2014 at 2:44 PM, Ranjini Rathinam <
>>>>>>>> ranjinibecse@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> Thanks a lot .
>>>>>>>>>
>>>>>>>>> Ranjini
>>>>>>>>>
>>>>>>>>> On Fri, Jan 3, 2014 at 10:40 PM, Diego Gutierrez <
>>>>>>>>> diego.gutierrez@ucsp.edu.pe> wrote:
>>>>>>>>>
>>>>>>>>>>  Hi,
>>>>>>>>>>
>>>>>>>>>> I suggest to use the XPath, this is a native java support for
>>>>>>>>>> parse xml and json formats.
>>>>>>>>>>
>>>>>>>>>> For the main problem, like distcp command(
>>>>>>>>>> http://hadoop.apache.org/docs/r0.19.0/distcp.pdf ) there is no
>>>>>>>>>> need of a reduce function, because you can parse the xml input file and
>>>>>>>>>> create the file you need in the map function.For example the following code
>>>>>>>>>> reads an xml file in HDFS, parse it and create a new file ( "/result.txt" )
>>>>>>>>>> with the expected format:
>>>>>>>>>> id,name
>>>>>>>>>> 100,RR
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Mapper function:
>>>>>>>>>>
>>>>>>>>>> import java.io.ByteArrayInputStream;
>>>>>>>>>> import java.io.IOException;
>>>>>>>>>> import java.io.InputStream;
>>>>>>>>>> import java.net.URI;
>>>>>>>>>>
>>>>>>>>>> import javax.xml.namespace.QName;
>>>>>>>>>> import javax.xml.parsers.DocumentBuilder;
>>>>>>>>>> import javax.xml.parsers.DocumentBuilderFactory;
>>>>>>>>>> import javax.xml.parsers.ParserConfigurationException;
>>>>>>>>>> import javax.xml.xpath.XPath;
>>>>>>>>>> import javax.xml.xpath.XPathConstants;
>>>>>>>>>> import javax.xml.xpath.XPathExpressionException;
>>>>>>>>>> import javax.xml.xpath.XPathFactory;
>>>>>>>>>>
>>>>>>>>>> import org.apache.hadoop.conf.Configuration;
>>>>>>>>>> import org.apache.hadoop.fs.FSDataOutputStream;
>>>>>>>>>> import org.apache.hadoop.fs.FileSystem;
>>>>>>>>>> import org.apache.hadoop.fs.Path;
>>>>>>>>>> import org.apache.hadoop.io.IOUtils;
>>>>>>>>>> import org.apache.hadoop.io.LongWritable;
>>>>>>>>>> import org.apache.hadoop.io.Text;
>>>>>>>>>> import org.apache.hadoop.mapreduce.Mapper;
>>>>>>>>>> import org.w3c.dom.Document;
>>>>>>>>>> import org.w3c.dom.Node;
>>>>>>>>>> import org.w3c.dom.NodeList;
>>>>>>>>>> import org.xml.sax.SAXException;
>>>>>>>>>>
>>>>>>>>>> import com.sun.org.apache.xml.internal.dtm.ref.DTMNodeList;
>>>>>>>>>>
>>>>>>>>>> public class XmlToTextMapper extends Mapper<LongWritable, Text,
>>>>>>>>>> Text, Text> {
>>>>>>>>>>
>>>>>>>>>>     private static final XPathFactory xpathFactory =
>>>>>>>>>> XPathFactory.newInstance();
>>>>>>>>>>
>>>>>>>>>>     @Override
>>>>>>>>>>     public void map(LongWritable key, Text value, Context context)
>>>>>>>>>>             throws IOException, InterruptedException {
>>>>>>>>>>
>>>>>>>>>>         String resultFileName = "/result.txt";
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>         Configuration conf = new Configuration();
>>>>>>>>>>         FileSystem fs =
>>>>>>>>>> FileSystem.get(URI.create(resultFileName), conf);
>>>>>>>>>>         FSDataOutputStream out = fs.create(new
>>>>>>>>>> Path(resultFileName));
>>>>>>>>>>
>>>>>>>>>>         InputStream resultIS = new ByteArrayInputStream(new
>>>>>>>>>> byte[0]);
>>>>>>>>>>
>>>>>>>>>>         String header = "id,name\n";
>>>>>>>>>>         out.write(header.getBytes());
>>>>>>>>>>
>>>>>>>>>>         String xmlContent = value.toString();
>>>>>>>>>>         InputStream is = new
>>>>>>>>>> ByteArrayInputStream(xmlContent.getBytes());
>>>>>>>>>>         DocumentBuilderFactory factory =
>>>>>>>>>> DocumentBuilderFactory.newInstance();
>>>>>>>>>>         DocumentBuilder builder;
>>>>>>>>>>         try {
>>>>>>>>>>             builder = factory.newDocumentBuilder();
>>>>>>>>>>             Document doc = builder.parse(is);
>>>>>>>>>>             DTMNodeList list = (DTMNodeList)
>>>>>>>>>> getNode("/main/data", doc,
>>>>>>>>>>                     XPathConstants.NODESET);
>>>>>>>>>>
>>>>>>>>>>             int size = list.getLength();
>>>>>>>>>>             for (int i = 0; i < size; i++) {
>>>>>>>>>>                 Node node = list.item(i);
>>>>>>>>>>                 String line = "";
>>>>>>>>>>                 NodeList nodeList = node.getChildNodes();
>>>>>>>>>>                 int childNumber = nodeList.getLength();
>>>>>>>>>>                 for (int j = 0; j < childNumber; j++) {
>>>>>>>>>>                     line += nodeList.item(j).getTextContent() +
>>>>>>>>>> ",";
>>>>>>>>>>                 }
>>>>>>>>>>                 if (line.endsWith(","))
>>>>>>>>>>                     line = line.substring(0, line.length() - 1);
>>>>>>>>>>                 line += "\n";
>>>>>>>>>>                 out.write(line.getBytes());
>>>>>>>>>>
>>>>>>>>>>             }
>>>>>>>>>>
>>>>>>>>>>         } catch (ParserConfigurationException e) {
>>>>>>>>>>             MyLogguer.log("error: " + e.getMessage());
>>>>>>>>>>             e.printStackTrace();
>>>>>>>>>>         } catch (SAXException e) {
>>>>>>>>>>             MyLogguer.log("error: " + e.getMessage());
>>>>>>>>>>             e.printStackTrace();
>>>>>>>>>>         } catch (XPathExpressionException e) {
>>>>>>>>>>             MyLogguer.log("error: " + e.getMessage());
>>>>>>>>>>             e.printStackTrace();
>>>>>>>>>>         }
>>>>>>>>>>
>>>>>>>>>>         IOUtils.copyBytes(resultIS, out, 4096, true);
>>>>>>>>>>         out.close();
>>>>>>>>>>     }
>>>>>>>>>>
>>>>>>>>>>     public static Object getNode(String xpathStr, Node node,
>>>>>>>>>> QName retunType)
>>>>>>>>>>             throws XPathExpressionException {
>>>>>>>>>>         XPath xpath = xpathFactory.newXPath();
>>>>>>>>>>         return xpath.evaluate(xpathStr, node, retunType);
>>>>>>>>>>     }
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --------------------------------------
>>>>>>>>>> Main class:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> public class Main {
>>>>>>>>>>
>>>>>>>>>>     public static void main(String[] args) throws Exception {
>>>>>>>>>>
>>>>>>>>>>         if (args.length != 2) {
>>>>>>>>>>             System.err
>>>>>>>>>>                     .println("Usage: XMLtoText <input path>
>>>>>>>>>> <output path>");
>>>>>>>>>>             System.exit(-1);
>>>>>>>>>>         }
>>>>>>>>>>
>>>>>>>>>>         Job job = new Job();
>>>>>>>>>>         job.setJarByClass(Main.class);
>>>>>>>>>>         job.setJobName("XML to Text");
>>>>>>>>>>         FileInputFormat.addInputPath(job, new Path(args[0]));
>>>>>>>>>>         FileOutputFormat.setOutputPath(job, new Path(args[1]));
>>>>>>>>>>
>>>>>>>>>>         job.setMapperClass(XmlToTextMapper.class);
>>>>>>>>>>         job.setNumReduceTasks(0);
>>>>>>>>>>         job.setMapOutputKeyClass(Text.class);
>>>>>>>>>>         job.setMapOutputValueClass(Text.class);
>>>>>>>>>>         System.exit(job.waitForCompletion(true) ? 0 : 1);
>>>>>>>>>>
>>>>>>>>>>     }
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>> To execute the job you can use :
>>>>>>>>>>
>>>>>>>>>>          bin/hadoop Main /data.xml /output.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Then you can use this to see result.txt file:
>>>>>>>>>>
>>>>>>>>>>           hadoop fs -cat /result.txt
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I'm using this xml as input:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> <Comp><Emp><id>1</id><name>NameA</name></data><data><id>2</id><name>NameB</name></Emp></Comp>
>>>>>>>>>>
>>>>>>>>>> and the content in result.txt is like this:
>>>>>>>>>>
>>>>>>>>>> id,name
>>>>>>>>>> 1,NameA
>>>>>>>>>> 2,NameB
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Hope this helps.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> 2014/1/3 Ranjini Rathinam <ra...@gmail.com>
>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> Need to convert XML into text using mapreduce.
>>>>>>>>>>>
>>>>>>>>>>> I have used DOM and SAX parser.
>>>>>>>>>>>
>>>>>>>>>>> After using SAX Builder in mapper class. the child node act as
>>>>>>>>>>> root Element.
>>>>>>>>>>>
>>>>>>>>>>> While seeing in Sys out i found thar root element is taking the
>>>>>>>>>>> child element and printing.
>>>>>>>>>>>
>>>>>>>>>>> For Eg,
>>>>>>>>>>>
>>>>>>>>>>> <Comp><Emp><id>100</id><name>RR</name></Emp></Comp>
>>>>>>>>>>> when this xml is passed in mapper , in sys out printing the root
>>>>>>>>>>> element
>>>>>>>>>>>
>>>>>>>>>>> I am getting the the root element as
>>>>>>>>>>>
>>>>>>>>>>> <id>
>>>>>>>>>>> <name>
>>>>>>>>>>>
>>>>>>>>>>> Please suggest and help to fix this.
>>>>>>>>>>>
>>>>>>>>>>> I need to convert the xml into text using mapreduce code. Please
>>>>>>>>>>> provide with example.
>>>>>>>>>>>
>>>>>>>>>>> Required output is
>>>>>>>>>>>
>>>>>>>>>>> id,name
>>>>>>>>>>> 100,RR
>>>>>>>>>>>
>>>>>>>>>>> Please help.
>>>>>>>>>>>
>>>>>>>>>>> Thanks in advance,
>>>>>>>>>>> Ranjini R
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>>
>

Re: XML to TEXT

Posted by Ranjini Rathinam <ra...@gmail.com>.
Hi Gutierrez ,

As suggest i tried with the code , but in the result.txt i got output only
header. Nothing else was printing.

After debugging i came to know that while parsing , there is no value.

The problem is in line given below which is bold. While putting SysOut i
found no value printing in this line.

String xmlContent = value.toString();

        InputStream is = new ByteArrayInputStream(xmlContent.getBytes());
        DocumentBuilderFactory factory =
DocumentBuilderFactory.newInstance();
        DocumentBuilder builder;
        try {
            builder = factory.newDocumentBuilder();

* Document doc = builder.parse(is);*   String
ed=doc.getDocumentElement().getNodeName();
   out.write(ed.getBytes());
            DTMNodeList list = (DTMNodeList) getNode("/Company/Employee",
doc,XPathConstants.NODESET);

When iam printing

out.write(xmlContent.getBytes):- the whole xml is being printed.

then i wrote for Sysout for list ,nothing printed.
 out.write(ed.getBytes):- nothing is being printed.

Please suggest where i am going wrong. Please help to fix this.

Thanks in advance.

I have attached my code.Please review.


Mapper class:-

public class XmlTextMapper extends Mapper<LongWritable, Text, Text, Text> {
    private static final XPathFactory xpathFactory =
XPathFactory.newInstance();
    @Override
    public void map(LongWritable key, Text value, Context context)
            throws IOException, InterruptedException {
        String resultFileName = "/user/task/Sales/result.txt";

        Configuration conf = new Configuration();
        FileSystem fs = FileSystem.get(URI.create(resultFileName), conf);
        FSDataOutputStream out = fs.create(new Path(resultFileName));
        InputStream resultIS = new ByteArrayInputStream(new byte[0]);
        String header = "id,name\n";
        out.write(header.getBytes());
        String xmlContent = value.toString();

        InputStream is = new ByteArrayInputStream(xmlContent.getBytes());
        DocumentBuilderFactory factory =
DocumentBuilderFactory.newInstance();
        DocumentBuilder builder;
        try {
            builder = factory.newDocumentBuilder();
            Document doc = builder.parse(is);
   String ed=doc.getDocumentElement().getNodeName();
   out.write(ed.getBytes());
            DTMNodeList list = (DTMNodeList) getNode("/Company/Employee",
doc,XPathConstants.NODESET);
            int size = list.getLength();
            for (int i = 0; i < size; i++) {
                Node node = list.item(i);
                String line = "";
                NodeList nodeList = node.getChildNodes();
                int childNumber = nodeList.getLength();
                for (int j = 0; j < childNumber; j++)
    {
                    line += nodeList.item(j).getTextContent() + ",";
                }
                if (line.endsWith(","))
                    line = line.substring(0, line.length() - 1);
                line += "\n";
                out.write(line.getBytes());
            }
        } catch (ParserConfigurationException e) {
             e.printStackTrace();
        } catch (SAXException e) {
             e.printStackTrace();
        } catch (XPathExpressionException e) {
             e.printStackTrace();
        }
        IOUtils.copyBytes(resultIS, out, 4096, true);
        out.close();
    }
    public static Object getNode(String xpathStr, Node node, QName
retunType)
            throws XPathExpressionException {
        XPath xpath = xpathFactory.newXPath();
        return xpath.evaluate(xpathStr, node, retunType);
    }
}



Main class
public class MainXml {
    public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
        if (args.length != 2) {
            System.err
                    .println("Usage: XMLtoText <input path> <output path>");
            System.exit(-1);
        }
  String output="/user/task/Sales/";
       Job job = new Job(conf, "XML to Text");
        job.setJarByClass(MainXml.class);
       // job.setJobName("XML to Text");
        FileInputFormat.addInputPath(job, new Path(args[0]));
       // FileOutputFormat.setOutputPath(job, new Path(args[1]));
  Path outPath = new Path(output);
  FileOutputFormat.setOutputPath(job, outPath);
  FileSystem dfs = FileSystem.get(outPath.toUri(), conf);
  if (dfs.exists(outPath)) {
  dfs.delete(outPath, true);
  }
        job.setMapperClass(XmlTextMapper.class);
        job.setNumReduceTasks(0);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(Text.class);
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}


My xml file

<Company>
<Employee>
<id>100</id>
<ename>ranjini</ename>
<dept>IT1</dept>
<sal>123456</sal>
<location>nextlevel1</location>
<Address>
<Home>Chennai1</Home>
<Office>Navallur1</Office>
</Address>
</Employee>
<Employee>
<id>1001</id>
<ename>ranjinikumar</ename>
<dept>IT</dept>
<sal>1234516</sal>
<location>nextlevel</location>
<Address>
<Home>Chennai</Home>
<Office>Navallur</Office>
</Address>
</Employee>
</Company>


Thanks in advance.

Ranjini



>  On Mon, Jan 6, 2014 at 2:44 PM, Ranjini Rathinam <ra...@gmail.com>wrote:
>
>> Hi,
>>
>> Thanks a lot .
>>
>> Ranjini
>>
>> On Fri, Jan 3, 2014 at 10:40 PM, Diego Gutierrez <
>> diego.gutierrez@ucsp.edu.pe> wrote:
>>
>>>  Hi,
>>>
>>> I suggest to use the XPath, this is a native java support for parse xml
>>> and json formats.
>>>
>>> For the main problem, like distcp command(
>>> http://hadoop.apache.org/docs/r0.19.0/distcp.pdf ) there is no need of
>>> a reduce function, because you can parse the xml input file and create the
>>> file you need in the map function.For example the following code reads an
>>> xml file in HDFS, parse it and create a new file ( "/result.txt" ) with the
>>> expected format:
>>> id,name
>>> 100,RR
>>>
>>>
>>> Mapper function:
>>>
>>> import java.io.ByteArrayInputStream;
>>> import java.io.IOException;
>>> import java.io.InputStream;
>>> import java.net.URI;
>>>
>>> import javax.xml.namespace.QName;
>>> import javax.xml.parsers.DocumentBuilder;
>>> import javax.xml.parsers.DocumentBuilderFactory;
>>> import javax.xml.parsers.ParserConfigurationException;
>>> import javax.xml.xpath.XPath;
>>> import javax.xml.xpath.XPathConstants;
>>> import javax.xml.xpath.XPathExpressionException;
>>> import javax.xml.xpath.XPathFactory;
>>>
>>> import org.apache.hadoop.conf.Configuration;
>>> import org.apache.hadoop.fs.FSDataOutputStream;
>>> import org.apache.hadoop.fs.FileSystem;
>>> import org.apache.hadoop.fs.Path;
>>> import org.apache.hadoop.io.IOUtils;
>>> import org.apache.hadoop.io.LongWritable;
>>> import org.apache.hadoop.io.Text;
>>> import org.apache.hadoop.mapreduce.Mapper;
>>> import org.w3c.dom.Document;
>>> import org.w3c.dom.Node;
>>> import org.w3c.dom.NodeList;
>>> import org.xml.sax.SAXException;
>>>
>>> import com.sun.org.apache.xml.internal.dtm.ref.DTMNodeList;
>>>
>>> public class XmlToTextMapper extends Mapper<LongWritable, Text, Text,
>>> Text> {
>>>
>>>     private static final XPathFactory xpathFactory =
>>> XPathFactory.newInstance();
>>>
>>>     @Override
>>>     public void map(LongWritable key, Text value, Context context)
>>>             throws IOException, InterruptedException {
>>>
>>>         String resultFileName = "/result.txt";
>>>
>>>
>>>         Configuration conf = new Configuration();
>>>         FileSystem fs = FileSystem.get(URI.create(resultFileName), conf);
>>>         FSDataOutputStream out = fs.create(new Path(resultFileName));
>>>
>>>         InputStream resultIS = new ByteArrayInputStream(new byte[0]);
>>>
>>>         String header = "id,name\n";
>>>         out.write(header.getBytes());
>>>
>>>         String xmlContent = value.toString();
>>>         InputStream is = new ByteArrayInputStream(xmlContent.getBytes());
>>>         DocumentBuilderFactory factory =
>>> DocumentBuilderFactory.newInstance();
>>>         DocumentBuilder builder;
>>>         try {
>>>             builder = factory.newDocumentBuilder();
>>>             Document doc = builder.parse(is);
>>>             DTMNodeList list = (DTMNodeList) getNode("/main/data", doc,
>>>                     XPathConstants.NODESET);
>>>
>>>             int size = list.getLength();
>>>             for (int i = 0; i < size; i++) {
>>>                 Node node = list.item(i);
>>>                 String line = "";
>>>                 NodeList nodeList = node.getChildNodes();
>>>                 int childNumber = nodeList.getLength();
>>>                 for (int j = 0; j < childNumber; j++) {
>>>                     line += nodeList.item(j).getTextContent() + ",";
>>>                 }
>>>                 if (line.endsWith(","))
>>>                     line = line.substring(0, line.length() - 1);
>>>                 line += "\n";
>>>                 out.write(line.getBytes());
>>>
>>>             }
>>>
>>>         } catch (ParserConfigurationException e) {
>>>             MyLogguer.log("error: " + e.getMessage());
>>>             e.printStackTrace();
>>>         } catch (SAXException e) {
>>>             MyLogguer.log("error: " + e.getMessage());
>>>             e.printStackTrace();
>>>         } catch (XPathExpressionException e) {
>>>             MyLogguer.log("error: " + e.getMessage());
>>>             e.printStackTrace();
>>>         }
>>>
>>>         IOUtils.copyBytes(resultIS, out, 4096, true);
>>>         out.close();
>>>     }
>>>
>>>     public static Object getNode(String xpathStr, Node node, QName
>>> retunType)
>>>             throws XPathExpressionException {
>>>         XPath xpath = xpathFactory.newXPath();
>>>         return xpath.evaluate(xpathStr, node, retunType);
>>>     }
>>> }
>>>
>>>
>>>
>>> --------------------------------------
>>> Main class:
>>>
>>>
>>> public class Main {
>>>
>>>     public static void main(String[] args) throws Exception {
>>>
>>>         if (args.length != 2) {
>>>             System.err
>>>                     .println("Usage: XMLtoText <input path> <output
>>> path>");
>>>             System.exit(-1);
>>>         }
>>>
>>>         Job job = new Job();
>>>         job.setJarByClass(Main.class);
>>>         job.setJobName("XML to Text");
>>>         FileInputFormat.addInputPath(job, new Path(args[0]));
>>>         FileOutputFormat.setOutputPath(job, new Path(args[1]));
>>>
>>>         job.setMapperClass(XmlToTextMapper.class);
>>>         job.setNumReduceTasks(0);
>>>         job.setMapOutputKeyClass(Text.class);
>>>         job.setMapOutputValueClass(Text.class);
>>>         System.exit(job.waitForCompletion(true) ? 0 : 1);
>>>
>>>     }
>>> }
>>>
>>> To execute the job you can use :
>>>
>>>          bin/hadoop Main /data.xml /output.
>>>
>>>
>>> Then you can use this to see result.txt file:
>>>
>>>           hadoop fs -cat /result.txt
>>>
>>>
>>> I'm using this xml as input:
>>>
>>>
>>> <Comp><Emp><id>1</id><name>NameA</name></data><data><id>2</id><name>NameB</name></Emp></Comp>
>>>
>>> and the content in result.txt is like this:
>>>
>>> id,name
>>> 1,NameA
>>> 2,NameB
>>>
>>>
>>> Hope this helps.
>>>
>>>
>>> 2014/1/3 Ranjini Rathinam <ra...@gmail.com>
>>>
>>>> Hi,
>>>>
>>>> Need to convert XML into text using mapreduce.
>>>>
>>>> I have used DOM and SAX parser.
>>>>
>>>> After using SAX Builder in mapper class. the child node act as root
>>>> Element.
>>>>
>>>> While seeing in Sys out i found thar root element is taking the child
>>>> element and printing.
>>>>
>>>> For Eg,
>>>>
>>>> <Comp><Emp><id>100</id><name>RR</name></Emp></Comp>
>>>> when this xml is passed in mapper , in sys out printing the root element
>>>>
>>>> I am getting the the root element as
>>>>
>>>> <id>
>>>> <name>
>>>>
>>>> Please suggest and help to fix this.
>>>>
>>>> I need to convert the xml into text using mapreduce code. Please
>>>> provide with example.
>>>>
>>>> Required output is
>>>>
>>>> id,name
>>>> 100,RR
>>>>
>>>> Please help.
>>>>
>>>> Thanks in advance,
>>>> Ranjini R
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>

Re: XML to TEXT

Posted by Ranjini Rathinam <ra...@gmail.com>.
Hi Gutierrez ,

As suggest i tried with the code , but in the result.txt i got output only
header. Nothing else was printing.

After debugging i came to know that while parsing , there is no value.

The problem is in line given below which is bold. While putting SysOut i
found no value printing in this line.

String xmlContent = value.toString();

        InputStream is = new ByteArrayInputStream(xmlContent.getBytes());
        DocumentBuilderFactory factory =
DocumentBuilderFactory.newInstance();
        DocumentBuilder builder;
        try {
            builder = factory.newDocumentBuilder();

* Document doc = builder.parse(is);*   String
ed=doc.getDocumentElement().getNodeName();
   out.write(ed.getBytes());
            DTMNodeList list = (DTMNodeList) getNode("/Company/Employee",
doc,XPathConstants.NODESET);

When iam printing

out.write(xmlContent.getBytes):- the whole xml is being printed.

then i wrote for Sysout for list ,nothing printed.
 out.write(ed.getBytes):- nothing is being printed.

Please suggest where i am going wrong. Please help to fix this.

Thanks in advance.

I have attached my code.Please review.


Mapper class:-

public class XmlTextMapper extends Mapper<LongWritable, Text, Text, Text> {
    private static final XPathFactory xpathFactory =
XPathFactory.newInstance();
    @Override
    public void map(LongWritable key, Text value, Context context)
            throws IOException, InterruptedException {
        String resultFileName = "/user/task/Sales/result.txt";

        Configuration conf = new Configuration();
        FileSystem fs = FileSystem.get(URI.create(resultFileName), conf);
        FSDataOutputStream out = fs.create(new Path(resultFileName));
        InputStream resultIS = new ByteArrayInputStream(new byte[0]);
        String header = "id,name\n";
        out.write(header.getBytes());
        String xmlContent = value.toString();

        InputStream is = new ByteArrayInputStream(xmlContent.getBytes());
        DocumentBuilderFactory factory =
DocumentBuilderFactory.newInstance();
        DocumentBuilder builder;
        try {
            builder = factory.newDocumentBuilder();
            Document doc = builder.parse(is);
   String ed=doc.getDocumentElement().getNodeName();
   out.write(ed.getBytes());
            DTMNodeList list = (DTMNodeList) getNode("/Company/Employee",
doc,XPathConstants.NODESET);
            int size = list.getLength();
            for (int i = 0; i < size; i++) {
                Node node = list.item(i);
                String line = "";
                NodeList nodeList = node.getChildNodes();
                int childNumber = nodeList.getLength();
                for (int j = 0; j < childNumber; j++)
    {
                    line += nodeList.item(j).getTextContent() + ",";
                }
                if (line.endsWith(","))
                    line = line.substring(0, line.length() - 1);
                line += "\n";
                out.write(line.getBytes());
            }
        } catch (ParserConfigurationException e) {
             e.printStackTrace();
        } catch (SAXException e) {
             e.printStackTrace();
        } catch (XPathExpressionException e) {
             e.printStackTrace();
        }
        IOUtils.copyBytes(resultIS, out, 4096, true);
        out.close();
    }
    public static Object getNode(String xpathStr, Node node, QName
retunType)
            throws XPathExpressionException {
        XPath xpath = xpathFactory.newXPath();
        return xpath.evaluate(xpathStr, node, retunType);
    }
}



Main class
public class MainXml {
    public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
        if (args.length != 2) {
            System.err
                    .println("Usage: XMLtoText <input path> <output path>");
            System.exit(-1);
        }
  String output="/user/task/Sales/";
       Job job = new Job(conf, "XML to Text");
        job.setJarByClass(MainXml.class);
       // job.setJobName("XML to Text");
        FileInputFormat.addInputPath(job, new Path(args[0]));
       // FileOutputFormat.setOutputPath(job, new Path(args[1]));
  Path outPath = new Path(output);
  FileOutputFormat.setOutputPath(job, outPath);
  FileSystem dfs = FileSystem.get(outPath.toUri(), conf);
  if (dfs.exists(outPath)) {
  dfs.delete(outPath, true);
  }
        job.setMapperClass(XmlTextMapper.class);
        job.setNumReduceTasks(0);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(Text.class);
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}


My xml file

<Company>
<Employee>
<id>100</id>
<ename>ranjini</ename>
<dept>IT1</dept>
<sal>123456</sal>
<location>nextlevel1</location>
<Address>
<Home>Chennai1</Home>
<Office>Navallur1</Office>
</Address>
</Employee>
<Employee>
<id>1001</id>
<ename>ranjinikumar</ename>
<dept>IT</dept>
<sal>1234516</sal>
<location>nextlevel</location>
<Address>
<Home>Chennai</Home>
<Office>Navallur</Office>
</Address>
</Employee>
</Company>


Thanks in advance.

Ranjini



>  On Mon, Jan 6, 2014 at 2:44 PM, Ranjini Rathinam <ra...@gmail.com>wrote:
>
>> Hi,
>>
>> Thanks a lot .
>>
>> Ranjini
>>
>> On Fri, Jan 3, 2014 at 10:40 PM, Diego Gutierrez <
>> diego.gutierrez@ucsp.edu.pe> wrote:
>>
>>>  Hi,
>>>
>>> I suggest to use the XPath, this is a native java support for parse xml
>>> and json formats.
>>>
>>> For the main problem, like distcp command(
>>> http://hadoop.apache.org/docs/r0.19.0/distcp.pdf ) there is no need of
>>> a reduce function, because you can parse the xml input file and create the
>>> file you need in the map function.For example the following code reads an
>>> xml file in HDFS, parse it and create a new file ( "/result.txt" ) with the
>>> expected format:
>>> id,name
>>> 100,RR
>>>
>>>
>>> Mapper function:
>>>
>>> import java.io.ByteArrayInputStream;
>>> import java.io.IOException;
>>> import java.io.InputStream;
>>> import java.net.URI;
>>>
>>> import javax.xml.namespace.QName;
>>> import javax.xml.parsers.DocumentBuilder;
>>> import javax.xml.parsers.DocumentBuilderFactory;
>>> import javax.xml.parsers.ParserConfigurationException;
>>> import javax.xml.xpath.XPath;
>>> import javax.xml.xpath.XPathConstants;
>>> import javax.xml.xpath.XPathExpressionException;
>>> import javax.xml.xpath.XPathFactory;
>>>
>>> import org.apache.hadoop.conf.Configuration;
>>> import org.apache.hadoop.fs.FSDataOutputStream;
>>> import org.apache.hadoop.fs.FileSystem;
>>> import org.apache.hadoop.fs.Path;
>>> import org.apache.hadoop.io.IOUtils;
>>> import org.apache.hadoop.io.LongWritable;
>>> import org.apache.hadoop.io.Text;
>>> import org.apache.hadoop.mapreduce.Mapper;
>>> import org.w3c.dom.Document;
>>> import org.w3c.dom.Node;
>>> import org.w3c.dom.NodeList;
>>> import org.xml.sax.SAXException;
>>>
>>> import com.sun.org.apache.xml.internal.dtm.ref.DTMNodeList;
>>>
>>> public class XmlToTextMapper extends Mapper<LongWritable, Text, Text,
>>> Text> {
>>>
>>>     private static final XPathFactory xpathFactory =
>>> XPathFactory.newInstance();
>>>
>>>     @Override
>>>     public void map(LongWritable key, Text value, Context context)
>>>             throws IOException, InterruptedException {
>>>
>>>         String resultFileName = "/result.txt";
>>>
>>>
>>>         Configuration conf = new Configuration();
>>>         FileSystem fs = FileSystem.get(URI.create(resultFileName), conf);
>>>         FSDataOutputStream out = fs.create(new Path(resultFileName));
>>>
>>>         InputStream resultIS = new ByteArrayInputStream(new byte[0]);
>>>
>>>         String header = "id,name\n";
>>>         out.write(header.getBytes());
>>>
>>>         String xmlContent = value.toString();
>>>         InputStream is = new ByteArrayInputStream(xmlContent.getBytes());
>>>         DocumentBuilderFactory factory =
>>> DocumentBuilderFactory.newInstance();
>>>         DocumentBuilder builder;
>>>         try {
>>>             builder = factory.newDocumentBuilder();
>>>             Document doc = builder.parse(is);
>>>             DTMNodeList list = (DTMNodeList) getNode("/main/data", doc,
>>>                     XPathConstants.NODESET);
>>>
>>>             int size = list.getLength();
>>>             for (int i = 0; i < size; i++) {
>>>                 Node node = list.item(i);
>>>                 String line = "";
>>>                 NodeList nodeList = node.getChildNodes();
>>>                 int childNumber = nodeList.getLength();
>>>                 for (int j = 0; j < childNumber; j++) {
>>>                     line += nodeList.item(j).getTextContent() + ",";
>>>                 }
>>>                 if (line.endsWith(","))
>>>                     line = line.substring(0, line.length() - 1);
>>>                 line += "\n";
>>>                 out.write(line.getBytes());
>>>
>>>             }
>>>
>>>         } catch (ParserConfigurationException e) {
>>>             MyLogguer.log("error: " + e.getMessage());
>>>             e.printStackTrace();
>>>         } catch (SAXException e) {
>>>             MyLogguer.log("error: " + e.getMessage());
>>>             e.printStackTrace();
>>>         } catch (XPathExpressionException e) {
>>>             MyLogguer.log("error: " + e.getMessage());
>>>             e.printStackTrace();
>>>         }
>>>
>>>         IOUtils.copyBytes(resultIS, out, 4096, true);
>>>         out.close();
>>>     }
>>>
>>>     public static Object getNode(String xpathStr, Node node, QName
>>> retunType)
>>>             throws XPathExpressionException {
>>>         XPath xpath = xpathFactory.newXPath();
>>>         return xpath.evaluate(xpathStr, node, retunType);
>>>     }
>>> }
>>>
>>>
>>>
>>> --------------------------------------
>>> Main class:
>>>
>>>
>>> public class Main {
>>>
>>>     public static void main(String[] args) throws Exception {
>>>
>>>         if (args.length != 2) {
>>>             System.err
>>>                     .println("Usage: XMLtoText <input path> <output
>>> path>");
>>>             System.exit(-1);
>>>         }
>>>
>>>         Job job = new Job();
>>>         job.setJarByClass(Main.class);
>>>         job.setJobName("XML to Text");
>>>         FileInputFormat.addInputPath(job, new Path(args[0]));
>>>         FileOutputFormat.setOutputPath(job, new Path(args[1]));
>>>
>>>         job.setMapperClass(XmlToTextMapper.class);
>>>         job.setNumReduceTasks(0);
>>>         job.setMapOutputKeyClass(Text.class);
>>>         job.setMapOutputValueClass(Text.class);
>>>         System.exit(job.waitForCompletion(true) ? 0 : 1);
>>>
>>>     }
>>> }
>>>
>>> To execute the job you can use :
>>>
>>>          bin/hadoop Main /data.xml /output.
>>>
>>>
>>> Then you can use this to see result.txt file:
>>>
>>>           hadoop fs -cat /result.txt
>>>
>>>
>>> I'm using this xml as input:
>>>
>>>
>>> <Comp><Emp><id>1</id><name>NameA</name></data><data><id>2</id><name>NameB</name></Emp></Comp>
>>>
>>> and the content in result.txt is like this:
>>>
>>> id,name
>>> 1,NameA
>>> 2,NameB
>>>
>>>
>>> Hope this helps.
>>>
>>>
>>> 2014/1/3 Ranjini Rathinam <ra...@gmail.com>
>>>
>>>> Hi,
>>>>
>>>> Need to convert XML into text using mapreduce.
>>>>
>>>> I have used DOM and SAX parser.
>>>>
>>>> After using SAX Builder in mapper class. the child node act as root
>>>> Element.
>>>>
>>>> While seeing in Sys out i found thar root element is taking the child
>>>> element and printing.
>>>>
>>>> For Eg,
>>>>
>>>> <Comp><Emp><id>100</id><name>RR</name></Emp></Comp>
>>>> when this xml is passed in mapper , in sys out printing the root element
>>>>
>>>> I am getting the the root element as
>>>>
>>>> <id>
>>>> <name>
>>>>
>>>> Please suggest and help to fix this.
>>>>
>>>> I need to convert the xml into text using mapreduce code. Please
>>>> provide with example.
>>>>
>>>> Required output is
>>>>
>>>> id,name
>>>> 100,RR
>>>>
>>>> Please help.
>>>>
>>>> Thanks in advance,
>>>> Ranjini R
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>

Re: XML to TEXT

Posted by Ranjini Rathinam <ra...@gmail.com>.
Hi Gutierrez ,

As suggest i tried with the code , but in the result.txt i got output only
header. Nothing else was printing.

After debugging i came to know that while parsing , there is no value.

The problem is in line given below which is bold. While putting SysOut i
found no value printing in this line.

String xmlContent = value.toString();

        InputStream is = new ByteArrayInputStream(xmlContent.getBytes());
        DocumentBuilderFactory factory =
DocumentBuilderFactory.newInstance();
        DocumentBuilder builder;
        try {
            builder = factory.newDocumentBuilder();

* Document doc = builder.parse(is);*   String
ed=doc.getDocumentElement().getNodeName();
   out.write(ed.getBytes());
            DTMNodeList list = (DTMNodeList) getNode("/Company/Employee",
doc,XPathConstants.NODESET);

When iam printing

out.write(xmlContent.getBytes):- the whole xml is being printed.

then i wrote for Sysout for list ,nothing printed.
 out.write(ed.getBytes):- nothing is being printed.

Please suggest where i am going wrong. Please help to fix this.

Thanks in advance.

I have attached my code.Please review.


Mapper class:-

public class XmlTextMapper extends Mapper<LongWritable, Text, Text, Text> {
    private static final XPathFactory xpathFactory =
XPathFactory.newInstance();
    @Override
    public void map(LongWritable key, Text value, Context context)
            throws IOException, InterruptedException {
        String resultFileName = "/user/task/Sales/result.txt";

        Configuration conf = new Configuration();
        FileSystem fs = FileSystem.get(URI.create(resultFileName), conf);
        FSDataOutputStream out = fs.create(new Path(resultFileName));
        InputStream resultIS = new ByteArrayInputStream(new byte[0]);
        String header = "id,name\n";
        out.write(header.getBytes());
        String xmlContent = value.toString();

        InputStream is = new ByteArrayInputStream(xmlContent.getBytes());
        DocumentBuilderFactory factory =
DocumentBuilderFactory.newInstance();
        DocumentBuilder builder;
        try {
            builder = factory.newDocumentBuilder();
            Document doc = builder.parse(is);
   String ed=doc.getDocumentElement().getNodeName();
   out.write(ed.getBytes());
            DTMNodeList list = (DTMNodeList) getNode("/Company/Employee",
doc,XPathConstants.NODESET);
            int size = list.getLength();
            for (int i = 0; i < size; i++) {
                Node node = list.item(i);
                String line = "";
                NodeList nodeList = node.getChildNodes();
                int childNumber = nodeList.getLength();
                for (int j = 0; j < childNumber; j++)
    {
                    line += nodeList.item(j).getTextContent() + ",";
                }
                if (line.endsWith(","))
                    line = line.substring(0, line.length() - 1);
                line += "\n";
                out.write(line.getBytes());
            }
        } catch (ParserConfigurationException e) {
             e.printStackTrace();
        } catch (SAXException e) {
             e.printStackTrace();
        } catch (XPathExpressionException e) {
             e.printStackTrace();
        }
        IOUtils.copyBytes(resultIS, out, 4096, true);
        out.close();
    }
    public static Object getNode(String xpathStr, Node node, QName
retunType)
            throws XPathExpressionException {
        XPath xpath = xpathFactory.newXPath();
        return xpath.evaluate(xpathStr, node, retunType);
    }
}



Main class
public class MainXml {
    public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
        if (args.length != 2) {
            System.err
                    .println("Usage: XMLtoText <input path> <output path>");
            System.exit(-1);
        }
  String output="/user/task/Sales/";
       Job job = new Job(conf, "XML to Text");
        job.setJarByClass(MainXml.class);
       // job.setJobName("XML to Text");
        FileInputFormat.addInputPath(job, new Path(args[0]));
       // FileOutputFormat.setOutputPath(job, new Path(args[1]));
  Path outPath = new Path(output);
  FileOutputFormat.setOutputPath(job, outPath);
  FileSystem dfs = FileSystem.get(outPath.toUri(), conf);
  if (dfs.exists(outPath)) {
  dfs.delete(outPath, true);
  }
        job.setMapperClass(XmlTextMapper.class);
        job.setNumReduceTasks(0);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(Text.class);
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}


My xml file

<Company>
<Employee>
<id>100</id>
<ename>ranjini</ename>
<dept>IT1</dept>
<sal>123456</sal>
<location>nextlevel1</location>
<Address>
<Home>Chennai1</Home>
<Office>Navallur1</Office>
</Address>
</Employee>
<Employee>
<id>1001</id>
<ename>ranjinikumar</ename>
<dept>IT</dept>
<sal>1234516</sal>
<location>nextlevel</location>
<Address>
<Home>Chennai</Home>
<Office>Navallur</Office>
</Address>
</Employee>
</Company>


Thanks in advance.

Ranjini



>  On Mon, Jan 6, 2014 at 2:44 PM, Ranjini Rathinam <ra...@gmail.com>wrote:
>
>> Hi,
>>
>> Thanks a lot .
>>
>> Ranjini
>>
>> On Fri, Jan 3, 2014 at 10:40 PM, Diego Gutierrez <
>> diego.gutierrez@ucsp.edu.pe> wrote:
>>
>>>  Hi,
>>>
>>> I suggest to use the XPath, this is a native java support for parse xml
>>> and json formats.
>>>
>>> For the main problem, like distcp command(
>>> http://hadoop.apache.org/docs/r0.19.0/distcp.pdf ) there is no need of
>>> a reduce function, because you can parse the xml input file and create the
>>> file you need in the map function.For example the following code reads an
>>> xml file in HDFS, parse it and create a new file ( "/result.txt" ) with the
>>> expected format:
>>> id,name
>>> 100,RR
>>>
>>>
>>> Mapper function:
>>>
>>> import java.io.ByteArrayInputStream;
>>> import java.io.IOException;
>>> import java.io.InputStream;
>>> import java.net.URI;
>>>
>>> import javax.xml.namespace.QName;
>>> import javax.xml.parsers.DocumentBuilder;
>>> import javax.xml.parsers.DocumentBuilderFactory;
>>> import javax.xml.parsers.ParserConfigurationException;
>>> import javax.xml.xpath.XPath;
>>> import javax.xml.xpath.XPathConstants;
>>> import javax.xml.xpath.XPathExpressionException;
>>> import javax.xml.xpath.XPathFactory;
>>>
>>> import org.apache.hadoop.conf.Configuration;
>>> import org.apache.hadoop.fs.FSDataOutputStream;
>>> import org.apache.hadoop.fs.FileSystem;
>>> import org.apache.hadoop.fs.Path;
>>> import org.apache.hadoop.io.IOUtils;
>>> import org.apache.hadoop.io.LongWritable;
>>> import org.apache.hadoop.io.Text;
>>> import org.apache.hadoop.mapreduce.Mapper;
>>> import org.w3c.dom.Document;
>>> import org.w3c.dom.Node;
>>> import org.w3c.dom.NodeList;
>>> import org.xml.sax.SAXException;
>>>
>>> import com.sun.org.apache.xml.internal.dtm.ref.DTMNodeList;
>>>
>>> public class XmlToTextMapper extends Mapper<LongWritable, Text, Text,
>>> Text> {
>>>
>>>     private static final XPathFactory xpathFactory =
>>> XPathFactory.newInstance();
>>>
>>>     @Override
>>>     public void map(LongWritable key, Text value, Context context)
>>>             throws IOException, InterruptedException {
>>>
>>>         String resultFileName = "/result.txt";
>>>
>>>
>>>         Configuration conf = new Configuration();
>>>         FileSystem fs = FileSystem.get(URI.create(resultFileName), conf);
>>>         FSDataOutputStream out = fs.create(new Path(resultFileName));
>>>
>>>         InputStream resultIS = new ByteArrayInputStream(new byte[0]);
>>>
>>>         String header = "id,name\n";
>>>         out.write(header.getBytes());
>>>
>>>         String xmlContent = value.toString();
>>>         InputStream is = new ByteArrayInputStream(xmlContent.getBytes());
>>>         DocumentBuilderFactory factory =
>>> DocumentBuilderFactory.newInstance();
>>>         DocumentBuilder builder;
>>>         try {
>>>             builder = factory.newDocumentBuilder();
>>>             Document doc = builder.parse(is);
>>>             DTMNodeList list = (DTMNodeList) getNode("/main/data", doc,
>>>                     XPathConstants.NODESET);
>>>
>>>             int size = list.getLength();
>>>             for (int i = 0; i < size; i++) {
>>>                 Node node = list.item(i);
>>>                 String line = "";
>>>                 NodeList nodeList = node.getChildNodes();
>>>                 int childNumber = nodeList.getLength();
>>>                 for (int j = 0; j < childNumber; j++) {
>>>                     line += nodeList.item(j).getTextContent() + ",";
>>>                 }
>>>                 if (line.endsWith(","))
>>>                     line = line.substring(0, line.length() - 1);
>>>                 line += "\n";
>>>                 out.write(line.getBytes());
>>>
>>>             }
>>>
>>>         } catch (ParserConfigurationException e) {
>>>             MyLogguer.log("error: " + e.getMessage());
>>>             e.printStackTrace();
>>>         } catch (SAXException e) {
>>>             MyLogguer.log("error: " + e.getMessage());
>>>             e.printStackTrace();
>>>         } catch (XPathExpressionException e) {
>>>             MyLogguer.log("error: " + e.getMessage());
>>>             e.printStackTrace();
>>>         }
>>>
>>>         IOUtils.copyBytes(resultIS, out, 4096, true);
>>>         out.close();
>>>     }
>>>
>>>     public static Object getNode(String xpathStr, Node node, QName
>>> retunType)
>>>             throws XPathExpressionException {
>>>         XPath xpath = xpathFactory.newXPath();
>>>         return xpath.evaluate(xpathStr, node, retunType);
>>>     }
>>> }
>>>
>>>
>>>
>>> --------------------------------------
>>> Main class:
>>>
>>>
>>> public class Main {
>>>
>>>     public static void main(String[] args) throws Exception {
>>>
>>>         if (args.length != 2) {
>>>             System.err
>>>                     .println("Usage: XMLtoText <input path> <output
>>> path>");
>>>             System.exit(-1);
>>>         }
>>>
>>>         Job job = new Job();
>>>         job.setJarByClass(Main.class);
>>>         job.setJobName("XML to Text");
>>>         FileInputFormat.addInputPath(job, new Path(args[0]));
>>>         FileOutputFormat.setOutputPath(job, new Path(args[1]));
>>>
>>>         job.setMapperClass(XmlToTextMapper.class);
>>>         job.setNumReduceTasks(0);
>>>         job.setMapOutputKeyClass(Text.class);
>>>         job.setMapOutputValueClass(Text.class);
>>>         System.exit(job.waitForCompletion(true) ? 0 : 1);
>>>
>>>     }
>>> }
>>>
>>> To execute the job you can use :
>>>
>>>          bin/hadoop Main /data.xml /output.
>>>
>>>
>>> Then you can use this to see result.txt file:
>>>
>>>           hadoop fs -cat /result.txt
>>>
>>>
>>> I'm using this xml as input:
>>>
>>>
>>> <Comp><Emp><id>1</id><name>NameA</name></data><data><id>2</id><name>NameB</name></Emp></Comp>
>>>
>>> and the content in result.txt is like this:
>>>
>>> id,name
>>> 1,NameA
>>> 2,NameB
>>>
>>>
>>> Hope this helps.
>>>
>>>
>>> 2014/1/3 Ranjini Rathinam <ra...@gmail.com>
>>>
>>>> Hi,
>>>>
>>>> Need to convert XML into text using mapreduce.
>>>>
>>>> I have used DOM and SAX parser.
>>>>
>>>> After using SAX Builder in mapper class. the child node act as root
>>>> Element.
>>>>
>>>> While seeing in Sys out i found thar root element is taking the child
>>>> element and printing.
>>>>
>>>> For Eg,
>>>>
>>>> <Comp><Emp><id>100</id><name>RR</name></Emp></Comp>
>>>> when this xml is passed in mapper , in sys out printing the root element
>>>>
>>>> I am getting the the root element as
>>>>
>>>> <id>
>>>> <name>
>>>>
>>>> Please suggest and help to fix this.
>>>>
>>>> I need to convert the xml into text using mapreduce code. Please
>>>> provide with example.
>>>>
>>>> Required output is
>>>>
>>>> id,name
>>>> 100,RR
>>>>
>>>> Please help.
>>>>
>>>> Thanks in advance,
>>>> Ranjini R
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>

Re: XML to TEXT

Posted by Ranjini Rathinam <ra...@gmail.com>.
Hi Gutierrez ,

As suggest i tried with the code , but in the result.txt i got output only
header. Nothing else was printing.

After debugging i came to know that while parsing , there is no value.

The problem is in line given below which is bold. While putting SysOut i
found no value printing in this line.

String xmlContent = value.toString();

        InputStream is = new ByteArrayInputStream(xmlContent.getBytes());
        DocumentBuilderFactory factory =
DocumentBuilderFactory.newInstance();
        DocumentBuilder builder;
        try {
            builder = factory.newDocumentBuilder();

* Document doc = builder.parse(is);*   String
ed=doc.getDocumentElement().getNodeName();
   out.write(ed.getBytes());
            DTMNodeList list = (DTMNodeList) getNode("/Company/Employee",
doc,XPathConstants.NODESET);

When iam printing

out.write(xmlContent.getBytes):- the whole xml is being printed.

then i wrote for Sysout for list ,nothing printed.
 out.write(ed.getBytes):- nothing is being printed.

Please suggest where i am going wrong. Please help to fix this.

Thanks in advance.

I have attached my code.Please review.


Mapper class:-

public class XmlTextMapper extends Mapper<LongWritable, Text, Text, Text> {
    private static final XPathFactory xpathFactory =
XPathFactory.newInstance();
    @Override
    public void map(LongWritable key, Text value, Context context)
            throws IOException, InterruptedException {
        String resultFileName = "/user/task/Sales/result.txt";

        Configuration conf = new Configuration();
        FileSystem fs = FileSystem.get(URI.create(resultFileName), conf);
        FSDataOutputStream out = fs.create(new Path(resultFileName));
        InputStream resultIS = new ByteArrayInputStream(new byte[0]);
        String header = "id,name\n";
        out.write(header.getBytes());
        String xmlContent = value.toString();

        InputStream is = new ByteArrayInputStream(xmlContent.getBytes());
        DocumentBuilderFactory factory =
DocumentBuilderFactory.newInstance();
        DocumentBuilder builder;
        try {
            builder = factory.newDocumentBuilder();
            Document doc = builder.parse(is);
   String ed=doc.getDocumentElement().getNodeName();
   out.write(ed.getBytes());
            DTMNodeList list = (DTMNodeList) getNode("/Company/Employee",
doc,XPathConstants.NODESET);
            int size = list.getLength();
            for (int i = 0; i < size; i++) {
                Node node = list.item(i);
                String line = "";
                NodeList nodeList = node.getChildNodes();
                int childNumber = nodeList.getLength();
                for (int j = 0; j < childNumber; j++)
    {
                    line += nodeList.item(j).getTextContent() + ",";
                }
                if (line.endsWith(","))
                    line = line.substring(0, line.length() - 1);
                line += "\n";
                out.write(line.getBytes());
            }
        } catch (ParserConfigurationException e) {
             e.printStackTrace();
        } catch (SAXException e) {
             e.printStackTrace();
        } catch (XPathExpressionException e) {
             e.printStackTrace();
        }
        IOUtils.copyBytes(resultIS, out, 4096, true);
        out.close();
    }
    public static Object getNode(String xpathStr, Node node, QName
retunType)
            throws XPathExpressionException {
        XPath xpath = xpathFactory.newXPath();
        return xpath.evaluate(xpathStr, node, retunType);
    }
}



Main class
public class MainXml {
    public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
        if (args.length != 2) {
            System.err
                    .println("Usage: XMLtoText <input path> <output path>");
            System.exit(-1);
        }
  String output="/user/task/Sales/";
       Job job = new Job(conf, "XML to Text");
        job.setJarByClass(MainXml.class);
       // job.setJobName("XML to Text");
        FileInputFormat.addInputPath(job, new Path(args[0]));
       // FileOutputFormat.setOutputPath(job, new Path(args[1]));
  Path outPath = new Path(output);
  FileOutputFormat.setOutputPath(job, outPath);
  FileSystem dfs = FileSystem.get(outPath.toUri(), conf);
  if (dfs.exists(outPath)) {
  dfs.delete(outPath, true);
  }
        job.setMapperClass(XmlTextMapper.class);
        job.setNumReduceTasks(0);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(Text.class);
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}


My xml file

<Company>
<Employee>
<id>100</id>
<ename>ranjini</ename>
<dept>IT1</dept>
<sal>123456</sal>
<location>nextlevel1</location>
<Address>
<Home>Chennai1</Home>
<Office>Navallur1</Office>
</Address>
</Employee>
<Employee>
<id>1001</id>
<ename>ranjinikumar</ename>
<dept>IT</dept>
<sal>1234516</sal>
<location>nextlevel</location>
<Address>
<Home>Chennai</Home>
<Office>Navallur</Office>
</Address>
</Employee>
</Company>


Thanks in advance.

Ranjini



>  On Mon, Jan 6, 2014 at 2:44 PM, Ranjini Rathinam <ra...@gmail.com>wrote:
>
>> Hi,
>>
>> Thanks a lot .
>>
>> Ranjini
>>
>> On Fri, Jan 3, 2014 at 10:40 PM, Diego Gutierrez <
>> diego.gutierrez@ucsp.edu.pe> wrote:
>>
>>>  Hi,
>>>
>>> I suggest to use the XPath, this is a native java support for parse xml
>>> and json formats.
>>>
>>> For the main problem, like distcp command(
>>> http://hadoop.apache.org/docs/r0.19.0/distcp.pdf ) there is no need of
>>> a reduce function, because you can parse the xml input file and create the
>>> file you need in the map function.For example the following code reads an
>>> xml file in HDFS, parse it and create a new file ( "/result.txt" ) with the
>>> expected format:
>>> id,name
>>> 100,RR
>>>
>>>
>>> Mapper function:
>>>
>>> import java.io.ByteArrayInputStream;
>>> import java.io.IOException;
>>> import java.io.InputStream;
>>> import java.net.URI;
>>>
>>> import javax.xml.namespace.QName;
>>> import javax.xml.parsers.DocumentBuilder;
>>> import javax.xml.parsers.DocumentBuilderFactory;
>>> import javax.xml.parsers.ParserConfigurationException;
>>> import javax.xml.xpath.XPath;
>>> import javax.xml.xpath.XPathConstants;
>>> import javax.xml.xpath.XPathExpressionException;
>>> import javax.xml.xpath.XPathFactory;
>>>
>>> import org.apache.hadoop.conf.Configuration;
>>> import org.apache.hadoop.fs.FSDataOutputStream;
>>> import org.apache.hadoop.fs.FileSystem;
>>> import org.apache.hadoop.fs.Path;
>>> import org.apache.hadoop.io.IOUtils;
>>> import org.apache.hadoop.io.LongWritable;
>>> import org.apache.hadoop.io.Text;
>>> import org.apache.hadoop.mapreduce.Mapper;
>>> import org.w3c.dom.Document;
>>> import org.w3c.dom.Node;
>>> import org.w3c.dom.NodeList;
>>> import org.xml.sax.SAXException;
>>>
>>> import com.sun.org.apache.xml.internal.dtm.ref.DTMNodeList;
>>>
>>> public class XmlToTextMapper extends Mapper<LongWritable, Text, Text,
>>> Text> {
>>>
>>>     private static final XPathFactory xpathFactory =
>>> XPathFactory.newInstance();
>>>
>>>     @Override
>>>     public void map(LongWritable key, Text value, Context context)
>>>             throws IOException, InterruptedException {
>>>
>>>         String resultFileName = "/result.txt";
>>>
>>>
>>>         Configuration conf = new Configuration();
>>>         FileSystem fs = FileSystem.get(URI.create(resultFileName), conf);
>>>         FSDataOutputStream out = fs.create(new Path(resultFileName));
>>>
>>>         InputStream resultIS = new ByteArrayInputStream(new byte[0]);
>>>
>>>         String header = "id,name\n";
>>>         out.write(header.getBytes());
>>>
>>>         String xmlContent = value.toString();
>>>         InputStream is = new ByteArrayInputStream(xmlContent.getBytes());
>>>         DocumentBuilderFactory factory =
>>> DocumentBuilderFactory.newInstance();
>>>         DocumentBuilder builder;
>>>         try {
>>>             builder = factory.newDocumentBuilder();
>>>             Document doc = builder.parse(is);
>>>             DTMNodeList list = (DTMNodeList) getNode("/main/data", doc,
>>>                     XPathConstants.NODESET);
>>>
>>>             int size = list.getLength();
>>>             for (int i = 0; i < size; i++) {
>>>                 Node node = list.item(i);
>>>                 String line = "";
>>>                 NodeList nodeList = node.getChildNodes();
>>>                 int childNumber = nodeList.getLength();
>>>                 for (int j = 0; j < childNumber; j++) {
>>>                     line += nodeList.item(j).getTextContent() + ",";
>>>                 }
>>>                 if (line.endsWith(","))
>>>                     line = line.substring(0, line.length() - 1);
>>>                 line += "\n";
>>>                 out.write(line.getBytes());
>>>
>>>             }
>>>
>>>         } catch (ParserConfigurationException e) {
>>>             MyLogguer.log("error: " + e.getMessage());
>>>             e.printStackTrace();
>>>         } catch (SAXException e) {
>>>             MyLogguer.log("error: " + e.getMessage());
>>>             e.printStackTrace();
>>>         } catch (XPathExpressionException e) {
>>>             MyLogguer.log("error: " + e.getMessage());
>>>             e.printStackTrace();
>>>         }
>>>
>>>         IOUtils.copyBytes(resultIS, out, 4096, true);
>>>         out.close();
>>>     }
>>>
>>>     public static Object getNode(String xpathStr, Node node, QName
>>> retunType)
>>>             throws XPathExpressionException {
>>>         XPath xpath = xpathFactory.newXPath();
>>>         return xpath.evaluate(xpathStr, node, retunType);
>>>     }
>>> }
>>>
>>>
>>>
>>> --------------------------------------
>>> Main class:
>>>
>>>
>>> public class Main {
>>>
>>>     public static void main(String[] args) throws Exception {
>>>
>>>         if (args.length != 2) {
>>>             System.err
>>>                     .println("Usage: XMLtoText <input path> <output
>>> path>");
>>>             System.exit(-1);
>>>         }
>>>
>>>         Job job = new Job();
>>>         job.setJarByClass(Main.class);
>>>         job.setJobName("XML to Text");
>>>         FileInputFormat.addInputPath(job, new Path(args[0]));
>>>         FileOutputFormat.setOutputPath(job, new Path(args[1]));
>>>
>>>         job.setMapperClass(XmlToTextMapper.class);
>>>         job.setNumReduceTasks(0);
>>>         job.setMapOutputKeyClass(Text.class);
>>>         job.setMapOutputValueClass(Text.class);
>>>         System.exit(job.waitForCompletion(true) ? 0 : 1);
>>>
>>>     }
>>> }
>>>
>>> To execute the job you can use :
>>>
>>>          bin/hadoop Main /data.xml /output.
>>>
>>>
>>> Then you can use this to see result.txt file:
>>>
>>>           hadoop fs -cat /result.txt
>>>
>>>
>>> I'm using this xml as input:
>>>
>>>
>>> <Comp><Emp><id>1</id><name>NameA</name></data><data><id>2</id><name>NameB</name></Emp></Comp>
>>>
>>> and the content in result.txt is like this:
>>>
>>> id,name
>>> 1,NameA
>>> 2,NameB
>>>
>>>
>>> Hope this helps.
>>>
>>>
>>> 2014/1/3 Ranjini Rathinam <ra...@gmail.com>
>>>
>>>> Hi,
>>>>
>>>> Need to convert XML into text using mapreduce.
>>>>
>>>> I have used DOM and SAX parser.
>>>>
>>>> After using SAX Builder in mapper class. the child node act as root
>>>> Element.
>>>>
>>>> While seeing in Sys out i found thar root element is taking the child
>>>> element and printing.
>>>>
>>>> For Eg,
>>>>
>>>> <Comp><Emp><id>100</id><name>RR</name></Emp></Comp>
>>>> when this xml is passed in mapper , in sys out printing the root element
>>>>
>>>> I am getting the the root element as
>>>>
>>>> <id>
>>>> <name>
>>>>
>>>> Please suggest and help to fix this.
>>>>
>>>> I need to convert the xml into text using mapreduce code. Please
>>>> provide with example.
>>>>
>>>> Required output is
>>>>
>>>> id,name
>>>> 100,RR
>>>>
>>>> Please help.
>>>>
>>>> Thanks in advance,
>>>> Ranjini R
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>

Re: XML to TEXT

Posted by Rajesh Nagaraju <ra...@gmail.com>.
hi rajini

Can u use hive? then u can just use xpaths in ur select clause

cheers
R+


On Mon, Jan 6, 2014 at 2:44 PM, Ranjini Rathinam <ra...@gmail.com>wrote:

> Hi,
>
> Thanks a lot .
>
> Ranjini
>
> On Fri, Jan 3, 2014 at 10:40 PM, Diego Gutierrez <
> diego.gutierrez@ucsp.edu.pe> wrote:
>
>>  Hi,
>>
>> I suggest to use the XPath, this is a native java support for parse xml
>> and json formats.
>>
>> For the main problem, like distcp command(
>> http://hadoop.apache.org/docs/r0.19.0/distcp.pdf ) there is no need of a
>> reduce function, because you can parse the xml input file and create the
>> file you need in the map function.For example the following code reads an
>> xml file in HDFS, parse it and create a new file ( "/result.txt" ) with the
>> expected format:
>> id,name
>> 100,RR
>>
>>
>> Mapper function:
>>
>> import java.io.ByteArrayInputStream;
>> import java.io.IOException;
>> import java.io.InputStream;
>> import java.net.URI;
>>
>> import javax.xml.namespace.QName;
>> import javax.xml.parsers.DocumentBuilder;
>> import javax.xml.parsers.DocumentBuilderFactory;
>> import javax.xml.parsers.ParserConfigurationException;
>> import javax.xml.xpath.XPath;
>> import javax.xml.xpath.XPathConstants;
>> import javax.xml.xpath.XPathExpressionException;
>> import javax.xml.xpath.XPathFactory;
>>
>> import org.apache.hadoop.conf.Configuration;
>> import org.apache.hadoop.fs.FSDataOutputStream;
>> import org.apache.hadoop.fs.FileSystem;
>> import org.apache.hadoop.fs.Path;
>> import org.apache.hadoop.io.IOUtils;
>> import org.apache.hadoop.io.LongWritable;
>> import org.apache.hadoop.io.Text;
>> import org.apache.hadoop.mapreduce.Mapper;
>> import org.w3c.dom.Document;
>> import org.w3c.dom.Node;
>> import org.w3c.dom.NodeList;
>> import org.xml.sax.SAXException;
>>
>> import com.sun.org.apache.xml.internal.dtm.ref.DTMNodeList;
>>
>> public class XmlToTextMapper extends Mapper<LongWritable, Text, Text,
>> Text> {
>>
>>     private static final XPathFactory xpathFactory =
>> XPathFactory.newInstance();
>>
>>     @Override
>>     public void map(LongWritable key, Text value, Context context)
>>             throws IOException, InterruptedException {
>>
>>         String resultFileName = "/result.txt";
>>
>>
>>         Configuration conf = new Configuration();
>>         FileSystem fs = FileSystem.get(URI.create(resultFileName), conf);
>>         FSDataOutputStream out = fs.create(new Path(resultFileName));
>>
>>         InputStream resultIS = new ByteArrayInputStream(new byte[0]);
>>
>>         String header = "id,name\n";
>>         out.write(header.getBytes());
>>
>>         String xmlContent = value.toString();
>>         InputStream is = new ByteArrayInputStream(xmlContent.getBytes());
>>         DocumentBuilderFactory factory =
>> DocumentBuilderFactory.newInstance();
>>         DocumentBuilder builder;
>>         try {
>>             builder = factory.newDocumentBuilder();
>>             Document doc = builder.parse(is);
>>             DTMNodeList list = (DTMNodeList) getNode("/main/data", doc,
>>                     XPathConstants.NODESET);
>>
>>             int size = list.getLength();
>>             for (int i = 0; i < size; i++) {
>>                 Node node = list.item(i);
>>                 String line = "";
>>                 NodeList nodeList = node.getChildNodes();
>>                 int childNumber = nodeList.getLength();
>>                 for (int j = 0; j < childNumber; j++) {
>>                     line += nodeList.item(j).getTextContent() + ",";
>>                 }
>>                 if (line.endsWith(","))
>>                     line = line.substring(0, line.length() - 1);
>>                 line += "\n";
>>                 out.write(line.getBytes());
>>
>>             }
>>
>>         } catch (ParserConfigurationException e) {
>>             MyLogguer.log("error: " + e.getMessage());
>>             e.printStackTrace();
>>         } catch (SAXException e) {
>>             MyLogguer.log("error: " + e.getMessage());
>>             e.printStackTrace();
>>         } catch (XPathExpressionException e) {
>>             MyLogguer.log("error: " + e.getMessage());
>>             e.printStackTrace();
>>         }
>>
>>         IOUtils.copyBytes(resultIS, out, 4096, true);
>>         out.close();
>>     }
>>
>>     public static Object getNode(String xpathStr, Node node, QName
>> retunType)
>>             throws XPathExpressionException {
>>         XPath xpath = xpathFactory.newXPath();
>>         return xpath.evaluate(xpathStr, node, retunType);
>>     }
>> }
>>
>>
>>
>> --------------------------------------
>>  Main class:
>>
>>
>> public class Main {
>>
>>     public static void main(String[] args) throws Exception {
>>
>>         if (args.length != 2) {
>>             System.err
>>                     .println("Usage: XMLtoText <input path> <output
>> path>");
>>             System.exit(-1);
>>         }
>>
>>         Job job = new Job();
>>         job.setJarByClass(Main.class);
>>         job.setJobName("XML to Text");
>>         FileInputFormat.addInputPath(job, new Path(args[0]));
>>         FileOutputFormat.setOutputPath(job, new Path(args[1]));
>>
>>         job.setMapperClass(XmlToTextMapper.class);
>>         job.setNumReduceTasks(0);
>>         job.setMapOutputKeyClass(Text.class);
>>         job.setMapOutputValueClass(Text.class);
>>         System.exit(job.waitForCompletion(true) ? 0 : 1);
>>
>>     }
>> }
>>
>> To execute the job you can use :
>>
>>          bin/hadoop Main /data.xml /output.
>>
>>
>> Then you can use this to see result.txt file:
>>
>>           hadoop fs -cat /result.txt
>>
>>
>> I'm using this xml as input:
>>
>>
>> <Comp><Emp><id>1</id><name>NameA</name></data><data><id>2</id><name>NameB</name></Emp></Comp>
>>
>> and the content in result.txt is like this:
>>
>> id,name
>> 1,NameA
>> 2,NameB
>>
>>
>> Hope this helps.
>>
>>
>> 2014/1/3 Ranjini Rathinam <ra...@gmail.com>
>>
>>> Hi,
>>>
>>> Need to convert XML into text using mapreduce.
>>>
>>> I have used DOM and SAX parser.
>>>
>>> After using SAX Builder in mapper class. the child node act as root
>>> Element.
>>>
>>> While seeing in Sys out i found thar root element is taking the child
>>> element and printing.
>>>
>>> For Eg,
>>>
>>> <Comp><Emp><id>100</id><name>RR</name></Emp></Comp>
>>> when this xml is passed in mapper , in sys out printing the root element
>>>
>>> I am getting the the root element as
>>>
>>> <id>
>>> <name>
>>>
>>> Please suggest and help to fix this.
>>>
>>> I need to convert the xml into text using mapreduce code. Please provide
>>> with example.
>>>
>>> Required output is
>>>
>>> id,name
>>> 100,RR
>>>
>>> Please help.
>>>
>>> Thanks in advance,
>>> Ranjini R
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>
>>
>

Re: XML to TEXT

Posted by Rajesh Nagaraju <ra...@gmail.com>.
hi rajini

Can u use hive? then u can just use xpaths in ur select clause

cheers
R+


On Mon, Jan 6, 2014 at 2:44 PM, Ranjini Rathinam <ra...@gmail.com>wrote:

> Hi,
>
> Thanks a lot .
>
> Ranjini
>
> On Fri, Jan 3, 2014 at 10:40 PM, Diego Gutierrez <
> diego.gutierrez@ucsp.edu.pe> wrote:
>
>>  Hi,
>>
>> I suggest to use the XPath, this is a native java support for parse xml
>> and json formats.
>>
>> For the main problem, like distcp command(
>> http://hadoop.apache.org/docs/r0.19.0/distcp.pdf ) there is no need of a
>> reduce function, because you can parse the xml input file and create the
>> file you need in the map function.For example the following code reads an
>> xml file in HDFS, parse it and create a new file ( "/result.txt" ) with the
>> expected format:
>> id,name
>> 100,RR
>>
>>
>> Mapper function:
>>
>> import java.io.ByteArrayInputStream;
>> import java.io.IOException;
>> import java.io.InputStream;
>> import java.net.URI;
>>
>> import javax.xml.namespace.QName;
>> import javax.xml.parsers.DocumentBuilder;
>> import javax.xml.parsers.DocumentBuilderFactory;
>> import javax.xml.parsers.ParserConfigurationException;
>> import javax.xml.xpath.XPath;
>> import javax.xml.xpath.XPathConstants;
>> import javax.xml.xpath.XPathExpressionException;
>> import javax.xml.xpath.XPathFactory;
>>
>> import org.apache.hadoop.conf.Configuration;
>> import org.apache.hadoop.fs.FSDataOutputStream;
>> import org.apache.hadoop.fs.FileSystem;
>> import org.apache.hadoop.fs.Path;
>> import org.apache.hadoop.io.IOUtils;
>> import org.apache.hadoop.io.LongWritable;
>> import org.apache.hadoop.io.Text;
>> import org.apache.hadoop.mapreduce.Mapper;
>> import org.w3c.dom.Document;
>> import org.w3c.dom.Node;
>> import org.w3c.dom.NodeList;
>> import org.xml.sax.SAXException;
>>
>> import com.sun.org.apache.xml.internal.dtm.ref.DTMNodeList;
>>
>> public class XmlToTextMapper extends Mapper<LongWritable, Text, Text,
>> Text> {
>>
>>     private static final XPathFactory xpathFactory =
>> XPathFactory.newInstance();
>>
>>     @Override
>>     public void map(LongWritable key, Text value, Context context)
>>             throws IOException, InterruptedException {
>>
>>         String resultFileName = "/result.txt";
>>
>>
>>         Configuration conf = new Configuration();
>>         FileSystem fs = FileSystem.get(URI.create(resultFileName), conf);
>>         FSDataOutputStream out = fs.create(new Path(resultFileName));
>>
>>         InputStream resultIS = new ByteArrayInputStream(new byte[0]);
>>
>>         String header = "id,name\n";
>>         out.write(header.getBytes());
>>
>>         String xmlContent = value.toString();
>>         InputStream is = new ByteArrayInputStream(xmlContent.getBytes());
>>         DocumentBuilderFactory factory =
>> DocumentBuilderFactory.newInstance();
>>         DocumentBuilder builder;
>>         try {
>>             builder = factory.newDocumentBuilder();
>>             Document doc = builder.parse(is);
>>             DTMNodeList list = (DTMNodeList) getNode("/main/data", doc,
>>                     XPathConstants.NODESET);
>>
>>             int size = list.getLength();
>>             for (int i = 0; i < size; i++) {
>>                 Node node = list.item(i);
>>                 String line = "";
>>                 NodeList nodeList = node.getChildNodes();
>>                 int childNumber = nodeList.getLength();
>>                 for (int j = 0; j < childNumber; j++) {
>>                     line += nodeList.item(j).getTextContent() + ",";
>>                 }
>>                 if (line.endsWith(","))
>>                     line = line.substring(0, line.length() - 1);
>>                 line += "\n";
>>                 out.write(line.getBytes());
>>
>>             }
>>
>>         } catch (ParserConfigurationException e) {
>>             MyLogguer.log("error: " + e.getMessage());
>>             e.printStackTrace();
>>         } catch (SAXException e) {
>>             MyLogguer.log("error: " + e.getMessage());
>>             e.printStackTrace();
>>         } catch (XPathExpressionException e) {
>>             MyLogguer.log("error: " + e.getMessage());
>>             e.printStackTrace();
>>         }
>>
>>         IOUtils.copyBytes(resultIS, out, 4096, true);
>>         out.close();
>>     }
>>
>>     public static Object getNode(String xpathStr, Node node, QName
>> retunType)
>>             throws XPathExpressionException {
>>         XPath xpath = xpathFactory.newXPath();
>>         return xpath.evaluate(xpathStr, node, retunType);
>>     }
>> }
>>
>>
>>
>> --------------------------------------
>>  Main class:
>>
>>
>> public class Main {
>>
>>     public static void main(String[] args) throws Exception {
>>
>>         if (args.length != 2) {
>>             System.err
>>                     .println("Usage: XMLtoText <input path> <output
>> path>");
>>             System.exit(-1);
>>         }
>>
>>         Job job = new Job();
>>         job.setJarByClass(Main.class);
>>         job.setJobName("XML to Text");
>>         FileInputFormat.addInputPath(job, new Path(args[0]));
>>         FileOutputFormat.setOutputPath(job, new Path(args[1]));
>>
>>         job.setMapperClass(XmlToTextMapper.class);
>>         job.setNumReduceTasks(0);
>>         job.setMapOutputKeyClass(Text.class);
>>         job.setMapOutputValueClass(Text.class);
>>         System.exit(job.waitForCompletion(true) ? 0 : 1);
>>
>>     }
>> }
>>
>> To execute the job you can use :
>>
>>          bin/hadoop Main /data.xml /output.
>>
>>
>> Then you can use this to see result.txt file:
>>
>>           hadoop fs -cat /result.txt
>>
>>
>> I'm using this xml as input:
>>
>>
>> <Comp><Emp><id>1</id><name>NameA</name></data><data><id>2</id><name>NameB</name></Emp></Comp>
>>
>> and the content in result.txt is like this:
>>
>> id,name
>> 1,NameA
>> 2,NameB
>>
>>
>> Hope this helps.
>>
>>
>> 2014/1/3 Ranjini Rathinam <ra...@gmail.com>
>>
>>> Hi,
>>>
>>> Need to convert XML into text using mapreduce.
>>>
>>> I have used DOM and SAX parser.
>>>
>>> After using SAX Builder in mapper class. the child node act as root
>>> Element.
>>>
>>> While seeing in Sys out i found thar root element is taking the child
>>> element and printing.
>>>
>>> For Eg,
>>>
>>> <Comp><Emp><id>100</id><name>RR</name></Emp></Comp>
>>> when this xml is passed in mapper , in sys out printing the root element
>>>
>>> I am getting the the root element as
>>>
>>> <id>
>>> <name>
>>>
>>> Please suggest and help to fix this.
>>>
>>> I need to convert the xml into text using mapreduce code. Please provide
>>> with example.
>>>
>>> Required output is
>>>
>>> id,name
>>> 100,RR
>>>
>>> Please help.
>>>
>>> Thanks in advance,
>>> Ranjini R
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>
>>
>

Re: XML to TEXT

Posted by Rajesh Nagaraju <ra...@gmail.com>.
hi rajini

Can u use hive? then u can just use xpaths in ur select clause

cheers
R+


On Mon, Jan 6, 2014 at 2:44 PM, Ranjini Rathinam <ra...@gmail.com>wrote:

> Hi,
>
> Thanks a lot .
>
> Ranjini
>
> On Fri, Jan 3, 2014 at 10:40 PM, Diego Gutierrez <
> diego.gutierrez@ucsp.edu.pe> wrote:
>
>>  Hi,
>>
>> I suggest to use the XPath, this is a native java support for parse xml
>> and json formats.
>>
>> For the main problem, like distcp command(
>> http://hadoop.apache.org/docs/r0.19.0/distcp.pdf ) there is no need of a
>> reduce function, because you can parse the xml input file and create the
>> file you need in the map function.For example the following code reads an
>> xml file in HDFS, parse it and create a new file ( "/result.txt" ) with the
>> expected format:
>> id,name
>> 100,RR
>>
>>
>> Mapper function:
>>
>> import java.io.ByteArrayInputStream;
>> import java.io.IOException;
>> import java.io.InputStream;
>> import java.net.URI;
>>
>> import javax.xml.namespace.QName;
>> import javax.xml.parsers.DocumentBuilder;
>> import javax.xml.parsers.DocumentBuilderFactory;
>> import javax.xml.parsers.ParserConfigurationException;
>> import javax.xml.xpath.XPath;
>> import javax.xml.xpath.XPathConstants;
>> import javax.xml.xpath.XPathExpressionException;
>> import javax.xml.xpath.XPathFactory;
>>
>> import org.apache.hadoop.conf.Configuration;
>> import org.apache.hadoop.fs.FSDataOutputStream;
>> import org.apache.hadoop.fs.FileSystem;
>> import org.apache.hadoop.fs.Path;
>> import org.apache.hadoop.io.IOUtils;
>> import org.apache.hadoop.io.LongWritable;
>> import org.apache.hadoop.io.Text;
>> import org.apache.hadoop.mapreduce.Mapper;
>> import org.w3c.dom.Document;
>> import org.w3c.dom.Node;
>> import org.w3c.dom.NodeList;
>> import org.xml.sax.SAXException;
>>
>> import com.sun.org.apache.xml.internal.dtm.ref.DTMNodeList;
>>
>> public class XmlToTextMapper extends Mapper<LongWritable, Text, Text,
>> Text> {
>>
>>     private static final XPathFactory xpathFactory =
>> XPathFactory.newInstance();
>>
>>     @Override
>>     public void map(LongWritable key, Text value, Context context)
>>             throws IOException, InterruptedException {
>>
>>         String resultFileName = "/result.txt";
>>
>>
>>         Configuration conf = new Configuration();
>>         FileSystem fs = FileSystem.get(URI.create(resultFileName), conf);
>>         FSDataOutputStream out = fs.create(new Path(resultFileName));
>>
>>         InputStream resultIS = new ByteArrayInputStream(new byte[0]);
>>
>>         String header = "id,name\n";
>>         out.write(header.getBytes());
>>
>>         String xmlContent = value.toString();
>>         InputStream is = new ByteArrayInputStream(xmlContent.getBytes());
>>         DocumentBuilderFactory factory =
>> DocumentBuilderFactory.newInstance();
>>         DocumentBuilder builder;
>>         try {
>>             builder = factory.newDocumentBuilder();
>>             Document doc = builder.parse(is);
>>             DTMNodeList list = (DTMNodeList) getNode("/main/data", doc,
>>                     XPathConstants.NODESET);
>>
>>             int size = list.getLength();
>>             for (int i = 0; i < size; i++) {
>>                 Node node = list.item(i);
>>                 String line = "";
>>                 NodeList nodeList = node.getChildNodes();
>>                 int childNumber = nodeList.getLength();
>>                 for (int j = 0; j < childNumber; j++) {
>>                     line += nodeList.item(j).getTextContent() + ",";
>>                 }
>>                 if (line.endsWith(","))
>>                     line = line.substring(0, line.length() - 1);
>>                 line += "\n";
>>                 out.write(line.getBytes());
>>
>>             }
>>
>>         } catch (ParserConfigurationException e) {
>>             MyLogguer.log("error: " + e.getMessage());
>>             e.printStackTrace();
>>         } catch (SAXException e) {
>>             MyLogguer.log("error: " + e.getMessage());
>>             e.printStackTrace();
>>         } catch (XPathExpressionException e) {
>>             MyLogguer.log("error: " + e.getMessage());
>>             e.printStackTrace();
>>         }
>>
>>         IOUtils.copyBytes(resultIS, out, 4096, true);
>>         out.close();
>>     }
>>
>>     public static Object getNode(String xpathStr, Node node, QName
>> retunType)
>>             throws XPathExpressionException {
>>         XPath xpath = xpathFactory.newXPath();
>>         return xpath.evaluate(xpathStr, node, retunType);
>>     }
>> }
>>
>>
>>
>> --------------------------------------
>>  Main class:
>>
>>
>> public class Main {
>>
>>     public static void main(String[] args) throws Exception {
>>
>>         if (args.length != 2) {
>>             System.err
>>                     .println("Usage: XMLtoText <input path> <output
>> path>");
>>             System.exit(-1);
>>         }
>>
>>         Job job = new Job();
>>         job.setJarByClass(Main.class);
>>         job.setJobName("XML to Text");
>>         FileInputFormat.addInputPath(job, new Path(args[0]));
>>         FileOutputFormat.setOutputPath(job, new Path(args[1]));
>>
>>         job.setMapperClass(XmlToTextMapper.class);
>>         job.setNumReduceTasks(0);
>>         job.setMapOutputKeyClass(Text.class);
>>         job.setMapOutputValueClass(Text.class);
>>         System.exit(job.waitForCompletion(true) ? 0 : 1);
>>
>>     }
>> }
>>
>> To execute the job you can use :
>>
>>          bin/hadoop Main /data.xml /output.
>>
>>
>> Then you can use this to see result.txt file:
>>
>>           hadoop fs -cat /result.txt
>>
>>
>> I'm using this xml as input:
>>
>>
>> <Comp><Emp><id>1</id><name>NameA</name></data><data><id>2</id><name>NameB</name></Emp></Comp>
>>
>> and the content in result.txt is like this:
>>
>> id,name
>> 1,NameA
>> 2,NameB
>>
>>
>> Hope this helps.
>>
>>
>> 2014/1/3 Ranjini Rathinam <ra...@gmail.com>
>>
>>> Hi,
>>>
>>> Need to convert XML into text using mapreduce.
>>>
>>> I have used DOM and SAX parser.
>>>
>>> After using SAX Builder in mapper class. the child node act as root
>>> Element.
>>>
>>> While seeing in Sys out i found thar root element is taking the child
>>> element and printing.
>>>
>>> For Eg,
>>>
>>> <Comp><Emp><id>100</id><name>RR</name></Emp></Comp>
>>> when this xml is passed in mapper , in sys out printing the root element
>>>
>>> I am getting the the root element as
>>>
>>> <id>
>>> <name>
>>>
>>> Please suggest and help to fix this.
>>>
>>> I need to convert the xml into text using mapreduce code. Please provide
>>> with example.
>>>
>>> Required output is
>>>
>>> id,name
>>> 100,RR
>>>
>>> Please help.
>>>
>>> Thanks in advance,
>>> Ranjini R
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>
>>
>

Re: XML to TEXT

Posted by Rajesh Nagaraju <ra...@gmail.com>.
hi rajini

Can u use hive? then u can just use xpaths in ur select clause

cheers
R+


On Mon, Jan 6, 2014 at 2:44 PM, Ranjini Rathinam <ra...@gmail.com>wrote:

> Hi,
>
> Thanks a lot .
>
> Ranjini
>
> On Fri, Jan 3, 2014 at 10:40 PM, Diego Gutierrez <
> diego.gutierrez@ucsp.edu.pe> wrote:
>
>>  Hi,
>>
>> I suggest to use the XPath, this is a native java support for parse xml
>> and json formats.
>>
>> For the main problem, like distcp command(
>> http://hadoop.apache.org/docs/r0.19.0/distcp.pdf ) there is no need of a
>> reduce function, because you can parse the xml input file and create the
>> file you need in the map function.For example the following code reads an
>> xml file in HDFS, parse it and create a new file ( "/result.txt" ) with the
>> expected format:
>> id,name
>> 100,RR
>>
>>
>> Mapper function:
>>
>> import java.io.ByteArrayInputStream;
>> import java.io.IOException;
>> import java.io.InputStream;
>> import java.net.URI;
>>
>> import javax.xml.namespace.QName;
>> import javax.xml.parsers.DocumentBuilder;
>> import javax.xml.parsers.DocumentBuilderFactory;
>> import javax.xml.parsers.ParserConfigurationException;
>> import javax.xml.xpath.XPath;
>> import javax.xml.xpath.XPathConstants;
>> import javax.xml.xpath.XPathExpressionException;
>> import javax.xml.xpath.XPathFactory;
>>
>> import org.apache.hadoop.conf.Configuration;
>> import org.apache.hadoop.fs.FSDataOutputStream;
>> import org.apache.hadoop.fs.FileSystem;
>> import org.apache.hadoop.fs.Path;
>> import org.apache.hadoop.io.IOUtils;
>> import org.apache.hadoop.io.LongWritable;
>> import org.apache.hadoop.io.Text;
>> import org.apache.hadoop.mapreduce.Mapper;
>> import org.w3c.dom.Document;
>> import org.w3c.dom.Node;
>> import org.w3c.dom.NodeList;
>> import org.xml.sax.SAXException;
>>
>> import com.sun.org.apache.xml.internal.dtm.ref.DTMNodeList;
>>
>> public class XmlToTextMapper extends Mapper<LongWritable, Text, Text,
>> Text> {
>>
>>     private static final XPathFactory xpathFactory =
>> XPathFactory.newInstance();
>>
>>     @Override
>>     public void map(LongWritable key, Text value, Context context)
>>             throws IOException, InterruptedException {
>>
>>         String resultFileName = "/result.txt";
>>
>>
>>         Configuration conf = new Configuration();
>>         FileSystem fs = FileSystem.get(URI.create(resultFileName), conf);
>>         FSDataOutputStream out = fs.create(new Path(resultFileName));
>>
>>         InputStream resultIS = new ByteArrayInputStream(new byte[0]);
>>
>>         String header = "id,name\n";
>>         out.write(header.getBytes());
>>
>>         String xmlContent = value.toString();
>>         InputStream is = new ByteArrayInputStream(xmlContent.getBytes());
>>         DocumentBuilderFactory factory =
>> DocumentBuilderFactory.newInstance();
>>         DocumentBuilder builder;
>>         try {
>>             builder = factory.newDocumentBuilder();
>>             Document doc = builder.parse(is);
>>             DTMNodeList list = (DTMNodeList) getNode("/main/data", doc,
>>                     XPathConstants.NODESET);
>>
>>             int size = list.getLength();
>>             for (int i = 0; i < size; i++) {
>>                 Node node = list.item(i);
>>                 String line = "";
>>                 NodeList nodeList = node.getChildNodes();
>>                 int childNumber = nodeList.getLength();
>>                 for (int j = 0; j < childNumber; j++) {
>>                     line += nodeList.item(j).getTextContent() + ",";
>>                 }
>>                 if (line.endsWith(","))
>>                     line = line.substring(0, line.length() - 1);
>>                 line += "\n";
>>                 out.write(line.getBytes());
>>
>>             }
>>
>>         } catch (ParserConfigurationException e) {
>>             MyLogguer.log("error: " + e.getMessage());
>>             e.printStackTrace();
>>         } catch (SAXException e) {
>>             MyLogguer.log("error: " + e.getMessage());
>>             e.printStackTrace();
>>         } catch (XPathExpressionException e) {
>>             MyLogguer.log("error: " + e.getMessage());
>>             e.printStackTrace();
>>         }
>>
>>         IOUtils.copyBytes(resultIS, out, 4096, true);
>>         out.close();
>>     }
>>
>>     public static Object getNode(String xpathStr, Node node, QName
>> retunType)
>>             throws XPathExpressionException {
>>         XPath xpath = xpathFactory.newXPath();
>>         return xpath.evaluate(xpathStr, node, retunType);
>>     }
>> }
>>
>>
>>
>> --------------------------------------
>>  Main class:
>>
>>
>> public class Main {
>>
>>     public static void main(String[] args) throws Exception {
>>
>>         if (args.length != 2) {
>>             System.err
>>                     .println("Usage: XMLtoText <input path> <output
>> path>");
>>             System.exit(-1);
>>         }
>>
>>         Job job = new Job();
>>         job.setJarByClass(Main.class);
>>         job.setJobName("XML to Text");
>>         FileInputFormat.addInputPath(job, new Path(args[0]));
>>         FileOutputFormat.setOutputPath(job, new Path(args[1]));
>>
>>         job.setMapperClass(XmlToTextMapper.class);
>>         job.setNumReduceTasks(0);
>>         job.setMapOutputKeyClass(Text.class);
>>         job.setMapOutputValueClass(Text.class);
>>         System.exit(job.waitForCompletion(true) ? 0 : 1);
>>
>>     }
>> }
>>
>> To execute the job you can use :
>>
>>          bin/hadoop Main /data.xml /output.
>>
>>
>> Then you can use this to see result.txt file:
>>
>>           hadoop fs -cat /result.txt
>>
>>
>> I'm using this xml as input:
>>
>>
>> <Comp><Emp><id>1</id><name>NameA</name></data><data><id>2</id><name>NameB</name></Emp></Comp>
>>
>> and the content in result.txt is like this:
>>
>> id,name
>> 1,NameA
>> 2,NameB
>>
>>
>> Hope this helps.
>>
>>
>> 2014/1/3 Ranjini Rathinam <ra...@gmail.com>
>>
>>> Hi,
>>>
>>> Need to convert XML into text using mapreduce.
>>>
>>> I have used DOM and SAX parser.
>>>
>>> After using SAX Builder in mapper class. the child node act as root
>>> Element.
>>>
>>> While seeing in Sys out i found thar root element is taking the child
>>> element and printing.
>>>
>>> For Eg,
>>>
>>> <Comp><Emp><id>100</id><name>RR</name></Emp></Comp>
>>> when this xml is passed in mapper , in sys out printing the root element
>>>
>>> I am getting the the root element as
>>>
>>> <id>
>>> <name>
>>>
>>> Please suggest and help to fix this.
>>>
>>> I need to convert the xml into text using mapreduce code. Please provide
>>> with example.
>>>
>>> Required output is
>>>
>>> id,name
>>> 100,RR
>>>
>>> Please help.
>>>
>>> Thanks in advance,
>>> Ranjini R
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>
>>
>

Re: XML to TEXT

Posted by Ranjini Rathinam <ra...@gmail.com>.
Hi,

Thanks a lot .

Ranjini

On Fri, Jan 3, 2014 at 10:40 PM, Diego Gutierrez <
diego.gutierrez@ucsp.edu.pe> wrote:

>  Hi,
>
> I suggest to use the XPath, this is a native java support for parse xml
> and json formats.
>
> For the main problem, like distcp command(
> http://hadoop.apache.org/docs/r0.19.0/distcp.pdf ) there is no need of a
> reduce function, because you can parse the xml input file and create the
> file you need in the map function.For example the following code reads an
> xml file in HDFS, parse it and create a new file ( "/result.txt" ) with the
> expected format:
> id,name
> 100,RR
>
>
> Mapper function:
>
> import java.io.ByteArrayInputStream;
> import java.io.IOException;
> import java.io.InputStream;
> import java.net.URI;
>
> import javax.xml.namespace.QName;
> import javax.xml.parsers.DocumentBuilder;
> import javax.xml.parsers.DocumentBuilderFactory;
> import javax.xml.parsers.ParserConfigurationException;
> import javax.xml.xpath.XPath;
> import javax.xml.xpath.XPathConstants;
> import javax.xml.xpath.XPathExpressionException;
> import javax.xml.xpath.XPathFactory;
>
> import org.apache.hadoop.conf.Configuration;
> import org.apache.hadoop.fs.FSDataOutputStream;
> import org.apache.hadoop.fs.FileSystem;
> import org.apache.hadoop.fs.Path;
> import org.apache.hadoop.io.IOUtils;
> import org.apache.hadoop.io.LongWritable;
> import org.apache.hadoop.io.Text;
> import org.apache.hadoop.mapreduce.Mapper;
> import org.w3c.dom.Document;
> import org.w3c.dom.Node;
> import org.w3c.dom.NodeList;
> import org.xml.sax.SAXException;
>
> import com.sun.org.apache.xml.internal.dtm.ref.DTMNodeList;
>
> public class XmlToTextMapper extends Mapper<LongWritable, Text, Text,
> Text> {
>
>     private static final XPathFactory xpathFactory =
> XPathFactory.newInstance();
>
>     @Override
>     public void map(LongWritable key, Text value, Context context)
>             throws IOException, InterruptedException {
>
>         String resultFileName = "/result.txt";
>
>
>         Configuration conf = new Configuration();
>         FileSystem fs = FileSystem.get(URI.create(resultFileName), conf);
>         FSDataOutputStream out = fs.create(new Path(resultFileName));
>
>         InputStream resultIS = new ByteArrayInputStream(new byte[0]);
>
>         String header = "id,name\n";
>         out.write(header.getBytes());
>
>         String xmlContent = value.toString();
>         InputStream is = new ByteArrayInputStream(xmlContent.getBytes());
>         DocumentBuilderFactory factory =
> DocumentBuilderFactory.newInstance();
>         DocumentBuilder builder;
>         try {
>             builder = factory.newDocumentBuilder();
>             Document doc = builder.parse(is);
>             DTMNodeList list = (DTMNodeList) getNode("/main/data", doc,
>                     XPathConstants.NODESET);
>
>             int size = list.getLength();
>             for (int i = 0; i < size; i++) {
>                 Node node = list.item(i);
>                 String line = "";
>                 NodeList nodeList = node.getChildNodes();
>                 int childNumber = nodeList.getLength();
>                 for (int j = 0; j < childNumber; j++) {
>                     line += nodeList.item(j).getTextContent() + ",";
>                 }
>                 if (line.endsWith(","))
>                     line = line.substring(0, line.length() - 1);
>                 line += "\n";
>                 out.write(line.getBytes());
>
>             }
>
>         } catch (ParserConfigurationException e) {
>             MyLogguer.log("error: " + e.getMessage());
>             e.printStackTrace();
>         } catch (SAXException e) {
>             MyLogguer.log("error: " + e.getMessage());
>             e.printStackTrace();
>         } catch (XPathExpressionException e) {
>             MyLogguer.log("error: " + e.getMessage());
>             e.printStackTrace();
>         }
>
>         IOUtils.copyBytes(resultIS, out, 4096, true);
>         out.close();
>     }
>
>     public static Object getNode(String xpathStr, Node node, QName
> retunType)
>             throws XPathExpressionException {
>         XPath xpath = xpathFactory.newXPath();
>         return xpath.evaluate(xpathStr, node, retunType);
>     }
> }
>
>
>
> --------------------------------------
> Main class:
>
>
> public class Main {
>
>     public static void main(String[] args) throws Exception {
>
>         if (args.length != 2) {
>             System.err
>                     .println("Usage: XMLtoText <input path> <output
> path>");
>             System.exit(-1);
>         }
>
>         Job job = new Job();
>         job.setJarByClass(Main.class);
>         job.setJobName("XML to Text");
>         FileInputFormat.addInputPath(job, new Path(args[0]));
>         FileOutputFormat.setOutputPath(job, new Path(args[1]));
>
>         job.setMapperClass(XmlToTextMapper.class);
>         job.setNumReduceTasks(0);
>         job.setMapOutputKeyClass(Text.class);
>         job.setMapOutputValueClass(Text.class);
>         System.exit(job.waitForCompletion(true) ? 0 : 1);
>
>     }
> }
>
> To execute the job you can use :
>
>          bin/hadoop Main /data.xml /output.
>
>
> Then you can use this to see result.txt file:
>
>           hadoop fs -cat /result.txt
>
>
> I'm using this xml as input:
>
>
> <Comp><Emp><id>1</id><name>NameA</name></data><data><id>2</id><name>NameB</name></Emp></Comp>
>
> and the content in result.txt is like this:
>
> id,name
> 1,NameA
> 2,NameB
>
>
> Hope this helps.
>
>
> 2014/1/3 Ranjini Rathinam <ra...@gmail.com>
>
>> Hi,
>>
>> Need to convert XML into text using mapreduce.
>>
>> I have used DOM and SAX parser.
>>
>> After using SAX Builder in mapper class. the child node act as root
>> Element.
>>
>> While seeing in Sys out i found thar root element is taking the child
>> element and printing.
>>
>> For Eg,
>>
>> <Comp><Emp><id>100</id><name>RR</name></Emp></Comp>
>> when this xml is passed in mapper , in sys out printing the root element
>>
>> I am getting the the root element as
>>
>> <id>
>> <name>
>>
>> Please suggest and help to fix this.
>>
>> I need to convert the xml into text using mapreduce code. Please provide
>> with example.
>>
>> Required output is
>>
>> id,name
>> 100,RR
>>
>> Please help.
>>
>> Thanks in advance,
>> Ranjini R
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>
>

Re: XML to TEXT

Posted by Ranjini Rathinam <ra...@gmail.com>.
Hi,

Thanks a lot .

Ranjini

On Fri, Jan 3, 2014 at 10:40 PM, Diego Gutierrez <
diego.gutierrez@ucsp.edu.pe> wrote:

>  Hi,
>
> I suggest to use the XPath, this is a native java support for parse xml
> and json formats.
>
> For the main problem, like distcp command(
> http://hadoop.apache.org/docs/r0.19.0/distcp.pdf ) there is no need of a
> reduce function, because you can parse the xml input file and create the
> file you need in the map function.For example the following code reads an
> xml file in HDFS, parse it and create a new file ( "/result.txt" ) with the
> expected format:
> id,name
> 100,RR
>
>
> Mapper function:
>
> import java.io.ByteArrayInputStream;
> import java.io.IOException;
> import java.io.InputStream;
> import java.net.URI;
>
> import javax.xml.namespace.QName;
> import javax.xml.parsers.DocumentBuilder;
> import javax.xml.parsers.DocumentBuilderFactory;
> import javax.xml.parsers.ParserConfigurationException;
> import javax.xml.xpath.XPath;
> import javax.xml.xpath.XPathConstants;
> import javax.xml.xpath.XPathExpressionException;
> import javax.xml.xpath.XPathFactory;
>
> import org.apache.hadoop.conf.Configuration;
> import org.apache.hadoop.fs.FSDataOutputStream;
> import org.apache.hadoop.fs.FileSystem;
> import org.apache.hadoop.fs.Path;
> import org.apache.hadoop.io.IOUtils;
> import org.apache.hadoop.io.LongWritable;
> import org.apache.hadoop.io.Text;
> import org.apache.hadoop.mapreduce.Mapper;
> import org.w3c.dom.Document;
> import org.w3c.dom.Node;
> import org.w3c.dom.NodeList;
> import org.xml.sax.SAXException;
>
> import com.sun.org.apache.xml.internal.dtm.ref.DTMNodeList;
>
> public class XmlToTextMapper extends Mapper<LongWritable, Text, Text,
> Text> {
>
>     private static final XPathFactory xpathFactory =
> XPathFactory.newInstance();
>
>     @Override
>     public void map(LongWritable key, Text value, Context context)
>             throws IOException, InterruptedException {
>
>         String resultFileName = "/result.txt";
>
>
>         Configuration conf = new Configuration();
>         FileSystem fs = FileSystem.get(URI.create(resultFileName), conf);
>         FSDataOutputStream out = fs.create(new Path(resultFileName));
>
>         InputStream resultIS = new ByteArrayInputStream(new byte[0]);
>
>         String header = "id,name\n";
>         out.write(header.getBytes());
>
>         String xmlContent = value.toString();
>         InputStream is = new ByteArrayInputStream(xmlContent.getBytes());
>         DocumentBuilderFactory factory =
> DocumentBuilderFactory.newInstance();
>         DocumentBuilder builder;
>         try {
>             builder = factory.newDocumentBuilder();
>             Document doc = builder.parse(is);
>             DTMNodeList list = (DTMNodeList) getNode("/main/data", doc,
>                     XPathConstants.NODESET);
>
>             int size = list.getLength();
>             for (int i = 0; i < size; i++) {
>                 Node node = list.item(i);
>                 String line = "";
>                 NodeList nodeList = node.getChildNodes();
>                 int childNumber = nodeList.getLength();
>                 for (int j = 0; j < childNumber; j++) {
>                     line += nodeList.item(j).getTextContent() + ",";
>                 }
>                 if (line.endsWith(","))
>                     line = line.substring(0, line.length() - 1);
>                 line += "\n";
>                 out.write(line.getBytes());
>
>             }
>
>         } catch (ParserConfigurationException e) {
>             MyLogguer.log("error: " + e.getMessage());
>             e.printStackTrace();
>         } catch (SAXException e) {
>             MyLogguer.log("error: " + e.getMessage());
>             e.printStackTrace();
>         } catch (XPathExpressionException e) {
>             MyLogguer.log("error: " + e.getMessage());
>             e.printStackTrace();
>         }
>
>         IOUtils.copyBytes(resultIS, out, 4096, true);
>         out.close();
>     }
>
>     public static Object getNode(String xpathStr, Node node, QName
> retunType)
>             throws XPathExpressionException {
>         XPath xpath = xpathFactory.newXPath();
>         return xpath.evaluate(xpathStr, node, retunType);
>     }
> }
>
>
>
> --------------------------------------
> Main class:
>
>
> public class Main {
>
>     public static void main(String[] args) throws Exception {
>
>         if (args.length != 2) {
>             System.err
>                     .println("Usage: XMLtoText <input path> <output
> path>");
>             System.exit(-1);
>         }
>
>         Job job = new Job();
>         job.setJarByClass(Main.class);
>         job.setJobName("XML to Text");
>         FileInputFormat.addInputPath(job, new Path(args[0]));
>         FileOutputFormat.setOutputPath(job, new Path(args[1]));
>
>         job.setMapperClass(XmlToTextMapper.class);
>         job.setNumReduceTasks(0);
>         job.setMapOutputKeyClass(Text.class);
>         job.setMapOutputValueClass(Text.class);
>         System.exit(job.waitForCompletion(true) ? 0 : 1);
>
>     }
> }
>
> To execute the job you can use :
>
>          bin/hadoop Main /data.xml /output.
>
>
> Then you can use this to see result.txt file:
>
>           hadoop fs -cat /result.txt
>
>
> I'm using this xml as input:
>
>
> <Comp><Emp><id>1</id><name>NameA</name></data><data><id>2</id><name>NameB</name></Emp></Comp>
>
> and the content in result.txt is like this:
>
> id,name
> 1,NameA
> 2,NameB
>
>
> Hope this helps.
>
>
> 2014/1/3 Ranjini Rathinam <ra...@gmail.com>
>
>> Hi,
>>
>> Need to convert XML into text using mapreduce.
>>
>> I have used DOM and SAX parser.
>>
>> After using SAX Builder in mapper class. the child node act as root
>> Element.
>>
>> While seeing in Sys out i found thar root element is taking the child
>> element and printing.
>>
>> For Eg,
>>
>> <Comp><Emp><id>100</id><name>RR</name></Emp></Comp>
>> when this xml is passed in mapper , in sys out printing the root element
>>
>> I am getting the the root element as
>>
>> <id>
>> <name>
>>
>> Please suggest and help to fix this.
>>
>> I need to convert the xml into text using mapreduce code. Please provide
>> with example.
>>
>> Required output is
>>
>> id,name
>> 100,RR
>>
>> Please help.
>>
>> Thanks in advance,
>> Ranjini R
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>
>

Re: XML to TEXT

Posted by Ranjini Rathinam <ra...@gmail.com>.
Hi,

Thanks a lot .

Ranjini

On Fri, Jan 3, 2014 at 10:40 PM, Diego Gutierrez <
diego.gutierrez@ucsp.edu.pe> wrote:

>  Hi,
>
> I suggest to use the XPath, this is a native java support for parse xml
> and json formats.
>
> For the main problem, like distcp command(
> http://hadoop.apache.org/docs/r0.19.0/distcp.pdf ) there is no need of a
> reduce function, because you can parse the xml input file and create the
> file you need in the map function.For example the following code reads an
> xml file in HDFS, parse it and create a new file ( "/result.txt" ) with the
> expected format:
> id,name
> 100,RR
>
>
> Mapper function:
>
> import java.io.ByteArrayInputStream;
> import java.io.IOException;
> import java.io.InputStream;
> import java.net.URI;
>
> import javax.xml.namespace.QName;
> import javax.xml.parsers.DocumentBuilder;
> import javax.xml.parsers.DocumentBuilderFactory;
> import javax.xml.parsers.ParserConfigurationException;
> import javax.xml.xpath.XPath;
> import javax.xml.xpath.XPathConstants;
> import javax.xml.xpath.XPathExpressionException;
> import javax.xml.xpath.XPathFactory;
>
> import org.apache.hadoop.conf.Configuration;
> import org.apache.hadoop.fs.FSDataOutputStream;
> import org.apache.hadoop.fs.FileSystem;
> import org.apache.hadoop.fs.Path;
> import org.apache.hadoop.io.IOUtils;
> import org.apache.hadoop.io.LongWritable;
> import org.apache.hadoop.io.Text;
> import org.apache.hadoop.mapreduce.Mapper;
> import org.w3c.dom.Document;
> import org.w3c.dom.Node;
> import org.w3c.dom.NodeList;
> import org.xml.sax.SAXException;
>
> import com.sun.org.apache.xml.internal.dtm.ref.DTMNodeList;
>
> public class XmlToTextMapper extends Mapper<LongWritable, Text, Text,
> Text> {
>
>     private static final XPathFactory xpathFactory =
> XPathFactory.newInstance();
>
>     @Override
>     public void map(LongWritable key, Text value, Context context)
>             throws IOException, InterruptedException {
>
>         String resultFileName = "/result.txt";
>
>
>         Configuration conf = new Configuration();
>         FileSystem fs = FileSystem.get(URI.create(resultFileName), conf);
>         FSDataOutputStream out = fs.create(new Path(resultFileName));
>
>         InputStream resultIS = new ByteArrayInputStream(new byte[0]);
>
>         String header = "id,name\n";
>         out.write(header.getBytes());
>
>         String xmlContent = value.toString();
>         InputStream is = new ByteArrayInputStream(xmlContent.getBytes());
>         DocumentBuilderFactory factory =
> DocumentBuilderFactory.newInstance();
>         DocumentBuilder builder;
>         try {
>             builder = factory.newDocumentBuilder();
>             Document doc = builder.parse(is);
>             DTMNodeList list = (DTMNodeList) getNode("/main/data", doc,
>                     XPathConstants.NODESET);
>
>             int size = list.getLength();
>             for (int i = 0; i < size; i++) {
>                 Node node = list.item(i);
>                 String line = "";
>                 NodeList nodeList = node.getChildNodes();
>                 int childNumber = nodeList.getLength();
>                 for (int j = 0; j < childNumber; j++) {
>                     line += nodeList.item(j).getTextContent() + ",";
>                 }
>                 if (line.endsWith(","))
>                     line = line.substring(0, line.length() - 1);
>                 line += "\n";
>                 out.write(line.getBytes());
>
>             }
>
>         } catch (ParserConfigurationException e) {
>             MyLogguer.log("error: " + e.getMessage());
>             e.printStackTrace();
>         } catch (SAXException e) {
>             MyLogguer.log("error: " + e.getMessage());
>             e.printStackTrace();
>         } catch (XPathExpressionException e) {
>             MyLogguer.log("error: " + e.getMessage());
>             e.printStackTrace();
>         }
>
>         IOUtils.copyBytes(resultIS, out, 4096, true);
>         out.close();
>     }
>
>     public static Object getNode(String xpathStr, Node node, QName
> retunType)
>             throws XPathExpressionException {
>         XPath xpath = xpathFactory.newXPath();
>         return xpath.evaluate(xpathStr, node, retunType);
>     }
> }
>
>
>
> --------------------------------------
> Main class:
>
>
> public class Main {
>
>     public static void main(String[] args) throws Exception {
>
>         if (args.length != 2) {
>             System.err
>                     .println("Usage: XMLtoText <input path> <output
> path>");
>             System.exit(-1);
>         }
>
>         Job job = new Job();
>         job.setJarByClass(Main.class);
>         job.setJobName("XML to Text");
>         FileInputFormat.addInputPath(job, new Path(args[0]));
>         FileOutputFormat.setOutputPath(job, new Path(args[1]));
>
>         job.setMapperClass(XmlToTextMapper.class);
>         job.setNumReduceTasks(0);
>         job.setMapOutputKeyClass(Text.class);
>         job.setMapOutputValueClass(Text.class);
>         System.exit(job.waitForCompletion(true) ? 0 : 1);
>
>     }
> }
>
> To execute the job you can use :
>
>          bin/hadoop Main /data.xml /output.
>
>
> Then you can use this to see result.txt file:
>
>           hadoop fs -cat /result.txt
>
>
> I'm using this xml as input:
>
>
> <Comp><Emp><id>1</id><name>NameA</name></data><data><id>2</id><name>NameB</name></Emp></Comp>
>
> and the content in result.txt is like this:
>
> id,name
> 1,NameA
> 2,NameB
>
>
> Hope this helps.
>
>
> 2014/1/3 Ranjini Rathinam <ra...@gmail.com>
>
>> Hi,
>>
>> Need to convert XML into text using mapreduce.
>>
>> I have used DOM and SAX parser.
>>
>> After using SAX Builder in mapper class. the child node act as root
>> Element.
>>
>> While seeing in Sys out i found thar root element is taking the child
>> element and printing.
>>
>> For Eg,
>>
>> <Comp><Emp><id>100</id><name>RR</name></Emp></Comp>
>> when this xml is passed in mapper , in sys out printing the root element
>>
>> I am getting the the root element as
>>
>> <id>
>> <name>
>>
>> Please suggest and help to fix this.
>>
>> I need to convert the xml into text using mapreduce code. Please provide
>> with example.
>>
>> Required output is
>>
>> id,name
>> 100,RR
>>
>> Please help.
>>
>> Thanks in advance,
>> Ranjini R
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>
>

Re: XML to TEXT

Posted by Ranjini Rathinam <ra...@gmail.com>.
Hi,

Thanks a lot .

Ranjini

On Fri, Jan 3, 2014 at 10:40 PM, Diego Gutierrez <
diego.gutierrez@ucsp.edu.pe> wrote:

>  Hi,
>
> I suggest to use the XPath, this is a native java support for parse xml
> and json formats.
>
> For the main problem, like distcp command(
> http://hadoop.apache.org/docs/r0.19.0/distcp.pdf ) there is no need of a
> reduce function, because you can parse the xml input file and create the
> file you need in the map function.For example the following code reads an
> xml file in HDFS, parse it and create a new file ( "/result.txt" ) with the
> expected format:
> id,name
> 100,RR
>
>
> Mapper function:
>
> import java.io.ByteArrayInputStream;
> import java.io.IOException;
> import java.io.InputStream;
> import java.net.URI;
>
> import javax.xml.namespace.QName;
> import javax.xml.parsers.DocumentBuilder;
> import javax.xml.parsers.DocumentBuilderFactory;
> import javax.xml.parsers.ParserConfigurationException;
> import javax.xml.xpath.XPath;
> import javax.xml.xpath.XPathConstants;
> import javax.xml.xpath.XPathExpressionException;
> import javax.xml.xpath.XPathFactory;
>
> import org.apache.hadoop.conf.Configuration;
> import org.apache.hadoop.fs.FSDataOutputStream;
> import org.apache.hadoop.fs.FileSystem;
> import org.apache.hadoop.fs.Path;
> import org.apache.hadoop.io.IOUtils;
> import org.apache.hadoop.io.LongWritable;
> import org.apache.hadoop.io.Text;
> import org.apache.hadoop.mapreduce.Mapper;
> import org.w3c.dom.Document;
> import org.w3c.dom.Node;
> import org.w3c.dom.NodeList;
> import org.xml.sax.SAXException;
>
> import com.sun.org.apache.xml.internal.dtm.ref.DTMNodeList;
>
> public class XmlToTextMapper extends Mapper<LongWritable, Text, Text,
> Text> {
>
>     private static final XPathFactory xpathFactory =
> XPathFactory.newInstance();
>
>     @Override
>     public void map(LongWritable key, Text value, Context context)
>             throws IOException, InterruptedException {
>
>         String resultFileName = "/result.txt";
>
>
>         Configuration conf = new Configuration();
>         FileSystem fs = FileSystem.get(URI.create(resultFileName), conf);
>         FSDataOutputStream out = fs.create(new Path(resultFileName));
>
>         InputStream resultIS = new ByteArrayInputStream(new byte[0]);
>
>         String header = "id,name\n";
>         out.write(header.getBytes());
>
>         String xmlContent = value.toString();
>         InputStream is = new ByteArrayInputStream(xmlContent.getBytes());
>         DocumentBuilderFactory factory =
> DocumentBuilderFactory.newInstance();
>         DocumentBuilder builder;
>         try {
>             builder = factory.newDocumentBuilder();
>             Document doc = builder.parse(is);
>             DTMNodeList list = (DTMNodeList) getNode("/main/data", doc,
>                     XPathConstants.NODESET);
>
>             int size = list.getLength();
>             for (int i = 0; i < size; i++) {
>                 Node node = list.item(i);
>                 String line = "";
>                 NodeList nodeList = node.getChildNodes();
>                 int childNumber = nodeList.getLength();
>                 for (int j = 0; j < childNumber; j++) {
>                     line += nodeList.item(j).getTextContent() + ",";
>                 }
>                 if (line.endsWith(","))
>                     line = line.substring(0, line.length() - 1);
>                 line += "\n";
>                 out.write(line.getBytes());
>
>             }
>
>         } catch (ParserConfigurationException e) {
>             MyLogguer.log("error: " + e.getMessage());
>             e.printStackTrace();
>         } catch (SAXException e) {
>             MyLogguer.log("error: " + e.getMessage());
>             e.printStackTrace();
>         } catch (XPathExpressionException e) {
>             MyLogguer.log("error: " + e.getMessage());
>             e.printStackTrace();
>         }
>
>         IOUtils.copyBytes(resultIS, out, 4096, true);
>         out.close();
>     }
>
>     public static Object getNode(String xpathStr, Node node, QName
> retunType)
>             throws XPathExpressionException {
>         XPath xpath = xpathFactory.newXPath();
>         return xpath.evaluate(xpathStr, node, retunType);
>     }
> }
>
>
>
> --------------------------------------
> Main class:
>
>
> public class Main {
>
>     public static void main(String[] args) throws Exception {
>
>         if (args.length != 2) {
>             System.err
>                     .println("Usage: XMLtoText <input path> <output
> path>");
>             System.exit(-1);
>         }
>
>         Job job = new Job();
>         job.setJarByClass(Main.class);
>         job.setJobName("XML to Text");
>         FileInputFormat.addInputPath(job, new Path(args[0]));
>         FileOutputFormat.setOutputPath(job, new Path(args[1]));
>
>         job.setMapperClass(XmlToTextMapper.class);
>         job.setNumReduceTasks(0);
>         job.setMapOutputKeyClass(Text.class);
>         job.setMapOutputValueClass(Text.class);
>         System.exit(job.waitForCompletion(true) ? 0 : 1);
>
>     }
> }
>
> To execute the job you can use :
>
>          bin/hadoop Main /data.xml /output.
>
>
> Then you can use this to see result.txt file:
>
>           hadoop fs -cat /result.txt
>
>
> I'm using this xml as input:
>
>
> <Comp><Emp><id>1</id><name>NameA</name></data><data><id>2</id><name>NameB</name></Emp></Comp>
>
> and the content in result.txt is like this:
>
> id,name
> 1,NameA
> 2,NameB
>
>
> Hope this helps.
>
>
> 2014/1/3 Ranjini Rathinam <ra...@gmail.com>
>
>> Hi,
>>
>> Need to convert XML into text using mapreduce.
>>
>> I have used DOM and SAX parser.
>>
>> After using SAX Builder in mapper class. the child node act as root
>> Element.
>>
>> While seeing in Sys out i found thar root element is taking the child
>> element and printing.
>>
>> For Eg,
>>
>> <Comp><Emp><id>100</id><name>RR</name></Emp></Comp>
>> when this xml is passed in mapper , in sys out printing the root element
>>
>> I am getting the the root element as
>>
>> <id>
>> <name>
>>
>> Please suggest and help to fix this.
>>
>> I need to convert the xml into text using mapreduce code. Please provide
>> with example.
>>
>> Required output is
>>
>> id,name
>> 100,RR
>>
>> Please help.
>>
>> Thanks in advance,
>> Ranjini R
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>
>

Re: XML to TEXT

Posted by Diego Gutierrez <di...@ucsp.edu.pe>.
Hi,

I suggest to use the XPath, this is a native java support for parse xml and
json formats.

For the main problem, like distcp command(
http://hadoop.apache.org/docs/r0.19.0/distcp.pdf ) there is no need of a
reduce function, because you can parse the xml input file and create the
file you need in the map function.For example the following code reads an
xml file in HDFS, parse it and create a new file ( "/result.txt" ) with the
expected format:
id,name
100,RR


Mapper function:

import java.io.ByteArrayInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.net.URI;

import javax.xml.namespace.QName;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.xpath.XPath;
import javax.xml.xpath.XPathConstants;
import javax.xml.xpath.XPathExpressionException;
import javax.xml.xpath.XPathFactory;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.w3c.dom.Document;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
import org.xml.sax.SAXException;

import com.sun.org.apache.xml.internal.dtm.ref.DTMNodeList;

public class XmlToTextMapper extends Mapper<LongWritable, Text, Text, Text>
{

    private static final XPathFactory xpathFactory =
XPathFactory.newInstance();

    @Override
    public void map(LongWritable key, Text value, Context context)
            throws IOException, InterruptedException {

        String resultFileName = "/result.txt";

        Configuration conf = new Configuration();
        FileSystem fs = FileSystem.get(URI.create(resultFileName), conf);
        FSDataOutputStream out = fs.create(new Path(resultFileName));

        InputStream resultIS = new ByteArrayInputStream(new byte[0]);

        String header = "id,name\n";
        out.write(header.getBytes());

        String xmlContent = value.toString();
        InputStream is = new ByteArrayInputStream(xmlContent.getBytes());
        DocumentBuilderFactory factory =
DocumentBuilderFactory.newInstance();
        DocumentBuilder builder;
        try {
            builder = factory.newDocumentBuilder();
            Document doc = builder.parse(is);
            DTMNodeList list = (DTMNodeList) getNode("/main/data", doc,
                    XPathConstants.NODESET);

            int size = list.getLength();
            for (int i = 0; i < size; i++) {
                Node node = list.item(i);
                String line = "";
                NodeList nodeList = node.getChildNodes();
                int childNumber = nodeList.getLength();
                for (int j = 0; j < childNumber; j++) {
                    line += nodeList.item(j).getTextContent() + ",";
                }
                if (line.endsWith(","))
                    line = line.substring(0, line.length() - 1);
                line += "\n";
                out.write(line.getBytes());

            }

        } catch (ParserConfigurationException e) {
            MyLogguer.log("error: " + e.getMessage());
            e.printStackTrace();
        } catch (SAXException e) {
            MyLogguer.log("error: " + e.getMessage());
            e.printStackTrace();
        } catch (XPathExpressionException e) {
            MyLogguer.log("error: " + e.getMessage());
            e.printStackTrace();
        }

        IOUtils.copyBytes(resultIS, out, 4096, true);
        out.close();
    }

    public static Object getNode(String xpathStr, Node node, QName
retunType)
            throws XPathExpressionException {
        XPath xpath = xpathFactory.newXPath();
        return xpath.evaluate(xpathStr, node, retunType);
    }
}



--------------------------------------
Main class:


public class Main {

    public static void main(String[] args) throws Exception {

        if (args.length != 2) {
            System.err
                    .println("Usage: XMLtoText <input path> <output path>");
            System.exit(-1);
        }

        Job job = new Job();
        job.setJarByClass(Main.class);
        job.setJobName("XML to Text");
        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        job.setMapperClass(XmlToTextMapper.class);
        job.setNumReduceTasks(0);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(Text.class);
        System.exit(job.waitForCompletion(true) ? 0 : 1);

    }
}

To execute the job you can use :

         bin/hadoop Main /data.xml /output.


Then you can use this to see result.txt file:

          hadoop fs -cat /result.txt


I'm using this xml as input:

<Comp><Emp><id>1</id><name>NameA</name></data><data><id>2</id><name>NameB</name></Emp></Comp>

and the content in result.txt is like this:

id,name
1,NameA
2,NameB


Hope this helps.


2014/1/3 Ranjini Rathinam <ra...@gmail.com>

> Hi,
>
> Need to convert XML into text using mapreduce.
>
> I have used DOM and SAX parser.
>
> After using SAX Builder in mapper class. the child node act as root
> Element.
>
> While seeing in Sys out i found thar root element is taking the child
> element and printing.
>
> For Eg,
>
> <Comp><Emp><id>100</id><name>RR</name></Emp></Comp>
> when this xml is passed in mapper , in sys out printing the root element
>
> I am getting the the root element as
>
> <id>
> <name>
>
> Please suggest and help to fix this.
>
> I need to convert the xml into text using mapreduce code. Please provide
> with example.
>
> Required output is
>
> id,name
> 100,RR
>
> Please help.
>
> Thanks in advance,
> Ranjini R
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>

Re: XML to TEXT

Posted by Shekhar Sharma <sh...@gmail.com>.
Which input format you are using . Use xml input format.
On 3 Jan 2014 10:47, "Ranjini Rathinam" <ra...@gmail.com> wrote:

> Hi,
>
> Need to convert XML into text using mapreduce.
>
> I have used DOM and SAX parser.
>
> After using SAX Builder in mapper class. the child node act as root
> Element.
>
> While seeing in Sys out i found thar root element is taking the child
> element and printing.
>
> For Eg,
>
> <Comp><Emp><id>100</id><name>RR</name></Emp></Comp>
> when this xml is passed in mapper , in sys out printing the root element
>
> I am getting the the root element as
>
> <id>
> <name>
>
> Please suggest and help to fix this.
>
> I need to convert the xml into text using mapreduce code. Please provide
> with example.
>
> Required output is
>
> id,name
> 100,RR
>
> Please help.
>
> Thanks in advance,
> Ranjini R
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>

Re: XML to TEXT

Posted by Azuryy Yu <az...@gmail.com>.
Hi,

you can use org.apache.hadoop.streaming.StreamInputFormat  using map reduce
to convert XML to text.

such as your xml like this:
<xml>
  <name>lll</name>
</xml>

you need to specify stream.recordreader.begin and stream.recordreader.end
in the Configuration:
Configuration conf = new Configuration();
conf.set("stream.recordreader.begin", "<xml>");
conf.set("stream.recordreader.end", "</xml>");






On Fri, Jan 3, 2014 at 1:16 PM, Ranjini Rathinam <ra...@gmail.com>wrote:

> Hi,
>
> Need to convert XML into text using mapreduce.
>
> I have used DOM and SAX parser.
>
> After using SAX Builder in mapper class. the child node act as root
> Element.
>
> While seeing in Sys out i found thar root element is taking the child
> element and printing.
>
> For Eg,
>
> <Comp><Emp><id>100</id><name>RR</name></Emp></Comp>
> when this xml is passed in mapper , in sys out printing the root element
>
> I am getting the the root element as
>
> <id>
> <name>
>
> Please suggest and help to fix this.
>
> I need to convert the xml into text using mapreduce code. Please provide
> with example.
>
> Required output is
>
> id,name
> 100,RR
>
> Please help.
>
> Thanks in advance,
> Ranjini R
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>

Re: XML to TEXT

Posted by Azuryy Yu <az...@gmail.com>.
Hi,

you can use org.apache.hadoop.streaming.StreamInputFormat  using map reduce
to convert XML to text.

such as your xml like this:
<xml>
  <name>lll</name>
</xml>

you need to specify stream.recordreader.begin and stream.recordreader.end
in the Configuration:
Configuration conf = new Configuration();
conf.set("stream.recordreader.begin", "<xml>");
conf.set("stream.recordreader.end", "</xml>");






On Fri, Jan 3, 2014 at 1:16 PM, Ranjini Rathinam <ra...@gmail.com>wrote:

> Hi,
>
> Need to convert XML into text using mapreduce.
>
> I have used DOM and SAX parser.
>
> After using SAX Builder in mapper class. the child node act as root
> Element.
>
> While seeing in Sys out i found thar root element is taking the child
> element and printing.
>
> For Eg,
>
> <Comp><Emp><id>100</id><name>RR</name></Emp></Comp>
> when this xml is passed in mapper , in sys out printing the root element
>
> I am getting the the root element as
>
> <id>
> <name>
>
> Please suggest and help to fix this.
>
> I need to convert the xml into text using mapreduce code. Please provide
> with example.
>
> Required output is
>
> id,name
> 100,RR
>
> Please help.
>
> Thanks in advance,
> Ranjini R
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>

Re: XML to TEXT

Posted by Diego Gutierrez <di...@ucsp.edu.pe>.
Hi,

I suggest to use the XPath, this is a native java support for parse xml and
json formats.

For the main problem, like distcp command(
http://hadoop.apache.org/docs/r0.19.0/distcp.pdf ) there is no need of a
reduce function, because you can parse the xml input file and create the
file you need in the map function.For example the following code reads an
xml file in HDFS, parse it and create a new file ( "/result.txt" ) with the
expected format:
id,name
100,RR


Mapper function:

import java.io.ByteArrayInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.net.URI;

import javax.xml.namespace.QName;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.xpath.XPath;
import javax.xml.xpath.XPathConstants;
import javax.xml.xpath.XPathExpressionException;
import javax.xml.xpath.XPathFactory;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.w3c.dom.Document;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
import org.xml.sax.SAXException;

import com.sun.org.apache.xml.internal.dtm.ref.DTMNodeList;

public class XmlToTextMapper extends Mapper<LongWritable, Text, Text, Text>
{

    private static final XPathFactory xpathFactory =
XPathFactory.newInstance();

    @Override
    public void map(LongWritable key, Text value, Context context)
            throws IOException, InterruptedException {

        String resultFileName = "/result.txt";

        Configuration conf = new Configuration();
        FileSystem fs = FileSystem.get(URI.create(resultFileName), conf);
        FSDataOutputStream out = fs.create(new Path(resultFileName));

        InputStream resultIS = new ByteArrayInputStream(new byte[0]);

        String header = "id,name\n";
        out.write(header.getBytes());

        String xmlContent = value.toString();
        InputStream is = new ByteArrayInputStream(xmlContent.getBytes());
        DocumentBuilderFactory factory =
DocumentBuilderFactory.newInstance();
        DocumentBuilder builder;
        try {
            builder = factory.newDocumentBuilder();
            Document doc = builder.parse(is);
            DTMNodeList list = (DTMNodeList) getNode("/main/data", doc,
                    XPathConstants.NODESET);

            int size = list.getLength();
            for (int i = 0; i < size; i++) {
                Node node = list.item(i);
                String line = "";
                NodeList nodeList = node.getChildNodes();
                int childNumber = nodeList.getLength();
                for (int j = 0; j < childNumber; j++) {
                    line += nodeList.item(j).getTextContent() + ",";
                }
                if (line.endsWith(","))
                    line = line.substring(0, line.length() - 1);
                line += "\n";
                out.write(line.getBytes());

            }

        } catch (ParserConfigurationException e) {
            MyLogguer.log("error: " + e.getMessage());
            e.printStackTrace();
        } catch (SAXException e) {
            MyLogguer.log("error: " + e.getMessage());
            e.printStackTrace();
        } catch (XPathExpressionException e) {
            MyLogguer.log("error: " + e.getMessage());
            e.printStackTrace();
        }

        IOUtils.copyBytes(resultIS, out, 4096, true);
        out.close();
    }

    public static Object getNode(String xpathStr, Node node, QName
retunType)
            throws XPathExpressionException {
        XPath xpath = xpathFactory.newXPath();
        return xpath.evaluate(xpathStr, node, retunType);
    }
}



--------------------------------------
Main class:


public class Main {

    public static void main(String[] args) throws Exception {

        if (args.length != 2) {
            System.err
                    .println("Usage: XMLtoText <input path> <output path>");
            System.exit(-1);
        }

        Job job = new Job();
        job.setJarByClass(Main.class);
        job.setJobName("XML to Text");
        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        job.setMapperClass(XmlToTextMapper.class);
        job.setNumReduceTasks(0);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(Text.class);
        System.exit(job.waitForCompletion(true) ? 0 : 1);

    }
}

To execute the job you can use :

         bin/hadoop Main /data.xml /output.


Then you can use this to see result.txt file:

          hadoop fs -cat /result.txt


I'm using this xml as input:

<Comp><Emp><id>1</id><name>NameA</name></data><data><id>2</id><name>NameB</name></Emp></Comp>

and the content in result.txt is like this:

id,name
1,NameA
2,NameB


Hope this helps.


2014/1/3 Ranjini Rathinam <ra...@gmail.com>

> Hi,
>
> Need to convert XML into text using mapreduce.
>
> I have used DOM and SAX parser.
>
> After using SAX Builder in mapper class. the child node act as root
> Element.
>
> While seeing in Sys out i found thar root element is taking the child
> element and printing.
>
> For Eg,
>
> <Comp><Emp><id>100</id><name>RR</name></Emp></Comp>
> when this xml is passed in mapper , in sys out printing the root element
>
> I am getting the the root element as
>
> <id>
> <name>
>
> Please suggest and help to fix this.
>
> I need to convert the xml into text using mapreduce code. Please provide
> with example.
>
> Required output is
>
> id,name
> 100,RR
>
> Please help.
>
> Thanks in advance,
> Ranjini R
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>

Re: XML to TEXT

Posted by Diego Gutierrez <di...@ucsp.edu.pe>.
Hi,

I suggest to use the XPath, this is a native java support for parse xml and
json formats.

For the main problem, like distcp command(
http://hadoop.apache.org/docs/r0.19.0/distcp.pdf ) there is no need of a
reduce function, because you can parse the xml input file and create the
file you need in the map function.For example the following code reads an
xml file in HDFS, parse it and create a new file ( "/result.txt" ) with the
expected format:
id,name
100,RR


Mapper function:

import java.io.ByteArrayInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.net.URI;

import javax.xml.namespace.QName;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.xpath.XPath;
import javax.xml.xpath.XPathConstants;
import javax.xml.xpath.XPathExpressionException;
import javax.xml.xpath.XPathFactory;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.w3c.dom.Document;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
import org.xml.sax.SAXException;

import com.sun.org.apache.xml.internal.dtm.ref.DTMNodeList;

public class XmlToTextMapper extends Mapper<LongWritable, Text, Text, Text>
{

    private static final XPathFactory xpathFactory =
XPathFactory.newInstance();

    @Override
    public void map(LongWritable key, Text value, Context context)
            throws IOException, InterruptedException {

        String resultFileName = "/result.txt";

        Configuration conf = new Configuration();
        FileSystem fs = FileSystem.get(URI.create(resultFileName), conf);
        FSDataOutputStream out = fs.create(new Path(resultFileName));

        InputStream resultIS = new ByteArrayInputStream(new byte[0]);

        String header = "id,name\n";
        out.write(header.getBytes());

        String xmlContent = value.toString();
        InputStream is = new ByteArrayInputStream(xmlContent.getBytes());
        DocumentBuilderFactory factory =
DocumentBuilderFactory.newInstance();
        DocumentBuilder builder;
        try {
            builder = factory.newDocumentBuilder();
            Document doc = builder.parse(is);
            DTMNodeList list = (DTMNodeList) getNode("/main/data", doc,
                    XPathConstants.NODESET);

            int size = list.getLength();
            for (int i = 0; i < size; i++) {
                Node node = list.item(i);
                String line = "";
                NodeList nodeList = node.getChildNodes();
                int childNumber = nodeList.getLength();
                for (int j = 0; j < childNumber; j++) {
                    line += nodeList.item(j).getTextContent() + ",";
                }
                if (line.endsWith(","))
                    line = line.substring(0, line.length() - 1);
                line += "\n";
                out.write(line.getBytes());

            }

        } catch (ParserConfigurationException e) {
            MyLogguer.log("error: " + e.getMessage());
            e.printStackTrace();
        } catch (SAXException e) {
            MyLogguer.log("error: " + e.getMessage());
            e.printStackTrace();
        } catch (XPathExpressionException e) {
            MyLogguer.log("error: " + e.getMessage());
            e.printStackTrace();
        }

        IOUtils.copyBytes(resultIS, out, 4096, true);
        out.close();
    }

    public static Object getNode(String xpathStr, Node node, QName
retunType)
            throws XPathExpressionException {
        XPath xpath = xpathFactory.newXPath();
        return xpath.evaluate(xpathStr, node, retunType);
    }
}



--------------------------------------
Main class:


public class Main {

    public static void main(String[] args) throws Exception {

        if (args.length != 2) {
            System.err
                    .println("Usage: XMLtoText <input path> <output path>");
            System.exit(-1);
        }

        Job job = new Job();
        job.setJarByClass(Main.class);
        job.setJobName("XML to Text");
        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        job.setMapperClass(XmlToTextMapper.class);
        job.setNumReduceTasks(0);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(Text.class);
        System.exit(job.waitForCompletion(true) ? 0 : 1);

    }
}

To execute the job you can use :

         bin/hadoop Main /data.xml /output.


Then you can use this to see result.txt file:

          hadoop fs -cat /result.txt


I'm using this xml as input:

<Comp><Emp><id>1</id><name>NameA</name></data><data><id>2</id><name>NameB</name></Emp></Comp>

and the content in result.txt is like this:

id,name
1,NameA
2,NameB


Hope this helps.


2014/1/3 Ranjini Rathinam <ra...@gmail.com>

> Hi,
>
> Need to convert XML into text using mapreduce.
>
> I have used DOM and SAX parser.
>
> After using SAX Builder in mapper class. the child node act as root
> Element.
>
> While seeing in Sys out i found thar root element is taking the child
> element and printing.
>
> For Eg,
>
> <Comp><Emp><id>100</id><name>RR</name></Emp></Comp>
> when this xml is passed in mapper , in sys out printing the root element
>
> I am getting the the root element as
>
> <id>
> <name>
>
> Please suggest and help to fix this.
>
> I need to convert the xml into text using mapreduce code. Please provide
> with example.
>
> Required output is
>
> id,name
> 100,RR
>
> Please help.
>
> Thanks in advance,
> Ranjini R
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>

Re: XML to TEXT

Posted by Azuryy Yu <az...@gmail.com>.
Hi,

you can use org.apache.hadoop.streaming.StreamInputFormat  using map reduce
to convert XML to text.

such as your xml like this:
<xml>
  <name>lll</name>
</xml>

you need to specify stream.recordreader.begin and stream.recordreader.end
in the Configuration:
Configuration conf = new Configuration();
conf.set("stream.recordreader.begin", "<xml>");
conf.set("stream.recordreader.end", "</xml>");






On Fri, Jan 3, 2014 at 1:16 PM, Ranjini Rathinam <ra...@gmail.com>wrote:

> Hi,
>
> Need to convert XML into text using mapreduce.
>
> I have used DOM and SAX parser.
>
> After using SAX Builder in mapper class. the child node act as root
> Element.
>
> While seeing in Sys out i found thar root element is taking the child
> element and printing.
>
> For Eg,
>
> <Comp><Emp><id>100</id><name>RR</name></Emp></Comp>
> when this xml is passed in mapper , in sys out printing the root element
>
> I am getting the the root element as
>
> <id>
> <name>
>
> Please suggest and help to fix this.
>
> I need to convert the xml into text using mapreduce code. Please provide
> with example.
>
> Required output is
>
> id,name
> 100,RR
>
> Please help.
>
> Thanks in advance,
> Ranjini R
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>

Re: XML to TEXT

Posted by Shekhar Sharma <sh...@gmail.com>.
Which input format you are using . Use xml input format.
On 3 Jan 2014 10:47, "Ranjini Rathinam" <ra...@gmail.com> wrote:

> Hi,
>
> Need to convert XML into text using mapreduce.
>
> I have used DOM and SAX parser.
>
> After using SAX Builder in mapper class. the child node act as root
> Element.
>
> While seeing in Sys out i found thar root element is taking the child
> element and printing.
>
> For Eg,
>
> <Comp><Emp><id>100</id><name>RR</name></Emp></Comp>
> when this xml is passed in mapper , in sys out printing the root element
>
> I am getting the the root element as
>
> <id>
> <name>
>
> Please suggest and help to fix this.
>
> I need to convert the xml into text using mapreduce code. Please provide
> with example.
>
> Required output is
>
> id,name
> 100,RR
>
> Please help.
>
> Thanks in advance,
> Ranjini R
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>

Re: XML to TEXT

Posted by Shekhar Sharma <sh...@gmail.com>.
Which input format you are using . Use xml input format.
On 3 Jan 2014 10:47, "Ranjini Rathinam" <ra...@gmail.com> wrote:

> Hi,
>
> Need to convert XML into text using mapreduce.
>
> I have used DOM and SAX parser.
>
> After using SAX Builder in mapper class. the child node act as root
> Element.
>
> While seeing in Sys out i found thar root element is taking the child
> element and printing.
>
> For Eg,
>
> <Comp><Emp><id>100</id><name>RR</name></Emp></Comp>
> when this xml is passed in mapper , in sys out printing the root element
>
> I am getting the the root element as
>
> <id>
> <name>
>
> Please suggest and help to fix this.
>
> I need to convert the xml into text using mapreduce code. Please provide
> with example.
>
> Required output is
>
> id,name
> 100,RR
>
> Please help.
>
> Thanks in advance,
> Ranjini R
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>

Re: XML to TEXT

Posted by Diego Gutierrez <di...@ucsp.edu.pe>.
Hi,

I suggest to use the XPath, this is a native java support for parse xml and
json formats.

For the main problem, like distcp command(
http://hadoop.apache.org/docs/r0.19.0/distcp.pdf ) there is no need of a
reduce function, because you can parse the xml input file and create the
file you need in the map function.For example the following code reads an
xml file in HDFS, parse it and create a new file ( "/result.txt" ) with the
expected format:
id,name
100,RR


Mapper function:

import java.io.ByteArrayInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.net.URI;

import javax.xml.namespace.QName;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.xpath.XPath;
import javax.xml.xpath.XPathConstants;
import javax.xml.xpath.XPathExpressionException;
import javax.xml.xpath.XPathFactory;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.w3c.dom.Document;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
import org.xml.sax.SAXException;

import com.sun.org.apache.xml.internal.dtm.ref.DTMNodeList;

public class XmlToTextMapper extends Mapper<LongWritable, Text, Text, Text>
{

    private static final XPathFactory xpathFactory =
XPathFactory.newInstance();

    @Override
    public void map(LongWritable key, Text value, Context context)
            throws IOException, InterruptedException {

        String resultFileName = "/result.txt";

        Configuration conf = new Configuration();
        FileSystem fs = FileSystem.get(URI.create(resultFileName), conf);
        FSDataOutputStream out = fs.create(new Path(resultFileName));

        InputStream resultIS = new ByteArrayInputStream(new byte[0]);

        String header = "id,name\n";
        out.write(header.getBytes());

        String xmlContent = value.toString();
        InputStream is = new ByteArrayInputStream(xmlContent.getBytes());
        DocumentBuilderFactory factory =
DocumentBuilderFactory.newInstance();
        DocumentBuilder builder;
        try {
            builder = factory.newDocumentBuilder();
            Document doc = builder.parse(is);
            DTMNodeList list = (DTMNodeList) getNode("/main/data", doc,
                    XPathConstants.NODESET);

            int size = list.getLength();
            for (int i = 0; i < size; i++) {
                Node node = list.item(i);
                String line = "";
                NodeList nodeList = node.getChildNodes();
                int childNumber = nodeList.getLength();
                for (int j = 0; j < childNumber; j++) {
                    line += nodeList.item(j).getTextContent() + ",";
                }
                if (line.endsWith(","))
                    line = line.substring(0, line.length() - 1);
                line += "\n";
                out.write(line.getBytes());

            }

        } catch (ParserConfigurationException e) {
            MyLogguer.log("error: " + e.getMessage());
            e.printStackTrace();
        } catch (SAXException e) {
            MyLogguer.log("error: " + e.getMessage());
            e.printStackTrace();
        } catch (XPathExpressionException e) {
            MyLogguer.log("error: " + e.getMessage());
            e.printStackTrace();
        }

        IOUtils.copyBytes(resultIS, out, 4096, true);
        out.close();
    }

    public static Object getNode(String xpathStr, Node node, QName
retunType)
            throws XPathExpressionException {
        XPath xpath = xpathFactory.newXPath();
        return xpath.evaluate(xpathStr, node, retunType);
    }
}



--------------------------------------
Main class:


public class Main {

    public static void main(String[] args) throws Exception {

        if (args.length != 2) {
            System.err
                    .println("Usage: XMLtoText <input path> <output path>");
            System.exit(-1);
        }

        Job job = new Job();
        job.setJarByClass(Main.class);
        job.setJobName("XML to Text");
        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        job.setMapperClass(XmlToTextMapper.class);
        job.setNumReduceTasks(0);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(Text.class);
        System.exit(job.waitForCompletion(true) ? 0 : 1);

    }
}

To execute the job you can use :

         bin/hadoop Main /data.xml /output.


Then you can use this to see result.txt file:

          hadoop fs -cat /result.txt


I'm using this xml as input:

<Comp><Emp><id>1</id><name>NameA</name></data><data><id>2</id><name>NameB</name></Emp></Comp>

and the content in result.txt is like this:

id,name
1,NameA
2,NameB


Hope this helps.


2014/1/3 Ranjini Rathinam <ra...@gmail.com>

> Hi,
>
> Need to convert XML into text using mapreduce.
>
> I have used DOM and SAX parser.
>
> After using SAX Builder in mapper class. the child node act as root
> Element.
>
> While seeing in Sys out i found thar root element is taking the child
> element and printing.
>
> For Eg,
>
> <Comp><Emp><id>100</id><name>RR</name></Emp></Comp>
> when this xml is passed in mapper , in sys out printing the root element
>
> I am getting the the root element as
>
> <id>
> <name>
>
> Please suggest and help to fix this.
>
> I need to convert the xml into text using mapreduce code. Please provide
> with example.
>
> Required output is
>
> id,name
> 100,RR
>
> Please help.
>
> Thanks in advance,
> Ranjini R
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>