You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by David Rahman <dr...@googlemail.com> on 2011/11/08 10:50:04 UTC

Using XML-Data

Hi,

my data is available in XML, it's looking something like that:

<data>
  <doc>
    <title> ... </title>
    <abstract> ... </abstract>
    <keyword> ... </keyword>
    ...
    <keyword> ... </keyword>
    <keyword> ... </keyword>
  </doc>
  <doc>
    ...
  </doc>
</data>

I looked into the wikipedia-example and I have a few questions:

1. The first step of the example is to chunk the data in pieces. Is this
necessary, because I have the data in pieces. In each xml-file are ~1000
documents and I want to use ~250 xml-files in a first test. Could I just
put the existing xml-files into a HDFS-folder in Hadoop?

2. Second step is using the wikipediaDataSetCreator on the chunk-files
(chunk-****.xml). I found the WikipediaDataSetCreatorDriver, -Mapper and
-Reducer. Can someone explain how they work, for example I don't
understand, how the label (the category "country") is defined. In my case
there would also be more than one label on one document.

3. And in the third step the classifier is trained, here I would use the
ComplementaryBayes. As result, when I test the classifier I would also need
all possible candidates (not only the top one). How can I list all possible
candidates with their weights? I just found the possibility to list the top
candidates, did I miss something?

But overall it should be be as the same as the wikipedia-example, only with
more labels (xml+text+possible categories).

Thanks and regards,
David