You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hama.apache.org by Steven van Beelen <sm...@gmail.com> on 2013/05/21 12:04:59 UTC

BSP Task Input/InputSplit Filename

Hi all,

The title says it: is there a way to retrieve the filename of the
input/inputsplit a BSP Task is working on? I've been looking for some time
in the docs and source files, but cannot seem to find if one is able to
retrieve the filename/pathname from the input used.

Cheers

Re: BSP Task Input/InputSplit Filename

Posted by "Edward J. Yoon" <ed...@apache.org>.
Good luck.

BTW, if you have to manage a lot of documents, I think you need to
merge documents into map or sequence file (document ID key and
document value pairs) on HDFS. Apache Nutch will be helpful. Then, you
can create a inverted index MR program by editing few lines of the
word-count MR example.

On Wed, May 22, 2013 at 4:42 PM, Steven van Beelen <sm...@gmail.com> wrote:
> For a project I'm trying to implement an Inverted Indexing algorithm, which
> has a 'term' and 'postingslist', in which the postings list consists of a
> 'document id' and 'payload' (in my case term frequency per document).
> I was thinking of inserting multiple different documents and taking the
> filename as documentID, hence the necessity.
> But I've found a way to work around this problem of mine by using different
> input which does not require the filename to be retrievable in a BSP task.
>
> If I will be needing it later on in my project and am working on it, I'll
> let you know.
>
> Thanks for the help thus far!
>
>
>
> On Wed, May 22, 2013 at 1:16 AM, Edward J. Yoon <ed...@apache.org>wrote:
>
>> Hi,
>>
>> Short answer is no, we don't provide API for what you are trying to do.
>>
>> However, it can be added easily. See BSPPeerImpl.initInput() method,
>> InputSplit interface and FileSplit classes.
>>
>> Why do you need that function? If there's reasonable necessity, Let's
>> add it together.
>>
>> On Tue, May 21, 2013 at 7:04 PM, Steven van Beelen <sm...@gmail.com>
>> wrote:
>> > Hi all,
>> >
>> > The title says it: is there a way to retrieve the filename of the
>> > input/inputsplit a BSP Task is working on? I've been looking for some
>> time
>> > in the docs and source files, but cannot seem to find if one is able to
>> > retrieve the filename/pathname from the input used.
>> >
>> > Cheers
>>
>>
>>
>> --
>> Best Regards, Edward J. Yoon
>> @eddieyoon
>>



-- 
Best Regards, Edward J. Yoon
@eddieyoon

Re: BSP Task Input/InputSplit Filename

Posted by Steven van Beelen <sm...@gmail.com>.
For a project I'm trying to implement an Inverted Indexing algorithm, which
has a 'term' and 'postingslist', in which the postings list consists of a
'document id' and 'payload' (in my case term frequency per document).
I was thinking of inserting multiple different documents and taking the
filename as documentID, hence the necessity.
But I've found a way to work around this problem of mine by using different
input which does not require the filename to be retrievable in a BSP task.

If I will be needing it later on in my project and am working on it, I'll
let you know.

Thanks for the help thus far!



On Wed, May 22, 2013 at 1:16 AM, Edward J. Yoon <ed...@apache.org>wrote:

> Hi,
>
> Short answer is no, we don't provide API for what you are trying to do.
>
> However, it can be added easily. See BSPPeerImpl.initInput() method,
> InputSplit interface and FileSplit classes.
>
> Why do you need that function? If there's reasonable necessity, Let's
> add it together.
>
> On Tue, May 21, 2013 at 7:04 PM, Steven van Beelen <sm...@gmail.com>
> wrote:
> > Hi all,
> >
> > The title says it: is there a way to retrieve the filename of the
> > input/inputsplit a BSP Task is working on? I've been looking for some
> time
> > in the docs and source files, but cannot seem to find if one is able to
> > retrieve the filename/pathname from the input used.
> >
> > Cheers
>
>
>
> --
> Best Regards, Edward J. Yoon
> @eddieyoon
>

Re: BSP Task Input/InputSplit Filename

Posted by "Edward J. Yoon" <ed...@apache.org>.
Hi,

Short answer is no, we don't provide API for what you are trying to do.

However, it can be added easily. See BSPPeerImpl.initInput() method,
InputSplit interface and FileSplit classes.

Why do you need that function? If there's reasonable necessity, Let's
add it together.

On Tue, May 21, 2013 at 7:04 PM, Steven van Beelen <sm...@gmail.com> wrote:
> Hi all,
>
> The title says it: is there a way to retrieve the filename of the
> input/inputsplit a BSP Task is working on? I've been looking for some time
> in the docs and source files, but cannot seem to find if one is able to
> retrieve the filename/pathname from the input used.
>
> Cheers



-- 
Best Regards, Edward J. Yoon
@eddieyoon

Re: BSP Task Input/InputSplit Filename

Posted by Steven van Beelen <sm...@gmail.com>.
Thanks for the notes Chia-Hung Lin

I'm familiar to setting my own variables in the Configuration or retrieving
the directory with the Constant JOB_INPUT_DIR.
But what if you have 1+ files, in which each task will get it's own file?
How does a task know which file you set by:

    conf.set("my.path1", "/path/to/file1")
    conf.set("my.path2", "/path/to/file2")
    ....


is the one it is working one? Or am I missing something trivial here?
That is what I'm trying to figure out.


On Tue, May 21, 2013 at 4:27 PM, Chia-Hung Lin <cl...@googlemail.com>wrote:

> My understanding is that you can configure path during constructing a job.
>
>     HamaConfiguration conf = new HamaConfiguration();
>     conf.set("my.path","/path/to/file")
>     BSPJob bsp = new BSPJob(conf, MyBSP.class);
>
> And wihtin customized BSP class e.g. MyBSP calls
>
>     BSPPeer.getConfiguration();
>
> to retrieve the file name.
>
> Or FileInputFormat makes use of Configuration set input path with the
> key "bsp.input.dir". Path should be able to obtain using
> conf.get("bsp.input.dir") when performing computation.
>
>
>
>
>
>
>
>
>
>
>
> On 21 May 2013 18:04, Steven van Beelen <sm...@gmail.com> wrote:
> > Hi all,
> >
> > The title says it: is there a way to retrieve the filename of the
> > input/inputsplit a BSP Task is working on? I've been looking for some
> time
> > in the docs and source files, but cannot seem to find if one is able to
> > retrieve the filename/pathname from the input used.
> >
> > Cheers
>

Re: BSP Task Input/InputSplit Filename

Posted by Chia-Hung Lin <cl...@googlemail.com>.
My understanding is that you can configure path during constructing a job.

    HamaConfiguration conf = new HamaConfiguration();
    conf.set("my.path","/path/to/file")
    BSPJob bsp = new BSPJob(conf, MyBSP.class);

And wihtin customized BSP class e.g. MyBSP calls

    BSPPeer.getConfiguration();

to retrieve the file name.

Or FileInputFormat makes use of Configuration set input path with the
key "bsp.input.dir". Path should be able to obtain using
conf.get("bsp.input.dir") when performing computation.











On 21 May 2013 18:04, Steven van Beelen <sm...@gmail.com> wrote:
> Hi all,
>
> The title says it: is there a way to retrieve the filename of the
> input/inputsplit a BSP Task is working on? I've been looking for some time
> in the docs and source files, but cannot seem to find if one is able to
> retrieve the filename/pathname from the input used.
>
> Cheers