You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@joshua.apache.org by John Hewitt <jo...@seas.upenn.edu> on 2016/11/23 17:11:39 UTC

Any symal experts?

Hey everyone,

I'm packaging up a Java port Fast Align for Joshua and integrating it into
the pipeline.
Fast Align does not produce symmetrical alignments -- it relies on a tool
that I haven't ported to Java.
We package symal (which symmetricizes alignments) with Joshua right now for
GIZA++, so I'm attempting to re-use that.
However, symal uses the .bal format, which it fails to describe.
It gets away with this because files from GIZA++ are piped through
giza2bal.pl, which itself is not well documented.
I'm attempting to write, say, fastalign2bal.py.
With a bit of tinkering, I got at the .bal format:

1

7 jehovah said to moses and aaron :  # 3 2 2 4 5 6 8

8 i řekl hospodin mojžíšovi a aronovi takto :  # 2 2 1 4 5 6 6 7

A template for which would be

1

NUM_TGT_TOKENS [tgt_token1 tgt_token2 ... tgt_tokenN] # [alignment1
alignment2 ... alignmentN]
NUM_SRC_TOKENS [src_token1 src_token2 ... src_tokenN] # [alignment1
alignment2 ... alignmentN]


However, I'm hitting some pretty nasty errors with symal when I pipe in
some fastalign2bal.py output.
A few hours with gdb made some progress (for as far as I can tell, the
formats are identical) but if anyone has experience with symal, I would
greatly appreciate some consultation.

-John

Re: Any symal experts?

Posted by Matt Post <po...@cs.jhu.edu>.
No worries, just curious. Thanks for the update.


> On Jan 9, 2017, at 10:35 PM, John Hewitt <jo...@seas.upenn.edu> wrote:
> 
> I have to admit, no. Projects in graduate courses got the best of my time
> at the end of last semester, and I took the winter break to stay away from
> work and recover a bit.
> 
> Back from break now; will give an update soon.
> 
> -John
> 
> On Tue, Jan 3, 2017 at 12:03 PM, Matt Post <po...@cs.jhu.edu> wrote:
> 
>> John — Any updates on here?
>> 
>> 
>>> On Nov 23, 2016, at 12:28 PM, Matt Post <po...@cs.jhu.edu> wrote:
>>> 
>>> I think it will be much less of a headache. The GIZA++ code is notorious
>> for being unreadable, and the Perl piece of that pipeline only hurts (even
>> though Philipp's Perl is unusually clear). I think adding atools to your
>> port is the way to go, and that it's written in C++ should facilitate that.
>>> 
>>> 
>>> 
>>> 
>>>> On Nov 23, 2016, at 12:25 PM, John Hewitt <jo...@seas.upenn.edu>
>> wrote:
>>>> 
>>>> It'll be a headache because it also has no documentation, but to be
>> fair it
>>>> may be less of a headache / a better long-term solution than trying to
>> move
>>>> forward with this hackier solution.
>>>> 
>>>> I'll keep the symal use on the backburner and start putting together an
>>>> atools port.
>>>> 
>>>> -John
>>>> 
>>>> On Wed, Nov 23, 2016 at 12:18 PM, Matt Post <po...@cs.jhu.edu> wrote:
>>>> 
>>>>> John — I suggest trying to ditch those GIZA++ tools entirely.
>> fast_align
>>>>> indeed replaced them with "atools"; how much work would it be to port
>> that?
>>>>> 
>>>>> 
>>>>>> On Nov 23, 2016, at 12:11 PM, John Hewitt <jo...@seas.upenn.edu>
>>>>> wrote:
>>>>>> 
>>>>>> Hey everyone,
>>>>>> 
>>>>>> I'm packaging up a Java port Fast Align for Joshua and integrating it
>>>>> into
>>>>>> the pipeline.
>>>>>> Fast Align does not produce symmetrical alignments -- it relies on a
>> tool
>>>>>> that I haven't ported to Java.
>>>>>> We package symal (which symmetricizes alignments) with Joshua right
>> now
>>>>> for
>>>>>> GIZA++, so I'm attempting to re-use that.
>>>>>> However, symal uses the .bal format, which it fails to describe.
>>>>>> It gets away with this because files from GIZA++ are piped through
>>>>>> giza2bal.pl, which itself is not well documented.
>>>>>> I'm attempting to write, say, fastalign2bal.py.
>>>>>> With a bit of tinkering, I got at the .bal format:
>>>>>> 
>>>>>> 1
>>>>>> 
>>>>>> 7 jehovah said to moses and aaron :  # 3 2 2 4 5 6 8
>>>>>> 
>>>>>> 8 i řekl hospodin mojžíšovi a aronovi takto :  # 2 2 1 4 5 6 6 7
>>>>>> 
>>>>>> A template for which would be
>>>>>> 
>>>>>> 1
>>>>>> 
>>>>>> NUM_TGT_TOKENS [tgt_token1 tgt_token2 ... tgt_tokenN] # [alignment1
>>>>>> alignment2 ... alignmentN]
>>>>>> NUM_SRC_TOKENS [src_token1 src_token2 ... src_tokenN] # [alignment1
>>>>>> alignment2 ... alignmentN]
>>>>>> 
>>>>>> 
>>>>>> However, I'm hitting some pretty nasty errors with symal when I pipe
>> in
>>>>>> some fastalign2bal.py output.
>>>>>> A few hours with gdb made some progress (for as far as I can tell, the
>>>>>> formats are identical) but if anyone has experience with symal, I
>> would
>>>>>> greatly appreciate some consultation.
>>>>>> 
>>>>>> -John
>>>>> 
>>>>> 
>>> 
>> 
>> 


Re: Any symal experts?

Posted by John Hewitt <jo...@seas.upenn.edu>.
I have to admit, no. Projects in graduate courses got the best of my time
at the end of last semester, and I took the winter break to stay away from
work and recover a bit.

Back from break now; will give an update soon.

-John

On Tue, Jan 3, 2017 at 12:03 PM, Matt Post <po...@cs.jhu.edu> wrote:

> John — Any updates on here?
>
>
> > On Nov 23, 2016, at 12:28 PM, Matt Post <po...@cs.jhu.edu> wrote:
> >
> > I think it will be much less of a headache. The GIZA++ code is notorious
> for being unreadable, and the Perl piece of that pipeline only hurts (even
> though Philipp's Perl is unusually clear). I think adding atools to your
> port is the way to go, and that it's written in C++ should facilitate that.
> >
> >
> >
> >
> >> On Nov 23, 2016, at 12:25 PM, John Hewitt <jo...@seas.upenn.edu>
> wrote:
> >>
> >> It'll be a headache because it also has no documentation, but to be
> fair it
> >> may be less of a headache / a better long-term solution than trying to
> move
> >> forward with this hackier solution.
> >>
> >> I'll keep the symal use on the backburner and start putting together an
> >> atools port.
> >>
> >> -John
> >>
> >> On Wed, Nov 23, 2016 at 12:18 PM, Matt Post <po...@cs.jhu.edu> wrote:
> >>
> >>> John — I suggest trying to ditch those GIZA++ tools entirely.
> fast_align
> >>> indeed replaced them with "atools"; how much work would it be to port
> that?
> >>>
> >>>
> >>>> On Nov 23, 2016, at 12:11 PM, John Hewitt <jo...@seas.upenn.edu>
> >>> wrote:
> >>>>
> >>>> Hey everyone,
> >>>>
> >>>> I'm packaging up a Java port Fast Align for Joshua and integrating it
> >>> into
> >>>> the pipeline.
> >>>> Fast Align does not produce symmetrical alignments -- it relies on a
> tool
> >>>> that I haven't ported to Java.
> >>>> We package symal (which symmetricizes alignments) with Joshua right
> now
> >>> for
> >>>> GIZA++, so I'm attempting to re-use that.
> >>>> However, symal uses the .bal format, which it fails to describe.
> >>>> It gets away with this because files from GIZA++ are piped through
> >>>> giza2bal.pl, which itself is not well documented.
> >>>> I'm attempting to write, say, fastalign2bal.py.
> >>>> With a bit of tinkering, I got at the .bal format:
> >>>>
> >>>> 1
> >>>>
> >>>> 7 jehovah said to moses and aaron :  # 3 2 2 4 5 6 8
> >>>>
> >>>> 8 i řekl hospodin mojžíšovi a aronovi takto :  # 2 2 1 4 5 6 6 7
> >>>>
> >>>> A template for which would be
> >>>>
> >>>> 1
> >>>>
> >>>> NUM_TGT_TOKENS [tgt_token1 tgt_token2 ... tgt_tokenN] # [alignment1
> >>>> alignment2 ... alignmentN]
> >>>> NUM_SRC_TOKENS [src_token1 src_token2 ... src_tokenN] # [alignment1
> >>>> alignment2 ... alignmentN]
> >>>>
> >>>>
> >>>> However, I'm hitting some pretty nasty errors with symal when I pipe
> in
> >>>> some fastalign2bal.py output.
> >>>> A few hours with gdb made some progress (for as far as I can tell, the
> >>>> formats are identical) but if anyone has experience with symal, I
> would
> >>>> greatly appreciate some consultation.
> >>>>
> >>>> -John
> >>>
> >>>
> >
>
>

Re: Any symal experts?

Posted by Matt Post <po...@cs.jhu.edu>.
John — Any updates on here?


> On Nov 23, 2016, at 12:28 PM, Matt Post <po...@cs.jhu.edu> wrote:
> 
> I think it will be much less of a headache. The GIZA++ code is notorious for being unreadable, and the Perl piece of that pipeline only hurts (even though Philipp's Perl is unusually clear). I think adding atools to your port is the way to go, and that it's written in C++ should facilitate that.
> 
> 
> 
> 
>> On Nov 23, 2016, at 12:25 PM, John Hewitt <jo...@seas.upenn.edu> wrote:
>> 
>> It'll be a headache because it also has no documentation, but to be fair it
>> may be less of a headache / a better long-term solution than trying to move
>> forward with this hackier solution.
>> 
>> I'll keep the symal use on the backburner and start putting together an
>> atools port.
>> 
>> -John
>> 
>> On Wed, Nov 23, 2016 at 12:18 PM, Matt Post <po...@cs.jhu.edu> wrote:
>> 
>>> John — I suggest trying to ditch those GIZA++ tools entirely. fast_align
>>> indeed replaced them with "atools"; how much work would it be to port that?
>>> 
>>> 
>>>> On Nov 23, 2016, at 12:11 PM, John Hewitt <jo...@seas.upenn.edu>
>>> wrote:
>>>> 
>>>> Hey everyone,
>>>> 
>>>> I'm packaging up a Java port Fast Align for Joshua and integrating it
>>> into
>>>> the pipeline.
>>>> Fast Align does not produce symmetrical alignments -- it relies on a tool
>>>> that I haven't ported to Java.
>>>> We package symal (which symmetricizes alignments) with Joshua right now
>>> for
>>>> GIZA++, so I'm attempting to re-use that.
>>>> However, symal uses the .bal format, which it fails to describe.
>>>> It gets away with this because files from GIZA++ are piped through
>>>> giza2bal.pl, which itself is not well documented.
>>>> I'm attempting to write, say, fastalign2bal.py.
>>>> With a bit of tinkering, I got at the .bal format:
>>>> 
>>>> 1
>>>> 
>>>> 7 jehovah said to moses and aaron :  # 3 2 2 4 5 6 8
>>>> 
>>>> 8 i řekl hospodin mojžíšovi a aronovi takto :  # 2 2 1 4 5 6 6 7
>>>> 
>>>> A template for which would be
>>>> 
>>>> 1
>>>> 
>>>> NUM_TGT_TOKENS [tgt_token1 tgt_token2 ... tgt_tokenN] # [alignment1
>>>> alignment2 ... alignmentN]
>>>> NUM_SRC_TOKENS [src_token1 src_token2 ... src_tokenN] # [alignment1
>>>> alignment2 ... alignmentN]
>>>> 
>>>> 
>>>> However, I'm hitting some pretty nasty errors with symal when I pipe in
>>>> some fastalign2bal.py output.
>>>> A few hours with gdb made some progress (for as far as I can tell, the
>>>> formats are identical) but if anyone has experience with symal, I would
>>>> greatly appreciate some consultation.
>>>> 
>>>> -John
>>> 
>>> 
> 


Re: Any symal experts?

Posted by Matt Post <po...@cs.jhu.edu>.
I think it will be much less of a headache. The GIZA++ code is notorious for being unreadable, and the Perl piece of that pipeline only hurts (even though Philipp's Perl is unusually clear). I think adding atools to your port is the way to go, and that it's written in C++ should facilitate that.




> On Nov 23, 2016, at 12:25 PM, John Hewitt <jo...@seas.upenn.edu> wrote:
> 
> It'll be a headache because it also has no documentation, but to be fair it
> may be less of a headache / a better long-term solution than trying to move
> forward with this hackier solution.
> 
> I'll keep the symal use on the backburner and start putting together an
> atools port.
> 
> -John
> 
> On Wed, Nov 23, 2016 at 12:18 PM, Matt Post <po...@cs.jhu.edu> wrote:
> 
>> John — I suggest trying to ditch those GIZA++ tools entirely. fast_align
>> indeed replaced them with "atools"; how much work would it be to port that?
>> 
>> 
>>> On Nov 23, 2016, at 12:11 PM, John Hewitt <jo...@seas.upenn.edu>
>> wrote:
>>> 
>>> Hey everyone,
>>> 
>>> I'm packaging up a Java port Fast Align for Joshua and integrating it
>> into
>>> the pipeline.
>>> Fast Align does not produce symmetrical alignments -- it relies on a tool
>>> that I haven't ported to Java.
>>> We package symal (which symmetricizes alignments) with Joshua right now
>> for
>>> GIZA++, so I'm attempting to re-use that.
>>> However, symal uses the .bal format, which it fails to describe.
>>> It gets away with this because files from GIZA++ are piped through
>>> giza2bal.pl, which itself is not well documented.
>>> I'm attempting to write, say, fastalign2bal.py.
>>> With a bit of tinkering, I got at the .bal format:
>>> 
>>> 1
>>> 
>>> 7 jehovah said to moses and aaron :  # 3 2 2 4 5 6 8
>>> 
>>> 8 i řekl hospodin mojžíšovi a aronovi takto :  # 2 2 1 4 5 6 6 7
>>> 
>>> A template for which would be
>>> 
>>> 1
>>> 
>>> NUM_TGT_TOKENS [tgt_token1 tgt_token2 ... tgt_tokenN] # [alignment1
>>> alignment2 ... alignmentN]
>>> NUM_SRC_TOKENS [src_token1 src_token2 ... src_tokenN] # [alignment1
>>> alignment2 ... alignmentN]
>>> 
>>> 
>>> However, I'm hitting some pretty nasty errors with symal when I pipe in
>>> some fastalign2bal.py output.
>>> A few hours with gdb made some progress (for as far as I can tell, the
>>> formats are identical) but if anyone has experience with symal, I would
>>> greatly appreciate some consultation.
>>> 
>>> -John
>> 
>> 


Re: Any symal experts?

Posted by John Hewitt <jo...@seas.upenn.edu>.
It'll be a headache because it also has no documentation, but to be fair it
may be less of a headache / a better long-term solution than trying to move
forward with this hackier solution.

I'll keep the symal use on the backburner and start putting together an
atools port.

-John

On Wed, Nov 23, 2016 at 12:18 PM, Matt Post <po...@cs.jhu.edu> wrote:

> John — I suggest trying to ditch those GIZA++ tools entirely. fast_align
> indeed replaced them with "atools"; how much work would it be to port that?
>
>
> > On Nov 23, 2016, at 12:11 PM, John Hewitt <jo...@seas.upenn.edu>
> wrote:
> >
> > Hey everyone,
> >
> > I'm packaging up a Java port Fast Align for Joshua and integrating it
> into
> > the pipeline.
> > Fast Align does not produce symmetrical alignments -- it relies on a tool
> > that I haven't ported to Java.
> > We package symal (which symmetricizes alignments) with Joshua right now
> for
> > GIZA++, so I'm attempting to re-use that.
> > However, symal uses the .bal format, which it fails to describe.
> > It gets away with this because files from GIZA++ are piped through
> > giza2bal.pl, which itself is not well documented.
> > I'm attempting to write, say, fastalign2bal.py.
> > With a bit of tinkering, I got at the .bal format:
> >
> > 1
> >
> > 7 jehovah said to moses and aaron :  # 3 2 2 4 5 6 8
> >
> > 8 i řekl hospodin mojžíšovi a aronovi takto :  # 2 2 1 4 5 6 6 7
> >
> > A template for which would be
> >
> > 1
> >
> > NUM_TGT_TOKENS [tgt_token1 tgt_token2 ... tgt_tokenN] # [alignment1
> > alignment2 ... alignmentN]
> > NUM_SRC_TOKENS [src_token1 src_token2 ... src_tokenN] # [alignment1
> > alignment2 ... alignmentN]
> >
> >
> > However, I'm hitting some pretty nasty errors with symal when I pipe in
> > some fastalign2bal.py output.
> > A few hours with gdb made some progress (for as far as I can tell, the
> > formats are identical) but if anyone has experience with symal, I would
> > greatly appreciate some consultation.
> >
> > -John
>
>

Re: Any symal experts?

Posted by Matt Post <po...@cs.jhu.edu>.
John — I suggest trying to ditch those GIZA++ tools entirely. fast_align indeed replaced them with "atools"; how much work would it be to port that?


> On Nov 23, 2016, at 12:11 PM, John Hewitt <jo...@seas.upenn.edu> wrote:
> 
> Hey everyone,
> 
> I'm packaging up a Java port Fast Align for Joshua and integrating it into
> the pipeline.
> Fast Align does not produce symmetrical alignments -- it relies on a tool
> that I haven't ported to Java.
> We package symal (which symmetricizes alignments) with Joshua right now for
> GIZA++, so I'm attempting to re-use that.
> However, symal uses the .bal format, which it fails to describe.
> It gets away with this because files from GIZA++ are piped through
> giza2bal.pl, which itself is not well documented.
> I'm attempting to write, say, fastalign2bal.py.
> With a bit of tinkering, I got at the .bal format:
> 
> 1
> 
> 7 jehovah said to moses and aaron :  # 3 2 2 4 5 6 8
> 
> 8 i řekl hospodin mojžíšovi a aronovi takto :  # 2 2 1 4 5 6 6 7
> 
> A template for which would be
> 
> 1
> 
> NUM_TGT_TOKENS [tgt_token1 tgt_token2 ... tgt_tokenN] # [alignment1
> alignment2 ... alignmentN]
> NUM_SRC_TOKENS [src_token1 src_token2 ... src_tokenN] # [alignment1
> alignment2 ... alignmentN]
> 
> 
> However, I'm hitting some pretty nasty errors with symal when I pipe in
> some fastalign2bal.py output.
> A few hours with gdb made some progress (for as far as I can tell, the
> formats are identical) but if anyone has experience with symal, I would
> greatly appreciate some consultation.
> 
> -John