You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Gianmarco De Francisci Morales (JIRA)" <ji...@apache.org> on 2010/06/27 12:03:49 UTC

[jira] Created: (PIG-1468) DataByteArray.compareTo() does not compare in lexicographic order

DataByteArray.compareTo() does not compare in lexicographic order
-----------------------------------------------------------------

                 Key: PIG-1468
                 URL: https://issues.apache.org/jira/browse/PIG-1468
             Project: Pig
          Issue Type: Bug
            Reporter: Gianmarco De Francisci Morales


The compareTo() method of org.apache.pig.data.DataByteArray does not compare items in lexicographic order.
Actually, it takes into account the signum of the bytes that compose the DataByteArray.

So, for example, 0xff compares to less than 0x00

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Re: load files

Posted by Jeff Zhang <zj...@gmail.com>.
part-xxxxx for is old hadoop mapred api, and part-m-xxxxx and
part-r-xxxxx is for new hadoop mapred api
You can use hadoop's globstatus("part-*") to handle both of these cases.



2010/6/28 Gang Luo <lg...@yahoo.com.cn>:
> Thanks, Jeff.
> In pig, the file name look like this: part-m-xxxxx(for map result) or part-r-xxxxx(for reduce result), which are different from the hadoop style (part-xxxxx). So, can we control the name of each generated file? How?
>
> Thanks,
> -Gang
>
>
>
> ----- 原始邮件 ----
> 发件人: Jeff Zhang <zj...@gmail.com>
> 收件人: pig-dev@hadoop.apache.org
> 发送日期: 2010/6/27 (周日) 9:22:30 下午
> 主   题: Re: load files
>
> Hi Gang,
>
> The path specified in load can be both file or directory, besides you
> can also leverage hadoop's globstatus.  The path specified in store is
> a directory.
>
>
>
> On Mon, Jun 28, 2010 at 4:44 AM, Gang Luo <lg...@yahoo.com.cn> wrote:
>> Hi all,
>> when we specify the path of input to a load operator, is it a file or a directory? Similarly, when we use store-load to connect two MR operators, is the path specified in the store and load a directory?
>>
>> Thanks,
>> -Gang
>>
>>
>>
>>
>>
>
>
>
> --
> Best Regards
>
> Jeff Zhang
>
>
>
>
>



-- 
Best Regards

Jeff Zhang

Re: load files

Posted by Gang Luo <lg...@yahoo.com.cn>.
Thanks, Jeff.
In pig, the file name look like this: part-m-xxxxx(for map result) or part-r-xxxxx(for reduce result), which are different from the hadoop style (part-xxxxx). So, can we control the name of each generated file? How?

Thanks,
-Gang



----- 原始邮件 ----
发件人: Jeff Zhang <zj...@gmail.com>
收件人: pig-dev@hadoop.apache.org
发送日期: 2010/6/27 (周日) 9:22:30 下午
主   题: Re: load files

Hi Gang,

The path specified in load can be both file or directory, besides you
can also leverage hadoop's globstatus.  The path specified in store is
a directory.



On Mon, Jun 28, 2010 at 4:44 AM, Gang Luo <lg...@yahoo.com.cn> wrote:
> Hi all,
> when we specify the path of input to a load operator, is it a file or a directory? Similarly, when we use store-load to connect two MR operators, is the path specified in the store and load a directory?
>
> Thanks,
> -Gang
>
>
>
>
>



-- 
Best Regards

Jeff Zhang



      

Re: load files

Posted by Jeff Zhang <zj...@gmail.com>.
Hi Gang,

The path specified in load can be both file or directory, besides you
can also leverage hadoop's globstatus.  The path specified in store is
a directory.



On Mon, Jun 28, 2010 at 4:44 AM, Gang Luo <lg...@yahoo.com.cn> wrote:
> Hi all,
> when we specify the path of input to a load operator, is it a file or a directory? Similarly, when we use store-load to connect two MR operators, is the path specified in the store and load a directory?
>
> Thanks,
> -Gang
>
>
>
>
>



-- 
Best Regards

Jeff Zhang

load files

Posted by Gang Luo <lg...@yahoo.com.cn>.
Hi all,
when we specify the path of input to a load operator, is it a file or a directory? Similarly, when we use store-load to connect two MR operators, is the path specified in the store and load a directory?

Thanks,
-Gang



      

[jira] Commented: (PIG-1468) DataByteArray.compareTo() does not compare in lexicographic order

Posted by "Gianmarco De Francisci Morales (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12883001#action_12883001 ] 

Gianmarco De Francisci Morales commented on PIG-1468:
-----------------------------------------------------

The test failures seem related to hudson, I have no failure locally.
I think the patch is ready for review.

> DataByteArray.compareTo() does not compare in lexicographic order
> -----------------------------------------------------------------
>
>                 Key: PIG-1468
>                 URL: https://issues.apache.org/jira/browse/PIG-1468
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Gianmarco De Francisci Morales
>            Assignee: Gianmarco De Francisci Morales
>         Attachments: PIG-1468.patch
>
>
> The compareTo() method of org.apache.pig.data.DataByteArray does not compare items in lexicographic order.
> Actually, it takes into account the signum of the bytes that compose the DataByteArray.
> So, for example, 0xff compares to less than 0x00

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1468) DataByteArray.compareTo() does not compare in lexicographic order

Posted by "Daniel Dai (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12886819#action_12886819 ] 

Daniel Dai commented on PIG-1468:
---------------------------------

The other concern is we only change DataByteArray not byte. So comparator for DataType.BYTEARRAY and DataType.BYTE is different. This will cause confusion.

> DataByteArray.compareTo() does not compare in lexicographic order
> -----------------------------------------------------------------
>
>                 Key: PIG-1468
>                 URL: https://issues.apache.org/jira/browse/PIG-1468
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Gianmarco De Francisci Morales
>            Assignee: Gianmarco De Francisci Morales
>         Attachments: PIG-1468.patch
>
>
> The compareTo() method of org.apache.pig.data.DataByteArray does not compare items in lexicographic order.
> Actually, it takes into account the signum of the bytes that compose the DataByteArray.
> So, for example, 0xff compares to less than 0x00

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-1468) DataByteArray.compareTo() does not compare in lexicographic order

Posted by "Gianmarco De Francisci Morales (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Gianmarco De Francisci Morales updated PIG-1468:
------------------------------------------------

    Attachment: PIG-1468.patch

> DataByteArray.compareTo() does not compare in lexicographic order
> -----------------------------------------------------------------
>
>                 Key: PIG-1468
>                 URL: https://issues.apache.org/jira/browse/PIG-1468
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Gianmarco De Francisci Morales
>         Attachments: PIG-1468.patch
>
>
> The compareTo() method of org.apache.pig.data.DataByteArray does not compare items in lexicographic order.
> Actually, it takes into account the signum of the bytes that compose the DataByteArray.
> So, for example, 0xff compares to less than 0x00

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1468) DataByteArray.compareTo() does not compare in lexicographic order

Posted by "Gianmarco De Francisci Morales (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12883557#action_12883557 ] 

Gianmarco De Francisci Morales commented on PIG-1468:
-----------------------------------------------------

I ran some tests. I see a ~1% decrease in performance overall.

I looked around the codebase for references to the method, and it does not seem there is any place that relies on the specific ordering.

Here is the code I used:

{code}
import java.util.Random;

public class TestSpeed {
    private static final int TIMES = (int) 10e6;
    private static final int NUM_ARRAYS = (int) 10e5;
    private static final int ARRAY_LENGTH = 50;

    private static int compareSigned(byte[] b1, byte[] b2) {
        int i;
        for (i = 0; i < b1.length; i++) {
            if (i >= b2.length)
                return 1;
            int a = b1[i];
            int b = b2[i];
            if (a < b)
                return -1;
            else if (a > b)
                return 1;
        }
        if (i < b2.length)
            return -1;
        return 0;
    }

    private static int compareUnsisgned(byte[] b1, byte[] b2) {
        int i;
        for (i = 0; i < b1.length; i++) {
            if (i >= b2.length)
                return 1;
            int a = b1[i] & 0xff;
            int b = b2[i] & 0xff;
            if (a < b)
                return -1;
            else if (a > b)
                return 1;
        }
        if (i < b2.length)
            return -1;
        return 0;
    }

    public static void main(String[] args) {
        long before, after;
        Random rand = new Random(123456789);
        byte[][] batch1 = new byte[NUM_ARRAYS][];
        byte[][] batch2 = new byte[NUM_ARRAYS][];
        for (int i = 0; i < NUM_ARRAYS; i++) {
            batch1[i] = new byte[ARRAY_LENGTH];
            batch2[i] = new byte[ARRAY_LENGTH];
            rand.nextBytes(batch1[i]);
            rand.nextBytes(batch2[i]);
        }

        before = System.currentTimeMillis();
        for (int i = 0; i < TIMES; i++)
            for (int j = 0; j < ARRAY_LENGTH; j++)
                compareSigned(batch1[j], batch2[j]);
        after = System.currentTimeMillis();
        System.out.println("Time for signed comparison (ms): " + (after - before));

        before = System.currentTimeMillis();
        for (int i = 0; i < TIMES; i++)
            for (int j = 0; j < ARRAY_LENGTH; j++)
                compareUnsisgned(batch1[j], batch2[j]);
        after = System.currentTimeMillis();
        System.out.println("Time for UNsigned comparison (ms): " + (after - before));
    }
}
{code}

> DataByteArray.compareTo() does not compare in lexicographic order
> -----------------------------------------------------------------
>
>                 Key: PIG-1468
>                 URL: https://issues.apache.org/jira/browse/PIG-1468
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Gianmarco De Francisci Morales
>            Assignee: Gianmarco De Francisci Morales
>         Attachments: PIG-1468.patch
>
>
> The compareTo() method of org.apache.pig.data.DataByteArray does not compare items in lexicographic order.
> Actually, it takes into account the signum of the bytes that compose the DataByteArray.
> So, for example, 0xff compares to less than 0x00

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1468) DataByteArray.compareTo() does not compare in lexicographic order

Posted by "Gianmarco De Francisci Morales (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12887178#action_12887178 ] 

Gianmarco De Francisci Morales commented on PIG-1468:
-----------------------------------------------------

It is quite easy to fix DataType.compare() to keep into account the unsigned logic.
But I am starting to feel that all of this is probably not worth the trouble.
This would make DataType.compare() for Bytes different from Byte.compareTo().


> DataByteArray.compareTo() does not compare in lexicographic order
> -----------------------------------------------------------------
>
>                 Key: PIG-1468
>                 URL: https://issues.apache.org/jira/browse/PIG-1468
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Gianmarco De Francisci Morales
>            Assignee: Gianmarco De Francisci Morales
>         Attachments: PIG-1468.patch
>
>
> The compareTo() method of org.apache.pig.data.DataByteArray does not compare items in lexicographic order.
> Actually, it takes into account the signum of the bytes that compose the DataByteArray.
> So, for example, 0xff compares to less than 0x00

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1468) DataByteArray.compareTo() does not compare in lexicographic order

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12882985#action_12882985 ] 

Hadoop QA commented on PIG-1468:
--------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12448155/PIG-1468.patch
  against trunk revision 958053.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 3 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    -1 core tests.  The patch failed core unit tests.

    -1 contrib tests.  The patch failed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/354/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/354/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/354/console

This message is automatically generated.

> DataByteArray.compareTo() does not compare in lexicographic order
> -----------------------------------------------------------------
>
>                 Key: PIG-1468
>                 URL: https://issues.apache.org/jira/browse/PIG-1468
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Gianmarco De Francisci Morales
>            Assignee: Gianmarco De Francisci Morales
>         Attachments: PIG-1468.patch
>
>
> The compareTo() method of org.apache.pig.data.DataByteArray does not compare items in lexicographic order.
> Actually, it takes into account the signum of the bytes that compose the DataByteArray.
> So, for example, 0xff compares to less than 0x00

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1468) DataByteArray.compareTo() does not compare in lexicographic order

Posted by "Daniel Dai (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12883029#action_12883029 ] 

Daniel Dai commented on PIG-1468:
---------------------------------

The problem is we do not have unsigned byte in Java. Although DataByteArray is for semantic unknown type and the actual order does not matter, it is more natural to see 0xff > 0x00. I have two comments:
1. Can we measure if there is any noticeable performance downgrade due to two additional "and" operation?
2. Do we have somewhere else use this logic? It is important to keep them consistent.

> DataByteArray.compareTo() does not compare in lexicographic order
> -----------------------------------------------------------------
>
>                 Key: PIG-1468
>                 URL: https://issues.apache.org/jira/browse/PIG-1468
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Gianmarco De Francisci Morales
>            Assignee: Gianmarco De Francisci Morales
>         Attachments: PIG-1468.patch
>
>
> The compareTo() method of org.apache.pig.data.DataByteArray does not compare items in lexicographic order.
> Actually, it takes into account the signum of the bytes that compose the DataByteArray.
> So, for example, 0xff compares to less than 0x00

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Assigned: (PIG-1468) DataByteArray.compareTo() does not compare in lexicographic order

Posted by "Gianmarco De Francisci Morales (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Gianmarco De Francisci Morales reassigned PIG-1468:
---------------------------------------------------

    Assignee: Gianmarco De Francisci Morales

> DataByteArray.compareTo() does not compare in lexicographic order
> -----------------------------------------------------------------
>
>                 Key: PIG-1468
>                 URL: https://issues.apache.org/jira/browse/PIG-1468
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Gianmarco De Francisci Morales
>            Assignee: Gianmarco De Francisci Morales
>         Attachments: PIG-1468.patch
>
>
> The compareTo() method of org.apache.pig.data.DataByteArray does not compare items in lexicographic order.
> Actually, it takes into account the signum of the bytes that compose the DataByteArray.
> So, for example, 0xff compares to less than 0x00

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-1468) DataByteArray.compareTo() does not compare in lexicographic order

Posted by "Gianmarco De Francisci Morales (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Gianmarco De Francisci Morales updated PIG-1468:
------------------------------------------------

    Status: Patch Available  (was: Open)

> DataByteArray.compareTo() does not compare in lexicographic order
> -----------------------------------------------------------------
>
>                 Key: PIG-1468
>                 URL: https://issues.apache.org/jira/browse/PIG-1468
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Gianmarco De Francisci Morales
>         Attachments: PIG-1468.patch
>
>
> The compareTo() method of org.apache.pig.data.DataByteArray does not compare items in lexicographic order.
> Actually, it takes into account the signum of the bytes that compose the DataByteArray.
> So, for example, 0xff compares to less than 0x00

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1468) DataByteArray.compareTo() does not compare in lexicographic order

Posted by "Gianmarco De Francisci Morales (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12883464#action_12883464 ] 

Gianmarco De Francisci Morales commented on PIG-1468:
-----------------------------------------------------

1) I will write a simple program to measure the performance impact.

2) I think this has no correlation to other places, but I will check.
Furthermore, this patch makes the ordering consistent with Hadoop's WritableComparator.compareBytes() (lexicographic order of binary data).

> DataByteArray.compareTo() does not compare in lexicographic order
> -----------------------------------------------------------------
>
>                 Key: PIG-1468
>                 URL: https://issues.apache.org/jira/browse/PIG-1468
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Gianmarco De Francisci Morales
>            Assignee: Gianmarco De Francisci Morales
>         Attachments: PIG-1468.patch
>
>
> The compareTo() method of org.apache.pig.data.DataByteArray does not compare items in lexicographic order.
> Actually, it takes into account the signum of the bytes that compose the DataByteArray.
> So, for example, 0xff compares to less than 0x00

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.