You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Martijn van Leeuwen <ic...@gmail.com> on 2013/02/02 20:21:08 UTC

Combine multiple row values based upon a condition.

Hi all,

I new to Apache Hive and I am doing some test to see if it fits my needs, one of the questions I have if it is possible to "peek" for the next row in order to find out if the values should be combined. Let me explain by an example.

Let say my data looks like this

Id name offset
1 Jan 100
2 Janssen 104
3 Klaas 150
4 Jan 160
5 Janssen 164

An my output to another table should be this

Id fullname offsets
1 Jan Janssen [ 100, 160 ]

I would like to combine the name values from two rows where the offset of the two rows are no more then 1 character apart.

Is this type of data manipulation is possible and if it is could someone point me to the right direction hopefully with some explaination?

Kind regards
Martijn

Re: Combine multiple row values based upon a condition.

Posted by Dean Wampler <de...@thinkbiganalytics.com>.
If you really only need to consider adjacent rows, it might just be easier
to write a UDF or use streaming, where your code remembers the last record
seen and emits a new record if you want to do the join with the current
record.

On Sat, Feb 2, 2013 at 1:21 PM, Martijn van Leeuwen <ic...@gmail.com>wrote:

> Hi all,
>
> I new to Apache Hive and I am doing some test to see if it fits my needs,
> one of the questions I have if it is possible to "peek" for the next row in
> order to find out if the values should be combined. Let me explain by an
> example.
>
> Let say my data looks like this
>
> Id name offset
> 1 Jan 100
> 2 Janssen 104
> 3 Klaas 150
> 4 Jan 160
> 5 Janssen 164
>
> An my output to another table should be this
>
> Id fullname offsets
> 1 Jan Janssen [ 100, 160 ]
>
> I would like to combine the name values from two rows where the offset of
> the two rows are no more then 1 character apart.
>
> Is this type of data manipulation is possible and if it is could someone
> point me to the right direction hopefully with some explaination?
>
> Kind regards
> Martijn




-- 
*Dean Wampler, Ph.D.*
thinkbiganalytics.com
+1-312-339-1330

Re: Combine multiple row values based upon a condition.

Posted by Martijn van Leeuwen <ic...@gmail.com>.
Thank you guys! I will have a look at this.

Kind regards,
Martijn

On Feb 3, 2013, at 8:36 PM, Edward Capriolo <ed...@gmail.com> wrote:

> You may want to look at sort by, distribute by, and cluster by. This
> syntax controls which Reducers the data end up on and how it is sorted
> on each reducer.
> 
> On Sun, Feb 3, 2013 at 2:27 PM, Martijn van Leeuwen
> <ic...@gmail.com> wrote:
>> yes there is. Each document has a UUID as its identifier. The actual output
>> of my map reduce job that produces the list of person names looks like this
>> 
>> docId        Name Type length offset
>> f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4     Lea     PERSON     3     10858
>> f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4     Lea     PERSON     3     11063
>> f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4     Ken     PERSON     3     11186
>> f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4     Marottoli     PERSON     9
>> 11234
>> f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4     Berkowitz     PERSON     9
>> 17073
>> f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4     Lea     PERSON     3     17095
>> f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4     Stephanie     PERSON     9
>> 17330
>> f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4     Putt     PERSON     4     17340
>> f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4     Stephanie     PERSON     9
>> 17347
>> f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4     Stephanie     PERSON     9
>> 17480
>> f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4     Putt     PERSON     4     17490
>> f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4     Berkowitz     PERSON     9
>> 19498
>> f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4     Stephanie     PERSON     9
>> 19530
>> 
>> Use the following code to produce a table inside Hive.
>> 
>> DROP TABLE IF EXISTS entities_extract;
>> 
>>    CREATE TABLE entities_extract (doc_id STRING, name STRING, type STRING,
>> len INT, offset BIGINT)
>>    ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
>>    LINES TERMINATED BY '\n'
>>    STORED AS TEXTFILE
>>    LOCATION '/research/45924/hive/entities_extract';
>> 
>> LOAD DATA LOCAL INPATH
>> '/home/researcher/hadoop-runnables/files/entitie_extract_by_doc.txt'
>> OVERWRITE INTO TABLE entities_extract;
>> 
>> 
>> 
>> On Feb 3, 2013, at 8:07 PM, John Omernik <jo...@omernik.com> wrote:
>> 
>> Is there some think akin to a document I'd so we can assure all rows
>> belonging to the same document can be sent to one mapper?
>> 
>> On Feb 3, 2013 1:00 PM, "Martijn van Leeuwen" <ic...@gmail.com> wrote:
>>> 
>>> Hi John,
>>> 
>>> Here is some background about my data and what I want as output.
>>> 
>>> I have a 215K documents containing text. From those text files I extract
>>> names of persons, organisations and locations by using the Stanford NER
>>> library. (see http://nlp.stanford.edu/software/CRF-NER.shtml)
>>> 
>>> Looking at the following line:
>>> 
>>> Jan Janssen was on this way to Klaas to sell vehicle Jan Janssen stole
>>> from his father.
>>> 
>>> when the classifier is done annotating the line looks like this:
>>> 
>>> <PERSON>Jan<PERSON><OFFSET>0<OFFSET>
>>> <PERSON>Janssen<PERSON><OFFSET>5<OFFSET> was on this way to
>>> <PERSON>Klaas<PERSON><OFFSET>26<OFFSET> to sell the vehicle
>>> <PERSON>Jan<PERSON><OFFSET>48<OFFSET>
>>> <PERSON>Janssen<PERSON><OFFSET>50<OFFSET> stole from his father.
>>> 
>>> When looping through this annotated line you can save the persons and its
>>> offsets, please note that offset is a LONG value, inside a Map for example:
>>> 
>>> MAP<STRING, LONG> entities
>>> 
>>> Jan, 0
>>> Janssen, 5
>>> Klaas, 26
>>> Jan, 48
>>> Janssen, 50
>>> 
>>> Jan Janssen in the line is actually the one person and not two. Jan occurs
>>> at offset 0, to determine if Janssen belongs to Jan I could subtract the
>>> length of Jan (3) + 1 (whitespace) from Janssen's offset (5) and if outcome
>>> isn't greater then 1 then combine the two person into one person.
>>> 
>>> (offset Jansen) - (offset Jan + whitespace) not greater then 1
>>> 
>>> If this is true then combine the two person and save this inside a new
>>> MAP<STRING, LONG[]> like
>>> Jan Janssen, [ 0 ].
>>> 
>>> The next time we come across Jan Janssen inside the text then just save
>>> the offset. Which produces the following MAP<STRING, LONG[]>
>>> 
>>> Jan Janssen, [0, 48]
>>> 
>>> I hope this clarifies my question.
>>> If things are still unclear please don't hesitate to ask me to clarify my
>>> question further.
>>> 
>>> Kind regards,
>>> Martijn
>>> 
>>> On Feb 3, 2013, at 1:05 PM, John Omernik <jo...@omernik.com> wrote:
>>> 
>>> Well there are some methods that may work, but I'd have to understand your
>>> data and your constraints more. You want to be able to (As it sounds) sort
>>> by offset, and then look at the one row, and then the next row, to determine
>>> if the the two items should be joined. It "looks" like you  are doing a
>>> string comparison between numbers ("100 "to "104" there is only one
>>> "position" out of three that is different (0 vs 4).  Trouble is, look at id
>>> 3 and id 4.  150 to 160 is only one position different as well, are you
>>> looking for Klaas Jan?  Also, is the ID fields filled from the first match?
>>> It seems like you have some very odd data here. I don't think you've
>>> provided enough information on the data for us to be able to help you.
>>> 
>>> 
>>> 
>>> On Sat, Feb 2, 2013 at 1:21 PM, Martijn van Leeuwen <ic...@gmail.com>
>>> wrote:
>>>> 
>>>> Hi all,
>>>> 
>>>> I new to Apache Hive and I am doing some test to see if it fits my needs,
>>>> one of the questions I have if it is possible to "peek" for the next row in
>>>> order to find out if the values should be combined. Let me explain by an
>>>> example.
>>>> 
>>>> Let say my data looks like this
>>>> 
>>>> Id name offset
>>>> 1 Jan 100
>>>> 2 Janssen 104
>>>> 3 Klaas 150
>>>> 4 Jan 160
>>>> 5 Janssen 164
>>>> 
>>>> An my output to another table should be this
>>>> 
>>>> Id fullname offsets
>>>> 1 Jan Janssen [ 100, 160 ]
>>>> 
>>>> I would like to combine the name values from two rows where the offset of
>>>> the two rows are no more then 1 character apart.
>>>> 
>>>> Is this type of data manipulation is possible and if it is could someone
>>>> point me to the right direction hopefully with some explaination?
>>>> 
>>>> Kind regards
>>>> Martijn
>>> 
>>> 
>>> 
>> 


Re: Combine multiple row values based upon a condition.

Posted by John Omernik <jo...@omernik.com>.
Yes, I agree with this. If you did a hive transform to say a python script
that collected your offsets per doc id and used "distributed by" to ensure
that the script you sent the data too had all the data to work with , you
could then do the logic to join what you need to join together and, emit
the resultant set.

On Sun, Feb 3, 2013 at 1:36 PM, Edward Capriolo <ed...@gmail.com>wrote:

> You may want to look at sort by, distribute by, and cluster by. This
> syntax controls which Reducers the data end up on and how it is sorted
> on each reducer.
>
> On Sun, Feb 3, 2013 at 2:27 PM, Martijn van Leeuwen
> <ic...@gmail.com> wrote:
> > yes there is. Each document has a UUID as its identifier. The actual
> output
> > of my map reduce job that produces the list of person names looks like
> this
> >
> > docId        Name Type length offset
> > f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4     Lea     PERSON     3     10858
> > f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4     Lea     PERSON     3     11063
> > f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4     Ken     PERSON     3     11186
> > f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4     Marottoli     PERSON     9
> > 11234
> > f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4     Berkowitz     PERSON     9
> > 17073
> > f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4     Lea     PERSON     3     17095
> > f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4     Stephanie     PERSON     9
> > 17330
> > f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4     Putt     PERSON     4     17340
> > f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4     Stephanie     PERSON     9
> > 17347
> > f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4     Stephanie     PERSON     9
> > 17480
> > f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4     Putt     PERSON     4     17490
> > f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4     Berkowitz     PERSON     9
> > 19498
> > f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4     Stephanie     PERSON     9
> > 19530
> >
> > Use the following code to produce a table inside Hive.
> >
> > DROP TABLE IF EXISTS entities_extract;
> >
> >     CREATE TABLE entities_extract (doc_id STRING, name STRING, type
> STRING,
> > len INT, offset BIGINT)
> >     ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
> >     LINES TERMINATED BY '\n'
> >     STORED AS TEXTFILE
> >     LOCATION '/research/45924/hive/entities_extract';
> >
> > LOAD DATA LOCAL INPATH
> > '/home/researcher/hadoop-runnables/files/entitie_extract_by_doc.txt'
> > OVERWRITE INTO TABLE entities_extract;
> >
> >
> >
> > On Feb 3, 2013, at 8:07 PM, John Omernik <jo...@omernik.com> wrote:
> >
> > Is there some think akin to a document I'd so we can assure all rows
> > belonging to the same document can be sent to one mapper?
> >
> > On Feb 3, 2013 1:00 PM, "Martijn van Leeuwen" <ic...@gmail.com>
> wrote:
> >>
> >> Hi John,
> >>
> >> Here is some background about my data and what I want as output.
> >>
> >> I have a 215K documents containing text. From those text files I extract
> >> names of persons, organisations and locations by using the Stanford NER
> >> library. (see http://nlp.stanford.edu/software/CRF-NER.shtml)
> >>
> >> Looking at the following line:
> >>
> >> Jan Janssen was on this way to Klaas to sell vehicle Jan Janssen stole
> >> from his father.
> >>
> >> when the classifier is done annotating the line looks like this:
> >>
> >> <PERSON>Jan<PERSON><OFFSET>0<OFFSET>
> >> <PERSON>Janssen<PERSON><OFFSET>5<OFFSET> was on this way to
> >> <PERSON>Klaas<PERSON><OFFSET>26<OFFSET> to sell the vehicle
> >> <PERSON>Jan<PERSON><OFFSET>48<OFFSET>
> >> <PERSON>Janssen<PERSON><OFFSET>50<OFFSET> stole from his father.
> >>
> >> When looping through this annotated line you can save the persons and
> its
> >> offsets, please note that offset is a LONG value, inside a Map for
> example:
> >>
> >> MAP<STRING, LONG> entities
> >>
> >> Jan, 0
> >> Janssen, 5
> >> Klaas, 26
> >> Jan, 48
> >> Janssen, 50
> >>
> >> Jan Janssen in the line is actually the one person and not two. Jan
> occurs
> >> at offset 0, to determine if Janssen belongs to Jan I could subtract the
> >> length of Jan (3) + 1 (whitespace) from Janssen's offset (5) and if
> outcome
> >> isn't greater then 1 then combine the two person into one person.
> >>
> >> (offset Jansen) - (offset Jan + whitespace) not greater then 1
> >>
> >> If this is true then combine the two person and save this inside a new
> >> MAP<STRING, LONG[]> like
> >> Jan Janssen, [ 0 ].
> >>
> >> The next time we come across Jan Janssen inside the text then just save
> >> the offset. Which produces the following MAP<STRING, LONG[]>
> >>
> >> Jan Janssen, [0, 48]
> >>
> >> I hope this clarifies my question.
> >> If things are still unclear please don't hesitate to ask me to clarify
> my
> >> question further.
> >>
> >> Kind regards,
> >> Martijn
> >>
> >> On Feb 3, 2013, at 1:05 PM, John Omernik <jo...@omernik.com> wrote:
> >>
> >> Well there are some methods that may work, but I'd have to understand
> your
> >> data and your constraints more. You want to be able to (As it sounds)
> sort
> >> by offset, and then look at the one row, and then the next row, to
> determine
> >> if the the two items should be joined. It "looks" like you  are doing a
> >> string comparison between numbers ("100 "to "104" there is only one
> >> "position" out of three that is different (0 vs 4).  Trouble is, look
> at id
> >> 3 and id 4.  150 to 160 is only one position different as well, are you
> >> looking for Klaas Jan?  Also, is the ID fields filled from the first
> match?
> >> It seems like you have some very odd data here. I don't think you've
> >> provided enough information on the data for us to be able to help you.
> >>
> >>
> >>
> >> On Sat, Feb 2, 2013 at 1:21 PM, Martijn van Leeuwen <
> icodesharp@gmail.com>
> >> wrote:
> >>>
> >>> Hi all,
> >>>
> >>> I new to Apache Hive and I am doing some test to see if it fits my
> needs,
> >>> one of the questions I have if it is possible to "peek" for the next
> row in
> >>> order to find out if the values should be combined. Let me explain by
> an
> >>> example.
> >>>
> >>> Let say my data looks like this
> >>>
> >>> Id name offset
> >>> 1 Jan 100
> >>> 2 Janssen 104
> >>> 3 Klaas 150
> >>> 4 Jan 160
> >>> 5 Janssen 164
> >>>
> >>> An my output to another table should be this
> >>>
> >>> Id fullname offsets
> >>> 1 Jan Janssen [ 100, 160 ]
> >>>
> >>> I would like to combine the name values from two rows where the offset
> of
> >>> the two rows are no more then 1 character apart.
> >>>
> >>> Is this type of data manipulation is possible and if it is could
> someone
> >>> point me to the right direction hopefully with some explaination?
> >>>
> >>> Kind regards
> >>> Martijn
> >>
> >>
> >>
> >
>

Re: Combine multiple row values based upon a condition.

Posted by Edward Capriolo <ed...@gmail.com>.
You may want to look at sort by, distribute by, and cluster by. This
syntax controls which Reducers the data end up on and how it is sorted
on each reducer.

On Sun, Feb 3, 2013 at 2:27 PM, Martijn van Leeuwen
<ic...@gmail.com> wrote:
> yes there is. Each document has a UUID as its identifier. The actual output
> of my map reduce job that produces the list of person names looks like this
>
> docId        Name Type length offset
> f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4     Lea     PERSON     3     10858
> f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4     Lea     PERSON     3     11063
> f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4     Ken     PERSON     3     11186
> f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4     Marottoli     PERSON     9
> 11234
> f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4     Berkowitz     PERSON     9
> 17073
> f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4     Lea     PERSON     3     17095
> f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4     Stephanie     PERSON     9
> 17330
> f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4     Putt     PERSON     4     17340
> f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4     Stephanie     PERSON     9
> 17347
> f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4     Stephanie     PERSON     9
> 17480
> f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4     Putt     PERSON     4     17490
> f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4     Berkowitz     PERSON     9
> 19498
> f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4     Stephanie     PERSON     9
> 19530
>
> Use the following code to produce a table inside Hive.
>
> DROP TABLE IF EXISTS entities_extract;
>
>     CREATE TABLE entities_extract (doc_id STRING, name STRING, type STRING,
> len INT, offset BIGINT)
>     ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
>     LINES TERMINATED BY '\n'
>     STORED AS TEXTFILE
>     LOCATION '/research/45924/hive/entities_extract';
>
> LOAD DATA LOCAL INPATH
> '/home/researcher/hadoop-runnables/files/entitie_extract_by_doc.txt'
> OVERWRITE INTO TABLE entities_extract;
>
>
>
> On Feb 3, 2013, at 8:07 PM, John Omernik <jo...@omernik.com> wrote:
>
> Is there some think akin to a document I'd so we can assure all rows
> belonging to the same document can be sent to one mapper?
>
> On Feb 3, 2013 1:00 PM, "Martijn van Leeuwen" <ic...@gmail.com> wrote:
>>
>> Hi John,
>>
>> Here is some background about my data and what I want as output.
>>
>> I have a 215K documents containing text. From those text files I extract
>> names of persons, organisations and locations by using the Stanford NER
>> library. (see http://nlp.stanford.edu/software/CRF-NER.shtml)
>>
>> Looking at the following line:
>>
>> Jan Janssen was on this way to Klaas to sell vehicle Jan Janssen stole
>> from his father.
>>
>> when the classifier is done annotating the line looks like this:
>>
>> <PERSON>Jan<PERSON><OFFSET>0<OFFSET>
>> <PERSON>Janssen<PERSON><OFFSET>5<OFFSET> was on this way to
>> <PERSON>Klaas<PERSON><OFFSET>26<OFFSET> to sell the vehicle
>> <PERSON>Jan<PERSON><OFFSET>48<OFFSET>
>> <PERSON>Janssen<PERSON><OFFSET>50<OFFSET> stole from his father.
>>
>> When looping through this annotated line you can save the persons and its
>> offsets, please note that offset is a LONG value, inside a Map for example:
>>
>> MAP<STRING, LONG> entities
>>
>> Jan, 0
>> Janssen, 5
>> Klaas, 26
>> Jan, 48
>> Janssen, 50
>>
>> Jan Janssen in the line is actually the one person and not two. Jan occurs
>> at offset 0, to determine if Janssen belongs to Jan I could subtract the
>> length of Jan (3) + 1 (whitespace) from Janssen's offset (5) and if outcome
>> isn't greater then 1 then combine the two person into one person.
>>
>> (offset Jansen) - (offset Jan + whitespace) not greater then 1
>>
>> If this is true then combine the two person and save this inside a new
>> MAP<STRING, LONG[]> like
>> Jan Janssen, [ 0 ].
>>
>> The next time we come across Jan Janssen inside the text then just save
>> the offset. Which produces the following MAP<STRING, LONG[]>
>>
>> Jan Janssen, [0, 48]
>>
>> I hope this clarifies my question.
>> If things are still unclear please don't hesitate to ask me to clarify my
>> question further.
>>
>> Kind regards,
>> Martijn
>>
>> On Feb 3, 2013, at 1:05 PM, John Omernik <jo...@omernik.com> wrote:
>>
>> Well there are some methods that may work, but I'd have to understand your
>> data and your constraints more. You want to be able to (As it sounds) sort
>> by offset, and then look at the one row, and then the next row, to determine
>> if the the two items should be joined. It "looks" like you  are doing a
>> string comparison between numbers ("100 "to "104" there is only one
>> "position" out of three that is different (0 vs 4).  Trouble is, look at id
>> 3 and id 4.  150 to 160 is only one position different as well, are you
>> looking for Klaas Jan?  Also, is the ID fields filled from the first match?
>> It seems like you have some very odd data here. I don't think you've
>> provided enough information on the data for us to be able to help you.
>>
>>
>>
>> On Sat, Feb 2, 2013 at 1:21 PM, Martijn van Leeuwen <ic...@gmail.com>
>> wrote:
>>>
>>> Hi all,
>>>
>>> I new to Apache Hive and I am doing some test to see if it fits my needs,
>>> one of the questions I have if it is possible to "peek" for the next row in
>>> order to find out if the values should be combined. Let me explain by an
>>> example.
>>>
>>> Let say my data looks like this
>>>
>>> Id name offset
>>> 1 Jan 100
>>> 2 Janssen 104
>>> 3 Klaas 150
>>> 4 Jan 160
>>> 5 Janssen 164
>>>
>>> An my output to another table should be this
>>>
>>> Id fullname offsets
>>> 1 Jan Janssen [ 100, 160 ]
>>>
>>> I would like to combine the name values from two rows where the offset of
>>> the two rows are no more then 1 character apart.
>>>
>>> Is this type of data manipulation is possible and if it is could someone
>>> point me to the right direction hopefully with some explaination?
>>>
>>> Kind regards
>>> Martijn
>>
>>
>>
>

Re: Combine multiple row values based upon a condition.

Posted by Martijn van Leeuwen <ic...@gmail.com>.
yes there is. Each document has a UUID as its identifier. The actual output of my map reduce job that produces the list of person names looks like this

docId								       Name Type		length offset
f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4     Lea     PERSON     3     10858
f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4     Lea     PERSON     3     11063
f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4     Ken     PERSON     3     11186
f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4     Marottoli     PERSON     9     11234
f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4     Berkowitz     PERSON     9     17073
f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4     Lea     PERSON     3     17095
f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4     Stephanie     PERSON     9     17330
f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4     Putt     PERSON     4     17340
f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4     Stephanie     PERSON     9     17347
f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4     Stephanie     PERSON     9     17480
f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4     Putt     PERSON     4     17490
f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4     Berkowitz     PERSON     9     19498
f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4     Stephanie     PERSON     9     19530

Use the following code to produce a table inside Hive.

DROP TABLE IF EXISTS entities_extract;

    CREATE TABLE entities_extract (doc_id STRING, name STRING, type STRING, len INT, offset BIGINT)
    ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
    LINES TERMINATED BY '\n'
    STORED AS TEXTFILE
    LOCATION '/research/45924/hive/entities_extract';

LOAD DATA LOCAL INPATH '/home/researcher/hadoop-runnables/files/entitie_extract_by_doc.txt' OVERWRITE INTO TABLE entities_extract;



On Feb 3, 2013, at 8:07 PM, John Omernik <jo...@omernik.com> wrote:

> Is there some think akin to a document I'd so we can assure all rows belonging to the same document can be sent to one mapper?
> 
> On Feb 3, 2013 1:00 PM, "Martijn van Leeuwen" <ic...@gmail.com> wrote:
> Hi John,
> 
> Here is some background about my data and what I want as output.
> 
> I have a 215K documents containing text. From those text files I extract names of persons, organisations and locations by using the Stanford NER library. (see http://nlp.stanford.edu/software/CRF-NER.shtml) 
> 
> Looking at the following line:
> 
> Jan Janssen was on this way to Klaas to sell vehicle Jan Janssen stole from his father.
> 
> when the classifier is done annotating the line looks like this:
> 
> <PERSON>Jan<PERSON><OFFSET>0<OFFSET> <PERSON>Janssen<PERSON><OFFSET>5<OFFSET> was on this way to <PERSON>Klaas<PERSON><OFFSET>26<OFFSET> to sell the vehicle <PERSON>Jan<PERSON><OFFSET>48<OFFSET> <PERSON>Janssen<PERSON><OFFSET>50<OFFSET> stole from his father.
> 
> When looping through this annotated line you can save the persons and its offsets, please note that offset is a LONG value, inside a Map for example:
> 
> MAP<STRING, LONG> entities
> 
> Jan, 0
> Janssen, 5 
> Klaas, 26
> Jan, 48
> Janssen, 50
> 
> Jan Janssen in the line is actually the one person and not two. Jan occurs at offset 0, to determine if Janssen belongs to Jan I could subtract the length of Jan (3) + 1 (whitespace) from Janssen's offset (5) and if outcome isn't greater then 1 then combine the two person into one person.
> 
> (offset Jansen) - (offset Jan + whitespace) not greater then 1
> 
> If this is true then combine the two person and save this inside a new MAP<STRING, LONG[]> like
> Jan Janssen, [ 0 ].
> 
> The next time we come across Jan Janssen inside the text then just save the offset. Which produces the following MAP<STRING, LONG[]>
> 
> Jan Janssen, [0, 48] 
> 
> I hope this clarifies my question. 
> If things are still unclear please don't hesitate to ask me to clarify my question further.
> 
> Kind regards,
> Martijn
> 
> On Feb 3, 2013, at 1:05 PM, John Omernik <jo...@omernik.com> wrote:
> 
>> Well there are some methods that may work, but I'd have to understand your data and your constraints more. You want to be able to (As it sounds) sort by offset, and then look at the one row, and then the next row, to determine if the the two items should be joined. It "looks" like you  are doing a string comparison between numbers ("100 "to "104" there is only one "position" out of three that is different (0 vs 4).  Trouble is, look at id 3 and id 4.  150 to 160 is only one position different as well, are you looking for Klaas Jan?  Also, is the ID fields filled from the first match? It seems like you have some very odd data here. I don't think you've provided enough information on the data for us to be able to help you. 
>> 
>> 
>> 
>> On Sat, Feb 2, 2013 at 1:21 PM, Martijn van Leeuwen <ic...@gmail.com> wrote:
>> Hi all,
>> 
>> I new to Apache Hive and I am doing some test to see if it fits my needs, one of the questions I have if it is possible to "peek" for the next row in order to find out if the values should be combined. Let me explain by an example.
>> 
>> Let say my data looks like this
>> 
>> Id name offset
>> 1 Jan 100
>> 2 Janssen 104
>> 3 Klaas 150
>> 4 Jan 160
>> 5 Janssen 164
>> 
>> An my output to another table should be this
>> 
>> Id fullname offsets
>> 1 Jan Janssen [ 100, 160 ]
>> 
>> I would like to combine the name values from two rows where the offset of the two rows are no more then 1 character apart.
>> 
>> Is this type of data manipulation is possible and if it is could someone point me to the right direction hopefully with some explaination?
>> 
>> Kind regards
>> Martijn
>> 
> 


Re: Combine multiple row values based upon a condition.

Posted by John Omernik <jo...@omernik.com>.
Is there some think akin to a document I'd so we can assure all rows
belonging to the same document can be sent to one mapper?
On Feb 3, 2013 1:00 PM, "Martijn van Leeuwen" <ic...@gmail.com> wrote:

> Hi John,
>
> Here is some background about my data and what I want as output.
>
> I have a 215K documents containing text. From those text files I extract
> names of persons, organisations and locations by using the Stanford NER
> library. (see http://nlp.stanford.edu/software/CRF-NER.shtml)
>
> Looking at the following line:
>
> Jan Janssen was on this way to Klaas to sell vehicle Jan Janssen stole
> from his father.
>
> when the classifier is done annotating the line looks like this:
>
> <PERSON>Jan<PERSON><OFFSET>0<OFFSET>
> <PERSON>Janssen<PERSON><OFFSET>5<OFFSET> was on this way
> to <PERSON>Klaas<PERSON><OFFSET>26<OFFSET> to sell the vehicle
> <PERSON>Jan<PERSON><OFFSET>48<OFFSET>
> <PERSON>Janssen<PERSON><OFFSET>50<OFFSET> stole from his father.
>
> When looping through this annotated line you can save the persons and its
> offsets, please note that offset is a LONG value, inside a Map for example:
>
> MAP<STRING, LONG> entities
>
> Jan, 0
> Janssen, 5
> Klaas, 26
> Jan, 48
> Janssen, 50
>
> Jan Janssen in the line is actually the one person and not two. Jan occurs
> at offset 0, to determine if Janssen belongs to Jan I could subtract the
> length of Jan (3) + 1 (whitespace) from Janssen's offset (5) and if outcome
> isn't greater then 1 then combine the two person into one person.
>
> (offset Jansen) - (offset Jan + whitespace) not greater then 1
>
> If this is true then combine the two person and save this inside a new
> MAP<STRING, LONG[]> like
> Jan Janssen, [ 0 ].
>
> The next time we come across Jan Janssen inside the text then just save
> the offset. Which produces the following MAP<STRING, LONG[]>
>
> Jan Janssen, [0, 48]
>
> I hope this clarifies my question.
> If things are still unclear please don't hesitate to ask me to clarify my
> question further.
>
> Kind regards,
> Martijn
>
> On Feb 3, 2013, at 1:05 PM, John Omernik <jo...@omernik.com> wrote:
>
> Well there are some methods that may work, but I'd have to understand your
> data and your constraints more. You want to be able to (As it sounds) sort
> by offset, and then look at the one row, and then the next row, to
> determine if the the two items should be joined. It "looks" like you  are
> doing a string comparison between numbers ("100 "to "104" there is only one
> "position" out of three that is different (0 vs 4).  Trouble is, look at id
> 3 and id 4.  150 to 160 is only one position different as well, are you
> looking for Klaas Jan?  Also, is the ID fields filled from the first match?
> It seems like you have some very odd data here. I don't think you've
> provided enough information on the data for us to be able to help you.
>
>
>
> On Sat, Feb 2, 2013 at 1:21 PM, Martijn van Leeuwen <ic...@gmail.com>wrote:
>
>> Hi all,
>>
>> I new to Apache Hive and I am doing some test to see if it fits my needs,
>> one of the questions I have if it is possible to "peek" for the next row in
>> order to find out if the values should be combined. Let me explain by an
>> example.
>>
>> Let say my data looks like this
>>
>> Id name offset
>> 1 Jan 100
>> 2 Janssen 104
>> 3 Klaas 150
>> 4 Jan 160
>> 5 Janssen 164
>>
>> An my output to another table should be this
>>
>> Id fullname offsets
>> 1 Jan Janssen [ 100, 160 ]
>>
>> I would like to combine the name values from two rows where the offset of
>> the two rows are no more then 1 character apart.
>>
>> Is this type of data manipulation is possible and if it is could someone
>> point me to the right direction hopefully with some explaination?
>>
>> Kind regards
>> Martijn
>
>
>
>

Re: Combine multiple row values based upon a condition.

Posted by Martijn van Leeuwen <ic...@gmail.com>.
Hi John,

Here is some background about my data and what I want as output.

I have a 215K documents containing text. From those text files I extract names of persons, organisations and locations by using the Stanford NER library. (see http://nlp.stanford.edu/software/CRF-NER.shtml) 

Looking at the following line:

Jan Janssen was on this way to Klaas to sell vehicle Jan Janssen stole from his father.

when the classifier is done annotating the line looks like this:

<PERSON>Jan<PERSON><OFFSET>0<OFFSET> <PERSON>Janssen<PERSON><OFFSET>5<OFFSET> was on this way to <PERSON>Klaas<PERSON><OFFSET>26<OFFSET> to sell the vehicle <PERSON>Jan<PERSON><OFFSET>48<OFFSET> <PERSON>Janssen<PERSON><OFFSET>50<OFFSET> stole from his father.

When looping through this annotated line you can save the persons and its offsets, please note that offset is a LONG value, inside a Map for example:

MAP<STRING, LONG> entities

Jan, 0
Janssen, 5 
Klaas, 26
Jan, 48
Janssen, 50

Jan Janssen in the line is actually the one person and not two. Jan occurs at offset 0, to determine if Janssen belongs to Jan I could subtract the length of Jan (3) + 1 (whitespace) from Janssen's offset (5) and if outcome isn't greater then 1 then combine the two person into one person.

(offset Jansen) - (offset Jan + whitespace) not greater then 1

If this is true then combine the two person and save this inside a new MAP<STRING, LONG[]> like
Jan Janssen, [ 0 ].

The next time we come across Jan Janssen inside the text then just save the offset. Which produces the following MAP<STRING, LONG[]>

Jan Janssen, [0, 48] 

I hope this clarifies my question. 
If things are still unclear please don't hesitate to ask me to clarify my question further.

Kind regards,
Martijn

On Feb 3, 2013, at 1:05 PM, John Omernik <jo...@omernik.com> wrote:

> Well there are some methods that may work, but I'd have to understand your data and your constraints more. You want to be able to (As it sounds) sort by offset, and then look at the one row, and then the next row, to determine if the the two items should be joined. It "looks" like you  are doing a string comparison between numbers ("100 "to "104" there is only one "position" out of three that is different (0 vs 4).  Trouble is, look at id 3 and id 4.  150 to 160 is only one position different as well, are you looking for Klaas Jan?  Also, is the ID fields filled from the first match? It seems like you have some very odd data here. I don't think you've provided enough information on the data for us to be able to help you. 
> 
> 
> 
> On Sat, Feb 2, 2013 at 1:21 PM, Martijn van Leeuwen <ic...@gmail.com> wrote:
> Hi all,
> 
> I new to Apache Hive and I am doing some test to see if it fits my needs, one of the questions I have if it is possible to "peek" for the next row in order to find out if the values should be combined. Let me explain by an example.
> 
> Let say my data looks like this
> 
> Id name offset
> 1 Jan 100
> 2 Janssen 104
> 3 Klaas 150
> 4 Jan 160
> 5 Janssen 164
> 
> An my output to another table should be this
> 
> Id fullname offsets
> 1 Jan Janssen [ 100, 160 ]
> 
> I would like to combine the name values from two rows where the offset of the two rows are no more then 1 character apart.
> 
> Is this type of data manipulation is possible and if it is could someone point me to the right direction hopefully with some explaination?
> 
> Kind regards
> Martijn
> 


Re: Combine multiple row values based upon a condition.

Posted by John Omernik <jo...@omernik.com>.
Well there are some methods that may work, but I'd have to understand your
data and your constraints more. You want to be able to (As it sounds) sort
by offset, and then look at the one row, and then the next row, to
determine if the the two items should be joined. It "looks" like you  are
doing a string comparison between numbers ("100 "to "104" there is only one
"position" out of three that is different (0 vs 4).  Trouble is, look at id
3 and id 4.  150 to 160 is only one position different as well, are you
looking for Klaas Jan?  Also, is the ID fields filled from the first match?
It seems like you have some very odd data here. I don't think you've
provided enough information on the data for us to be able to help you.



On Sat, Feb 2, 2013 at 1:21 PM, Martijn van Leeuwen <ic...@gmail.com>wrote:

> Hi all,
>
> I new to Apache Hive and I am doing some test to see if it fits my needs,
> one of the questions I have if it is possible to "peek" for the next row in
> order to find out if the values should be combined. Let me explain by an
> example.
>
> Let say my data looks like this
>
> Id name offset
> 1 Jan 100
> 2 Janssen 104
> 3 Klaas 150
> 4 Jan 160
> 5 Janssen 164
>
> An my output to another table should be this
>
> Id fullname offsets
> 1 Jan Janssen [ 100, 160 ]
>
> I would like to combine the name values from two rows where the offset of
> the two rows are no more then 1 character apart.
>
> Is this type of data manipulation is possible and if it is could someone
> point me to the right direction hopefully with some explaination?
>
> Kind regards
> Martijn