You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hadoop.apache.org by Shashidhar Rao <ra...@gmail.com> on 2015/01/03 17:06:00 UTC

XML files in Hadoop

Hi,

Can someone help me by suggesting the best way to solve this use case

1. XML files keep flowing from external system and need to be stored into
HDFS.
2. These files  can be directly stored using NoSql database e.g any xml
supported NoSql. or
3. These files need to be processed and stored in one of the database
HBase, Hive etc.
4. There won't be any updates only read and has to be retrieved based on
some queries and a dashboard has to be created , bits of analytics

The xml files are huge and expected number of nodes is roughly around 12
nodes.
I am stuck in the storage part say if I convert xml to json and store it
into HBase , the processing part from xml to json will be huge.

It will be only reading and no updates.

Please suggest how to store these xml files.

Thanks
Shashi

Re: XML files in Hadoop

Posted by Shashidhar Rao <ra...@gmail.com>.

Hi Peyman,

Sure , will try using two Hive tables for the conversion.
It was awesome discussing with you . Thanks a lot.


Shashi

On Sat, Jan 3, 2015 at 10:53 PM, Peyman Mohajerian <mo...@gmail.com>
wrote:

> I would recommend as the first step not to use Flume, but rather land the
> data in hdfs in the source format, XML and use Hive to convert the format
> from XML to Parquet. That is much simpler to do than using Flume. Flume
> only makes sense if you don't care for the original file format and want to
> ingest the data fast, meet some SLA.
> Flume has a good user guide page if you google it.
> In Hive you need two tables, one that reads XML data using XML serd
> (external table), a second one that is Parquet format, you do insert into
> the second table from the source, and that will easily do the format
> conversion.
>
> On Sat, Jan 3, 2015 at 9:16 AM, Shashidhar Rao <raoshashidhar123@gmail.com
> > wrote:
>
>> Hi Peyman,
>>
>> Really appreciate your suggestion.
>> But say , if Tableau has to be used to generate reports then Tableau
>> works great with Hive.
>>
>> Just one more question, can flume be used to convert xml data to parquet ?
>> I will store these into Hive as parquet and generate reports using
>> Tableau.
>>
>> If flume can convert xml to parquet , do I need external tools , can you
>> please provide me some links on how to convert xml to parquet using flume.
>> Because , Predictive analytics may be used on Hive data in the end phase of
>> the project.
>>
>> Thanks
>> Shashi
>>
>> On Sat, Jan 3, 2015 at 10:32 PM, Peyman Mohajerian <mo...@gmail.com>
>> wrote:
>>
>>> Hi Shashi,
>>> Sure you can use json instead of Parquet, I was thinking in terms of
>>> using Hive for processing the data, but if you'd like to use Drill (which i
>>> heard is a good choice), then just convert the data from to json. You don't
>>> have to deal with parquet or Hive in that case, just use Flume to convert
>>> XML to json (there are many other choices to do that within the cluster
>>> too) and then use Drill to read and process the data.
>>>
>>> Thanks,
>>> Peyman
>>>
>>>
>>> On Sat, Jan 3, 2015 at 8:53 AM, Shashidhar Rao <
>>> raoshashidhar123@gmail.com> wrote:
>>>
>>>> Hi Peyman,
>>>>
>>>> Thanks a lot for your suggestions, really appreciate and got some idea
>>>> from your suggestions. Here's what I want to proceed.
>>>> 1.  Using Flume convert xml to JSON/Parquet before it reaches HDFS.
>>>> 2.  Store parquet converted files into Hive.
>>>> 3.  Query using Apache Drill in SQL dialect.
>>>>
>>>> But one thing can you please help me if instead of converting to
>>>> parquet if I convert into json and store in Hive as Parquet format , is
>>>> this a feasible option.
>>>> The reason I want to convert to json is that Apache Drill works very
>>>> well with JSON format.
>>>>
>>>> Thanks
>>>> Shashi
>>>>
>>>> On Sat, Jan 3, 2015 at 10:08 PM, Peyman Mohajerian <mo...@gmail.com>
>>>> wrote:
>>>>
>>>>> You can land the data in HDFS as XML files and use 'hive xml serde' to
>>>>> read the data and write it back in a more optimal format, e.g. ORC or
>>>>> parquet (depending somewhat on your choice of Hadoop distro). Querying XML
>>>>> data directly via Hive is also doable but slow. Converting to Avro is also
>>>>> doable but in my experience not as fast as ORC or Parquet. Columnar formats
>>>>> work give you better performance but Avro has its own strength, e.g.
>>>>> managing schema changes better.
>>>>> You can also convert the format before you land the data in HDFS, e.g.
>>>>> using Flume or some other tool for changing the format in flight.
>>>>>
>>>>>
>>>>>
>>>>> On Sat, Jan 3, 2015 at 8:33 AM, Shashidhar Rao <
>>>>> raoshashidhar123@gmail.com> wrote:
>>>>>
>>>>>> Sorry , not Hive files but xml files to some Avro format and store
>>>>>> these into Hive will be fast .
>>>>>>
>>>>>> On Sat, Jan 3, 2015 at 9:59 PM, Shashidhar Rao <
>>>>>> raoshashidhar123@gmail.com> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> Exact number of files is not known but it will run into millions of
>>>>>>> files depending on client's request who collects terabytes of xml data
>>>>>>> every day. Basically, storing is just one part but the main part will be
>>>>>>> how to query these data like  aggregation, count and do some analytics over
>>>>>>> these data. Fast retrieval is required , say for e.g for a particular year
>>>>>>> what are the top 10 products, top ten manufacturers and top ten stores etc.
>>>>>>>
>>>>>>> Will Hive be a better choice ? And will converting these Hive files
>>>>>>> to some format work out.
>>>>>>>
>>>>>>> Thanks
>>>>>>> Shashi
>>>>>>>
>>>>>>> On Sat, Jan 3, 2015 at 9:44 PM, Wilm Schumacher <
>>>>>>> wilm.schumacher@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> how many xml files are you planning to store? Perhaps it is
>>>>>>>> possible to
>>>>>>>> store them directly on hdfs and save meta data in hbase. This sounds
>>>>>>>> more reasonable to me.
>>>>>>>>
>>>>>>>> If the number of xml files is to large (millions and billions),
>>>>>>>> then you
>>>>>>>> can use hadoop map files to put files together. E.g. based on
>>>>>>>> years, or
>>>>>>>> month.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>>
>>>>>>>> Wilm
>>>>>>>>
>>>>>>>> Am 03.01.2015 um 17:06 schrieb Shashidhar Rao:
>>>>>>>> > Hi,
>>>>>>>> >
>>>>>>>> > Can someone help me by suggesting the best way to solve this use
>>>>>>>> case
>>>>>>>> >
>>>>>>>> > 1. XML files keep flowing from external system and need to be
>>>>>>>> stored
>>>>>>>> > into HDFS.
>>>>>>>> > 2. These files  can be directly stored using NoSql database e.g
>>>>>>>> any
>>>>>>>> > xml supported NoSql. or
>>>>>>>> > 3. These files need to be processed and stored in one of the
>>>>>>>> database
>>>>>>>> > HBase, Hive etc.
>>>>>>>> > 4. There won't be any updates only read and has to be retrieved
>>>>>>>> based
>>>>>>>> > on some queries and a dashboard has to be created , bits of
>>>>>>>> analytics
>>>>>>>> >
>>>>>>>> > The xml files are huge and expected number of nodes is roughly
>>>>>>>> around
>>>>>>>> > 12 nodes.
>>>>>>>> > I am stuck in the storage part say if I convert xml to json and
>>>>>>>> store
>>>>>>>> > it into HBase , the processing part from xml to json will be huge.
>>>>>>>> >
>>>>>>>> > It will be only reading and no updates.
>>>>>>>> >
>>>>>>>> > Please suggest how to store these xml files.
>>>>>>>> >
>>>>>>>> > Thanks
>>>>>>>> > Shashi
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: XML files in Hadoop

Posted by Shashidhar Rao <ra...@gmail.com>.

Hi Peyman,

Sure , will try using two Hive tables for the conversion.
It was awesome discussing with you . Thanks a lot.


Shashi

On Sat, Jan 3, 2015 at 10:53 PM, Peyman Mohajerian <mo...@gmail.com>
wrote:

> I would recommend as the first step not to use Flume, but rather land the
> data in hdfs in the source format, XML and use Hive to convert the format
> from XML to Parquet. That is much simpler to do than using Flume. Flume
> only makes sense if you don't care for the original file format and want to
> ingest the data fast, meet some SLA.
> Flume has a good user guide page if you google it.
> In Hive you need two tables, one that reads XML data using XML serd
> (external table), a second one that is Parquet format, you do insert into
> the second table from the source, and that will easily do the format
> conversion.
>
> On Sat, Jan 3, 2015 at 9:16 AM, Shashidhar Rao <raoshashidhar123@gmail.com
> > wrote:
>
>> Hi Peyman,
>>
>> Really appreciate your suggestion.
>> But say , if Tableau has to be used to generate reports then Tableau
>> works great with Hive.
>>
>> Just one more question, can flume be used to convert xml data to parquet ?
>> I will store these into Hive as parquet and generate reports using
>> Tableau.
>>
>> If flume can convert xml to parquet , do I need external tools , can you
>> please provide me some links on how to convert xml to parquet using flume.
>> Because , Predictive analytics may be used on Hive data in the end phase of
>> the project.
>>
>> Thanks
>> Shashi
>>
>> On Sat, Jan 3, 2015 at 10:32 PM, Peyman Mohajerian <mo...@gmail.com>
>> wrote:
>>
>>> Hi Shashi,
>>> Sure you can use json instead of Parquet, I was thinking in terms of
>>> using Hive for processing the data, but if you'd like to use Drill (which i
>>> heard is a good choice), then just convert the data from to json. You don't
>>> have to deal with parquet or Hive in that case, just use Flume to convert
>>> XML to json (there are many other choices to do that within the cluster
>>> too) and then use Drill to read and process the data.
>>>
>>> Thanks,
>>> Peyman
>>>
>>>
>>> On Sat, Jan 3, 2015 at 8:53 AM, Shashidhar Rao <
>>> raoshashidhar123@gmail.com> wrote:
>>>
>>>> Hi Peyman,
>>>>
>>>> Thanks a lot for your suggestions, really appreciate and got some idea
>>>> from your suggestions. Here's what I want to proceed.
>>>> 1.  Using Flume convert xml to JSON/Parquet before it reaches HDFS.
>>>> 2.  Store parquet converted files into Hive.
>>>> 3.  Query using Apache Drill in SQL dialect.
>>>>
>>>> But one thing can you please help me if instead of converting to
>>>> parquet if I convert into json and store in Hive as Parquet format , is
>>>> this a feasible option.
>>>> The reason I want to convert to json is that Apache Drill works very
>>>> well with JSON format.
>>>>
>>>> Thanks
>>>> Shashi
>>>>
>>>> On Sat, Jan 3, 2015 at 10:08 PM, Peyman Mohajerian <mo...@gmail.com>
>>>> wrote:
>>>>
>>>>> You can land the data in HDFS as XML files and use 'hive xml serde' to
>>>>> read the data and write it back in a more optimal format, e.g. ORC or
>>>>> parquet (depending somewhat on your choice of Hadoop distro). Querying XML
>>>>> data directly via Hive is also doable but slow. Converting to Avro is also
>>>>> doable but in my experience not as fast as ORC or Parquet. Columnar formats
>>>>> work give you better performance but Avro has its own strength, e.g.
>>>>> managing schema changes better.
>>>>> You can also convert the format before you land the data in HDFS, e.g.
>>>>> using Flume or some other tool for changing the format in flight.
>>>>>
>>>>>
>>>>>
>>>>> On Sat, Jan 3, 2015 at 8:33 AM, Shashidhar Rao <
>>>>> raoshashidhar123@gmail.com> wrote:
>>>>>
>>>>>> Sorry , not Hive files but xml files to some Avro format and store
>>>>>> these into Hive will be fast .
>>>>>>
>>>>>> On Sat, Jan 3, 2015 at 9:59 PM, Shashidhar Rao <
>>>>>> raoshashidhar123@gmail.com> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> Exact number of files is not known but it will run into millions of
>>>>>>> files depending on client's request who collects terabytes of xml data
>>>>>>> every day. Basically, storing is just one part but the main part will be
>>>>>>> how to query these data like  aggregation, count and do some analytics over
>>>>>>> these data. Fast retrieval is required , say for e.g for a particular year
>>>>>>> what are the top 10 products, top ten manufacturers and top ten stores etc.
>>>>>>>
>>>>>>> Will Hive be a better choice ? And will converting these Hive files
>>>>>>> to some format work out.
>>>>>>>
>>>>>>> Thanks
>>>>>>> Shashi
>>>>>>>
>>>>>>> On Sat, Jan 3, 2015 at 9:44 PM, Wilm Schumacher <
>>>>>>> wilm.schumacher@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> how many xml files are you planning to store? Perhaps it is
>>>>>>>> possible to
>>>>>>>> store them directly on hdfs and save meta data in hbase. This sounds
>>>>>>>> more reasonable to me.
>>>>>>>>
>>>>>>>> If the number of xml files is to large (millions and billions),
>>>>>>>> then you
>>>>>>>> can use hadoop map files to put files together. E.g. based on
>>>>>>>> years, or
>>>>>>>> month.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>>
>>>>>>>> Wilm
>>>>>>>>
>>>>>>>> Am 03.01.2015 um 17:06 schrieb Shashidhar Rao:
>>>>>>>> > Hi,
>>>>>>>> >
>>>>>>>> > Can someone help me by suggesting the best way to solve this use
>>>>>>>> case
>>>>>>>> >
>>>>>>>> > 1. XML files keep flowing from external system and need to be
>>>>>>>> stored
>>>>>>>> > into HDFS.
>>>>>>>> > 2. These files  can be directly stored using NoSql database e.g
>>>>>>>> any
>>>>>>>> > xml supported NoSql. or
>>>>>>>> > 3. These files need to be processed and stored in one of the
>>>>>>>> database
>>>>>>>> > HBase, Hive etc.
>>>>>>>> > 4. There won't be any updates only read and has to be retrieved
>>>>>>>> based
>>>>>>>> > on some queries and a dashboard has to be created , bits of
>>>>>>>> analytics
>>>>>>>> >
>>>>>>>> > The xml files are huge and expected number of nodes is roughly
>>>>>>>> around
>>>>>>>> > 12 nodes.
>>>>>>>> > I am stuck in the storage part say if I convert xml to json and
>>>>>>>> store
>>>>>>>> > it into HBase , the processing part from xml to json will be huge.
>>>>>>>> >
>>>>>>>> > It will be only reading and no updates.
>>>>>>>> >
>>>>>>>> > Please suggest how to store these xml files.
>>>>>>>> >
>>>>>>>> > Thanks
>>>>>>>> > Shashi
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: XML files in Hadoop

Posted by Shashidhar Rao <ra...@gmail.com>.

Hi Peyman,

Sure , will try using two Hive tables for the conversion.
It was awesome discussing with you . Thanks a lot.


Shashi

On Sat, Jan 3, 2015 at 10:53 PM, Peyman Mohajerian <mo...@gmail.com>
wrote:

> I would recommend as the first step not to use Flume, but rather land the
> data in hdfs in the source format, XML and use Hive to convert the format
> from XML to Parquet. That is much simpler to do than using Flume. Flume
> only makes sense if you don't care for the original file format and want to
> ingest the data fast, meet some SLA.
> Flume has a good user guide page if you google it.
> In Hive you need two tables, one that reads XML data using XML serd
> (external table), a second one that is Parquet format, you do insert into
> the second table from the source, and that will easily do the format
> conversion.
>
> On Sat, Jan 3, 2015 at 9:16 AM, Shashidhar Rao <raoshashidhar123@gmail.com
> > wrote:
>
>> Hi Peyman,
>>
>> Really appreciate your suggestion.
>> But say , if Tableau has to be used to generate reports then Tableau
>> works great with Hive.
>>
>> Just one more question, can flume be used to convert xml data to parquet ?
>> I will store these into Hive as parquet and generate reports using
>> Tableau.
>>
>> If flume can convert xml to parquet , do I need external tools , can you
>> please provide me some links on how to convert xml to parquet using flume.
>> Because , Predictive analytics may be used on Hive data in the end phase of
>> the project.
>>
>> Thanks
>> Shashi
>>
>> On Sat, Jan 3, 2015 at 10:32 PM, Peyman Mohajerian <mo...@gmail.com>
>> wrote:
>>
>>> Hi Shashi,
>>> Sure you can use json instead of Parquet, I was thinking in terms of
>>> using Hive for processing the data, but if you'd like to use Drill (which i
>>> heard is a good choice), then just convert the data from to json. You don't
>>> have to deal with parquet or Hive in that case, just use Flume to convert
>>> XML to json (there are many other choices to do that within the cluster
>>> too) and then use Drill to read and process the data.
>>>
>>> Thanks,
>>> Peyman
>>>
>>>
>>> On Sat, Jan 3, 2015 at 8:53 AM, Shashidhar Rao <
>>> raoshashidhar123@gmail.com> wrote:
>>>
>>>> Hi Peyman,
>>>>
>>>> Thanks a lot for your suggestions, really appreciate and got some idea
>>>> from your suggestions. Here's what I want to proceed.
>>>> 1.  Using Flume convert xml to JSON/Parquet before it reaches HDFS.
>>>> 2.  Store parquet converted files into Hive.
>>>> 3.  Query using Apache Drill in SQL dialect.
>>>>
>>>> But one thing can you please help me if instead of converting to
>>>> parquet if I convert into json and store in Hive as Parquet format , is
>>>> this a feasible option.
>>>> The reason I want to convert to json is that Apache Drill works very
>>>> well with JSON format.
>>>>
>>>> Thanks
>>>> Shashi
>>>>
>>>> On Sat, Jan 3, 2015 at 10:08 PM, Peyman Mohajerian <mo...@gmail.com>
>>>> wrote:
>>>>
>>>>> You can land the data in HDFS as XML files and use 'hive xml serde' to
>>>>> read the data and write it back in a more optimal format, e.g. ORC or
>>>>> parquet (depending somewhat on your choice of Hadoop distro). Querying XML
>>>>> data directly via Hive is also doable but slow. Converting to Avro is also
>>>>> doable but in my experience not as fast as ORC or Parquet. Columnar formats
>>>>> work give you better performance but Avro has its own strength, e.g.
>>>>> managing schema changes better.
>>>>> You can also convert the format before you land the data in HDFS, e.g.
>>>>> using Flume or some other tool for changing the format in flight.
>>>>>
>>>>>
>>>>>
>>>>> On Sat, Jan 3, 2015 at 8:33 AM, Shashidhar Rao <
>>>>> raoshashidhar123@gmail.com> wrote:
>>>>>
>>>>>> Sorry , not Hive files but xml files to some Avro format and store
>>>>>> these into Hive will be fast .
>>>>>>
>>>>>> On Sat, Jan 3, 2015 at 9:59 PM, Shashidhar Rao <
>>>>>> raoshashidhar123@gmail.com> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> Exact number of files is not known but it will run into millions of
>>>>>>> files depending on client's request who collects terabytes of xml data
>>>>>>> every day. Basically, storing is just one part but the main part will be
>>>>>>> how to query these data like  aggregation, count and do some analytics over
>>>>>>> these data. Fast retrieval is required , say for e.g for a particular year
>>>>>>> what are the top 10 products, top ten manufacturers and top ten stores etc.
>>>>>>>
>>>>>>> Will Hive be a better choice ? And will converting these Hive files
>>>>>>> to some format work out.
>>>>>>>
>>>>>>> Thanks
>>>>>>> Shashi
>>>>>>>
>>>>>>> On Sat, Jan 3, 2015 at 9:44 PM, Wilm Schumacher <
>>>>>>> wilm.schumacher@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> how many xml files are you planning to store? Perhaps it is
>>>>>>>> possible to
>>>>>>>> store them directly on hdfs and save meta data in hbase. This sounds
>>>>>>>> more reasonable to me.
>>>>>>>>
>>>>>>>> If the number of xml files is to large (millions and billions),
>>>>>>>> then you
>>>>>>>> can use hadoop map files to put files together. E.g. based on
>>>>>>>> years, or
>>>>>>>> month.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>>
>>>>>>>> Wilm
>>>>>>>>
>>>>>>>> Am 03.01.2015 um 17:06 schrieb Shashidhar Rao:
>>>>>>>> > Hi,
>>>>>>>> >
>>>>>>>> > Can someone help me by suggesting the best way to solve this use
>>>>>>>> case
>>>>>>>> >
>>>>>>>> > 1. XML files keep flowing from external system and need to be
>>>>>>>> stored
>>>>>>>> > into HDFS.
>>>>>>>> > 2. These files  can be directly stored using NoSql database e.g
>>>>>>>> any
>>>>>>>> > xml supported NoSql. or
>>>>>>>> > 3. These files need to be processed and stored in one of the
>>>>>>>> database
>>>>>>>> > HBase, Hive etc.
>>>>>>>> > 4. There won't be any updates only read and has to be retrieved
>>>>>>>> based
>>>>>>>> > on some queries and a dashboard has to be created , bits of
>>>>>>>> analytics
>>>>>>>> >
>>>>>>>> > The xml files are huge and expected number of nodes is roughly
>>>>>>>> around
>>>>>>>> > 12 nodes.
>>>>>>>> > I am stuck in the storage part say if I convert xml to json and
>>>>>>>> store
>>>>>>>> > it into HBase , the processing part from xml to json will be huge.
>>>>>>>> >
>>>>>>>> > It will be only reading and no updates.
>>>>>>>> >
>>>>>>>> > Please suggest how to store these xml files.
>>>>>>>> >
>>>>>>>> > Thanks
>>>>>>>> > Shashi
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: XML files in Hadoop

Posted by Shashidhar Rao <ra...@gmail.com>.

Hi Peyman,

Sure , will try using two Hive tables for the conversion.
It was awesome discussing with you . Thanks a lot.


Shashi

On Sat, Jan 3, 2015 at 10:53 PM, Peyman Mohajerian <mo...@gmail.com>
wrote:

> I would recommend as the first step not to use Flume, but rather land the
> data in hdfs in the source format, XML and use Hive to convert the format
> from XML to Parquet. That is much simpler to do than using Flume. Flume
> only makes sense if you don't care for the original file format and want to
> ingest the data fast, meet some SLA.
> Flume has a good user guide page if you google it.
> In Hive you need two tables, one that reads XML data using XML serd
> (external table), a second one that is Parquet format, you do insert into
> the second table from the source, and that will easily do the format
> conversion.
>
> On Sat, Jan 3, 2015 at 9:16 AM, Shashidhar Rao <raoshashidhar123@gmail.com
> > wrote:
>
>> Hi Peyman,
>>
>> Really appreciate your suggestion.
>> But say , if Tableau has to be used to generate reports then Tableau
>> works great with Hive.
>>
>> Just one more question, can flume be used to convert xml data to parquet ?
>> I will store these into Hive as parquet and generate reports using
>> Tableau.
>>
>> If flume can convert xml to parquet , do I need external tools , can you
>> please provide me some links on how to convert xml to parquet using flume.
>> Because , Predictive analytics may be used on Hive data in the end phase of
>> the project.
>>
>> Thanks
>> Shashi
>>
>> On Sat, Jan 3, 2015 at 10:32 PM, Peyman Mohajerian <mo...@gmail.com>
>> wrote:
>>
>>> Hi Shashi,
>>> Sure you can use json instead of Parquet, I was thinking in terms of
>>> using Hive for processing the data, but if you'd like to use Drill (which i
>>> heard is a good choice), then just convert the data from to json. You don't
>>> have to deal with parquet or Hive in that case, just use Flume to convert
>>> XML to json (there are many other choices to do that within the cluster
>>> too) and then use Drill to read and process the data.
>>>
>>> Thanks,
>>> Peyman
>>>
>>>
>>> On Sat, Jan 3, 2015 at 8:53 AM, Shashidhar Rao <
>>> raoshashidhar123@gmail.com> wrote:
>>>
>>>> Hi Peyman,
>>>>
>>>> Thanks a lot for your suggestions, really appreciate and got some idea
>>>> from your suggestions. Here's what I want to proceed.
>>>> 1.  Using Flume convert xml to JSON/Parquet before it reaches HDFS.
>>>> 2.  Store parquet converted files into Hive.
>>>> 3.  Query using Apache Drill in SQL dialect.
>>>>
>>>> But one thing can you please help me if instead of converting to
>>>> parquet if I convert into json and store in Hive as Parquet format , is
>>>> this a feasible option.
>>>> The reason I want to convert to json is that Apache Drill works very
>>>> well with JSON format.
>>>>
>>>> Thanks
>>>> Shashi
>>>>
>>>> On Sat, Jan 3, 2015 at 10:08 PM, Peyman Mohajerian <mo...@gmail.com>
>>>> wrote:
>>>>
>>>>> You can land the data in HDFS as XML files and use 'hive xml serde' to
>>>>> read the data and write it back in a more optimal format, e.g. ORC or
>>>>> parquet (depending somewhat on your choice of Hadoop distro). Querying XML
>>>>> data directly via Hive is also doable but slow. Converting to Avro is also
>>>>> doable but in my experience not as fast as ORC or Parquet. Columnar formats
>>>>> work give you better performance but Avro has its own strength, e.g.
>>>>> managing schema changes better.
>>>>> You can also convert the format before you land the data in HDFS, e.g.
>>>>> using Flume or some other tool for changing the format in flight.
>>>>>
>>>>>
>>>>>
>>>>> On Sat, Jan 3, 2015 at 8:33 AM, Shashidhar Rao <
>>>>> raoshashidhar123@gmail.com> wrote:
>>>>>
>>>>>> Sorry , not Hive files but xml files to some Avro format and store
>>>>>> these into Hive will be fast .
>>>>>>
>>>>>> On Sat, Jan 3, 2015 at 9:59 PM, Shashidhar Rao <
>>>>>> raoshashidhar123@gmail.com> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> Exact number of files is not known but it will run into millions of
>>>>>>> files depending on client's request who collects terabytes of xml data
>>>>>>> every day. Basically, storing is just one part but the main part will be
>>>>>>> how to query these data like  aggregation, count and do some analytics over
>>>>>>> these data. Fast retrieval is required , say for e.g for a particular year
>>>>>>> what are the top 10 products, top ten manufacturers and top ten stores etc.
>>>>>>>
>>>>>>> Will Hive be a better choice ? And will converting these Hive files
>>>>>>> to some format work out.
>>>>>>>
>>>>>>> Thanks
>>>>>>> Shashi
>>>>>>>
>>>>>>> On Sat, Jan 3, 2015 at 9:44 PM, Wilm Schumacher <
>>>>>>> wilm.schumacher@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> how many xml files are you planning to store? Perhaps it is
>>>>>>>> possible to
>>>>>>>> store them directly on hdfs and save meta data in hbase. This sounds
>>>>>>>> more reasonable to me.
>>>>>>>>
>>>>>>>> If the number of xml files is to large (millions and billions),
>>>>>>>> then you
>>>>>>>> can use hadoop map files to put files together. E.g. based on
>>>>>>>> years, or
>>>>>>>> month.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>>
>>>>>>>> Wilm
>>>>>>>>
>>>>>>>> Am 03.01.2015 um 17:06 schrieb Shashidhar Rao:
>>>>>>>> > Hi,
>>>>>>>> >
>>>>>>>> > Can someone help me by suggesting the best way to solve this use
>>>>>>>> case
>>>>>>>> >
>>>>>>>> > 1. XML files keep flowing from external system and need to be
>>>>>>>> stored
>>>>>>>> > into HDFS.
>>>>>>>> > 2. These files  can be directly stored using NoSql database e.g
>>>>>>>> any
>>>>>>>> > xml supported NoSql. or
>>>>>>>> > 3. These files need to be processed and stored in one of the
>>>>>>>> database
>>>>>>>> > HBase, Hive etc.
>>>>>>>> > 4. There won't be any updates only read and has to be retrieved
>>>>>>>> based
>>>>>>>> > on some queries and a dashboard has to be created , bits of
>>>>>>>> analytics
>>>>>>>> >
>>>>>>>> > The xml files are huge and expected number of nodes is roughly
>>>>>>>> around
>>>>>>>> > 12 nodes.
>>>>>>>> > I am stuck in the storage part say if I convert xml to json and
>>>>>>>> store
>>>>>>>> > it into HBase , the processing part from xml to json will be huge.
>>>>>>>> >
>>>>>>>> > It will be only reading and no updates.
>>>>>>>> >
>>>>>>>> > Please suggest how to store these xml files.
>>>>>>>> >
>>>>>>>> > Thanks
>>>>>>>> > Shashi
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: XML files in Hadoop

Posted by Peyman Mohajerian <mo...@gmail.com>.

I would recommend as the first step not to use Flume, but rather land the
data in hdfs in the source format, XML and use Hive to convert the format
from XML to Parquet. That is much simpler to do than using Flume. Flume
only makes sense if you don't care for the original file format and want to
ingest the data fast, meet some SLA.
Flume has a good user guide page if you google it.
In Hive you need two tables, one that reads XML data using XML serd
(external table), a second one that is Parquet format, you do insert into
the second table from the source, and that will easily do the format
conversion.

On Sat, Jan 3, 2015 at 9:16 AM, Shashidhar Rao <ra...@gmail.com>
wrote:

> Hi Peyman,
>
> Really appreciate your suggestion.
> But say , if Tableau has to be used to generate reports then Tableau works
> great with Hive.
>
> Just one more question, can flume be used to convert xml data to parquet ?
> I will store these into Hive as parquet and generate reports using Tableau.
>
> If flume can convert xml to parquet , do I need external tools , can you
> please provide me some links on how to convert xml to parquet using flume.
> Because , Predictive analytics may be used on Hive data in the end phase of
> the project.
>
> Thanks
> Shashi
>
> On Sat, Jan 3, 2015 at 10:32 PM, Peyman Mohajerian <mo...@gmail.com>
> wrote:
>
>> Hi Shashi,
>> Sure you can use json instead of Parquet, I was thinking in terms of
>> using Hive for processing the data, but if you'd like to use Drill (which i
>> heard is a good choice), then just convert the data from to json. You don't
>> have to deal with parquet or Hive in that case, just use Flume to convert
>> XML to json (there are many other choices to do that within the cluster
>> too) and then use Drill to read and process the data.
>>
>> Thanks,
>> Peyman
>>
>>
>> On Sat, Jan 3, 2015 at 8:53 AM, Shashidhar Rao <
>> raoshashidhar123@gmail.com> wrote:
>>
>>> Hi Peyman,
>>>
>>> Thanks a lot for your suggestions, really appreciate and got some idea
>>> from your suggestions. Here's what I want to proceed.
>>> 1.  Using Flume convert xml to JSON/Parquet before it reaches HDFS.
>>> 2.  Store parquet converted files into Hive.
>>> 3.  Query using Apache Drill in SQL dialect.
>>>
>>> But one thing can you please help me if instead of converting to parquet
>>> if I convert into json and store in Hive as Parquet format , is this a
>>> feasible option.
>>> The reason I want to convert to json is that Apache Drill works very
>>> well with JSON format.
>>>
>>> Thanks
>>> Shashi
>>>
>>> On Sat, Jan 3, 2015 at 10:08 PM, Peyman Mohajerian <mo...@gmail.com>
>>> wrote:
>>>
>>>> You can land the data in HDFS as XML files and use 'hive xml serde' to
>>>> read the data and write it back in a more optimal format, e.g. ORC or
>>>> parquet (depending somewhat on your choice of Hadoop distro). Querying XML
>>>> data directly via Hive is also doable but slow. Converting to Avro is also
>>>> doable but in my experience not as fast as ORC or Parquet. Columnar formats
>>>> work give you better performance but Avro has its own strength, e.g.
>>>> managing schema changes better.
>>>> You can also convert the format before you land the data in HDFS, e.g.
>>>> using Flume or some other tool for changing the format in flight.
>>>>
>>>>
>>>>
>>>> On Sat, Jan 3, 2015 at 8:33 AM, Shashidhar Rao <
>>>> raoshashidhar123@gmail.com> wrote:
>>>>
>>>>> Sorry , not Hive files but xml files to some Avro format and store
>>>>> these into Hive will be fast .
>>>>>
>>>>> On Sat, Jan 3, 2015 at 9:59 PM, Shashidhar Rao <
>>>>> raoshashidhar123@gmail.com> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> Exact number of files is not known but it will run into millions of
>>>>>> files depending on client's request who collects terabytes of xml data
>>>>>> every day. Basically, storing is just one part but the main part will be
>>>>>> how to query these data like  aggregation, count and do some analytics over
>>>>>> these data. Fast retrieval is required , say for e.g for a particular year
>>>>>> what are the top 10 products, top ten manufacturers and top ten stores etc.
>>>>>>
>>>>>> Will Hive be a better choice ? And will converting these Hive files
>>>>>> to some format work out.
>>>>>>
>>>>>> Thanks
>>>>>> Shashi
>>>>>>
>>>>>> On Sat, Jan 3, 2015 at 9:44 PM, Wilm Schumacher <
>>>>>> wilm.schumacher@gmail.com> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> how many xml files are you planning to store? Perhaps it is possible
>>>>>>> to
>>>>>>> store them directly on hdfs and save meta data in hbase. This sounds
>>>>>>> more reasonable to me.
>>>>>>>
>>>>>>> If the number of xml files is to large (millions and billions), then
>>>>>>> you
>>>>>>> can use hadoop map files to put files together. E.g. based on years,
>>>>>>> or
>>>>>>> month.
>>>>>>>
>>>>>>> Regards,
>>>>>>>
>>>>>>> Wilm
>>>>>>>
>>>>>>> Am 03.01.2015 um 17:06 schrieb Shashidhar Rao:
>>>>>>> > Hi,
>>>>>>> >
>>>>>>> > Can someone help me by suggesting the best way to solve this use
>>>>>>> case
>>>>>>> >
>>>>>>> > 1. XML files keep flowing from external system and need to be
>>>>>>> stored
>>>>>>> > into HDFS.
>>>>>>> > 2. These files  can be directly stored using NoSql database e.g any
>>>>>>> > xml supported NoSql. or
>>>>>>> > 3. These files need to be processed and stored in one of the
>>>>>>> database
>>>>>>> > HBase, Hive etc.
>>>>>>> > 4. There won't be any updates only read and has to be retrieved
>>>>>>> based
>>>>>>> > on some queries and a dashboard has to be created , bits of
>>>>>>> analytics
>>>>>>> >
>>>>>>> > The xml files are huge and expected number of nodes is roughly
>>>>>>> around
>>>>>>> > 12 nodes.
>>>>>>> > I am stuck in the storage part say if I convert xml to json and
>>>>>>> store
>>>>>>> > it into HBase , the processing part from xml to json will be huge.
>>>>>>> >
>>>>>>> > It will be only reading and no updates.
>>>>>>> >
>>>>>>> > Please suggest how to store these xml files.
>>>>>>> >
>>>>>>> > Thanks
>>>>>>> > Shashi
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: XML files in Hadoop

Posted by Peyman Mohajerian <mo...@gmail.com>.

I would recommend as the first step not to use Flume, but rather land the
data in hdfs in the source format, XML and use Hive to convert the format
from XML to Parquet. That is much simpler to do than using Flume. Flume
only makes sense if you don't care for the original file format and want to
ingest the data fast, meet some SLA.
Flume has a good user guide page if you google it.
In Hive you need two tables, one that reads XML data using XML serd
(external table), a second one that is Parquet format, you do insert into
the second table from the source, and that will easily do the format
conversion.

On Sat, Jan 3, 2015 at 9:16 AM, Shashidhar Rao <ra...@gmail.com>
wrote:

> Hi Peyman,
>
> Really appreciate your suggestion.
> But say , if Tableau has to be used to generate reports then Tableau works
> great with Hive.
>
> Just one more question, can flume be used to convert xml data to parquet ?
> I will store these into Hive as parquet and generate reports using Tableau.
>
> If flume can convert xml to parquet , do I need external tools , can you
> please provide me some links on how to convert xml to parquet using flume.
> Because , Predictive analytics may be used on Hive data in the end phase of
> the project.
>
> Thanks
> Shashi
>
> On Sat, Jan 3, 2015 at 10:32 PM, Peyman Mohajerian <mo...@gmail.com>
> wrote:
>
>> Hi Shashi,
>> Sure you can use json instead of Parquet, I was thinking in terms of
>> using Hive for processing the data, but if you'd like to use Drill (which i
>> heard is a good choice), then just convert the data from to json. You don't
>> have to deal with parquet or Hive in that case, just use Flume to convert
>> XML to json (there are many other choices to do that within the cluster
>> too) and then use Drill to read and process the data.
>>
>> Thanks,
>> Peyman
>>
>>
>> On Sat, Jan 3, 2015 at 8:53 AM, Shashidhar Rao <
>> raoshashidhar123@gmail.com> wrote:
>>
>>> Hi Peyman,
>>>
>>> Thanks a lot for your suggestions, really appreciate and got some idea
>>> from your suggestions. Here's what I want to proceed.
>>> 1.  Using Flume convert xml to JSON/Parquet before it reaches HDFS.
>>> 2.  Store parquet converted files into Hive.
>>> 3.  Query using Apache Drill in SQL dialect.
>>>
>>> But one thing can you please help me if instead of converting to parquet
>>> if I convert into json and store in Hive as Parquet format , is this a
>>> feasible option.
>>> The reason I want to convert to json is that Apache Drill works very
>>> well with JSON format.
>>>
>>> Thanks
>>> Shashi
>>>
>>> On Sat, Jan 3, 2015 at 10:08 PM, Peyman Mohajerian <mo...@gmail.com>
>>> wrote:
>>>
>>>> You can land the data in HDFS as XML files and use 'hive xml serde' to
>>>> read the data and write it back in a more optimal format, e.g. ORC or
>>>> parquet (depending somewhat on your choice of Hadoop distro). Querying XML
>>>> data directly via Hive is also doable but slow. Converting to Avro is also
>>>> doable but in my experience not as fast as ORC or Parquet. Columnar formats
>>>> work give you better performance but Avro has its own strength, e.g.
>>>> managing schema changes better.
>>>> You can also convert the format before you land the data in HDFS, e.g.
>>>> using Flume or some other tool for changing the format in flight.
>>>>
>>>>
>>>>
>>>> On Sat, Jan 3, 2015 at 8:33 AM, Shashidhar Rao <
>>>> raoshashidhar123@gmail.com> wrote:
>>>>
>>>>> Sorry , not Hive files but xml files to some Avro format and store
>>>>> these into Hive will be fast .
>>>>>
>>>>> On Sat, Jan 3, 2015 at 9:59 PM, Shashidhar Rao <
>>>>> raoshashidhar123@gmail.com> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> Exact number of files is not known but it will run into millions of
>>>>>> files depending on client's request who collects terabytes of xml data
>>>>>> every day. Basically, storing is just one part but the main part will be
>>>>>> how to query these data like  aggregation, count and do some analytics over
>>>>>> these data. Fast retrieval is required , say for e.g for a particular year
>>>>>> what are the top 10 products, top ten manufacturers and top ten stores etc.
>>>>>>
>>>>>> Will Hive be a better choice ? And will converting these Hive files
>>>>>> to some format work out.
>>>>>>
>>>>>> Thanks
>>>>>> Shashi
>>>>>>
>>>>>> On Sat, Jan 3, 2015 at 9:44 PM, Wilm Schumacher <
>>>>>> wilm.schumacher@gmail.com> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> how many xml files are you planning to store? Perhaps it is possible
>>>>>>> to
>>>>>>> store them directly on hdfs and save meta data in hbase. This sounds
>>>>>>> more reasonable to me.
>>>>>>>
>>>>>>> If the number of xml files is to large (millions and billions), then
>>>>>>> you
>>>>>>> can use hadoop map files to put files together. E.g. based on years,
>>>>>>> or
>>>>>>> month.
>>>>>>>
>>>>>>> Regards,
>>>>>>>
>>>>>>> Wilm
>>>>>>>
>>>>>>> Am 03.01.2015 um 17:06 schrieb Shashidhar Rao:
>>>>>>> > Hi,
>>>>>>> >
>>>>>>> > Can someone help me by suggesting the best way to solve this use
>>>>>>> case
>>>>>>> >
>>>>>>> > 1. XML files keep flowing from external system and need to be
>>>>>>> stored
>>>>>>> > into HDFS.
>>>>>>> > 2. These files  can be directly stored using NoSql database e.g any
>>>>>>> > xml supported NoSql. or
>>>>>>> > 3. These files need to be processed and stored in one of the
>>>>>>> database
>>>>>>> > HBase, Hive etc.
>>>>>>> > 4. There won't be any updates only read and has to be retrieved
>>>>>>> based
>>>>>>> > on some queries and a dashboard has to be created , bits of
>>>>>>> analytics
>>>>>>> >
>>>>>>> > The xml files are huge and expected number of nodes is roughly
>>>>>>> around
>>>>>>> > 12 nodes.
>>>>>>> > I am stuck in the storage part say if I convert xml to json and
>>>>>>> store
>>>>>>> > it into HBase , the processing part from xml to json will be huge.
>>>>>>> >
>>>>>>> > It will be only reading and no updates.
>>>>>>> >
>>>>>>> > Please suggest how to store these xml files.
>>>>>>> >
>>>>>>> > Thanks
>>>>>>> > Shashi
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: XML files in Hadoop

Posted by Peyman Mohajerian <mo...@gmail.com>.

I would recommend as the first step not to use Flume, but rather land the
data in hdfs in the source format, XML and use Hive to convert the format
from XML to Parquet. That is much simpler to do than using Flume. Flume
only makes sense if you don't care for the original file format and want to
ingest the data fast, meet some SLA.
Flume has a good user guide page if you google it.
In Hive you need two tables, one that reads XML data using XML serd
(external table), a second one that is Parquet format, you do insert into
the second table from the source, and that will easily do the format
conversion.

On Sat, Jan 3, 2015 at 9:16 AM, Shashidhar Rao <ra...@gmail.com>
wrote:

> Hi Peyman,
>
> Really appreciate your suggestion.
> But say , if Tableau has to be used to generate reports then Tableau works
> great with Hive.
>
> Just one more question, can flume be used to convert xml data to parquet ?
> I will store these into Hive as parquet and generate reports using Tableau.
>
> If flume can convert xml to parquet , do I need external tools , can you
> please provide me some links on how to convert xml to parquet using flume.
> Because , Predictive analytics may be used on Hive data in the end phase of
> the project.
>
> Thanks
> Shashi
>
> On Sat, Jan 3, 2015 at 10:32 PM, Peyman Mohajerian <mo...@gmail.com>
> wrote:
>
>> Hi Shashi,
>> Sure you can use json instead of Parquet, I was thinking in terms of
>> using Hive for processing the data, but if you'd like to use Drill (which i
>> heard is a good choice), then just convert the data from to json. You don't
>> have to deal with parquet or Hive in that case, just use Flume to convert
>> XML to json (there are many other choices to do that within the cluster
>> too) and then use Drill to read and process the data.
>>
>> Thanks,
>> Peyman
>>
>>
>> On Sat, Jan 3, 2015 at 8:53 AM, Shashidhar Rao <
>> raoshashidhar123@gmail.com> wrote:
>>
>>> Hi Peyman,
>>>
>>> Thanks a lot for your suggestions, really appreciate and got some idea
>>> from your suggestions. Here's what I want to proceed.
>>> 1.  Using Flume convert xml to JSON/Parquet before it reaches HDFS.
>>> 2.  Store parquet converted files into Hive.
>>> 3.  Query using Apache Drill in SQL dialect.
>>>
>>> But one thing can you please help me if instead of converting to parquet
>>> if I convert into json and store in Hive as Parquet format , is this a
>>> feasible option.
>>> The reason I want to convert to json is that Apache Drill works very
>>> well with JSON format.
>>>
>>> Thanks
>>> Shashi
>>>
>>> On Sat, Jan 3, 2015 at 10:08 PM, Peyman Mohajerian <mo...@gmail.com>
>>> wrote:
>>>
>>>> You can land the data in HDFS as XML files and use 'hive xml serde' to
>>>> read the data and write it back in a more optimal format, e.g. ORC or
>>>> parquet (depending somewhat on your choice of Hadoop distro). Querying XML
>>>> data directly via Hive is also doable but slow. Converting to Avro is also
>>>> doable but in my experience not as fast as ORC or Parquet. Columnar formats
>>>> work give you better performance but Avro has its own strength, e.g.
>>>> managing schema changes better.
>>>> You can also convert the format before you land the data in HDFS, e.g.
>>>> using Flume or some other tool for changing the format in flight.
>>>>
>>>>
>>>>
>>>> On Sat, Jan 3, 2015 at 8:33 AM, Shashidhar Rao <
>>>> raoshashidhar123@gmail.com> wrote:
>>>>
>>>>> Sorry , not Hive files but xml files to some Avro format and store
>>>>> these into Hive will be fast .
>>>>>
>>>>> On Sat, Jan 3, 2015 at 9:59 PM, Shashidhar Rao <
>>>>> raoshashidhar123@gmail.com> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> Exact number of files is not known but it will run into millions of
>>>>>> files depending on client's request who collects terabytes of xml data
>>>>>> every day. Basically, storing is just one part but the main part will be
>>>>>> how to query these data like  aggregation, count and do some analytics over
>>>>>> these data. Fast retrieval is required , say for e.g for a particular year
>>>>>> what are the top 10 products, top ten manufacturers and top ten stores etc.
>>>>>>
>>>>>> Will Hive be a better choice ? And will converting these Hive files
>>>>>> to some format work out.
>>>>>>
>>>>>> Thanks
>>>>>> Shashi
>>>>>>
>>>>>> On Sat, Jan 3, 2015 at 9:44 PM, Wilm Schumacher <
>>>>>> wilm.schumacher@gmail.com> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> how many xml files are you planning to store? Perhaps it is possible
>>>>>>> to
>>>>>>> store them directly on hdfs and save meta data in hbase. This sounds
>>>>>>> more reasonable to me.
>>>>>>>
>>>>>>> If the number of xml files is to large (millions and billions), then
>>>>>>> you
>>>>>>> can use hadoop map files to put files together. E.g. based on years,
>>>>>>> or
>>>>>>> month.
>>>>>>>
>>>>>>> Regards,
>>>>>>>
>>>>>>> Wilm
>>>>>>>
>>>>>>> Am 03.01.2015 um 17:06 schrieb Shashidhar Rao:
>>>>>>> > Hi,
>>>>>>> >
>>>>>>> > Can someone help me by suggesting the best way to solve this use
>>>>>>> case
>>>>>>> >
>>>>>>> > 1. XML files keep flowing from external system and need to be
>>>>>>> stored
>>>>>>> > into HDFS.
>>>>>>> > 2. These files  can be directly stored using NoSql database e.g any
>>>>>>> > xml supported NoSql. or
>>>>>>> > 3. These files need to be processed and stored in one of the
>>>>>>> database
>>>>>>> > HBase, Hive etc.
>>>>>>> > 4. There won't be any updates only read and has to be retrieved
>>>>>>> based
>>>>>>> > on some queries and a dashboard has to be created , bits of
>>>>>>> analytics
>>>>>>> >
>>>>>>> > The xml files are huge and expected number of nodes is roughly
>>>>>>> around
>>>>>>> > 12 nodes.
>>>>>>> > I am stuck in the storage part say if I convert xml to json and
>>>>>>> store
>>>>>>> > it into HBase , the processing part from xml to json will be huge.
>>>>>>> >
>>>>>>> > It will be only reading and no updates.
>>>>>>> >
>>>>>>> > Please suggest how to store these xml files.
>>>>>>> >
>>>>>>> > Thanks
>>>>>>> > Shashi
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: XML files in Hadoop

Posted by Peyman Mohajerian <mo...@gmail.com>.

I would recommend as the first step not to use Flume, but rather land the
data in hdfs in the source format, XML and use Hive to convert the format
from XML to Parquet. That is much simpler to do than using Flume. Flume
only makes sense if you don't care for the original file format and want to
ingest the data fast, meet some SLA.
Flume has a good user guide page if you google it.
In Hive you need two tables, one that reads XML data using XML serd
(external table), a second one that is Parquet format, you do insert into
the second table from the source, and that will easily do the format
conversion.

On Sat, Jan 3, 2015 at 9:16 AM, Shashidhar Rao <ra...@gmail.com>
wrote:

> Hi Peyman,
>
> Really appreciate your suggestion.
> But say , if Tableau has to be used to generate reports then Tableau works
> great with Hive.
>
> Just one more question, can flume be used to convert xml data to parquet ?
> I will store these into Hive as parquet and generate reports using Tableau.
>
> If flume can convert xml to parquet , do I need external tools , can you
> please provide me some links on how to convert xml to parquet using flume.
> Because , Predictive analytics may be used on Hive data in the end phase of
> the project.
>
> Thanks
> Shashi
>
> On Sat, Jan 3, 2015 at 10:32 PM, Peyman Mohajerian <mo...@gmail.com>
> wrote:
>
>> Hi Shashi,
>> Sure you can use json instead of Parquet, I was thinking in terms of
>> using Hive for processing the data, but if you'd like to use Drill (which i
>> heard is a good choice), then just convert the data from to json. You don't
>> have to deal with parquet or Hive in that case, just use Flume to convert
>> XML to json (there are many other choices to do that within the cluster
>> too) and then use Drill to read and process the data.
>>
>> Thanks,
>> Peyman
>>
>>
>> On Sat, Jan 3, 2015 at 8:53 AM, Shashidhar Rao <
>> raoshashidhar123@gmail.com> wrote:
>>
>>> Hi Peyman,
>>>
>>> Thanks a lot for your suggestions, really appreciate and got some idea
>>> from your suggestions. Here's what I want to proceed.
>>> 1.  Using Flume convert xml to JSON/Parquet before it reaches HDFS.
>>> 2.  Store parquet converted files into Hive.
>>> 3.  Query using Apache Drill in SQL dialect.
>>>
>>> But one thing can you please help me if instead of converting to parquet
>>> if I convert into json and store in Hive as Parquet format , is this a
>>> feasible option.
>>> The reason I want to convert to json is that Apache Drill works very
>>> well with JSON format.
>>>
>>> Thanks
>>> Shashi
>>>
>>> On Sat, Jan 3, 2015 at 10:08 PM, Peyman Mohajerian <mo...@gmail.com>
>>> wrote:
>>>
>>>> You can land the data in HDFS as XML files and use 'hive xml serde' to
>>>> read the data and write it back in a more optimal format, e.g. ORC or
>>>> parquet (depending somewhat on your choice of Hadoop distro). Querying XML
>>>> data directly via Hive is also doable but slow. Converting to Avro is also
>>>> doable but in my experience not as fast as ORC or Parquet. Columnar formats
>>>> work give you better performance but Avro has its own strength, e.g.
>>>> managing schema changes better.
>>>> You can also convert the format before you land the data in HDFS, e.g.
>>>> using Flume or some other tool for changing the format in flight.
>>>>
>>>>
>>>>
>>>> On Sat, Jan 3, 2015 at 8:33 AM, Shashidhar Rao <
>>>> raoshashidhar123@gmail.com> wrote:
>>>>
>>>>> Sorry , not Hive files but xml files to some Avro format and store
>>>>> these into Hive will be fast .
>>>>>
>>>>> On Sat, Jan 3, 2015 at 9:59 PM, Shashidhar Rao <
>>>>> raoshashidhar123@gmail.com> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> Exact number of files is not known but it will run into millions of
>>>>>> files depending on client's request who collects terabytes of xml data
>>>>>> every day. Basically, storing is just one part but the main part will be
>>>>>> how to query these data like  aggregation, count and do some analytics over
>>>>>> these data. Fast retrieval is required , say for e.g for a particular year
>>>>>> what are the top 10 products, top ten manufacturers and top ten stores etc.
>>>>>>
>>>>>> Will Hive be a better choice ? And will converting these Hive files
>>>>>> to some format work out.
>>>>>>
>>>>>> Thanks
>>>>>> Shashi
>>>>>>
>>>>>> On Sat, Jan 3, 2015 at 9:44 PM, Wilm Schumacher <
>>>>>> wilm.schumacher@gmail.com> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> how many xml files are you planning to store? Perhaps it is possible
>>>>>>> to
>>>>>>> store them directly on hdfs and save meta data in hbase. This sounds
>>>>>>> more reasonable to me.
>>>>>>>
>>>>>>> If the number of xml files is to large (millions and billions), then
>>>>>>> you
>>>>>>> can use hadoop map files to put files together. E.g. based on years,
>>>>>>> or
>>>>>>> month.
>>>>>>>
>>>>>>> Regards,
>>>>>>>
>>>>>>> Wilm
>>>>>>>
>>>>>>> Am 03.01.2015 um 17:06 schrieb Shashidhar Rao:
>>>>>>> > Hi,
>>>>>>> >
>>>>>>> > Can someone help me by suggesting the best way to solve this use
>>>>>>> case
>>>>>>> >
>>>>>>> > 1. XML files keep flowing from external system and need to be
>>>>>>> stored
>>>>>>> > into HDFS.
>>>>>>> > 2. These files  can be directly stored using NoSql database e.g any
>>>>>>> > xml supported NoSql. or
>>>>>>> > 3. These files need to be processed and stored in one of the
>>>>>>> database
>>>>>>> > HBase, Hive etc.
>>>>>>> > 4. There won't be any updates only read and has to be retrieved
>>>>>>> based
>>>>>>> > on some queries and a dashboard has to be created , bits of
>>>>>>> analytics
>>>>>>> >
>>>>>>> > The xml files are huge and expected number of nodes is roughly
>>>>>>> around
>>>>>>> > 12 nodes.
>>>>>>> > I am stuck in the storage part say if I convert xml to json and
>>>>>>> store
>>>>>>> > it into HBase , the processing part from xml to json will be huge.
>>>>>>> >
>>>>>>> > It will be only reading and no updates.
>>>>>>> >
>>>>>>> > Please suggest how to store these xml files.
>>>>>>> >
>>>>>>> > Thanks
>>>>>>> > Shashi
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: XML files in Hadoop

Posted by Shashidhar Rao <ra...@gmail.com>.

Hi Peyman,

Really appreciate your suggestion.
But say , if Tableau has to be used to generate reports then Tableau works
great with Hive.

Just one more question, can flume be used to convert xml data to parquet ?
I will store these into Hive as parquet and generate reports using Tableau.

If flume can convert xml to parquet , do I need external tools , can you
please provide me some links on how to convert xml to parquet using flume.
Because , Predictive analytics may be used on Hive data in the end phase of
the project.

Thanks
Shashi

On Sat, Jan 3, 2015 at 10:32 PM, Peyman Mohajerian <mo...@gmail.com>
wrote:

> Hi Shashi,
> Sure you can use json instead of Parquet, I was thinking in terms of using
> Hive for processing the data, but if you'd like to use Drill (which i heard
> is a good choice), then just convert the data from to json. You don't have
> to deal with parquet or Hive in that case, just use Flume to convert XML to
> json (there are many other choices to do that within the cluster too) and
> then use Drill to read and process the data.
>
> Thanks,
> Peyman
>
>
> On Sat, Jan 3, 2015 at 8:53 AM, Shashidhar Rao <raoshashidhar123@gmail.com
> > wrote:
>
>> Hi Peyman,
>>
>> Thanks a lot for your suggestions, really appreciate and got some idea
>> from your suggestions. Here's what I want to proceed.
>> 1.  Using Flume convert xml to JSON/Parquet before it reaches HDFS.
>> 2.  Store parquet converted files into Hive.
>> 3.  Query using Apache Drill in SQL dialect.
>>
>> But one thing can you please help me if instead of converting to parquet
>> if I convert into json and store in Hive as Parquet format , is this a
>> feasible option.
>> The reason I want to convert to json is that Apache Drill works very well
>> with JSON format.
>>
>> Thanks
>> Shashi
>>
>> On Sat, Jan 3, 2015 at 10:08 PM, Peyman Mohajerian <mo...@gmail.com>
>> wrote:
>>
>>> You can land the data in HDFS as XML files and use 'hive xml serde' to
>>> read the data and write it back in a more optimal format, e.g. ORC or
>>> parquet (depending somewhat on your choice of Hadoop distro). Querying XML
>>> data directly via Hive is also doable but slow. Converting to Avro is also
>>> doable but in my experience not as fast as ORC or Parquet. Columnar formats
>>> work give you better performance but Avro has its own strength, e.g.
>>> managing schema changes better.
>>> You can also convert the format before you land the data in HDFS, e.g.
>>> using Flume or some other tool for changing the format in flight.
>>>
>>>
>>>
>>> On Sat, Jan 3, 2015 at 8:33 AM, Shashidhar Rao <
>>> raoshashidhar123@gmail.com> wrote:
>>>
>>>> Sorry , not Hive files but xml files to some Avro format and store
>>>> these into Hive will be fast .
>>>>
>>>> On Sat, Jan 3, 2015 at 9:59 PM, Shashidhar Rao <
>>>> raoshashidhar123@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> Exact number of files is not known but it will run into millions of
>>>>> files depending on client's request who collects terabytes of xml data
>>>>> every day. Basically, storing is just one part but the main part will be
>>>>> how to query these data like  aggregation, count and do some analytics over
>>>>> these data. Fast retrieval is required , say for e.g for a particular year
>>>>> what are the top 10 products, top ten manufacturers and top ten stores etc.
>>>>>
>>>>> Will Hive be a better choice ? And will converting these Hive files to
>>>>> some format work out.
>>>>>
>>>>> Thanks
>>>>> Shashi
>>>>>
>>>>> On Sat, Jan 3, 2015 at 9:44 PM, Wilm Schumacher <
>>>>> wilm.schumacher@gmail.com> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> how many xml files are you planning to store? Perhaps it is possible
>>>>>> to
>>>>>> store them directly on hdfs and save meta data in hbase. This sounds
>>>>>> more reasonable to me.
>>>>>>
>>>>>> If the number of xml files is to large (millions and billions), then
>>>>>> you
>>>>>> can use hadoop map files to put files together. E.g. based on years,
>>>>>> or
>>>>>> month.
>>>>>>
>>>>>> Regards,
>>>>>>
>>>>>> Wilm
>>>>>>
>>>>>> Am 03.01.2015 um 17:06 schrieb Shashidhar Rao:
>>>>>> > Hi,
>>>>>> >
>>>>>> > Can someone help me by suggesting the best way to solve this use
>>>>>> case
>>>>>> >
>>>>>> > 1. XML files keep flowing from external system and need to be stored
>>>>>> > into HDFS.
>>>>>> > 2. These files  can be directly stored using NoSql database e.g any
>>>>>> > xml supported NoSql. or
>>>>>> > 3. These files need to be processed and stored in one of the
>>>>>> database
>>>>>> > HBase, Hive etc.
>>>>>> > 4. There won't be any updates only read and has to be retrieved
>>>>>> based
>>>>>> > on some queries and a dashboard has to be created , bits of
>>>>>> analytics
>>>>>> >
>>>>>> > The xml files are huge and expected number of nodes is roughly
>>>>>> around
>>>>>> > 12 nodes.
>>>>>> > I am stuck in the storage part say if I convert xml to json and
>>>>>> store
>>>>>> > it into HBase , the processing part from xml to json will be huge.
>>>>>> >
>>>>>> > It will be only reading and no updates.
>>>>>> >
>>>>>> > Please suggest how to store these xml files.
>>>>>> >
>>>>>> > Thanks
>>>>>> > Shashi
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: XML files in Hadoop

Posted by Shashidhar Rao <ra...@gmail.com>.

Hi Peyman,

Really appreciate your suggestion.
But say , if Tableau has to be used to generate reports then Tableau works
great with Hive.

Just one more question, can flume be used to convert xml data to parquet ?
I will store these into Hive as parquet and generate reports using Tableau.

If flume can convert xml to parquet , do I need external tools , can you
please provide me some links on how to convert xml to parquet using flume.
Because , Predictive analytics may be used on Hive data in the end phase of
the project.

Thanks
Shashi

On Sat, Jan 3, 2015 at 10:32 PM, Peyman Mohajerian <mo...@gmail.com>
wrote:

> Hi Shashi,
> Sure you can use json instead of Parquet, I was thinking in terms of using
> Hive for processing the data, but if you'd like to use Drill (which i heard
> is a good choice), then just convert the data from to json. You don't have
> to deal with parquet or Hive in that case, just use Flume to convert XML to
> json (there are many other choices to do that within the cluster too) and
> then use Drill to read and process the data.
>
> Thanks,
> Peyman
>
>
> On Sat, Jan 3, 2015 at 8:53 AM, Shashidhar Rao <raoshashidhar123@gmail.com
> > wrote:
>
>> Hi Peyman,
>>
>> Thanks a lot for your suggestions, really appreciate and got some idea
>> from your suggestions. Here's what I want to proceed.
>> 1.  Using Flume convert xml to JSON/Parquet before it reaches HDFS.
>> 2.  Store parquet converted files into Hive.
>> 3.  Query using Apache Drill in SQL dialect.
>>
>> But one thing can you please help me if instead of converting to parquet
>> if I convert into json and store in Hive as Parquet format , is this a
>> feasible option.
>> The reason I want to convert to json is that Apache Drill works very well
>> with JSON format.
>>
>> Thanks
>> Shashi
>>
>> On Sat, Jan 3, 2015 at 10:08 PM, Peyman Mohajerian <mo...@gmail.com>
>> wrote:
>>
>>> You can land the data in HDFS as XML files and use 'hive xml serde' to
>>> read the data and write it back in a more optimal format, e.g. ORC or
>>> parquet (depending somewhat on your choice of Hadoop distro). Querying XML
>>> data directly via Hive is also doable but slow. Converting to Avro is also
>>> doable but in my experience not as fast as ORC or Parquet. Columnar formats
>>> work give you better performance but Avro has its own strength, e.g.
>>> managing schema changes better.
>>> You can also convert the format before you land the data in HDFS, e.g.
>>> using Flume or some other tool for changing the format in flight.
>>>
>>>
>>>
>>> On Sat, Jan 3, 2015 at 8:33 AM, Shashidhar Rao <
>>> raoshashidhar123@gmail.com> wrote:
>>>
>>>> Sorry , not Hive files but xml files to some Avro format and store
>>>> these into Hive will be fast .
>>>>
>>>> On Sat, Jan 3, 2015 at 9:59 PM, Shashidhar Rao <
>>>> raoshashidhar123@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> Exact number of files is not known but it will run into millions of
>>>>> files depending on client's request who collects terabytes of xml data
>>>>> every day. Basically, storing is just one part but the main part will be
>>>>> how to query these data like  aggregation, count and do some analytics over
>>>>> these data. Fast retrieval is required , say for e.g for a particular year
>>>>> what are the top 10 products, top ten manufacturers and top ten stores etc.
>>>>>
>>>>> Will Hive be a better choice ? And will converting these Hive files to
>>>>> some format work out.
>>>>>
>>>>> Thanks
>>>>> Shashi
>>>>>
>>>>> On Sat, Jan 3, 2015 at 9:44 PM, Wilm Schumacher <
>>>>> wilm.schumacher@gmail.com> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> how many xml files are you planning to store? Perhaps it is possible
>>>>>> to
>>>>>> store them directly on hdfs and save meta data in hbase. This sounds
>>>>>> more reasonable to me.
>>>>>>
>>>>>> If the number of xml files is to large (millions and billions), then
>>>>>> you
>>>>>> can use hadoop map files to put files together. E.g. based on years,
>>>>>> or
>>>>>> month.
>>>>>>
>>>>>> Regards,
>>>>>>
>>>>>> Wilm
>>>>>>
>>>>>> Am 03.01.2015 um 17:06 schrieb Shashidhar Rao:
>>>>>> > Hi,
>>>>>> >
>>>>>> > Can someone help me by suggesting the best way to solve this use
>>>>>> case
>>>>>> >
>>>>>> > 1. XML files keep flowing from external system and need to be stored
>>>>>> > into HDFS.
>>>>>> > 2. These files  can be directly stored using NoSql database e.g any
>>>>>> > xml supported NoSql. or
>>>>>> > 3. These files need to be processed and stored in one of the
>>>>>> database
>>>>>> > HBase, Hive etc.
>>>>>> > 4. There won't be any updates only read and has to be retrieved
>>>>>> based
>>>>>> > on some queries and a dashboard has to be created , bits of
>>>>>> analytics
>>>>>> >
>>>>>> > The xml files are huge and expected number of nodes is roughly
>>>>>> around
>>>>>> > 12 nodes.
>>>>>> > I am stuck in the storage part say if I convert xml to json and
>>>>>> store
>>>>>> > it into HBase , the processing part from xml to json will be huge.
>>>>>> >
>>>>>> > It will be only reading and no updates.
>>>>>> >
>>>>>> > Please suggest how to store these xml files.
>>>>>> >
>>>>>> > Thanks
>>>>>> > Shashi
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: XML files in Hadoop

Posted by Shashidhar Rao <ra...@gmail.com>.

Hi Peyman,

Really appreciate your suggestion.
But say , if Tableau has to be used to generate reports then Tableau works
great with Hive.

Just one more question, can flume be used to convert xml data to parquet ?
I will store these into Hive as parquet and generate reports using Tableau.

If flume can convert xml to parquet , do I need external tools , can you
please provide me some links on how to convert xml to parquet using flume.
Because , Predictive analytics may be used on Hive data in the end phase of
the project.

Thanks
Shashi

On Sat, Jan 3, 2015 at 10:32 PM, Peyman Mohajerian <mo...@gmail.com>
wrote:

> Hi Shashi,
> Sure you can use json instead of Parquet, I was thinking in terms of using
> Hive for processing the data, but if you'd like to use Drill (which i heard
> is a good choice), then just convert the data from to json. You don't have
> to deal with parquet or Hive in that case, just use Flume to convert XML to
> json (there are many other choices to do that within the cluster too) and
> then use Drill to read and process the data.
>
> Thanks,
> Peyman
>
>
> On Sat, Jan 3, 2015 at 8:53 AM, Shashidhar Rao <raoshashidhar123@gmail.com
> > wrote:
>
>> Hi Peyman,
>>
>> Thanks a lot for your suggestions, really appreciate and got some idea
>> from your suggestions. Here's what I want to proceed.
>> 1.  Using Flume convert xml to JSON/Parquet before it reaches HDFS.
>> 2.  Store parquet converted files into Hive.
>> 3.  Query using Apache Drill in SQL dialect.
>>
>> But one thing can you please help me if instead of converting to parquet
>> if I convert into json and store in Hive as Parquet format , is this a
>> feasible option.
>> The reason I want to convert to json is that Apache Drill works very well
>> with JSON format.
>>
>> Thanks
>> Shashi
>>
>> On Sat, Jan 3, 2015 at 10:08 PM, Peyman Mohajerian <mo...@gmail.com>
>> wrote:
>>
>>> You can land the data in HDFS as XML files and use 'hive xml serde' to
>>> read the data and write it back in a more optimal format, e.g. ORC or
>>> parquet (depending somewhat on your choice of Hadoop distro). Querying XML
>>> data directly via Hive is also doable but slow. Converting to Avro is also
>>> doable but in my experience not as fast as ORC or Parquet. Columnar formats
>>> work give you better performance but Avro has its own strength, e.g.
>>> managing schema changes better.
>>> You can also convert the format before you land the data in HDFS, e.g.
>>> using Flume or some other tool for changing the format in flight.
>>>
>>>
>>>
>>> On Sat, Jan 3, 2015 at 8:33 AM, Shashidhar Rao <
>>> raoshashidhar123@gmail.com> wrote:
>>>
>>>> Sorry , not Hive files but xml files to some Avro format and store
>>>> these into Hive will be fast .
>>>>
>>>> On Sat, Jan 3, 2015 at 9:59 PM, Shashidhar Rao <
>>>> raoshashidhar123@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> Exact number of files is not known but it will run into millions of
>>>>> files depending on client's request who collects terabytes of xml data
>>>>> every day. Basically, storing is just one part but the main part will be
>>>>> how to query these data like  aggregation, count and do some analytics over
>>>>> these data. Fast retrieval is required , say for e.g for a particular year
>>>>> what are the top 10 products, top ten manufacturers and top ten stores etc.
>>>>>
>>>>> Will Hive be a better choice ? And will converting these Hive files to
>>>>> some format work out.
>>>>>
>>>>> Thanks
>>>>> Shashi
>>>>>
>>>>> On Sat, Jan 3, 2015 at 9:44 PM, Wilm Schumacher <
>>>>> wilm.schumacher@gmail.com> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> how many xml files are you planning to store? Perhaps it is possible
>>>>>> to
>>>>>> store them directly on hdfs and save meta data in hbase. This sounds
>>>>>> more reasonable to me.
>>>>>>
>>>>>> If the number of xml files is to large (millions and billions), then
>>>>>> you
>>>>>> can use hadoop map files to put files together. E.g. based on years,
>>>>>> or
>>>>>> month.
>>>>>>
>>>>>> Regards,
>>>>>>
>>>>>> Wilm
>>>>>>
>>>>>> Am 03.01.2015 um 17:06 schrieb Shashidhar Rao:
>>>>>> > Hi,
>>>>>> >
>>>>>> > Can someone help me by suggesting the best way to solve this use
>>>>>> case
>>>>>> >
>>>>>> > 1. XML files keep flowing from external system and need to be stored
>>>>>> > into HDFS.
>>>>>> > 2. These files  can be directly stored using NoSql database e.g any
>>>>>> > xml supported NoSql. or
>>>>>> > 3. These files need to be processed and stored in one of the
>>>>>> database
>>>>>> > HBase, Hive etc.
>>>>>> > 4. There won't be any updates only read and has to be retrieved
>>>>>> based
>>>>>> > on some queries and a dashboard has to be created , bits of
>>>>>> analytics
>>>>>> >
>>>>>> > The xml files are huge and expected number of nodes is roughly
>>>>>> around
>>>>>> > 12 nodes.
>>>>>> > I am stuck in the storage part say if I convert xml to json and
>>>>>> store
>>>>>> > it into HBase , the processing part from xml to json will be huge.
>>>>>> >
>>>>>> > It will be only reading and no updates.
>>>>>> >
>>>>>> > Please suggest how to store these xml files.
>>>>>> >
>>>>>> > Thanks
>>>>>> > Shashi
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: XML files in Hadoop

Posted by Shashidhar Rao <ra...@gmail.com>.

Hi Peyman,

Really appreciate your suggestion.
But say , if Tableau has to be used to generate reports then Tableau works
great with Hive.

Just one more question, can flume be used to convert xml data to parquet ?
I will store these into Hive as parquet and generate reports using Tableau.

If flume can convert xml to parquet , do I need external tools , can you
please provide me some links on how to convert xml to parquet using flume.
Because , Predictive analytics may be used on Hive data in the end phase of
the project.

Thanks
Shashi

On Sat, Jan 3, 2015 at 10:32 PM, Peyman Mohajerian <mo...@gmail.com>
wrote:

> Hi Shashi,
> Sure you can use json instead of Parquet, I was thinking in terms of using
> Hive for processing the data, but if you'd like to use Drill (which i heard
> is a good choice), then just convert the data from to json. You don't have
> to deal with parquet or Hive in that case, just use Flume to convert XML to
> json (there are many other choices to do that within the cluster too) and
> then use Drill to read and process the data.
>
> Thanks,
> Peyman
>
>
> On Sat, Jan 3, 2015 at 8:53 AM, Shashidhar Rao <raoshashidhar123@gmail.com
> > wrote:
>
>> Hi Peyman,
>>
>> Thanks a lot for your suggestions, really appreciate and got some idea
>> from your suggestions. Here's what I want to proceed.
>> 1.  Using Flume convert xml to JSON/Parquet before it reaches HDFS.
>> 2.  Store parquet converted files into Hive.
>> 3.  Query using Apache Drill in SQL dialect.
>>
>> But one thing can you please help me if instead of converting to parquet
>> if I convert into json and store in Hive as Parquet format , is this a
>> feasible option.
>> The reason I want to convert to json is that Apache Drill works very well
>> with JSON format.
>>
>> Thanks
>> Shashi
>>
>> On Sat, Jan 3, 2015 at 10:08 PM, Peyman Mohajerian <mo...@gmail.com>
>> wrote:
>>
>>> You can land the data in HDFS as XML files and use 'hive xml serde' to
>>> read the data and write it back in a more optimal format, e.g. ORC or
>>> parquet (depending somewhat on your choice of Hadoop distro). Querying XML
>>> data directly via Hive is also doable but slow. Converting to Avro is also
>>> doable but in my experience not as fast as ORC or Parquet. Columnar formats
>>> work give you better performance but Avro has its own strength, e.g.
>>> managing schema changes better.
>>> You can also convert the format before you land the data in HDFS, e.g.
>>> using Flume or some other tool for changing the format in flight.
>>>
>>>
>>>
>>> On Sat, Jan 3, 2015 at 8:33 AM, Shashidhar Rao <
>>> raoshashidhar123@gmail.com> wrote:
>>>
>>>> Sorry , not Hive files but xml files to some Avro format and store
>>>> these into Hive will be fast .
>>>>
>>>> On Sat, Jan 3, 2015 at 9:59 PM, Shashidhar Rao <
>>>> raoshashidhar123@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> Exact number of files is not known but it will run into millions of
>>>>> files depending on client's request who collects terabytes of xml data
>>>>> every day. Basically, storing is just one part but the main part will be
>>>>> how to query these data like  aggregation, count and do some analytics over
>>>>> these data. Fast retrieval is required , say for e.g for a particular year
>>>>> what are the top 10 products, top ten manufacturers and top ten stores etc.
>>>>>
>>>>> Will Hive be a better choice ? And will converting these Hive files to
>>>>> some format work out.
>>>>>
>>>>> Thanks
>>>>> Shashi
>>>>>
>>>>> On Sat, Jan 3, 2015 at 9:44 PM, Wilm Schumacher <
>>>>> wilm.schumacher@gmail.com> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> how many xml files are you planning to store? Perhaps it is possible
>>>>>> to
>>>>>> store them directly on hdfs and save meta data in hbase. This sounds
>>>>>> more reasonable to me.
>>>>>>
>>>>>> If the number of xml files is to large (millions and billions), then
>>>>>> you
>>>>>> can use hadoop map files to put files together. E.g. based on years,
>>>>>> or
>>>>>> month.
>>>>>>
>>>>>> Regards,
>>>>>>
>>>>>> Wilm
>>>>>>
>>>>>> Am 03.01.2015 um 17:06 schrieb Shashidhar Rao:
>>>>>> > Hi,
>>>>>> >
>>>>>> > Can someone help me by suggesting the best way to solve this use
>>>>>> case
>>>>>> >
>>>>>> > 1. XML files keep flowing from external system and need to be stored
>>>>>> > into HDFS.
>>>>>> > 2. These files  can be directly stored using NoSql database e.g any
>>>>>> > xml supported NoSql. or
>>>>>> > 3. These files need to be processed and stored in one of the
>>>>>> database
>>>>>> > HBase, Hive etc.
>>>>>> > 4. There won't be any updates only read and has to be retrieved
>>>>>> based
>>>>>> > on some queries and a dashboard has to be created , bits of
>>>>>> analytics
>>>>>> >
>>>>>> > The xml files are huge and expected number of nodes is roughly
>>>>>> around
>>>>>> > 12 nodes.
>>>>>> > I am stuck in the storage part say if I convert xml to json and
>>>>>> store
>>>>>> > it into HBase , the processing part from xml to json will be huge.
>>>>>> >
>>>>>> > It will be only reading and no updates.
>>>>>> >
>>>>>> > Please suggest how to store these xml files.
>>>>>> >
>>>>>> > Thanks
>>>>>> > Shashi
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: XML files in Hadoop

Posted by Peyman Mohajerian <mo...@gmail.com>.

Hi Shashi,
Sure you can use json instead of Parquet, I was thinking in terms of using
Hive for processing the data, but if you'd like to use Drill (which i heard
is a good choice), then just convert the data from to json. You don't have
to deal with parquet or Hive in that case, just use Flume to convert XML to
json (there are many other choices to do that within the cluster too) and
then use Drill to read and process the data.

Thanks,
Peyman


On Sat, Jan 3, 2015 at 8:53 AM, Shashidhar Rao <ra...@gmail.com>
wrote:

> Hi Peyman,
>
> Thanks a lot for your suggestions, really appreciate and got some idea
> from your suggestions. Here's what I want to proceed.
> 1.  Using Flume convert xml to JSON/Parquet before it reaches HDFS.
> 2.  Store parquet converted files into Hive.
> 3.  Query using Apache Drill in SQL dialect.
>
> But one thing can you please help me if instead of converting to parquet
> if I convert into json and store in Hive as Parquet format , is this a
> feasible option.
> The reason I want to convert to json is that Apache Drill works very well
> with JSON format.
>
> Thanks
> Shashi
>
> On Sat, Jan 3, 2015 at 10:08 PM, Peyman Mohajerian <mo...@gmail.com>
> wrote:
>
>> You can land the data in HDFS as XML files and use 'hive xml serde' to
>> read the data and write it back in a more optimal format, e.g. ORC or
>> parquet (depending somewhat on your choice of Hadoop distro). Querying XML
>> data directly via Hive is also doable but slow. Converting to Avro is also
>> doable but in my experience not as fast as ORC or Parquet. Columnar formats
>> work give you better performance but Avro has its own strength, e.g.
>> managing schema changes better.
>> You can also convert the format before you land the data in HDFS, e.g.
>> using Flume or some other tool for changing the format in flight.
>>
>>
>>
>> On Sat, Jan 3, 2015 at 8:33 AM, Shashidhar Rao <
>> raoshashidhar123@gmail.com> wrote:
>>
>>> Sorry , not Hive files but xml files to some Avro format and store these
>>> into Hive will be fast .
>>>
>>> On Sat, Jan 3, 2015 at 9:59 PM, Shashidhar Rao <
>>> raoshashidhar123@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> Exact number of files is not known but it will run into millions of
>>>> files depending on client's request who collects terabytes of xml data
>>>> every day. Basically, storing is just one part but the main part will be
>>>> how to query these data like  aggregation, count and do some analytics over
>>>> these data. Fast retrieval is required , say for e.g for a particular year
>>>> what are the top 10 products, top ten manufacturers and top ten stores etc.
>>>>
>>>> Will Hive be a better choice ? And will converting these Hive files to
>>>> some format work out.
>>>>
>>>> Thanks
>>>> Shashi
>>>>
>>>> On Sat, Jan 3, 2015 at 9:44 PM, Wilm Schumacher <
>>>> wilm.schumacher@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> how many xml files are you planning to store? Perhaps it is possible to
>>>>> store them directly on hdfs and save meta data in hbase. This sounds
>>>>> more reasonable to me.
>>>>>
>>>>> If the number of xml files is to large (millions and billions), then
>>>>> you
>>>>> can use hadoop map files to put files together. E.g. based on years, or
>>>>> month.
>>>>>
>>>>> Regards,
>>>>>
>>>>> Wilm
>>>>>
>>>>> Am 03.01.2015 um 17:06 schrieb Shashidhar Rao:
>>>>> > Hi,
>>>>> >
>>>>> > Can someone help me by suggesting the best way to solve this use case
>>>>> >
>>>>> > 1. XML files keep flowing from external system and need to be stored
>>>>> > into HDFS.
>>>>> > 2. These files  can be directly stored using NoSql database e.g any
>>>>> > xml supported NoSql. or
>>>>> > 3. These files need to be processed and stored in one of the database
>>>>> > HBase, Hive etc.
>>>>> > 4. There won't be any updates only read and has to be retrieved based
>>>>> > on some queries and a dashboard has to be created , bits of analytics
>>>>> >
>>>>> > The xml files are huge and expected number of nodes is roughly around
>>>>> > 12 nodes.
>>>>> > I am stuck in the storage part say if I convert xml to json and store
>>>>> > it into HBase , the processing part from xml to json will be huge.
>>>>> >
>>>>> > It will be only reading and no updates.
>>>>> >
>>>>> > Please suggest how to store these xml files.
>>>>> >
>>>>> > Thanks
>>>>> > Shashi
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: XML files in Hadoop

Posted by Peyman Mohajerian <mo...@gmail.com>.

Hi Shashi,
Sure you can use json instead of Parquet, I was thinking in terms of using
Hive for processing the data, but if you'd like to use Drill (which i heard
is a good choice), then just convert the data from to json. You don't have
to deal with parquet or Hive in that case, just use Flume to convert XML to
json (there are many other choices to do that within the cluster too) and
then use Drill to read and process the data.

Thanks,
Peyman


On Sat, Jan 3, 2015 at 8:53 AM, Shashidhar Rao <ra...@gmail.com>
wrote:

> Hi Peyman,
>
> Thanks a lot for your suggestions, really appreciate and got some idea
> from your suggestions. Here's what I want to proceed.
> 1.  Using Flume convert xml to JSON/Parquet before it reaches HDFS.
> 2.  Store parquet converted files into Hive.
> 3.  Query using Apache Drill in SQL dialect.
>
> But one thing can you please help me if instead of converting to parquet
> if I convert into json and store in Hive as Parquet format , is this a
> feasible option.
> The reason I want to convert to json is that Apache Drill works very well
> with JSON format.
>
> Thanks
> Shashi
>
> On Sat, Jan 3, 2015 at 10:08 PM, Peyman Mohajerian <mo...@gmail.com>
> wrote:
>
>> You can land the data in HDFS as XML files and use 'hive xml serde' to
>> read the data and write it back in a more optimal format, e.g. ORC or
>> parquet (depending somewhat on your choice of Hadoop distro). Querying XML
>> data directly via Hive is also doable but slow. Converting to Avro is also
>> doable but in my experience not as fast as ORC or Parquet. Columnar formats
>> work give you better performance but Avro has its own strength, e.g.
>> managing schema changes better.
>> You can also convert the format before you land the data in HDFS, e.g.
>> using Flume or some other tool for changing the format in flight.
>>
>>
>>
>> On Sat, Jan 3, 2015 at 8:33 AM, Shashidhar Rao <
>> raoshashidhar123@gmail.com> wrote:
>>
>>> Sorry , not Hive files but xml files to some Avro format and store these
>>> into Hive will be fast .
>>>
>>> On Sat, Jan 3, 2015 at 9:59 PM, Shashidhar Rao <
>>> raoshashidhar123@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> Exact number of files is not known but it will run into millions of
>>>> files depending on client's request who collects terabytes of xml data
>>>> every day. Basically, storing is just one part but the main part will be
>>>> how to query these data like  aggregation, count and do some analytics over
>>>> these data. Fast retrieval is required , say for e.g for a particular year
>>>> what are the top 10 products, top ten manufacturers and top ten stores etc.
>>>>
>>>> Will Hive be a better choice ? And will converting these Hive files to
>>>> some format work out.
>>>>
>>>> Thanks
>>>> Shashi
>>>>
>>>> On Sat, Jan 3, 2015 at 9:44 PM, Wilm Schumacher <
>>>> wilm.schumacher@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> how many xml files are you planning to store? Perhaps it is possible to
>>>>> store them directly on hdfs and save meta data in hbase. This sounds
>>>>> more reasonable to me.
>>>>>
>>>>> If the number of xml files is to large (millions and billions), then
>>>>> you
>>>>> can use hadoop map files to put files together. E.g. based on years, or
>>>>> month.
>>>>>
>>>>> Regards,
>>>>>
>>>>> Wilm
>>>>>
>>>>> Am 03.01.2015 um 17:06 schrieb Shashidhar Rao:
>>>>> > Hi,
>>>>> >
>>>>> > Can someone help me by suggesting the best way to solve this use case
>>>>> >
>>>>> > 1. XML files keep flowing from external system and need to be stored
>>>>> > into HDFS.
>>>>> > 2. These files  can be directly stored using NoSql database e.g any
>>>>> > xml supported NoSql. or
>>>>> > 3. These files need to be processed and stored in one of the database
>>>>> > HBase, Hive etc.
>>>>> > 4. There won't be any updates only read and has to be retrieved based
>>>>> > on some queries and a dashboard has to be created , bits of analytics
>>>>> >
>>>>> > The xml files are huge and expected number of nodes is roughly around
>>>>> > 12 nodes.
>>>>> > I am stuck in the storage part say if I convert xml to json and store
>>>>> > it into HBase , the processing part from xml to json will be huge.
>>>>> >
>>>>> > It will be only reading and no updates.
>>>>> >
>>>>> > Please suggest how to store these xml files.
>>>>> >
>>>>> > Thanks
>>>>> > Shashi
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: XML files in Hadoop

Posted by Peyman Mohajerian <mo...@gmail.com>.

Hi Shashi,
Sure you can use json instead of Parquet, I was thinking in terms of using
Hive for processing the data, but if you'd like to use Drill (which i heard
is a good choice), then just convert the data from to json. You don't have
to deal with parquet or Hive in that case, just use Flume to convert XML to
json (there are many other choices to do that within the cluster too) and
then use Drill to read and process the data.

Thanks,
Peyman


On Sat, Jan 3, 2015 at 8:53 AM, Shashidhar Rao <ra...@gmail.com>
wrote:

> Hi Peyman,
>
> Thanks a lot for your suggestions, really appreciate and got some idea
> from your suggestions. Here's what I want to proceed.
> 1.  Using Flume convert xml to JSON/Parquet before it reaches HDFS.
> 2.  Store parquet converted files into Hive.
> 3.  Query using Apache Drill in SQL dialect.
>
> But one thing can you please help me if instead of converting to parquet
> if I convert into json and store in Hive as Parquet format , is this a
> feasible option.
> The reason I want to convert to json is that Apache Drill works very well
> with JSON format.
>
> Thanks
> Shashi
>
> On Sat, Jan 3, 2015 at 10:08 PM, Peyman Mohajerian <mo...@gmail.com>
> wrote:
>
>> You can land the data in HDFS as XML files and use 'hive xml serde' to
>> read the data and write it back in a more optimal format, e.g. ORC or
>> parquet (depending somewhat on your choice of Hadoop distro). Querying XML
>> data directly via Hive is also doable but slow. Converting to Avro is also
>> doable but in my experience not as fast as ORC or Parquet. Columnar formats
>> work give you better performance but Avro has its own strength, e.g.
>> managing schema changes better.
>> You can also convert the format before you land the data in HDFS, e.g.
>> using Flume or some other tool for changing the format in flight.
>>
>>
>>
>> On Sat, Jan 3, 2015 at 8:33 AM, Shashidhar Rao <
>> raoshashidhar123@gmail.com> wrote:
>>
>>> Sorry , not Hive files but xml files to some Avro format and store these
>>> into Hive will be fast .
>>>
>>> On Sat, Jan 3, 2015 at 9:59 PM, Shashidhar Rao <
>>> raoshashidhar123@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> Exact number of files is not known but it will run into millions of
>>>> files depending on client's request who collects terabytes of xml data
>>>> every day. Basically, storing is just one part but the main part will be
>>>> how to query these data like  aggregation, count and do some analytics over
>>>> these data. Fast retrieval is required , say for e.g for a particular year
>>>> what are the top 10 products, top ten manufacturers and top ten stores etc.
>>>>
>>>> Will Hive be a better choice ? And will converting these Hive files to
>>>> some format work out.
>>>>
>>>> Thanks
>>>> Shashi
>>>>
>>>> On Sat, Jan 3, 2015 at 9:44 PM, Wilm Schumacher <
>>>> wilm.schumacher@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> how many xml files are you planning to store? Perhaps it is possible to
>>>>> store them directly on hdfs and save meta data in hbase. This sounds
>>>>> more reasonable to me.
>>>>>
>>>>> If the number of xml files is to large (millions and billions), then
>>>>> you
>>>>> can use hadoop map files to put files together. E.g. based on years, or
>>>>> month.
>>>>>
>>>>> Regards,
>>>>>
>>>>> Wilm
>>>>>
>>>>> Am 03.01.2015 um 17:06 schrieb Shashidhar Rao:
>>>>> > Hi,
>>>>> >
>>>>> > Can someone help me by suggesting the best way to solve this use case
>>>>> >
>>>>> > 1. XML files keep flowing from external system and need to be stored
>>>>> > into HDFS.
>>>>> > 2. These files  can be directly stored using NoSql database e.g any
>>>>> > xml supported NoSql. or
>>>>> > 3. These files need to be processed and stored in one of the database
>>>>> > HBase, Hive etc.
>>>>> > 4. There won't be any updates only read and has to be retrieved based
>>>>> > on some queries and a dashboard has to be created , bits of analytics
>>>>> >
>>>>> > The xml files are huge and expected number of nodes is roughly around
>>>>> > 12 nodes.
>>>>> > I am stuck in the storage part say if I convert xml to json and store
>>>>> > it into HBase , the processing part from xml to json will be huge.
>>>>> >
>>>>> > It will be only reading and no updates.
>>>>> >
>>>>> > Please suggest how to store these xml files.
>>>>> >
>>>>> > Thanks
>>>>> > Shashi
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: XML files in Hadoop

Posted by Peyman Mohajerian <mo...@gmail.com>.

Hi Shashi,
Sure you can use json instead of Parquet, I was thinking in terms of using
Hive for processing the data, but if you'd like to use Drill (which i heard
is a good choice), then just convert the data from to json. You don't have
to deal with parquet or Hive in that case, just use Flume to convert XML to
json (there are many other choices to do that within the cluster too) and
then use Drill to read and process the data.

Thanks,
Peyman


On Sat, Jan 3, 2015 at 8:53 AM, Shashidhar Rao <ra...@gmail.com>
wrote:

> Hi Peyman,
>
> Thanks a lot for your suggestions, really appreciate and got some idea
> from your suggestions. Here's what I want to proceed.
> 1.  Using Flume convert xml to JSON/Parquet before it reaches HDFS.
> 2.  Store parquet converted files into Hive.
> 3.  Query using Apache Drill in SQL dialect.
>
> But one thing can you please help me if instead of converting to parquet
> if I convert into json and store in Hive as Parquet format , is this a
> feasible option.
> The reason I want to convert to json is that Apache Drill works very well
> with JSON format.
>
> Thanks
> Shashi
>
> On Sat, Jan 3, 2015 at 10:08 PM, Peyman Mohajerian <mo...@gmail.com>
> wrote:
>
>> You can land the data in HDFS as XML files and use 'hive xml serde' to
>> read the data and write it back in a more optimal format, e.g. ORC or
>> parquet (depending somewhat on your choice of Hadoop distro). Querying XML
>> data directly via Hive is also doable but slow. Converting to Avro is also
>> doable but in my experience not as fast as ORC or Parquet. Columnar formats
>> work give you better performance but Avro has its own strength, e.g.
>> managing schema changes better.
>> You can also convert the format before you land the data in HDFS, e.g.
>> using Flume or some other tool for changing the format in flight.
>>
>>
>>
>> On Sat, Jan 3, 2015 at 8:33 AM, Shashidhar Rao <
>> raoshashidhar123@gmail.com> wrote:
>>
>>> Sorry , not Hive files but xml files to some Avro format and store these
>>> into Hive will be fast .
>>>
>>> On Sat, Jan 3, 2015 at 9:59 PM, Shashidhar Rao <
>>> raoshashidhar123@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> Exact number of files is not known but it will run into millions of
>>>> files depending on client's request who collects terabytes of xml data
>>>> every day. Basically, storing is just one part but the main part will be
>>>> how to query these data like  aggregation, count and do some analytics over
>>>> these data. Fast retrieval is required , say for e.g for a particular year
>>>> what are the top 10 products, top ten manufacturers and top ten stores etc.
>>>>
>>>> Will Hive be a better choice ? And will converting these Hive files to
>>>> some format work out.
>>>>
>>>> Thanks
>>>> Shashi
>>>>
>>>> On Sat, Jan 3, 2015 at 9:44 PM, Wilm Schumacher <
>>>> wilm.schumacher@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> how many xml files are you planning to store? Perhaps it is possible to
>>>>> store them directly on hdfs and save meta data in hbase. This sounds
>>>>> more reasonable to me.
>>>>>
>>>>> If the number of xml files is to large (millions and billions), then
>>>>> you
>>>>> can use hadoop map files to put files together. E.g. based on years, or
>>>>> month.
>>>>>
>>>>> Regards,
>>>>>
>>>>> Wilm
>>>>>
>>>>> Am 03.01.2015 um 17:06 schrieb Shashidhar Rao:
>>>>> > Hi,
>>>>> >
>>>>> > Can someone help me by suggesting the best way to solve this use case
>>>>> >
>>>>> > 1. XML files keep flowing from external system and need to be stored
>>>>> > into HDFS.
>>>>> > 2. These files  can be directly stored using NoSql database e.g any
>>>>> > xml supported NoSql. or
>>>>> > 3. These files need to be processed and stored in one of the database
>>>>> > HBase, Hive etc.
>>>>> > 4. There won't be any updates only read and has to be retrieved based
>>>>> > on some queries and a dashboard has to be created , bits of analytics
>>>>> >
>>>>> > The xml files are huge and expected number of nodes is roughly around
>>>>> > 12 nodes.
>>>>> > I am stuck in the storage part say if I convert xml to json and store
>>>>> > it into HBase , the processing part from xml to json will be huge.
>>>>> >
>>>>> > It will be only reading and no updates.
>>>>> >
>>>>> > Please suggest how to store these xml files.
>>>>> >
>>>>> > Thanks
>>>>> > Shashi
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: XML files in Hadoop

Posted by Shashidhar Rao <ra...@gmail.com>.

Hi Peyman,

Thanks a lot for your suggestions, really appreciate and got some idea from
your suggestions. Here's what I want to proceed.
1.  Using Flume convert xml to JSON/Parquet before it reaches HDFS.
2.  Store parquet converted files into Hive.
3.  Query using Apache Drill in SQL dialect.

But one thing can you please help me if instead of converting to parquet if
I convert into json and store in Hive as Parquet format , is this a
feasible option.
The reason I want to convert to json is that Apache Drill works very well
with JSON format.

Thanks
Shashi

On Sat, Jan 3, 2015 at 10:08 PM, Peyman Mohajerian <mo...@gmail.com>
wrote:

> You can land the data in HDFS as XML files and use 'hive xml serde' to
> read the data and write it back in a more optimal format, e.g. ORC or
> parquet (depending somewhat on your choice of Hadoop distro). Querying XML
> data directly via Hive is also doable but slow. Converting to Avro is also
> doable but in my experience not as fast as ORC or Parquet. Columnar formats
> work give you better performance but Avro has its own strength, e.g.
> managing schema changes better.
> You can also convert the format before you land the data in HDFS, e.g.
> using Flume or some other tool for changing the format in flight.
>
>
>
> On Sat, Jan 3, 2015 at 8:33 AM, Shashidhar Rao <raoshashidhar123@gmail.com
> > wrote:
>
>> Sorry , not Hive files but xml files to some Avro format and store these
>> into Hive will be fast .
>>
>> On Sat, Jan 3, 2015 at 9:59 PM, Shashidhar Rao <
>> raoshashidhar123@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> Exact number of files is not known but it will run into millions of
>>> files depending on client's request who collects terabytes of xml data
>>> every day. Basically, storing is just one part but the main part will be
>>> how to query these data like  aggregation, count and do some analytics over
>>> these data. Fast retrieval is required , say for e.g for a particular year
>>> what are the top 10 products, top ten manufacturers and top ten stores etc.
>>>
>>> Will Hive be a better choice ? And will converting these Hive files to
>>> some format work out.
>>>
>>> Thanks
>>> Shashi
>>>
>>> On Sat, Jan 3, 2015 at 9:44 PM, Wilm Schumacher <
>>> wilm.schumacher@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> how many xml files are you planning to store? Perhaps it is possible to
>>>> store them directly on hdfs and save meta data in hbase. This sounds
>>>> more reasonable to me.
>>>>
>>>> If the number of xml files is to large (millions and billions), then you
>>>> can use hadoop map files to put files together. E.g. based on years, or
>>>> month.
>>>>
>>>> Regards,
>>>>
>>>> Wilm
>>>>
>>>> Am 03.01.2015 um 17:06 schrieb Shashidhar Rao:
>>>> > Hi,
>>>> >
>>>> > Can someone help me by suggesting the best way to solve this use case
>>>> >
>>>> > 1. XML files keep flowing from external system and need to be stored
>>>> > into HDFS.
>>>> > 2. These files  can be directly stored using NoSql database e.g any
>>>> > xml supported NoSql. or
>>>> > 3. These files need to be processed and stored in one of the database
>>>> > HBase, Hive etc.
>>>> > 4. There won't be any updates only read and has to be retrieved based
>>>> > on some queries and a dashboard has to be created , bits of analytics
>>>> >
>>>> > The xml files are huge and expected number of nodes is roughly around
>>>> > 12 nodes.
>>>> > I am stuck in the storage part say if I convert xml to json and store
>>>> > it into HBase , the processing part from xml to json will be huge.
>>>> >
>>>> > It will be only reading and no updates.
>>>> >
>>>> > Please suggest how to store these xml files.
>>>> >
>>>> > Thanks
>>>> > Shashi
>>>>
>>>>
>>>
>>
>

Re: XML files in Hadoop

Posted by Shashidhar Rao <ra...@gmail.com>.

Hi Peyman,

Thanks a lot for your suggestions, really appreciate and got some idea from
your suggestions. Here's what I want to proceed.
1.  Using Flume convert xml to JSON/Parquet before it reaches HDFS.
2.  Store parquet converted files into Hive.
3.  Query using Apache Drill in SQL dialect.

But one thing can you please help me if instead of converting to parquet if
I convert into json and store in Hive as Parquet format , is this a
feasible option.
The reason I want to convert to json is that Apache Drill works very well
with JSON format.

Thanks
Shashi

On Sat, Jan 3, 2015 at 10:08 PM, Peyman Mohajerian <mo...@gmail.com>
wrote:

> You can land the data in HDFS as XML files and use 'hive xml serde' to
> read the data and write it back in a more optimal format, e.g. ORC or
> parquet (depending somewhat on your choice of Hadoop distro). Querying XML
> data directly via Hive is also doable but slow. Converting to Avro is also
> doable but in my experience not as fast as ORC or Parquet. Columnar formats
> work give you better performance but Avro has its own strength, e.g.
> managing schema changes better.
> You can also convert the format before you land the data in HDFS, e.g.
> using Flume or some other tool for changing the format in flight.
>
>
>
> On Sat, Jan 3, 2015 at 8:33 AM, Shashidhar Rao <raoshashidhar123@gmail.com
> > wrote:
>
>> Sorry , not Hive files but xml files to some Avro format and store these
>> into Hive will be fast .
>>
>> On Sat, Jan 3, 2015 at 9:59 PM, Shashidhar Rao <
>> raoshashidhar123@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> Exact number of files is not known but it will run into millions of
>>> files depending on client's request who collects terabytes of xml data
>>> every day. Basically, storing is just one part but the main part will be
>>> how to query these data like  aggregation, count and do some analytics over
>>> these data. Fast retrieval is required , say for e.g for a particular year
>>> what are the top 10 products, top ten manufacturers and top ten stores etc.
>>>
>>> Will Hive be a better choice ? And will converting these Hive files to
>>> some format work out.
>>>
>>> Thanks
>>> Shashi
>>>
>>> On Sat, Jan 3, 2015 at 9:44 PM, Wilm Schumacher <
>>> wilm.schumacher@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> how many xml files are you planning to store? Perhaps it is possible to
>>>> store them directly on hdfs and save meta data in hbase. This sounds
>>>> more reasonable to me.
>>>>
>>>> If the number of xml files is to large (millions and billions), then you
>>>> can use hadoop map files to put files together. E.g. based on years, or
>>>> month.
>>>>
>>>> Regards,
>>>>
>>>> Wilm
>>>>
>>>> Am 03.01.2015 um 17:06 schrieb Shashidhar Rao:
>>>> > Hi,
>>>> >
>>>> > Can someone help me by suggesting the best way to solve this use case
>>>> >
>>>> > 1. XML files keep flowing from external system and need to be stored
>>>> > into HDFS.
>>>> > 2. These files  can be directly stored using NoSql database e.g any
>>>> > xml supported NoSql. or
>>>> > 3. These files need to be processed and stored in one of the database
>>>> > HBase, Hive etc.
>>>> > 4. There won't be any updates only read and has to be retrieved based
>>>> > on some queries and a dashboard has to be created , bits of analytics
>>>> >
>>>> > The xml files are huge and expected number of nodes is roughly around
>>>> > 12 nodes.
>>>> > I am stuck in the storage part say if I convert xml to json and store
>>>> > it into HBase , the processing part from xml to json will be huge.
>>>> >
>>>> > It will be only reading and no updates.
>>>> >
>>>> > Please suggest how to store these xml files.
>>>> >
>>>> > Thanks
>>>> > Shashi
>>>>
>>>>
>>>
>>
>

Re: XML files in Hadoop

Posted by Shashidhar Rao <ra...@gmail.com>.

Hi Peyman,

Thanks a lot for your suggestions, really appreciate and got some idea from
your suggestions. Here's what I want to proceed.
1.  Using Flume convert xml to JSON/Parquet before it reaches HDFS.
2.  Store parquet converted files into Hive.
3.  Query using Apache Drill in SQL dialect.

But one thing can you please help me if instead of converting to parquet if
I convert into json and store in Hive as Parquet format , is this a
feasible option.
The reason I want to convert to json is that Apache Drill works very well
with JSON format.

Thanks
Shashi

On Sat, Jan 3, 2015 at 10:08 PM, Peyman Mohajerian <mo...@gmail.com>
wrote:

> You can land the data in HDFS as XML files and use 'hive xml serde' to
> read the data and write it back in a more optimal format, e.g. ORC or
> parquet (depending somewhat on your choice of Hadoop distro). Querying XML
> data directly via Hive is also doable but slow. Converting to Avro is also
> doable but in my experience not as fast as ORC or Parquet. Columnar formats
> work give you better performance but Avro has its own strength, e.g.
> managing schema changes better.
> You can also convert the format before you land the data in HDFS, e.g.
> using Flume or some other tool for changing the format in flight.
>
>
>
> On Sat, Jan 3, 2015 at 8:33 AM, Shashidhar Rao <raoshashidhar123@gmail.com
> > wrote:
>
>> Sorry , not Hive files but xml files to some Avro format and store these
>> into Hive will be fast .
>>
>> On Sat, Jan 3, 2015 at 9:59 PM, Shashidhar Rao <
>> raoshashidhar123@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> Exact number of files is not known but it will run into millions of
>>> files depending on client's request who collects terabytes of xml data
>>> every day. Basically, storing is just one part but the main part will be
>>> how to query these data like  aggregation, count and do some analytics over
>>> these data. Fast retrieval is required , say for e.g for a particular year
>>> what are the top 10 products, top ten manufacturers and top ten stores etc.
>>>
>>> Will Hive be a better choice ? And will converting these Hive files to
>>> some format work out.
>>>
>>> Thanks
>>> Shashi
>>>
>>> On Sat, Jan 3, 2015 at 9:44 PM, Wilm Schumacher <
>>> wilm.schumacher@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> how many xml files are you planning to store? Perhaps it is possible to
>>>> store them directly on hdfs and save meta data in hbase. This sounds
>>>> more reasonable to me.
>>>>
>>>> If the number of xml files is to large (millions and billions), then you
>>>> can use hadoop map files to put files together. E.g. based on years, or
>>>> month.
>>>>
>>>> Regards,
>>>>
>>>> Wilm
>>>>
>>>> Am 03.01.2015 um 17:06 schrieb Shashidhar Rao:
>>>> > Hi,
>>>> >
>>>> > Can someone help me by suggesting the best way to solve this use case
>>>> >
>>>> > 1. XML files keep flowing from external system and need to be stored
>>>> > into HDFS.
>>>> > 2. These files  can be directly stored using NoSql database e.g any
>>>> > xml supported NoSql. or
>>>> > 3. These files need to be processed and stored in one of the database
>>>> > HBase, Hive etc.
>>>> > 4. There won't be any updates only read and has to be retrieved based
>>>> > on some queries and a dashboard has to be created , bits of analytics
>>>> >
>>>> > The xml files are huge and expected number of nodes is roughly around
>>>> > 12 nodes.
>>>> > I am stuck in the storage part say if I convert xml to json and store
>>>> > it into HBase , the processing part from xml to json will be huge.
>>>> >
>>>> > It will be only reading and no updates.
>>>> >
>>>> > Please suggest how to store these xml files.
>>>> >
>>>> > Thanks
>>>> > Shashi
>>>>
>>>>
>>>
>>
>

Re: XML files in Hadoop

Posted by Shashidhar Rao <ra...@gmail.com>.

Hi Peyman,

Thanks a lot for your suggestions, really appreciate and got some idea from
your suggestions. Here's what I want to proceed.
1.  Using Flume convert xml to JSON/Parquet before it reaches HDFS.
2.  Store parquet converted files into Hive.
3.  Query using Apache Drill in SQL dialect.

But one thing can you please help me if instead of converting to parquet if
I convert into json and store in Hive as Parquet format , is this a
feasible option.
The reason I want to convert to json is that Apache Drill works very well
with JSON format.

Thanks
Shashi

On Sat, Jan 3, 2015 at 10:08 PM, Peyman Mohajerian <mo...@gmail.com>
wrote:

> You can land the data in HDFS as XML files and use 'hive xml serde' to
> read the data and write it back in a more optimal format, e.g. ORC or
> parquet (depending somewhat on your choice of Hadoop distro). Querying XML
> data directly via Hive is also doable but slow. Converting to Avro is also
> doable but in my experience not as fast as ORC or Parquet. Columnar formats
> work give you better performance but Avro has its own strength, e.g.
> managing schema changes better.
> You can also convert the format before you land the data in HDFS, e.g.
> using Flume or some other tool for changing the format in flight.
>
>
>
> On Sat, Jan 3, 2015 at 8:33 AM, Shashidhar Rao <raoshashidhar123@gmail.com
> > wrote:
>
>> Sorry , not Hive files but xml files to some Avro format and store these
>> into Hive will be fast .
>>
>> On Sat, Jan 3, 2015 at 9:59 PM, Shashidhar Rao <
>> raoshashidhar123@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> Exact number of files is not known but it will run into millions of
>>> files depending on client's request who collects terabytes of xml data
>>> every day. Basically, storing is just one part but the main part will be
>>> how to query these data like  aggregation, count and do some analytics over
>>> these data. Fast retrieval is required , say for e.g for a particular year
>>> what are the top 10 products, top ten manufacturers and top ten stores etc.
>>>
>>> Will Hive be a better choice ? And will converting these Hive files to
>>> some format work out.
>>>
>>> Thanks
>>> Shashi
>>>
>>> On Sat, Jan 3, 2015 at 9:44 PM, Wilm Schumacher <
>>> wilm.schumacher@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> how many xml files are you planning to store? Perhaps it is possible to
>>>> store them directly on hdfs and save meta data in hbase. This sounds
>>>> more reasonable to me.
>>>>
>>>> If the number of xml files is to large (millions and billions), then you
>>>> can use hadoop map files to put files together. E.g. based on years, or
>>>> month.
>>>>
>>>> Regards,
>>>>
>>>> Wilm
>>>>
>>>> Am 03.01.2015 um 17:06 schrieb Shashidhar Rao:
>>>> > Hi,
>>>> >
>>>> > Can someone help me by suggesting the best way to solve this use case
>>>> >
>>>> > 1. XML files keep flowing from external system and need to be stored
>>>> > into HDFS.
>>>> > 2. These files  can be directly stored using NoSql database e.g any
>>>> > xml supported NoSql. or
>>>> > 3. These files need to be processed and stored in one of the database
>>>> > HBase, Hive etc.
>>>> > 4. There won't be any updates only read and has to be retrieved based
>>>> > on some queries and a dashboard has to be created , bits of analytics
>>>> >
>>>> > The xml files are huge and expected number of nodes is roughly around
>>>> > 12 nodes.
>>>> > I am stuck in the storage part say if I convert xml to json and store
>>>> > it into HBase , the processing part from xml to json will be huge.
>>>> >
>>>> > It will be only reading and no updates.
>>>> >
>>>> > Please suggest how to store these xml files.
>>>> >
>>>> > Thanks
>>>> > Shashi
>>>>
>>>>
>>>
>>
>

Re: XML files in Hadoop

Posted by Peyman Mohajerian <mo...@gmail.com>.

You can land the data in HDFS as XML files and use 'hive xml serde' to read
the data and write it back in a more optimal format, e.g. ORC or parquet
(depending somewhat on your choice of Hadoop distro). Querying XML data
directly via Hive is also doable but slow. Converting to Avro is also
doable but in my experience not as fast as ORC or Parquet. Columnar formats
work give you better performance but Avro has its own strength, e.g.
managing schema changes better.
You can also convert the format before you land the data in HDFS, e.g.
using Flume or some other tool for changing the format in flight.



On Sat, Jan 3, 2015 at 8:33 AM, Shashidhar Rao <ra...@gmail.com>
wrote:

> Sorry , not Hive files but xml files to some Avro format and store these
> into Hive will be fast .
>
> On Sat, Jan 3, 2015 at 9:59 PM, Shashidhar Rao <raoshashidhar123@gmail.com
> > wrote:
>
>> Hi,
>>
>> Exact number of files is not known but it will run into millions of files
>> depending on client's request who collects terabytes of xml data every day.
>> Basically, storing is just one part but the main part will be how to query
>> these data like  aggregation, count and do some analytics over these data.
>> Fast retrieval is required , say for e.g for a particular year what are the
>> top 10 products, top ten manufacturers and top ten stores etc.
>>
>> Will Hive be a better choice ? And will converting these Hive files to
>> some format work out.
>>
>> Thanks
>> Shashi
>>
>> On Sat, Jan 3, 2015 at 9:44 PM, Wilm Schumacher <
>> wilm.schumacher@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> how many xml files are you planning to store? Perhaps it is possible to
>>> store them directly on hdfs and save meta data in hbase. This sounds
>>> more reasonable to me.
>>>
>>> If the number of xml files is to large (millions and billions), then you
>>> can use hadoop map files to put files together. E.g. based on years, or
>>> month.
>>>
>>> Regards,
>>>
>>> Wilm
>>>
>>> Am 03.01.2015 um 17:06 schrieb Shashidhar Rao:
>>> > Hi,
>>> >
>>> > Can someone help me by suggesting the best way to solve this use case
>>> >
>>> > 1. XML files keep flowing from external system and need to be stored
>>> > into HDFS.
>>> > 2. These files  can be directly stored using NoSql database e.g any
>>> > xml supported NoSql. or
>>> > 3. These files need to be processed and stored in one of the database
>>> > HBase, Hive etc.
>>> > 4. There won't be any updates only read and has to be retrieved based
>>> > on some queries and a dashboard has to be created , bits of analytics
>>> >
>>> > The xml files are huge and expected number of nodes is roughly around
>>> > 12 nodes.
>>> > I am stuck in the storage part say if I convert xml to json and store
>>> > it into HBase , the processing part from xml to json will be huge.
>>> >
>>> > It will be only reading and no updates.
>>> >
>>> > Please suggest how to store these xml files.
>>> >
>>> > Thanks
>>> > Shashi
>>>
>>>
>>
>

Re: XML files in Hadoop

Posted by Peyman Mohajerian <mo...@gmail.com>.

You can land the data in HDFS as XML files and use 'hive xml serde' to read
the data and write it back in a more optimal format, e.g. ORC or parquet
(depending somewhat on your choice of Hadoop distro). Querying XML data
directly via Hive is also doable but slow. Converting to Avro is also
doable but in my experience not as fast as ORC or Parquet. Columnar formats
work give you better performance but Avro has its own strength, e.g.
managing schema changes better.
You can also convert the format before you land the data in HDFS, e.g.
using Flume or some other tool for changing the format in flight.



On Sat, Jan 3, 2015 at 8:33 AM, Shashidhar Rao <ra...@gmail.com>
wrote:

> Sorry , not Hive files but xml files to some Avro format and store these
> into Hive will be fast .
>
> On Sat, Jan 3, 2015 at 9:59 PM, Shashidhar Rao <raoshashidhar123@gmail.com
> > wrote:
>
>> Hi,
>>
>> Exact number of files is not known but it will run into millions of files
>> depending on client's request who collects terabytes of xml data every day.
>> Basically, storing is just one part but the main part will be how to query
>> these data like  aggregation, count and do some analytics over these data.
>> Fast retrieval is required , say for e.g for a particular year what are the
>> top 10 products, top ten manufacturers and top ten stores etc.
>>
>> Will Hive be a better choice ? And will converting these Hive files to
>> some format work out.
>>
>> Thanks
>> Shashi
>>
>> On Sat, Jan 3, 2015 at 9:44 PM, Wilm Schumacher <
>> wilm.schumacher@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> how many xml files are you planning to store? Perhaps it is possible to
>>> store them directly on hdfs and save meta data in hbase. This sounds
>>> more reasonable to me.
>>>
>>> If the number of xml files is to large (millions and billions), then you
>>> can use hadoop map files to put files together. E.g. based on years, or
>>> month.
>>>
>>> Regards,
>>>
>>> Wilm
>>>
>>> Am 03.01.2015 um 17:06 schrieb Shashidhar Rao:
>>> > Hi,
>>> >
>>> > Can someone help me by suggesting the best way to solve this use case
>>> >
>>> > 1. XML files keep flowing from external system and need to be stored
>>> > into HDFS.
>>> > 2. These files  can be directly stored using NoSql database e.g any
>>> > xml supported NoSql. or
>>> > 3. These files need to be processed and stored in one of the database
>>> > HBase, Hive etc.
>>> > 4. There won't be any updates only read and has to be retrieved based
>>> > on some queries and a dashboard has to be created , bits of analytics
>>> >
>>> > The xml files are huge and expected number of nodes is roughly around
>>> > 12 nodes.
>>> > I am stuck in the storage part say if I convert xml to json and store
>>> > it into HBase , the processing part from xml to json will be huge.
>>> >
>>> > It will be only reading and no updates.
>>> >
>>> > Please suggest how to store these xml files.
>>> >
>>> > Thanks
>>> > Shashi
>>>
>>>
>>
>

Re: XML files in Hadoop

Posted by Peyman Mohajerian <mo...@gmail.com>.

You can land the data in HDFS as XML files and use 'hive xml serde' to read
the data and write it back in a more optimal format, e.g. ORC or parquet
(depending somewhat on your choice of Hadoop distro). Querying XML data
directly via Hive is also doable but slow. Converting to Avro is also
doable but in my experience not as fast as ORC or Parquet. Columnar formats
work give you better performance but Avro has its own strength, e.g.
managing schema changes better.
You can also convert the format before you land the data in HDFS, e.g.
using Flume or some other tool for changing the format in flight.



On Sat, Jan 3, 2015 at 8:33 AM, Shashidhar Rao <ra...@gmail.com>
wrote:

> Sorry , not Hive files but xml files to some Avro format and store these
> into Hive will be fast .
>
> On Sat, Jan 3, 2015 at 9:59 PM, Shashidhar Rao <raoshashidhar123@gmail.com
> > wrote:
>
>> Hi,
>>
>> Exact number of files is not known but it will run into millions of files
>> depending on client's request who collects terabytes of xml data every day.
>> Basically, storing is just one part but the main part will be how to query
>> these data like  aggregation, count and do some analytics over these data.
>> Fast retrieval is required , say for e.g for a particular year what are the
>> top 10 products, top ten manufacturers and top ten stores etc.
>>
>> Will Hive be a better choice ? And will converting these Hive files to
>> some format work out.
>>
>> Thanks
>> Shashi
>>
>> On Sat, Jan 3, 2015 at 9:44 PM, Wilm Schumacher <
>> wilm.schumacher@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> how many xml files are you planning to store? Perhaps it is possible to
>>> store them directly on hdfs and save meta data in hbase. This sounds
>>> more reasonable to me.
>>>
>>> If the number of xml files is to large (millions and billions), then you
>>> can use hadoop map files to put files together. E.g. based on years, or
>>> month.
>>>
>>> Regards,
>>>
>>> Wilm
>>>
>>> Am 03.01.2015 um 17:06 schrieb Shashidhar Rao:
>>> > Hi,
>>> >
>>> > Can someone help me by suggesting the best way to solve this use case
>>> >
>>> > 1. XML files keep flowing from external system and need to be stored
>>> > into HDFS.
>>> > 2. These files  can be directly stored using NoSql database e.g any
>>> > xml supported NoSql. or
>>> > 3. These files need to be processed and stored in one of the database
>>> > HBase, Hive etc.
>>> > 4. There won't be any updates only read and has to be retrieved based
>>> > on some queries and a dashboard has to be created , bits of analytics
>>> >
>>> > The xml files are huge and expected number of nodes is roughly around
>>> > 12 nodes.
>>> > I am stuck in the storage part say if I convert xml to json and store
>>> > it into HBase , the processing part from xml to json will be huge.
>>> >
>>> > It will be only reading and no updates.
>>> >
>>> > Please suggest how to store these xml files.
>>> >
>>> > Thanks
>>> > Shashi
>>>
>>>
>>
>

Re: XML files in Hadoop

Posted by Peyman Mohajerian <mo...@gmail.com>.

You can land the data in HDFS as XML files and use 'hive xml serde' to read
the data and write it back in a more optimal format, e.g. ORC or parquet
(depending somewhat on your choice of Hadoop distro). Querying XML data
directly via Hive is also doable but slow. Converting to Avro is also
doable but in my experience not as fast as ORC or Parquet. Columnar formats
work give you better performance but Avro has its own strength, e.g.
managing schema changes better.
You can also convert the format before you land the data in HDFS, e.g.
using Flume or some other tool for changing the format in flight.



On Sat, Jan 3, 2015 at 8:33 AM, Shashidhar Rao <ra...@gmail.com>
wrote:

> Sorry , not Hive files but xml files to some Avro format and store these
> into Hive will be fast .
>
> On Sat, Jan 3, 2015 at 9:59 PM, Shashidhar Rao <raoshashidhar123@gmail.com
> > wrote:
>
>> Hi,
>>
>> Exact number of files is not known but it will run into millions of files
>> depending on client's request who collects terabytes of xml data every day.
>> Basically, storing is just one part but the main part will be how to query
>> these data like  aggregation, count and do some analytics over these data.
>> Fast retrieval is required , say for e.g for a particular year what are the
>> top 10 products, top ten manufacturers and top ten stores etc.
>>
>> Will Hive be a better choice ? And will converting these Hive files to
>> some format work out.
>>
>> Thanks
>> Shashi
>>
>> On Sat, Jan 3, 2015 at 9:44 PM, Wilm Schumacher <
>> wilm.schumacher@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> how many xml files are you planning to store? Perhaps it is possible to
>>> store them directly on hdfs and save meta data in hbase. This sounds
>>> more reasonable to me.
>>>
>>> If the number of xml files is to large (millions and billions), then you
>>> can use hadoop map files to put files together. E.g. based on years, or
>>> month.
>>>
>>> Regards,
>>>
>>> Wilm
>>>
>>> Am 03.01.2015 um 17:06 schrieb Shashidhar Rao:
>>> > Hi,
>>> >
>>> > Can someone help me by suggesting the best way to solve this use case
>>> >
>>> > 1. XML files keep flowing from external system and need to be stored
>>> > into HDFS.
>>> > 2. These files  can be directly stored using NoSql database e.g any
>>> > xml supported NoSql. or
>>> > 3. These files need to be processed and stored in one of the database
>>> > HBase, Hive etc.
>>> > 4. There won't be any updates only read and has to be retrieved based
>>> > on some queries and a dashboard has to be created , bits of analytics
>>> >
>>> > The xml files are huge and expected number of nodes is roughly around
>>> > 12 nodes.
>>> > I am stuck in the storage part say if I convert xml to json and store
>>> > it into HBase , the processing part from xml to json will be huge.
>>> >
>>> > It will be only reading and no updates.
>>> >
>>> > Please suggest how to store these xml files.
>>> >
>>> > Thanks
>>> > Shashi
>>>
>>>
>>
>

Re: XML files in Hadoop

Posted by Shashidhar Rao <ra...@gmail.com>.

Sorry , not Hive files but xml files to some Avro format and store these
into Hive will be fast .

On Sat, Jan 3, 2015 at 9:59 PM, Shashidhar Rao <ra...@gmail.com>
wrote:

> Hi,
>
> Exact number of files is not known but it will run into millions of files
> depending on client's request who collects terabytes of xml data every day.
> Basically, storing is just one part but the main part will be how to query
> these data like  aggregation, count and do some analytics over these data.
> Fast retrieval is required , say for e.g for a particular year what are the
> top 10 products, top ten manufacturers and top ten stores etc.
>
> Will Hive be a better choice ? And will converting these Hive files to
> some format work out.
>
> Thanks
> Shashi
>
> On Sat, Jan 3, 2015 at 9:44 PM, Wilm Schumacher <wilm.schumacher@gmail.com
> > wrote:
>
>> Hi,
>>
>> how many xml files are you planning to store? Perhaps it is possible to
>> store them directly on hdfs and save meta data in hbase. This sounds
>> more reasonable to me.
>>
>> If the number of xml files is to large (millions and billions), then you
>> can use hadoop map files to put files together. E.g. based on years, or
>> month.
>>
>> Regards,
>>
>> Wilm
>>
>> Am 03.01.2015 um 17:06 schrieb Shashidhar Rao:
>> > Hi,
>> >
>> > Can someone help me by suggesting the best way to solve this use case
>> >
>> > 1. XML files keep flowing from external system and need to be stored
>> > into HDFS.
>> > 2. These files  can be directly stored using NoSql database e.g any
>> > xml supported NoSql. or
>> > 3. These files need to be processed and stored in one of the database
>> > HBase, Hive etc.
>> > 4. There won't be any updates only read and has to be retrieved based
>> > on some queries and a dashboard has to be created , bits of analytics
>> >
>> > The xml files are huge and expected number of nodes is roughly around
>> > 12 nodes.
>> > I am stuck in the storage part say if I convert xml to json and store
>> > it into HBase , the processing part from xml to json will be huge.
>> >
>> > It will be only reading and no updates.
>> >
>> > Please suggest how to store these xml files.
>> >
>> > Thanks
>> > Shashi
>>
>>
>

Re: XML files in Hadoop

Posted by Shashidhar Rao <ra...@gmail.com>.

Sorry , not Hive files but xml files to some Avro format and store these
into Hive will be fast .

On Sat, Jan 3, 2015 at 9:59 PM, Shashidhar Rao <ra...@gmail.com>
wrote:

> Hi,
>
> Exact number of files is not known but it will run into millions of files
> depending on client's request who collects terabytes of xml data every day.
> Basically, storing is just one part but the main part will be how to query
> these data like  aggregation, count and do some analytics over these data.
> Fast retrieval is required , say for e.g for a particular year what are the
> top 10 products, top ten manufacturers and top ten stores etc.
>
> Will Hive be a better choice ? And will converting these Hive files to
> some format work out.
>
> Thanks
> Shashi
>
> On Sat, Jan 3, 2015 at 9:44 PM, Wilm Schumacher <wilm.schumacher@gmail.com
> > wrote:
>
>> Hi,
>>
>> how many xml files are you planning to store? Perhaps it is possible to
>> store them directly on hdfs and save meta data in hbase. This sounds
>> more reasonable to me.
>>
>> If the number of xml files is to large (millions and billions), then you
>> can use hadoop map files to put files together. E.g. based on years, or
>> month.
>>
>> Regards,
>>
>> Wilm
>>
>> Am 03.01.2015 um 17:06 schrieb Shashidhar Rao:
>> > Hi,
>> >
>> > Can someone help me by suggesting the best way to solve this use case
>> >
>> > 1. XML files keep flowing from external system and need to be stored
>> > into HDFS.
>> > 2. These files  can be directly stored using NoSql database e.g any
>> > xml supported NoSql. or
>> > 3. These files need to be processed and stored in one of the database
>> > HBase, Hive etc.
>> > 4. There won't be any updates only read and has to be retrieved based
>> > on some queries and a dashboard has to be created , bits of analytics
>> >
>> > The xml files are huge and expected number of nodes is roughly around
>> > 12 nodes.
>> > I am stuck in the storage part say if I convert xml to json and store
>> > it into HBase , the processing part from xml to json will be huge.
>> >
>> > It will be only reading and no updates.
>> >
>> > Please suggest how to store these xml files.
>> >
>> > Thanks
>> > Shashi
>>
>>
>

Re: XML files in Hadoop

Posted by Shashidhar Rao <ra...@gmail.com>.

Sorry , not Hive files but xml files to some Avro format and store these
into Hive will be fast .

On Sat, Jan 3, 2015 at 9:59 PM, Shashidhar Rao <ra...@gmail.com>
wrote:

> Hi,
>
> Exact number of files is not known but it will run into millions of files
> depending on client's request who collects terabytes of xml data every day.
> Basically, storing is just one part but the main part will be how to query
> these data like  aggregation, count and do some analytics over these data.
> Fast retrieval is required , say for e.g for a particular year what are the
> top 10 products, top ten manufacturers and top ten stores etc.
>
> Will Hive be a better choice ? And will converting these Hive files to
> some format work out.
>
> Thanks
> Shashi
>
> On Sat, Jan 3, 2015 at 9:44 PM, Wilm Schumacher <wilm.schumacher@gmail.com
> > wrote:
>
>> Hi,
>>
>> how many xml files are you planning to store? Perhaps it is possible to
>> store them directly on hdfs and save meta data in hbase. This sounds
>> more reasonable to me.
>>
>> If the number of xml files is to large (millions and billions), then you
>> can use hadoop map files to put files together. E.g. based on years, or
>> month.
>>
>> Regards,
>>
>> Wilm
>>
>> Am 03.01.2015 um 17:06 schrieb Shashidhar Rao:
>> > Hi,
>> >
>> > Can someone help me by suggesting the best way to solve this use case
>> >
>> > 1. XML files keep flowing from external system and need to be stored
>> > into HDFS.
>> > 2. These files  can be directly stored using NoSql database e.g any
>> > xml supported NoSql. or
>> > 3. These files need to be processed and stored in one of the database
>> > HBase, Hive etc.
>> > 4. There won't be any updates only read and has to be retrieved based
>> > on some queries and a dashboard has to be created , bits of analytics
>> >
>> > The xml files are huge and expected number of nodes is roughly around
>> > 12 nodes.
>> > I am stuck in the storage part say if I convert xml to json and store
>> > it into HBase , the processing part from xml to json will be huge.
>> >
>> > It will be only reading and no updates.
>> >
>> > Please suggest how to store these xml files.
>> >
>> > Thanks
>> > Shashi
>>
>>
>

Re: XML files in Hadoop

Posted by Shashidhar Rao <ra...@gmail.com>.

Sorry , not Hive files but xml files to some Avro format and store these
into Hive will be fast .

On Sat, Jan 3, 2015 at 9:59 PM, Shashidhar Rao <ra...@gmail.com>
wrote:

> Hi,
>
> Exact number of files is not known but it will run into millions of files
> depending on client's request who collects terabytes of xml data every day.
> Basically, storing is just one part but the main part will be how to query
> these data like  aggregation, count and do some analytics over these data.
> Fast retrieval is required , say for e.g for a particular year what are the
> top 10 products, top ten manufacturers and top ten stores etc.
>
> Will Hive be a better choice ? And will converting these Hive files to
> some format work out.
>
> Thanks
> Shashi
>
> On Sat, Jan 3, 2015 at 9:44 PM, Wilm Schumacher <wilm.schumacher@gmail.com
> > wrote:
>
>> Hi,
>>
>> how many xml files are you planning to store? Perhaps it is possible to
>> store them directly on hdfs and save meta data in hbase. This sounds
>> more reasonable to me.
>>
>> If the number of xml files is to large (millions and billions), then you
>> can use hadoop map files to put files together. E.g. based on years, or
>> month.
>>
>> Regards,
>>
>> Wilm
>>
>> Am 03.01.2015 um 17:06 schrieb Shashidhar Rao:
>> > Hi,
>> >
>> > Can someone help me by suggesting the best way to solve this use case
>> >
>> > 1. XML files keep flowing from external system and need to be stored
>> > into HDFS.
>> > 2. These files  can be directly stored using NoSql database e.g any
>> > xml supported NoSql. or
>> > 3. These files need to be processed and stored in one of the database
>> > HBase, Hive etc.
>> > 4. There won't be any updates only read and has to be retrieved based
>> > on some queries and a dashboard has to be created , bits of analytics
>> >
>> > The xml files are huge and expected number of nodes is roughly around
>> > 12 nodes.
>> > I am stuck in the storage part say if I convert xml to json and store
>> > it into HBase , the processing part from xml to json will be huge.
>> >
>> > It will be only reading and no updates.
>> >
>> > Please suggest how to store these xml files.
>> >
>> > Thanks
>> > Shashi
>>
>>
>

Re: XML files in Hadoop

Posted by Shashidhar Rao <ra...@gmail.com>.

Hi,

Exact number of files is not known but it will run into millions of files
depending on client's request who collects terabytes of xml data every day.
Basically, storing is just one part but the main part will be how to query
these data like  aggregation, count and do some analytics over these data.
Fast retrieval is required , say for e.g for a particular year what are the
top 10 products, top ten manufacturers and top ten stores etc.

Will Hive be a better choice ? And will converting these Hive files to some
format work out.

Thanks
Shashi

On Sat, Jan 3, 2015 at 9:44 PM, Wilm Schumacher <wi...@gmail.com>
wrote:

> Hi,
>
> how many xml files are you planning to store? Perhaps it is possible to
> store them directly on hdfs and save meta data in hbase. This sounds
> more reasonable to me.
>
> If the number of xml files is to large (millions and billions), then you
> can use hadoop map files to put files together. E.g. based on years, or
> month.
>
> Regards,
>
> Wilm
>
> Am 03.01.2015 um 17:06 schrieb Shashidhar Rao:
> > Hi,
> >
> > Can someone help me by suggesting the best way to solve this use case
> >
> > 1. XML files keep flowing from external system and need to be stored
> > into HDFS.
> > 2. These files  can be directly stored using NoSql database e.g any
> > xml supported NoSql. or
> > 3. These files need to be processed and stored in one of the database
> > HBase, Hive etc.
> > 4. There won't be any updates only read and has to be retrieved based
> > on some queries and a dashboard has to be created , bits of analytics
> >
> > The xml files are huge and expected number of nodes is roughly around
> > 12 nodes.
> > I am stuck in the storage part say if I convert xml to json and store
> > it into HBase , the processing part from xml to json will be huge.
> >
> > It will be only reading and no updates.
> >
> > Please suggest how to store these xml files.
> >
> > Thanks
> > Shashi
>
>

Re: XML files in Hadoop

Posted by Shashidhar Rao <ra...@gmail.com>.

Hi,

Exact number of files is not known but it will run into millions of files
depending on client's request who collects terabytes of xml data every day.
Basically, storing is just one part but the main part will be how to query
these data like  aggregation, count and do some analytics over these data.
Fast retrieval is required , say for e.g for a particular year what are the
top 10 products, top ten manufacturers and top ten stores etc.

Will Hive be a better choice ? And will converting these Hive files to some
format work out.

Thanks
Shashi

On Sat, Jan 3, 2015 at 9:44 PM, Wilm Schumacher <wi...@gmail.com>
wrote:

> Hi,
>
> how many xml files are you planning to store? Perhaps it is possible to
> store them directly on hdfs and save meta data in hbase. This sounds
> more reasonable to me.
>
> If the number of xml files is to large (millions and billions), then you
> can use hadoop map files to put files together. E.g. based on years, or
> month.
>
> Regards,
>
> Wilm
>
> Am 03.01.2015 um 17:06 schrieb Shashidhar Rao:
> > Hi,
> >
> > Can someone help me by suggesting the best way to solve this use case
> >
> > 1. XML files keep flowing from external system and need to be stored
> > into HDFS.
> > 2. These files  can be directly stored using NoSql database e.g any
> > xml supported NoSql. or
> > 3. These files need to be processed and stored in one of the database
> > HBase, Hive etc.
> > 4. There won't be any updates only read and has to be retrieved based
> > on some queries and a dashboard has to be created , bits of analytics
> >
> > The xml files are huge and expected number of nodes is roughly around
> > 12 nodes.
> > I am stuck in the storage part say if I convert xml to json and store
> > it into HBase , the processing part from xml to json will be huge.
> >
> > It will be only reading and no updates.
> >
> > Please suggest how to store these xml files.
> >
> > Thanks
> > Shashi
>
>

Re: XML files in Hadoop

Posted by Shashidhar Rao <ra...@gmail.com>.

Hi,

Exact number of files is not known but it will run into millions of files
depending on client's request who collects terabytes of xml data every day.
Basically, storing is just one part but the main part will be how to query
these data like  aggregation, count and do some analytics over these data.
Fast retrieval is required , say for e.g for a particular year what are the
top 10 products, top ten manufacturers and top ten stores etc.

Will Hive be a better choice ? And will converting these Hive files to some
format work out.

Thanks
Shashi

On Sat, Jan 3, 2015 at 9:44 PM, Wilm Schumacher <wi...@gmail.com>
wrote:

> Hi,
>
> how many xml files are you planning to store? Perhaps it is possible to
> store them directly on hdfs and save meta data in hbase. This sounds
> more reasonable to me.
>
> If the number of xml files is to large (millions and billions), then you
> can use hadoop map files to put files together. E.g. based on years, or
> month.
>
> Regards,
>
> Wilm
>
> Am 03.01.2015 um 17:06 schrieb Shashidhar Rao:
> > Hi,
> >
> > Can someone help me by suggesting the best way to solve this use case
> >
> > 1. XML files keep flowing from external system and need to be stored
> > into HDFS.
> > 2. These files  can be directly stored using NoSql database e.g any
> > xml supported NoSql. or
> > 3. These files need to be processed and stored in one of the database
> > HBase, Hive etc.
> > 4. There won't be any updates only read and has to be retrieved based
> > on some queries and a dashboard has to be created , bits of analytics
> >
> > The xml files are huge and expected number of nodes is roughly around
> > 12 nodes.
> > I am stuck in the storage part say if I convert xml to json and store
> > it into HBase , the processing part from xml to json will be huge.
> >
> > It will be only reading and no updates.
> >
> > Please suggest how to store these xml files.
> >
> > Thanks
> > Shashi
>
>

Re: XML files in Hadoop

Posted by Shashidhar Rao <ra...@gmail.com>.

Hi,

Exact number of files is not known but it will run into millions of files
depending on client's request who collects terabytes of xml data every day.
Basically, storing is just one part but the main part will be how to query
these data like  aggregation, count and do some analytics over these data.
Fast retrieval is required , say for e.g for a particular year what are the
top 10 products, top ten manufacturers and top ten stores etc.

Will Hive be a better choice ? And will converting these Hive files to some
format work out.

Thanks
Shashi

On Sat, Jan 3, 2015 at 9:44 PM, Wilm Schumacher <wi...@gmail.com>
wrote:

> Hi,
>
> how many xml files are you planning to store? Perhaps it is possible to
> store them directly on hdfs and save meta data in hbase. This sounds
> more reasonable to me.
>
> If the number of xml files is to large (millions and billions), then you
> can use hadoop map files to put files together. E.g. based on years, or
> month.
>
> Regards,
>
> Wilm
>
> Am 03.01.2015 um 17:06 schrieb Shashidhar Rao:
> > Hi,
> >
> > Can someone help me by suggesting the best way to solve this use case
> >
> > 1. XML files keep flowing from external system and need to be stored
> > into HDFS.
> > 2. These files  can be directly stored using NoSql database e.g any
> > xml supported NoSql. or
> > 3. These files need to be processed and stored in one of the database
> > HBase, Hive etc.
> > 4. There won't be any updates only read and has to be retrieved based
> > on some queries and a dashboard has to be created , bits of analytics
> >
> > The xml files are huge and expected number of nodes is roughly around
> > 12 nodes.
> > I am stuck in the storage part say if I convert xml to json and store
> > it into HBase , the processing part from xml to json will be huge.
> >
> > It will be only reading and no updates.
> >
> > Please suggest how to store these xml files.
> >
> > Thanks
> > Shashi
>
>

Re: XML files in Hadoop

Posted by Wilm Schumacher <wi...@gmail.com>.

Hi,

how many xml files are you planning to store? Perhaps it is possible to
store them directly on hdfs and save meta data in hbase. This sounds
more reasonable to me.

If the number of xml files is to large (millions and billions), then you
can use hadoop map files to put files together. E.g. based on years, or
month.

Regards,

Wilm

Am 03.01.2015 um 17:06 schrieb Shashidhar Rao:
> Hi,
>
> Can someone help me by suggesting the best way to solve this use case
>
> 1. XML files keep flowing from external system and need to be stored
> into HDFS.
> 2. These files  can be directly stored using NoSql database e.g any
> xml supported NoSql. or
> 3. These files need to be processed and stored in one of the database
> HBase, Hive etc.
> 4. There won't be any updates only read and has to be retrieved based
> on some queries and a dashboard has to be created , bits of analytics
>
> The xml files are huge and expected number of nodes is roughly around
> 12 nodes.
> I am stuck in the storage part say if I convert xml to json and store
> it into HBase , the processing part from xml to json will be huge.
>
> It will be only reading and no updates.
>
> Please suggest how to store these xml files.
>
> Thanks
> Shashi

Re: XML files in Hadoop

Posted by Wilm Schumacher <wi...@gmail.com>.

Hi,

how many xml files are you planning to store? Perhaps it is possible to
store them directly on hdfs and save meta data in hbase. This sounds
more reasonable to me.

If the number of xml files is to large (millions and billions), then you
can use hadoop map files to put files together. E.g. based on years, or
month.

Regards,

Wilm

Am 03.01.2015 um 17:06 schrieb Shashidhar Rao:
> Hi,
>
> Can someone help me by suggesting the best way to solve this use case
>
> 1. XML files keep flowing from external system and need to be stored
> into HDFS.
> 2. These files  can be directly stored using NoSql database e.g any
> xml supported NoSql. or
> 3. These files need to be processed and stored in one of the database
> HBase, Hive etc.
> 4. There won't be any updates only read and has to be retrieved based
> on some queries and a dashboard has to be created , bits of analytics
>
> The xml files are huge and expected number of nodes is roughly around
> 12 nodes.
> I am stuck in the storage part say if I convert xml to json and store
> it into HBase , the processing part from xml to json will be huge.
>
> It will be only reading and no updates.
>
> Please suggest how to store these xml files.
>
> Thanks
> Shashi

Re: XML files in Hadoop

Posted by Wilm Schumacher <wi...@gmail.com>.

Hi,

how many xml files are you planning to store? Perhaps it is possible to
store them directly on hdfs and save meta data in hbase. This sounds
more reasonable to me.

If the number of xml files is to large (millions and billions), then you
can use hadoop map files to put files together. E.g. based on years, or
month.

Regards,

Wilm

Am 03.01.2015 um 17:06 schrieb Shashidhar Rao:
> Hi,
>
> Can someone help me by suggesting the best way to solve this use case
>
> 1. XML files keep flowing from external system and need to be stored
> into HDFS.
> 2. These files  can be directly stored using NoSql database e.g any
> xml supported NoSql. or
> 3. These files need to be processed and stored in one of the database
> HBase, Hive etc.
> 4. There won't be any updates only read and has to be retrieved based
> on some queries and a dashboard has to be created , bits of analytics
>
> The xml files are huge and expected number of nodes is roughly around
> 12 nodes.
> I am stuck in the storage part say if I convert xml to json and store
> it into HBase , the processing part from xml to json will be huge.
>
> It will be only reading and no updates.
>
> Please suggest how to store these xml files.
>
> Thanks
> Shashi

Re: XML files in Hadoop

Posted by Wilm Schumacher <wi...@gmail.com>.

Hi,

how many xml files are you planning to store? Perhaps it is possible to
store them directly on hdfs and save meta data in hbase. This sounds
more reasonable to me.

If the number of xml files is to large (millions and billions), then you
can use hadoop map files to put files together. E.g. based on years, or
month.

Regards,

Wilm

Am 03.01.2015 um 17:06 schrieb Shashidhar Rao:
> Hi,
>
> Can someone help me by suggesting the best way to solve this use case
>
> 1. XML files keep flowing from external system and need to be stored
> into HDFS.
> 2. These files  can be directly stored using NoSql database e.g any
> xml supported NoSql. or
> 3. These files need to be processed and stored in one of the database
> HBase, Hive etc.
> 4. There won't be any updates only read and has to be retrieved based
> on some queries and a dashboard has to be created , bits of analytics
>
> The xml files are huge and expected number of nodes is roughly around
> 12 nodes.
> I am stuck in the storage part say if I convert xml to json and store
> it into HBase , the processing part from xml to json will be huge.
>
> It will be only reading and no updates.
>
> Please suggest how to store these xml files.
>
> Thanks
> Shashi