How to Offload Oracle and MySQL Databases into Hadoop using Apache Spark and Talend

In the space of Big Data, a common pattern found is offloading a traditional data warehouse into a Hadoop environment. Whether it be for primary use or to only store “cold” data, Talend makes it painless to offload.

Many organizations trying to optimize their data architecture have leveraged Hadoop for their cold data or to maintain an archive. With the native code generation for Hadoop, Talend can make this process easy.

Talend already provides out of the box Connectors to support this paradigm using SQOOP; here we are going to focus on how to make the same using Apache Spark.

Apache Spark is a fast and general engine for large-scale data processing. This engine is available in most of the latest Hadoop distribution version (Cloudera, Hortonworks, MapR, AWS EMR, etc…). Built on a massively parallel processing (MPP) architecture, it allows you to massively parallelize a data flow to handle any enterprise workload.

The fastest and most known solution today to bring data from your Databases into Hadoop is to leverage and use SQOOP (Sqoop is leveraging underneath a MapReduce process to perform the offload from RDBMS to Hadoop). Today I wanted to introduce you to something which will perform the same purpose of SQOOP but using SPARK as framework/ending.

In this blog post, I’m going to address first how to use Spark to move 1 table from Oracle or MySQL into Hadoop. Then once we have one working job to do this task; how we can turn this job to be generic to be control by a list of Tables to move from your databases server to Hadoop.

For simplicity, we will key in on the following two scenarios:

How to move a Table into HDFS from a Spark job.
How to automate and turn the job above into a Metadata-driven ingestion framework to work on a list of tables.

Moving a Table into HDFS from a Talend Spark Job

leveragingtalendtooffloaddatabases1

In this scenario, we created a very generic job that extract from a Database table and move the data into HDFS using Apache Spark and a generic Query statement such as:

"SELECT concat_ws('"+context.FIELD_SEPARATOR+"',  "+context.column_list+" ) as my_data FROM my_table".
) as my_data FROM my_table".

context.FIELD_SEPARATOR = is a context variable at the job level set to ‘,’ or ‘;’ or ‘|’ or others context.column_list = is a context variable which is the concat of the FIELD required to be extracted (for example: field1, field2, field3, etc…)

The Offload piece will execute the query statement natively on Hadoop using Spark. The generated code is deployed directly through YARN.

Automating and turning the Job above into a Metadata driven ingestion framework to work on a list of tables

leveragingtalendtooffloaddatabases2

The offload-preparation process starts at the database. Next, the table list is pulled and contextualized, along with a list of the columns in the table (preparation of the variables to be sent to Offload job). Once this has been completed, a simple call to Offload Job is made through the iteration over tables to offload to Hadoop. The Offload process is the Job we described in the section “How to move a Table into HDFS from a Talend Spark job” above.

The post How to Offload Oracle and MySQL Databases into Hadoop using Apache Spark and Talend appeared first on Talend Real-Time Open Source Data Integration Software.

How to Offload Oracle and MySQL Databases into Hadoop using Apache Spark and Talend

Moving a Table into HDFS from a Talend Spark Job

Automating and turning the Job above into a Metadata driven ingestion framework to work on a list of tables

Trending Articles

Bath man appears in court charged with attempted murder of a man...

MACLEAN, Allan

Black Angus Grilled Artichokes

Practice Sheet of Right form of verbs for HSC Students

Police blotter for Jan. 12

99 God Status for Whatsapp, Facebook

Rajasthan Board 12th Science Result 2018 name wise- RBSE 12th commerce result...

Notorious Naushad of Ippa gang nabbed

Child Kidnapping: Amy McNeil was kidnapped on her way to school by 5 adults;...

Sonible Smartlimit v1.1.5-R2R

NCERT Solutions for Class 9th Sanskrit Chapter 3 पाथेयम्

मतलबी दोस्त स्टेट्स | Matlabi Dost Status in Hindi – Selfish Friends Status

Arrow Flash 2 – Sinhala Dubbed – Episode 23 – 20th March 2016

[GET] AI Traffic Goldmine

[E² Plugin] HDF-Radio

Universal Multi-Patch v1.3 By RADIXX11

IWAN – Thanks and Praise ( Throw Back Thursday )

RONALD P SONDERGAARD Arrested by Miami-Dade County Corrections on Mar 03, 2017

मुख मैथुन से उठाएं सेक्स का भरपूर मज़ा, जानें क्या है इसका सही तरीकामुख मैथुन...

HSSC Excise & Taxation Inspector Result 2017 Scorecard/ Category Wise Merit List