Using Talend to Gather Data About Data

This article was developed using the free, open source version of Talend Open Studio for Data Integration which is available here.

What’s more exciting than data. Data about data!

Recently I had to assess the impact of data model changes within a transactional system feeding our data warehouse. I directed our SQL scripts and ETL processes to the pre-production environment, and ran the ETL jobs and did some basic regression analysis. For every table and column, were there any significant changes which warranted further investigation:

Number of rows
Number of nulls per column
min and max length of each column
distinct values per columns

The idea was, that any large shifts in data volume, or null values could indicate something worth investigating.

I could use Oracle dictionary tables like “ALL_TAB_COL_STATISTICS” to get some of this information, but wanted more control over the final output. The end result needed to be a data set suitable for analysis in Tibco Spotfire.

I built a simple table within the schema to gather some information for each table/column, and then created a simple Talend job. The job gathers stats using dynamic queries, retrieving table and column names from the Oracle system table, “ALL_TAB_COLUMNS”. The stats table created to store the final output is below.

dataaboutdata1

I ran the job before and after the ETL change, and then used Spotfire to look for outliers in the metrics. Below, for example, there is a big difference in the “number of nulls” metric for the “STATE_PROVINCE” column when comparing “before” vs “after”.

This is a manufactured example, using Oracle Express HR database. The Talend Job was straightforward, the most complicated part is the creation of the dynamic SQL statements.

dataaboutdata2

Step 1: Create the Oracle connection

dataaboutdata3

Step 2: Iterate over the Oracle tables

Use a “tOracleTableList” component to iterate through the tables. I am excluding the STATS table itself (DB_STATS) using a WHERE clause in the component.

This component iterates over the table names. During each iteration I can grab the table name from the globalMap. The table name is used to build the dynamic SQL statements:

dataaboutdata4

Step 3: Get the columns names for the current table iteration

Now we have the table name, we can create a query against the “ALL_TAB_COLUMNS” system table in Oracle. The query simply returns the column names for the table. We pull out the table name from the globalMap, courtesy of the “tOracleTableList”.

It is not obvious from the screenshot, but there are single quotes and double quotes next to each other. This is a theme for most of the dynamic SQL statements in this article.

dataaboutdata5

Step 4: Iterate over the column names

Later we will need both the table name AND a column name to create SQL statements, so I added a “tFlowToIterate” component. This will let me iterate over the column names. The column name for each iteration will get added to the globalMap as a variable.

At any point I can now pull out both the table name and column name from the globalMap. Note I also have access to other fields I am not using in this example, such as “DATA_TYPE” and “DATA_LENGTH” which are available in the “ALL_TAB_COLUMNS” table.

dataaboutdata6

Step 5: Execute the SQL needed to gather the metrics

This may need some explaining. If, for a particular iteration, the table name is “DEPARTMENTS” and the column name is “DEPARTMENT_NAME”, to gather some basic stats and produce one row of output I need to create a query like this:

dataaboutdata7

We are selecting from the special Oracle table “DUAL” which is a dummy table with one row and one column, which is great for when you need to do this kind of query to produce a single row of output. You can see above, we have generated some basic stats for the table and column.

Note: There is some redundancy here, as for each column I am calculating the total number of rows in the table, and really should do this once per table.

To create this SQL, we have to generate a lot of the values dynamically, by extracting the table and column name from the globalMap variables created earlier in the job.

The next component to add is a “tOracleInput” component. Here we construct and execute the dynamic SQL above, for each table and column. Again, we do get into a little bit of “single quote/double quote” hell, but it isn’t too bad.

I am also adding a “Before” or “After” value from the Talend context to indicate if this is a view of the stats before or after we implemented the data model changes.

It isn’t as bad as it looks, we are basically trying to recreate the SQL above by pulling table and column names out of the globalMap:

dataaboutdata8

The schema is straightforward:

dataaboutdata9

Step 6: Congratulate Ourselves 😉

You are now querying a database (Oracle system catalog table), to get data about the database (table and column names), so you can create dynamic SQL on the fly – to get more data about the database (metrics)…

Step 7: Insert the record into the stats table

This is a simple insert of the stats record we just created through dynamic SQL, using a “tOracleOutput” component to write the results into the DB_STATS table:

dataaboutdata10

Step 8: Commit (or rollback)

If the subjob completes without error – we perform an Oracle commit. This is necessary as I didn’t choose auto-commit in my tOracleConnection in step 1:

dataaboutdata11

If you have hundreds of tables, a trillion rows, and lots of columns, then think carefully about pulling the trigger. It did take me a few hours to execute the job against approximately 150 tables (200 million rows across all), but when I put the data into Spotfire it gave me a view of my data which was incredibly useful, and it did pick up some changes I would have missed otherwise.

The changes in the transactional system had in fact included logic changes to several key PL/SQL procedures used during our ETL extractions.

Download >> Talend Open Studio for Data Integration

Disclaimer: All opinions expressed in this article are my own and do not necessarily reflect the position of my employer.

The post Using Talend to Gather Data About Data appeared first on Talend Real-Time Open Source Data Integration Software.

Using Talend to Gather Data About Data

Step 1: Create the Oracle connection

Step 2: Iterate over the Oracle tables

Step 3: Get the columns names for the current table iteration

Step 4: Iterate over the column names

Step 5: Execute the SQL needed to gather the metrics

Step 6: Congratulate Ourselves 😉

Step 7: Insert the record into the stats table

Step 8: Commit (or rollback)

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112