Open design and architecture: March 2015

What this post is about

In this post, I will walk through an ETL transformation that takes a data model defined in spreadsheets into a DDL script that can be executed to create tables on PostgreSQL. I am using Pentaho Data Integration Community Edition for this exercise.

Creating Data Models in Spreadsheets

While creating data models, it is usually easier to create a data model using spreadsheets. I typically use multiple sheets in the spreadsheet for each table, list the column names and data types as descriptions.

The structure I use is as follows, with one sample row of data

ID	AttributeName	DataType	DataWidth	Description	Dimension	SourceColumns	ExtractionMethod
1	W_ATTRIB_1	Numeric	4

Here is a sample fact table shown in Excel

Here is a Dimension table (its the same structure except the name ends in "_D" instead of "_F" based on naming conventions.

The problem with Spreadsheets

The problem with spreadsheets is of course creating the final SQL script is not straightforward. For this reason, I am creating the transformation in an ETL tool, so that it can be maintained and used over and over again.

Lets walk through the high level steps for the transformation.

High level steps

The high level steps in the ETL transformation are as follows.

1. Read all the spreadsheet(s) sitting in one folder, and read all the sheets in the spreadsheet.
2. Filter the sheets to be processed.
3. Add constants
4. Concatenate Fields
5. Sorting the rows
6. Denormalizing the rows
7. Concatenating the final SQL
8. Outputting to a text file

Here is a picture of the transformation

Reading all the spreadsheets in the folder

The first step is to read all the spreadsheets in a folder, as well as all sheets in each spreadsheet file. Since this step is an important and big enough step, I have posted it as a separate post on my blog.

Filter the sheets to be processed by the transformation

Since I may have multiple sheets in each spreadsheet, and only some may be relevant in the model, I added a step to Filter Rows. Filter Rows is available under the Flow group of transformers on the Design tab.

Click on Edit Step to configure this step. As you can see, I have added two filters to only allow sheets in each file that either end with _D or _F.

Adding some constants

Next step was to add some constants that I need at multiple places in the transformation to fully form the SQL. The Add Constants step can be found under the "Transform" group of steps under the Design tab.

The constants I added were the following:

#	Constant Name	Value
1.	TableColumnKey	TabColKey
2.	OpenParanthesis	(
3.	ClosedParanthesis	)
4.	CreateTable	CREATE TABLE
5.	Semicolon	;
6.	Space

Click Ok to continue. Next step is to concatenate fields to form column definitions.

Concatenating Fields to form Column Definitions

I used the "Concat Fields" step under the Transform group of steps on the Design tab as shown below.

After dragging and connecting it to the previous step, right click and select Edit Step. Enter the following details on the dialog box

1. TargetFieldName: TableColumnDefinition
2. Separator: or blank

Here are the input columns I concatenated, in the following order.

1. AttributeName
2. Space
3. DataType
4. OpenParanthesis
5. DataWidth
6. ClosedParanthesis

This will create a column called TableColumnDefinition concatenating all the above fields in listed order.

Next we need to prepare to flatten the rows. Prior to feeding the rows into the Denormalizer, we need to sort these.

Sorting the definitions by Table name

The Sort Rows step is also present under the Transform group of steps. Drag and connect it to the previous step.

Right click and Edit step to make the following entries. The fields need to be in the order listed below.

1. TableName
2. ID

This will sort the input rows accordingly. Click Ok to continue.

Denormalizing the rows.

This is the most important step in this entire transformation. Drag and connect the Row Denormalizer step present under Transform steps, to the previous steps.

Right-Click and select Edit Step to continue.

On this dialog box, we will enter the following details

1. The Key Field: TableColumnKey (This is the constant we introduced in previous steps and is same for all input rows)

2. GroupField: TableName (We want to select all rows having the same TableName to be merged into a single row)

3. Target Fields: (We will enter only one target field with the following values)

a. TargetFieldName: AllColumns (Each output row will now have this additional column)

b. ValueFieldName: TableColumnDefinition (This is the concatenated column we created in a previous step)

c. Length: 5000 (We need space for long table statements)

d. Aggregation: Concatenate strings separated by , (This will take individual column definitions for a given table name, and concatenate them with comma separations, creating a list of column definitions)

Click Ok to continue.

Constructing the Full SQL

The last step before writing to file is to construct the full SQL. We will concatenate the following with a space between them. Enter the following details

1. TargetFieldName: FullSQL

2. Length of Target Field: 5000

3. Separator: " " (Single Space)

4. Fields:

a. CreateTable

b. TableName

c. OpenParanthesis

d. AllColumns

e. ClosedParanthesis

f. SemiColon

Writing to file

Last step in this transformation is writing to file. Drag and connect a Text File Output step from the Output group of steps.

Enter the following

1. Name of file: Your choice
2. Extension: sql

On the Content tab, I made no changes.

Finally on the Fields tab, we only need one output column with following details

1. Name: FullSQL
2. Length: 5000

That's it. Save the tranformation and execute.

About this post

This post, outlines the steps needed to read all Excel spreadsheets in a folder using Pentaho Data Integration. This post was part of a larger transformation that I was building, but this one step is large enough and important enough to warrant its own post.

Let's get started...

Reading all the spreadsheets in the folder

The first step is to read all the spreadsheets in a folder is to use a Microsoft Excel Input Transformer from the Input Group on the Design tab on the left, by dragging it on the main page.

By right-clicking on the newly added transformer (Edit step), we need to set the following in the dialog box that pops up.

1. Navigate to "Files" tab.

On this dialog box, I left the spreadsheet engine as Excel 97-2003 (JXL).

2. File or Directory: Enter the name of the folder in which the Excel files reside by clicking on Browse.

3. Regular Expression: Enter a regular expression to select the files you want. In my case since I want all Excel files, my regular expression looks for all the *_star.xls files in the folder.

4. Click on "Add" button next to Files or Directory input box.

5. Click on "Show Filename(s)" button to ensure the correct files are being selected.

Next step is to read all sheets in each selected spreadsheet.

Reading all sheets in each file

For this, do the following steps:

1. Click on the "Sheets" tab

2. List of sheets to read: Leave the Sheet Name as blank and enter Start Row and Start Column as 0.

Reading the header information

Next, navigate to the Content tab.

Enable the following:

1. Header
2. No Empty rows
3. Stop on empty row
4. Add filenames to result.

Reading consistent fields across multiple sheets

Next, navigate to the Fields tab.

1. Click on "Get fields from header row".

2. This will get duplicate values for all fields across all sheets, as shown in the first screenshot below.

3. Delete all additional field names (shown on the screenshot as XXX_1 and beyond), to get the final list of fields.

Here is the final list of fields after deleting the extra ones.

Next, we need to add some additional input fields, that we may need in our transformation.

I added Full filename field, as well as Sheetname field.

Finally, click "OK" to continue.

Open design and architecture

Search This Blog

Monday, March 9, 2015

Creating DDL scripts from Excel Model using Pentaho Data Integration

What this post is about

Creating Data Models in Spreadsheets

The problem with Spreadsheets

High level steps

Reading all the spreadsheets in the folder

Filter the sheets to be processed by the transformation

Adding some constants

Concatenating Fields to form Column Definitions

Sorting the definitions by Table name

Denormalizing the rows.

Constructing the Full SQL

Writing to file

Reading all the spreadsheets in a folder, and all sheets in each spreadsheet using Pentaho Data Integration

About this post

Reading all the spreadsheets in the folder

Reading all sheets in each file

Reading the header information

Reading consistent fields across multiple sheets