Search This Blog

Monday, March 9, 2015

Reading all the spreadsheets in a folder, and all sheets in each spreadsheet using Pentaho Data Integration

About this post

This post, outlines the steps needed to read all Excel spreadsheets in a folder using Pentaho Data Integration. This post was part of a larger transformation that I was building, but this one step is large enough and important enough to warrant its own post. 

Let's get started...

Reading all the spreadsheets in the folder

The first step is to read all the spreadsheets in a folder is to use a Microsoft Excel Input Transformer from the Input Group on the Design tab on the left, by dragging it on the main page.



By right-clicking on the newly added transformer (Edit step), we need to set the following in the dialog box that pops up.

1. Navigate to "Files" tab.

On this dialog box, I left the spreadsheet engine as Excel 97-2003 (JXL).

2. File or Directory: Enter the name of the folder in which the Excel files reside by clicking on Browse.

3. Regular Expression: Enter a regular expression to select the files you want. In my case since I want all Excel files, my regular expression looks for all the *_star.xls files in the folder.

4. Click on "Add" button next to Files or Directory input box.

5. Click on "Show Filename(s)" button to ensure the correct files are being selected.



Next step is to read all sheets in each selected spreadsheet.

Reading all sheets in each file

For this, do the following steps:

1. Click on the "Sheets" tab

2. List of sheets to read: Leave the Sheet Name as blank and enter Start Row and Start Column as 0.




Reading the header information


Next, navigate to the Content tab.

Enable the following:

1. Header
2. No Empty rows
3. Stop on empty row
4. Add filenames to result.




Reading consistent fields across multiple sheets


Next, navigate to the Fields tab.

1. Click on "Get fields from header row".

2. This will get duplicate values for all fields across all sheets, as shown in the first screenshot below.

3. Delete all additional field names (shown on the screenshot as XXX_1 and beyond), to get the final list of fields.


Here is the final list of fields after deleting the extra ones.


Next, we need to add some additional input fields, that we may need in our transformation.

I added Full filename field, as well as Sheetname field.


Finally, click "OK" to continue.


2 comments:

Anonymous said...

Hi,
Thanks for the detailed steps. Can ypu please elaborate the Regular expression? I am trying to get excel files in a folder but I keep getting an error "No files found! Please check filename/directory and regular expression options". Please suggest a solution?

awachs said...

I just clarified the regular expression. Essentially you want to filter for which files you want to read. In my case, if you check the screenshot, I wanted to read all .xls or .xlsx files that matched *_star.xls.