Disclaimer

Download: etl-job-1.0.zip
Migrating Data With DataConnect - Part 2

Loading Data With Pentaho Data Integration

 

In the previous article, we discussed the DataConnect framework and how it can be used to migrate your data into the TRIRIGA application.  We walked through a simple data load scenario, including a basic SQL script template for populating your staging tables.

Obviously, it won't always be practical to manually load in pre-scripted data.  In most cases you'll want to use some sort of ETL (Extract/Transform/Load) solution to retrieve data from your source (a flat file, a database, or some other system), perform basic pre-load manipulation and transformation of the data, and load the data into your staging tables.

The Pentaho Data Integration (also known as "Kettle") ETL suite is a popular tool for creating data movement scenarios.  It is free, open source software that can be downloaded from http://kettle.pentaho.org.  The Pentaho suite comes with several utilities for managing your ETL jobs; the primary tools are:

  • Spoon - A graphical environment for creating transformations and jobs
  • Pan - A command line tool for running transformations
  • Kitchen - A command line utility for executing jobs
We'll be primarily focusing on using the Spoon application to build jobs and transformations that we can use to populate the DataConnect staging tables.

Getting Started

First, you'll need to download the Pentaho Data Integration application.  The download is linked at http://kettle.pentaho.org, or you can go directly to the SourceForge project download site:
Click on the release version that you want to download (at the time of writing, the latest stable version was 3.2.0), then select the appropriate package for your platform.  Be sure to get the application package (you probably don't want the "javadoc" or "src" archives, just the main application); the application packages for version 3.2.0 are named "pdi-ce-3.2.0-stable.dmg" (for Mac OS X), "pdi-ce-3.2.0-stable.tar.gz" (Linux/Unix), and "pdi-ce-3.2.0-stable.zip" (Windows).  After downloading, you just need to expand the archive somewhere; the application is Java-based and doesn't require any special installation (although you will need to have the Java Runtime Environment 1.5 or higher installed).
 
Once you have installed the application, launch the Spoon utility; this can be done by running "spoon.bat" (Windows) or "spoon.sh" (Mac/Linux/Unix) from the directory where you installed Kettle.  The first time you launch Spoon, it will ask you if you want to open a repository; click the "No Repository" button, since we'll be using local files.

The Pentaho Data Integration suite deals with two types of files:
  • Jobs (file extension ".kjb") - A job is a sequential set of tasks or activities, each of which is a fairly discrete unit of work.  For example, a job might copy a file from a shared directory, perform a transformation on the file, then upload the result to an FTP server.  Each task is performed as a separate unit, in the specified order.
  • Transformations (file extension ".ktr") - A transformation defines a flow of data through a set of operations.  While operations are ordered, data can be processed in parallel; rows can effectively flow through the transformation at different rates.  As an example, your transformation might convert an incoming building code to uppercase, look up the corresponding building name from a database, then write the code and name to a flat file.  When the transformation is run on a set of 1000 buildings, the system may process the rows in 5 separate threads.  All rows will go through the transformation steps, but different threads may be performing different tasks at the same time.
In general, if you are operating on a set or sets of data, you will be using a transformation; if you are performing a sequence of tasks, you will want to use a job.

Example

Attached at the bottom of this article is a zip file named "kettle-example.zip"; this contains the following:
  • create-dc-job.ktr - A Kettle transformation that creates a DataConnect Job entry.  This transformation uses variables passed in from a parent job, so can be reused across multiple ETL scenarios.
  • complete-dc-job.ktr - A transformation that marks the DataConnect Job entry as ready for processing (after successful data loading).  This is also a reusable component.
  • load-example.ktr - A Kettle transformation that loads some sample data from a couple of text files to our S_DEVEXAMPLE and S_DEVCHILDEXAMPLE DataConnect staging tables from the previous article.
  • dataconnect-example.kjb - A Kettle job that links all of the above together; this job sets a couple of variables, then runs "create-dc-job.ktr" to create a DataConnect Job entry.  It then executes "load-example.ktr" to load the sample data to the staging tables for devExample and devChildExample.  Finally, it runs "complete-dc-job.ktr" to mark the job ready for processing.
  • sample-parent-data.txt, sample-child-data.txt - Some sample records.
To load the test data, open "dataconnect-example.kjb" in Spoon and click the "Run" button in the toolbar (the button that looks like a small green arrow).  You will be presented with the "Execute A Job" dialog:



You'll need to specify a few variables before clicking "Launch":
  • tririgaDatabaseServer - Host name or IP address of the TRIRIGA database server
  • tririgaDatabasePort - Port number on which the database is running
  • tririgaDatabaseName - Name of the database instance
  • tririgaDatabaseUsername - TRIRIGA database user
  • tririgaDatabasePassword - Password of the database user
Plug the appropriate values into the Variables table on the right hand side, then click "Launch".

Note that the create-dc-job.ktr and complete-dc-job.ktr transformations contain primarily "boilerplate" functionality common to nearly all DataConnect loads; since they use variables passed in from the main job, they can be leveraged across most or all of your load scenarios without modification.  The heavy lifting is in the load-example.ktr transformation, which loads data from the sample text files to the staging tables:



Step-by-step, here's what's happening in this transformation:
  1. Extract Parent Data - This is a Text File Input task, which extracts sample devExample data from "sample-parent-data.txt".
  2. Get Job Number - This is a Get Variables task which retrieves the "dcJobNumber" variable from the parent job (containing the generated DataConnect job number).
  3. Add Parent Correlation ID - This is an Add Sequence task, which is generating a sequential counter value for the correlation identifier.  The output of this task is forked along two different branches; both receive a copy of all output rows.
  4. Map Parent Fields - This is a Select Values task.  This is used to rename fields and select a subset of fields to pass on to the output.
  5. Add Parent Sequence ID - Another Add Sequence task, used to generate the DataConnect sequence number.
  6. Add Basic Parent DC Metadata - This is an Add Constants task, which adds the basic hardcoded data needed for all the staging table entries (including state, action, path, and GUI name).
  7. Save Parent Data to Staging Table - This is a Table Output task.  This is where the final transformed data set is written to the S_DEVEXAMPLE table.
  8. Extract Child Data - This is another Text File Input task, extracting sample devChildExample data from "sample-child-data.txt".  Note that this is a second data stream, running in parallel with the parent data.
  9. Sort Child Records - This is a Sort Rows task, which sorts the child records on the Parent and ID fields.
  10. Sort Parent Records - Another Sort Rows task.  This is the other branch from the previous fork; this sorts the parent records on the ID field.
  11. Join Parent to Child - This is a Merge Join task; this joins the two data streams and results in a combined stream containing matched data from both.  Here, the streams are joined on the parent record's ID.
  12. Map Child Fields - Another Select Values task, selecting a subset of fields for the child output.
  13. Sort by Correlation - Another Sort Rows task; here, the child records are sorted by the correlation identifier from the parent record.
  14. Add Child Sequence ID - An Add Sequence task to generate the child record's DataConnect sequence number.
  15. Add Basic Child DC Metadata - Another Add Constants task to append the basic necessary staging table fields
  16. Save Child Data to Staging Table - Finally, another Table Output task to save the rows to the S_DEVCHILDEXAMPLE table.

Integrating TRIRIGA With Kettle

Once you've got your jobs and transformations built, you'll probably want to start loading data.  If you're working on a one-time data load (for example, migrating legacy data into the system) then it may be sufficient to simply execute the ETL job from within Spoon.  However, if you're going to run your job nightly or on some other scheduled basis, you'll likely need a more robust solution.

"Pan" and "Kitchen" are command-line tools included with Kettle which allow you to run your jobs and transformations from the command line; you can use standard operating system schedulers to execute them that way.  However, we can also leverage the fact that TRIRIGA uses Kettle internally to maintain the WPM Fact tables -- the required libraries to build integrations and interact with the Kettle API are already present in the platform.

Attached to this article is "etl-job.zip", which contains the following:
  • etl-job-om.zip - Object Migration package which creates a few objects under the "devETLJob" module.
  • etl-job.jar - Custom workflow task; this should be copied into the "server/all/lib" subdirectory under your TRIRIGA server's JBoss instance.
  • integration-example.ktr - A Kettle transformation; copy this somewhere on the application server.  If you have more than one server running, this should go on the server where the Workflow Agent executes.
  • sample-parent-data2.txt - Some sample parent records; copy this alongside the transformation.
  • sample-child-data2.txt - Some sample child records; copy this alongside the transformation.
After importing the Object Migration package and copying over the custom task, you'll need to restart your application server.

The package should create an "ETL Job" manager; look for that in the Data Browser, and open the "JOB-Example" example record:



Update the "Job File" and "Log File" fields to reflect the appropriate locations on the application server.  The utility can execute either Kettle Jobs (".kjb") or Kettle Transformations (".ktr").  Here, we have checked the "DataConnect Job" checkbox; this tells the tool to additionally create a DataConnect Job entry for the Business Object specified in the "DataConnect Job Object" field and update it when the ETL job has completed.  In this case, the custom workflow task will automatically maintain the DataConnect job entry (you don't have to do it explicitly within your transformation or job, but can instead just move data into the staging tables).

You can also define variables that should be passed to your job or transformation.  This can be used to parameterize database connections, etc.  Note that the "tririgaDatabaseServer" and other parameters are not needed for this transformation; it is configured to use the JNDI connection that is managed by the TRIRIGA application server itself.  You can click "Execute" to launch the job immediately, or you can define a schedule for the job to execute automatically using the TRIRIGA scheduling engine (via the "Job Schedule" section).

Hopefully this provides you with a solid foundation for exploring DataConnect and Kettle more thoroughly.  Documentation on DataConnect is available from TRIRIGA Elite ("Application Building for the TRIRIGA Application Platform: Data Management" is the document you want); up-to-date documentation and tutorials for Kettle can be found at:


Source code for the custom task presented in this article is available from the triDeveloper Code Repository:


Files