In the previous article, we discussed the DataConnect framework and how it can be used to migrate your data into the TRIRIGA application. We walked through a simple data load scenario, including a basic SQL script template for populating your staging tables. Obviously, it won't always be practical to manually load in pre-scripted data. In most cases you'll want to use some sort of ETL (Extract/Transform/Load) solution to retrieve data from your source (a flat file, a database, or some other system), perform basic pre-load manipulation and transformation of the data, and load the data into your staging tables. The Pentaho Data Integration (also known as "Kettle") ETL suite is a popular tool for creating data movement scenarios. It is free, open source software that can be downloaded from http://kettle.pentaho.org. The Pentaho suite comes with several utilities for managing your ETL jobs; the primary tools are:
Getting StartedFirst, you'll need to download the Pentaho Data Integration application. The download is linked at http://kettle.pentaho.org, or you can go directly to the SourceForge project download site:
Click on the release version that you want to download (at the time of writing, the latest stable version was 3.2.0), then select the appropriate package for your platform. Be sure to get the application package (you probably don't want the "javadoc" or "src" archives, just the main application); the application packages for version 3.2.0 are named "pdi-ce-3.2.0-stable.dmg" (for Mac OS X), "pdi-ce-3.2.0-stable.tar.gz" (Linux/Unix), and "pdi-ce-3.2.0-stable.zip" (Windows). After downloading, you just need to expand the archive somewhere; the application is Java-based and doesn't require any special installation (although you will need to have the Java Runtime Environment 1.5 or higher installed).
Once you have installed the application, launch the Spoon utility; this can be done by running "spoon.bat" (Windows) or "spoon.sh" (Mac/Linux/Unix) from the directory where you installed Kettle. The first time you launch Spoon, it will ask you if you want to open a repository; click the "No Repository" button, since we'll be using local files.
The Pentaho Data Integration suite deals with two types of files:
ExampleAttached at the bottom of this article is a zip file named "kettle-example.zip"; this contains the following:
You'll need to specify a few variables before clicking "Launch":
Note that the create-dc-job.ktr and complete-dc-job.ktr transformations contain primarily "boilerplate" functionality common to nearly all DataConnect loads; since they use variables passed in from the main job, they can be leveraged across most or all of your load scenarios without modification. The heavy lifting is in the load-example.ktr transformation, which loads data from the sample text files to the staging tables: Step-by-step, here's what's happening in this transformation:
Integrating TRIRIGA With KettleOnce you've got your jobs and transformations built, you'll probably want to start loading data. If you're working on a one-time data load (for example, migrating legacy data into the system) then it may be sufficient to simply execute the ETL job from within Spoon. However, if you're going to run your job nightly or on some other scheduled basis, you'll likely need a more robust solution."Pan" and "Kitchen" are command-line tools included with Kettle which allow you to run your jobs and transformations from the command line; you can use standard operating system schedulers to execute them that way. However, we can also leverage the fact that TRIRIGA uses Kettle internally to maintain the WPM Fact tables -- the required libraries to build integrations and interact with the Kettle API are already present in the platform. Attached to this article is "etl-job.zip", which contains the following:
The package should create an "ETL Job" manager; look for that in the Data Browser, and open the "JOB-Example" example record: Update the "Job File" and "Log File" fields to reflect the appropriate locations on the application server. The utility can execute either Kettle Jobs (".kjb") or Kettle Transformations (".ktr"). Here, we have checked the "DataConnect Job" checkbox; this tells the tool to additionally create a DataConnect Job entry for the Business Object specified in the "DataConnect Job Object" field and update it when the ETL job has completed. In this case, the custom workflow task will automatically maintain the DataConnect job entry (you don't have to do it explicitly within your transformation or job, but can instead just move data into the staging tables). You can also define variables that should be passed to your job or transformation. This can be used to parameterize database connections, etc. Note that the "tririgaDatabaseServer" and other parameters are not needed for this transformation; it is configured to use the JNDI connection that is managed by the TRIRIGA application server itself. You can click "Execute" to launch the job immediately, or you can define a schedule for the job to execute automatically using the TRIRIGA scheduling engine (via the "Job Schedule" section). Hopefully this provides you with a solid foundation for exploring DataConnect and Kettle more thoroughly. Documentation on DataConnect is available from TRIRIGA Elite ("Application Building for the TRIRIGA Application Platform: Data Management" is the document you want); up-to-date documentation and tutorials for Kettle can be found at: Source code for the custom task presented in this article is available from the triDeveloper Code Repository: |



