What is DataStage?
- An ETL tool to Extract, Transform and Load the data into data mart or data warehousing
- Used for data integration projects such as data warehousing, ODS (Operational Data Store) and can connect to major databases like Teradata, Oracle, DB2, SQL Server etc.
- Designed ETL jobs can migrate in different environments such as Dev, UAT and Prod by importing and exporting DataStage components
- Can manage metadata in the jobs
- Can schedule, execute and monitor the jobs in DataStage
DataStage Architecture:
DataStage allow us to develop the jobs in Server or Parallel editions. Parallel edition uses the parallel processing capabilities for processing the data and is ideal for large volumes of data.
Components:
- Designer
- Director
- Administrator
Administrator:
The following tasks performed using the administrator.
- Add, delete, and move projects
- Set up user permissions for projects
- Purge job log files
- Set the timeout interval on the engine
- Trace the engine activity
- Set job parameter defaults
- Issue WebSphere DataStage Engine commands from the Administration client
- Configure parallel processing jobs settings.
- Create/set environmental variables.
Enabling job administration in the Director client:
These features let WebSphere DataStage operators release the resources of a job that has aborted or hung, and so return the job to a state where it can run.
This procedure enables two commands in the Director menu.
- Cleanup Resources
- Clear Status File
Designer:
- Design and develop using the graphical design tool
- Various stages like General, Database, File, Processing stages used while developing jobs
- Table definitions can be imported directly from the data source or data warehousing tables
- Jobs are compiled using the designer and it checks for any compilation errors in primary inputs, reference outputs, key expressions, transforms etc.
- Import and/or export projects from different environments
- Server, mainframe and parallel jobs can be created using the designer
- Define parameters in parameters page under the properties and will be used accordingly in development phase
- Can created custom routines
- Multiple jobs can be selected for compilation and provide the report after the compilation is finished
Director:
- Validate, schedule, run, and monitor jobs run by the DataStage Server
- Job status displays the current status like running, compiled, finished, aborted and not compiled
- Job log displayes the log file for the selected job
- Reset the job if the status is aborted or stopped before running it again.
- Provides the execution times of the jobs
- Ability to clean up the resources (if administrator has enabled this option)
Along with these jobs, DataStage provides containers (local containers and shared containers) and sequence jobs allow to specify a sequence of server or parallel jobs to run.