BODS Transformation Overview/Bods List of Transformations
List
of available transforms
- Data Integrator- Data_Transfer,
Date_Generation, Effective_Date, Hierarchy_Flattening, History_Preserving,
Key_Generation, Map_CDC_Operation, Pivot (Columns to Rows), Reverse Pivot
(Rows to Columns), Table_Comparison, XML_Pipeline
- Data Quality- Associate, Country ID, Data
Cleanse, DSF2 Walk Sequencer, Geocoder, Global Address Cleanse, Global
Suggestion Lists, Match, USA Regulatory Address Cleanse, User-Defined
- Platform- Case, Map_Operation, Merge, Query,
Row_Generation, SQL, Validation
- Text Data Processing-
Entity_Extraction
Our
approach is to get a detailed knowledge on all the above transforms starting
with the mostly commonly used ones. So we will start with the Query transform.
QUERY Transform
Query
transform is used to retrieve a data set based on the input schema that
satisfies conditions that we specify. A query transform is similar to a SQL
SELECT statement. The Query transform is used to perform the following
operations: -
- Maps column
from input Schema to output Schema.
- Perform
transformations and functions on the source data.
- Assign Primary
Keys to output Schema columns.
- Add New Output
columns, Nested Schemas, and Function Calls to the output Schema.
- Perform Data
Nesting and Unnesting with Sub Schemas of the Output Schema. Also assign
Make Current Schema.
- Generate Distinct
result set output for the input Schema.
- Join data from
Multiple Input Source Schemas. Equi Join as well as Outer Join is
supported.
- Filter input
Source Data.
- Performs
Aggregation based on input column groups.
- Generate
sorted dataset based on source input column order.
- Also we can
generate DTD, XML Schema or File Format based on the Input or Output
Schema.
CASE Transform
Case
transform is used to divide or route the input data set into multiple output
data sets based on the defined logical expression. It is used to implement IF-THEN-ELSE logic at dataflow level. This
transform accepts only one source input. We can define multiple labels and
their corresponding CASE expression. For input rows that do not satisfy any of
the CASE conditions, we may select to output those records using theDEFAULT case.
For that we need to select the check box Produce default output when all expressions are
false.
Two
other featured properties of this transform are Row
can be TRUE for one case only and Preserve
expression order. If we select the option Row can be TRUE for one case only, then a
row is passed to the first case whose expression returns TRUE. Otherwise, the
row is passed to all the cases whose expression returns TRUE. Preserve
expression order option is
available only when the Row can be TRUE for one case only option is checked. We
can select this option if expression order is important to us because there is
no way to guarantee which expression will evaluate to TRUE first.
MERGE Transform
Merge
transform is used to combine multiple input dataset with the same schemas into
a single output dataset of the same schema. It is equivalent to SQL UNION
ALL statement. In order to
eliminate duplicate records from output dataset basically to attain UNION operation, add a Query transform with DISTINCToption
enabled after the Merge transform.
VALIDATION Transform
Validation
transform is used to filter or replace the source dataset based on criteria or
validation rules to produce desired output dataset. It enables to create
validation rules on the input dataset, and generate the output based on whether
they have passed or failed the validation condition. This transform is
typically used for NULL ckecking for mandatory fields, Pattern matching,
existence of value in reference table, validate datatype, etc.
The Validation transform can generate three output dataset Pass, Fail,
and RuleViolation. The Pass Output schema is identical with the Input
schema. The Fail Output schema has two more columns, DI_ERRORACTION and
DI_ERRORCOLUMNS. The RuleViolation has three columns DI_ROWID, DI_RULENAME and
DI_COLUMNNAME.
MAP_OPERATION Transform
Map_Operation
transform allows conversions between data manipulation operations like INSERT,
UPDATE, DELETE & REJECT. It enables to change the operation
codes of input data sets to
produce the desired output row type. There are 4 operation codes for any input
row type – Normal, Update, Insert and Delete. In addition, the DISCARD option
can be assigned to the output row type. Discarded rows are not passed through
to the output of the transform.
If
the output record is flagged NORMAL or INSERT,
then it inserts a new record in the target table. If it is marked as UPDATE it basically overwrites an existing
row in the target table. If the input record is flagged as Delete it does not
load the records in the target table. But if the output row type is set to DELETE then it deletes the corresponding
records present in target. If the row is marked as DISCARD then no records are passed to the
output of the transform.
ROW_GENERATION Transform
Row_Generation
transform produces a dataset with a single column. The column values start with
the number that we specify in the Row number starts at option. The value then increments
by one to specified number of
rows as set in the Row count option.
This transform does not allow any input data set.
SQL Transform
SQL
transform is used to submit or perform standard SQL operations on database
server. The SQL transform supports a single SELECT statement
only. This transform does not allow any input data set. Use this transform when
other built-transforms cannot perform the required SQL operation. Try to use
this transform as your last option as it not optimised for performance and also
reduces readability.
KEY_GENERATION Transform
Key_Generation
transform helps to generate artificial keys for new rows in a table. The
transform looks up the maximum existing key value of the surrogate key column
from the table; And uses it as the starting value to generate new keys for new
rows in the input dataset. The transform expects a column with the same name as
the Generated
key column of the source
table to be a part of the input schema.
The
source table must be imported into the DS repository before defining the source
table for this transform. The fully qualified Table name e.g. DATASTORE.OWNER.TABLE should be specified. Also we can set
the Increment
value i.e. the interval
between the generated key values. By default it is 1. We can also use a
variable placeholder for this option. We will be using this tranform frequently
while populating surrogate key values of slowly changing dimension tables.
TABLE_COMPARISON Transform
Table_Comparison
transform helps to compare two data sets and generates the difference between
them as a resultant data set with rows flagged as INSERT, UPDATE,
or DELETE.
This transform can be used to ensure rows are not duplicated in a target table,
or to compare the changed records of a data warehouse dimension table. It helps
to detect and forward all changes or the latest ones that have occurred since
the last time the comparison table was updated. We will be using this transform
frequently while implementing slowing changing dimensions and while designing
dataflows for recovery.
The
source table must be already imported into the DS repository. The fully
qualified Table
name e.g.DATASTORE.OWNER.TABLE should be specified. Also set the
input dataset columns that uniquely identify each row as Input
primary key columns. These columns must be present in the comparison table
with the same column names and datatypes. If the primary key value from the
input data set does not match a value in the comparison table, DS generates an
INSERT statement. Else it generates an UPDATE row with the values from the
input dataset row after comparing all the columns in the input data set that
are also present in the comparison table apart from the primary key columns. As
per your requirement select only the required subset of non-key Compare
columns which will give
performance improvement.
If
the Input primary key columns have duplicate keys, the transform arbitrarily
chooses any of the rows to compare during dataflow processing i.e. order of the
input rows are ignored. Selecting the Input contains duplicate keys check box provides a method of handling
duplicate keys in the input data set.
If
the comparison table contains rows with the same primary keys, the transform
arbitrarily chooses any of the rows to compare. Specify the column of the comparison
table with unique keys i.e. by design contains no duplicate keys as the Generated
key column. A generated key column indicates which row of a set containing
identical primary keys is to be used in the comparison. This provides a method
of handling
duplicate keys in the comparison table.
For
an UPDATE,
the output data set will contain the largest key value found for the given
primary key. And for a DELETE, the output data set can include all
duplicate key rows or just the row with the largest key value.
When
we select the check box Detect deleted row(s) from comparison table the transform flags rows of the
comparison table with the same key value as DELETE. When we select the options of the
transforms - Generated key column, Detect deleted row(s) from comparison table
and Row-by-row select or the Sorted input comparison method; Additional section
appears to specify how to handle DELETE rows with duplicate keys. i.e. Detect
all rows or Detect
row with largest generated key value
Apart
from all these properties there are three methods for accessing the comparison
table namely Row-by-row
select, Cached
comparison table and Sorted
input. Below is the brief on when to select which option.
- Row-by-row select option is best if the target table is
large compared to the number of rows the transform will receive as input.
In this case for every input row the transform fires a SQL to lookup the
target table.
- Cached comparison table option is best when we are comparing
the entire target table. DS uses pageable cache as the default. If the
table fits in the available memory, we can change the Cache type property
of the dataflow to In-Memory.
- Sorted input option is best when the input data is
pre sorted based on the primary key columns. DS reads the comparison table
in the order of the primary key columns using sequential read only once.
NOTE: The order of the input data set must exactly match the order of all
primary key columns in the Table_Comparison transform.
NOTE:
- The transform
only considers rows flagged as NORMAL as Input dataset.
- Cautious when
using real datatype columns in this transform as comparison results are
unpredictable for this datatype.
No comments:
Post a Comment