It's all about Business Intelligence

March 21, 2011

Aggregator Transformation

Aggregator Transformation

Let us explore how to use Aggregator Transformation while building ETL mapping using Informatica.

It helps us to perform aggregate calculations such as SUM, AVG etc.  Here is the flow how it works

1. Reads the data from input ports

2. Perform the calculation

3. Store the index and data in cache for further aggregation

4. Performs the steps 1-3 till the last row

5. Gives the data to output ports

By sorting the aggregator input we can increase the performance. Also make sure we limit the number of input ports which will help us to use our cache space efficiently.

We can specify conditional expressions like which will help to reduce the number of rows to be processed.

For example, use the following expression to calculate the total sales where the quantity is greater than 10

SUM( SALES, QUANTITY > 10 )

We can also configure the Integration service the way it handles the NULL values. By default NULL is treated as NULL only.

You can also use nested aggregate functions but in that case all the expressions in the transformation should be nested ones. You cannot use single level and nested in the same transformation.

Each transformation has certain properties.

1.       Cache Directory: Path where the cache files will be stored

2.       Tracing Level:  Need to set the log level. We have simple logging to most detailed log level which includes each and every row which passes through.

3.       Sorted Input: When the input is sorted, aggregator job becomes easy and it increases the performance. Check this option when the input is sorted

Do not use sorted input if expression uses nested aggregate functions and when you use incremental aggregation.

If the input is not sorted and still if you use sorted input option then the session will fail.

4.       Data and Index cache size: set the cache sizes depends on the requirement.

5.       Transformation Scope: This specifies the boundary for applying the calculation logic which is based on the commit interval level of the session. Since aggregation involves handling rows which are spreaded across table it is better to keep this as “All Input” rather than “Transaction”.

We can perform Incremental Aggregation which can be configured while creating task in workflow design. It will utilize the historical cache to perform the incremental aggregation.

To Use Sorted Output option:

Add a custom property to the integration service from Admin console as shown below.

Transformations

Transformations

In this posting we will discuss about Transformation in Informatica.  Basically the term “Transformation” refers to transforming the data which comes from Source and loads into the Target.

Transformation is a core capability of any ETL product and particularly when it comes to Informatica it provides 360 degree transformation. It means virtually we can transform the data as we want it to be.

We have classifications in transformation based on the way that they work.

Based on their participation in changing the number of rows we can classify them in two categories.

1.       Active – changes the number of rows like Aggregator Transformation

2.       Passive – Doesn’t change the number of rows like Expression Transformation

Based on their participation in data flow it can be classified into two categories

1.       Connected – Used in the flow with other transformation/source/target

2.       Unconnected – Used inside another transformation similar lookup transformation

For example to merge the data horizontally (joining two tables like in SQL) we can use the joiner transformation and to merge data vertically (like a union all in SQL) we can use union transformation.

Below is the list of transformations that Informatica 8.x version supports.

Transformation Type Description
Aggregator Active/ 

Connected

Used to perform aggregate calculations similar to Group by in SQL(SUM,AVG etc)
Source Qualifier Active/ 

Connected

Provides the number of rows read from the source
Expression Passive/ 

Connected

Used to perform calculations on each row
External Procedure Passive/ 

Connected or Unconnected

Calls a procedure in a shared library
Filter Active/ 

Connected

Filters data.
HTTP Passive/ 

Connected

Connects to an HTTP server to read or update data.
Java Active or Passive/ 

Connected

Executes user logic coded in Java. The byte code for the user logic is stored in the repository.
Joiner Active/ 

Connected

Joins data from different databases or flat file systems similar to joins in SQL
Lookup Passive/ 

Connected or Unconnected

Can perform lookup operations
Normalizer Active/ 

Connected

Source qualifier for COBOL sources. Can also use in the pipeline to normalize (transforming rows from horizontal to vertical) data from relational or flat file sources.
Rank Active/ 

Connected

Limits records to a top or bottom range.
Router Active/ 

Connected

Routes data into multiple transformations based on group conditions.
Sequence Generator Passive/ 

Connected

Generates primary keys.
Sorter Active/ 

Connected

Sorts data based on a sort key.
SQL Active or Passive/ 

Connected

Executes SQL queries against a database.
Stored Procedure Passive/ 

Connected or Unconnected

Calls a stored procedure.
Transaction Control Active/ 

Connected

Defines commit and rollback transactions.
Union Active/ 

Connected

Merges data from different databases or flat file systems.
Unstructured Data Active or Passive/ 

Connected

Transforms data in unstructured and semi-structured formats.
Update Strategy Active/ 

Connected

Determines whether to insert, delete, update, or reject rows.

In upcoming posts let us have a detailed discussion on all these transformations.

BFN

November 9, 2010

Prioritizing Logical Table Sources

This is one of the new feature got introduced in OBIEE 11g which helps to dynamically choose the logical table source for a particular dimension or fact based on the user preference.

Its very important for a reporting tool to have option to dynamically choose the underlying physical sources. For example some times user may prefer to get data from the transactional data(Current Data) or from warehouse(Past Data) based on the requirement.

OBIEE 11g provides priority setting option on logical table sources which helps the BI server to choose the appropriate logical table sources at run time. Usually lowering priority number for a particular logical table source increases the chances to get queried. Typically ‘0’ or ‘1’ will be used to indicate ‘0’ as current data and ‘1’ indicates past data.

We can use this priority setting in following three ways

1.Standard usage of Priority
In this way we can simply prioritize the logical table sources by setting the priority number for each LTS.

Steps:
1.In RPD for each LTS set the priority group number

2.Permanently Reversing the Priority of Logical Table Sources
In this way we can reverse the priority setting in RPD for only certain subject areas. Its like we can build subject areas like one for current data and other one for historical data based on the same logical table.

Steps:
1.Create table in database which holds subject area name and reversal number (1 for reverse and 0 for normal) for which we have to permanently reverse the priority
2.Create a session initialization block and vector session variable like below.

3.Reversing the Priority based on the User Query
In this way we can reverse the priority setting in RPD by passing value like ‘0’ (keeping the priority defined in RPD)or ‘1’(reversing the priority defined in RPD) through a request variable.

Steps:

1.Create table in database which holds subject area name and reversal number (1 for reverse and 0 for normal) for which we have to permanently reverse the priority
2.Create a session initialization block and vector session variable like below.

3.While running query from answers set a request variable (0 for reversing or 1 for normal priority) which will dynamically shift the priority

I had attached sample RPD which has all this work.Where sales data may come from current data or past data. If you query the “Sales Current” then data will come from current and “Sales Past” will give past data paint.rpd (rename paint.ppt to paint.rpd)

You can try reversing the priority by passing the value to the request variable mentioned and try getting the past data in the “Sales Current” subject area.

October 6, 2010

First Hands on in OBIEE 11g

This document will help you to start with OBIEE 11g and covers the following topics. This will be a good starting point for someone who wants to learn OBIEE11g.

1. Building Physical Layer

2. Buiding BMM layer

3. Building Presentation Layer

4. Updating the settings in EM for latest RPD

5. Starting BI services

6. Enabling Query Logging

First Hands on in OBIEE 11g.doc

September 4, 2010

Data Warehouse Administration Console (DAC) Part 1

Following my posts on Informatica let us look into other integral component of BI Applications which is Data warehouse Administration Console. Before starting with DAC let’s have an introduction to BI applications which will help us to understand the big picture.

Oracle BI Applications

This is pre packaged BI applications across many modules like Financials, SCM, CRM, HCM etc which comes with a Central Data warehouse which holds all these data and set of pre built ETL routines which loads data into warehouse which is sourced from ERP systems like Oracle Ebiz, Peoplesoft & Siebel CRM applications.

Components of BI Applications Architecture:

As you see from the below diagram Informatica is the ETL tool used in BI Applications (Oracle Data Integrator also can be used as ETL tool). DAC is the console which is used to manage analytics schema, configuration, administration, loading and monitoring the analytics warehouse.

DAC Client: User interface through which user performs the activities (configuration, administration, loading, monitoring) to manage the analytics schema.

DAC Server: Executes the instructions given from the DAC client for running the ETL routines.

DAC Repository: Stores the metadata used by DAC to manage this whole data warehouse process.

Terms to know about DAC:

Source System Containers: This is like schemas in database. Objects in the DAC repository belongs to a particular source system container and objects can reference across containers. DAC repository comes with the predefined source system containers which can be copied and customized to implement our solution. We cannot modify the predefined containers and they are like templates.

Subject Areas: It’s just a grouping of activities related to one functional area. For example within HCM there can be subject areas like Talent Management, Compensation etc.

Table Types: There can be different table types like Fact, Dimension, Hierarchy, Aggregate etc

Tasks & Task Groups: Task is unit of work required to load one or more tables. It contains information like source, target tables and other instructions used for loading. Task groups are the group of tasks which will be done in a sequence.

Execution Plan: This is defined on subject areas which will help in loading the tables in that area. It includes the tasks involved and parameters used for executing it.

Schedules: This indicates the frequency in which execution plan should run as per the business requirement.

DAC Process Cycle:

DAC is used to design, execute, monitor and diagnose execution plans

Setup: Database connections, ETL setup in Informatica

Design: Define Tasks, Groups and Execution plans

Execute: Define parameters, schedule and run the execution plans

Monitor: Monitoring the run time executions

Diagnose: In case of task failure identifying the route cause and rerunning the task.

Extracting and Loading Mechanism’s:

DAC supports different kind of data extraction and loading combinations to give more flexibility over the kind of data transformation that happens.

Full Extract and Full Load: Loading of data warehouse first time happens in this way

Full Extract and Incremental Load: This comes in a situation like when you have to pull data from a new source and load it into the existing table in warehouse. It will have an additional logic whether to insert/update the record in cases of handling data integrity.

Incremental Extract and Incremental Load: This would be the regular process where the data loading happens every night which captures only the new or changed records.

In the next posts let see how to setup DAC for the first time, performing full load and user management.

August 26, 2010

OBIEE 11g Information

Link for GA of OBIEE 11g

http://www.oracle.com/technetwork/middleware/bi-enterprise-edition/downloads/bus-intelligence-11g-165436.htm

Code Samples for OBIEE 11g

http://www.oracle.com/technetwork/middleware/bi-foundation/obiee-samples-167534.html

OBIEE 11g Documentation

http://www.oracle.com/technetwork/middleware/bi-enterprise-edition/documentation/index.html

Tutorials on Interactive Reporting

http://www.oracle.com/technetwork/middleware/bi-enterprise-edition/tutorials/index.html

August 19, 2010

Best Practices!!! No Thank You

Filed under: Uncategorized — Santhosh @ 6:55 pm

Its wonderful message that I get know from Frank Buytendijk VP and Fellow at Oracle Corp recent video blog post which I would like to share here.

This could be a rational approach of looking how to adopt best practices.

What do we mean by saying Best Practices, It just worked best for you and may or may not give the same results for others.

In terms of business processes & standards blindly adopting best practices may not yield best results. But mostly it can yield average to good results.

But it doesn’t mean that we don’t want best practices and then it could be bad one too.

Strategically saying Best Practices could be your starting point and thinking in terms of how it will fit for your business and analyzing them properly may lead to revise the same or you may end up innovating a better practice.

You could check more thoughtful business articles in following links

Frank’s Blog

Harvard Business Review

There is one wonderful section in Oracle’s website about thought leadership. Please check that too.

Byee

August 18, 2010

Hierarchies in OBIEE 11g – Part 2

As a continuation with my previous blog post on Value Based Hierarchies, Let us see how to model Ragged (Unbalanced) and Skip level hierarchies in OBIEE.

In this sample implementation we will consider the Products table which comes with the sales history (SH) schema which comes with Oracle Database. Products entity has hierarchy like below.

For demonstration purpose I had removed a level (Skip Level) for a particular product and also its similar to Ragged since the height of the hierarchy is not same (not all the leaf nodes are at last level).

Problems with Ragged & Skip Level:

In case of Ragged we cannot define fixed number of levels as for certain rows don’t have all the levels. In this example product Mini Camcorder don’t have a subcategory.

In case of skip level again we have to handle the NULL values in between levels which lead to removal of the row which is part of the skip level and won’t be displayed in the output. In this example same Mini Camcorder will have NULL value for sub category.

OBIEE 11g has a way to handle NULL problems easily.

It’s just we have to mark those dimensions as Ragged or Skip level and rest will be taken care of.

Import the Product table and Sales table into physical layer. For demonstration purpose I had removed all other columns except product ID from the Sales table.

If not already then create physical join between Products and Sales table

(If you are not sure how to import tables from Oracle database then refer my previous blog post on Value Based Hierarchies )

Create BMM layer with Product as dimension with the sales fact. After pulling the logical tables right click on Products to create a level based hierarchy.

After creating the dimension we have to create Product Category and Product Sub Category levels. Right click on Product Details to create parent level.

Enter the Level name and number of elements and click ok. Number of elements is a parameter which will be used by BI server to estimate the number of rows in each level. This need not to be an accurate one. In our case we can think of ratio like 1:10:40:100 (Product Total {default one for all hierarchies} ->Category->Sub Category->Product).

Drag Sub Category columns to the level which got created now and double click on the Level Name and create keys like below. We need to define key for each level so we chose the Sub Category ID and for display purpose we use the Sub Category Name (Use for Display option).

Same way create one more parent level for Sub Category called Category(as exists in our scenario) and pull the category columns to the parent level , define the keys as we did for sub category.

Go back to the Product Detail level and create one more key for description since Product Identifier is a ID column. Same like above.

Overall dimension should like below.

Now let us set the dimension type as Ragged & Skip Level both like below. Double click on ProductDim to get the below window. This is the extra step we do to take care of Ragged & Skip Levels.

Define the measure in the Sales fact as shown below.

Pull the BMM model and create the presentation layer (Just drag and drop the BMM model). Remove the identifier columns.

Now let’s go to answers and verify out work.

After logging into answers choose the product Hierarchical column and a measure and see how that product “Mini DV Camcorder with 3.5″ Swivel LCD” which don’t have sub category level got reported.

Now we are done with hierarchies. Next let’s see some of the new enhancements which were not there till OBIEE 10g. byeee

August 17, 2010

Hierarchies in OBIEE 11g Part 1

Hierarchies in OBIEE 11g – Part 1

It’s extremely exciting to see the breadth and depth of OBIEE 11g features. There are many things which were constrained till previous release, 11g fixed most of them.

Let us start with Hierarchies and see how to implement different types of hierarchies in BI 11g with step by step example. This article is specific to Value Based Hierarchies.

Hierarchy:

There are different types of hierarchy which exists in an Organization data which gives more insight for our analysis.

1. Level Based Hierarchy:

Basic type of hierarchy which has fixed number of levels. For example Time Dimension hierarchy is a good example for this one.

Year->Quarter -> Month

We have two special cases of level based.

· Ragged Hierarchy

Members of certain levels don’t have child members.

· Skip Level Hierarchy

Level itself will not be present in case of certain hierarchies. In the above diagram if we consider “direct” as a delivery mode under “wholesale” it becomes skip level hierarchy.

1. Value Based Hierarchy

This will be called as parent child hierarchy too. Typical reporting structure in an organization is a good example for this one.

In this article let us see how to create a value based hierarchy step by step.

Believe me creating value based hierarchy in 11g is like walking on ice J

Let us consider employees table which has Manager -> Employee relationship and model the same.

Steps:

1. Importing the source table(Employees)

2. Creating value based hierarchy in BMM layer

a. Define parent(Manager_ID) & child(Employee_ID) columns

b. Generate parent -child hierarchy table DDL statements

c. Generate parent – child hierarchy table DML statements

d. Execute DDL & DML and create hierarchy table and insert data into it

3. Pull the BMM model into presentation layer

4. Run answers and check the output.

Make sure Employees table has below structure.

1. Importing Employees table into Physical layer

Finally update the row count and test the database connectivity

2. Create value based hierarchy in BMM layer

Drag employee table twice to the BMM (OBIEE requires star model) as shown below

Rename the columns and do the basic cleanup

Right click on the BMM model and join the Employee and Salary Information.

Now right click on Employees logical table and choose for new parent child hierarchy

Choose the member key (by default it will take the primary key) and parent column as shown in the below screenshot.

Then click ok parent- child settings. This is the place where we are going to generate the DDL & DML scripts which we can use to create and populate the hierarchy table which will be used by BI server to report parent child hierarchies.

Click on new hierarchy table (middle one – yellow arrow). If you have already generated hierarchy table then click on the select (red arrow) and choose the table.

After clicking new give the name for the DDL and DML scripts. This wizard will create the SQL scripts.

Give name for the hierarchy table

Click next and verify the scripts

After finishing you can see the details are mapped now

After finishing the wizard you can see the HierarchyTable got imported automatically.

Now go to the scripts location and run the DDL & DML scripts and commit the changes via SQL command prompt

Update the row count and make sure table got created properly

3. Pull the BMM model to presentation layer and create presentation folders.

Make sure that Hierarchy inside the Employees folder is visible.

Now we are done with the metadata definition. Save the RPD and make sure there are no consistency errors.

Start the WLS and BI services and make sure all of them are running fine.

4. Running answers and verifying the hierarchy

Before logging into answers lets enable the logging so that we can check the physical SQL’s that are getting generated while performing drill down’s through hierarchy.

Open the RPD in online mode and navigate to Manage->Identity and set the log level for Administrator.

Check in the changes and close the RPD.

Login to URL http://localhost:9704/analytics using Administrator/Admin123

After logging in since we changed the log level in online mode we have to reload the metadata.

Click on Administration on the top right

Then click on Reload files and Metadata

Then go to our Analysis like below

Click ok ValueBasedHierarchy

Select the Hierarchy column from Employees and the salary measure from the fact.

Then click on results.

We are done and can see the hierarchy information like below.

This output just shows the salary for each employee and their manager information. This result set is not an aggregated salary for each level. Since salary information pertains to each employee salary is not aggregated.

If you check the Nquery log BI server generates only one query for the any number of operations. It’s fully in memory operation.

Let’s continue to see Ragged and Skipped level hierarchy in my next post.

Happy Learning BI J

July 21, 2010

Crash Course on Informatica part 2

Crash Course on Informatica part 2

In this article we will see how to use the PowerCenter client tools and creating a sample transformation.

PowerCenter Client tools:

Client tools will help us in Creating, Maintaining & Running the ETL programs.It has totally five components.

Tool Usage
Repository Server Administration Console Creating Repositories
Repository Manager Managing Repositories
Designer Creating Data Tranformations
WorfFlow Manager Creating flows which runs the transformations
WorkFlow Monitor Running and monitoring the workflows

Repository Server Administration Console:

Access Path:

We can create, delete, start, and stop repositories through the console.

Repository Manager:

Access Path:

We can use Repository manager for the following things.

i. Importing ETL metadata/objects from XML file

ii. Creating Folders which are containers for ETL objects

iii. Managing users & security related things

iv. Versioning the Repository

v. Analyzing dependencies between ETL objects

Designer:

Access Path:

This is the tool which is used for creating the ETL designs.It has five sub components in it.

Component Usage
Source Analyzer Contains metadata about the data sources
Warehouse Designer Contains metadata for warehouse(typically) where data has to be loaded
Transformation Developer Place where we can design different types of data transformations
Mapplet Designer Place where we can create the mappings and resuse them.Typical use case will be dimension lookups will be performed while loading every fact.So for common dimensions we can write the lookup once and use it in all fact loading places.
Mapping Designer Place where we can combine the source, tranformation and the target.One end to end transformation

We can navigate to each component by going to tool and clicking the appropriate tool.

Now let us see each component with its screenshot and their fntionalities.

Source analyzer:

We can define source as a database,file or XML file.Now we can see a sample of how to define the source as databae.

Step 1:

Define the ODBC for the corresponding database.In my case it is 11g database.

Step 2:

Click on Import from Database and fill the database credentials and click connect.

Step 3:

After selecting the table click ok.Now we had defined the source.

Warehouse Designer:

Now let us create the defintion for the target (warehouse).Let us say I have a table called w_countries where I have to load the data.

Step 1:

In mycase source and target I use the same ODBC for example purpose.If your target is a different database then you have to configure another ODBC for the target database.

Click on Import from database and select the table and then click ok.

Now we had defined the target.

Transformation Designer:

Already we had defined the source & targets.Now let us build a simple transformation just to propogate the data from source to target.

Let’s choose expression as the type and give the transformation name

Double click on the tranformation and enter the port information.Just to explain the usage of input, output and variable ports I had created 4 extra ports.

Mapplet Designer:

Even we have a simple use case of loading the source data to target, Just to explain the usage of mapplet I had created the simple mapplet.Its mandatory for a mapplet to have mapplet input and mapplet output transformations which acts like an entry and exit for a mapplet which will useful when we use this mapplet in a mapping.

I had simply created two extra transformations (mapplet input & mapplet output) and connected them with our transformation.Now we will how to use this mapplet in out mapping.

Mapping Designer:

Mapping is the place where the source & target are connected using the transformation and this mapping will be used to run the actual ETL.In our case we will reuse the mapplet which is created in the previous step. Drag and drop the sources, targets and the mapplet and connect them.As you see in the mapplet you will see only the input and output ports.Actual definition on the mapplet is hidden.you can open it by right clicking and click on open mapplet.We have to create one more transformation for transactional control.Create a transformation with type “Transactional Control” and set the properties as in the below screenshot.This will ensure the data loading gets committed.

Now we are done with creating the mapping.Next we will see how to run this by learning workflow manager and workflow monitor.

WorkFlow Manager:

Access Path

This tool is used to create workflows which contains tasks which runs the mapping we designed using the designer.It has three sub components

Component Usage
Tasks Simply defining task is a unit of work like running a mapping etc.There are many types of tasks.In my next posting I will explain purpose of each task.
Worklets Worklets are reusable tasks similar to Mapplets which helps us to have reusability and maintainability.
Workflows Workflow combines many tasks and run it in a flow.

Tasks:

Let’s create a task for our sample case. Open the workflow manager and connect to the repository.

Click on the task developer from the menu

Click on the Tasks menu and click create

For now choose the type as session and give a name for the task and click create.

Choose our mapping from the list mapping available from the repository.

Now the task is created.After creating the task we need to create the source and target database (or flat file) connections which will be used by the workflow to run them.

Go to connections and click on Relational (as mine is Oracle 11g)

Choose Oracle and enter the database details.

In my case both source and target uses the same database otherwise we have to configure one more for the target data source.

After configuring the connection click on the task and go to its properties, mapping tab.Enter the connection details for both source and target as shown below.

Then click ok.Now we are done with the task and we will proceed to worklet.

Worklets:

Worklets are reusable components similar to the mapplets which contains tasks.We can create both reusable and non reusable worklets based on the requirements.

For our case we will create non reusable worklet.

Go to tool and choose Worklet Designer

Then click on creating a new worklet

After creating the worklet, insert the session type task which we created already and link them using a link.

Everytime after completing your work while saving make sure that you don’t get any errors through the output window.

WorkFlow Designer:

Access Path:

Similar to worklet menu, click on creating new workflow

Insert the worklet and connect them using the links.

Now save the work and make sure that you don’t have any compile error.

After creating our workflows we need to assign one server to this workflow which will run this workflow everytime its called.

Let us see how to create and assign the informatica server to this workflow.

Server Configuration:

Follow the screenshot and click on server configuration

Enter the server details which is same as given during Informatica Server Setup.

After finishing the server configuration, Disconnect from the repository and connect again.This will ensure that all folders in the repository are closed and you wont get any error during assigning workflows (may be try the next step without doing thisJ)

Right click on the server and choose “Assign to Workflows”

Select the corresponding workflow and click assign.

Now we had completed the full cycle.Next lets go to worlflow monitor and test what we had developed till now.

WorkFlow Monitor:

This tool helps us in monitor the workflows we have.

Access Path:

Connect to the repository as below

Start the server and make sure its running.

Now let’s go to the workflow manager and start the workflow which we had designed.

Once you click on start, you will be able to see the status in monitor.on successful completion you will see a window similar one below.

Wow we are done.First ETL run went on successfully.

If you go to the task view tab, you can get statistics information for each task like number of rows loaded etc.

We are pretty much done with some basics of Informatica.In my next next post lets see about types of transformations available in Informatica and Types of tasks available with examples.

Please provide us your feedback on the content..

Thanks..byee

Next Page »

Create a free website or blog at WordPress.com.