Friday, September 12, 2008

INFORMATICA POWER CENTER - A SHORT TUTORIAL

Lets learn at Informatica Power Center in less than 30mins.

We will first discuss the basic overall architecture of the product, and then will have some hands on to break through.

Informatica is a tool for the ETL process and integration, which has a service-oriented architecture. SOA provides the ability to scale services and share resources across multiple machines.

Informatica has the ability to communicate with all major data sources and database vendors. It can move data to/from these data sources.

Power Center Domain is the fundamental and an important administrative unit. It is a collection of nodes and services. A node is the logical representation of a machine in a domain. Services are the processes that help in the whole ETL process, e.g. authentication, authorization, repository management, integration services etc.

Services for the domain include:

  • Service Manager: It manages all domain operations. It runs the application services and performs domain functions (e.g. authentication, authorization, and logging) on each node in the domain. The Service Manager uses TCP/IP to communicate with the application services.

· Application services

· Repository Service: It manages the repository. Repository is the data about all the data and and the ETL involved on the data. It retrieves, inserts, and updates metadata in the repository database tables.

· Integration Service: The Integration Service runs sessions and workflows.

A session is a set of instructions that tells the Integration Service how to move data from sources to targets

A workflow is a set of instructions that tells the Integration Service how to execute tasks, such as sessions, email notifications, and shell commands.

The Integration Service reads workflow information from the repository. The Integration Service connects to the repository through the Repository Service to fetch metadata from the repository.

· SAP BW Service: listens for RFC requests from SAP BW and processes it.

· Web Services Hub: Runs within Service Manager and receives requests from web service clients and processes.

There are 2 types of nodes:

  1. Gateway Nodes: A gateway node can run application services, and it can serve as a master gateway node. The master gateway node is the entry point to the domain.
  2. Worker Nodes: A worker node is any node not configured to serve as a gateway. The Service Manager performs limited domain operations on a worker node.

Some other terms:

Mapping: is to depict the flow of data between sources and targets.

Transformations: We add transformations to a mapping that depict how the Integration Service extracts and transforms data before it loads a target.

A transformation is a part of a mapping that generates or modifies data. Every mapping includes a Source Qualifier transformation, representing all data read from a source and temporarily stored by the Integration Service. In addition, you can add transformations that calculate a sum, look up a value, or generate a unique ID before the source data reaches the target

Source Qualifier: The source qualifier represents the rows that the Integration Service reads from the source when it runs a session.

Session: A session is a set of instructions that tells the Integration Service how to move data from sources to targets. You create a session for each mapping that you want the Integration Service to run

Workflow: A workflow is a set of instructions that tells the Integration Service how to execute tasks, such as sessions, email notifications, and shell commands.

Above is some insight into the working of the Informatica. For more in depth architecture of the Informatica, you can visit:

http://informatica.techtiks.com/informatica_architecture.html

Informatica tool has 3 main components:

  1. Client tools: These tools are used for the development process. It has the following supporting tools:

· Designer: To define mapping and transformation.

· Workflow Manager: To define the session and the workflow.

· Workflow Monitor: To monitor execution of sessions.

· Repository Manager: To manage repository.

2. Server tools: Can be considered as the staging area, where data from sources are fetched, transformed and then loaded into the target.

3. Repository: Repository is a kind of data inventory where all the data related to mappings, sources, targets etc is kept.

Now once we are familiar with the overall description of the product, we are now ready to have some practical hands-on. Follow the following steps to perform your ETL. This is a highly generic steps, and depending on your requirements u need to perform your transformations.

1. Create a folder to organize your work.

2. Logon to the Repository in the Designer Window.

3. Create the Source Repository in the Repository Database.

4. Create the Target Repository in the Repository Database. If Target does not exist, then create and execute the Target creation commands.

5. Now perform the source to target mapping, with appropriate transformations.

6. Now create a session for the mapping and thereafter a workflow for its execution. This will be done in the Workflow Manager.

7. Now run the workflow and watch in Workflow Monitor. It will flash ‘Success’ once the operation completed successfully.


The time is running out. So, will post the pictorial form of the steps in next issue.
Thanks, and comments are solicited.

Wednesday, August 20, 2008

The Zig-Zag SQL query

The Zig-Zag SQL query

Once I was asked to solve a very peculiar problem. The problem was to find the employee_id and the corresponding current month salary and previous month salary in a single row. The table looks like this:

employee_id

job_id

department_id

date_id

amount_credited

206

AC_ACCOUNT

110

20080731

8500

206

AC_ACCOUNT

110

20080431

8600

100

AD_PRES

90

20081131

26000

101

AD_VP

90

20080731

19000

102

AD_VP

90

20081231

20000

102

AD_VP

90

20080631

18000

121

ST_MAN

50

20080431

8500

102

AD_VP

90

20080831

20000

118

PU_CLERK

30

20081131

2700

118

PU_CLERK

30

20080431

2800

121

ST_MAN

50

20080531

8200

206

AC_ACCOUNT

110

20080831

8600

206

AC_ACCOUNT

110

20080631

8400

100

AD_PRES

90

20080531

24000

100

AD_PRES

90

20080431

27000

100

AD_PRES

90

20080331

26000

100

AD_PRES

90

20080231

25000

…………………………………………………………………

……………………………………………………………………

………………………………………………………………

…………………………………………………

…………………………………………………………..

P.S: Don't be astonished by seeing the date_id. It is made like that to ease the understanding. The date_id is not possible in all cases, but assume them to be correct for simplicity and making life easier.

Here, amount_credited, is the current month salary.

How to go about this SQL puzzle? I did it like this:

======================================================================

select aa.EMPLOYEE_ID,aa.DATE_ID, aa.AMOUNT_CREDITED "THIS_MONTH", bb.AMOUNT_CREDITED "LAST_MONTH"

from table_given aa, table_given bb

where

aa.EMPLOYEE_ID = bb.EMPLOYEE_ID AND

aa.DATE_ID=bb.date_id+100

UNION ALL

select distinct aa.EMPLOYEE_ID,aa.DATE_ID, aa.AMOUNT_CREDITED "THIS_MONTH", null "LAST_MONTH"

from table_given aa, table_given bb

where

aa.DATE_ID-100 NOT IN (Select distinct DATE_ID from new_fact_hr)

order by 1,2;

======================================================================

In this I have taken into account the relationship among date_id for formulating this query. The date_id for each month differs by 100.

e.g. 20080331

2008

03

31


Year Month Day

P.S: Don't be astonished by seeing the date_id. It is made like that to ease the understanding. The date_id is not possible in all cases, but assume them to be correct for simplicity and making life easier.

This query can be run on any database, which follows SQL standards. The zig-zag output produced is as follow:

employee_id date_id This Month Last Month
100 20080131 24000
100 20080231 25000 24000
100 20080331 26000 25000
100 20080431 27000 26000
100 20080531 24000 27000
100 20080631 25000 24000
100 20080731 26000 25000
100 20080831 27000 26000
100 20080931 24000 27000
100 20081031 25000 24000
100 20081131 26000 25000
100 20081231 27000 26000
101 20080131 17000
101 20080231 18000 17000
101 20080331 19000 18000
101 20080431 20000 19000
…………………. …………………………………. ……………………. ……………….
…………………. …………………………………. ……………………. ……………….
…………………. …………………………………. ……………………. ……………….
…………………. …………………………………. ……………………. ……………….
…………………. …………………………………. ……………………. ……………….

However, this query can be done more effectively with a shorter query if we use ORACLE database. It has many proprietary SQL commands, which is much advanced than SQL standards. I leave this up to you to try your hand on. Hint: Use Oracle SQL for Analytics (e.g. LAG, PARTITION OVER etc.). Good Luck to you.

Perl Vs Python Fight

I was confused as to what to start with? Perl or Python. Though I had done Perl before but wanna give a chance to Python too, if it outperforms Perl. I browsed through the web, and compiled the following pros and cons for both of them:

1) Python tends to focus more on code aesthetics than performance; as a result, Python OpenGL developers continue to struggle with overcoming Python's array handling performance issues.

2) Perl is much simpler to port to/from C/C++/Java

3) Perl has a much larger repository of open source libraries

4) Perl has much stronger/faster string handling (dynamic shader programs)

5) Perl simplifies embedding values within text blocks

6) Every function in Python can easily be performed in Perl; the converse is not true, particularly in terms of string handling.

7) Python has no defined global constants

8) Python has no forward referencing/prototypes

9) Python has no simple method of retrieving a dictionary/hash element's key

10) Python's print has no global method for enabling unbuffered I/O

11) Python's formatted print requires lists to be first converted to tuples

12) Python's print has no means to suppress both trailing new lines and spaces

13) Python has no ifdef method (short of reformatting) to block out prototype code

After going through these comparisons I decided to stick with Perl only, and learn Python to find the above truth only.