Friday, June 5, 2015

Pentaho Kettle 5.3 CE and copy tables data from Postgres to MySql database

My experience with Kettle is really short. I tried to manage use case where data in Postgres database tables should be transferred to Mysql database. In previous article I described how to have Virtualbox/Ubuntu 14.04 with Mysql and Postgres database so this work is useful for my current task.

First step is to install
Installation is straightforward  - unpack zip in new folder and this minimal setup is all that you need. Kettle has Eclipse-based visual editor so Java runtime is pre-requirement (Java8 64bit was ok on my Win8.1 laptop). To access different databases jdbc libraries should be added to KETTLE_HOME\lib directory (Postgres driver is only one already in place).

Some important points from my experience
  • There are different options how to install Kettle - keep everything simple as possible and avoid pdi database. In my experience for development purposes folder/xml based repository is good enough and much easier to manage.
  • There is huge collection of documentation in internet (even youtube videos), but problem is that for different Kettle versions and not all options of Kettle commercial version is available in CE version. My suggestion is to read old Kettle wiki documentation and avoid Pentaho EE documentation, because changes between Kettle versions are not so big. Another option is to download older Kettle version, but I do not find any reason for that (4.4 version is quite popular in forums).
  • If you have previous installation of Kettle in same computer then in %USER%/.kettle directory files will be used by both installations (better is to remove older one).

Kettle task implementation

My task was simple - I have 2 tables in postgres database with data and I want to transfer data in these tables to Mysql database. I have previously installed Virtualbox/Ubuntu 14.04 with Mysql and Postgres databases with 2 tables in both databases with similar structure.

Because simple Transformation with 2 Table Input/Output tasks is to simple solution, then idea was to solve problem in dynamic way. In future any number of tables can be tranferred using same pattern (identical tables structures in 2 databases is prerequisite) and only changes in configuration is needed.

Get names and columns of tables from source DB

First step is to have transformation that reads from DB metadata table names and columns and output is stream of rows consisting Table_name and column_names (column names is list of columns separated by comma)

Solution is Transformation

Table Input -> Add Constraint -> Row Denormalizer -> Copy Rows to result

Table Input.Connection = Postgres Db connection (previously saved in shared.xml)
Table Input.SQL = select table_name, column_name from information_schema.columns where table_schema = 'test' order by table_name, ordinal_position
# Get information from Postgres metadata


Add Constraint (required for Normalizer, adds constant 1 for all rows in stream)
Add Constraint.Name = grouper
Add Constraint.Type = Integer
Add Constraint.Value = 1


Row Denormalizer.The key field = grouper
Row Denormalizer.Group field = table_name
Row Denormalizer.Target fieldname = table_columns
Row Denormalizer.Value fieldname = column name
Row Denormalizer.Key value = 1
Row Denormalizer.Type = String
Row Denormalizer.Aggregation = Concatenate strings separated by ,

Row Denormalizer will concatanate columns (achievable also by sql, but we try to learn Kettle now :-)) and Copy Rows to result will make current transformation output available to next step (Transformation/job) as stream. Idea is to pass this stream as parameters to next step without persisting information in any file/database.

Next step is to have transformation to save data in target db

Get dynamically data from source and save in target db

In this step we will read data from source database table and save it in target database. This would be really simple if it will be implemented in static way

Table Input -> Table Output

We will have same components in our transformation, but table name and column names should be come as named parameters to this transformation. So we will have Transformation properties

Transformation properties.parameters[0].Parameter = table_columns
Transformation properties.parameters[0].Default Value = i,d,t

Transformation properties.parameters[1].Parameter = table_name
Transformation properties.parameters[1].Default Value = test_data

# Default value is optional, but gives option to test transformation without input feed

Table.Input.Connection = Postgres Db connection (previously saved in shared.xml)
Table.Input.SQL = SELECT ${table_columns} FROM ${table_name}
Table.Input.Replace variables in script = Check
# This trick will read columns and table name from named parameters

Table.Output.Connection = Mysql connection (previously saved in shared.xml)
Table.Output.Target schema = test
Table.Output.Target table = ${table_name}
Table.Output.Truncate table = check
# We will always truncate table before loading, we assume that source/target have same field names so field mapping is not needed.

Main job

Now we have 2 separate transformations and we need to feed with every row of first transformation output second transformation to get results. So loop over first transformation output should be implemented and for every row second generation started.

There is really simple and elegant solution in Kettle to solve the problem (even without loop in screen). We need new job with these elements

Start -> Transfomation 1 -> Transformation 2

No changes in properties for job

Transfomation 1.job entry details.Transformation Spec.Transformation name = <first transformation filename>

Transfomation 2.job entry details.Transformation Spec.Transformation name = <second transformation filename>
Transfomation 2.job entry details.Advanced.Copy previous results to parameters = Check
Transfomation 2.job entry details.Advanced.Execute for every input row = Check
Transfomation 2.job entry details.Parameters[0].Parameter = table_columns
Transfomation 2.job entry details.Parameters[0].Stream column name = table_columns
Transfomation 2.job entry details.Parameters[1].Parameter = table_name
Transfomation 2.job entry details.Parameters[1].Stream column name = table_name

Now you have first transformation that will call second transformation for every row in output stream and pass these values as named parameters to the second transformation. If you have data in source database tables then running Main job should transfer data to target tables.

Conclusion

We have very powerful solution to tranfer data from source database to target if table structures in both databases are identical. Main problem during development was to figure out what fields must be filled to pass one transformation output stream as input to next transformation. Actually it took many hours to figure out the second transformation Parameters part, because it is not easy to figure out where is problem with debugging.

Honestly, I was surprised how simple and elegant dynamic solution was possible to develop in Kettle and hopefully it will be my favourite ETL tool in future.


Install on Win8 with vagrant Virtualbox/Ubuntu 14.04 with Mysql and Postgresql database inside

The goal was to have mysql/postgres databases accessible from local computer. I do not like database server installations in my own laptop and therefore VM solution is preferrable. This article describes steps to have ubuntu 14.04 Virtualbox with mysql and postgresql server inside

In host machine (Win8 laptop) first steps are to install
  1. Vagrant 1.8.1
  2. VirtualBox 5.0.12
Vagrant is my favourite tool to manage VM-s from commandline and Virtualbox is best VM provider. After installation open new command line window (cmder is my favourite and cmd.exe is ok also) and make new folder for VM installation. I assume that vagrant.exe is in your PATH.

First steps are to run from commandline in Win8.1 machine
  • vagrant init ubuntu/trusty64
  • vagrant up --provider virtualbox
  • vagrant ssh
You will have VM with ubuntu 14.04 server accessible from commandline with ssh. Next steps are to install mysql database and postgres databases in Ubuntu server guest.

Install mysql database

  • sudo apt-get -y install mysql-server
During installation root password must be provided (and better is to remember it also :-)).  After installation mysql server is up and running, but it is not accessible from host machine.
  • sudo vi /etc/mysql/my.cnf and change bind address line to
    bind-address            = 0.0.0.0 and save file
  • sudo service mysql restart
  • sudo netstat -tap | grep mysql  
Something like 
tcp        0      0 *:mysql                 *:*                     LISTEN      11462/mysqld
should appear on screen

For testing is good to have some test user/database/table
  • mysql -u root -p
    CREATE USER 'test'@'localhost' IDENTIFIED BY 'a';
    CREATE USER 'test'@'%' IDENTIFIED BY 'a';

    GRANT ALL ON *.* TO 'test'@'localhost';
    GRANT ALL ON *.* TO 'test'@'%';
       
    CREATE DATABASE IF NOT EXISTS test CHARACTER SET = 'utf8';
    create table `test_data` ( `i` bigint(20) NOT NULL, `d` datetime NOT NULL, `t` timestamp);
    create table `test_data_2` ( `seq` bigint(20) NOT NULL, `d` datetime NOT NULL);       
Next step is optional, but I prefer to have phpMyAdmin also installed on server so
  • sudo apt-get -y install phpMyAdmin
    Answer (server: apache2) and provide passwords
Last step is to forward host ports to guest server so in Win8 machine run
  • vagrant halt
  • Edit Vagrantfile and add port forwarding
    config.vm.network :forwarded_port, guest: 3306, host: 3306
    config.vm.network "forwarded_port", guest: 80, host: 8010
  • vagrant up
After these steps mysql should be up and running in guest server and accessible from host machine also. You can first validate your mysql server running http://localhost:8010/phpmyadmin in Win8 browser and access your mysql server with any client in Win8 (Oracle SQL Developer for example).

Install postgres server

In guest machine (vagrant ssh from commandline) execute
  • sudo apt-get -y install postgresql postgresql-contrib
  • sudo vi /etc/postgresql/9.3/main/postgresql.conf
    listen_addresses = '*'  # uncomment this line
  • sudo vi /etc/postgresql/9.3/main/pg_hba.conf
    # IPv4 remote connections:
    host    all             all             0.0.0.0/0               md5
    # add previous line to config file and save file
  •  sudo service postgresql restart
For testing is good to have some test user/database/table
  • sudo -i -u postgres
  • psql
    CREATE USER test SUPERUSER ENCRYPTED PASSWORD 'a';
    CREATE DATABASE test OWNER = test ENCODING = 'UTF8';
    ALTER USER postgres ENCRYPTED PASSWORD 'a';
    \c test
    CREATE SCHEMA IF NOT EXISTS test AUTHORIZATION test;
    create table if not exists test_data (i int, d date, t timestamp);
    CREATE TABLE IF NOT EXISTS test_data_2 (seq serial, d date);
In Win8 run    
  • vagrant halt
  • Edit Vagrantfile and add port forwarding
    config.vm.network "forwarded_port", guest: 5432, host: 5432
  • vagrant up
After these steps postgres server should be up and running in guest server and accessible from host machine. You can check postgres with any client in Win8 (pgAdmin or Oracle SQL Developer for example).

Installed software

Software installed in host computer (Win8.1)
  • vagrant 1.8.1
  • VirtualBox 5.0.12
Software installed in guest computer (Ubuntu 14.04)
  •  mysql 5.5.46
  • phpMyAdmin 4.0.10
  • postgreSql 9.3.10

Services accesible from Win8.1 (host)

#   5432 - postgres db (localhost:5432:test/a)
#   3306 - mysql db    {localhost:3306:test/a}
#   8010 - phpMyAdmin  (http://localhost:8010/phpmyadmin {root/a})