Friday, June 5, 2015

Pentaho Kettle 5.3 CE and copy tables data from Postgres to MySql database

My experience with Kettle is really short. I tried to manage use case where data in Postgres database tables should be transferred to Mysql database. In previous article I described how to have Virtualbox/Ubuntu 14.04 with Mysql and Postgres database so this work is useful for my current task.

First step is to install
Installation is straightforward  - unpack zip in new folder and this minimal setup is all that you need. Kettle has Eclipse-based visual editor so Java runtime is pre-requirement (Java8 64bit was ok on my Win8.1 laptop). To access different databases jdbc libraries should be added to KETTLE_HOME\lib directory (Postgres driver is only one already in place).

Some important points from my experience
  • There are different options how to install Kettle - keep everything simple as possible and avoid pdi database. In my experience for development purposes folder/xml based repository is good enough and much easier to manage.
  • There is huge collection of documentation in internet (even youtube videos), but problem is that for different Kettle versions and not all options of Kettle commercial version is available in CE version. My suggestion is to read old Kettle wiki documentation and avoid Pentaho EE documentation, because changes between Kettle versions are not so big. Another option is to download older Kettle version, but I do not find any reason for that (4.4 version is quite popular in forums).
  • If you have previous installation of Kettle in same computer then in %USER%/.kettle directory files will be used by both installations (better is to remove older one).

Kettle task implementation

My task was simple - I have 2 tables in postgres database with data and I want to transfer data in these tables to Mysql database. I have previously installed Virtualbox/Ubuntu 14.04 with Mysql and Postgres databases with 2 tables in both databases with similar structure.

Because simple Transformation with 2 Table Input/Output tasks is to simple solution, then idea was to solve problem in dynamic way. In future any number of tables can be tranferred using same pattern (identical tables structures in 2 databases is prerequisite) and only changes in configuration is needed.

Get names and columns of tables from source DB

First step is to have transformation that reads from DB metadata table names and columns and output is stream of rows consisting Table_name and column_names (column names is list of columns separated by comma)

Solution is Transformation

Table Input -> Add Constraint -> Row Denormalizer -> Copy Rows to result

Table Input.Connection = Postgres Db connection (previously saved in shared.xml)
Table Input.SQL = select table_name, column_name from information_schema.columns where table_schema = 'test' order by table_name, ordinal_position
# Get information from Postgres metadata


Add Constraint (required for Normalizer, adds constant 1 for all rows in stream)
Add Constraint.Name = grouper
Add Constraint.Type = Integer
Add Constraint.Value = 1


Row Denormalizer.The key field = grouper
Row Denormalizer.Group field = table_name
Row Denormalizer.Target fieldname = table_columns
Row Denormalizer.Value fieldname = column name
Row Denormalizer.Key value = 1
Row Denormalizer.Type = String
Row Denormalizer.Aggregation = Concatenate strings separated by ,

Row Denormalizer will concatanate columns (achievable also by sql, but we try to learn Kettle now :-)) and Copy Rows to result will make current transformation output available to next step (Transformation/job) as stream. Idea is to pass this stream as parameters to next step without persisting information in any file/database.

Next step is to have transformation to save data in target db

Get dynamically data from source and save in target db

In this step we will read data from source database table and save it in target database. This would be really simple if it will be implemented in static way

Table Input -> Table Output

We will have same components in our transformation, but table name and column names should be come as named parameters to this transformation. So we will have Transformation properties

Transformation properties.parameters[0].Parameter = table_columns
Transformation properties.parameters[0].Default Value = i,d,t

Transformation properties.parameters[1].Parameter = table_name
Transformation properties.parameters[1].Default Value = test_data

# Default value is optional, but gives option to test transformation without input feed

Table.Input.Connection = Postgres Db connection (previously saved in shared.xml)
Table.Input.SQL = SELECT ${table_columns} FROM ${table_name}
Table.Input.Replace variables in script = Check
# This trick will read columns and table name from named parameters

Table.Output.Connection = Mysql connection (previously saved in shared.xml)
Table.Output.Target schema = test
Table.Output.Target table = ${table_name}
Table.Output.Truncate table = check
# We will always truncate table before loading, we assume that source/target have same field names so field mapping is not needed.

Main job

Now we have 2 separate transformations and we need to feed with every row of first transformation output second transformation to get results. So loop over first transformation output should be implemented and for every row second generation started.

There is really simple and elegant solution in Kettle to solve the problem (even without loop in screen). We need new job with these elements

Start -> Transfomation 1 -> Transformation 2

No changes in properties for job

Transfomation 1.job entry details.Transformation Spec.Transformation name = <first transformation filename>

Transfomation 2.job entry details.Transformation Spec.Transformation name = <second transformation filename>
Transfomation 2.job entry details.Advanced.Copy previous results to parameters = Check
Transfomation 2.job entry details.Advanced.Execute for every input row = Check
Transfomation 2.job entry details.Parameters[0].Parameter = table_columns
Transfomation 2.job entry details.Parameters[0].Stream column name = table_columns
Transfomation 2.job entry details.Parameters[1].Parameter = table_name
Transfomation 2.job entry details.Parameters[1].Stream column name = table_name

Now you have first transformation that will call second transformation for every row in output stream and pass these values as named parameters to the second transformation. If you have data in source database tables then running Main job should transfer data to target tables.

Conclusion

We have very powerful solution to tranfer data from source database to target if table structures in both databases are identical. Main problem during development was to figure out what fields must be filled to pass one transformation output stream as input to next transformation. Actually it took many hours to figure out the second transformation Parameters part, because it is not easy to figure out where is problem with debugging.

Honestly, I was surprised how simple and elegant dynamic solution was possible to develop in Kettle and hopefully it will be my favourite ETL tool in future.


Install on Win8 with vagrant Virtualbox/Ubuntu 14.04 with Mysql and Postgresql database inside

The goal was to have mysql/postgres databases accessible from local computer. I do not like database server installations in my own laptop and therefore VM solution is preferrable. This article describes steps to have ubuntu 14.04 Virtualbox with mysql and postgresql server inside

In host machine (Win8 laptop) first steps are to install
  1. Vagrant 1.8.1
  2. VirtualBox 5.0.12
Vagrant is my favourite tool to manage VM-s from commandline and Virtualbox is best VM provider. After installation open new command line window (cmder is my favourite and cmd.exe is ok also) and make new folder for VM installation. I assume that vagrant.exe is in your PATH.

First steps are to run from commandline in Win8.1 machine
  • vagrant init ubuntu/trusty64
  • vagrant up --provider virtualbox
  • vagrant ssh
You will have VM with ubuntu 14.04 server accessible from commandline with ssh. Next steps are to install mysql database and postgres databases in Ubuntu server guest.

Install mysql database

  • sudo apt-get -y install mysql-server
During installation root password must be provided (and better is to remember it also :-)).  After installation mysql server is up and running, but it is not accessible from host machine.
  • sudo vi /etc/mysql/my.cnf and change bind address line to
    bind-address            = 0.0.0.0 and save file
  • sudo service mysql restart
  • sudo netstat -tap | grep mysql  
Something like 
tcp        0      0 *:mysql                 *:*                     LISTEN      11462/mysqld
should appear on screen

For testing is good to have some test user/database/table
  • mysql -u root -p
    CREATE USER 'test'@'localhost' IDENTIFIED BY 'a';
    CREATE USER 'test'@'%' IDENTIFIED BY 'a';

    GRANT ALL ON *.* TO 'test'@'localhost';
    GRANT ALL ON *.* TO 'test'@'%';
       
    CREATE DATABASE IF NOT EXISTS test CHARACTER SET = 'utf8';
    create table `test_data` ( `i` bigint(20) NOT NULL, `d` datetime NOT NULL, `t` timestamp);
    create table `test_data_2` ( `seq` bigint(20) NOT NULL, `d` datetime NOT NULL);       
Next step is optional, but I prefer to have phpMyAdmin also installed on server so
  • sudo apt-get -y install phpMyAdmin
    Answer (server: apache2) and provide passwords
Last step is to forward host ports to guest server so in Win8 machine run
  • vagrant halt
  • Edit Vagrantfile and add port forwarding
    config.vm.network :forwarded_port, guest: 3306, host: 3306
    config.vm.network "forwarded_port", guest: 80, host: 8010
  • vagrant up
After these steps mysql should be up and running in guest server and accessible from host machine also. You can first validate your mysql server running http://localhost:8010/phpmyadmin in Win8 browser and access your mysql server with any client in Win8 (Oracle SQL Developer for example).

Install postgres server

In guest machine (vagrant ssh from commandline) execute
  • sudo apt-get -y install postgresql postgresql-contrib
  • sudo vi /etc/postgresql/9.3/main/postgresql.conf
    listen_addresses = '*'  # uncomment this line
  • sudo vi /etc/postgresql/9.3/main/pg_hba.conf
    # IPv4 remote connections:
    host    all             all             0.0.0.0/0               md5
    # add previous line to config file and save file
  •  sudo service postgresql restart
For testing is good to have some test user/database/table
  • sudo -i -u postgres
  • psql
    CREATE USER test SUPERUSER ENCRYPTED PASSWORD 'a';
    CREATE DATABASE test OWNER = test ENCODING = 'UTF8';
    ALTER USER postgres ENCRYPTED PASSWORD 'a';
    \c test
    CREATE SCHEMA IF NOT EXISTS test AUTHORIZATION test;
    create table if not exists test_data (i int, d date, t timestamp);
    CREATE TABLE IF NOT EXISTS test_data_2 (seq serial, d date);
In Win8 run    
  • vagrant halt
  • Edit Vagrantfile and add port forwarding
    config.vm.network "forwarded_port", guest: 5432, host: 5432
  • vagrant up
After these steps postgres server should be up and running in guest server and accessible from host machine. You can check postgres with any client in Win8 (pgAdmin or Oracle SQL Developer for example).

Installed software

Software installed in host computer (Win8.1)
  • vagrant 1.8.1
  • VirtualBox 5.0.12
Software installed in guest computer (Ubuntu 14.04)
  •  mysql 5.5.46
  • phpMyAdmin 4.0.10
  • postgreSql 9.3.10

Services accesible from Win8.1 (host)

#   5432 - postgres db (localhost:5432:test/a)
#   3306 - mysql db    {localhost:3306:test/a}
#   8010 - phpMyAdmin  (http://localhost:8010/phpmyadmin {root/a})

Saturday, January 4, 2014

Install Crashplan on Synology NAS DS112+


There is article from authors of CrashPlan Synology version describing installation process (PC Load Letter). This is my version of installation process of CrashPlan to DS112+.
  1. First step is to add Package Center -> Settings -> Package Sources: PC Load Letter | http://packages.pcloadletter.co.uk.
  2. Next step is to install Java SE for Embedded 7.
  3. Download manually Java SE for Embedded 7 that is suitable for you Synology NAS (ejre-7u45-fcs-b15-linux-arm-sflt-headless-26_sep_2013.tar.gz for my DS112+) and copy this file to NAS public folder
  4. Run Package Center -> Community -> Java SE for Embedded 7 "Install". If Java SE for Embedded 7 compressed file is available in public folder then everything should go smoothly.
  5. Next step is to install Crashplan to Synology NAS.
  6. Run Package Center -> Community -> CrashPlan "Install". After installation everything is in place and next step is to configure CrashPlan on NAS server.
  7. Configuration should be done from PC with ssh port forwarding.
  8. Run from PC command line
    ssh -L 4200:localhost:4243 root@<NAS ip-address>
    what will forward PC 4200 port to NAS 4243 port
  9. Change CrashPlan ui.properties file (C:\Program Files\CrashPlan\conf\ui.properties in my case)
    servicePort=4200
  10. Run CrashPlan user interface, change NAS server parameters as necessary and finally press "Save changes"
  11. Remove servicePort=4200 from ui.properties file and configure local CrashPlan (folders to backup and so on).
There is another good article describing same process, if you find any problems during installation. There are also some articles with out-of-date information (specially synology wiki).
After running all these steps you are ready to backup your PC files/directories to Synology NAS server as you configuration describes.

I have added some additional steps (optional) to my configuration. In PC I have configured to run backup every day from 1:00 AM to 6:00 AM. In NAS server I have added these 2 lines to /etc/crontab
0       0       *       *       *       root    /volume1/@appstore/CrashPlan/bin/CrashPlanEngine start
0       7       *       *       *       root    /volume1/@appstore/CrashPlan/bin/CrashPlanEngine stop

So CrahPlan will run on NAS server every day from 0:00AM to 7:00AM.

Cron daemon restart is necessary to make crontab changes effective.

/usr/syno/etc.defaults/rc.d/S04crond.sh stop
/usr/syno/etc.defaults/rc.d/S04crond.sh start

Tuesday, December 31, 2013

How to load CSV file to Mysql database

There some tricks to know how to load csv file to mysql:
  1. mysql -u <user> -p --local-infile
    (the --local-infile is necessary, otherwise you will get error during dataloading: ERROR 1148 (42000): The used command is not allowed with this MySQL version)
  2. use <database>
  3. create table script (create_tables.sql) in text editor
    drop table kontakt;
    create table kontakt
     (klkood        bigint(20)
     ,kontakti_liik varchar(32)
     ,vaartus       varchar(100)
     ,kasutusluba   varchar(100)
     ,keeluaeg      varchar(100)
    );
    create index kontakt_klkood on kontakt(klkood);
  4. run the script from mysql prompt
    .\ create_tables.sql
  5. create load data script (load_data.sql) in text editor
    LOAD DATA LOCAL INFILE '/vagrant/uptime/EA_kliendi_kontaktinfo_3012.csv' INTO TABLE kontakt
    FIELDS TERMINATED BY ';'
    LINES STARTING BY '"' TERMINATED BY '"\r\n'
    IGNORE 1 LINES;
  6. run the script from mysql prompt
    .\ load_data.sql
  7. If there are warnings then execute
    show warnings
For me most difficult part was to get load_data.sql to work and specially --local-infile parameter.

Thursday, December 5, 2013

How to install mysql and mysql workbench in Ubuntu 12.04

Its really simple task to install mysql server and mysql workbench in Ubuntu 12.04. You shuold run from commandline



sudo apt-get install mysql-server
sudo apt-get install mysql-workbench

First row installs mysql server, asks root password and starts mysql service. To restart server run

sudo service mysql restart

Second row installs mysql-workbench and you are ready to access mysql server in localhost.

Tuesday, November 26, 2013

How to make Ubuntu Desktop Vagrant compliant

Its seems me reasonable to have Vagrant compliant Ubuntu Desktop 12.04 even I will not start it from commandline. These steps are necessary:
  1. Create default user as vagrant/vagrant in Ubuntu setup
  2. Add admin group
    groupadd admin
    useradd -a -G admin vagrant
  3. update sudoers (sudo visudo; and add these lines
    Defaults:vagrant !requiretty
    Defaults env_keep = "SSH_AUTH_SOCK"
    %admin   ALL=(ALL:ALL) NOPASSWD: ALL
  4. Add a port forwarding rule in Virtualbox NAT as follows:
        Name: SSH
        Protocol: TCP
        Host IP:
        Host Port: 2222
        Guest IP:
        Guest Port: 22
  5. Install and configure OpenSSH Server
    sudo apt-get install -y openssh-server
    sudo vi /etc/ssh/sshd_config

    and add/modify these lines:
    Port 22
    PubKeyAuthentication yes
    AuthorizedKeysFile %h/.ssh/authorized_keys
    PermitEmptyPasswords no
    PasswordAuthentication no

    restart ssh server
  6. Download vargant public and private key
    $ mkdir -p /home/vagrant/.ssh
    $ chmod 0700 /home/vagrant/.ssh
    $ wget --no-check-certificate \
        https://raw.github.com/mitchellh/vagrant/master/keys/vagrant.pub \
        -O /home/vagrant/.ssh/authorized_keys
    $ chmod 0600 /home/vagrant/.ssh/authorized_keys
    $ wget --no-check-certificate \
        https://raw.github.com/mitchellh/vagrant/master/keys/vagrant \
        -O /home/vagrant/.ssh/id_rsa
    $ chmod 0600 /home/vagrant/.ssh/id_rsa
    $ chown -R vagrant /home/vagrant/.ssh
  7. Remove file to fix network issues afterward
    sudo rm -f /etc/udev/rules.d/70-persistent-net.rule
  8. Comment out HWADDR row in /etc/sysconfig/network-scripts/ifcfg-eth1 (to fix https://github.com/mitchellh/vagrant/issues/1777 issue)
After that you are ready to create new vagrant box. There are several articles describing same process in depth.

Quick file search in Ubuntu

Quick answer to this simple question is

 locate <filename>

and to update slocate database

sudo updatedb

As always there is also long answer.