Tutorial to create a Fully Automated Open-Source Work-flow for Statistics and Data Mining

📅 Sunday, March 30, 2014

It is said that the typical person of our times consumes in one day as much information as a typical person in the Middle Ages would digest in his/her whole life. And this is only the visible top of the iceberg: the information that we see and get. But beneath the surface of these calm waters lays an ever growing ocean of information: the information gathered by electronic machines. From the app in our mobile phone counting your steps and pulling your position from the GPS to the transactional system your bank collecting your payment behaviour: every second loads of data are generated. Each person generates daily multiple mega- or even gigabytes of new and unique information. In order to make sense out of this we need to understand the underlying trends and concepts that are concealed in vast amounts of date: in other words we “need to run some statistics”, or do some “data crunching”.

There are very good commercial systems available that will do exactly this. Probably the best known is the information age pioneer and industry standard: Statistical Analysis System (better known as SAS). It will make your wallet a few million Dollars lighter and you get an interface that is an interesting mix of programming languages that in the 1980s were already outdated (as in “not even Object Oriented”) and the most modern, intelligent interfaces. It will come with loads of support and it will allow you to have a dynamical view, or a pre-generated set of reports at your fingertips in your favourite browser. SAS can run on a portable computer, a server or a super-computer; it is powerful and reliable. And above all it is the industry’s standard: so one can safely think: “no one ever gets fired for buying SAS”.

Of course there are alternatives, like Sephen Wolfram’s “Mathematica”. This is a more modern tool that actually does statistics and reporting as something additional on top of its real calling: mathematical (symbolic) analysis. It will cost ten to hundred times less than SAS (typically under the $100’000), but you might want to combine it with a database system. Oh, and if you want to get a flavour: it comes free with the Raspberry Pi! So it will cost you less than your next lunch to get started.

And of course there is an Open Source alternative (or as is so typical for open source: there is a multitude of possible combinations and configurations that can suit your needs). I will not give an overview of possibilities or combinations (although that cannot be avoided to mention a few), but in this tutorial I will try to give you one working solution.

The idea is more or less as follows:

  1. Collect data in a database (we’ll use MySQL as example, but eg. PostgreSQL is also great (and it is ACID compliant! – The relevance of this is that you will never need a function such as the sql.reset.query.cache() as mentioned in our R-code)))
  2. Build a Data Mart data with SQL
  3. Build our model (or perform statistical analysis) with R
  4. Present data in PDF via LaTeX (or HTML)

This stack has a few nice aspects and advantages:

  1. It can be fully automated (every day your company’s website can reflect the latest result without human interference!)
  2. It is completely free (gratis) and open source (so you can be sure that there are no backdoors as you can download code, check and compile yourself)
  3. From the start to the end we use tools that are quire ubiquitous and for each of them there is an active community to support you … also for free. That is of course only relevant if your Google cannot help you.

Step-by-step guide to install the open stack.

  1. The first thing that you will need is a working computer with a good operating system. Everything below should work on a Windows, OS X or FreeBSD machine, but our guidance is for Linux. We assume that you have a working distribution on your PC, that it is booted up and that you know how to install packages (we assume a Debian where necessary to make the example concrete)
  2. Next we need to install the specific softwares that we will use. If you installed for example Ubuntu Server Edition, then you will not have to install anything, however, if you started from the bare necessities, you might need the following.
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    # to be executed as root or insert "sudo" before each line
    #-- MySQL, the database
    apt-get install mysql-server -y 
    apt-get install mysql-client -y 
    apt-get install phpmyadmin -y 
    apt-get install mysql-workbench -y  # if you prefer a GUI for manipulation for writing queries
    # -- R:
    apt-get install r-base -y
    apt-get install r-base-dev -y
    apt-get install r-cran-xtable -y
    apt-get install r-cran-msm -y
    apt-get install libmysqlclient-dev -y # this is needed for RMySQL !!
    # -- LaTeX:
    apt-get install texlive-latex-base -y 
    apt-get install latex2rtf -y 
    apt-get install latex2html -y 
    apt-get install biblatex -y 
    apt-get install texlive-lang-polish -y
    # -- eventually the text-editor of your choice
    apt-get install kdevelop, kate -y
    # -- and also we want to use pdfcrop, and something to visualize our PDF output so we need also 
    apt-get install pdftk   # needed for pdfcrop (and other goodies) but does not include pdfcrop itself
    apt-get install okular  # the PDF viewer
    
    Now you still have to download the Perl script “pdfcrop”“ from sourceforge, make it executable and copy it for example in /usr/local/bin:
    1
    2
    cp pdfcrop /usr/local/bin/
    chmod 755 /usr/local/bin/pdfcrop
    
  3. By now you should have a working environment with all necessary software installed. The next step is to set up a database, get your data and import it in the database. One caveat is that MySQL needs to be able to find your file to upload and it should be allowed to read your files. For me the following worked:
    1
    2
    3
    echo "Please enter your sudo password"
    sudo cp MY_DATA.csv /tmp
    sudo chown mysql:mysql /tmp/*csv
    
    Note 1: Sure, for smaller datasets you can use phpmyadmin to upload the data. For larger data-sets only the MySQL terminal will work reliably … and of course only the terminal can be be fully automated (see below how). Note 2: If you are working on a machine that has only a command line interface, then you might want to use vi as a text editor and of course you won’t visualize the PDF files … but apart from that everything below should work on terminal (you can do without a GUI!). Actually during development a GUI is most useful, but once you want to automate the flow nothing can compete with the CLI. You will have the comfort that the work-flow described here can be copied to a headless server for example that has no GUI.
  4. Now, upload the data in the database. Of course the code snippet below is just a canvas. You will have to replace not only MY_DATABASE, tbl_TEST, etc. with your names, but also fieldnames, filenames, termination fields, etc.
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    USE MY_DATABASE;
    DROP TABLE IF EXISTS tbl_TEST;
    CREATE TABLE tbl_TEST
    (
    account_number CHAR(12) NOT NULL, PRIMARY KEY (account_number), total_balance_01 DECIMAL(12,2)
    ) ENGINE INNODB
    COLLATE 'utf8_general_ci';
    LOAD DATA INFILE '/tmp/data.csv' INTO TABLE tbl_TEST
    FIELDS TERMINATED BY ';'
    ENCLOSED BY '"'
    LINES TERMINATED BY '\n'
    IGNORE 1 LINES;
    
  5. Typically you will want to prepare the data and create some fields that contain aggregated information, or otherwise logically deduced information (for example if you try to model a lending portfolio, you will want to create fields indicating if an account –at a certain month– was ever in a delinquent state). This is handy as it will reduce compute time further down the road. Another handy hint is that you can simply write all your SQL code in a text file (call it data/build_balance.sql for example) and you can from the MySQL console invoke it by typing
    1
    source data/build_balance.sql
    
  6. And we’re ready to link our database to R, or otherwise stated access our database from within R. To achieve this, do the following.

    1. Get the necessary packages and load them:
      1
      2
      install.packages('RMySQL')
      library(RMySQL)
      
    2. Then, I would advice to create a text file and store the following functions:
       1
       2
       3
       4
       5
       6
       7
       8
       9
      10
      11
      12
      13
      14
      15
      16
      17
      18
      19
      20
      21
      22
      23
      24
      25
      26
      27
      28
      29
      30
      31
      32
      33
      34
      35
      36
      # -------------------------------------------
      # -- function to open the data base
      sql.login <- function(theUser = "root", thePwd = "", theDB = "MY_DATABASE", theHost = "localhost")
      {
      if (thePwd == "") {
      cat("Enter MySQL root Password: ")
      thePwd <- readline()}
      con <- dbConnect(MySQL(), user=theUser, password=thePwd, dbname=theDB, host=theHost)
      con
      }
      # -------------------------------------------
      # -- close the database-connection
      sql.close <- function(con) {dbDisconnect(con)}
      # -- get the sql data 
      sql.get <-function(ssql,NBR=500000)
      {
      rs <- dbSendQuery(con, ssql)
      data <- fetch(rs, n=NBR)
      huh <- dbHasCompleted(rs)
      dbClearResult(rs)
      data
      }
      # -------------------------------------------
      # -- execute a query that does not return anything (used for UPDATE, CREATE, etc.)
      sql.run <-function(sSQL)
      {
      rs <- dbSendQuery(con,sSQL)
      }
      # -------------------------------------------
      # -- reset query cache 
      sql.reset.query.cache <- function ()
      {
      system("sync && echo 3 | sudo tee /proc/sys/vm/drop_caches")
      rc <- dbSendQuery(con, "RESET QUERY CACHE;")
      system("sync && echo 3 | sudo tee /proc/sys/vm/drop_caches")
      }
      
    3. If you call this file for example functions_sql.R, then you can load this functionality in R with the command 2. Formatted tables. For example use the print function for the xtable object:
      1
      source('functions_sql.R')
      
      That allows you to re-use code in more than one project. So, what you probably will want to do is build a file that loads all your packages, loads functions_sql.R and then runs all your analysis. We’ll refer to this file as calc_Balance.R.
    4. Now the database can be used from within R.
      1
      2
      3
      4
      con <- sql.login(thePwd = "XXXXXX",theDB="MY_DATABASE")
      bal <- sql.get("SELECT total_balance_01 from tbl_TEST WHERE 1;")
      summary(bal)
      plot(quantile(bal))
      
  7. This allows us to use the strengths of R combined with the strengths of our data mart to do our analysis, build our model or whatever we’re interested in or paid for. The trick is now to make sure that the output is generated in a format that can be picked up by LaTeX. There are at least three simple possibilities.

    1. Pre-formatted text. This might not be very elegant, but it is fast and it "just works”. This consists of transferring how something looks in the R-console to our report. For example
      1
      2
      3
      sink("./tex/inc/summary.tex")
      summary(bal)
      sink()
      
    2. Formatted tables. For example use the print function for the xtable object:
      1
      2
      3
      4
      5
      6
      7
      install.packages('xtable')
      library(xtable)
      M <- matrix(c(1,2,3, 11,12,13), nrow = 2, ncol=3, byrow=TRUE, dimnames = list(c("row1", "row2"),c("col1", "col2", "col3")))
      sink("tex/inc/matrix.tex")
      x<-xtable(M,align="|lrrr|",digits=2, caption = "my caption.") 
      print(x)
      sink()
      
    3. Graphs. Also here it is sufficient to print the plot to a file that can be loaded by LaTeX. For example:
      1
      2
      3
      4
      5
      install.packages('ggplot2')  # this has to be done only once!
      library(ggplot2) # this should move to the headers of calc_Balance (or better in a headers.R that is called from there)
      d <- data.frame(bal)
      p <- ggplot(d, aes(x=bal)) + geom_histogram()
      ggsave("tex/img/graph.pdf")
      
  8. Now these elements can be added to you LaTeX (or HTML) report. Note that we assume that the calculations are done in a certain directory and that in that directory we have a subdirectory tex in which we have img and inc. A minimal skeleton could look as follows:

And name this file for example model_doc_Balance.tex

Typically one will want to re-run a model and its validation information (statistics such as Smirnov-Kolmogorov test, etc.) on a regular base when new data comes available.

Now that we have the work-flow in place it can be automated end-to-end and even starting the workflow can be automated. To complete this it is sufficient to create one bash file that will do all previous steps sequentially. Assuming that we would for each of the previous steps the following files,

  1. data/prepare_Balance.sh (to do the initial cleanup of the data via sed and awk for example – eg. insert comma’s in a fixed width file, etc.)

  2. data/import_Balance.sql to prepare the data fields that we want to use.

  3. calc_Balance.R to do all the statistical analysis (eventually using a header file to load packages and our functions_sql.R file, a configuration file to define the specific parameters and names) and spitting out the text and graphs in the tex/img and tex/inc subdirectories.)

  4. tex/model_doc_Balance is your model documentation booklet in LaTeX

  5. tex/pres_Balance is the presentation about the model (slides in the “Beamer” class of LaTeX for example)

and assuming that you use Okular to visualize the PDF files in the last lines, then this “do-everything-file” could look as follows:

Note: The first part in this file prepares the compilation of the LaTeX markup files. This is a little more complex than simply “latex myfile”, because it is prepared to allow you to use the package nomenclature, build an index, build a table of contents and have a reference system in place. The idea about the multiple runs is that we want to be sure that one call to our code gets all page references right (maybe I make a separate tutorial about this one).

Assuming that the shell file from previous point is called “calc.sh”, we could run then our project with the following command:

Linux is not only market leader on the small computers (routers, storage arrays, phones (via Android) and leader among the biggest computers on this planet, it is also a versatile environment that allows you to control your system the way you want. It is for example very simple to fully automate the initiation of all calculations.

Typically it is sufficient to add it to your relevant crontab file (eg. /etc/cron.monthly) or refer to the documentation of your distro to fine tune cron (for example askbutu for the Ubuntu distribution.

Free and Open Source Software (FOSS) is for people who value freedom, and this workflow can be moulded to suit your specific needs to any extend. For example

  • You can include the downloading of new data when it should be available.
  • You can email a report by using the mail command (on Ubuntu that might require to install first mailutils: sudo apt-get install mailutils)
  • You can replace LaTeX markup with Markdown and automatically create a website using Jinja2 and Mynt for example (those are also the systems that helped creating this website). In that case you can even add to your automation file some checks if it worked fine and then us rsync to upload the new html files to your webserver.
  • Of course you can replace MySQL with PostgreSQL, R with Octave, kate by vi, etc.

I hope that this tutorial is helpful. Please let me know if you find inaccuracies or if you think that it can be improved.