pdf to eps

Publishing my most recent research paper takes longer than expected. It is not helping that Elsevier wants to have the figures in eps format. I did not want to create them from scratch (using Matlab). Online I found help based on ghostscript. Here is a little script that converts a pdf file to eps directly.

#!/bin/bash
# Convert pdf to eps file.

while test $# -gt 0 ;
do
case $1 in
         *.pdf)
             /usr/local/bin/gs -dNOPAUSE -dNOCACHE -dBATCH -sDEVICE=epswrite -sOutputFile=${1%.pdf}.eps $1
             ;;
         *)
             echo "Check file $1 - not recognised"
             ;;
         esac
      shift
 done

I saved the script as pdftoeps.sh and used xargs to convert multiple pdf files into eps format:

$ ls | egrep ".pdf" | xargs ./pdftoeps.sh

Beautiful awk

For a recent biophysics project I used awk quite a lot. Awk is quite powerful in manipulating structured files and the advantage is: no python or perl scripts needed. Very powerful operations can be done in shell one-liners.

One example I came across often was to count the number of lines in a file that satisfied a certain condition. For example, you have a big text file (let’s say yourfile.txt) and you want to check how many entries (lines) are there where the third column is greater than, e.g., zero. Assuming whitespace column separators, all you need to do is this:

$ awk '$3>0' yourfile.txt | wc -l

The statement $3>0 check if the third column is greater than zero. $1 is the first column etc. $0 represents the whole line. Note that the output of the awk command is passed to the wc (“word count”) command, which in this case actually counts the lines ( “-l” flag).

There is no explicit print statement in awk. If the condition $3>0 is true, it will print automatically.

O’Reilly has a nice pocket reference covering sed & awk. (I carry it in my work bag at all times.)

Performance improvement for embarrassingly parallel applications using Python’s multiprocessing

Recently, I posted a template for scientific simulations that applies Python’s multiprocessing module for enhanced performance. Simulations usually run independently and require very low communication between the processes. Such process are commonly called “embarrassingly parallel”. In this post I want to assess the performance improvement of my template. I use two different machines and vary the number runs and the number of processes.

As expected, on an Intel Core 2 Duo with 2.53GHz I get the optimal speed-up with at least two processes.

Using an Intel Core i5 3.1 GHz (Quad-core), the optimal speedup is achieved by using 4 processes.

The results are not too surprising but it is good to know that the speedup that can be achieved using my embarrassingly parallel  template is close to optimum – at least close enough for me :)

As always, feedback is appreciated.

Parallel computing in Python using the multiprocessing module

It does not matter where you run your code, either on your local multi-core machine or on a cluster: Parallel programming approaches have become pivotal when it comes to scientific computing.

At the same time Python has become very popular and powerful. Hence, I recently started to learn more about parallel coding approaches in Python. There are two main approaches: using threading modules (either the low level _thread or the higher level threading module) or the multiprocessing module. The reason for the two different approaches is CPython’s GIL (Global Interpreter Lock) that the threading modules rely on. GIL is the subject of controversy because it prevents running true parallel applications. I am aware of the contradiction caused by using a higher level language like Python on the one hand and craving optimization on the other hand. Nevertheless, the multiprocessing module, which side-steps GIL, seems to be a good place to dive into parallel coding. Another advantage of the multiprocessing module is its higher level approach and own memory space, i.e., global variables are not automatically shared. Therefore, multiprocessing provides its own objects for shared memory (Array and Value).

After going through several excellent tutorials, e.g. by Doug Hellman ( pt. I , pt. II ) or Norm Matloff’s (UC Davis) book (chapter 14), I created the following template for parallel simulations. Simulations are particularly adaptable because they usually don’t share memory (communication between threads is computationally expensive) and run independently. However, the simulation results need to be stored, usually in an array. Here, the result objects are defined in line 43&44. The associated Locks are defined in line 46&47 and they take care of race conditions: a delicate synchronization issue that arises when two or more processes try to access shared memory.

#!/usr/bin/env python

# Multiprocessing simulation template
#
# Lars Seemann
# lseemann [at] uh.edu

import time
import random
import multiprocessing as mp

class glbls:                      # globals, other than shared
   DIM_RESULTS = 5                # dimensionality of simulation result
   thrdlist = []                  # list of all instances

def simulation(rnd):
   #______ACTUAL SIMULATION IN HERE________

   # waste some time and get random results for now
   tmp_count=0
   for i in xrange(int(1e5)):       # usually, the simulation
      tmp_count+=random.random()    # computes and accesses memory
   result = []
   result = []
   for i in range(glbls.DIM_RESULTS):
      result.append(rnd.gauss(mu=0, sigma=i+1))
   return result

def worker(id,mynreps,rslt,rsltlock,rslt_sq,rsltsqlock,n_sim,nsimlock):
   rnd = random                              # set up random number generator
   for n in range(mynreps):
      tmp_result = simulation(rnd)
      nsimlock.acquire()
      n_sim.value += 1
      nsimlock.release()
      for (i,x) in enumerate(tmp_result):
         rsltlock.acquire()                  # acquire: make sure shared global is
         rslt[i] += x                        # hold by one process only
         rsltlock.release()                  # release
         rsltsqlock.acquire()
         rslt_sq[i] += x*x
         rsltsqlock.release()
   print "\tProcess%d ran %d simulations"%(id,mynreps) 

def main(N_MC = 1e3 , N_THREADS = 2):
   nreps=int(N_MC)/N_THREADS
   result=mp.Array('d', glbls.DIM_RESULTS)    # shared array object
   result_sq=mp.Array('d',glbls.DIM_RESULTS)  # shared array object
   num_sim=mp.Value('i',0)                    # shared number array updates
   rslt_lock = mp.Lock()                      # lock
   rslt_sq_lock = mp.Lock()                   # lock
   nsim_lock = mp.Lock()                      # lock

   # Create and start processes
   t_0=time.time()
   for i in range(N_THREADS):
      p = mp.Process(target=worker,args=(i,nreps,result,rslt_lock,result_sq,rslt_sq_lock, num_sim,nsim_lock))
      glbls.thrdlist.append(p)
      p.start()
   for thrd in glbls.thrdlist:thrd.join()     # wait till all processes finish
   t_1=time.time()
   Actual_N_MC = nreps * N_THREADS            # actual nr of simulations
   if num_sim.value!=Actual_N_MC:             # consistency check
      raise ValueError("Inconsistent number of simulations! %d,%d"%(Actual_N_MC,num_sim.value))

   # Assess result
   for i in range(glbls.DIM_RESULTS):
      mean=result[i]/Actual_N_MC
      std_dev=((result_sq[i])/Actual_N_MC - (result[i]/Actual_N_MC)**2)**(0.5)
      print "\tN(mu=0,sigma=%d) : sample mean=%.4f \tstd dev=%.4f"%(i+1, mean, std_dev)
   t = t_1-t_0
   print "Time elapsed: %ds"%t
   return t

if __name__ == '__main__':
   main()

The class glbls define global variables: the dimensionality of the simulation results DIM_RESULTS (here set to 5) and a list of all process instances.

simulation contains the actual simulation. I am planning to use this template for an agent-based model. In general a simulation can return several quantities of interest. These quantities are returned as a list whose dimensionality must be defined as the global variable DIM_RESULTS. For demonstration purposes I just create normal distributed variables with zero mean and different variance.

The worker runs the simulation multiple times and stores the result in the shared results arrays. Note that the input of worker contains two multiprocessing.Array objects which are shared by all threads. I also keep the squared results to calculate the standard deviation later on.

main creates, starts, and joins all threads. Input is the number of monte carlo runs N_MC and number of threads N_THREADS. Optimal performance can usually achieved by setting the number of threads to the number of cores of the cpu. Once all processes ran, the results can be assessed. Here I just calculate the sample mean and sample standard deviation of the normal random numbers.

I hope this template can be utilized for different purposes. Feedback is always appreciated.

Preparing Mac OS X Lion for scientific computing

In this post I am going to summarize how I set up my new iMac to prepare it for its main purpose: scientific computing, mainly in python.

A recent fund from the National Science Foundation allowed me to purchase a new iMac 27inch (I really like the screen) with the following specifications:

  • CPU: Intel Core i5 3.1 GHz (Quad-core)
  • GPU: AMD Radeaon HD 6970M 1GB (maybe I’ll dive into GPU computing / OpenCL later)
  • L2 Cache: 4 x 256KB
  • L3 Cache 6MB
  • RAM:  8GB DDR3
  • Hard Drive: Hitachi 2TB 7200 RPM (not partitioned yet)
  • OS: Mac OS X 10.7 (Lion)
Here is how I set everything up:
  1. Create Admin account. I prefer to work in an environment that doesn’t have administrative rights per default and explicitly asks for Admin-login for important system changes. Hence, I created an Admin account and I work under a user without these rights. I also created a guest account.
  2. Computer Name. I changed my computer name under System Preferences -> Sharing.
  3. Install Xcode. Xcode provides most necessary compilers and libraries. Xcode version 4.1 for Lion is available for free from here. Older versions are available through the apple developer page as well (login required).
  4. XQuartz? On Leopard (Mac OS X 10.5) I used XQuartz as a X Window System. Lion did ship with X11 and XQuartz does not provide an explicit version for Lion. And since I do not see any reason to not use X11 for now, I will leave everything as it is.
  5. Fink package management system. Fink makes third-party application originally written for unix easily accessible for Mac OS X.
  6. Python. I recommend using pre-configured python distributions like PythonXY or the Enthought Python distribution. PythonXY is licensed under the GNU GPL license whereas the Enthought Python distribution is only free for academic use (academic email address required). I work with the latter.
  7. WingIDE (optional). I hate to recommend commercial software but WingIDE make python coding/developing a lot easier. Free alternatives are, e.g.,  Spyder (comes with PythonXY) or any editor. That brings me to the next topic:
  8. Editor. Oh my… finding a good editor is a long quest, which can get very personal. There are a lot of very good ones out there. Personally, I like vim, TextWrangler, and KomodoEdit.
  9. Setting up github, ec2, Dropbox etc.
  10. .bash_profile & vim syntax highlighting. First, I turned on syntax highlighting in vim as follows:
    $ vim .vimrc
    

    and added the following two lines for syntax highlighting and line numbering:

    syntax on
    set number
    

    Next, I made sure that somewhere (most likely in the beginning) in .bash_profile the following lines can be found

     Get the aliases and functions
     if [ -f ~/.bashrc ]; then
             . ~/.bashrc
     fi
    

    which basically tells the bash to look for the file .bashrc. This is where personalized setting (shortcuts, colors) are placed. My settings:

     #.bashrc defaults
    
     export EDITOR=/usr/bin/vim
    
     export PATH=$PATH:~/bin
    
     export LSCOLORS=Hxfxcxdxbxegedabagacad
    
     export PS1="\[\033[01;33;01m\]\[\033[00;37;01m\]\h:\u\[\033[01;32;01m\] \d \$(date '+%T %Z') \[\033[01;32;01m\] [\[\    033[01;37;01m\]$SHLVL\[\033[01;32;01m\]]\n!\!\[\033[01;33;01m\] [ \w ]\[\033[00m\] "
     export PS2="\[\033[01;33;40m\] ?\[\033[00m\] "
    
     PROMPT_COMMAND='echo -ne "\033]0;${USER}@${HOSTNAME%%.*}: ${PWD/$HOME/~}\007"'
    
     alias ls='ls -G'
    

    The outcome is the following.

  11. TexShop & BidDesk. TexShop is a convenient LaTeX environment. BidDesk is a great bibliography manger which makes sorting papers almost fun and it works well for citations in LaTeX.

Getting started with Amazon’s EC2 using API tools under Mac OS X / Unix

Since you are reading this I assume that you at least heard about Amazon’s Cloud Computing service, so I am skipping the introduction about how great it is.

There are several excellent posts on this topic already, e.g. here or here . Therefore I am just going to summarize what I did.

  1. Sign up for Amazon Web Service (AWS) service here . Note that the service is in general not for free and you need to provide credit card information. However, Amazon’s free Free Usage Tier for new customers will keep costs low, if not at zero, if you just want to try it out.
  2. To communicate with Amazon’s EC2 instance you need a certificate and the appropriate command line tools. Once your account is created, you can get the X.509 certificates from your Account -> Security Credentials -> Access Credentials -> X.509 Certificates. Create a new certificate and download the two created files. Both files will have the ending .pem (Privacy Enhanced Mail protocol) and one will have the prefix “cert-” (=certificate) and the other file will have the prefix “pk-” (=private key). Download and save both files. Don’t loose them, they are important. Next, download the EC2 command line API tools .
  3. Create a new folder named .ec2/ in your home directory and move the certificate files and the unzipped EC2 toolbox folder into the new .ec2/folder:
    $ cd
    $ mkdir ~/.ec2
    

    Assuming you stored the certificates and the EC2 tools folder in your home directory

    $ mv *.pem .ec2/
    $ mv ec2-api-tools-1.4*/* .ec2/
    

    Now, the .ec2/ folder should contain both .pem files and the EC2 API related files and folders.

     $ ls -al 

    The content of the .ec2/ folder should look similar to this
    Note that I kept the EC2 tools zip-file and there is an additional .pem file which we will come back to later.

  4. To make work with the EC2 tools more convenient, let’s set some EC2 related paths by adding a few line to the bash profile. That way our system will automatically know where our keys and EC2 libraries live.
     $ cd
     $ vim .bash_profile
     

    Now add the following section (in insert mode)

     # Setting PATHs for Amazon EC2
     export EC2_HOME=~/.ec2
     export PATH=$PATH:$EC2_HOME/bin
     export EC2_PRIVATE_KEY=~/.ec2/pk-___________________________.pem
     export EC2_CERT=~/.ec2/cert-________________________________.pem
     export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Home/
    

    where you need to fill in your actual key names. Finally, let’s activate the changes via

     $ source ~/.bash_profile
     
  5. To securely log into Amazon’s remote machines via ssh we need to generate another key (that’s the one mentioned above). Finally we can put the EC2 library to work
     $ ec2-add-keypair ec2-keypair
     

    Copy the output manually (including the header and footer) and paste it into a new file, e.g. via

     $ cd .ec2/
     $ vim YOURKEY_keypair.pem       # paste and save the key
     $ chmod 600 YOURKEY_keypair.pem # permission
     

    and change the permission explicitly to private.

  6. To run an instance we have to choose an AMI (=Amazon Machine Image). I think of it as a ready-to-use configured system, similar to the Linux distribution DVDs. There are plenty available as you can see by running
    $ ec2-describe-images -o amazon # AMIs from Amazon
    $ ec2-describe-images -a # all AMIs available
     

    I am not very familiar with all the differences yet, but I know I want to focus on scientific computing and not on web-hosting etc. Therefore I wanted to use one I found on Drew Conways blog (great read by the way) (ID: ami-84bd41ed). However, it looks like it requires 64-bit architecture, which does not fall under the Free Tier usage. So let’s try an S3 backed (more on that later) 32bit Linux AMI:

    $ ec2-run-instances ami-2a1fec43 -k YOURKEY_keypair
    RESERVATION ...
    INSTANCE ...
     

    and if everything works out you will see some information about your instance. I hope you feel as excited as I did when I ran my first instance on the cloud. It has been a long way…The returned information are important and can be retrieved during the whole session via

    $ec2-describe-instances
     
  7. From here we can go different ways. We can either open port 80 for Apache and access our instance through a browser, or we can open port 20 for ssh. Of course we can do both but I think here is where the purpose of using EC2 comes into play. Since I am planning to use EC2 as a platform for scientific computation I will access the root directory. Besides, I almost don’t know anything about Apache, web-hosting etc. (I hope that doesn’t undermine my credibility for writing this post.)
    $ec2-authorize default -p 22
    $ssh -i YOURKEY_keypair.pem ec2-user@ec2-__-__-___-___.compute-1.amazonaws.com
     

    where you need your keypair and the web address of your instance, which can by found by using

    $ec2-describe-instances
     

    Finally! What a charming welcome to the cloud

           __|  __|_  )  Amazon Linux AMI
           _|  (     /     Beta
          ___|\___|___|
    
    See /usr/share/doc/system-release-2011.02 for latest release notes. :-)
    

    It was a long way but I hope you are as excited as I were. Now we can play around a little bit and see ,e.g., what the guys from Amazon have to tell us

    [ec2-user@ip-10-36-13-72 ~]$ cd ../../usr/share/doc/system-release-2011.02/
    [ec2-user@ip-10-36-13-72 system-release-2011.02]$ vim ReleaseNotes.txt
     

    Or play with python

    [ec2-user@ip-10-36-13-72 ~]$ python
     
    >>>> a = range(10)
    >>> a
    [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
     

    or do whatever you have in mind. Enjoy!

  8. I know you are excited, but don’t forget to terminate your instance. We are paying for it!
    Log out of ssh

    [ec2-user@ip-10-36-13-72 ~]$ .~
    ec2-user@ip-10-36-13-72 ~]$ Connection to ec2-50-17-138-227.compute-1.amazonaws.com closed.
    

    Get instance info

    $c2-describe-instances
    RESERVATION	r-9ef419f1	976958007180	default
    INSTANCE	i-ce8363af	ami-2a1fec43	ec2-__-  ...
    

    Terminate

    $ec2-terminate-instances i-ce8363af
    INSTANCE	i-ce8363af	running	shutting-down
    

    Double check

    $ec2-describe-instances
    RESERVATION	r-9ef419f1	976958007180	default
    INSTANCE	i-ce8363af	ami-2a1fec43			terminated ...
    

I hope this little tutorial helped you and you are ready to explore the world of EC2. I am still at the beginning of the learning process. I will keep you updated. I really enjoyed writing my first blog post ever.

Next Steps

Next I am going to explore the world of AMI and maybe build my own focusing on scientific computations using , e.g., Python and its extensions and how to incorporate big data for numerical analysis. That brings me to the next step: Exploring how to incorporate S3 & EBS storage. The ultimate goal is to run a trading system on EC2 using broker-provided APIs.