Tips for Long Running Computations (evanjones.ca)

[ 2004-November-03 12:46 ]

I have recently been doing some relatively big simulations for my research work. It has been a bit of a painful learning process for me, because working with long running jobs is very different from normal applications. I hope that these tips are useful to others, as I've had to learn them the hard way.

Trial Runs

Since a job could take many hours (or days, or weeks), you really want to verify that it will run to completion before you walk away. This means that you should run some sort of short sanity check before you actually start the job, just to make sure you didn't screw up the parameters.

Incremental Output

I strongly recommend writing results to disk as soon as you have them. Don't forget to flush any disk buffers. This is useful because if you will find a bug, the machine crashes, or your process goes berserk and gets killed by the operating system, you will have something to work with. The partial results can help you debug the problem, and allows you to monitor a process. It is terrible to wait for days for results, only to immediately realize that something was wrong with your test.

Automate Automate Automate

It is very important to automate every step in your computation process. Ideally, you want to type "go" and have the computer spit out your final results: tables, graphs, or a number. If you need to manually perform some tasks before and after a task, you will waste time each you need to restart a task. Restarting tasks is something that happens frequently, either because of mistakes, or because after looking at some results, you need to tweak the parameters. Yes, you will spend a lot of time writing code that is not directly related to the problem you want to solve, but you will save time each time you start a job.

Performance Matters ... Sort Of

With long running jobs, small tweaks can shave hours off of your run time. However, it hardly matters if your job takes 6 hours or 9 hours, because both are long enough that you will need to come back the next day. So at that level, performance doesn't matter. However, what does matter is how your job scales. For example, if you double the size of your test and your job takes fifty times longer, that is not acceptable. For this reason, do the initial work in a high level language that you are comfortable with (I recommend Python). Then, if your job is taking too long, you can easily play with different algorithms. One thing that is critically important: Optimize the bottleneck, not what you think is the bottleneck. Use a profiler to figure out what part of your program are slow. You should think about using a low level, high performance language like C only after you have the best algorithms.

Flat Output

My experiments involve millions of little simulations that are combined to generate bigger results. I originally was creating a heirarchy of little files. The lowest level had the "raw" data, and each level higher combined the data to produce a higher level result. This was perfect until I wanted to group the data in a different way, or use a different equation to combine results. Now, I report the output as one big, flat table, which I post-process to get the results I want. I could use a relational database, but that would be overkill.