The ETL from Hell – Diagnosing Batch System Performance Issues

by Nigel Rivett Too often, the batch systems that underlie a lot of database processing just grow without conscious design. When runs start to extend beyond their allotted time, and tuning no longer solves the problem, it is often discovered that batches are run in series, with draconian error handling. It is time to impose some rational design, and Nigel is a seasoned healer of batch processes. Overview Batch systems, which perform housekeeping jobs without human intervention, are often used with databases, usually for the population of data warehouses but more generally for any regular backend processing such as accounting processes. In this article, I’ll be discussing the typical problems in batch processing, showing how to determine their cause, and describing how to resolve them. We will concentrate on an overnight batch run because this is such a common way to populate a data warehouse, but the same principles will apply to any batch system, whenever it is run. Systems that are designed for high availability have additional challenges, and processing will already be designed so that maintenance can be carried out while the system is available. These systems can still benefit from the principles outlined in this article because control of the process can still be an issue. >> Go to Source...
Data Transformation and Linear Algebra

Data Transformation and Linear Algebra

The problem of data transformation is solved in numerous ways with different levels of smartness and in different flavors. ETL (extract – transform – load) processes is a buzz word strongly related to this topic. Basically the requirement is to get a defined set of data entities, that would be data structures like records from tables in schemas from one presentation into another. That can be just a space time transformation (trivial as it maintains the structure – shape) or structural transformation which is shape changing. Based on some concepts of linear algebra where a fully understood algorithm has been defined over the last centuries, mostly the actual work done on different presentations of so called vectors ( which are well defined sets of data within a presentation (multi dimensional space) ). So something like the image above. Now, the idea is to try presenting a data structure in a space or what is equivalent provide a bi-directional transformation (mapping) onto that space. Impossible? I don’t think so. Conclusion, do it then! Ok, watch this blog and your curiosity will be satisfied...

Loading half a billion rows into MySQL

Interesting post on the derwiki blog … Especially the commenting is quite entertaining! Amazing how ignorance produces patronizing statements (-> Morg). See belwo the top of teh post … Background We have a legacy system in our production environment that keeps track of when a user takes an action on Causes.com (joins a Cause, recruits a friend, etc). I say legacy, but I really mean a prematurely-optimized system that I’d like to make less smart. This 500m record database is split across monthly sharded tables. Seems like a great solution to scaling (and it is) — except that we don’t need it. And based on our usage pattern (e.g. to count a user’s total number of actions, we need to do query N tables), this leads to pretty severe performance degradation issues. Even with memcache layer sitting in front of old month tables, new features keep discovering new N-query performance problems. Noticing that we have another database happily chugging along with 900 million records, I decided to migrate the existing system into a single table setup. The goals were: reduce complexity. Querying one table is simpler than N tables. push as much complexity as possible to the database. The wrappers around the month-sharding logic in Rails are slow and buggy. increase performance. Also related to one table query being simpler than N. … >> Go to Source...