The ETL from Hell – Diagnosing Batch System Performance Issues

by Nigel Rivett Too often, the batch systems that underlie a lot of database processing just grow without conscious design. When runs start to extend beyond their allotted time, and tuning no longer solves the problem, it is often discovered that batches are run in series, with draconian error handling. It is time to impose some rational design, and Nigel is a seasoned healer of batch processes. Overview Batch systems, which perform housekeeping jobs without human intervention, are often used with databases, usually for the population of data warehouses but more generally for any regular backend processing such as accounting processes. In this article, I’ll be discussing the typical problems in batch processing, showing how to determine their cause, and describing how to resolve them. We will concentrate on an overnight batch run because this is such a common way to populate a data warehouse, but the same principles will apply to any batch system, whenever it is run. Systems that are designed for high availability have additional challenges, and processing will already be designed so that maintenance can be carried out while the system is available. These systems can still benefit from the principles outlined in this article because control of the process can still be an issue. >> Go to Source...
Sofadeve – Mailer Response Tracker Module

Sofadeve – Mailer Response Tracker Module

This module went live just in time for the Meeting of german academics in Edinburgh in April 2014 and has claimed to be great success. The module uses the ground breaking webFRAME from Sofadeve’s product suite and provides a very simple but nevertheless appealing response page which is referenced into the mailer inviting to the upcoming event. Interested ? Want to know more ? Have a look yourself ? In case of one or two or three yes answers to the above use our contact page to get in touch...

Trees and Other Hierarchies in MySQL

Great chapter from the unmissable book by Peter Brawley and Arthur Fuller … http://www.artfulsoftware.com/ … Thank you boys! Most non-trivial data is hierarchical. Customers have orders, which have line items, which refer to products, which have prices. Population samples have subjects, who take tests, which give results, which have sub-results and norms. Web sites have pages, which have links, which collect hits, which distribute across dates and times. With such data, we know the depth of the hierarchy before we sit down to write a query. The depth of the hierarchy of tables fixes the number of JOINs we need to write. But if our data describes a family tree, or a browsing history, or a bill of materials, hierarchical depth depends on the data. We no longer know how many JOINs it will take to walk the tree. We need a different data model. That model is the graph (Fig 1), which is a set of nodes (vertices) and the edges (lines or arcs) that connect them. This chapter is about how to model and query graphs in a MySQL database. Graph theory is a branch of topology. It is the study of geometric relations which aren’t changed by stretching and compression—rubber sheet geometry, some call it. Graph theory is ideal for modelling hierarchies—like family trees, browsing histories, search trees and bills of materials—whose shape and size we can’t know in advance. >> Go to Source...
Data Transformation and Linear Algebra

Data Transformation and Linear Algebra

The problem of data transformation is solved in numerous ways with different levels of smartness and in different flavors. ETL (extract – transform – load) processes is a buzz word strongly related to this topic. Basically the requirement is to get a defined set of data entities, that would be data structures like records from tables in schemas from one presentation into another. That can be just a space time transformation (trivial as it maintains the structure – shape) or structural transformation which is shape changing. Based on some concepts of linear algebra where a fully understood algorithm has been defined over the last centuries, mostly the actual work done on different presentations of so called vectors ( which are well defined sets of data within a presentation (multi dimensional space) ). So something like the image above. Now, the idea is to try presenting a data structure in a space or what is equivalent provide a bi-directional transformation (mapping) onto that space. Impossible? I don’t think so. Conclusion, do it then! Ok, watch this blog and your curiosity will be satisfied...

Loading half a billion rows into MySQL

Interesting post on the derwiki blog … Especially the commenting is quite entertaining! Amazing how ignorance produces patronizing statements (-> Morg). See belwo the top of teh post … Background We have a legacy system in our production environment that keeps track of when a user takes an action on Causes.com (joins a Cause, recruits a friend, etc). I say legacy, but I really mean a prematurely-optimized system that I’d like to make less smart. This 500m record database is split across monthly sharded tables. Seems like a great solution to scaling (and it is) — except that we don’t need it. And based on our usage pattern (e.g. to count a user’s total number of actions, we need to do query N tables), this leads to pretty severe performance degradation issues. Even with memcache layer sitting in front of old month tables, new features keep discovering new N-query performance problems. Noticing that we have another database happily chugging along with 900 million records, I decided to migrate the existing system into a single table setup. The goals were: reduce complexity. Querying one table is simpler than N tables. push as much complexity as possible to the database. The wrappers around the month-sharding logic in Rails are slow and buggy. increase performance. Also related to one table query being simpler than N. … >> Go to Source...
Big data is better data

Big data is better data

And here is another TED talk. This time we are listening to wonderful Mr. Kenneth Cukier ……. Self-driving cars were just the start. What’s the future of big data-driven technology and design? In a thrilling science talk, Kenneth Cukier looks at what’s next for machine learning — and human knowledge. Watch...