The ETL process and MySQL
Get advice on the ETL process and ETL tools for transforming raw data in MySQL.
I am looking for help regarding the ETL process. I have raw data and I want to transform it in MySQL to remove data errors. What is the process for doing this?
If you Google for extract, transform and load (ETL) you'll find a huge number of references to ETL tools. The reason that all of these tools have been developed is simple -- the ETL process is so complex that a tool is usually the best choice. And even though they are usually expensive, they are cost-effective. So, my default answer would be to first advise you to take a serious look at using an ETL tool. I hold zero shares in any ETL company (actually, I hold zero in any IT-related company), so I don't have any ax to grind here.
However, you may well have considered and rejected a tool so, where do you start with ETL using MySQL?
Extraction
MySQL is essentially a RDBMS engine, not an ETL tool. It doesn't have specific extraction tools, so you'll probably have to push the data from the source systems out as XML or CSV files. These will have to be imported into MySQL tables.
Transformation/Cleansing
Cleansing the data is often more complex. I can't give you a series of MySQL statements to run on your data, but I can, hopefully, give you some pointers.
Dirty data comes in many flavors, but it often consists of the outliers in the data. For example, consider a column that stores exam marks allocated as percentages. It should contain values between 0 and 100 (to put that more formally, the domain of acceptable values in the set 0 - 100). So, run a Min and Max query against the column and that will immediately identify whether there are unacceptable values.
Next, we know that marks are not evenly distributed across the domain, so run a query that groups the values into, say, 0-10, 11-20, 21-30 and so on. Eye-balling the figures will give an idea of whether they are the expected skewed distribution.
Now consider a column called Gender. You expect it to contain two values: Male and Female. If you run your eye down the column they may look alright, but there may be 60 million rows. However, a simple GROUPBY query will show you each discrete value in the column once. It helps to perform a count on the primary key as well and if, for example, you see:
COUNT OF Primary Key | Gender |
31,234,542 | Male |
28,438,765 | Female |
42 | |
563 | F |
766 | M |
Then you know you have a problem. You can then construct an UPDATE query to fix it.
Of course, once you have worked out how to do the cleansing you still have to automate it in a script.
Loading
The final step is to load the data: move it from the transformation area to the core data warehouse. This operation is often functionally an INSERT operation but, as a general rule rather than a MySQL specific one, we usually try to avoid the SQL INSERT operation here because it is way too slow if there is a significant data volume. Instead we tend to use whatever bulk inset option the RDBMS offers.
Overall
As I write this answer I do keep on coming back to the fact that all of this is easier with an ETL tool. You can still use MySQL to write and run the SQL but use the ETL tool to schedule the SQL and control the process overall.
More information on ETL tools:
- ETL tools: What you do and don't want
- Extract, transform and load tutorial: Five-minute audio podcast
- ETL tools and EDR tools: What's the difference?