Are you also waiting for a data import to finish? Then you have come to the right place. Today I am going to share my experiences working with Redshift and Alteryx bulk loading in combination. This serves as an example and can also be done with other databases that support bulk loading (SQL server, Teradata, and Redshift)
Even though Alteryx is quite fast at processing data, I would also like to show how to utilize the power of a big scalable cluster to process data. To do that, I will do the following:
- Move the data to Redshift and show the difference in performance between traditional ODBC Loading and Bulk Loading
- Show queries using the In-Database tools in Alteryx
A little about the dataset:
- US flight data from 1990-2012
- 96.825.753 rows
- Stored locally on my desktop in a YXDB file I am not going to bore you with technical details in this post, but you are more than welcome to contact me for further information.
Moving The Data to Redshift
You remember I asked you about the data import? If yes, then now is the time to pay attention. As a matter of fact I didn't have the patience to let the ODBC import finish however the Bulk Load happened in a little less than 10 minutes.
To give a short explanation on the difference between ODBC loading and Bulk loading let me give you a quick example. ODBC loading happens a row at a time. Imagine stamping the letters one by one and sending them to the post office - not very effective. Bulk loading on the other hand sends all the data in one bulk, hence the name, and then loads the data into the database. This would compare to giving a huge pile of letters to the post office and let their SUPER stamper do the job - much more effective.
So back to the numbers:
Bulk Load: 96.825.753 rows in 9:29 minutes
ODBC Load (Stopped): 381.440 rows in 1:49 hours
I think the numbers speak for themselves. If you want a little bit more info, you can see this video https://vimeo.com/144868131.
Using In-Database Tools For Analysis
In version 9.5 Alteryx introduced in-database tools to make processing of data even smarter. (http://www.alteryx.com/solutions/in-database-processing).
Basically the In-database tools helps you create a query using the tools you are familiar. The result of the query can then be sent back to in-memory processing for further analysis.
In this video, I am showing the difference between doing stuff in-database and in-memory: https://vimeo.com/144866052
The results are quite significant:
In-memory: 2:47 minutes
In-database: 23,1 seconds
I think the above clearly shows the power of both bulk loading an in-database tools. This happens by giving the power to processing back to the database. Start using this today, and you will see huge time improvements in your daily routines.