Magento to Redshift

This page provides custom instructions on how to extract data from Magento’s ecommerce backend and load it into Amazon Redshift. (If this manual process is a bit more involved than you’d prefer, check out Stitch, which can do all the heavy lifting for you in just a few clicks.)

What is Magento?

Magento is a wildly popular open-source content management system for e-commerce web sites. It is developed in PHP and is known for its flexibility and wide adoption across e-commerce businesses of all sizes.

Pulling Data Out of Magento

There are two ways for you to extract data from Magento: the API and pulling directly from the underlying database.

Magento’s API is unique relative to traditional SaaS APIs because Magento is self-hosted to begin with. While the API provides a helpful interface for extracting structured data, you do have lower-level access available if you decide it’s more appropriate.

Depending on the information you want to extract, the Magento API could be a good fit. You can check out the API Docs to learn more. Be warned, however, that the many historical versions of Magento could lead to inconsistent compatibility with different API calls. In most recent version, Magento offers both REST and SOAP versions of their API.

If you’d prefer to dig in at a lower level, you can actually run queries directly on the underlying database that is powering your Magento instance. (Hitting the API is really just doing this via a layer of abstraction.) In this case, you’re really just doing an exercise in MySQL to Redshift or PostgreSQL to Redshift, depending on which database platform you’re using. If you go this route, it will be very helpful to familiarize yourself with the Magento database structure, which you can find here.

Preparing Magento Data for Redshift

Your Magento data will need to be structured into a schema that can be inserted into a Redshift database. If you don’t mind dealing with the default magento DB structure in your analytical environment, this simply means recreating the tables and fields that you pulled from your Magento API or database in the Redshift environment. You can refer to the API docs or use the information_schema tables in those databases to understand these formats.

Inserting Magento Data into Redshift

Once you have identified all of the columns you will want to insert, you can use the CREATE TABLE statement in Redshift to create a table that can receive all of this data.

With a table built, it may seem like the easiest way to add your data (especially if there isn’t much of it), is to build INSERT statements to add data to your Redshift table row-by-row. If you have any experience with SQL, this will be your gut reaction. But beware! Redshift isn’t optimized for inserting data one row at a time, and if you have any kind of high-volume data being inserted, you would be much better off loading the data into Amazon S3 and then using the COPY command to load it into Redshift.

Keeping Data Up-To-Date

You’ve built a script that pulls data from Magento and loads it into Redshift, but that’s only half the battle. What happens when you get new data?

The key is to build your script in such a way that it can also identify incremental updates to your data. Much of Magento’s data includes fields like created_at or auto-incrementing IDs that allow you to quickly identify records that are new since your last update. You can set your script up as a cron job or continuous loop to keep pulling down new data as it appears.

Other Data Warehouse Options

Redshift is totally awesome, but sometimes you need to start smaller or optimize for different things. In this case, many people choose to get started with Postgres, which is an open source RDBMS that uses nearly identical SQL syntax to Redshift. If you’re interested in seeing the relevant steps for loading this data into Postgres, check out Magento to Postgres

Easier and Faster Alternatives

Magento administration is not easy. If you have all the skills necessary to go through this process, chances are building and maintaining a script like this isn’t a very high-leverage use of your time.

Thankfully, products like Stitch were built to solve this problem automatically. With just a few clicks, Stitch starts extracting your Magento data from the underlying database, structuring it in a way that is optimized for analysis, and inserting that data into your Amazon Redshift data warehouse.