Rethinking Relational Database CRUD Design Patterns Since We Live in a Big Data World Базы данных и системы хранения
Chris Bohn (“CB”) has worked for Etsy for 10 years. He was initially hired to manage the company's production PostgreSQL servers, and then assisting in moving production data to horizontally sharded MySQL servers. In the past 5 years, he was tasked with building an analytics capability to replace an aging PostgreSQL-based system that was underperforming. CB has managed Etsy's Vertica installation since then. CB has a degree from the University of California, Berkeley.
The past several years have seen the explosive growth of Big Data Analytics to process huge volumes of data. For example, Big Data analytics can be used to process clickstream data to gain insight into user behavior. Clickstream data generally contains only a reference to the user, without any attributes. This is where marrying production data (dimensions) with clickstream data (facts) yields powerful analytics. Unfortunately, there is often an ETL impedance mismatch between production and Big Data data stores. Depending on the Big Data database, this means either very inefficient ETL, or the inability to even load production data into the system. “CRUD” is an acronym that stands for CReate, Update, Delete. CRUD is a fundamental feature of Relational Databases (“RDBs”) that enables records to be created, updated, deleted. But in the internet and Big Data age of today, the entire notion of CRUD may be fundamentally ill suited, and a better design pattern needed. The ever-decreasing price of data storage and the rise of Big Data Analytics have changed the cost, velocity and value of data. In a modern internet and Big Data context, CRUD has morphed from “pattern” to “anti-pattern” because downstream analytic stacks are not fully or efficiently fed. Yet, many websites and mobile apps are built with RDB backend storage that use CRUD design patterns and thus the business is likely not getting the full benefit of Big Data analytics. This talk describes the problem and presents solutions, many of which are easy to implement. The solutions are based on the author's years of experience at Etsy.com, one of the world's most highly trafficked retail websites.
When RDBs were first brought to market in the 1970s, disk storage was expensive. As such, RDBs were designed to enable updating and deleting records, to minimize the size and cost of the data footprint. This formed a design pattern that has continued into the internet age. Almost every popular web framework today features an Object Relational Model (“ORM”) that maps objects to database records following the CRUD design pattern that updates records and preserves only the latest state – discarding previous values and attributes. Again, this minimizes the storage footprint and thus cost, but the price is lack of data change history. For example, consider a table that stores a user's physical address. If the address record is updated in place, the previous address is then lost. It is better to keep the entire history of a user's addresses, and the way to do that is to treat the table as a log, and just add new addresses, never updating or deleting. But how does one then determine the current address? This is commonly implemented with a pattern called “Soft Delete.” This pattern adds a boolean field to the table to indicate the active record, in this case, the user's current physical address. But this common pattern causes very inefficient ETL to Big Data systems, because inserting a new address record entails an update to the previous entry. Big Data Analytic databases achieve their tremendous speed by means of data encoding and sort order. Unfortunately, updates and deletes are very expensive operations in a database such as Vertica. Other databases such as Google Big Query are append-only, meaning that the Soft Delete pattern can't produce records that can be loaded into it. This talk will present new design patterns based on logging and timestamps that circumvent these issues. Many are easy to implement, even if the source data is high velocity. I will present a pattern for Vertica that uses Top-N Projections, and SQL queries to achieve the same in Google Big Query and other Big Data SQL databases.