Реальный опыт крупной компании в сфере, где инет-компании пока слабы перед большими корпорациями

Доклад принят в Программу конференции
Chris Bohn (CB) (Etsy.com)Chris Bohn (CB)

Тематика доклада: как анализировать огромные объёмы информации о поведении юзеров на сайте максимально оперативно и полно, какие для этого нужны инструменты, если мы ориентируемся не на закупку монстроидальных Oracle/SAP/MS решений, а на open source.

Эта тема, вообще, мало освещена в нашей среде, а у корпораций ведь огромные теории выстроены — хранилища данных, OLAP, многомерный анализ и т.п.

CB заявил, что их инструменты для связки postgres/vertica/обработка http-логов он выпустит в open source прямо на конференции.

 

Abstract:

Etsy.com is a very popular consumer-facing site, with customers in almost every country worldwide and which has high traffic and transaction volume. Etsy database architecture has gone through a number of changes to meet the demands of scaling the site. One of our current projects is upgrading our data analytics capabilities. We use Hadoop for a variety of map reduce jobs, but Hadoop is batchoriented and not suited to ad hoc queries. Furthermore, our business analysts live in a SQL world, they know how to write good SQL queries, and Hadoop does not provide a good match for their needs.

What we wanted was a system wherein our analysts could write the SQL they are used to, but have access to all the big data like clickstreams, and also to be able to correlate that big data against the fact and dimension data from our master databases of record. To provide our business analysts with fast and powerful tools for analyzing data, Etsy licensed Vertica, which is a columnar store database with a focus on fast analytic queries. This talk will present our experience with Vertica, and in particular, the tools that we built to get data from a variety of sources into it reliably and timely. Etsy has a vast amount of clickstream data preserved in log files, and we use large batch jobs to reduce then load many terabytes of behavioral data into Vertica. We also replicate fact and dimension data to Vertica as a continuous trickle load process directly from our master Postgres database. To accomplish this, we had to build our own Postgres-->Vertica replication system, since none existed. We have been successfully

using these replication tools to keep Vertica synchronized with the fact and dimension data from our master Postgres database. Etsy has always been a leader in open-sourcing its internal tools, and the tools we have built for Postgres-->Vertica replication will be officially released to the open source community at this Hiload++ conference. Note that in addition to Postgres, these tools also support replication from MySQL and other sources.

The talk will first discuss the overall architecture and topology of our data ecosystem, and then give a thorough introduction to these replication tools. Topics covered will be installation, operation and monitoring, and the system will be demonstrated. Vertica is a licensed product, but a Community Edition that supports up to 1 TB of raw data is available free of charge. The Community Edition of Vertica can be very useful in a number of ways, from analytic reporting to doing quick audits of production Postgres tables, and the talk will discuss several use cases. This talk will show how, using the replication tools, it is very easy to quickly load data into Vertica from Postgres and other sources.

Target Audience:

Any DBA or IT professional that has an interest in analytic-oriented databases. After the presentation, those attending will have the confidence and ability to set up a small, free Vertica installation that can be used for analytic jobs, and using the tools we are releasing, to very quickly replicate data from Postgres to Vertica.