Sawzall and the PIG

When I heard interesting uses cases of how “Sawzall” is used to hack huge amounts of log data within Google I was thinking about two things.

  • Apache PIG, which is “a platform for analyzing large data sets that consists of a high-level language Pig for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.”
  • CEP (Complex event processing) – consists in processing many events happening across all the layers of an organization, identifying the most meaningful events within the event cloud, analyzing their impact, and taking subsequent action in real time. [ Also look at esper ]

Google has opened parts of this framework in a project called “Szl”

Sawzall is a procedural language developed for parallel analysis of very large data sets (such as logs). It provides protocol buffer handling, regular expression support, string and array manipulation, associative arrays (maps), structured data (tuples), data fingerprinting (64-bit hash values), time values, various utility operations and the usual library functions operating on floating-point and string values. For years Sawzall has been Google’s logs processing language of choice and is used for various other data analysis tasks across the company.

Instead of specifying how to process the entire data set, a Sawzall program describes the processing steps for a single data record independent of others. It also provides statements for emitting extracted intermediate results to predefined containers that aggregate data over all records. The separation between per-record processing and aggregation enables parallelization. Multiple records can be processed in parallel by different runs of the same program, possibly distributed across many machines. The language does not specify a particular implementation for aggregation, but a number of aggregators are supplied. Aggregation within a single execution is automatic. Aggregation of results from multiple executions is not automatic but an example program is supplied.

Here is a quick example of how it could be used…

  topwords: table top(3) of word: string weight count: int;
   fields: array of bytes = splitcsvline(input);
   w: string = string(fields[0]);
   c: int = int(string(fields[1]), 10);
   if (c != 0) {
   emit topwords <- w weight c;
   }

Given the input:

  abc,1
   def,2
   ghi,3
   def,4
   jkl,5

The program (using –table_output) emits:

  topwords[] = def, 6, 0
   topwords[] = jkl, 5, 0
   topwords[] = ghi, 3, 0