# Beyond map reduce ## In memory processing ### Hadoop's problems - Batch processing system - Does not process streamed data, thus the performance is slower - Designed to process very large data, not suited for many small files - Efficient at map stage, bad at IO: - Data loaded and written from HDFS - Shuffle and sort use a lot of net traffic - Job startup and finish takes seconds, regardless of the size - Not a good fit for every case - The structure is rigid: Map, Combiner, Shuffle and sort, Reduce - No support for iterations - Only one sync barrier - Bad at e.g. graph processing ### Intro to in memory Processing - Definition: Load data in memory, before starting process - Advantages: - More flexible computation - Iteration is supported - No slow IO required - Disadvantages: - Data must fit in memory of distributed storage - Need additional measures for persistence - Mandatory fault-tolerant - Major frameworks: - Apache spark - Graph-centric: Pregel - SQL focused read only: Cloudera Impala ### Spark - Open source large and general engine for large scale distributed data processing - is a **cluster computing** platform that has API for distributed programming - In memory processing and storage engine - Load data from HDFS,Cassandra - Resource management via Spark, EC2, YARN - Can work with Hadoop, or standalone - Runs on local or clusters - Goal: to provide distributed datasets, that users can use as if they are local - Has the shiny bits as MapReduce: - Fault tolerance - Data locality - Scalability - Approach: argument data flow with RDD ### RDD: Resilient Distributed Datasets - Basic level of abstraction in spark - Distributed memory model: RDDs - Immutable collections of data **Distributed** across the nodes of cluster - New RDD is created by: - Loading data from input - transform existing collection to generate a new one - Can be saved to HDFS or other programs with action - Operations: - Transformation: Define new RDD from existing one - `map` - `filter` - `sample` - `union` - `groupByKey` - `reduceByKey` - `join` - `cache` - Action: Take RDD and return a result to driver - `reduce` - `collect` - `count` - `save` - `lookupKey` ### Scala (Not in test) - Scala is the native language for spark - Similar syntax to java, but has powerful type inference ### Scala application - Consists of a **driver** that executes various parallel operations on **RDDs**, partitioned across cluster - Driver is on different machine where RDDs are created - Use action to retrieve data from RDD - TODO: look at diagram at p20 - Driver program run the user's main function, executes parallel operations on a cluster ### Components - Driver program run the user's main functions, executes parallel operation on a cluster - Run as **independent** sets of processors, coordinated by a `SparkContext` in driver - Context run in a cluster manager like YARN, which allocates system resources - Working in cluster in managed by **executor**, which is managed by `SparkContext` - **Executor** responsible for executing task and store data - Deploying is up to the cluster manager used, like YARN or standalone spark ### Computation - Using anonymous functions - Named functions - `map`: create a new RDD, with the original value replaced by new value returned in map - `filter`: create a new RDD, with less values ### Deferred execution - Only executes the transformation, the moment they are needed - Only the invocation of action triggers the execution chain - This allows internal optimization: combine the operations ### Spark performance #### Issues - Because spark has freedom, task allocation is much more challenging - Errors appear more often, and hard to debug - Knowledge of basics of map reduce helps #### Tuning - Memory: Spark uses more memory - Partitioning for RDD - Performance implication for each operation ### Spark ecosystem - GraphX: Graph processing RDD - MLib: machine learning - Spark SQL - Spark Streaming: **Stream** processing with D-Stream RDDs ## Stream processing ### Information streams - Data continuously generated from various sources - Unbound, the arrival time is not fixed - Process the information the moment it's generated - Apply a function to each new element - Look for **real time** changes and response ### Apache Storm #### Intro - Developed by BlackType, apache project - Real time computation of streams - Features - Scalable - No data loss guarantee - Extremely robust and fault tolerant - Programming language agnostic - Distributed Stream Processing: tasks distributed across cluster ## Discretized Streams - Unlike true streaming processing, we process information in micro batches - Input -> Spark Streaming -> batched input data -> Spark Engine -> Processed batch data - In spark: - reuse the spark framework - Can use spark transformations on RDDs - Construct a RDD every few seconds (defined time) to manage data streams - New RDD processed at each time slot ### DStream RDD - Composes of a series of RDDs, to represent data over time - Choosing timer interval: - Small interval: quicker response time, at the cost of frequent batching ### DStream Transformations - Change each RDD in the stream ### DStream Streaming Context - Create a `StreamingContext` to manage the stream and transformation, and need a action to collect the results ### DStream Sliding windows - Some usage require looking at a set of stream messages, to perform computation - Sliding window stores a rolling list with latest items from stream - The contents are changed over time, with new items added and old items popped - Using in Spark DStream: - has API to configure size of window (seconds) and frequency of computation (seconds) - Code?: `reduceByWindowAndKey((a,b)=>math.max(a,b), Seconds(60), Seconds(5) )`