Apache Apex with Apache SAMOA
Jul 17, 2016
Do you like our work......we hire!
Never miss our publications about Open Source, big data and distributed systems, low frequency of one email every two months.
Traditional Machine Learning
- Batch Oriented
- Supervised - most common
- Training and Scoring
- One time model building
- Data set
- Training: Model building
- Holdout: Paremeter tuning
- Test: Accuracy
Online Machine Learning
- Streaming
- Change
- Dynmaically adapt to new patterns in Data
- Change over time (concept drift)
- Model updates
- Approximation Algorithms
- Single pass: one data item at a time
- sublinear space and time per data item
- small error with high probabilities
Apache SAMOA
- What we need
- Platform for streaming learning Algorithms
- Distributed Scalable
Machine learning classification
Machine learning
/ \
Disribué Non distibué
/ \ / \
Batch streaming Batch streaming
| | | |
Hadoop Apex/flink/storm | |
| | | |
Mahout SAMOA R, WEKA MOA
Logical Building Blocks
Each block is a processor (an algorithm), then we create a topology of blocks.
Prequential Eval Tasks in SAMOA
- Interleaved test-then-train
- Evaluates performance for online classifiers
- Basic - Overall
- Sliding Windows based - Most recent
Apex DSPE
Distributed Stream Processing Engine:
- Apex
- Storm
- Flink
Apex Application DAG
- DAG is composed of vertices (Operators) and Edges (Streams)
- Stream is a sequence of data tuples which connects operators
Distribution of tuples
- Calculate Hash of tuple
- Modulo by the number of partition
Iteration support in Apex
- Machine-learning needs iterations
- At the very least, feedback loop
- Apex topology - Acyclic: DAG
Delay Operators
- Increment window id for all outgoing ports
- A note on Fault tolerance
Challenges
Adding DSPE for Apache Apex:
- Differences in the topology builder APIs of SAMOA and Apex
- No concept of Ports in SAMOA
- On demand declaration of streams in SAMOA
- Cycles in topology: Delay Operator
- Serialization of Processor state during checkpointing.
Also serialization of tuples:
- Number of tuples in a single window - Affects number of tuples in future windows coming from the delay operator