Apache Apex with Apache SAMOA

Apache Apex with Apache SAMOA

By Pierre SAUVAGE

Jul 17, 2016

Traditional Machine Learning

  • Batch Oriented
  • Supervised - most common
  • Training and Scoring
  • One time model building
  • Data set
  • Training: Model building
  • Holdout: Paremeter tuning
  • Test: Accuracy

Online Machine Learning

  • Streaming
  • Change
  • Dynmaically adapt to new patterns in Data
  • Change over time (concept drift)
  • Model updates
  • Approximation Algorithms
  • Single pass: one data item at a time
  • sublinear space and time per data item
  • small error with high probabilities

Apache SAMOA

  • What we need
  • Platform for streaming learning Algorithms
  • Distributed Scalable

Machine learning classification

               Machine learning
              /                \
     Disribué                   Non distibué
    /        \                   /        \
  Batch   streaming           Batch    streaming
    |         |                 |          |
  Hadoop Apex/flink/storm       |          |
    |         |                 |          |
  Mahout    SAMOA            R, WEKA      MOA

Logical Building Blocks

Each block is a processor (an algorithm), then we create a topology of blocks.

Prequential Eval Tasks in SAMOA

  • Interleaved test-then-train
  • Evaluates performance for online classifiers
  • Basic - Overall
  • Sliding Windows based - Most recent

Apex DSPE

Distributed Stream Processing Engine:

  • Apex
  • Storm
  • Flink

Apex Application DAG

  • DAG is composed of vertices (Operators) and Edges (Streams)
  • Stream is a sequence of data tuples which connects operators

Distribution of tuples

  • Calculate Hash of tuple
  • Modulo by the number of partition

Iteration support in Apex

  • Machine-learning needs iterations
  • At the very least, feedback loop
  • Apex topology - Acyclic: DAG

Delay Operators

  • Increment window id for all outgoing ports
  • A note on Fault tolerance

Challenges

Adding DSPE for Apache Apex:

  • Differences in the topology builder APIs of SAMOA and Apex
  • No concept of Ports in SAMOA
  • On demand declaration of streams in SAMOA
  • Cycles in topology: Delay Operator
  • Serialization of Processor state during checkpointing.

Also serialization of tuples:

  • Number of tuples in a single window - Affects number of tuples in future windows coming from the delay operator

Canada - Morocco - France

International locations

10 rue de la Kasbah
2393 Rabbat
Canada

We are a team of Open Source enthusiasts doing consulting in Big Data, Cloud, DevOps, Data Engineering, Data Science…

We provide our customers with accurate insights on how to leverage technologies to convert their use cases to projects in production, how to reduce their costs and increase the time to market.

If you enjoy reading our publications and have an interest in what we do, contact us and we will be thrilled to cooperate with you.