Definitions of machine learning algorithms present in Apache Mahout.

Apache Mahout is a machine learning library built for scalability. Its core algorithms for clustering, classfication and batch based collaborative filtering are implemented on top of Apache Hadoop using the map/reduce paradigm.

It contains various algorithms which we are defining below. Each of them may define multiple implementations. A mojority but not all of the implementations are distributed.


Classification is the problem of identifying to which of a set of categories (sub-populations) a new observation belongs, on the basis of a training set of data containing observations (or instances) whose category membership is known.


Clustering is the task of grouping a set of objects in such a way that objects in the same group (called cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters).

Pattern mining

Pattern mining is a data mining method that involves finding existing patterns in data. In this context patterns often means association rules.

Regression analysis

Regression analysis is a statistical technique for estimating the relationships among variables. It includes many techniques for modeling and analyzing several variables, when the focus is on the relationship between a dependent variable and one or more independent variables.

Dimension reduction

Dimension reduction is the process of reducing the number of random variables under consideration and can be divided into feature selection and feature extraction.

Evolutionary algorithm

Evolutionary algorithm uses mechanisms inspired by biological evolution, such as reproduction, mutation, recombination, and selection. Candidate solutions to the optimization problem play the role of individuals in a population, and the fitness function determines the environment within which the solutions “live”

Recommenders / Collaborative filtering

Collaborative filtering is the process of filtering for information or patterns using techniques involving collaboration among multiple agents, viewpoints, data sources, etc.

Vector Similarity

Vector Similarity allows one to compare one or more vectors with another set of vectors.


Collocation defines a sequence of words or terms that co-occur more often than would be expected by chance.