Definitions of machine learning algorithms present in Apache Mahout
By David WORMS
Mar 8, 2013
Never miss our publications about Open Source, big data and distributed systems, low frequency of one email every two months.
Apache Mahout is a machine learning library built for scalability. Its core algorithms for clustering, classfication and batch based collaborative filtering are implemented on top of Apache Hadoop using the map/reduce paradigm.
It contains various algorithms which we are defining below. Each of them may define multiple implementations. A majority but not all of the implementations are distributed.
Classification
Classification is the problem of identifying to which of a set of categories (sub-populations) a new observation belongs, on the basis of a training set of data containing observations (or instances) whose category membership is known.
Clustering
Clustering is the task of grouping a set of objects in such a way that objects in the same group (called cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters).
Pattern mining
Pattern mining is a data mining method that involves finding existing patterns in data. In this context patterns often mean association rules.
Regression analysis
Regression analysis is a statistical technique for estimating the relationships among variables. It includes many techniques for modeling and analyzing several variables, when the focus is on the relationship between a dependent variable and one or more independent variables.
Dimension reduction
Dimension reduction is the process of reducing the number of random variables under consideration and can be divided into feature selection and feature extraction.
Evolutionary algorithm
Evolutionary algorithm uses mechanisms inspired by biological evolution, such as reproduction, mutation, recombination, and selection. Candidate solutions to the optimization problem play the role of individuals in a population, and the fitness function determines the environment within which the solutions “live”
Recommenders / Collaborative filtering
Collaborative filtering is the process of filtering for information or patterns using techniques involving collaboration among multiple agents, viewpoints, data sources, etc.
Vector Similarity
Vector Similarity allows one to compare one or more vectors with another set of vectors.
Collocation
Collocation defines a sequence of words or terms that co-occur more often than would be expected by chance.