DC-NMF: nonnegative matrix factorization based on divide-and-conquer for fast clustering and topic modeling

DC-NMF: nonnegative matrix factorization based on divide-and-conquer for fast clustering and topic modeling

0.00 Avg rating0 Votes
Article ID: iaor20172809
Volume: 68
Issue: 4
Start Page Number: 777
End Page Number: 798
Publication Date: Aug 2017
Journal: Journal of Global Optimization
Authors: , ,
Keywords: datamining, matrices, heuristics
Abstract:

The importance of unsupervised clustering and topic modeling is well recognized with ever‐increasing volumes of text data available from numerous sources. Nonnegative matrix factorization (NMF) has proven to be a successful method for cluster and topic discovery in unlabeled data sets. In this paper, we propose a fast algorithm for computing NMF using a divide‐and‐conquer strategy, called DC‐NMF. Given an input matrix where the columns represent data items, we build a binary tree structure of the data items using a recently‐proposed efficient algorithm for computing rank‐2 NMF, and then gather information from the tree to initialize the rank‐k NMF, which needs only a few iterations to reach a desired solution. We also investigate various criteria for selecting the node to split when growing the tree. We demonstrate the scalability of our algorithm for computing general rank‐k NMF as well as its effectiveness in clustering and topic modeling for large‐scale text data sets, by comparing it to other frequently utilized state‐of‐the‐art algorithms. The value of the proposed approach lies in the highly efficient and accurate method for initializing rank‐k NMF and the scalability achieved from the divide‐and‐conquer approach of the algorithm and properties of rank‐2 NMF. In summary, we present efficient tools for analyzing large‐scale data sets, and techniques that can be generalized to many other data analytics problem domains along with an open‐source software library called SmallK.

Reviews

Required fields are marked *. Your email address will not be published.