Topic Extraction Using Non-Negative Matrix Factorization

As part of my exploration into emerging research trends in machine learning, I recently revived an older project of mine. The core idea was simple: gather abstracts from top machine learning journals, process them, and use Non-Negative Matrix Factorization (NMF) to uncover overarching themes in the field.

The pipeline looks like this:

  1. Collect abstracts from leading journals.
  2. Preprocess text with tokenization and TF-IDF vectorization.
  3. Apply NMF to extract latent topic clusters.
  4. Analyze results by tracking popularity over time and identifying representative/cited papers for each theme.

I limited the factorization to 20 topics – beyond this, additional clusters offered minimal new insight while reducing interpretability. Full methodology and code are available in my GitHub repo: https://github.com/trisha-sen/machine_learning_journal_topic_analysis

Figure 1 End-to-end workflow for topic extraction using NMF: from journal abstract collection to clustering and trend analysis.


Why NMF?

Unlike hard clustering approaches, NMF represents each document as a weighted combination of topics. This fits research papers well, since most span multiple themes (e.g., a paper on attention mechanisms may also touch on optimization strategies).

The output not only highlighted well-established areas but also surfaced several emerging fields that were new to me – making this a strong starting point for deeper exploration.

Figure 2 Popularity trends of extracted topics across time, highlighting emerging research themes.


Key Groups and Themes

1. Broad, Foundational Topics

The first four clusters capture the backbone of the ML research landscape:

  • Generic learning and modeling
    Keywords: learning, data, model, training, samples, methods, knowledge, classification
  • Attention mechanisms
    Keywords: attention, features, semantic, fusion, segmentation, multi-level
    • Particularly relevant is the expansion of transformer architectures into computer vision.

“Transformer, first applied to the field of natural language processing, is a type of deep neural network mainly based on the self-attention mechanism. Thanks to its strong representation capabilities, researchers are looking at ways to apply transformer to computer vision tasks… Given its high performance and less need for vision-specific inductive bias, transformer is receiving more and more attention from the computer vision community.” – 10.1109/TPAMI.2022.3152247

  • Neural Networks (CNNs)
    Keywords: neural, networks, convolutional, pruning, architecture
  • Optimization algorithms
    Keywords: optimization, convergence, evolutionary, search, objective

2. Process Control of Dynamic Systems  

Keywords: control, event-triggered, nonlinear, adaptive

“… feedback optimized control design for a class of strict-feedback nonlinear systems that contain unknown internal dynamics and states that are immeasurable and constrained within some predefined compact sets.” (example: a quarter-car active suspension)

10.1109/TNNLS.2021.3051030


3. Image Super-Resolution with Diffusion Models

Keywords: image, resolution, reconstruction, segmentation, medical, high/low quality

This was the most active industry-driven topic, with contributions from Microsoft, Google, and Snapchat.

“Single-image super-resolution is the process of generating a high-resolution image that is consistent with an input low-resolution image. It falls under the broad family of image-to-image translation tasks, including colorization, in-painting, and de-blurring”

“Diffusion-based image generators have seen widespread commercial interest, such as Stable Diffusion and DALL-E. These models typically combine diffusion models with other models, such as text-encoders and cross-attention modules to allow text-conditioned generation.”

https://imagen.research.google/, 10.1109/TPAMI.2022.3204461, https://en.wikipedia.org/wiki/Diffusion_model


4. Graph Neural Networks (GNNs)

Keywords: graph, nodes, structure, GCN

“Deep learning has revolutionized many machine learning tasks in recent years… The data in these tasks are typically represented in the Euclidean space. However, there is an increasing number of applications where data are generated from non-Euclidean domains and are represented as graphs with complex relationships and interdependency between objects. The complexity of graph data has imposed significant challenges on existing machine learning algorithms. Recently, many studies on extending deep learning approaches for graph data have emerged”

10.1109/TNNLS.2020.2978386


5. Domain Generalization

Keywords: domain adaptation, cross-domain, transfer

“If an image classifier was trained on photo images, would it work on sketch images? What if a car detector trained using urban images is tested in rural environments?… Generalization to out-of-distribution (OOD) data is a capability natural to humans yet challenging for machines to reproduce. This is because most learning algorithms strongly rely on the i.i.d. assumption on source/target data, which is often violated in practice due to domain shift. Domain generalization (DG) aims to achieve OOD generalization by using only source data for model learning.”

10.1109/TPAMI.2022.3195549


6. Small Object Detection (SOD)

Keywords: detection, anomaly, small objects

“Small Object Detection (SOD), as a sub-field of generic object detection, which concentrates on detecting those objects with small size, is of great theoretical and practical significance in various scenarios such as surveillance, drone scene analysis, pedestrian detection, traffic sign detection in autonomous driving, etc. There remains a huge performance gap in detecting small and normal-sized objects even for leading detectors.”

10.1109/TPAMI.2023.3290594


7. Predictive Learning of Spatiotemporal Sequences

Keywords: temporal, video, motion, frames

“As a key application of predictive learning, generating future frames from historical consecutive frames has received growing interest in machine learning and computer vision communities. It benefits many practical applications and downstream tasks, such as the precipitation forecasting, traffic flow prediction, physical scene understanding, early activity recognition, deep reinforcement learning, and the vision-based model predictive control. Many of these existing approaches suggested leveraging RNNs with stacked LSTM units to capture the temporal dependencies of spatiotemporal data.”

10.1109/TPAMI.2022.3165153


8. Multivariate Time-Series Prediction

Keywords: time series, forecasting, multivariate

“Modern cyberphysical systems (CPS), such as those encountered in manufacturing, aircraft and servers, involve sophisticated equipment that records multivariate time-series (MVTS) data from 10s, 100s, or even thousands of sensors. The MVTS need to be continuously monitored to ensure smooth operation and prevent expensive failures”

10.1109/TNNLS.2021.3105827


9. Other Active Areas

Several other high-interest topics emerged, including:


Final Thoughts

This project demonstrates how topic modeling can map the evolving landscape of machine learning research. By systematically analyzing abstracts, we can:

  • Track emerging vs. declining research themes
  • Identify seminal works within each cluster
  • Spot areas ripe for exploration and collaboration

For me, the most exciting part was uncovering entire research areas I hadn’t explored before – a reminder of just how fast the field continues to expand.

Posted in

Leave a comment