Close Menu
geekfence.comgeekfence.com
    What's Hot

    The Download: the future of chipmaking and Anthropic’s government clash

    June 23, 2026

    Comarch User Group 2026: Navigating the 2% Growth Trap with Agentic AI and Composable Architecture

    June 23, 2026

    Clustering Unstructured Text with LLM Embeddings and HDBSCAN

    June 23, 2026
    Facebook X (Twitter) Instagram
    • About Us
    • Contact Us
    Facebook Instagram
    geekfence.comgeekfence.com
    • Home
    • UK Tech News
    • AI
    • Big Data
    • Cyber Security
      • Cloud Computing
      • iOS Development
    • IoT
    • Mobile
    • Software
      • Software Development
      • Software Engineering
    • Technology
      • Green Technology
      • Nanotechnology
    • Telecom
    geekfence.comgeekfence.com
    Home»Artificial Intelligence»Clustering Unstructured Text with LLM Embeddings and HDBSCAN
    Artificial Intelligence

    Clustering Unstructured Text with LLM Embeddings and HDBSCAN

    AdminBy AdminJune 23, 2026No Comments9 Mins Read0 Views
    Facebook Twitter Pinterest LinkedIn Telegram Tumblr Email
    Clustering Unstructured Text with LLM Embeddings and HDBSCAN
    Share
    Facebook Twitter LinkedIn Pinterest Email


    In this article, you will learn how to build a text clustering pipeline by combining large language model embeddings with HDBSCAN, a density-based clustering algorithm, to automatically discover topics in unlabeled text data.

    Topics we will cover include:

    • How to generate text embeddings for raw documents using a pre-trained sentence-transformers model.
    • How to reduce the dimensionality of those embeddings with UMAP to prepare them for clustering.
    • How to apply HDBSCAN to automatically discover topic clusters and visualize the results.
    Clustering Unstructured Text with LLM Embeddings and HDBSCAN

    Clustering Unstructured Text with LLM Embeddings and HDBSCAN

    Introduction

    The current era of Generative AI seems to primarily focus on chat interfaces and prompts, but the range of applications of large language models, or LLMs for short, is not limited to just that. Indeed, one of their most powerful downstream abilities consists of turning raw, messy, unstructured text into semantically rich mathematical representations called embeddings. Once that’s done, we can use these text representations for a variety of machine learning use cases, with clustering being no exception.

    In particular, embeddings can be combined with advanced, density-based clustering techniques like HDBSCAN, allowing as a result for the discovery of hidden topics, patterns, or categories in your collection of text documents: all without the need for prior labeling.

    This article shows how to construct a text-based clustering pipeline from scratch. We will use a freely available dataset containing text instances, as well as an open-source LLM that has been trained for generating embeddings — i.e. a so-called embedding model. The icing on the cake: we’ll use free and handy, modern Python libraries providing implementations of clustering algorithms like HDBSCAN.

    Step-by-Step Walkthrough

    First, let’s start by installing the key Python libraries we will need:

    • Sentence transformers, to load a pre-trained LLM for embedding generation from Hugging Face — you’ll need a Hugging Face API key, also called an access token, to be able to load the model.
    • Umap-learn, to apply an algorithm to reduce the dimensionality of embeddings.

    Likewise, if you are working on a local IDE instead of a cloud notebook environment and don’t have scikit-learn and pandas, you may need to install them too.

    !pip install sentence–transformers umap–learn

    Now we start the coding part by getting some fresh data. The fetch_20newsgroups function, which fetches a dataset containing texts from categorized news articles, will do. Note that even though the dataset contains labels, we will omit them, as we are pretending not to know this information for the sake of clustering these data instances into groups based on similarity. Also, we sample down the dataset to 150 instances, which will be representative enough for our example.

    import pandas as pd

    from sklearn.datasets import fetch_20newsgroups

     

    # Fetching a highly targeted subset of data (~150-200 docs)

    categories = [‘sci.space’, ‘sci.med’, ‘rec.autos’]

    newsgroups = fetch_20newsgroups(subset=‘train’, categories=categories, remove=(‘headers’, ‘footers’, ‘quotes’))

     

    # Sampling down into a representative, illustrative subset

    df = pd.DataFrame({‘text’: newsgroups.data, ‘true_label’: newsgroups.target})

    df = df[df[‘text’].str.strip().str.len() > 100].sample(150, random_state=42).reset_index(drop=True)

     

    print(f“Loaded {len(df)} text documents.”)

    print(“\nSample document:”)

    print(df[‘text’].iloc[0][:150] + “…”)

    Output:

    Loaded 150 text documents.

     

    Sample document:

     

    Okay Mr. Dyer, we‘re properly impressed with your philosophical skills and

    ability to insult people. You’re a wonderful speaker and an adept politic...

    The next step is to obtain the embeddings from raw texts. To do this, we load all-MiniLM-L6-v2 from Hugging Face’s sentence-transformers library. This is a lightweight yet effective model to obtain embeddings quickly.

    from sentence_transformers import SentenceTransformer

     

    # Loading the free, open-source model

    model = SentenceTransformer(‘all-MiniLM-L6-v2’)

     

    # Encoding text documents into dense vector embeddings

    print(“Generating embeddings…”)

    embeddings = model.encode(df[‘text’].tolist(), show_progress_bar=True)

     

    print(f“Embedding matrix shape: {embeddings.shape}”)

    Since the embedding dimension is originally too high for clustering purposes, we now apply a dimensionality reduction technique by using the UMAP algorithm from the namesake library installed earlier:

    import umap

     

    # Reducing embedding dimensions to 5, to retain enough density information for clustering

    reducer = umap.UMAP(n_neighbors=15, n_components=5, min_dist=0.0, random_state=42)

    reduced_embeddings = reducer.fit_transform(embeddings)

     

    print(f“Reduced matrix shape: {reduced_embeddings.shape}”)

    Now our numerical embedding vectors associated with news articles consist of five dimensions (attributes) only. Let’s see if this compact representation is meaningful enough to obtain insightful clustering by applying the HDBSCAN algorithm, which is a density-based clustering approach:

    from sklearn.cluster import HDBSCAN

     

    # Initializing HDBSCAN

    # min_cluster_size=8: we specified that each cluster must have at least 8 documents

    clusterer = HDBSCAN(min_cluster_size=8, min_samples=3, store_centers=‘centroid’)

    df[‘cluster’] = clusterer.fit_predict(reduced_embeddings)

     

    # Counting instances per cluster

    cluster_counts = df[‘cluster’].value_counts()

    print(“\nCluster Distribution:”)

    print(cluster_counts)

    Important: the clustering results are partly influenced by the hyperparameter settings we defined for HDBSCAN. I recommend you try out other configurations for the minimum cluster size and other hyperparameters to explore how this affects results.

    Result:

    Cluster Distribution:

    cluster

    0    101

    1     49

    Name: count, dtype: int64

    It looks like HDBSCAN detected two clusters associated with high-density regions in the data space. Would there also be noisy points that were not allocated to either of these two clusters? Let’s check:

    for cluster_id in sorted(df[‘cluster’].unique()):

        if cluster_id == –1:

            print(“\n=== CLUSTER: NOISE / UNCLASSIFIED ===”)

        else:

            print(f“\n=== CLUSTER: Discovered Topic #{cluster_id} ===”)

            

        # Getting up to 3 sample texts from this cluster

        samples = df[df[‘cluster’] == cluster_id][‘text’].head(3).tolist()

        for i, sample in enumerate(samples, 1):

            clean_sample = ” “.join(sample.split())[:120]

            print(f”  {i}. {clean_sample}…”)

    Output:

    === CLUSTER: Discovered Topic #0 ===

      1. Okay Mr. Dyer, we‘re properly impressed with your philosophical skills and ability to insult people. You’re a wonderful ...

      2. I was at an interesting seminar at work (UK‘s R.A.L. Space Science Dept.) on this subject, specifically on a small-scale…

      3. This is the second post which seems to be blurring the distinction between real disease caused by Candida albicans and t…

     

    === CLUSTER: Discovered Topic #1 ===

      1. It’s great that all these other cars can out–handle, out–corner, and out– accelerate an Integra. But, you‘ve got to ask ...

      2. l diamond star cars (Talon/Eclipse/Laser) put out 190 hp in the turbo models, and 195 hp in the AWD turbo models, These ...

      3. Sorry for the mis–spelling, but I forgot how to spell it after my series of exams and NO–on hand reference here. Is it s...

    Seems like all data points in the sample of 150 were allocated to either one of the two clusters identified, thus hinting at the clue that the news articles might easily separable according to topic.

    For extra insight, we can show some cluster visualizations with the aid of the supplementary code provided below, which shows a scatterplot for every pairwise combination of the five existing components that describe each data point:

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    11

    12

    13

    14

    15

    16

    17

    18

    19

    20

    21

    22

    23

    24

    25

    26

    27

    28

    29

    30

    31

    32

    33

    34

    35

    36

    import matplotlib.pyplot as plt

    import seaborn as sns

    import itertools

     

    # Creating a DataFrame for the 5 reduced embeddings and cluster labels

    reduced_df = pd.DataFrame(reduced_embeddings, columns=[f‘UMAP_D{i+1}’ for i in range(reduced_embeddings.shape[1])])

    reduced_df[‘cluster’] = df[‘cluster’]

     

    # Getting all unique pairwise combinations of the 5 dimensions

    dim_pairs = list(itertools.combinations(reduced_df.columns[:–1], 2))

     

    num_plots = len(dim_pairs)

    num_cols = 3

    num_rows = (num_plots + num_cols – 1) // num_cols

     

    plt.figure(figsize=(num_cols * 5, num_rows * 4))

     

    for i, (dim1, dim2) in enumerate(dim_pairs):

        plt.subplot(num_rows, num_cols, i + 1)

        sns.scatterplot(

            x=dim1,

            y=dim2,

            hue=‘cluster’,

            data=reduced_df,

            palette=‘viridis’,

            s=70,

            alpha=0.7,

            legend=‘full’

        )

        plt.title(f‘{dim1} vs {dim2}’)

        plt.xlabel(dim1)

        plt.ylabel(dim2)

        plt.grid(True, linestyle=‘–‘, alpha=0.6)

     

    plt.tight_layout()

    plt.show()

    Result:

    Clustering visualizations

    By trying different configurations for HDBSCAN, you may come across results in which the number of identified clusters could be different from two. Just give it a try!

    Wrapping Up

    Once we have gone through the process of building the text-based clustering pipeline, it is worth concluding by pointing out the key reasons why putting together LLM embeddings with HDBSCAN is worth it. These include the ability to retain and capture, to some extent, the true semantic meaning and linguistic nuances of the original text, thanks to the properties inherent to embeddings obtained through sentence-transformers. Moreover, HDBSCAN automatically determines an optimal number of clusters and is able to detect outlying points that might be noise or outliers that would distort group-level statistics.



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email

    Related Posts

    SpaceX wants to build AI data centers in space. Will it work?

    June 22, 2026

    DataRobot for Developers — integrating with the Google Antigravity CLI

    June 21, 2026

    Building AI Agents and Workflows for Every Role Without Coding with Great Learning

    June 20, 2026

    Five ways to do least squares (with torch)

    June 19, 2026

    The Download: a new hunt for dark matter and Kenya’s case for going solar

    June 18, 2026

    The Case Against Building Your Own Agent Platform – O’Reilly

    June 17, 2026
    Top Posts

    Understanding U-Net Architecture in Deep Learning

    November 25, 202555 Views

    Hard-braking events as indicators of road segment crash risk

    January 14, 202630 Views

    Redefining AI efficiency with extreme compression

    March 25, 202627 Views
    Don't Miss

    The Download: the future of chipmaking and Anthropic’s government clash

    June 23, 2026

    This story is from The Algorithm, our weekly newsletter giving you the inside track on…

    Comarch User Group 2026: Navigating the 2% Growth Trap with Agentic AI and Composable Architecture

    June 23, 2026

    Clustering Unstructured Text with LLM Embeddings and HDBSCAN

    June 23, 2026

    New Data Analytics Breakthroughs Give Ecommerce Startups a Fighting Chance

    June 23, 2026
    Stay In Touch
    • Facebook
    • Instagram
    About Us

    At GeekFence, we are a team of tech-enthusiasts, industry watchers and content creators who believe that technology isn’t just about gadgets—it’s about how innovation transforms our lives, work and society. We’ve come together to build a place where readers, thinkers and industry insiders can converge to explore what’s next in tech.

    Our Picks

    The Download: the future of chipmaking and Anthropic’s government clash

    June 23, 2026

    Comarch User Group 2026: Navigating the 2% Growth Trap with Agentic AI and Composable Architecture

    June 23, 2026

    Subscribe to Updates

    Please enable JavaScript in your browser to complete this form.
    Loading
    • About Us
    • Contact Us
    • Disclaimer
    • Privacy Policy
    • Terms and Conditions
    © 2026 Geekfence.All Rigt Reserved.

    Type above and press Enter to search. Press Esc to cancel.