LLMs contain a LOT of parameters. But what’s a parameter?

When a model is trained, each word in its vocabulary is assigned a numerical value that captures the meaning of that word in relation to all the other words, based on how the word appears in countless examples across the model’s training data.

Each word gets replaced by a kind of code?

Yeah. But there’s a bit more to it. The numerical value—the embedding—that represents each word is in fact a list of numbers, with each number in the list representing a different facet of meaning that the model has extracted from its training data. The length of this list of numbers is another thing that LLM designers can specify before an LLM is trained. A common size is 4,096.

Every word inside an LLM is represented by a list of 4,096 numbers?

Yup, that’s an embedding. And each of those numbers is tweaked during training. An LLM with embeddings that are 4,096 numbers long is said to have 4,096 dimensions.

Why 4,096?

It might look like a strange number. But LLMs (like anything that runs on a computer chip) work best with powers of two—2, 4, 8, 16, 32, 64, and so on. LLM engineers have found that 4,096 is a power of two that hits a sweet spot between capability and efficiency. Models with fewer dimensions are less capable; models with more dimensions are too expensive or slow to train and run.

Using more numbers allows the LLM to capture very fine-grained information about how a word is used in many different contexts, what subtle connotations it might have, how it relates to other words, and so on.

Back in February, OpenAI released GPT-4.5, the firm’s largest LLM yet (some estimates have put its parameter count at more than 10 trillion). Nick Ryder, a research scientist at OpenAI who worked on the model, told me at the time that bigger models can work with extra information, like emotional cues, such as when a speaker’s words signal hostility: “All of these subtle patterns that come through a human conversation—those are the bits that these larger and larger models will pick up on.”

The upshot is that all the words inside an LLM get encoded into a high-dimensional space. Picture thousands of words floating in the air around you. Words that are closer together have similar meanings. For example, “table” and “chair” will be closer to each other than they are to “astronaut,” which is close to “moon” and “Musk.” Way off in the distance you can see “prestidigitation.” It’s a little like that, but instead of being related to each other across three dimensions, the words inside an LLM are related across 4,096 dimensions.

Yikes.

It’s dizzying stuff. In effect, an LLM compresses the entire internet into a single monumental mathematical structure that encodes an unfathomable amount of interconnected information. It’s both why LLMs can do astonishing things and why they’re impossible to fully understand.

Source link

What's Hot

Open Cosmos launches first satellites for new LEO constellation

Achieving superior intent extraction through decomposition

How UX Research Reveals Hidden AI Orchestration Failures

LLMs contain a LOT of parameters. But what’s a parameter?

Tech CEOs boast and bicker about AI at Davos

India smartphone shipments were flat YoY at ~153M; Apple shipped 14M iPhones, raising its share of shipments to a record 9%, up from 7% in 2024 (Jagmeet Singh/TechCrunch)

Today’s NYT Connections: Sports Edition Hints, Answers for Jan. 23 #487

The Fork-It-and-Forget Decade – O’Reilly

AI Data Centers Face Skilled Worker Shortage

How to Clean Your Keurig (and When)

Understanding U-Net Architecture in Deep Learning

Hard-braking events as indicators of road segment crash risk

Microsoft 365 Copilot now enables you to build apps and workflows

Open Cosmos launches first satellites for new LEO constellation

Achieving superior intent extraction through decomposition

How UX Research Reveals Hidden AI Orchestration Failures

ByteDance steps up its push into enterprise cloud services

Our Picks

Open Cosmos launches first satellites for new LEO constellation

Achieving superior intent extraction through decomposition

What's Hot

LLMs contain a LOT of parameters. But what’s a parameter?

Each word gets replaced by a kind of code?

Every word inside an LLM is represented by a list of 4,096 numbers?

Why 4,096?

Yikes.

Related Posts

Subscribe to Updates