Beyond Generative AI: The JEPA Architecture

October 5, 2024

Why predicting pixels is a waste of time

If you follow AI, you're drowning in "generative" models. GPT-4, DALL-E 3, Sora… all these systems are built on an autoregressive, generative principle: predict the next word, the next pixel, or the next patch.

And according to AI pioneer Yann LeCun, this approach is "doomed."

He argues that this obsession with generative modeling is inefficient and, more importantly, will never lead to true artificial intelligence or "common sense." His proposed alternative, a core part of his vision for autonomous AI, is a new (or rather, revitalized) concept: the Joint Embedding Predictive Architecture (JEPA).

The Problem: Why Generative Models Lack Common Sense

LeCun's main critique of models like LLMs is that they are all-in on "System 1" thinking—fast, intuitive, and reactive, but with no deep understanding or planning capabilities ("System 2").

When an LLM "hallucinates," it's because it doesn't understand the world. It only knows that, statistically, certain words tend to follow other words. It has no underlying world model—no internal simulation of "how things work," "what causes what," or "what is physically possible."

To build a world model, LeCun argues, an AI needs to learn like a human infant. Babies don't learn by reading all of Wikipedia. They learn by observing the world, pushing objects, and building an intuitive model of physics and object permanence.

But trying to learn this by predicting every single pixel in a video is incredibly wasteful. Does a baby need to predict the exact texture of the carpet to know the ball will roll on it? No. They just need to understand the abstract concepts of "ball" and "rolling."

JEPA: Predicting in Abstract Space

This is where JEPA comes in. It's an architecture designed to create a world model by learning efficient, abstract representations.

The core idea is simple: Don't predict the future in pixel space. Predict the future in representation space.

Here's how it works:

Context (Input): The model is given a piece of information, like a portion of an image or a segment of video.
Target (To be Predicted): A different part of the information is masked or hidden.
The JEPA Solution:
- It uses an "encoder" to turn the context into an abstract representation (a list of numbers, or embedding).
- It uses another "encoder" (the target-encoder) to turn the target into its own abstract representation.
- A predictor network then looks at the context representation and tries to predict the target representation.

The entire system is trained to minimize the difference between the predicted representation and the actual representation.

Why is this better?

Because the model is never asked to reconstruct pixels, it's free to discard all the "noise" and focus only on what matters.

To successfully predict the representation of a masked part of an image, the model is forced to learn high-level concepts. For example, in the I-JEPA (Image-JEPA) model, if the model sees the top half of a dog (the context) and is asked to predict the representation of the bottom half (the target), it doesn't need to guess the exact fur pattern. It just needs to predict "this will be a 'leg' representation" or "this will be a 'tail' representation."

It's learning the essence of the object, not its superficial details.

*This post originally appeared on my Medium

Enjoyed this article? You can also read and engage with it on Medium:

Read on Medium