Understanding Semantic Vector Space for SEO Professionals

What is Semantic Vector Space?

The concept of semantic vector space is important for SEOs to understand. But surprisingly, in the modern world, many SEOs still work on an exact-match basis when it comes to keywords. This exact-match idea doesn't include the concept of how related keywords are, whereas a semantic approach does.

With the realization that modern search engines now look at words on a semantic level, we can see that a web page can be found to be relevant to a variety of search terms, even if they're not directly listed on the page. This explains the observation of many that successful web pages can rank in search engines for hundreds, or even thousands, of search queries.

The thing that makes all this possible is the concept of a semantic vector space. By understanding this, we understand better how search engines work and therefore how to rank web pages. Let's take a deeper dive.

How a Semantic Vector Space Works

Vector spaces can seem complex. That's because they are multidimensional and our understanding of things tends to be limited to two or three dimensions, based on our 3 dimensional experience.

For example, take the screen you're looking at. It shows a two dimensional image. It goes up and down, and left and right. Easy enough. In the real world there's another dimension, you can go forwards and backwards. It's 3 dimensional! We could add time as a 4th dimension but our understanding generally gets very fuzzy. How about a 5th one? It could exist but we live in a goldfish bowl of 3 dimensions so it's hard for us to understand as it doesn't equate to anything we know.

Unfortunately for use vector spaces work with a multiple dimensional representation but luckily for us the precise way it works is exactly the same regardless of the number of dimensions. So we can simplify.

Let's say we have a web page about cats. We can imagine a line that on one end reads "not relevant to cats" and on the other end reads "relevant to cats". Our web page sits somewhere on that line, and hopefully close to the "relevant to cats" end. We could put all web pages on that line. Let's give them a score from 0 to 1. Our page might score 0.92. A more relevant page may score 0.96. A page about dogs might score 0.3, because dogs are slightly related to cats. A page about newtonian physics might score 0.01. This is a one dimensional encoding of how closely related a web page is to a cat. A one dimensional vector.

We can use AI/NLP to classify how closely related each web page is to cats. Now something interesting happens because we've just invented a search engine. It's not a very useful search engine because it only really covers one topic, but it's a search engine.

How does it work? When a user enters a query, we can use the exact AI to give that query a number ( a one dimensional vector ). If they type "cats" then they will get a number around 1.0. The search engine should return the pages that score closest to this number. If they type "newtonian physics" it will come out around 0 and the search engine should return the pages that score closest to this number. This works if you think about it because the the search query was essentially not about cats so it returned pages that weren't about cats. Similarly if the user searches for "dogs" it would return pages close to the score of dogs.

When we say "closest to", we don't care if the page is more to the left or right on the line than the query. We care solely about the distance.

Let's imagine another dimension! Or, in other words, give our vector a second dimension/element. Let's call this vector "furry". We could imagine this as a second line with "not furry" on one end and "furry" on the other. If something is high on the "cats" line then it is likely to be high on the "furry" line. But at the other end it's not so clear. If something is not a cat then it could be furry, or it might not. It's probably better for us to mark this as a two dimensional graph.

What this immediately allows us to do is to cater for caters that aren't furry, like a sphynx. Whereas previously with our search engine, typing in the query "sphynx" might bring up a one dimensional score close to "cat" and return pages close to that, now it returns a score close to cat but far from furry in our two dimensional vector space.

What our imaginary search engine has started to do is not just understand the word "cats" and "furry", but to understand the meaning and the extent of the connection between them. We could keep adding more and more dimensions to make it more and more specific. These vectors map a position in an imaginary space and the distance between that and something else mapped into the space (say a query) is how relevant they are.

And that's exactly what is done in practice. By adding more and more dimensions, the vector space can consider more and more concepts. In practice these won't exactly relate to terms like "cats" and "furry" but will be calculated, probably by AI, to maximally encode the concepts in whatever number of dimensions somebody has chosen to use. That number is often in the hundreds, sometimes thousands (the trade-off being that the more dimensions in the vector, the more time it takes for whatever application is using this technique to calculate the distance between varying vectors to find the similarity).

A student in front of a whiteboard with lots of math equations

How This Applies to SEO

In practice, in search engines, there does still appear to be an element of exact match queries counting for quite a lot. This seems to be the case when pages have low link equity. But when link equity is high, or as it increases, the pages start to rank for a broader array of search terms.

This basic understanding of how that comes about, hopefully, suggests to you why using a variety of connected search terms for keywords is a better tactic than always using the same keyword or phrase, and in particular always internal linking with that. In the case of low link equity, you have more opportunity to rank for more different terms and for high link equity or as it improves your page is more likely to sit central to that topic cluster and appear in more searches if you use a variety of terms.

Whilst it might seem to initially work, the days of picking a keyword or keyword phrase and heavily targetting that are over.