What SEOs Need to Know About Topic Modeling & Semantic Connectivity – Whiteboard Friday
Posted by randfish
Search engines, especially Google, have gotten remarkably good at understanding searchers’ intent—what we mean to search for, even if that’s not exactly what we search for. How in the world do they do this? It’s incredibly complex, but in today’s Whiteboard Friday, Rand covers the basics—what we all need to know about how entities are connected in search.
For reference, here’s a still of this week’s whiteboard!
Video Transcription
Howdy, Moz fans, and welcome to another edition of Whiteboard Friday. This week we’re talking topic modeling and semantic connectivity. Those words might sound big and confusing, but, in fact, they are important to understanding the operations of search engines, and they have some direct influence on things that we might do as SEOs, hence our need to understand them.
Now, I’m going to make a caveat here. I am not an expert in this topic. I have not taken the required math classes, stats classes, programming classes to truly understand this topic in a way that I would feel extremely comfortable explaining. However, even at the surface level of understanding, I feel like I can give some compelling information that hopefully you all and myself included can go research some more about. We’re certainly investigating a lot of topic modeling opportunities and possibilities here at Moz. We’ve done so in the past, and we’re revisiting that again for some future tools, so the topic is fresh on my mind.
So here’s the basic concept. The idea is that search engines are smarter than just knowing that a word, a phrase that someone searches for, like “Super Mario Brothers,” is only supposed to bring back results that have exactly the words “Super Mario Brothers,” that perfect phrase in the title and in the headline and in the document itself. That’s still an SEO best practice because you’re trying to serve visitors who have that search query. But search engines are actually a lot smarter than this.
One of my favorite examples is how intelligent Google has gotten around movie topics. So try, for example, searching for “That movie where the guy is called The Dude,” and you will see that Google properly returns “The Big Lebowski” in the first ranking position. How do they know that? Well, they’ve essentially connected up “movie,” “The Dude,” and said, “Aha, those things are most closely related to ‘The Big Lebowski. That’s what the intent of the searcher is. That’s the document that we’re going to return, not a document that happens to have ‘That movie about the guy named ‘The Dude’ in the title, exactly those words.'”
Here’s another example. So this is Super Mario Brothers, and Super Mario Brothers might be connected to a lot of other terms and phrases. So a search engine might understand that Super Mario Brothers is a little bit more semantically connected to Mario than it is to Luigi, then to Nintendo and then Bowser, the jumping dragon guy, turtle with spikes on his back — I’m not sure exactly what he is — and Princess Peach.
As you go down here, the search engine might actually have a topic modeling algorithm, something like latent semantic indexing, which was an early model, or a later model like latent Dirichlet allocation, which is a somewhat later model, or even predictive latent Dirichlet allocation, which is an even later model. Model’s not particularly important, especially for our purposes.
What is important is to know that there’s probably some scoring going on. A search engine — Google, Bing — can understand that some of these words are more connected to Super Mario Brothers than others, and it can do the reverse. They can say Super Mario Brothers is somewhat connected to video games and very not connected to cat food. So if we find a page that happens to have the title element of Super Mario Brothers, but most of the on-page content seems to be about cat food, well, maybe we shouldn’t rank that even if it has lots of incoming links with anchor text saying “Super Mario Brothers” or a very high page rank or domain authority or those kinds of things.
So search engines, Google, in particular, has gotten very, very smart about this connectivity stuff and this topic modeling post-Hummingbird. Hummingbird, of course, being the algorithm update from last fall that changed a lot of how they can interpret words and phrases.
So knowing that Google and Bing can calculate this relative connectivity, connectivity between the words and phrases and topics, we want to know how are they doing this. That answer is actually extremely broad. So that could come from co-occurrence in web documents. Sorry for turning my back on the camera. I know I’m supposed to move like this, but I just had to do a little twirl for you.
Distance between the keywords. I mean distance on the actual page itself. Does Google find “Super Mario Brothers” near the word “Mario” on a lot of the documents where the two occur, or are they relatively far away? Maybe Super Mario Brothers does appear with cat food a lot, but they’re quite far away. They might look at citations and links between documents in terms of, boy, there’s a lot pages on the web, when they talk about Super Mario Brothers, they also link to pages about Mario, Luigi, Nintendo, etc.
They can look at the anchor text connections of those links. They could look at co-occurrence of those words biased by a given corpi, a set of corpuses, or from certain domains. So they might say, “Hey, we only want to pay attention to what’s on the fresh web right now or in the blogosphere or on news sites or on trusted domains, these kinds of things as opposed to looking at all of the documents on the web.” They might choose to do this in multiple different sets of corpi.
They can look at queries from searchers, which is a really powerful thing that we unfortunately don’t have access to. So they might see searcher behavior saying that a lot of people who search for Mario, Luigi, Nintendo are also searching for Super Mario Brothers.
They might look at searcher clicks, visits, history, all of that browser data that they’ve got from Chrome and from Android and, of course, from Google itself, and they might say those are corpi that they use to connect up words and phrases.
Probably there’s a whole list of other places that they’re getting this from. So they can build a very robust data set to connect words and phrases. For us, as SEOs, this means a few things.
If you’re targeting a keyword for rankings, say “Super Mario Brothers,” those semantically connected and related terms and phrases can help with a number of things. So if you could know that these were the right words and phrases that search engines connected to Super Mario Brothers, you can do all sorts of stuff. Things like inclusion on the page itself, helping to tell the search engine my page is more relevant for Super Mario Brothers because I include words like Mario, Luigi, Princess Peach, Bowser, Nintendo, etc. as opposed to things like cat food, dog food, T-shirts, glasses, what have you.
You can think about it in the links that you earn, the documents that are linking to you and whether they contain those words and phrases and are on those topics, the anchor text that points to you potentially. You can certainly be thinking about this from a naming convention and branding standpoint. So if you’re going to call a product something or call a page something or your unique version of it, you might think about including more of these words or biasing to have those words in the description of the product itself, the formal product description.
For an About page, you might think about the formal bio for a person or a company, including those kinds of words, so that as you’re getting cited around the web or on your book cover jacket or in the presentation that you give at a conference, those words are included. They don’t necessarily have to be links. This is a potentially powerful thing to say a lot of people who mention Super Mario Brothers tend to point to this page Nintendo8.com, which I think actually you can play the original “Super Mario Brothers” live on the web. It’s kind of fun. Sorry to waste your afternoon with that.
Of course, these can also be additional keywords that you might consider targeting. This can be part of your keyword research in addition to your on-page and link building optimization.
What’s unfortunate is right now there are not a lot of tools out there to help you with this process. There is a tool from Virante. Russ Jones, I think did some funding internally to put this together, and it’s quite cool. It’s nTopic.org. Hopefully, this Whiteboard Friday won’t bring that tool to its knees by sending tons of traffic over there. But if it does, maybe give it a few days and come back. It gives you a broad score with a little more data if you register and log in. It’s got a plugin for Chrome and for WordPress. It’s fairly simplistic right now, but it might help you say, “Is this page on the topic of the term or phrase that I’m targeting?”
There are many, many downloadable tools and libraries. In fact, Code.google.com has an LDA topic modeling tool specifically, and that might have been something that Google used back in the day. We don’t know.
If you do a search for topic modeling tools, you can find these. Unfortunately, almost all of them are going to require some web development background at the very least. Many of them rely on a Python library or an API. Almost all of them also require a training corpus in order to model things on. So you can think about, “Well, maybe I can download Wikipedia’s content and use that as a training model or use the top 10 search results from Google as some sort of training model.”
This is tough stuff. This is one of the reasons why at Moz I’m particularly passionate about trying to make this something that we can help with in our on-page optimization and keyword difficulty tools, because I think this can be very powerful stuff.
What is true is that you can spot check this yourself right now. It is very possible to go look at things like related searches, look at the keyword terms and phrases that also appear on the pages that are ranking in the top 10 and extract these things out and use your own mental intelligence to say, “Are these terms and phrases relevant? Should they be included? Are these things that people would be looking for? Are they topically relevant?” Consider including them and using them for all of these things. Hopefully, over time, we’ll get more sophisticated in the SEO world with tools that can help with this.
All right, everyone, hope you’ve enjoyed this addition of Whiteboard Friday. Look forward to some great comments, and we’ll see you again next week. Take care.
Video transcription by Speechpad.com
Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don’t have time to hunt down but want to read!
Continue reading →