Archives for 

seo

7 Illustrations of How Topical Links Impact SEO, in Theory and Practice

Posted by Cyrus-Shepard

The Internet lives on links.

Marketers have long understood the importance of links to SEO. So much so, it’s something we study regularly here at Moz. At their most basic, links are counted as “votes” of popularity for search engines to rank websites. Beyond this, search engineers have long worked to extract a large number of signals from the simple link, including:
  • Trustworthiness – Links from trusted sites may count as an endorsement
  • Spamminess – Links from known spam sites may count against you
  • Link Manipulation – Looking at signals such as over-optimization and link velocity, search engines may be able to tell when webmasters are trying to “game” the system

One of the most important signals engineers have worked to extract from links is topical relevance. This allows search engines to answer questions such as “What is this website about?” by examining incoming links.

Exactly how search engines use links to measure and weigh topical relevance is subject to debate. Rand has addressed it eloquently here, and again here. Over the years, several US patent filings from Google engineers demonstrate exactly how this process may work.

It’s hugely important to look at these concepts to better understand how incoming links may influence a website’s ability to rank.

This is the “theory” part of SEO. As always, a huge thanks to Bill Slawski and his blog SEO by the Sea, which acted as a starting point of research for many of these concepts. Let’s dive in.

1. Hub and authority pages

In the beginning, there was the Hilltop algorithm.

In the early days of Google, not long after Larry Page figured out how to rank pages based on popularity, the Hilltop algorithm worked out how to rank pages on authority. It accomplished this by looking for “expert” pages linking to them.

An expert page is a document that links to many other topically relevant pages. If a page is linked to from several expert pages, then it is considered an authority on that topic, and may rank higher.

A similar concept using “hub” and “authority” pages was put forth by Jon Kleinberg, a Cornell professor with grants from Google and other search engines. Kleinberg explains:
“…a good hub is a page that points to many good authorities; a good authority is a page that is pointed to by many good hubs.”
Authoritative Sources in a Hyperlinked Environment (PDF)

These were eloquent solutions that produced superior search results. While we can’t know the degree to which these concepts are used today, Google acquired the Hilltop algorithm in 2003.

2. Anchor text

Links contain a ton of information. For example, if you link out using the anchor phrase “hipster pizza,” there’s a great chance the page you’re linking to is about pizza (and maybe hipsters).

That’s the idea behind several Google PageRank patents. Earning links with the right anchor text can help your page to rank for similar phrases.

This also explains why you should use descriptive anchor text when linking, as opposed to generic “click here” type links.

Beyond the anchor text, other signals from the linking page — including the title and text surrounding the link — could provide contextual clues as to what the target page is about. While the importance of anchor text has long been established in SEO, the influence of these other elements is harder to prove.

3. Topic-sensitive PageRank

Despite rumors to the contrary, PageRank is very much alive (though Toolbar PageRank is dead).

PageRank technology can be used to distribute all kinds of different ranking signals throughout a search index. While the most common examples are popularity and trust, another signal is topical relevance, as laid out in this paper by Taher Haveliwala, who went on to become a Google software engineer.

The concept works by grouping “seed pages” by topic (for example, the Politics section of the New York Times). Every link out from these pages passes on a small amount of topic-sensitive PageRank, which is passed on through the next set of links, and so on.

When a user enters a search, those pages with the highest topic-sensitive PageRank (associated with the topic of the search) are considered more relevant and may rank higher.

4. Reasonable surfer

All links are not created equal.

The idea behind Google’s Reasonable Surfer patent is that certain links on a page are more important than others, and thus assigned increase weight. Examples of more important links include:

  • Prominent links, higher up in the HTML
  • Topically relevant links, related to both the source document and the target document.

Conversely, less important links include:

  • “Terms of Service” and footer links
  • Banner ads
  • Links unrelated to the document

    Because the important links are more likely to be clicked by a “reasonable surfer,” a topically relevant link can carry more weight than an off-topic one.

    “…when a topical cluster associated with the source document is related to a topical cluster associated with the target document, the link has a higher probability of being selected than when the topical cluster associated with the source document is unrelated to the topical cluster associated with the target document.”
    United States Patent: 7716225

    5. Phrase-based indexing

    Not going to lie. Phrase-based indexing can be a tough concept to wrap your head around.

    What’s important to understand is that phrase-based indexing allows search engines to score the relevancy of any link by looking for related phrases in both the source and target pages. The more related phrases, the higher the score.

    In addition to ranking documents based on the most relevant links, phrase-based indexing allows search engines to do cool things with less relevant links, including:

    1. Discounting spam and off-topic links: For example, an injected spam link to a gambling site from a page about cookie recipes will earn a very low outlink score based on relevancy, and would carry less weight.
    2. Fighting “Google Bombing”: For those that remember, Google bombing is the art of ranking a page highly for funny or politically-motivated phrases by “bombing” it with anchor text links, often unrelated to the page itself. Phrase-based indexing can stop Google bombing by scoring the links for relevance against the actual text on the page. This way, irrelevant links can be discounted.

    6. Local inter-connectivity

    Local inter-connectivity refers to a reranking concept that reorders search results based on measuring how often each page is linked to by all the other pages.

    To put it simply, when a page is linked to from a number of high-ranking results, it is likely more relevant than a page with fewer links from same set of results.

    This also provides a strong hint as to the types of links you should be seeking: pages that already rank highly for your target term.

    7. The Golden Question

    If the above concepts seem complex, the good news is you don’t have to actually understand the above concepts when trying to build links to your site.

    To understand if a link is topically relevant to your site, simply ask yourself the golden question of link building: Will this link bring engaged, highly qualified visitors to my website?

    The result of the golden question is exactly what Google engineers are trying to determine when evaluating links, so you can arrive at a good end result without understanding the actual algorithms.

    About those links between sites you control…

    One important thing to know is this: in nearly all of these Google patents and papers, every effort is made to count only “unbiased” links from unnassociated sites, and discount links between sites and pages related to one another through preexisting relationships.

    This means that both internal links and links between sites you own or control will be less valuable, while links from non-associated sites will carry far more weight.

    Researching the impact of topical links

    While it’s difficult to measure the direct effect these principals exert on Google’s search results (or even if Google uses them at all), we are able to correlate certain linking characteristics with higher rankings, especially around topical anchor text.

    Below is a sample of results from our Search Engine Ranking Factors study that shows link features positively associated with higher Google rankings. Remember the usual caveat that correlation is not causation, but it sure is a hint.

    It’s interesting to note that while both partial and exact match anchor text links correlate with higher rankings, they are both trumped by the overall number of unique websites linking to a page. This supports the notion that it’s best to have a wide variety of links types, including topically relevant links, as part of a healthy backlink profile.

    Practical tips for topically relevant links

    Consider this advice when thinking about links for SEO:

    1. DO use good, descriptive anchor text for your links. This applies to internal links, outlinks to other sites, and links you seek from non-biased external sites.
    2. AVOID generic or non-descriptive anchor text.
    3. DO seek relationships from authoritative, topically relevant sites. These include sites that rank well for your target keyword, and “expert” pages that link to many authority sites. (For those interested, Majestic has done some interesting work around Topical Trust Flow.)
    4. AVOID over-optimizing your links. This includes repetitive use of exact match anchor text and keyword stuffing.
    5. DO seek links from relevant pages. This includes examining the title, body, related phrases, and intent of the page to ensure its relevancy to your target topic.
    6. DO seek links that people are more likely to click. The ideal link is often both topically relevant and placed in a prominent position.
    7. AVOID manipulative link building. Marie Haynes has written an excellent explanation of the kinds of unnatural links that you likely want to avoid at all cost.

    Finally, DO try to earn and attract links to your site with high quality, topically relevant content.

    What are your best tips around topically relevant links? Let us know in the comments below!


    Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don’t have time to hunt down but want to read!

    Continue reading →

    Announcing Moz’s New Beginner’s Guide to Content Marketing

    Posted by Trevor-Klein

    I’m thrilled to announce the next in Moz’s series of beginner’s guides:
    The Beginner’s Guide to Content Marketing.

    Content marketing is a field full of challenges. Creating content that provides great value to your audience what we’ve come to call 10x content is difficult enough, but content marketers also regularly encounter skeptical employers and clients, diminutive budgets, and you guessed it a noted lack of time to get it all done. You’re not alone. You’re fighting the good fight, and we’re here to back you up. So is Carl.

    Meet Carl, the Content Cat. He’ll show up in every chapter of the guide for a little levity and to remind you that you’re in good company.

    There’s no denying the importance of content marketing. In its annual study of more than 5,000 marketers, the Content Marketing Institute showed that about 70% of all marketers, B2B and B2C, are creating more content than they did one year ago. Nearly half of B2C marketers have a dedicated content marketing group in their organizations. While this guide is written primarily for those who are relatively new to content marketing, we’d certainly recommend that more advanced marketers take a look through, as we often find veteran teams are missing some key fundamentals.

    Say no more; show me the guide!

    What you’ll learn

    The guide has nine chapters, and we’ve organized them in the order we think folks should think about them when they’re approaching content marketing. Start with planning and goals, move through ideation and execution, then wrap up with analysis and revisions to the process.

    1. What is content marketing? Is it right for my business?

    Before we dive too deep into strategy and tactics, there’s something we need to clear up: What in the world is content marketing, anyway? Look it up in 10 different places, and you’ll get 10 different answers to m that question. In this chapter, we break it down and offer a look into whether or not it’s a worthwhile investment of your time (spoiler alert: It is).


    2. Content strategy

    Arguably the most important part of any content marketing effort, your content strategy is what keeps you aligned with your company’s goals, ensuring you’re putting your time and effort into areas that will help move needles and earn you the recognition you deserve. There’s more to it than meets the eye, though, and this chapter paints a holistic picture to get you started.


    3. Content and the marketing funnel

    Most folks who are new to content marketing assume that it belongs right at the top of the marketing funnel. We’d like to bust it out of that pigeonhole. The truth is that content belongs at every stage of the funnel, from brand awareness and early acquisition to retention of loyal customers. This chapter shows you which kinds of content typically work well for each major phase of the funnel.


    4. Building a framework and a content team

    There are some things you’ll need to figure out before you even start coming up with ideas for your content. What tools will you use to create it? What processes and standards will you put in place? Who will you work with, and how can you get them aligned with your goals? Setting the framework for your future success will save you from major headaches, and this chapter aims to make sure there’s nothing you’re overlooking.


    5. Content ideation

    We’ve all had it happen. We need to write something be it a blog post, a whitepaper, even an email and when we sit down to make it happen, nothing. No ideas come to mind. Coming up with ideas for content that really resonates is deceptively difficult, but there are many tricks that’ll help get the proverbial gears turning. We’ll go through those in this chapter.


    6. Content creation

    After all that planning, it’s finally time to dive in and do the hard work of actually creating your content. From getting the formatting right and working with design/UX teams to the most important cliche you can remember — to focus on quality, not quantity — this chapter will help you make the most effective use of your time.


    7. Content promotion

    You’ve done it. You’ve put together a wonderful piece of 10x content, and can’t wait to see the response. Only one thing stands in your way: Getting it in front of the right people. From working with industry influencers to syndication and social promotion, there are a great many ways to connect your content with your audiences; it’s just a matter of choosing the right ones. This chapter aims to point you down the right path.


    8. Analysis and reporting

    Nobody (seriously, nobody) is able to perfectly target their audiences. We make assumptions based on what we know (and can surmise) about the things readers will find valuable. The only way we can get better is by taking a look at how our past content performed. That’s easier said than done, though, and data can often be misleading. This chapter shows you the basics of measurement and reporting so you can get an accurate picture of how things are going.


    9. Iteration, maintenance, and growth

    Like all aspects of marketing, content should be iterative. You should take a close look at how your past work resonated with your audience, learn from what went right (and what went wrong), and revise your approach next time around. It also pays to revisit your processes from time to time; as your organization and your audience grow, the tactics that served you well at the beginning could well be holding you back now. This chapter explores how you can scale your content efforts without sacrificing the quality you’ve worked so hard to instill.


    What are we waiting for? Let’s get started!

    Thanks

    The biggest thanks and the majority of the credit for this guide go to Isla McKetta. She was an immense help with early planning, and wrote the lion’s share of the guide. Derric Wise led the UX efforts, illustrating much of the guide and bringing Carl the Content Cat to life. Huge thanks also go to both Kevin Engle and Abe Schmidt for their fantastic illustrations. Thanks as well to Lisa Wildwood for her keen editing eyes, and to Ronell Smith and Christy Correll for their additional reviews. This guide never would have happened without all of you. =)


    Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don’t have time to hunt down but want to read!

    Continue reading →

    How to Prioritize SEO Tasks & Invest in High-Value Work Items – Whiteboard Friday

    Posted by randfish

    One thing we can all agree on: there’s a lot to think about when it comes to your SEO tasks. Even for the most organized among us, it can be really difficult to prioritize our to-dos and make sure we’re getting the highest return on them. In this week’s Whiteboard Friday, Rand tackles the question that’s a constant subtext in every SEO’s mind.

    Click on the whiteboard image above to open a high resolution version in a new tab!

    Video Transcription

    Howdy, Moz fans, and welcome to another edition of Whiteboard Friday. This week we’re chatting about how to prioritize SEO tasks and specifically get the biggest bang for the buck that we possibly can.

    I know that all of you have to deal with this, whether you are a consultant or at an agency and you’re working with a client and you’re trying to prioritize their SEO tasks in an audit or a set of recommendations that you’ve got, or you’re working on an ongoing basis in-house or as a consultant and you’re trying to tell a team or a boss or manager, “Hey these are all the SEO things that we could potentially do. Which ones should we do first? Which ones are going to get in this sprint, this quarter, or this cycle?” — whatever the cadence is that you’re using.

    I wanted to give you some great ways that we here at Moz have done this and some of the things that I’ve seen from both very small companies, startups, all the way up to large enterprises.

    SEO tasks

    Look, the list of SEO tasks can be fairly enormous. It could be all sorts of things: rewrite our titles and descriptions, add rich snippets categories, create new user profile pages, rewrite the remaining dynamic URLs that we haven’t taken care of yet, or add some of the recommended internal inks to the blog posts, or do outreach to some influencers that we know in this new space we’re getting into. You might have a huge list of these things that are potential SEO items. I actually urge you to make this list internally for yourself, either as a consulting team or an in-house team, as big as you possibly can.

    I think it’s great to involve decision makers in this process. You reach out to a manager or the rest of your team or your client, whoever it is, and get all of their ideas as well, because you don’t want to walk into these prioritization meetings and then have them go, “Great, those are your priorities. But what about all these things that are my ideas?” You want to capture as many of these as you can. Then you go through a validation process. That’s really the focus of today.

    Prioritization questions to ask yourself

    The prioritization questions that I think all of us need to be asking ourselves before we decide which order tasks will go in and which ones we’re going to focus on are:

    What company goals does this task serve or map to?

    Look, if your company or the organization you’re working with doesn’t actually have big initiatives for the year or the quarter, that’s a whole other matter. I recommend that you make sure your organization gets on top of that or that you as a consultant, if you are a consultant, get a list of what those big goals are.

    Those big things might be, hey, we’re trying to increase revenue from this particular product line, or we’re trying to drive more qualified users to sign up for this feature, or we’re trying to grow traffic to this specific section. Big company goals. It might even be weird things or non-marketing things, like we’re trying to recruit this quarter. It’s really important for us to focus on recruitment. So you might have an SEO task that maps to how do we get more people who are job seekers to our jobs pages, or how do we get our jobs listings more prominent in search results for relevant keywords — that kind of thing. They can map to all sorts of goals across a company.

    What’s an estimated 30, 60, 90, and 1 year value?

    Then, once we have those, we want to ask for an estimated range — this is very important — of value that the task will provide over the next X period of time. I like doing this in terms of several time periods. I don’t like to say we’re only going to estimate what the six month value is. I like to say, “What’s an estimated 30, 60, 90, and 1 year value?”

    You don’t have to be that specific. You could say we’re only going to do this for a month and then for the next year. For each of those time periods here, you’d go here’s our low estimate, our mid estimate, and our high estimate of how this is going to impact traffic or conversion rate or whatever the goal is that you’re mapping to up here.

    Which teams/people are needed to accomplish this work, and what is their estimate of time needed?

    Next, we want to ask which teams or people are needed to accomplish this work and what is their estimate of time needed. Important: what is their estimate, not what’s your estimate. I, as an SEO, think that it’s very, very simple to make small changes to a CMS to allow me to edit a rel=canonical tag. My web dev team tells me differently. I want their opinion. That’s what I want to represent in any sort of planning process.

    If you’re working outside a company as a consultant or at an agency, you need to go validate with their web dev team, with their engineering team, what it’s going to take to make these changes. If you are a contractor and they work with a web dev contractor, you need to talk to that contractor about what it’s going to take.

    You never want to present estimates that haven’t been validated by the right team. I might, for example, say there’s a big SEO change that we want to make here at Moz. I might need some help from UX folks, some help from content, some help from the SEOs themselves, and one dev for two weeks. All of these different things I want to represent those completely in the planning process.

    How will we capture metrics, measure if it’s working, and ID potential problems early?

    Finally, last question I’ll ask in this prioritization is: How are we going to capture the right metrics around this, measure it, see that it’s working, and identify potential problems early on? One of the things that happens with SEO is sometimes something goes wrong — either in the planning phase or the implementation or the launch itself — or something unexpected happens. We update the user profiles to be way more SEO friendly and realize that in the new profile pages we no longer link to this very important piece of internal content that users had uploaded or had created, and so now we’ve lost a bunch of internal links to that and our indexation is dropping out. The user profile pages may be doing great, but that user-generated content is shrinking fast, and so we need to correct that immediately.

    We have to be on the watch for those. That requires validation of design, some form of test if you can (sometimes it’s not needed but many times it is), some launch metrics so you can watch and see how it’s doing, and then ongoing metrics to tell you was that a good change and did it map well to what we predicted it was going to do.

    General wisdom regarding prioritization

    Just a few rules now that we’ve been through this process, some general wisdom around here. I think this is true in all aspects of professional life. Under-promise and over-deliver, especially on speed to execute. When you estimate all these things, make sure to leave yourself a nice healthy buffer and potential value. I like to be very conservative around how I think these types of things can move the needle on the metrics.

    Leave teams and people room in their sprints or whatever the cadence is to do their daily and ongoing and maintenance types of work. You can’t go, “Well, there are four weeks in this time period for this sprint, so we’re going to have the dev do this thing that takes two weeks and that thing that takes two weeks.” Guess what? They have to do other work as well. You’re not the only team asking for things from them. They have their daily work that they’ve got to do. They have maintenance work. They have regular things that crop up that go wrong. They have email that needs to be answered. You’ve got to make sure that those are accounted for.

    I mentioned this before. Never, ever, ever estimate on behalf of other people. It’s not just that you might be wrong about it. That’s actually only a small portion of the problem. The big part of the problem with estimating on behalf of others is then when they see it or when they’re asked to confirm it by a team, a manager, a client or whomever, they will inevitably get upset that you’ve estimated on their behalf and assumed that work will take a certain amount of time. You might’ve been way overestimating, so you feel like, “Hey, man, I left you tons of time. What are you worried about?”

    The frustrating part is not being looped in early. I think, just as a general rule, human beings like to know that they are part of a process for the work that they have to do and not being told, “Okay, this is the work we’re assigning you. You had no input into it.” I promise you, too, if you have these conversations early, the work will get done faster and better than if you left those people out of those conversations.

    Don’t present every option in planning. I know there’s a huge list of things here. What I don’t want you to do is go into a planning process or a client meeting or something like that, sit down and have that full list, and go, “All right. Here’s everything we evaluated. We evaluated 50 different things you could do for SEO.” No, bring them the top five, maybe even just the top three or so. You want to have just the best ones.

    You should have the full list available somewhere so if they call up like, “Hey, did you think about doing this, did you think about doing that,” you can say, “Yeah, we did. We’ve done the diligence on it. This is the list of the best things that we’ve got, and here’s our recommended prioritization.” Then that might change around, as people have different opinions about value and which goals are more important that time period, etc.

    If possible, two of the earliest investments I recommend are A.) automated, easy-to-access metrics, building up a culture of metrics and a way to get those metrics easily so that every time you launch something new it doesn’t take you an inordinate amount of time to go get the metrics. Every week or month or quarter, however your reporting cycle goes, it doesn’t take you tons and tons of time to collect and report on those metrics. Automated metrics, especially for SEO, but all kinds of metrics are hugely valuable.

    Second, CMS upgrades — things that make it such that your content team and your SEO team can make changes on the fly without having to involve developers, engineers, UX folks, all that kind of stuff. If you make it very easy for a content management system to enable editable titles and descriptions, make URLs easily rewritable, make things redirectable simply, allow for rel=canonical or other types of header changes, enable you to put schema markup into stuff, all those kinds of things — if that is right in the CMS and you can get that done early, then a ton of the things over here go from needing lots and lots of people involved to just the SEO or the SEO and the content person involved. That’s really, really nice.

    All right, everyone, I look forward to hearing your thoughts and comments on prioritization methods. We’ll see you again next week for another edition of Whiteboard Friday. Take care.

    Video transcription by Speechpad.com


    Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don’t have time to hunt down but want to read!

    Continue reading →

    Good News: We Launched a New Index Early! Let’s Also Talk About 2015’s Mozscape Index Woes

    Posted by LokiAstari

    Good news, everyone: November’s Mozscape index is here! And it’s arrived earlier than expected.

    Of late, we’ve faced some big challenges with the Mozscape index — and that’s hurt our customers and our data quality. I’m glad to say that we believe a lot of the troubles are now behind us and that this index, along with those to follow, will provide higher-quality, more relevant, and more useful link data.

    Here are some details about this index release:

    • 144,950,855,587 (145 billion) URLs
    • 4,693,987,064 (4 billion) subdomains
    • 198,411,412 (198 million) root domains
    • 882,209,713,340 (882 billion) links
    • Followed vs nofollowed links
      • 3.25% of all links found were nofollowed
      • 63.49% of nofollowed links are internal
      • 36.51% are external
    • Rel canonical: 24.62% of all pages employ the rel=canonical tag
    • The average page has 81 links on it
      • 69 internal links on average
      • 12 external links on average
    • Correlations with Google Rankings:
      • Page Authority: 0.307
      • Domain Authority: 0.176
      • Linking C-Blocks to URL: 0.255

    You’ll notice this index is a bit smaller than much of what we’ve released this year. That’s intentional on our part, in order to get fresher, higher-quality stuff and cut out a lot of the junk you may have seen in older indices. DA and PA scores should be more accurate in this index (accurate meaning more representative of how a domain or page will perform in Google based on link equity factors), and that accuracy should continue to climb in the next few indices. We’ll keep a close eye on it and, as always, report the metrics transparently on our index update release page.

    What’s been up with the index over the last year?

    Let’s be blunt: the Mozscape index has had a hard time this year. We’ve been slow to release, and the size of the index has jumped around.

    Before we get down into the details of what happened, here’s the good news: We’re confident that we have found the underlying problem and the index can now improve. For our own peace of mind and to ensure stability, we will be growing the index slowly in the next quarter, planning for a release at least once a month (or quicker, if possible).

    Also on the bright side, some of the improvements we made while trying to find the problem have increased the speed of our crawlers, and we are now hitting just over a billion pages a day.

    We had a bug.

    There was a small bug in our scheduling code (this is different from the code that creates the index, so our metrics were still good). Previously, this bug had been benign, but due to several other minor issues (when it rains, it pours!), it had a snowball effect and caused some large problems. This made identifying and tracking down the original problem relatively hard.

    The bug had far-reaching consequences…

    The bug was causing lower-value domains to be crawled more frequently than they should have been. This happened because we crawled a huge number of low-quality sites for a 30-day period (we’ll elaborate on this further down), and then generated an index with them. In turn, this raised all these sites’ domain authority above a certain threshold where they would have otherwise been ignored, when the bug was benign. Now that they crossed this threshold (from a DA of 0 to a DA of 1), the bug was acting on them, and when crawls were scheduled, these domains were treated as if they had a DA of 5 or 6. Billions of low-quality sites were flooding the schedule with pages that caused us to crawl fewer pages on high-quality sites because we were using the crawl budget to crawl lots of low-quality sites.

    …And index quality was affected.

    We noticed the drop in high-quality domain pages being crawled. As a result, we started using more and more data to build the index, increasing the size of our crawler fleet so that we expanded daily capacity to offset the low numbers and make sure we had enough pages from high-quality domains to get a quality index that accurately reflected PA/DA for our customers. This was a bit of a manual process, and we got it wrong twice: once on the low side, causing us to cancel index #49, and once on the high side, making index #48 huge.

    Though we worked aggressively to maintain the quality of the index, importing more data meant it took longer to process the data and build the index. Additionally, because of the odd shape of some of the domains (see below) our algorithms and hardware cluster were put under some unusual stress that caused hot spots in our processing, exaggerating some of the delays.

    However, in the final analysis, we maintained the approximate size and shape of good-quality domains, and thus PA and DA were being preserved in their quality for our customers.

    There were a few contributing factors:

    We imported a new set of domains from a partner company.

    We basically did a swap with them. We showed them all the domains we had seen, and they would show us all the domains they had seen. We had a corpus of 390 million domains, while they had 450 million domains. A lot of this was overlap, but afterwards, we had approximately 470 million domains available to our schedulers.

    On the face of it, that doesn’t sound so bad. However, it turns out a large chunk of the new domains we received were domains in .pw and .cn. Not a perfect fit for Moz, as most of our customers are in North America and Europe, but it does provide a more accurate description of the web, which in turn creates better Page/Domain authority values (in theory). More on this below.

    Palau, a small island nation in the middle of the Indian Ocean.

    Palau has the TLD of .pw. Seems harmless, right? In the last couple of years, the domain registrar of Palau has been aggressively marketing itself as the “Professional Web” TLD. This seems to have attracted a lot of spammers (enough that even Symantec took notice).

    The result was that we got a lot of spam from Palau in our index. That shouldn’t have been a big deal, in the grand scheme of things. But, as it turns out, there’s a lot of spam in Palau. In one index, domains with the .pw extension reached 5% of the domains in our index. As a reference point, that’s more than most European countries.

    More interestingly, though, there seem to be a lot of links to .pw domains, but very few outlinks from .pw to any other part of the web.

    Here’s a graph showing the outlinks per domain for each region of the index:

    TQu--jaKCoqQLiRknNQw42R7GeMWfkuuKmDCOBUTmZ2Eg6FW1grq3z6oBJMZm_wItHmOD_K7UDicMgq_8OkLVnjLKDNxoRMfgU20B2ymlQK7eueKqIAcY3wsqfJizRwo7hnt7Yw2jA

    China and its subdomains (also known as FQDNs).

    In China, it seems to be relatively common for domains to have lots of subdomains. Normally, we can handle a site with a lot of subdomains (blogspot.com and wordpress.com are perfect examples of sites with many, many subdomains). But within the .cn TLD, 2% of domains have over 10,000 subdomains, and 80% have several thousand subdomains. This is much rarer in the North Americas and in Europe, in spite of a few outliers like Wordpress and Blogspot.

    Historically, the Mozcape index has slowly grown the total number of FQDNs, from ¼ billion in 2010 to 1 billion in 2013. Then, in 2014, we started to expand and got 6 billion FQDNs in the index. In 2015, one index had 56 billion FQDNs!

    We found that a whopping 45 billion of those FQDNS were coming from only 250,000 domains. That means, on average, these sites had 180,000 subdomains each. (The record was 10 million subdomains for a single domain.)

    Chinese sites are fond of links.

    We started running across pages with thousands of links per page. It’s not terribly uncommon to have a large number of links on a particular page. However, we started to run into domains with tens of thousands of links per page, and tens of thousands of pages on the same site with these characteristics.

    At the peak, we had two pages in the index with over 16,000 links on each of these pages. These could have been quite legitimate pages, but it was hard to tell, given the language barrier. However, in terms of SEO analysis, these pages were providing very little link equity and thus not contributing much to the index.

    This is not exclusively a problem with the .cn TLD; this happens on a lot of spammy sites. But we did find a huge cluster of sites in the .cn TLD that were close together lexicographically, causing a hot spot in our processing cluster.

    We had a 12-hour DNS outage that went unnoticed.

    DNS is the backbone of the Internet. It should never die. If DNS fails, the Internet more or less dies, as it becomes impossible to lookup the IP address of a domain. Our crawlers, unfortunately, experienced a DNS outage.

    The crawlers continued to crawl, but marked all the pages they crawled as DNS failures. Generally, when we have a DNS failure, it’s because a domain has “died,” or been taken offline. (Fun fact: the average life expectancy of a domain is 40 days.) This information is passed back to the schedulers, and the domain is blacklisted for 30 days, then retried. If it fails again, then we remove it from the schedulers.

    In a 12-hour period, we crawl a lot of sites (approximately 500,000). We ended up banning a lot of sites from being recrawled for a 30-day period, and many of them were high-value domains.

    Because we banned a lot of high-value domains, we filled that space with lower-quality domains for 30 days. This isn’t a huge problem for the index, as we use more than 30 days of data — in the end, we still included the quality domains. But it did cause a skew in what we crawled, and we took a deep dive into the .cn and .pw TLDs.

    This caused the perfect storm.

    We imported a lot of new domains (whose initial DA is unknown) that we had not seen previously. These would have been crawled slowly over time and would likely have resulted in their domains to be assigned a DA of 0, because their linkage with other domains in the index would be minimal.

    But, because we had a DNS outage that caused a large number of high-quality domains to be banned, we replaced them in the schedule with a lot of low-quality domains from the .pw and .cn TLDs for a 30-day period. These domains, though not connected to other domains in the index, were highly connected to each other. Thus, when an index was generated with this information, a significant percentage of these domains gained enough DA to make the bug in scheduling non-benign.

    With lots of low-quality domains now being available for scheduling, we used up a significant percentage of our crawl budget on low-quality sites. This had the effect of making our crawl of high-quality sites more shallow, while the low-quality sites were either dead or very slow to respond — this caused a reduction in the total number of actual pages crawled.

    Another side effect was the shape of the domains we crawled. As noted above, domains with the .pw and .cn TLDs seem to have a different strategy in terms of linking — both externally to one other and internally to themselves — in comparison with North American and European sites. This data shape caused a couple of problems when processing the data that increased the required time to process the data (due to the unexpected shape and the resulting hot spots in our processing cluster).

    What measures have we taken to solve this?

    We fixed the originally benign bug in scheduling. This was a two-line code change to make sure that domains were correctly categorized by their Domain Authority. We use DA to determine how deeply to crawl a domain.

    During this year, we have increased our crawler fleet and added some extra checks in the scheduler. With these new additions and the bug fix, we are now crawling at record rates and seeing more than 1 billion pages a day being checked by our crawlers.

    We’ve also improved.

    There’s a silver lining to all of this. The interesting shapes of data we saw caused us to examine several bottlenecks in our code and optimize them. This helped improve our performance in generating an index. We can now automatically handle some odd shapes in the data without any intervention, so we should see fewer issues with the processing cluster.

    More restrictions were added.

    1. We have a maximum link limit per page (the first 2,000).
    2. We have banned domains with an excessive number of subdomains.
      • Any domain that has more than 10,000 subdomains has been banned…
      • …Unless it is explicitly whitelisted (e.g. Wordpress.com).
        • We have ~70,000 whitelisted domains.
      • This ban affects approximately 250,000 domains (most with .cn and .pw TLDs)…
        • …and has removed 45 billion subdomains. Yes, BILLION! You can bet that was clogging up a lot of our crawl bandwidth with sites Google probably doesn’t care much about.

    We made positive changes.

    1. Better monitoring of DNS (complete with alarms).
    2. Banning domains after DNS failure is not automatic for high-quality domains (but still is for low-quality domains).
    3. Several code quality improvements that will make generating the index faster.
    4. We’ve doubled our crawler fleet, with more improvements to come.

    Now, how are things looking for 2016?

    Good! But I’ve been told I need to be more specific. đŸ™‚

    Before we get to 2016, we still have a good portion of 2015 to go. Our plan is stabilize the index at around 180 billion URLs for the end of the year and release an index predictably every three weeks.

    We are also in the process of improving our correlations to Google’s index. Currently our fit is pretty good at a 75% match, but we’ve been higher at around 80%; we’re testing a new technique to improve our metrics correlations and Google coverage beyond that. This will be an ongoing processes, and though we expect to see improvements in 2015, these improvements will continue on into 2016.

    Our index struggles this year have taught us some very valuable lessons. We’ve identified some bottlenecks and their causes. We’re going to attack these bottlenecks and improve the performance of the processing cluster to get the index out quicker for you.

    We’ve improved the crawling cluster and now exceed a billion pages a day. That’s a lot of pages. And guess what? We still have some spare bandwidth in our data center to crawl more sites. We plan to improve the crawlers to increase our crawl rate, reducing the number of historical days in our index and allowing us to see much more recent data.

    In summary, in 2016, expect to see larger indexes, at a more consistent time frame, using less historical data, that maps closer to Google’s own index. And thank you for bearing with us, through the hard times and the good — we could never do it without you.


    Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don’t have time to hunt down but want to read!

    Continue reading →

    ​Announcing Search Insights from Moz Local!

    Posted by David-Mihm

    When we launched Moz Local, I said at the time that one of the primary goals of our product team was to “help business owners and marketers trying to keep up with the frenetic pace of change in local search.” Today we take a major step forward towards that goal with the beta release of Moz Local Search Insights, the foundation for a holistic understanding of your local search presence.

    As we move into an app-centric world that’s even more dependent on structured, accurate location data than the mobile web, it’s getting harder to keep up with the disparate sources where this data appears — and where customers are finding your business. Enter Moz Local Insights — the hub for analyzing your location-centric digital activity.

    What’s included in this beta release?

    We’ve heard our customers loud and clear — especially those at agencies and enterprise brands — that while enhanced reporting was a major improvement, they needed a more comprehensive way to prove the value of their efforts to clients and company locations.

    We start with daily-updated reporting in three key areas with this release: Location page performance, SERP rankings, and reputation. All of these are available not only within a single location view, but aggregated across all locations in your account, or by locations you’ve tagged with our custom labels.

    Location page performance

    The goal of our new Performance section is to distill the online traffic metrics that matter most to brick-and-mortar businesses into a single digestible screen. After a simple two-click authentication of your Google Analytics account, you’ll see a breakdown of your traffic sources by percentage:

    Clicking into each of the traffic sources on the righthand side will show you the breakdown of traffic from those sources by device type.

    There’s also an ordered list of all prominent local directories that are sending potential customers to your website. While we haven’t yet integrated impression data from these directories, this should give you a relative indicator of customer engagement on each.

    traffic_directories.png

    We’re hoping to add even more performance metrics, including Google My Business and other primary consumer destinations, as they become available.

    Visibility

    The Visibility section houses your location-focused ranking reports, with a breakdown of how well you’re performing, both in local packs and in organic results. Similar to the visibility score in Moz Analytics, we’ve combined your rankings across both types of results into a single metric that’s designed to reflect the likelihood that a searcher will click on a result for your business when searching a given keyword.




    The Visibility section also lets you see how you stack up against your competitors — up to three at a time. But rather than preselecting a particular competitor, you can choose any competitor you’d like to compare yourself to on the fly.

    And, of course, we give you the metrics in full table view below (CSV export coming soon) if you prefer to get a little more granular with your visibility analysis by keyword.

    We’ve got a number of other innovative features planned for release later in the beta period, including taking barnacle positions into account (originally heard through Will Scott) when calculating your visibility score, and tracking additional knowledge panel and universal search entries that are appearing for your keywords.

    Reputation

    The Reputation section is probably the most straightforward of the bunch — a simple display of how your review acquisition efforts are progressing, both in terms of volume and the ratings that people are leaving for your business.

    There’s also a distribution of where people are leaving reviews, so you have a sense of what sites your customers are leaving reviews on, and which ones might need a little extra TLC.

    Over time, we’ll be expanding this section to include many more review sources, sentiment analysis, and the ability to receive notifications and summaries of new reviews.


    What’s next?

    You tell us! This is a true beta, and we’ll be paying close attention to your feedback over the next couple of months.

    Search Insights is already enabled for all Moz Local customers by default. Just log in to your dashboard and let us know what you think. And if you’re not yet a Moz Local customer, sign up today to take Search Insights for a free spin during our beta period.

    There’s a lot of underlying infrastructure beneath the surface of this release that will allow us to add new features on a modular basis moving forward, and we’re already working on improvements, such as custom date range selection, CSV exporting, emailed reports, and notifications. But your feedback will help us prioritize and add new features to the roadmap.

    Before I sign off, I want to give a huge thank you to our engineering, design and UX, marketing, and community teams for their hard work, assistance, and patience as we worked to release Moz Local Search Insights into the wild. And most importantly, thank you to you guys — our customers — whose feedback has already proven invaluable and will be even more so as we enter the newest phase of Moz Local!


    Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don’t have time to hunt down but want to read!

    Continue reading →