frans |

About frans

Website:
frans has written 4625 articles so far, you can find them below.

The Content Marketing Campaign Playbook – Guaranteeing Success in 2016

Posted by SimonPenson

Introduction

Stage One: Setting expectations and objectives

Stage Two: Audience understanding

Stage Three: Brand activation

Stage Four: Campaign plans

Stage Five: Finding the right prospects

Stage Six: Social

Stage Seven: Retargeting

Free downloadable campaign planner

__________

“How do I know if my content campaign is going to work?”

This question is the one I get asked more than any other at present, and for good reason; creating hit content, consistently, is one of the biggest challenges in marketing.

I’m certainly not going to pretend that every piece of content I’ve ever been involved in has been a hit. In fact, the opposite is probably closer to the truth — but failure teaches you much more, and allows you to iterate faster.

The result of that heartache and frustration is a process I want to share with you today, one that’s designed to maximize the possibility of success with content campaigns.

It is also important to point out here that I’m not talking about content strategy or the wider picture, but specifically those bigger campaigns that should punctuate your wider marketing plan.

The difference between good and bad ideas

So, why did we fail so many times? Why were truly exceptional campaign ideas not hitting the mark?

The answer to these questions lay both in our ability to answer a handful of very simple questions, and understanding how to align the various marketing disciplines required to ensure you deliver.

Let’s look now at the process that has made the difference between us delivering content that, while good on the face of it, didn’t deliver the objective, and the pieces that absolutely flew.

Put another way, the process that makes the difference between a campaign asset like the one below — which smashed every one of its key objectives — and one of the many that seemed like a good idea, but just failed to “fly.”

I’m sharing this example, not because it was our best-performing piece, but simply because it was the first time we saw the benefits of getting the process right from the start.

The objective for the piece was a sampling exercise: to get a challenger brand’s condoms into hands of 5,000 targeted men with a relatively small budget.

The resulting piece was an interactive quiz designed to capitalize on the “Fifty Shades of Grey” noise, as it coincided with the film’s launch. It worked by asking the site visitor a number of questions about their sex life before presenting them with a result between 0–50. The “Greyer” you were, the higher the score.

There were then calls-to-action around social sharing, further learning, and the main “sample request” option.

The piece was so popular that all 5,000 sample requests were taken up in just two days. The Buzzfeed post on it received 2,100 views in the first week, and ten major national websites covered the concept, in addition to numerous smaller blogs.

Facebook traction was also very impressive, with more than 4,500 post “Likes,” 560 comments, and an average engagement rate of 7.4%.

So how was this made possible? Let’s walk through the same process that spawned it now…

STAGE ONE

Setting expectation and objectives

Ground Zero for any successful campaign is objective setting. The process is still overlooked by many, but before you start, you must define what success looks like.

And that MUST come topped off with a healthy serving of realism.

If your budget is a couple of thousand pounds (or dollars) for the entire piece, then you must be honest about what this may achieve — both as a standalone piece and as part of the wider content strategy it sits within.

You must also be clear about where the value is coming from. Is the piece a brand play or a performance marketing effort? Metrics that may suit each of these can be seen below:

Brand content metrics

Dwell time
Social sharing
Eyeballs reached
Sentiment
Visits

Performance marketing content metrics

Visits
Leads generated
Effect on organic search visibility
Citations and links earned

There are many others, of course, but for a campaign to be measurable, you should set clear and realistic KPIs against any of these relevant to the campaign.

For instance, in the earlier example we talked through, the KPIs were simple and we captured them in a format similar to the one below:

Main Objective: Obtain 5,000 product sample requests and reach 100,000 new “eyeballs.”

Secondary Objectives: Improve social engagement, gain coverage on high-profile sites, and increase traffic to the site during the campaign period.

KPIs

PR Placements: 6+ high-profile site placements (notional Hitwise Traffic target of 100,000 for those pieces).

Social: Organic – Reach: 75,000; Engagement: 5,000 | Paid – Reach: 250,000; Clicks: 20,000.

Visits and actions: Doubling of traffic to the site during campaign period and 5,000 sample requests.

STAGE TWO

Audience understanding

Who

Once these KPIs are set and agreed upon, the next phase is to center your thinking on the audience with whom you want to engage, in order to achieve those objectives.

For the example campaign, the target market was relatively broad, but ended up being focused on females in the 18–34 range. The insight from the brand was that the trialling needed to understand that, and required a process for collating all known customer information — allowing us to create Campaign Personas.

I have written previously about how you can extract data from social to inform audience understanding, and while Facebook has changed Graph Search a lot since penning the piece, there is still value in following some of that process.

Also worth a read on the wider persona process is this excellent guide by Mike King. It contains a huge amount of information on how to leverage data to build an accurate picture of your customer or clients.

Creating campaign- or distribution-specific personas allows you to focus very clearly on creating the right content, angles, and distribution plan to hit those key objectives.

To do that, however, you must first dive into the data.

The starting points for this are existing marketing insight, social data, and/or output from Global Web Index, a SAAS offering (and paid-for tool) that allows you to mine a vast swathe of Internet usage data. Many of the ad platforms you use buy this data to power their own targeting.

Where

From social, you can extract data that helps add richness to the picture and how much time people spend on any particular platform. However, GWI aggregates that information and allows you to produce insights such as in the example below.

From this kind of data, you can plan a detailed, focused, and informed social distribution plan as part of the wider seeding strategy.

What’s also interesting to know is both the type of content this audience currently engages with and also how they believe your brand fits within that picture.

Mistakes are often made when businesses understand what the people they want to attract consume, without taking into account if the brand has the right to play in that specific space.

The good news is you can easily gain insight into both of these areas.

What

The starting point for this insight piece is a dive into Google Display Planner. This free tool is designed to help media planners with display ad targeting, but its data can also be used to understand which sites a select demographic may frequent.

In the example below, you can see we have entered a couple of keyword interests and a topic interest to form a target demographic.

By clicking that “Get placement ideas” button, you’re entered into the main dashboard where you can further refine everything, from age and gender to device use and back again.

A section I use quite a lot both for paid and PR targeting (as well as for initial audience insight) is the Individual Targeting Ideas > Placements > Sites drill down. This gives you a list of sites visited by your “audience,” which can be downloaded into a CSV file and sorted based on a number of metrics, including traffic, popularity, and more.

This then allows you to select a small number of sites that will most likely be visited by those thinking about your product or service for the next level of analysis.

Why

To understand what they’re into, you must now drill into what your audience shares most on those sites. The best tools for doing that are Ahrefs’ Content Explorer and Buzzsumo.

Taking a random site from the list we created, we can now look at the most-shared content on the site.

For this specific task, we’ll use the former, selecting the “Top Content” option within the main Site Explorer:

Here, we can see the most shared and linked-to assets, and start to understand the sort of content our audience wants to engage with.

We can also make this picture even richer by then looking at a “whole-of-market” view and typing in associated topics into Buzzsumo. This then gives us a full list of the most shared content pieces in a broader sense.

STAGE THREE

Brand activation

As already discussed, however, not every brand can cover every subject, or has the right to do so — understanding this is key to success.

To get a fuller picture here, qualitative survey data is needed. To paint this picture, we will again turn to Global Web Index data. In the absence of such a tool, a quick survey of existing visitors will give you this critical insight.

Below, you can see the answer to what this target audience expects to see from the brand. This doesn’t mean specific content ideas, but rather the type of content it has the authority to produce in the eyes of the audience.

As we can see here, the brand is looked to predominantly as a source of information and knowledge sharing (great brand-as-publisher strategy opportunities!).

It is also clear, however, that they want to engage with the brand and expect relevant, timely content — an important point we will come back to later.

So, we now understand a little more about our audience’s needs and we can use this alongside existing research data and customer knowledge to create personas specific to the campaign.

In the example we’re walking through, those personas were as follows:

The image above is a simplified version and we always use our persona template, which you can download here, to ensure we paint a thorough picture.

The point here is to humanize the data. The mind processes all that information in a much more structured way if you do this, and that means you end up making more precise decisions in how and where you target the campaign.

Personas also make it much easier to scale data understanding outside the group that created them. By having a shared “face” to each segment and trying to align each one to a famous person, it makes it much easier to ensure there’s a shared understanding across the whole working group.

Once this stage is set in stone, the next phase is to move into the campaign idea itself.

Ideation – informing ideas with data

At Zazzle we use our much-publicized ideation process as the basis for this process and it is something I have written about previously for Moz.

The principle is that you create left-brain structure around the creative process to ensure you can consistently output great ideas based on the objective.

We follow a 13-step process for doing this, which starts with an underpinning of the ideas against the objective — ensuring that they will achieve it — and defining the content types (as in infographics, video, articles, etc.) relevant to the audience we want to reach.

This process will always unearth great ideas, but not always ideas that fly from a campaign perspective — and for a long time we really struggled to understand why.

It was an anomaly that perplexed us for several months and it took a session of digging into feedback from journalists at real scale, as well as work on the entire distribution process, to really figure it out.

The answer boiled down to not asking the right questions of each concept at an early enough stage, and it required a reversal in how we plan the campaign as a whole.

Testing ideas

The result was a new process that included a session at the end to ask questions of each and every idea recorded to ensure it is “fit for purpose.”

1. Why now?

The first and most important question is, “Why are we doing this now?” We learned the hard way that an idea can be the best idea in the history of content marketing, but if it hasn’t got a “news hook,” you may well be fighting a losing battle.

Such an angle can be manufactured with a little forethought, of course, so this doesn’t mean that only “newsy” content will work.

For instance, if we take a look at a piece on a subject such as finance, there’s always a way to weave a new study, political opinion, or law change into the campaign to give it that critical “run it now” message.

Without it, a journalist or blogger — almost all of whom are motivated by news and trends — will have something more important to run before your piece, and it may just get lost in the noise.

2. What’s the angle?

If your idea passes the first stage of questioning, then the next phase is to look at how you may break that news angle down into a series of angles, or exclusives.

While having one really strong “story” can be enough, it is much better to be able to present a number of different flavors on the same thing. That way, when pitching it, your PR team will be able to approach a larger number of sites with that exclusive they all hunger for.

Below you’ll see an example of how this may work. In this case, we designed a series of exclusive angles for the idea we ended up opting for (an interactive quiz based on the “Fifty Shades of Grey” hype). The data-informed rationale behind it was as follows:

Why now? – “Because the film is launching.”
Why this? – “There’s a huge existing conversation in this area and we can tap into it. The audience is also perfect.”

As you’ll see, there are a number of clearly different angles here supported by supplementary content.

This process then actually shapes the way you build the assets themselves, ensuring that you maximize potential reach.

3. Who is it for?

Once you have established it has legs as a trending opportunity campaign, the next stage is to work hard on understanding who would be interested in it, and where you may find them online.

As we now have several exclusive angles, we can go back to our personas and add an extra layer of detail to define which ones would be interested in each angle/story.

For instance, we know that the free condoms giveaway is most likely to resonate with our male persona, and so we want to push that through relevant websites and social channels more attuned to that audience.

4. Where will we find them?

There are myriad tools and ways in which to do this, enough for a post in its own right, but while I can’t share every one, it’s worth discussing the key tools we use daily to do this.

You find these distinct groups in different places on the web, so grouping those people together helps you to then understand which sites they frequent.

At this stage we often use upstream and downstream traffic data from Hitwise to inform our decision making in a more data-driven way. The platform allows you to see where visitors go before and after visiting specific sites, widening your prospecting list.

STAGE FOUR

Before we get into the influencer outreach piece, you must first create a site framework for your PR team to work from.

This means creating a handful of example sites for each distribution persona, giving clear examples of where we may find them.

For example, we may find “Steve” on the main social platforms, Buzzfeed, and so on. From this, you can then build a list of similar sites.

The final list of agreed upon and approved prospects is then added into our Content Campaign Planner, which you can download for your own campaigns either via the link here, or later on at the bottom of the article.

Building campaign plans

Below you can see a screen shot of the top sheet of the plan, which captures the overall timeline of each element. The tabs below it then contain all the info on:

The paid social plan – Targeting, spend, target CPC, etc.
The PR plan – Exclusive angles, the sell, content being used, etc.
Prospect list – List of publications to be targeted
Other – A tab to capture any other activity, such as above-the-line activity, if appropriate for the campaign.

Budget distribution

Before we get into the plan details, however, one important point we always cover is budget breakdown.

Regardless of how much budget you have to play with for the overall campaign, it is important to look at wider media planning benchmarking to ensure you split it in a way that will maximize the chance of success.

We used a famous ad campaign in the UK as the basis for this decision-making process, and learn from one of the most successful going: the John Lewis Christmas campaign. It is a wildly successful TV-first creative with a tasty £7 million budget.

Critically, however, only one million of that is spent on creative; the rest is all distribution. While it wins award after award for being an undeniable hit, that budget split ensured it was always going to be successful.

“6 in every 7 campaign dollars should be spent on distribution.”

All too often we get carried away with making the creative stand out, when we should be much more focused on distribution planning.

Exact breakdown will vary, but as a guide, aim for a 70/30 split towards distribution.

STAGE FIVE

Find the right prospects

Distribution is key, and in the majority of cases your PR plan should deliver the biggest impact, if executed correctly. And that makes your approach to prospecting key to the overall success of the project.

As you’ve already carried out a lot of work around target sites, the next phase is to understand who the right journalists or influencers are inside those businesses.

At this stage, there will also be further work on blogger influencer identification, to ensure that the PR plan has the breadth of targets to cover as many eyeballs as possible.

To do that, you need to look at who is already sharing your content, using a tool like Ahref’s Top Referring Content. Reaching out to those already predisposed to linking to you is a surefire way of kickstarting your PR efforts with warm conversations.

Outside of this, there are myriad ways to reach the right bloggers, and this certainly isn’t a guide on influencer outreach. If you did want to know more, I suggest checking out these resources:

Link Prospecting on Steroids – A Streamlined Process by Matthew Barby
The Definitive Guide to Guest Blogging by Brian Dean
The Ultimate Guide to Advanced Guest Blogging by Pratik Dholakiya

From a PR perspective, we only use two tools to simplify the process as much as possible. After trawling through every process and option possible, we’ve settled on a combination of Gorkana and Linkedin. That may be a process that disappoints some of the more technically-minded, but this is based on tens of thousands of hours of experience.

And the process couldn’t be easier, because it is simply about people:

Take your list of sites selected as part of the audience-understanding project.
Enter them into Gorkana and/or Linkedin to establish the best section editor, journalist, or influencer to reach out to.
Note name, email address, phone number, and any previous communication notes into your planner.

Outside of this, we have been trialling JournoRequest to bolster those efforts and take the legwork out of social monitoring (an effective but labor-intensive process for finding trending opportunities from the journalists themselves).

This simple tool delivers targeted journalist content requests to your inbox and can help when it is part of an “always on” monitoring process that feeds in at the ideas stage.

The pre-pitch

A major mistake often made at this stage is to pick up the phone too early. It’s all too tempting to do that when so much work has led to this point, but before you do, it’s important to pre-plan what you’re going to say and to whom. This ensures that you maximize take-up and don’t confuse who you pitch which angles to.

This is where the prospecting list from our planner comes into its own. As you can see in the example below, it segments that process and makes it possible to scale the communication across multiple PR team members.

It can often help PRs to write a script before making the call, to ensure the sell is as strong as planned. We ALWAYS tell the journalist that we’ll follow up with all the details on email.

This not only creates an excuse to get their email address if we don’t already have it, but also ensures that it stays front-of-mind and that we make it as easy as possible for them.

STAGE SIX

Social

PR is, of course, only part of the story. It’s important to plan around every other available channel opportunity to maximize reach.

Social is the next consideration, as it will support PR activity. We know from the initial audience piece how much time our target market spends on key platforms.

Supporting the content by creating a regular organic sharing plan across social and other owned channels is the first logical step, but there is obviously much more you can do. The chart below is a great starting point when considering how wide you can, or could, spread the net.

Which option you choose is dependent upon a) the topic of the campaign and b) what insights tell you about the audience you are targeting.

In our example, the interactive quiz was hosted on the site and was pushed organically via all key social channels, as well as being the subject of a significant PR campaign.

Organically, we ensure we can get the most out of the channel by, again, creating a number of editorial angles. In the case of the Skyn piece, this meant creating a number of quotes obtained from the survey results, memes, and so on, both to vary the messaging around the campaign and to ensure we kept it front-of-mind.

It was the paid media side that we focused on most, however, as we saw the targeting in the space as the best way to capture the attention of our audience.

That meant focusing on Instagram and Facebook with the majority of spend, but also drip-feeding it through Twitter to a really tightly-controlled custom audience created from existing customer email data.

Speaking more generally, when there is a paid social budget, our split would start looking like this, to be refined based on insight and the content subject matter:

Facebook 70%
Instagram 20%
Twitter 10%

For the majority of markets, with the possible exclusion of B2B, Facebook will almost always trump the rest simply due to the size of the potential audience and the quality of the targeting its ads platform offers.

And while targeting simply by interest sets will work, we almost always find that the best option here is to add the Facebook Website Custom Audience Pixel to your site, and to then use that data to create a custom audience based on those already visiting. It can also be useful to test this against a custom audience created from “lookalikes” based on uploading your email database (if you have one).

However, if the campaign were designed to attract a completely different audience, then we would look more towards modeling the targeting on interests and/or competitors.

For example, if our campaign is designed to attract men to a survey about marriage but the piece is for a wedding and engagement ring specialist, the likelihood may be that the majority of the site’s audience will be female. In this scenario, we would choose interest targeting to make sure we were reaching the right eyeballs.

The same is true of Twitter, too, although clicks here will be more expensive. Instagram is still at a very early stage in its paid lifecycle, which means that CPCs here are relatively affordable but are undoubtedly heading north as more advertisers jump on the platform.

LinkedIn is the most expensive, and hardest to target, of all options — but where there is a high average lifetime value of a customer and your product is in the B2B space, it can work.

There are, of course, several other considerations. You may also want to add other levels, such as native ad opportunities (think Taboola and Outbrain), and even paid search and/or display.

STAGE SEVEN

Retargeting

Display or retargeting can work very well as part of a wider, longer-term strategy to nurture the new visitor in the weeks after they land on your content.

The idea here is to either provide a really targeted piece of content or offer to follow up, thus feeding the whole inbound marketing strategy.

Let’s say your content was the quiz we’ve discussed throughout this piece. We’ve captured their details as part of that activity, but we want to stay front-of-mind. Here we can use retargeting to do just that. Rather than simply using it generically, you can segment to show something like a “10% Off Your Next Purchase” offer, or a follow-up piece of content on the results of the quiz, for instance.

Email

This is where email can come in also. As well as simply promoting the campaign through an editorial newsletter, we can choose to personalize that message further, as we did with our retargeting. This only serves to strengthen the relationship you have with that individual.

Fitting it within a wider strategy

There are many, many thousands more words to write around the topic of lifecycle marketing, but that is the subject of a post for another day.

Before we finish, however, it is definitely worth touching on how that standalone campaign should sit within a wider content strategy.

This is something I have always been incredibly passionate about, as we see time and time again how larger organizations throw money at campaigns without really thinking about how they fit within the whole picture.

Getting that right is about understanding a concept I call “Content Flow,” and measuring it is a subject I have written about previously here. We even built a simple tool to enable marketers to do just that and map the output of their content strategies easily.

The point is that a “big” idea is only as good as the other content that surrounds it. Great ROI does not often flow from a singular piece, but from the overall approach to content strategy. Being able to consistently deliver is the difference between success and failure.

Free downloadable campaign planner!

Content campaigns are a hugely important part of getting that right, and if you’re not already creating them, there should now be fewer barriers in the way of your success.

If you’d like to have a go at it, you can download the campaign planner I use day-to-day by clicking on the image below.

Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don’t have time to hunt down but want to read!

Continue reading →

7 Illustrations of How Topical Links Impact SEO, in Theory and Practice

Posted by Cyrus-Shepard

The Internet lives on links.

Marketers have long understood the importance of links to SEO. So much so, it’s something we study regularly here at Moz. At their most basic, links are counted as “votes” of popularity for search engines to rank websites. Beyond this, search engineers have long worked to extract a large number of signals from the simple link, including:

Trustworthiness – Links from trusted sites may count as an endorsement
Spamminess – Links from known spam sites may count against you
Link Manipulation – Looking at signals such as over-optimization and link velocity, search engines may be able to tell when webmasters are trying to “game” the system

One of the most important signals engineers have worked to extract from links is topical relevance. This allows search engines to answer questions such as “What is this website about?” by examining incoming links.

Exactly how search engines use links to measure and weigh topical relevance is subject to debate. Rand has addressed it eloquently here, and again here. Over the years, several US patent filings from Google engineers demonstrate exactly how this process may work.

It’s hugely important to look at these concepts to better understand how incoming links may influence a website’s ability to rank.

This is the “theory” part of SEO. As always, a huge thanks to Bill Slawski and his blog SEO by the Sea, which acted as a starting point of research for many of these concepts. Let’s dive in.

1. Hub and authority pages

In the beginning, there was the Hilltop algorithm.

In the early days of Google, not long after Larry Page figured out how to rank pages based on popularity, the Hilltop algorithm worked out how to rank pages on authority. It accomplished this by looking for “expert” pages linking to them.

An expert page is a document that links to many other topically relevant pages. If a page is linked to from several expert pages, then it is considered an authority on that topic, and may rank higher.

A similar concept using “hub” and “authority” pages was put forth by Jon Kleinberg, a Cornell professor with grants from Google and other search engines. Kleinberg explains:

“…a good hub is a page that points to many good authorities; a good authority is a page that is pointed to by many good hubs.”
– Authoritative Sources in a Hyperlinked Environment (PDF)

These were eloquent solutions that produced superior search results. While we can’t know the degree to which these concepts are used today, Google acquired the Hilltop algorithm in 2003.

2. Anchor text

Links contain a ton of information. For example, if you link out using the anchor phrase “hipster pizza,” there’s a great chance the page you’re linking to is about pizza (and maybe hipsters).

That’s the idea behind several Google PageRank patents. Earning links with the right anchor text can help your page to rank for similar phrases.

This also explains why you should use descriptive anchor text when linking, as opposed to generic “click here” type links.

Beyond the anchor text, other signals from the linking page — including the title and text surrounding the link — could provide contextual clues as to what the target page is about. While the importance of anchor text has long been established in SEO, the influence of these other elements is harder to prove.

3. Topic-sensitive PageRank

Despite rumors to the contrary, PageRank is very much alive (though Toolbar PageRank is dead).

PageRank technology can be used to distribute all kinds of different ranking signals throughout a search index. While the most common examples are popularity and trust, another signal is topical relevance, as laid out in this paper by Taher Haveliwala, who went on to become a Google software engineer.

The concept works by grouping “seed pages” by topic (for example, the Politics section of the New York Times). Every link out from these pages passes on a small amount of topic-sensitive PageRank, which is passed on through the next set of links, and so on.

When a user enters a search, those pages with the highest topic-sensitive PageRank (associated with the topic of the search) are considered more relevant and may rank higher.

4. Reasonable surfer

All links are not created equal.

The idea behind Google’s Reasonable Surfer patent is that certain links on a page are more important than others, and thus assigned increase weight. Examples of more important links include:

Prominent links, higher up in the HTML
Topically relevant links, related to both the source document and the target document.

Conversely, less important links include:

“Terms of Service” and footer links
Banner ads
Links unrelated to the document

Because the important links are more likely to be clicked by a “reasonable surfer,” a topically relevant link can carry more weight than an off-topic one.

“…when a topical cluster associated with the source document is related to a topical cluster associated with the target document, the link has a higher probability of being selected than when the topical cluster associated with the source document is unrelated to the topical cluster associated with the target document.”
– United States Patent: 7716225

5. Phrase-based indexing

Not going to lie. Phrase-based indexing can be a tough concept to wrap your head around.

What’s important to understand is that phrase-based indexing allows search engines to score the relevancy of any link by looking for related phrases in both the source and target pages. The more related phrases, the higher the score.

In addition to ranking documents based on the most relevant links, phrase-based indexing allows search engines to do cool things with less relevant links, including:

Discounting spam and off-topic links: For example, an injected spam link to a gambling site from a page about cookie recipes will earn a very low outlink score based on relevancy, and would carry less weight.
Fighting “Google Bombing”: For those that remember, Google bombing is the art of ranking a page highly for funny or politically-motivated phrases by “bombing” it with anchor text links, often unrelated to the page itself. Phrase-based indexing can stop Google bombing by scoring the links for relevance against the actual text on the page. This way, irrelevant links can be discounted.

6. Local inter-connectivity

Local inter-connectivity refers to a reranking concept that reorders search results based on measuring how often each page is linked to by all the other pages.

To put it simply, when a page is linked to from a number of high-ranking results, it is likely more relevant than a page with fewer links from same set of results.

This also provides a strong hint as to the types of links you should be seeking: pages that already rank highly for your target term.

7. The Golden Question

If the above concepts seem complex, the good news is you don’t have to actually understand the above concepts when trying to build links to your site.

To understand if a link is topically relevant to your site, simply ask yourself the golden question of link building: Will this link bring engaged, highly qualified visitors to my website?

The result of the golden question is exactly what Google engineers are trying to determine when evaluating links, so you can arrive at a good end result without understanding the actual algorithms.

About those links between sites you control…

One important thing to know is this: in nearly all of these Google patents and papers, every effort is made to count only “unbiased” links from unnassociated sites, and discount links between sites and pages related to one another through preexisting relationships.

This means that both internal links and links between sites you own or control will be less valuable, while links from non-associated sites will carry far more weight.

Researching the impact of topical links

While it’s difficult to measure the direct effect these principals exert on Google’s search results (or even if Google uses them at all), we are able to correlate certain linking characteristics with higher rankings, especially around topical anchor text.

Below is a sample of results from our Search Engine Ranking Factors study that shows link features positively associated with higher Google rankings. Remember the usual caveat that correlation is not causation, but it sure is a hint.

It’s interesting to note that while both partial and exact match anchor text links correlate with higher rankings, they are both trumped by the overall number of unique websites linking to a page. This supports the notion that it’s best to have a wide variety of links types, including topically relevant links, as part of a healthy backlink profile.

Practical tips for topically relevant links

Consider this advice when thinking about links for SEO:

DO use good, descriptive anchor text for your links. This applies to internal links, outlinks to other sites, and links you seek from non-biased external sites.
AVOID generic or non-descriptive anchor text.
DO seek relationships from authoritative, topically relevant sites. These include sites that rank well for your target keyword, and “expert” pages that link to many authority sites. (For those interested, Majestic has done some interesting work around Topical Trust Flow.)
AVOID over-optimizing your links. This includes repetitive use of exact match anchor text and keyword stuffing.
DO seek links from relevant pages. This includes examining the title, body, related phrases, and intent of the page to ensure its relevancy to your target topic.
DO seek links that people are more likely to click. The ideal link is often both topically relevant and placed in a prominent position.
AVOID manipulative link building. Marie Haynes has written an excellent explanation of the kinds of unnatural links that you likely want to avoid at all cost.

Finally, DO try to earn and attract links to your site with high quality, topically relevant content.

What are your best tips around topically relevant links? Let us know in the comments below!

Big thanks to Abe Schmidt for his amazing animated graphics. If you like illustrated posts, here are 4 others useful at explaining SEO concepts:

Continue reading →

Announcing Moz’s New Beginner’s Guide to Content Marketing

Posted by Trevor-Klein

I’m thrilled to announce the next in Moz’s series of beginner’s guides:
The Beginner’s Guide to Content Marketing.

Content marketing is a field full of challenges. Creating content that provides great value to your audience what we’ve come to call 10x content is difficult enough, but content marketers also regularly encounter skeptical employers and clients, diminutive budgets, and you guessed it a noted lack of time to get it all done. You’re not alone. You’re fighting the good fight, and we’re here to back you up. So is Carl.

Meet Carl, the Content Cat. He’ll show up in every chapter of the guide for a little levity and to remind you that you’re in good company.

There’s no denying the importance of content marketing. In its annual study of more than 5,000 marketers, the Content Marketing Institute showed that about 70% of all marketers, B2B and B2C, are creating more content than they did one year ago. Nearly half of B2C marketers have a dedicated content marketing group in their organizations. While this guide is written primarily for those who are relatively new to content marketing, we’d certainly recommend that more advanced marketers take a look through, as we often find veteran teams are missing some key fundamentals.

Say no more; show me the guide!

What you’ll learn

The guide has nine chapters, and we’ve organized them in the order we think folks should think about them when they’re approaching content marketing. Start with planning and goals, move through ideation and execution, then wrap up with analysis and revisions to the process.

1. What is content marketing? Is it right for my business?

Before we dive too deep into strategy and tactics, there’s something we need to clear up: What in the world is content marketing, anyway? Look it up in 10 different places, and you’ll get 10 different answers to m that question. In this chapter, we break it down and offer a look into whether or not it’s a worthwhile investment of your time (spoiler alert: It is).

2. Content strategy

Arguably the most important part of any content marketing effort, your content strategy is what keeps you aligned with your company’s goals, ensuring you’re putting your time and effort into areas that will help move needles and earn you the recognition you deserve. There’s more to it than meets the eye, though, and this chapter paints a holistic picture to get you started.

3. Content and the marketing funnel

Most folks who are new to content marketing assume that it belongs right at the top of the marketing funnel. We’d like to bust it out of that pigeonhole. The truth is that content belongs at every stage of the funnel, from brand awareness and early acquisition to retention of loyal customers. This chapter shows you which kinds of content typically work well for each major phase of the funnel.

4. Building a framework and a content team

There are some things you’ll need to figure out before you even start coming up with ideas for your content. What tools will you use to create it? What processes and standards will you put in place? Who will you work with, and how can you get them aligned with your goals? Setting the framework for your future success will save you from major headaches, and this chapter aims to make sure there’s nothing you’re overlooking.

5. Content ideation

We’ve all had it happen. We need to write something be it a blog post, a whitepaper, even an email and when we sit down to make it happen, nothing. No ideas come to mind. Coming up with ideas for content that really resonates is deceptively difficult, but there are many tricks that’ll help get the proverbial gears turning. We’ll go through those in this chapter.

6. Content creation

After all that planning, it’s finally time to dive in and do the hard work of actually creating your content. From getting the formatting right and working with design/UX teams to the most important cliche you can remember — to focus on quality, not quantity — this chapter will help you make the most effective use of your time.

7. Content promotion

You’ve done it. You’ve put together a wonderful piece of 10x content, and can’t wait to see the response. Only one thing stands in your way: Getting it in front of the right people. From working with industry influencers to syndication and social promotion, there are a great many ways to connect your content with your audiences; it’s just a matter of choosing the right ones. This chapter aims to point you down the right path.

8. Analysis and reporting

Nobody (seriously, nobody) is able to perfectly target their audiences. We make assumptions based on what we know (and can surmise) about the things readers will find valuable. The only way we can get better is by taking a look at how our past content performed. That’s easier said than done, though, and data can often be misleading. This chapter shows you the basics of measurement and reporting so you can get an accurate picture of how things are going.

9. Iteration, maintenance, and growth

Like all aspects of marketing, content should be iterative. You should take a close look at how your past work resonated with your audience, learn from what went right (and what went wrong), and revise your approach next time around. It also pays to revisit your processes from time to time; as your organization and your audience grow, the tactics that served you well at the beginning could well be holding you back now. This chapter explores how you can scale your content efforts without sacrificing the quality you’ve worked so hard to instill.

What are we waiting for? Let’s get started!

Thanks

The biggest thanks and the majority of the credit for this guide go to Isla McKetta. She was an immense help with early planning, and wrote the lion’s share of the guide. Derric Wise led the UX efforts, illustrating much of the guide and bringing Carl the Content Cat to life. Huge thanks also go to both Kevin Engle and Abe Schmidt for their fantastic illustrations. Thanks as well to Lisa Wildwood for her keen editing eyes, and to Ronell Smith and Christy Correll for their additional reviews. This guide never would have happened without all of you. =)

Continue reading →

How to Prioritize SEO Tasks & Invest in High-Value Work Items – Whiteboard Friday

Posted by randfish

One thing we can all agree on: there’s a lot to think about when it comes to your SEO tasks. Even for the most organized among us, it can be really difficult to prioritize our to-dos and make sure we’re getting the highest return on them. In this week’s Whiteboard Friday, Rand tackles the question that’s a constant subtext in every SEO’s mind.

Click on the whiteboard image above to open a high resolution version in a new tab!

Video Transcription

Howdy, Moz fans, and welcome to another edition of Whiteboard Friday. This week we’re chatting about how to prioritize SEO tasks and specifically get the biggest bang for the buck that we possibly can.

I know that all of you have to deal with this, whether you are a consultant or at an agency and you’re working with a client and you’re trying to prioritize their SEO tasks in an audit or a set of recommendations that you’ve got, or you’re working on an ongoing basis in-house or as a consultant and you’re trying to tell a team or a boss or manager, “Hey these are all the SEO things that we could potentially do. Which ones should we do first? Which ones are going to get in this sprint, this quarter, or this cycle?” — whatever the cadence is that you’re using.

I wanted to give you some great ways that we here at Moz have done this and some of the things that I’ve seen from both very small companies, startups, all the way up to large enterprises.

SEO tasks

Look, the list of SEO tasks can be fairly enormous. It could be all sorts of things: rewrite our titles and descriptions, add rich snippets categories, create new user profile pages, rewrite the remaining dynamic URLs that we haven’t taken care of yet, or add some of the recommended internal inks to the blog posts, or do outreach to some influencers that we know in this new space we’re getting into. You might have a huge list of these things that are potential SEO items. I actually urge you to make this list internally for yourself, either as a consulting team or an in-house team, as big as you possibly can.

I think it’s great to involve decision makers in this process. You reach out to a manager or the rest of your team or your client, whoever it is, and get all of their ideas as well, because you don’t want to walk into these prioritization meetings and then have them go, “Great, those are your priorities. But what about all these things that are my ideas?” You want to capture as many of these as you can. Then you go through a validation process. That’s really the focus of today.

Prioritization questions to ask yourself

The prioritization questions that I think all of us need to be asking ourselves before we decide which order tasks will go in and which ones we’re going to focus on are:

What company goals does this task serve or map to?

Look, if your company or the organization you’re working with doesn’t actually have big initiatives for the year or the quarter, that’s a whole other matter. I recommend that you make sure your organization gets on top of that or that you as a consultant, if you are a consultant, get a list of what those big goals are.

Those big things might be, hey, we’re trying to increase revenue from this particular product line, or we’re trying to drive more qualified users to sign up for this feature, or we’re trying to grow traffic to this specific section. Big company goals. It might even be weird things or non-marketing things, like we’re trying to recruit this quarter. It’s really important for us to focus on recruitment. So you might have an SEO task that maps to how do we get more people who are job seekers to our jobs pages, or how do we get our jobs listings more prominent in search results for relevant keywords — that kind of thing. They can map to all sorts of goals across a company.

What’s an estimated 30, 60, 90, and 1 year value?

Then, once we have those, we want to ask for an estimated range — this is very important — of value that the task will provide over the next X period of time. I like doing this in terms of several time periods. I don’t like to say we’re only going to estimate what the six month value is. I like to say, “What’s an estimated 30, 60, 90, and 1 year value?”

You don’t have to be that specific. You could say we’re only going to do this for a month and then for the next year. For each of those time periods here, you’d go here’s our low estimate, our mid estimate, and our high estimate of how this is going to impact traffic or conversion rate or whatever the goal is that you’re mapping to up here.

Which teams/people are needed to accomplish this work, and what is their estimate of time needed?

Next, we want to ask which teams or people are needed to accomplish this work and what is their estimate of time needed. Important: what is their estimate, not what’s your estimate. I, as an SEO, think that it’s very, very simple to make small changes to a CMS to allow me to edit a rel=canonical tag. My web dev team tells me differently. I want their opinion. That’s what I want to represent in any sort of planning process.

If you’re working outside a company as a consultant or at an agency, you need to go validate with their web dev team, with their engineering team, what it’s going to take to make these changes. If you are a contractor and they work with a web dev contractor, you need to talk to that contractor about what it’s going to take.

You never want to present estimates that haven’t been validated by the right team. I might, for example, say there’s a big SEO change that we want to make here at Moz. I might need some help from UX folks, some help from content, some help from the SEOs themselves, and one dev for two weeks. All of these different things I want to represent those completely in the planning process.

How will we capture metrics, measure if it’s working, and ID potential problems early?

Finally, last question I’ll ask in this prioritization is: How are we going to capture the right metrics around this, measure it, see that it’s working, and identify potential problems early on? One of the things that happens with SEO is sometimes something goes wrong — either in the planning phase or the implementation or the launch itself — or something unexpected happens. We update the user profiles to be way more SEO friendly and realize that in the new profile pages we no longer link to this very important piece of internal content that users had uploaded or had created, and so now we’ve lost a bunch of internal links to that and our indexation is dropping out. The user profile pages may be doing great, but that user-generated content is shrinking fast, and so we need to correct that immediately.

We have to be on the watch for those. That requires validation of design, some form of test if you can (sometimes it’s not needed but many times it is), some launch metrics so you can watch and see how it’s doing, and then ongoing metrics to tell you was that a good change and did it map well to what we predicted it was going to do.

General wisdom regarding prioritization

Just a few rules now that we’ve been through this process, some general wisdom around here. I think this is true in all aspects of professional life. Under-promise and over-deliver, especially on speed to execute. When you estimate all these things, make sure to leave yourself a nice healthy buffer and potential value. I like to be very conservative around how I think these types of things can move the needle on the metrics.

Leave teams and people room in their sprints or whatever the cadence is to do their daily and ongoing and maintenance types of work. You can’t go, “Well, there are four weeks in this time period for this sprint, so we’re going to have the dev do this thing that takes two weeks and that thing that takes two weeks.” Guess what? They have to do other work as well. You’re not the only team asking for things from them. They have their daily work that they’ve got to do. They have maintenance work. They have regular things that crop up that go wrong. They have email that needs to be answered. You’ve got to make sure that those are accounted for.

I mentioned this before. Never, ever, ever estimate on behalf of other people. It’s not just that you might be wrong about it. That’s actually only a small portion of the problem. The big part of the problem with estimating on behalf of others is then when they see it or when they’re asked to confirm it by a team, a manager, a client or whomever, they will inevitably get upset that you’ve estimated on their behalf and assumed that work will take a certain amount of time. You might’ve been way overestimating, so you feel like, “Hey, man, I left you tons of time. What are you worried about?”

The frustrating part is not being looped in early. I think, just as a general rule, human beings like to know that they are part of a process for the work that they have to do and not being told, “Okay, this is the work we’re assigning you. You had no input into it.” I promise you, too, if you have these conversations early, the work will get done faster and better than if you left those people out of those conversations.

Don’t present every option in planning. I know there’s a huge list of things here. What I don’t want you to do is go into a planning process or a client meeting or something like that, sit down and have that full list, and go, “All right. Here’s everything we evaluated. We evaluated 50 different things you could do for SEO.” No, bring them the top five, maybe even just the top three or so. You want to have just the best ones.

You should have the full list available somewhere so if they call up like, “Hey, did you think about doing this, did you think about doing that,” you can say, “Yeah, we did. We’ve done the diligence on it. This is the list of the best things that we’ve got, and here’s our recommended prioritization.” Then that might change around, as people have different opinions about value and which goals are more important that time period, etc.

If possible, two of the earliest investments I recommend are A.) automated, easy-to-access metrics, building up a culture of metrics and a way to get those metrics easily so that every time you launch something new it doesn’t take you an inordinate amount of time to go get the metrics. Every week or month or quarter, however your reporting cycle goes, it doesn’t take you tons and tons of time to collect and report on those metrics. Automated metrics, especially for SEO, but all kinds of metrics are hugely valuable.

Second, CMS upgrades — things that make it such that your content team and your SEO team can make changes on the fly without having to involve developers, engineers, UX folks, all that kind of stuff. If you make it very easy for a content management system to enable editable titles and descriptions, make URLs easily rewritable, make things redirectable simply, allow for rel=canonical or other types of header changes, enable you to put schema markup into stuff, all those kinds of things — if that is right in the CMS and you can get that done early, then a ton of the things over here go from needing lots and lots of people involved to just the SEO or the SEO and the content person involved. That’s really, really nice.

All right, everyone, I look forward to hearing your thoughts and comments on prioritization methods. We’ll see you again next week for another edition of Whiteboard Friday. Take care.

Video transcription by Speechpad.com

Continue reading →

Good News: We Launched a New Index Early! Let’s Also Talk About 2015’s Mozscape Index Woes

Posted by LokiAstari

Good news, everyone: November’s Mozscape index is here! And it’s arrived earlier than expected.

Of late, we’ve faced some big challenges with the Mozscape index — and that’s hurt our customers and our data quality. I’m glad to say that we believe a lot of the troubles are now behind us and that this index, along with those to follow, will provide higher-quality, more relevant, and more useful link data.

Here are some details about this index release:

144,950,855,587 (145 billion) URLs
4,693,987,064 (4 billion) subdomains
198,411,412 (198 million) root domains
882,209,713,340 (882 billion) links
Followed vs nofollowed links
- 3.25% of all links found were nofollowed
- 63.49% of nofollowed links are internal
- 36.51% are external
Rel canonical: 24.62% of all pages employ the rel=canonical tag
The average page has 81 links on it
- 69 internal links on average
- 12 external links on average
Correlations with Google Rankings:
- Page Authority: 0.307
- Domain Authority: 0.176
- Linking C-Blocks to URL: 0.255

You’ll notice this index is a bit smaller than much of what we’ve released this year. That’s intentional on our part, in order to get fresher, higher-quality stuff and cut out a lot of the junk you may have seen in older indices. DA and PA scores should be more accurate in this index (accurate meaning more representative of how a domain or page will perform in Google based on link equity factors), and that accuracy should continue to climb in the next few indices. We’ll keep a close eye on it and, as always, report the metrics transparently on our index update release page.

What’s been up with the index over the last year?

Let’s be blunt: the Mozscape index has had a hard time this year. We’ve been slow to release, and the size of the index has jumped around.

Before we get down into the details of what happened, here’s the good news: We’re confident that we have found the underlying problem and the index can now improve. For our own peace of mind and to ensure stability, we will be growing the index slowly in the next quarter, planning for a release at least once a month (or quicker, if possible).

Also on the bright side, some of the improvements we made while trying to find the problem have increased the speed of our crawlers, and we are now hitting just over a billion pages a day.

We had a bug.

There was a small bug in our scheduling code (this is different from the code that creates the index, so our metrics were still good). Previously, this bug had been benign, but due to several other minor issues (when it rains, it pours!), it had a snowball effect and caused some large problems. This made identifying and tracking down the original problem relatively hard.

The bug had far-reaching consequences…

The bug was causing lower-value domains to be crawled more frequently than they should have been. This happened because we crawled a huge number of low-quality sites for a 30-day period (we’ll elaborate on this further down), and then generated an index with them. In turn, this raised all these sites’ domain authority above a certain threshold where they would have otherwise been ignored, when the bug was benign. Now that they crossed this threshold (from a DA of 0 to a DA of 1), the bug was acting on them, and when crawls were scheduled, these domains were treated as if they had a DA of 5 or 6. Billions of low-quality sites were flooding the schedule with pages that caused us to crawl fewer pages on high-quality sites because we were using the crawl budget to crawl lots of low-quality sites.

…And index quality was affected.

We noticed the drop in high-quality domain pages being crawled. As a result, we started using more and more data to build the index, increasing the size of our crawler fleet so that we expanded daily capacity to offset the low numbers and make sure we had enough pages from high-quality domains to get a quality index that accurately reflected PA/DA for our customers. This was a bit of a manual process, and we got it wrong twice: once on the low side, causing us to cancel index #49, and once on the high side, making index #48 huge.

Though we worked aggressively to maintain the quality of the index, importing more data meant it took longer to process the data and build the index. Additionally, because of the odd shape of some of the domains (see below) our algorithms and hardware cluster were put under some unusual stress that caused hot spots in our processing, exaggerating some of the delays.

However, in the final analysis, we maintained the approximate size and shape of good-quality domains, and thus PA and DA were being preserved in their quality for our customers.

There were a few contributing factors:

We imported a new set of domains from a partner company.

We basically did a swap with them. We showed them all the domains we had seen, and they would show us all the domains they had seen. We had a corpus of 390 million domains, while they had 450 million domains. A lot of this was overlap, but afterwards, we had approximately 470 million domains available to our schedulers.

On the face of it, that doesn’t sound so bad. However, it turns out a large chunk of the new domains we received were domains in .pw and .cn. Not a perfect fit for Moz, as most of our customers are in North America and Europe, but it does provide a more accurate description of the web, which in turn creates better Page/Domain authority values (in theory). More on this below.

Palau, a small island nation in the middle of the Indian Ocean.

Palau has the TLD of .pw. Seems harmless, right? In the last couple of years, the domain registrar of Palau has been aggressively marketing itself as the “Professional Web” TLD. This seems to have attracted a lot of spammers (enough that even Symantec took notice).

The result was that we got a lot of spam from Palau in our index. That shouldn’t have been a big deal, in the grand scheme of things. But, as it turns out, there’s a lot of spam in Palau. In one index, domains with the .pw extension reached 5% of the domains in our index. As a reference point, that’s more than most European countries.

More interestingly, though, there seem to be a lot of links to .pw domains, but very few outlinks from .pw to any other part of the web.

Here’s a graph showing the outlinks per domain for each region of the index:

TQu--jaKCoqQLiRknNQw42R7GeMWfkuuKmDCOBUTmZ2Eg6FW1grq3z6oBJMZm_wItHmOD_K7UDicMgq_8OkLVnjLKDNxoRMfgU20B2ymlQK7eueKqIAcY3wsqfJizRwo7hnt7Yw2jA

China and its subdomains (also known as FQDNs).

In China, it seems to be relatively common for domains to have lots of subdomains. Normally, we can handle a site with a lot of subdomains (blogspot.com and wordpress.com are perfect examples of sites with many, many subdomains). But within the .cn TLD, 2% of domains have over 10,000 subdomains, and 80% have several thousand subdomains. This is much rarer in the North Americas and in Europe, in spite of a few outliers like Wordpress and Blogspot.

Historically, the Mozcape index has slowly grown the total number of FQDNs, from ¼ billion in 2010 to 1 billion in 2013. Then, in 2014, we started to expand and got 6 billion FQDNs in the index. In 2015, one index had 56 billion FQDNs!

We found that a whopping 45 billion of those FQDNS were coming from only 250,000 domains. That means, on average, these sites had 180,000 subdomains each. (The record was 10 million subdomains for a single domain.)

Chinese sites are fond of links.

We started running across pages with thousands of links per page. It’s not terribly uncommon to have a large number of links on a particular page. However, we started to run into domains with tens of thousands of links per page, and tens of thousands of pages on the same site with these characteristics.

At the peak, we had two pages in the index with over 16,000 links on each of these pages. These could have been quite legitimate pages, but it was hard to tell, given the language barrier. However, in terms of SEO analysis, these pages were providing very little link equity and thus not contributing much to the index.

This is not exclusively a problem with the .cn TLD; this happens on a lot of spammy sites. But we did find a huge cluster of sites in the .cn TLD that were close together lexicographically, causing a hot spot in our processing cluster.

We had a 12-hour DNS outage that went unnoticed.

DNS is the backbone of the Internet. It should never die. If DNS fails, the Internet more or less dies, as it becomes impossible to lookup the IP address of a domain. Our crawlers, unfortunately, experienced a DNS outage.

The crawlers continued to crawl, but marked all the pages they crawled as DNS failures. Generally, when we have a DNS failure, it’s because a domain has “died,” or been taken offline. (Fun fact: the average life expectancy of a domain is 40 days.) This information is passed back to the schedulers, and the domain is blacklisted for 30 days, then retried. If it fails again, then we remove it from the schedulers.

In a 12-hour period, we crawl a lot of sites (approximately 500,000). We ended up banning a lot of sites from being recrawled for a 30-day period, and many of them were high-value domains.

Because we banned a lot of high-value domains, we filled that space with lower-quality domains for 30 days. This isn’t a huge problem for the index, as we use more than 30 days of data — in the end, we still included the quality domains. But it did cause a skew in what we crawled, and we took a deep dive into the .cn and .pw TLDs.

This caused the perfect storm.

We imported a lot of new domains (whose initial DA is unknown) that we had not seen previously. These would have been crawled slowly over time and would likely have resulted in their domains to be assigned a DA of 0, because their linkage with other domains in the index would be minimal.

But, because we had a DNS outage that caused a large number of high-quality domains to be banned, we replaced them in the schedule with a lot of low-quality domains from the .pw and .cn TLDs for a 30-day period. These domains, though not connected to other domains in the index, were highly connected to each other. Thus, when an index was generated with this information, a significant percentage of these domains gained enough DA to make the bug in scheduling non-benign.

With lots of low-quality domains now being available for scheduling, we used up a significant percentage of our crawl budget on low-quality sites. This had the effect of making our crawl of high-quality sites more shallow, while the low-quality sites were either dead or very slow to respond — this caused a reduction in the total number of actual pages crawled.

Another side effect was the shape of the domains we crawled. As noted above, domains with the .pw and .cn TLDs seem to have a different strategy in terms of linking — both externally to one other and internally to themselves — in comparison with North American and European sites. This data shape caused a couple of problems when processing the data that increased the required time to process the data (due to the unexpected shape and the resulting hot spots in our processing cluster).

What measures have we taken to solve this?

We fixed the originally benign bug in scheduling. This was a two-line code change to make sure that domains were correctly categorized by their Domain Authority. We use DA to determine how deeply to crawl a domain.

During this year, we have increased our crawler fleet and added some extra checks in the scheduler. With these new additions and the bug fix, we are now crawling at record rates and seeing more than 1 billion pages a day being checked by our crawlers.

We’ve also improved.

There’s a silver lining to all of this. The interesting shapes of data we saw caused us to examine several bottlenecks in our code and optimize them. This helped improve our performance in generating an index. We can now automatically handle some odd shapes in the data without any intervention, so we should see fewer issues with the processing cluster.

More restrictions were added.

We have a maximum link limit per page (the first 2,000).
We have banned domains with an excessive number of subdomains.
- Any domain that has more than 10,000 subdomains has been banned…
- …Unless it is explicitly whitelisted (e.g. Wordpress.com).
  - We have ~70,000 whitelisted domains.
- This ban affects approximately 250,000 domains (most with .cn and .pw TLDs)…
  - …and has removed 45 billion subdomains. Yes, BILLION! You can bet that was clogging up a lot of our crawl bandwidth with sites Google probably doesn’t care much about.

We made positive changes.

Better monitoring of DNS (complete with alarms).
Banning domains after DNS failure is not automatic for high-quality domains (but still is for low-quality domains).
Several code quality improvements that will make generating the index faster.
We’ve doubled our crawler fleet, with more improvements to come.

Now, how are things looking for 2016?

Good! But I’ve been told I need to be more specific. 🙂

Before we get to 2016, we still have a good portion of 2015 to go. Our plan is stabilize the index at around 180 billion URLs for the end of the year and release an index predictably every three weeks.

We are also in the process of improving our correlations to Google’s index. Currently our fit is pretty good at a 75% match, but we’ve been higher at around 80%; we’re testing a new technique to improve our metrics correlations and Google coverage beyond that. This will be an ongoing processes, and though we expect to see improvements in 2015, these improvements will continue on into 2016.

Our index struggles this year have taught us some very valuable lessons. We’ve identified some bottlenecks and their causes. We’re going to attack these bottlenecks and improve the performance of the processing cluster to get the index out quicker for you.

We’ve improved the crawling cluster and now exceed a billion pages a day. That’s a lot of pages. And guess what? We still have some spare bandwidth in our data center to crawl more sites. We plan to improve the crawlers to increase our crawl rate, reducing the number of historical days in our index and allowing us to see much more recent data.

In summary, in 2016, expect to see larger indexes, at a more consistent time frame, using less historical data, that maps closer to Google’s own index. And thank you for bearing with us, through the hard times and the good — we could never do it without you.

Continue reading →

Page 146 of 925
« First
…
120
130
140
«
144
145
146
147
148
»
150
160
170
…
Last »