Archives for 

seo

Should SEOs Care About Internal Links? – Whiteboard Friday

Posted by randfish

Internal links are one of those essential SEO items you have to get right to avoid getting them really wrong. Rand shares 18 tips to help inform your strategy, going into detail about their attributes, internal vs. external links, ideal link structures, and much, much more in this edition of Whiteboard Friday.

Should SEOs Care About Internl Links?

Click on the whiteboard image above to open a high-resolution version in a new tab!

Video Transcription

Howdy, Moz fans, and welcome to another edition of Whiteboard Friday. This week we’re going to chat a little bit about internal links and internal link structures. Now, it is not the most exciting thing in the SEO world, but it’s something that you have to get right and getting it wrong can actually cause lots of problems.

Attributes of internal links

So let’s start by talking about some of the things that are true about internal links. Internal links, when I say that phrase, what I mean is a link that exists on a website, let’s say ABC.com here, that is linking to a page on the same website, so over here, linking to another page on ABC.com. We’ll do /A and /B. This is actually my shipping routes page. So you can see I’m linking from A to B with the anchor text “shipping routes.”

The idea of an internal link is really initially to drive visitors from one place to another, to show them where they need to go to navigate from one spot on your site to another spot. They’re different from internal links only in that, in the HTML code, you’re pointing to the same fundamental root domain. In the initial early versions of the internet, that didn’t matter all that much, but for SEO, it matters quite a bit because external links are treated very differently from internal links. That is not to say, however, that internal links have no power or no ability to change rankings, to change crawling patterns and to change how a search engine views your site. That’s what we need to chat about.



1. Anchor text is something that can be considered. The search engines have generally minimized its importance, but it’s certainly something that’s in there for internal links.

2. The location on the page actually matters quite a bit, just as it does with external links. Internal links, it’s almost more so in that navigation and footers specifically have attributes around internal links that can be problematic.

Those are essentially when Google in particular sees manipulation in the internal link structure, specifically things like you’ve stuffed anchor text into all of the internal links trying to get this shipping routes page ranking by putting a little link down here in the footer of every single page and then pointing over here trying to game and manipulate us, they hate that. In fact, there is an algorithmic penalty for that kind of stuff, and we can see it very directly.



We’ve actually run tests where we’ve observed that jamming this type of anchor text-rich links into footers or into navigation and then removing it gets a site indexed, well let’s not say indexed, let’s say ranking well and then ranking poorly when you do it. Google reverses that penalty pretty quickly too, which is nice. So if you are not ranking well and you’re like, “Oh no, Rand, I’ve been doing a lot of that,” maybe take it away. Your rankings might come right back. That’s great.



3. The link target matters obviously from one place to another.

4. The importance of the linking page, this is actually a big one with internal links. So it is generally the case that if a page on your website has lots of external links pointing to it, it gains authority and it has more ability to sort of generate a little bit, not nearly as much as external links, but a little bit of ranking power and influence by linking to other pages. So if you have very well-linked two pages on your site, you should make sure to link out from those to pages on your site that a) need it and b) are actually useful for your users. That’s another signal we’ll talk about.



5. The relevance of the link, so pointing to my shipping routes page from a page about other types of shipping information, totally great. Pointing to it from my dog food page, well, it doesn’t make great sense. Unless I’m talking about shipping routes of dog food specifically, it seems like it’s lacking some of that context, and search engines can pick up on that as well.

6. The first link on the page. So this matters mostly in terms of the anchor text, just as it does for external links. Basically, if you are linking in a bunch of different places to this page from this one, Google will usually, at least in all of our experiments so far, count the first anchor text only. So if I have six different links to this and the first link says “Click here,” “Click here” is the anchor text that Google is going to apply, not “Click here” and “shipping routes” and “shipping.” Those subsequent links won’t matter as much.

7. Then the type of link matters too. Obviously, I would recommend that you keep it in the HTML link format rather than trying to do something fancy with JavaScript. Even though Google can technically follow those, it looks to us like they’re not treated with quite the same authority and ranking influence. Text is slightly, slightly better than images in our testing, although that testing is a few years old at this point. So maybe image links are treated exactly the same. Either way, do make sure you have that. If you’re doing image links, by the way, remember that the alt attribute of that image is what becomes the anchor text of that link.

Internal versus external links

A. External links usually give more authority and ranking ability.

That shouldn’t be surprising. An external link is like a vote from an independent, hopefully independent, hopefully editorially given website to your website saying, “This is a good place for you to go for this type of information.” On your own site, it’s like a vote for yourself, so engines don’t treat it the same.

B. Anchor text of internal links generally have less influence.

So, as we mentioned, me pointing to my page with the phrase that I want to rank for isn’t necessarily a bad thing, but I shouldn’t do it in a manipulative way. I shouldn’t do it in a way that’s going to look spammy or sketchy to visitors, because if visitors stop clicking around my site or engaging with it or they bounce more, I will definitely lose ranking influence much faster than if I simply make those links credible and usable and useful to visitors. Besides, the anchor text of internal links is not as powerful anyway.



C. A lack of internal links can seriously hamper a page’s ability to get crawled + ranked.

It is, however, the case that a lack of internal links, like an orphan page that doesn’t have many internal or any internal links from the rest of its website, that can really hamper a page’s ability to rank. Sometimes it will happen. External links will point to a page. You’ll see that page in your analytics or in a report about your links from Moz or Ahrefs or Majestic, and then you go, “Oh my gosh, I’m not linking to that page at all from anywhere else on my site.” That’s a bad idea. Don’t do that. That is definitely problematic.

D. It’s still the case, by the way, that, broadly speaking, pages with more links on them will send less link value per link.

So, essentially, you remember the original PageRank formula from Google. It said basically like, “Oh, well, if there are five links, send one-fifth of the PageRank power to each of those, and if there are four links, send one-fourth.” Obviously, one-fourth is bigger than one-fifth. So taking away that fifth link could mean that each of the four pages that you’ve linked to get a little bit more ranking authority and influence in the original PageRank algorithm.

Look, PageRank is old, very, very old at this point, but at least the theories behind it are not completely gone. So it is the case that if you have a page with tons and tons of links on it, that tends to send out less authority and influence than a page with few links on it, which is why it can definitely pay to do some spring cleaning on your website and clear out any rubbish pages or rubbish links, ones that visitors don’t want, that search engines don’t want, that you don’t care about. Clearing that up can actually have a positive influence. We’ve seen that on a number of websites where they’ve cleaned up their information architecture, whittled down their links to just the stuff that matters the most and the pages that matter the most, and then seen increased rankings across the board from all sorts of signals, positive signals, user engagement signals, link signals, context signals that help the engine them rank better.

E. Internal link flow (aka PR sculpting) is rarely effective, and usually has only mild effects… BUT a little of the right internal linking can go a long way.

Then finally, I do want to point out that what was previous called — you probably have heard of it in the SEO world — PageRank sculpting. This was a practice that I’d say from maybe 2003, 2002 to about 2008, 2009, had this life where there would be panel discussions about PageRank sculpting and all these examples of how to do it and software that would crawl your site and show you the ideal PageRank sculpting system to use and which pages to link to and not.



When PageRank was the dominant algorithm inside of Google’s ranking system, yeah, it was the case that PageRank sculpting could have some real effect. These days, that is dramatically reduced. It’s not entirely gone because of some of these other principles that we’ve talked about, just having lots of links on a page for no particularly good reason is generally bad and can have harmful effects and having few carefully chosen ones has good effects. But most of the time, internal linking, optimizing internal linking beyond a certain point is not very valuable, not a great value add.

But a little of what I’m calling the right internal linking, that’s what we’re going to talk about, can go a long way. For example, if you have those orphan pages or pages that are clearly the next step in a process or that users want and they cannot find them or engines can’t find them through the link structure, it’s bad. Fixing that can have a positive impact.


Ideal internal link structures

So ideally, in an internal linking structure system, you want something kind of like this. This is a very rough illustration here. But the homepage, which has maybe 100 links on it to internal pages. One hop away from that, you’ve got your 100 different pages of whatever it is, subcategories or category pages, places that can get folks deeper into your website. Then from there, each of those have maybe a maximum of 100 unique links, and they get you 2 hops away from a homepage, which takes you to 10,000 pages who do the same thing.



I. No page should be more than 3 link “hops” away from another (on most small–>medium sites).

Now, the idea behind this is that basically in one, two, three hops, three links away from the homepage and three links away from any page on the site, I can get to up to a million pages. So when you talk about, “How many clicks do I have to get? How far away is this in terms of link distance from any other page on the site?” a great internal linking structure should be able to get you there in three or fewer link hops. If it’s a lot more, you might have an internal linking structure that’s really creating sort of these long pathways of forcing you to click before you can ever reach something, and that is not ideal, which is why it can make very good sense to build smart categories and subcategories to help people get in there.

I’ll give you the most basic example in the world, a traditional blog. In order to reach any post that was published two years ago, I’ve got to click Next, Next, Next, Next, Next, Next through all this pagination until I finally get there. Or if I’ve done a really good job with my categories and my subcategories, I can click on the category of that blog post and I can find it very quickly in a list of the last 50 blog posts in that particular category, great, or by author or by tag, however you’re doing your navigation.



II. Pages should contain links that visitors will find relevant and useful.

If no one ever clicks on a link, that is a bad signal for your site, and it is a bad signal for Google as well. I don’t just mean no one ever. Very, very few people ever and many of them who do click it click the back button because it wasn’t what they wanted. That’s also a bad sign.

III. Just as no two pages should be targeting the same keyword or searcher intent, likewise no two links should be using the same anchor text to point to different pages. Canonicalize!

For example, if over here I had a shipping routes link that pointed to this page and then another shipping routes link, same anchor text pointing to a separate page, page C, why am I doing that? Why am I creating competition between my own two pages? Why am I having two things that serve the same function or at least to visitors would appear to serve the same function and search engines too? I should canonicalize those. Canonicalize those links, canonicalize those pages. If a page is serving the same intent and keywords, keep it together.

IV. Limit use of the rel=”nofollow” to UGC or specific untrusted external links. It won’t help your internal link flow efforts for SEO.

Rel=”nofollow” was sort of the classic way that people had been doing PageRank sculpting that we talked about earlier here. I would strongly recommend against using it for that purpose. Google said that they’ve put in some preventative measures so that rel=”nofollow” links sort of do this leaking PageRank thing, as they call it. I wouldn’t stress too much about that, but I certainly wouldn’t use rel=”nofollow.”

What I would do is if I’m trying to do internal link sculpting, I would just do careful curation of the links and pages that I’ve got. That is the best way to help your internal link flow. That’s things like…



V. Removing low-value content, low-engagement content and creating internal links that people actually do want. That is going to give you the best results.

VI. Don’t orphan! Make sure pages that matter have links to (and from) them. Last, but not least, there should never be an orphan. There should never be a page with no links to it, and certainly there should never be a page that is well linked to that isn’t linking back out to portions of your site that are of interest or value to visitors and to Google.

So following these practices, I think you can do some awesome internal link analysis, internal link optimization and help your SEO efforts and the value visitors get from your site. We’ll see you again next week for another edition of Whiteboard Friday. Take care.

Video transcription by Speechpad.com


Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don’t have time to hunt down but want to read!

Continue reading →

It’s Here: The Finalized MozCon 2017 Agenda

Posted by ronell-smith

That sound you hear is the coming together of MozCon 2017.

[You can hear that, right? It’s not just me.]

With less than two months to go, most of the nuts and bolts of the event have been fastened together to create what looks to be one of the strongest MozCons in history. Yeah, that’s saying a lot, but once you’ve perused the speakers’ lineup, we’re sure you’ll agree.

MozCon has a rich tradition of bringing together the best and brightest minds in digital marketing, creating a place for individuals across the globe to learn from top-notch speakers, network, share ideas, and learn about the tools, services, and tactics they can put to use in their work and their business.

As a bonus, attendees also get to enjoy lots of snacks, coffee and lots and lots of bacon.

Also, this year we’ll offer pre-MozCon SEO workshops on Sunday, July 16. Keep reading for more info.

You will, however, need a ticket to attend the event, so you might want to take care of that sooner rather later, since it always sells out:

Buy my MozCon 2017 ticket!

Now for the meaty details you’ve been waiting for.

The MozCon 2017 Agenda

Monday


08:00–09:00am
Breakfast


Rand Fishkin

09:00–09:20am
Welcome to MozCon 2017

Rand Fishkin, Wizard of Moz
@randfish

Rand Fishkin is the founder and former CEO of Moz, co-author of a pair of books on SEO, and co-founder of Inbound.org. Rand’s an un-save-able addict of all things content, search, and social on the web.


lisa-myers-150x150-33348.jpg09:20–10:05am
How to Get Big Links

Lisa Myers, Verve Search
@LisaDMyers

Everyone wants links and coverage from sites such as New York Times, the Wall Street Journal, and the BBC, but very few achieve it. This is how we cracked it. Over and over.

Lisa is the founder and CEO of award-winning SEO agency Verve Search and founder of Womeninsearch.net. Feminist, mother of two, and modern-day shield maiden.


oli-gardner-150x150-47067.jpg

10:05–10:35am
Data-Driven Design

Oli Gardner, Unbounce
@oligardner

Data-Driven Design (3D) is an actionable, evidence-based framework for creating websites & landing pages that will increase your leads, sales, and customers. In this session you’ll learn how to use the latest industry conversion data to inform copywriting and design decisions that impact conversions. Additionally, I’ll share a new methodology for prioritizing your marketing optimization that will show you which pages are awesome (leave them alone), which pages aren’t (massive ROI potential here), and help you develop a common language that your teams of marketers, designers, and copywriters can use to work better together to collectively increase your conversion rates.

Oli, founder of Unbounce, is on a mission to rid the world of marketing mediocrity by using data-informed copywriting, design, interaction, and psychology to create a more delightful experience for marketers and customers alike.


10:35–11:05am
AM Break


11:10–11:30am
How to Write Customer-Driven Copy That Converts

Joel Klettke, Business Casual Copywriting & Case Study Buddy
@JoelKlettke

If you want to write copy that converts, you need to get into your customers’ heads. But how do you do that? How do you know which pain points you need to address, features customers care about, or benefits your audience needs to hear? Marketers are sick and tired of hearing “it depends.” I’ll give the audience a practical framework for writing customer-driven copy that any business can apply.

Joel is a freelance conversion copywriter and strategist for Business Casual Copywriting. He also owns and runs Case Study Buddy, a done-for-you case studies service.


11:30–11:50am
What We Learned From Reddit & How It Can Help Your Brand Take Content Marketing to the Next Level

Daniel Russell, Go Fish Digital
@dnlRussell

It almost seems too good to be true — online forums where people automatically segment themselves into different markets and demographics and then vote on what content they like best. These forums, including Reddit, are treasure troves of content ideas. I’ll share actionable insights from three case studies that demonstrate how your marketing can benefit from content on Reddit.

Daniel is a director at Go Fish Digital whose work has hit the front page of Reddit, earned the #1 spot on YouTube, and been featured in Entrepreneur, Inc., The Washington Post, WSJ, and Fast Company.


11:50am–12:10pm
How to Build an SEO-Intent-Based Framework for Any Business

Kathryn Cunningham, Adept Marketing
@kac4509

Everyone knows intent behind the search matters. In e-commerce, intent is somewhat easy to see. B2B, or better yet healthcare, isn’t quite as easy. Matching persona intent to keywords requires a bit more thought. I will cover how to find intent modifiers during keyword research, how to organize those modifiers into the search funnel, and how to quickly find unique universal results at different levels of the search funnel to utilize.

Kathryn is an SEO consultant for Adept Marketing, although to many of her office mates she is known as the Excel nerd.


12:10–01:40pm
Lunch


ian-lurie-150x150-40285.jpg01:45–02:30pm
Size Doesn’t Matter: Great Content by Teams of One

Ian Lurie, Portent, Inc.
@portentint

Feel the energy surge through your veins as you gain content creation powers THE LIKES OF WHICH YOU HAVE NEVER EXPERIENCED… Or, just learn a process for creating great content when it’s just you and your little teeny team. Because size doesn’t matter.

Ian Lurie is founder, CEO, and nerdiest marketing nerd at Portent, a digital marketing agency he started in the Cretaceous era, aka 1995. Ian’s meandering career includes marketing copywriting, expert dungeon master, bike messenger-ing, and office temp worker.


justine-jordan-150x150-39303.jpg

02:30–03:00pm
The Tie That Binds: Why Email is Key to Maximizing Marketing ROI

Justine Jordan, Litmus
@meladorri

If nailing the omnichannel experience (whatever that means!) is key to getting more traffic and converting more leads, what happens if we have our channel priorities out of order? Justine will show you how email — far from being an old-school afterthought — is core to hitting marketing goals, building lifetime value, and making customers happy.

Justine is obsessed with helping marketers create, test, and send better email. Named 2015 Email Marketer Thought Leader of the Year, she is strangely passionate about email marketing, hates being called a spammer, and still gets nervous when pressing send.


03:00–03:30pm
PM Break


purna-virji-150x150-46694.jpg03:35–04:05pm
Marketing in a Conversational World: How to Get Discovered, Delight Your Customers and Earn the Conversion

Purna Virji, Microsoft
@purnavirji

Capturing and keeping attention is one of the hardest parts of our job today. Fact: It’s just going to get harder with the advent of new technology and conversational interfaces. In the brave new world we’re stepping into, the key questions are: How do we get discovered? How can we delight our audiences? And how can we grow revenue for our clients? Come to this session to learn how to make your marketing and advertising efforts something people are going to want to consume.

Named by PPC Hero as the #1 most influential PPC expert in the world, Purna specializes in SEM, SEO, and future search trends. She is a popular global keynote speaker and columnist, an avid traveler, aspiring top chef, and amateur knitter.



phil-nottingham-150x150-38081.jpg04:05–04:50pm
Thinking Smaller: Optimizing for the New Wave of Social Video Platforms

Phil Nottingham, Wistia
@philnottingham

SnapChat, Facebook, Twitter, Instagram, Periscope… the list goes on. All social networks are now video platforms, but it’s hard to know where to invest. In this session, Phil will be giving you all the tips and tricks for what to make, how to get your content in front of the right audiences, and how get the most value from the investment you’re making in social video.

Phil Nottingham is a strategist who believes in the power of creative video content to improve the way companies speak to their customers, and regularly speaks around the world about video strategy, SEO, and technical marketing.


07:00–10:00pm
Monday Night #MozCrawl

The Monday night pub crawl is back.

For the uninitiated, “pub crawl” is not meant to convey what you do after a night of drinking.

Rather, during the MozCon pub crawl, attendees visit some of the best bars in Seattle.

(Each stop is sponsored by a trusted partner; You’ll need to bring your MozCon badge for free drinks and light appetizers. You’ll also need your US ID or passport.)

More deets to follow.


Tuesday


08:00–09:00am
Breakfast


wil-reynolds-150x150-33027.jpg

09:05–09:50am
I’d Rather Be Thanked Than Ranked

Wil Reynolds, Seer Interactive
@wilreynolds

Ego and assumptions led me to chose the wrong keywords for my own site — yeah, me, Wil Reynolds, Mr. RCS. How did I spend three years optimizing my site and building links to finally crack the top three for six critical keywords, only to find out that I wasted all that time? However, in spite of targeting the wrong words, Seer grew the business. In this presentation, I’ll show you the mistakes I made and share with you to approaches that can help you to build content that gets you thanked.

A former teacher with a knack for advising, he’s been helping Fortune 500 companies develop SEO strategies since 1999. Today, Seer is home to over 100 employees across Philadelphia and San Diego.


rob-bucci-150x150-39132.jpg

09:50–10:35 am
Reverse-Engineer Google’s Research to Serve Up the Best, Most Relevant Content for Your Audience

Rob Bucci, STAT Search Analytics
@STATrob

The SERP is the front-end to Google’s multi-billion dollar consumer research machine. They know what searchers want. In this data-heavy talk, Rob will teach you how to uncover what Google already knows about what web searchers are looking for. Using this knowledge, you can deliver the right content to the right searchers at the right time, every time.

Rob loves the challenge of staying ahead of the changes Google makes to their SERPs. When not working, you can usually find him hiking up a mountain, falling down a ski slope, or splashing around in the ocean.


10:35–11:05am
AM Break


11:10–11:15am
MozCon Ignite Preview


11:15–11:35am
More Than SEO: 3 Ways To Prove UX Matters Too

Matthew Edgar, Elementive
@MatthewEdgarCO

Great SEO is increasingly dependent on having a website with a great user experience. To make your user experience great requires carefully tracking what people do so that you always know where to improve. But what do you track? In this 15-minute talk, I’ll cover three effective and advanced ways to use event tracking in Google Analytics to understand a website’s user

Matthew is a web analytics and technical marketing consultant at Elementive.


11:35–11:55am
A Site Migration: Redirects, Resources, & Reflection

Jayna Grassel, Dick’s Sporting Goods
@jaynagrassel

Site. Migration. No two words elicit more fear, joy, or excitement to a digital marketer. When the idea was shared three years ago, the company was excited. They dreamed of new features and efficiency. But as SEOs, we knew better. We knew there would be midnight strategy sessions with IT. More UAT environments than we could track. Deadlines, requirements, and compromises forged through hallway chats. …The result was a stable transition with minimal dips in traffic. What we didn’t know, however, was the amount of cross-functional coordination that was required to pull it off.

Jayna is the SEO manager at Dick’s Sporting Goods and is the unofficial world’s second-fastest crocheter.


11:55am–12:15pm
The 8 Paid Promotion Tactics That Will Get You To Quit Organic Traffic

Kane Jamison, Content Harmony
@kanejamison

Digital marketers are ignoring huge opportunities to promote their content through paid channels, and I want to give them the tools to get started. How many brands out there are spending $500+ on a blog post, then moving on to the next one before that post has been seen by 500 people, or even 50? For some reason, everyone thinks about Outbrain and native ads when we talk about paid content distribution, but the real opportunity is in highly targeted paid social.

Kane is the founder of Content Harmony, a content marketing agency based here in Seattle. The Content Harmony team specializes in full funnel content marketing and content promotion.


12:15–01:45pm
Lunch


tara-nicholle-nelson-150x150-39664.jpg

01:50–02:20pm

How to Be a Happy Marketer: Survive the Content Crisis and Drive Results by Mastering Your Customer’s Transformational Journey

Tara-Nicholle Nelson, Transformational Consumer Insights
@taranicholle

Branded content is way up, but customer engagement with that content is plummeting. This whole scene makes it hard to get up in the morning, as a marketer. But there’s a new path beyond the epidemic of disengagement and, at the end of it, your brand and your content become regular stops along your customer’s everyday journey.

Tara-Nicholle Nelson is the CEO of Transformational Consumer Insights, the former VP of Marketing for MyFitnessPal, and author of the Transformational Consumer.


matthew-barby-150x150-37740.jpg

02:20–02:50pm
Up and to the Right: Growing Traffic, Conversions, & Revenue

Matthew Barby, HubSpot
@matthewbarby

So many of the case studies that document how a company has grown from 0 to X forget to mention that solutions that they found are applicable to their specific scenario and won’t work for everyone. This falls into the dangerous category of bad advice for generic problems. Instead of building up a list of other companies’ tactics, marketers need to understand how to diagnose and solve problems across their entire funnel. Illustrated with real-world examples, I’ll be talking you through the process that I take to come up with ideas that none of my competitors are thinking of.

Matt, who heads up user acquisition at HubSpot, is an award-winning blogger, startup advisor, and a lecturer.


joanna-lord-150x150-66788.jpg

02:50–03:20pm
How to Operationalize Growth for Maximum Revenue

Joanna Lord, ClassPass
@JoannaLord

Joanna will walk through tactical ways to organize your team, build system foundations, and create processes that fuel growth across the company. You’ll hear how to coordinate with product, engineering, CX, and sales to ensure you’re maximizing your opportunity to acquire, retain, and monetize your customers.

Joanna is the CMO of ClassPass, the world’s leading fitness membership. Prior to that she was VP of Marketing at Porch and CMO of BigDoor. She is a global keynote and digital evangelist. Joanna is a recognized thought leader in digital marketing and a startup mentor.


03:20–03:50pm
PM Break


03:55–04:25pm
Analytics to Drive Optimization & Personalization

Krista Seiden, Google
@kristaseiden

Getting the most out of your optimization efforts means understanding the data you’re collecting, from analytics implementation, to report setup, to analysis techniques. In this session, Krista walks you through several tips for using analytics data to empower your optimization efforts, and then takes it further to show you how to up-level your efforts to take advantage of personalization from mass scale all the way down to individual user actions.

Krista Seiden is the Analytics Advocate for Google, advocating for all things data, web, mobile, optimization, and more. Keynote speaker, practitioner, writer on Analytics and Optimization, and passionate supporter of #WomenInAnalytics.


dr-pete-meyers-150x150-40534.jpg

04:25–05:10pm
Facing the Future: 5 Simple Tactics for 5 Scary Changes

Dr. Pete Meyers, Moz
@dr_pete

We’ve seen big changes to SEO recently, from an explosion in SERP features to RankBrain to voice search. These fundamental changes to organic search marketing can be daunting, and it’s hard to know where to get started. Dr. Pete will walk you through five big changes and five tactics for coping with those changes today.

Dr. Peter J. Meyers (aka “Dr. Pete”) is Marketing Scientist for Seattle-based Moz, where he works with the marketing and data science teams on product research and data-driven content.


07:00–10:00pm
MozCon Ignite

Join us for an evening of networking and passion-talks. Laugh, cheer, and be inspired as your peers share their 5-minute talks about their hobbies, passion projects, and life lessons.

Be sure to bring your MozCon badge.


Wednesday


09:00–10:00am
Breakfast


cindy-krum-150x150-58917.jpg10:05–10:50am
The Truth About Mobile-First Indexing

Cindy Krum, MobileMoxie, LLC
@suzzicks

Mobile-first design has been a best practice for a while, and Google is finally about to support it with mobile-first indexing. But mobile-first design and mobile-first indexing are not the same thing. Mobile-first indexing is about cross-device accessibility of information, to help integrate digital assistants and web-enabled devices that don’t even have browsers to achieve Google’s larger goals. Learn how mobile-first indexing will give digital marketers their first real swing at influencing Google’s new AI (Artificial Intelligence) landscape. Marketers who embrace an accurate understanding of mobile-first indexing could see a huge first-mover advantage, similar to the early days of the web, and we all need to be prepared.

Cindy, the CEO and Founder of MobileMoxie, LLC, is the author of Mobile Marketing: Finding Your Customers No Matter Where They Are. She brings fresh and creative ideas to her clients, and regularly speaks at US and international digital marketing events.


tara-reed-150x150-45070.jpg

10:50–11:20am
Powerful Brands Have Communities

Tara Reed, Apps Without Code
@TaraReed_

You are laser-focused on user growth. Meanwhile, you’re neglecting a gold mine of existing customers who desperately want to be part of your brand’s community. Tara Reed shares how to use communities, gamification, and membership content to grow your revenue.

Tara Reed is a tech entrepreneur & marketer. After running marketing initiatives at Google, Foursquare, & Microsoft, Tara branched out to launch her own apps & startups. Today, Tara helps people implement cutting-edge marketing into their businesses.


11:20–11:50am
AM Break


11:55–12:25am

From Anchor to Asset: How Agencies Can Wisely Create Data-Driven Content

Heather Physioc, VML
@HeatherPhysioc

Creative agencies are complicated and messy, often embracing chaos instead of process, and focusing exclusively on one-time campaign creative instead of continuous web content creation. Campaign creative can be costly, and not sustainable for most large brands. How can creative shops produce data-driven streams of high-quality content for the web that stays true to its creative roots — but faster, cheaper, and continuously? I’ll show you how.

Heather is director of Organic Search at global digital ad agency VML, which performs search engine optimization services for multinational brands like Hill’s Pet Nutrition, Electrolux/Frigidaire, Bridgestone, EXPRESS, and Wendy’s.


britney-muller-150x150-45570.jpg12:25–12:55pm
5 Secrets: How to Execute Lean SEO to Increase Qualified Leads

Britney Muller, Moz
@BritneyMuller

I invite you to steal some of the ideas I’ve gleaned from managing SEO for the behemoth bad-ass Moz.com. Learn what it takes to move the needle on qualified leads, execute quick wins, and keep your head above water. I’ll go over my biggest Moz.com successes, failures, tests, and lessons.

Britney is a Minnesota native who moved to Colorado to fulfill a dream of being a snowboard bum! After 50+ days on the mountain her first season, she got stir-crazy and taught herself how to program, then found her way into SEO while writing for a local realtor.


12:55–02:25pm
Lunch


stephanie-chang-150x150-5456.jpg02:30–03:15pm
SEO Experimentation for Big-Time Results

Stephanie Chang, Etsy
@@stephpchang

One of the biggest business hurdles any brand faces is how to prioritize and validate SEO recommendations. This presentation describes an SEO experimentation framework you can use to effectively test how changes made to your pages affect SEO performance.

Stephanie currently leads the Global Acquisition & Retention Marketing teams at Etsy. Previously, she was a Senior Consultant at Distilled.


dawn-anderson-150x150-8516.jpg03:15–03:45pm
Winning Value Propositions for Crawlers and Consumers

Dawn Anderson, Move It Marketing/Manchester Metropolitan University
@dawnieando

In an evolving mobile-first web, we can utilize preempting solutions to create winning value propositions, which are designed to attract and satisfy search engine crawlers and keep consumers happy. I’ll outline a strategy and share tactics that help ensure increased organic reach, in addition to highlighting smart ways to view data, intent, consumer choice theory, and crawl optimization.

Dawn Anderson is an International and Technical SEO Consultant, Director of Move It Marketing, and a lecturer at Manchester Metropolitan University.


03:45–04:15pm
PM Break


04:20–05:05pm
rand-fishkin-150x150-32915.jpgInside the Googling Mind: An SEO’s Guide to Winning Clicks, Hearts, & Rankings in the Years Ahead

Rand Fishkin, Founder of Moz, doer of SEO, feminist
@randfish

Searcher behavior, intent, and satisfaction are on the verge of overtaking classic SEO inputs (keywords, links, on-page, etc). In this presentation, Rand will examine the shift that behavioral signals have caused, and list the step-by-step process to build a strategy that can thrive long-term in Google’s new reality.

Rand Fishkin is the founder and former CEO of Moz, co-author of a pair of books on SEO, and co-founder of Inbound.org. Rand’s an un-save-able addict of all things content, search, and social on the web.


07:00–11:30pm
MozCon Bash

Join us at Garage Billiards for an evening of networking, billiards, bowling, and karaoke with MozCon friends new and old. Don’t forget to bring your MozCon badge and US ID or passport.


Additional Pre-MozCon Sunday Workshops


12:30pm–5:05pm
SEO Intensive

Offered as 75-minute sessions, the five workshops will be taught by Mozzers Rand Fishkin, Britney Muller, Brian Childs, Russ Jones, and Dr. Pete. Topics include The 10 Jobs of SEO-focused Content, Keyword Targeting for RankBrain and Beyond, and Risk-Averse Link Building at Scale, among others.

These workshops are separate from MozCon; you’ll need a ticket to attend them.


Amped up for a talk or ten? Curious about new methods? Excited to learn? Get your ticket before they sell out:

Snag my ticket to MozCon 2017!


Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don’t have time to hunt down but want to read!

Continue reading →

Tackling Tag Sprawl: Crawl Budget, Duplicate Content, and User-Generated Content

Posted by rjonesx.

Alright, so here’s the situation. You have a million-product website. Your competitors have a lot of the same products. You need unique content. What do you do? The same thing everyone does — you turn to user-generated content. Problem solved, right?

User-generated content (UGC) can be an incredibly valuable source of content and organization, helping you build natural language descriptions and human-driven organization of site content. One common feature used by sites to take advantage of user-created content are tags, found everywhere from e-commerce sites to blogs. Webmasters can leverage tags to power site search, create taxonomies and categories of products for browsing, and to provide rich descriptions of site content.

This is a logical and practical approach, but can cause intractable SEO problems if left unchecked. For mega-sites, manually moderating millions of user-submitted tags can be cumbersome (if not wholly impossible). Leaving tags unchecked, though, can create massive problems with thin content, duplicate content, and general content sprawl. In our case study below, three technical SEOs from different companies joined forces to solve a massive tag sprawl problem. The project was led by Jacob Bohall, VP of Marketing at Hive Digital, while computational statistics services were provided by J.R. Oakes of Adapt Partners and Russ Jones of Moz. Let’s dive in.

What is tag sprawl?

We define tag sprawl as the unchecked growth of unique, user-contributed tags resulting in a large amount of near-duplicate pages and unnecessary crawl space. Tag sprawl generates URLs likely to be classified as doorway pages, pages appearing to exist only for the purpose of building an index across an exhaustive array of keywords. You’ve probably seen this in its most basic form in the tagging of posts across blogs, which is why most SEOs recommend a blanket “noindex, follow” across tag pages in Wordpress sites. This simple approach can be an effective solution for small blog sites, but is not often the solution for major e-commerce sites that rely more heavily on tags for categorizing products.

The three following tag clouds represent a list of user-generated terms associated with different stock photos. Note: User behavior is generally to place as many tags as possible in an attempt to ensure maximum exposure for their products.

  1. USS Yorktown, Yorktown, cv, cvs-10, bonhomme richard, revolutionary war-ships, war-ships, naval ship, military ship, attack carriers, patriots point, landmarks, historic boats, essex class aircraft carrier, water, ocean
  2. ship, ships, Yorktown, war boats, Patriot pointe, old war ship, historic landmarks, aircraft carrier, war ship, naval ship, navy ship, see, ocean
  3. Yorktown ship, Warships and aircraft carriers, historic military vessels, the USS Yorktown aircraft carrier

As you can see, each user has generated valuable information for the photos, which we would want to use as a basis for creating indexable taxonomies for related stock images. However, at any type of scale, we have immediate threats of:

  • Thin content: Only a handful of products share the user-generated tag when a user creates a more specific/defining tag, e.g. “cvs-10”
  • Duplicate and similar content: Many of these tags will overlap, e.g. “USS Yorktown” vs. “Yorktown,” “ship” vs. “ships,” “cv” vs. “cvs-10,” etc.
  • Bad content: Created by improper formatting, misspellings, verbose tags, hyphenation, and similar mistakes made by users.

Now that you understand what tag sprawl is and how it negatively effects your site, how can we address this issue at scale?

The proposed solution

In correcting tag sprawl, we have some basic (at the surface) problems to solve. We need to effectively review each tag in our database and place them in groups so further action can be taken. First, we determine the quality of a tag (how likely is someone to search for this tag, is it spelled correctly, is it commercial, is it used for many products) and second, we determine if there is another tag very similar to it that has a higher quality.

  1. Identify good tags: We defined a good tag as term capable of contributing meaning, and easily justifiable as an indexed page in search results. This also entailed identifying a “master” tag to represent groups of similar terms.
  2. Identify bad tags: We wanted to isolate tags that should not appear in our database due to misspellings, duplicates, poor format, high ambiguity, or likely to cause a low-quality page.
  3. Relate bad tags to good tags: We assumed many of our initial “bad tags” could be a range of duplicates, i.e. plural/singular, technical/slang, hyphenated/non-hyphenated, conjugations, and other stems. There could also be two phrases which refer to the same thing, like “Yorktown ship” vs. “USS Yorktown.” We need to identify these relationships for every “bad” tag.

For the project inspiring this post, our sample tag database comprised over 2,000,000 “unique” tags, making this a nearly impossible feat to accomplish manually. While theoretically we could have leveraged Mechanical Turk or similar platform to get “manual” review, early tests of this method proved to be unsuccessful. We would need a programmatic method (several methods, in fact) that we could later reproduce when adding new tags.

The methods

Keeping the goal in mind of identifying good tags, labeling bad tags, and relating bad tags to good tags, we employed more than a dozen methods, including: spell correction, bid value, tag search volume, unique visitors, tag count, Porter stemming, lemmatization, Jaccard index, Jaro-Winkler distance, Keyword Planner grouping, Wikipedia disambiguation, and K-Means clustering with word vectors. Each method either helped us determine whether the tag was valuable and, if not, helped us identify an alternate tag that was valuable.

Spell correction

  • Method: One of the obvious issues with user-generated content is the occurrence of misspellings. We would regularly find misspellings where semicolons are transposed for the letter “L” or words have unintended characters at the beginning or end. Luckily, Linux has an excellent built-in spell checker called Aspell which we were able to use to fix a large volume of issues.
  • Benefits: This offered a quick, early win in that it was fairly easy to identify bad tags when they were composed of words that weren’t included in the dictionary or included characters that were simply inexplicable (like a semicolon in the middle of a word). Moreover, if the corrected word or phrase occurred in the tag list, we could trust the corrected phrase as a potentially good tag, and relate the misspelled term to the good tag. Thus, this method help us both filter bad tags (misspelled terms) and find good tags (the spell-corrected term)
  • Limitations: The biggest limitation with this methodology was that combinations of correctly spelled words or phrases aren’t necessarily useful for users or the search engine. For example, many of the tags in the database were concatenations of multiple tags where the user space-delimited rather than comma-delimited their submitted tags. Thus, a tag might consist of correctly spelled terms but still be useless in terms of search value. Moreover, there were substantial dictionary limitations, especially with domain names, brand names, and Internet slang. In order to accommodate this, we added a personal dictionary that included a list of the top 10,000 domains according to Quantcast, several thousand brands, and a slang dictionary. While this was helpful, there were still several false recommendations that needed to be handled. For example, we saw “purfect” correct to “perfect,” despite being a pop-culture reference for cat images. We also noticed some users reference this saying as “purrfect,” “purrrfect,” “purrrrfect,” “purrfeck,” etc. Ultimately, we had to rely on other metrics to determine whether we trusted the misspelling recommendations.

Bid value

  • Method: While a tag might be good in the sense that it is descriptive, we wanted tags that were commercially relevant. Using the estimated cost-per-click of the tag or tag phrase proved useful in making sure that the term could attract buyers, not just visitors.
  • Benefits: One of the great features of this methodology is that it tends to have a high signal-to-noise ratio. Most tags that have high CPCs tend to be commercially relevant and searched frequently enough to warrant inclusion as “good tags.” In many cases we could feel confident that a tag was good just on this metric alone.
  • Limitations: However, the bid value metric comes with some pretty big limitations, too. For starters, Google Keyword Planner’s disambiguation problem is readily apparent. Google combines related keywords together when reporting search volume and CPC data, which means a tag like “facbook” would return the same data as “facebook.” Obviously, we would prefer to map “facbook” to “facebook” rather than keep both tags, so in some cases the CPC metric wasn’t sufficient to identify good tags. A further limitation of the bid value was the difficulty of acquiring CPC data. Google now requires running active Adwords campaigns to get access to CPC value. It is no simple feat to look up 5,000,000 keywords in Google Keyword Planner, even if you have a sufficient account. Luckily, we felt comfortable that historical data would be trustworthy enough, so we didn’t need to acquire fresh data.

Tag search volume

  • Method: Similar to CPC, we could use search volume to determine the potential value of a tag. We had to be careful not to rely on the tag itself, though, since the tag could be so generic that it earns traffic unrelated to the product itself. For example, the tag “USS Yorktown” might get a few hundred searches a month, but “USS Yorktown T-shirt” gets 0. For all of the tags in our index, we tracked down the search volume for the tag plus the product name, in order to make sure we had good estimates of potential product traffic.
  • Benefits: Like CPC, this metric did a very good job of consolidating our tag data set to just keywords that were likely to deliver traffic. In the vast majority of cases, if “tag + product” had search volume, we could feel confident that it is a good term.
  • Limitations: Unfortunately, this method fell victim to the same disambiguation problem that CPC presents. Because Google groups terms together, it is possible that on some occasions two tags will be given the same metrics. For example: “pontoons boat,” “pontoonboat,” “pontoon boats,” “pontoon boat,” “pontoon boating,” and “pontoons boats” were in the same traffic volume group which also included tags like “yacht” and “yachts.” Moreover, there is no accounting for keyword difficulty in this metric. Some tags, when combined with product types, produce keywords that receive substantial traffic but will always be out of reach for a templated tag page.

Unique visitors

  • Method: This method was a no-brainer: protect the tags that already receive traffic from Google. We exported all of the tags from Google Analytics that had received search traffic from Google in the last 12 months. Generally speaking, this should be a fairly safe list of terms.
  • Benefits: When doing experimental work with a client, it is always nice to be able to give them a scenario that almost guarantees improvement. Because we were able to protect tags that already receive traffic by labeling them as good (in the vast majority of cases), we could ensure that the client had a high probability of profiting from the changes we made and minimal risk of any traffic loss.
  • Limitations: Unfortunately, even this method wasn’t perfect. If a product (or set of products) with high enough authority included a poor variation of a tag, then the bad variant would rank and receive traffic. We had to use other strategies to verify our selections from this method and devise a method to encourage a tag swap in the index for the correct version of a term.

Tag count

  • Description: The frequency with which a tag was used on the site was often a strong signal that we could trust the tag, especially when compared with other similar tags. By counting the number of times each tag was used on the site, we could bias our final set of trusted tags in favor of these more popular terms.
  • Benefits: This was a great tie-breaker metric when we had two tags that were very similar but needed to choose just one. For example, sometimes two variants of a phrase were completely acceptable (such as a version with and without a hyphen). We could simply defer to the one with a higher tag count.
  • Limitations: The clear limitation of tag frequency is that many of the most frequent tags were too generic to be useful. The tag “blue” isn’t particularly useful when it just helps people find “blue t-shirts.” The term is too generic and too competitive to warrant inclusion. Additionally, the inclusion of too broad of a tag would simply create a very large crawl vs. traffic-potential ratio. A common tag will have hundreds if not thousands of matching products, creating many pages of products for the single tag. If a tag produces 50 paginated product listings, but only has the potential to drive 10 visitors a year, it might not be worth it.

Porter stemming

  • Method: Stemming is a method used to identify the root word from a tag by scanning the word right to left and using various pattern matching rules to remove characters (suffixes) until you arrive at the word’s stem. There are a couple of popular stemmers available, but we found Porter stemming to be more accurate as a tool for seeing alternative word forms. You can geek out by looking at the Porter stemming algorithm in Snowball here, or you can play with a JS version here.
  • Benefits: Plural and possessive terms can be grouped by their stem for further analysis. Running Porter stemming on the terms “pony” and “ponies” will return “poni” as the stem, which can then be used to group terms for further analysis. You can also run Porter stemming on phrases. For example, “boating accident,” “boat accidents,” “boating accidents,” etc. share the stem “boat accid.” This can be a crude and quick method for grouping variations. Porter stemming also is able to clean text more kindly, where others stemmers can be too aggressive for our efforts; e.g., Lancaster stemmer reduces “woman” to “wom,” while Porter stemmer leaves it as “woman.”
  • Limitations: Stemming is intended for finding a common root for terms and phrases, and does not create any type of indication as to the proper form of a term. The Porter stemming method applies a fixed set of rules to the English language by blanket removing trailing “s,” “e,” “ance,” “ing,” and similar word endings to try and find the stem. For this to work well, you have to have all of the correct rules (and exceptions) in place to get the correct stems in all cases. This can be particularly problematic with words that end in S but are not plural, like “billiards” or “Brussels.” Additionally, this method does not help with mapping related terms such as “boat crash,” “crashed boat,” “boat accident,” etc. which would stem to “boat crash,” “crash boat,” and “boat acci.”

Lemmatization

  • Method: Lemmatization works similarly to stemming. However, instead of using a rule set for editing words by removing letters to arrive at a stem, lemmatization attempts to map the term to its most simple dictionary form, such as WordNet, and return a canonical “lemma” of the word. A crude way to think about lemmatization is just simplifying a word. Here’s an API to check out.
  • Benefits: This method often works better than stemming. Terms like “ship,” “shipped,” and “ships” are all mapped to “ship” by this method, while “shipping” or “shipper,” which are terms that have distinct meaning despite the same stem, are retained. You can create an array of “lemma” from phrases which can be compared to other phrases resolving word order issues. This proved to be a more reliable method for grouping variations than stemming.
  • Limitations: As with many of the methods, context for mapping related terms can be difficult. Lemmatization can provide better filters for context, but to do so generally relies on identifying the word form (noun, adjective, etc) to appropriately map to a root term. Given the inconsistency of the user-generated content, it is inaccurate to assume all words are in adjective form (describing a product), or noun form (the product itself). This inconsistency can present wild results. For example, “strip socks” could be intended as as a tag for socks with a strip of color on them, such as as “striped socks,” or it could be “stripper socks” or some other leggings that would be a match only found if there other products and tags to compare for context. Additionally, it doesn’t create associations between all related words, just textual derivatives, so you are still seeking out a canonical between mailman, courier, shipper, etc.

Jaccard index

  • Method: The Jaccard index is a similarity coefficient measured by Intersection over Union. Now, don’t run off just yet, it is actually quite straightforward.

    Imagine you had two piles with 3 marbles in each: Red, Green, and Blue in the first, Red, Green and Yellow in the second. The “Intersection” of these two piles would be Red and Green, since both piles have those two colors. The “Union” would be Red, Green, Blue and Yellow, since that is the complete list of all the colors. The Jaccard index would be 2 (Red and Green) divided by 4 (Red, Green, Blue, and Yellow). Thus, the Jaccard index of these two piles would be .5. The higher the Jaccard index, the more similar the two sets.
    So what does this have to do with tags? Well, imagine we have two tags: “ocean” and “sea.” We can get a list of all of the products that have the tag “ocean” and “sea.” Finally, we get the Jaccard index of those two sets. The higher the score, the more related they are. Perhaps we find that 70% of the products with the tag “ocean” also have the tag “sea”; we now know that the two are fairly well-related. However, when we run the same measurement to compare “basement” or “casement,” we find that they only have a Jaccard index of .02. Even though they are very similar in terms of characters, they mean quite different things. We can rule out mapping the two terms together.
  • Benefits: The greatest benefit of using the Jaccard index is that it allows us to find highly related tags which may have absolutely no textual characteristics in common, and are more likely to have an overly similar or duplicate results set. While most of the the metrics we have considered so far help us find “good” or “bad” tags, the Jaccard index helps us find “related” tags without having to do any complex machine learning.
  • Limitations: While certainly useful, the Jaccard index methodology has its own problems. The biggest issue we ran into had to do with tags that were used together nearly all the time but weren’t substitutes of one another. For example, consider the tags “babe ruth” and his nickname, “sultan of swat.” The latter tag only occurred on products which also had the “babe ruth” tag (since this was one of his nicknames), so they had quite a high Jaccard index. However, Google doesn’t map these two terms together in search, so we would prefer to keep the nickname and not simply redirect it to “babe ruth.” We needed to dig deeper if we were to determine when we should keep both tags or when we should redirect one to another. As a standalone, this method also was not sufficient at identifying cases where a user consistently misspelled tags or used incorrect syntax, as their products would essentially be orphans without “union.”

Jaro-Winkler distance

  • Method: There are several edit distance and string similarity metrics that we used throughout this process. Edit Distance is simply some measurement of how difficult it is to change one word to another. For example, the most basic edit distance metric, Levenshtein distance, between “Russ Jones” and “Russell Jones” is 3 (you have to add “E”,”L”, and “L” to transform Russ to Russell). This can be used to help us find similar words and phrases. In our case, we used a particular edit distance measure called “Jaro-Winkler distance” which gives higher precedence to words and phrases that are similar at the beginning. For example, “Baseball” would be closer to “Baseballer” than to “Basketball” because the differences are at the very end of the term.
  • Benefits: Edit distance metrics helped us find many very similar variants of tags, especially when the variants were not necessarily misspellings. This was particularly valuable when used in conjunction with the Jaccard index metrics, because we could apply a character-level metric on top of a character-agnostic metric (i.e. one that cares about the letters in the tag and one that doesn’t).
  • Limitations: Edit distance metrics can be kind of stupid. According to Jaro-Winkler distance, “Baseball” and “Basketball” are far more related to one another than “Baseball” and “Pitcher” or “Catcher.” “Round” and “Circle” have a horrible edit distance metric, while “Round” and “Pound” look very similar. Edit distance simply cannot be used in isolation to find similar tags.

Keyword Planner grouping

  • Method: While Google’s choice to combine similar keywords in Keyword Planner has been problematic for predicting traffic, it has actually offered us a new method to identify highly related terms. Whenever two tags share identical metrics from Google Keyword Planner (average monthly traffic, historical traffic, CPC, and competition), we can conclude that there is an increased chance the two are related to one another.
  • Benefits: This method is extremely useful for acronyms (which are particularly difficult to detect). While Google groups together COO and Chief Operating Officer, you can imagine that standard methods like those outlined above might have problems detecting the relationship.
  • Limitations: The greatest drawback for this methodology was that it created numerous false positives among less popular terms. There are just too many keywords which have an annual search volume average of 10, are searched 10 times monthly, and have a CPC and competition of 0. Thus, we had to limit the use of this methodology to more popular terms where there were only a handful of matches.

Wikipedia disambiguation

  • Method: Many of the methods above are great for grouping similar/related terms, but do not provide a high-confidence method for determining the “master” term or phrase to represent a grouping of related/duplicate terms. While considerations can be made for testing all tags against an English language model, the lack of pop culture references and phrases makes it unreliable. To do this effectively, we found Wikipedia to be a trusted source for identifying the proper spelling, tense, formatting, and word order for any given tag. For example, if users tagged a product as “Lord of the Rings,” “LOTR,” and “The Lord of the Rings,” it can be difficult to determine which tag should be preferred (certainly we don’t need all 3). If you search Wikipedia for these terms, you will see that they redirect you to the page titled “The Lord of the Rings.” In many cases, we can trust their canonical variant as the “good tag.” Please note that we don’t encourage scraping any website or violating their terms of use. Wikipedia does offer an export of their entire database that can be used for research purposes.
  • Benefits: When a tag could be mapped to a Wikipedia entry, this method proved to be a highly effective at providing validation that a tag had potential value, or creating a point of reference for related tags. If the Wikipedia community felt a tag or tag phrase was important enough to have an article dedicated to it, then the tag was more likely to be a valuable term vs. random entry or keyword stuffing by the user. Further, the methodology allows for grouping related terms without any bias on word order. Doing a search on Wikipedia creates a search results page (“pontoon boats”), or redirects you to a correction of the article (“disneyworld” becomes “Walt Disney World”). Wikipedia also tends to have entries for some pop culture references, so things that would get flagged as a misspelling, such as “lolcats,” can be vindicated by the existence of a matching Wikipedia article.
  • Limitations: While Wikipedia is effective at delivering a consistent formal tag for disambiguation, it can at times be more sterile than user-friendly. This can run counter to other signals such as CPC or traffic volume methods. For example, “pontoon boats” becomes “Pontoon (Boat)”, or “Lily” becomes “lilium.” All signals indicate the former case as the most popular, but Wikipedia disambiguation suggests the latter to be the correct usage. Wikipedia also contains entries for very broad terms, like each number, year, letter, etc. so simply applying a rule that any Wikipedia article is an allowed tag would continue to contribute to tag sprawl problems.

K-means clustering with word vectors

  • Method: Finally, we attempted to transform the tags into a subset of more meaningful tags using word embeddings and k-means clustering. Generally, the process involved transforming the tags into tokens (individual words), then refining by part-of-speech (noun, verb, adjective), and finally lemmatizing the tokens (“blue shirts” becomes “blue shirt”). From there, we transformed all the tokens into a custom Word2Vec embedding model based on adding the vectors of each resulting token array. We created a label array and a vector array of each tag in the dataset, then ran k-means with 10 percent of the total count of the tags as the value for number of centroids. At first we tested on 30,000 tags and obtained reasonable results.
    Once k-means had completed, we pulled all of the centroids and obtained their nearest relative from the custom Word2Vec model, then we assigned the tags to their centroid category in the main dataset.

    Tag Tokens Tag Pos Tag Lemm. Categorization
    [‘beach’, ‘photographs’] [(‘beach’, ‘NN’), (‘photographs’, ‘NN’)] [‘beach’, ‘photograph’] beach photo
    [‘seaside’, ‘photographs’] [(‘seaside’, ‘NN’), (‘photographs’, ‘NN’)] [‘seaside’, ‘photograph’] beach photo
    [‘coastal’, ‘photographs’] [(‘coastal’, ‘JJ’), (‘photographs’, ‘NN’)] [‘coastal’, ‘photograph’] beach photo
    [‘seaside’, ‘photographs’] [(‘seaside’, ‘NN’), (‘photographs’, ‘NN’)] [‘seaside’, ‘photograph’] beach photo
    [‘seaside’, ‘posters’] [(‘seaside’, ‘NN’), (‘posters’, ‘NNS’)] [‘seaside’, ‘poster’] beach photo
    [‘coast’, ‘photographs’] [(‘coast’, ‘NN’), (‘photographs’, ‘NN’)] [‘coast’, ‘photograph’] beach photo
    [‘beach’, ‘photos’] [(‘beach’, ‘NN’), (‘photos’, ‘NNS’)] [‘beach’, ‘photo’] beach photo
    The Categorization column above was the centroid selected by Kmeans. Notice how it handled the matching of “seaside” to “beach” and “coastal” to “beach.”
  • Benefits: This method seemed to do a good job of finding associations between the tags and their categories that were more semantic than character-driven. “Blue shirt” might be matched to “clothing.” This was obviously not possible without the semantic relationships found within the vector space.
  • Limitations: Ultimately, the chief limitation that we encountered was trying to run k-means on the full two million tags while ending up with 200,000 categories (centroids). Sklearn for Python allows for multiple concurrent jobs, but only across the initialization of the centroids, which in this case was 11 — meaning that even if you ran on a 60-core processor, the number of concurrent jobs was limited by the number of initialization, which in this case, was again 11. We tried PCA (principal component analysis) to reduce the vector sizes (300 to 10) but the results were overall poor. Finally, because embeddings are generally built based on probabilistic closeness of terms in the corpus on which they were trained, there were matches that you could understand why they matched, but would obviously not have been the correct category (eg “19th century art” was picked as a category for “18th century art”). Finally, context matters and the word embeddings obviously suffer from understanding the difference between “duck” (the animal) and “duck” (the action).

Bringing it all together

Using a combination of the methods above, we were able to develop a series of methodology confidence scores that could be applied to any tag in our dataset, generating a heuristic for how to consider each tag going forward. These were case-level strategies to determine the appropriate methodology. We denoted these as follows:

  • Good Tags: This mostly started as our “do not touch” list of terms which already received traffic from Google. After some confirmation exercises, the list was expanded to include unique terms with rankings potential, commercial appeal, and unique product sets to deliver to customers. For example, a heuristic for this category might look like this:
    1. If tag is identical to Wikipedia entry and
    2. Tag + product has estimated search traffic and
    3. Tag has CPC value then
    4. Mark as “Good Tag”
  • Okay Tags: This represents terms that we would like to retain associated with products and their descriptions, as they could be used within the site to add context to a page, but do not warrant their own indexable space. These tags were mapped to be redirected or canonicaled to a “master,” but still included on a page for topical relevancy, natural language queries, long-tail searches, etc. For example, a heuristic for this category might look like this:
    1. If tag is identical to Wikipedia entry but
    2. Tag + product has no search volume
    3. Vector tag matches a “Good Tag”
    4. Mark as “Okay Tag” and redirect to “Good Tag”
  • Bad Tags to Remap: This grouping represents bad tags that were mapped to a replacement. These tags would literally be deleted and replaced with a corrected version. These were most often misspellings or terms discovered through stemming/lemmatization/etc. where a dominant replacement was identified. For example, a heuristic for this category might look like this:
    1. If tag is not identical to either Wikipedia or vector space and
    2. Tag + product has no search volume
    3. Tag has no volume
    4. Tag Wikipedia entry matches a “Good Tag”
    5. Mark as “Bad Tag to Remap”
  • Bad Tags to Remove: These are tags that were flagged as bad tags that could not be related to a good tag. Essentially, these needed to be removed from our database completely. This final group represented the worst of the worst in the sense that the existence of the tag would likely be considered a negative indicator of site quality. Considerations were made for character length of tags, lack of Wikipedia entries, inability to map to word vectors, no previous traffic, no predicted traffic or CPC value, etc. In many cases, these were nonsense phrases.

All together, we were able to reduce the number of tags by 87.5%, consolidating the site down to a reasonable, targeted, and useful set of tags which properly organized the corpus without wasting either crawl budget or limiting user engagement.

Conclusions: Advanced white hat SEO

It was nearly nine years ago that a well-known black hat SEO called out white hat SEO as being simple, stale, and bereft of innovation. He claimed that “advanced white hat SEO” was an oxymoron — it simply did not exist. I was proud at the time to respond to his claims with a technique Hive Digital was using which I called “Second Page Poaching.” It was a great technique, but it paled in comparison to the sophistication of methods we now see today. I never envisioned either the depth or breadth of technical proficiency which would develop within the white hat SEO community for dealing with unique but persistent problems facing webmasters.

I sincerely doubt most of the readers here will have the specific tag sprawl problem described above. I’d be lucky if even a few of you have run into it. What I hope is that this post might disabuse us of any caricatures of white hat SEO as facile or stagnant and inspire those in our space to their best work.


Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don’t have time to hunt down but want to read!

Continue reading →

Tackling Tag Sprawl: Crawl Budget, Duplicate Content, and User-Generated Content

Posted by rjonesx.

Alright, so here’s the situation. You have a million-product website. Your competitors have a lot of the same products. You need unique content. What do you do? The same thing everyone does — you turn to user-generated content. Problem solved, right?

User-generated content (UGC) can be an incredibly valuable source of content and organization, helping you build natural language descriptions and human-driven organization of site content. One common feature used by sites to take advantage of user-created content are tags, found everywhere from e-commerce sites to blogs. Webmasters can leverage tags to power site search, create taxonomies and categories of products for browsing, and to provide rich descriptions of site content.

This is a logical and practical approach, but can cause intractable SEO problems if left unchecked. For mega-sites, manually moderating millions of user-submitted tags can be cumbersome (if not wholly impossible). Leaving tags unchecked, though, can create massive problems with thin content, duplicate content, and general content sprawl. In our case study below, three technical SEOs from different companies joined forces to solve a massive tag sprawl problem. The project was led by Jacob Bohall, VP of Marketing at Hive Digital, while computational statistics services were provided by J.R. Oakes of Adapt Partners and Russ Jones of Moz. Let’s dive in.

What is tag sprawl?

We define tag sprawl as the unchecked growth of unique, user-contributed tags resulting in a large amount of near-duplicate pages and unnecessary crawl space. Tag sprawl generates URLs likely to be classified as doorway pages, pages appearing to exist only for the purpose of building an index across an exhaustive array of keywords. You’ve probably seen this in its most basic form in the tagging of posts across blogs, which is why most SEOs recommend a blanket “noindex, follow” across tag pages in Wordpress sites. This simple approach can be an effective solution for small blog sites, but is not often the solution for major e-commerce sites that rely more heavily on tags for categorizing products.

The three following tag clouds represent a list of user-generated terms associated with different stock photos. Note: User behavior is generally to place as many tags as possible in an attempt to ensure maximum exposure for their products.

  1. USS Yorktown, Yorktown, cv, cvs-10, bonhomme richard, revolutionary war-ships, war-ships, naval ship, military ship, attack carriers, patriots point, landmarks, historic boats, essex class aircraft carrier, water, ocean
  2. ship, ships, Yorktown, war boats, Patriot pointe, old war ship, historic landmarks, aircraft carrier, war ship, naval ship, navy ship, see, ocean
  3. Yorktown ship, Warships and aircraft carriers, historic military vessels, the USS Yorktown aircraft carrier

As you can see, each user has generated valuable information for the photos, which we would want to use as a basis for creating indexable taxonomies for related stock images. However, at any type of scale, we have immediate threats of:

  • Thin content: Only a handful of products share the user-generated tag when a user creates a more specific/defining tag, e.g. “cvs-10”
  • Duplicate and similar content: Many of these tags will overlap, e.g. “USS Yorktown” vs. “Yorktown,” “ship” vs. “ships,” “cv” vs. “cvs-10,” etc.
  • Bad content: Created by improper formatting, misspellings, verbose tags, hyphenation, and similar mistakes made by users.

Now that you understand what tag sprawl is and how it negatively effects your site, how can we address this issue at scale?

The proposed solution

In correcting tag sprawl, we have some basic (at the surface) problems to solve. We need to effectively review each tag in our database and place them in groups so further action can be taken. First, we determine the quality of a tag (how likely is someone to search for this tag, is it spelled correctly, is it commercial, is it used for many products) and second, we determine if there is another tag very similar to it that has a higher quality.

  1. Identify good tags: We defined a good tag as term capable of contributing meaning, and easily justifiable as an indexed page in search results. This also entailed identifying a “master” tag to represent groups of similar terms.
  2. Identify bad tags: We wanted to isolate tags that should not appear in our database due to misspellings, duplicates, poor format, high ambiguity, or likely to cause a low-quality page.
  3. Relate bad tags to good tags: We assumed many of our initial “bad tags” could be a range of duplicates, i.e. plural/singular, technical/slang, hyphenated/non-hyphenated, conjugations, and other stems. There could also be two phrases which refer to the same thing, like “Yorktown ship” vs. “USS Yorktown.” We need to identify these relationships for every “bad” tag.

For the project inspiring this post, our sample tag database comprised over 2,000,000 “unique” tags, making this a nearly impossible feat to accomplish manually. While theoretically we could have leveraged Mechanical Turk or similar platform to get “manual” review, early tests of this method proved to be unsuccessful. We would need a programmatic method (several methods, in fact) that we could later reproduce when adding new tags.

The methods

Keeping the goal in mind of identifying good tags, labeling bad tags, and relating bad tags to good tags, we employed more than a dozen methods, including: spell correction, bid value, tag search volume, unique visitors, tag count, Porter stemming, lemmatization, Jaccard index, Jaro-Winkler distance, Keyword Planner grouping, Wikipedia disambiguation, and K-Means clustering with word vectors. Each method either helped us determine whether the tag was valuable and, if not, helped us identify an alternate tag that was valuable.

Spell correction

  • Method: One of the obvious issues with user-generated content is the occurrence of misspellings. We would regularly find misspellings where semicolons are transposed for the letter “L” or words have unintended characters at the beginning or end. Luckily, Linux has an excellent built-in spell checker called Aspell which we were able to use to fix a large volume of issues.
  • Benefits: This offered a quick, early win in that it was fairly easy to identify bad tags when they were composed of words that weren’t included in the dictionary or included characters that were simply inexplicable (like a semicolon in the middle of a word). Moreover, if the corrected word or phrase occurred in the tag list, we could trust the corrected phrase as a potentially good tag, and relate the misspelled term to the good tag. Thus, this method help us both filter bad tags (misspelled terms) and find good tags (the spell-corrected term)
  • Limitations: The biggest limitation with this methodology was that combinations of correctly spelled words or phrases aren’t necessarily useful for users or the search engine. For example, many of the tags in the database were concatenations of multiple tags where the user space-delimited rather than comma-delimited their submitted tags. Thus, a tag might consist of correctly spelled terms but still be useless in terms of search value. Moreover, there were substantial dictionary limitations, especially with domain names, brand names, and Internet slang. In order to accommodate this, we added a personal dictionary that included a list of the top 10,000 domains according to Quantcast, several thousand brands, and a slang dictionary. While this was helpful, there were still several false recommendations that needed to be handled. For example, we saw “purfect” correct to “perfect,” despite being a pop-culture reference for cat images. We also noticed some users reference this saying as “purrfect,” “purrrfect,” “purrrrfect,” “purrfeck,” etc. Ultimately, we had to rely on other metrics to determine whether we trusted the misspelling recommendations.

Bid value

  • Method: While a tag might be good in the sense that it is descriptive, we wanted tags that were commercially relevant. Using the estimated cost-per-click of the tag or tag phrase proved useful in making sure that the term could attract buyers, not just visitors.
  • Benefits: One of the great features of this methodology is that it tends to have a high signal-to-noise ratio. Most tags that have high CPCs tend to be commercially relevant and searched frequently enough to warrant inclusion as “good tags.” In many cases we could feel confident that a tag was good just on this metric alone.
  • Limitations: However, the bid value metric comes with some pretty big limitations, too. For starters, Google Keyword Planner’s disambiguation problem is readily apparent. Google combines related keywords together when reporting search volume and CPC data, which means a tag like “facbook” would return the same data as “facebook.” Obviously, we would prefer to map “facbook” to “facebook” rather than keep both tags, so in some cases the CPC metric wasn’t sufficient to identify good tags. A further limitation of the bid value was the difficulty of acquiring CPC data. Google now requires running active Adwords campaigns to get access to CPC value. It is no simple feat to look up 5,000,000 keywords in Google Keyword Planner, even if you have a sufficient account. Luckily, we felt comfortable that historical data would be trustworthy enough, so we didn’t need to acquire fresh data.

Tag search volume

  • Method: Similar to CPC, we could use search volume to determine the potential value of a tag. We had to be careful not to rely on the tag itself, though, since the tag could be so generic that it earns traffic unrelated to the product itself. For example, the tag “USS Yorktown” might get a few hundred searches a month, but “USS Yorktown T-shirt” gets 0. For all of the tags in our index, we tracked down the search volume for the tag plus the product name, in order to make sure we had good estimates of potential product traffic.
  • Benefits: Like CPC, this metric did a very good job of consolidating our tag data set to just keywords that were likely to deliver traffic. In the vast majority of cases, if “tag + product” had search volume, we could feel confident that it is a good term.
  • Limitations: Unfortunately, this method fell victim to the same disambiguation problem that CPC presents. Because Google groups terms together, it is possible that on some occasions two tags will be given the same metrics. For example: “pontoons boat,” “pontoonboat,” “pontoon boats,” “pontoon boat,” “pontoon boating,” and “pontoons boats” were in the same traffic volume group which also included tags like “yacht” and “yachts.” Moreover, there is no accounting for keyword difficulty in this metric. Some tags, when combined with product types, produce keywords that receive substantial traffic but will always be out of reach for a templated tag page.

Unique visitors

  • Method: This method was a no-brainer: protect the tags that already receive traffic from Google. We exported all of the tags from Google Analytics that had received search traffic from Google in the last 12 months. Generally speaking, this should be a fairly safe list of terms.
  • Benefits: When doing experimental work with a client, it is always nice to be able to give them a scenario that almost guarantees improvement. Because we were able to protect tags that already receive traffic by labeling them as good (in the vast majority of cases), we could ensure that the client had a high probability of profiting from the changes we made and minimal risk of any traffic loss.
  • Limitations: Unfortunately, even this method wasn’t perfect. If a product (or set of products) with high enough authority included a poor variation of a tag, then the bad variant would rank and receive traffic. We had to use other strategies to verify our selections from this method and devise a method to encourage a tag swap in the index for the correct version of a term.

Tag count

  • Description: The frequency with which a tag was used on the site was often a strong signal that we could trust the tag, especially when compared with other similar tags. By counting the number of times each tag was used on the site, we could bias our final set of trusted tags in favor of these more popular terms.
  • Benefits: This was a great tie-breaker metric when we had two tags that were very similar but needed to choose just one. For example, sometimes two variants of a phrase were completely acceptable (such as a version with and without a hyphen). We could simply defer to the one with a higher tag count.
  • Limitations: The clear limitation of tag frequency is that many of the most frequent tags were too generic to be useful. The tag “blue” isn’t particularly useful when it just helps people find “blue t-shirts.” The term is too generic and too competitive to warrant inclusion. Additionally, the inclusion of too broad of a tag would simply create a very large crawl vs. traffic-potential ratio. A common tag will have hundreds if not thousands of matching products, creating many pages of products for the single tag. If a tag produces 50 paginated product listings, but only has the potential to drive 10 visitors a year, it might not be worth it.

Porter stemming

  • Method: Stemming is a method used to identify the root word from a tag by scanning the word right to left and using various pattern matching rules to remove characters (suffixes) until you arrive at the word’s stem. There are a couple of popular stemmers available, but we found Porter stemming to be more accurate as a tool for seeing alternative word forms. You can geek out by looking at the Porter stemming algorithm in Snowball here, or you can play with a JS version here.
  • Benefits: Plural and possessive terms can be grouped by their stem for further analysis. Running Porter stemming on the terms “pony” and “ponies” will return “poni” as the stem, which can then be used to group terms for further analysis. You can also run Porter stemming on phrases. For example, “boating accident,” “boat accidents,” “boating accidents,” etc. share the stem “boat accid.” This can be a crude and quick method for grouping variations. Porter stemming also is able to clean text more kindly, where others stemmers can be too aggressive for our efforts; e.g., Lancaster stemmer reduces “woman” to “wom,” while Porter stemmer leaves it as “woman.”
  • Limitations: Stemming is intended for finding a common root for terms and phrases, and does not create any type of indication as to the proper form of a term. The Porter stemming method applies a fixed set of rules to the English language by blanket removing trailing “s,” “e,” “ance,” “ing,” and similar word endings to try and find the stem. For this to work well, you have to have all of the correct rules (and exceptions) in place to get the correct stems in all cases. This can be particularly problematic with words that end in S but are not plural, like “billiards” or “Brussels.” Additionally, this method does not help with mapping related terms such as “boat crash,” “crashed boat,” “boat accident,” etc. which would stem to “boat crash,” “crash boat,” and “boat acci.”

Lemmatization

  • Method: Lemmatization works similarly to stemming. However, instead of using a rule set for editing words by removing letters to arrive at a stem, lemmatization attempts to map the term to its most simple dictionary form, such as WordNet, and return a canonical “lemma” of the word. A crude way to think about lemmatization is just simplifying a word. Here’s an API to check out.
  • Benefits: This method often works better than stemming. Terms like “ship,” “shipped,” and “ships” are all mapped to “ship” by this method, while “shipping” or “shipper,” which are terms that have distinct meaning despite the same stem, are retained. You can create an array of “lemma” from phrases which can be compared to other phrases resolving word order issues. This proved to be a more reliable method for grouping variations than stemming.
  • Limitations: As with many of the methods, context for mapping related terms can be difficult. Lemmatization can provide better filters for context, but to do so generally relies on identifying the word form (noun, adjective, etc) to appropriately map to a root term. Given the inconsistency of the user-generated content, it is inaccurate to assume all words are in adjective form (describing a product), or noun form (the product itself). This inconsistency can present wild results. For example, “strip socks” could be intended as as a tag for socks with a strip of color on them, such as as “striped socks,” or it could be “stripper socks” or some other leggings that would be a match only found if there other products and tags to compare for context. Additionally, it doesn’t create associations between all related words, just textual derivatives, so you are still seeking out a canonical between mailman, courier, shipper, etc.

Jaccard index

  • Method: The Jaccard index is a similarity coefficient measured by Intersection over Union. Now, don’t run off just yet, it is actually quite straightforward.

    Imagine you had two piles with 3 marbles in each: Red, Green, and Blue in the first, Red, Green and Yellow in the second. The “Intersection” of these two piles would be Red and Green, since both piles have those two colors. The “Union” would be Red, Green, Blue and Yellow, since that is the complete list of all the colors. The Jaccard index would be 2 (Red and Green) divided by 4 (Red, Green, Blue, and Yellow). Thus, the Jaccard index of these two piles would be .5. The higher the Jaccard index, the more similar the two sets.
    So what does this have to do with tags? Well, imagine we have two tags: “ocean” and “sea.” We can get a list of all of the products that have the tag “ocean” and “sea.” Finally, we get the Jaccard index of those two sets. The higher the score, the more related they are. Perhaps we find that 70% of the products with the tag “ocean” also have the tag “sea”; we now know that the two are fairly well-related. However, when we run the same measurement to compare “basement” or “casement,” we find that they only have a Jaccard index of .02. Even though they are very similar in terms of characters, they mean quite different things. We can rule out mapping the two terms together.
  • Benefits: The greatest benefit of using the Jaccard index is that it allows us to find highly related tags which may have absolutely no textual characteristics in common, and are more likely to have an overly similar or duplicate results set. While most of the the metrics we have considered so far help us find “good” or “bad” tags, the Jaccard index helps us find “related” tags without having to do any complex machine learning.
  • Limitations: While certainly useful, the Jaccard index methodology has its own problems. The biggest issue we ran into had to do with tags that were used together nearly all the time but weren’t substitutes of one another. For example, consider the tags “babe ruth” and his nickname, “sultan of swat.” The latter tag only occurred on products which also had the “babe ruth” tag (since this was one of his nicknames), so they had quite a high Jaccard index. However, Google doesn’t map these two terms together in search, so we would prefer to keep the nickname and not simply redirect it to “babe ruth.” We needed to dig deeper if we were to determine when we should keep both tags or when we should redirect one to another. As a standalone, this method also was not sufficient at identifying cases where a user consistently misspelled tags or used incorrect syntax, as their products would essentially be orphans without “union.”

Jaro-Winkler distance

  • Method: There are several edit distance and string similarity metrics that we used throughout this process. Edit Distance is simply some measurement of how difficult it is to change one word to another. For example, the most basic edit distance metric, Levenshtein distance, between “Russ Jones” and “Russell Jones” is 3 (you have to add “E”,”L”, and “L” to transform Russ to Russell). This can be used to help us find similar words and phrases. In our case, we used a particular edit distance measure called “Jaro-Winkler distance” which gives higher precedence to words and phrases that are similar at the beginning. For example, “Baseball” would be closer to “Baseballer” than to “Basketball” because the differences are at the very end of the term.
  • Benefits: Edit distance metrics helped us find many very similar variants of tags, especially when the variants were not necessarily misspellings. This was particularly valuable when used in conjunction with the Jaccard index metrics, because we could apply a character-level metric on top of a character-agnostic metric (i.e. one that cares about the letters in the tag and one that doesn’t).
  • Limitations: Edit distance metrics can be kind of stupid. According to Jaro-Winkler distance, “Baseball” and “Basketball” are far more related to one another than “Baseball” and “Pitcher” or “Catcher.” “Round” and “Circle” have a horrible edit distance metric, while “Round” and “Pound” look very similar. Edit distance simply cannot be used in isolation to find similar tags.

Keyword Planner grouping

  • Method: While Google’s choice to combine similar keywords in Keyword Planner has been problematic for predicting traffic, it has actually offered us a new method to identify highly related terms. Whenever two tags share identical metrics from Google Keyword Planner (average monthly traffic, historical traffic, CPC, and competition), we can conclude that there is an increased chance the two are related to one another.
  • Benefits: This method is extremely useful for acronyms (which are particularly difficult to detect). While Google groups together COO and Chief Operating Officer, you can imagine that standard methods like those outlined above might have problems detecting the relationship.
  • Limitations: The greatest drawback for this methodology was that it created numerous false positives among less popular terms. There are just too many keywords which have an annual search volume average of 10, are searched 10 times monthly, and have a CPC and competition of 0. Thus, we had to limit the use of this methodology to more popular terms where there were only a handful of matches.

Wikipedia disambiguation

  • Method: Many of the methods above are great for grouping similar/related terms, but do not provide a high-confidence method for determining the “master” term or phrase to represent a grouping of related/duplicate terms. While considerations can be made for testing all tags against an English language model, the lack of pop culture references and phrases makes it unreliable. To do this effectively, we found Wikipedia to be a trusted source for identifying the proper spelling, tense, formatting, and word order for any given tag. For example, if users tagged a product as “Lord of the Rings,” “LOTR,” and “The Lord of the Rings,” it can be difficult to determine which tag should be preferred (certainly we don’t need all 3). If you search Wikipedia for these terms, you will see that they redirect you to the page titled “The Lord of the Rings.” In many cases, we can trust their canonical variant as the “good tag.” Please note that we don’t encourage scraping any website or violating their terms of use. Wikipedia does offer an export of their entire database that can be used for research purposes.
  • Benefits: When a tag could be mapped to a Wikipedia entry, this method proved to be a highly effective at providing validation that a tag had potential value, or creating a point of reference for related tags. If the Wikipedia community felt a tag or tag phrase was important enough to have an article dedicated to it, then the tag was more likely to be a valuable term vs. random entry or keyword stuffing by the user. Further, the methodology allows for grouping related terms without any bias on word order. Doing a search on Wikipedia creates a search results page (“pontoon boats”), or redirects you to a correction of the article (“disneyworld” becomes “Walt Disney World”). Wikipedia also tends to have entries for some pop culture references, so things that would get flagged as a misspelling, such as “lolcats,” can be vindicated by the existence of a matching Wikipedia article.
  • Limitations: While Wikipedia is effective at delivering a consistent formal tag for disambiguation, it can at times be more sterile than user-friendly. This can run counter to other signals such as CPC or traffic volume methods. For example, “pontoon boats” becomes “Pontoon (Boat)”, or “Lily” becomes “lilium.” All signals indicate the former case as the most popular, but Wikipedia disambiguation suggests the latter to be the correct usage. Wikipedia also contains entries for very broad terms, like each number, year, letter, etc. so simply applying a rule that any Wikipedia article is an allowed tag would continue to contribute to tag sprawl problems.

K-means clustering with word vectors

  • Method: Finally, we attempted to transform the tags into a subset of more meaningful tags using word embeddings and k-means clustering. Generally, the process involved transforming the tags into tokens (individual words), then refining by part-of-speech (noun, verb, adjective), and finally lemmatizing the tokens (“blue shirts” becomes “blue shirt”). From there, we transformed all the tokens into a custom Word2Vec embedding model based on adding the vectors of each resulting token array. We created a label array and a vector array of each tag in the dataset, then ran k-means with 10 percent of the total count of the tags as the value for number of centroids. At first we tested on 30,000 tags and obtained reasonable results.
    Once k-means had completed, we pulled all of the centroids and obtained their nearest relative from the custom Word2Vec model, then we assigned the tags to their centroid category in the main dataset.

    Tag Tokens Tag Pos Tag Lemm. Categorization
    [‘beach’, ‘photographs’] [(‘beach’, ‘NN’), (‘photographs’, ‘NN’)] [‘beach’, ‘photograph’] beach photo
    [‘seaside’, ‘photographs’] [(‘seaside’, ‘NN’), (‘photographs’, ‘NN’)] [‘seaside’, ‘photograph’] beach photo
    [‘coastal’, ‘photographs’] [(‘coastal’, ‘JJ’), (‘photographs’, ‘NN’)] [‘coastal’, ‘photograph’] beach photo
    [‘seaside’, ‘photographs’] [(‘seaside’, ‘NN’), (‘photographs’, ‘NN’)] [‘seaside’, ‘photograph’] beach photo
    [‘seaside’, ‘posters’] [(‘seaside’, ‘NN’), (‘posters’, ‘NNS’)] [‘seaside’, ‘poster’] beach photo
    [‘coast’, ‘photographs’] [(‘coast’, ‘NN’), (‘photographs’, ‘NN’)] [‘coast’, ‘photograph’] beach photo
    [‘beach’, ‘photos’] [(‘beach’, ‘NN’), (‘photos’, ‘NNS’)] [‘beach’, ‘photo’] beach photo
    The Categorization column above was the centroid selected by Kmeans. Notice how it handled the matching of “seaside” to “beach” and “coastal” to “beach.”
  • Benefits: This method seemed to do a good job of finding associations between the tags and their categories that were more semantic than character-driven. “Blue shirt” might be matched to “clothing.” This was obviously not possible without the semantic relationships found within the vector space.
  • Limitations: Ultimately, the chief limitation that we encountered was trying to run k-means on the full two million tags while ending up with 200,000 categories (centroids). Sklearn for Python allows for multiple concurrent jobs, but only across the initialization of the centroids, which in this case was 11 — meaning that even if you ran on a 60-core processor, the number of concurrent jobs was limited by the number of initialization, which in this case, was again 11. We tried PCA (principal component analysis) to reduce the vector sizes (300 to 10) but the results were overall poor. Finally, because embeddings are generally built based on probabilistic closeness of terms in the corpus on which they were trained, there were matches that you could understand why they matched, but would obviously not have been the correct category (eg “19th century art” was picked as a category for “18th century art”). Finally, context matters and the word embeddings obviously suffer from understanding the difference between “duck” (the animal) and “duck” (the action).

Bringing it all together

Using a combination of the methods above, we were able to develop a series of methodology confidence scores that could be applied to any tag in our dataset, generating a heuristic for how to consider each tag going forward. These were case-level strategies to determine the appropriate methodology. We denoted these as follows:

  • Good Tags: This mostly started as our “do not touch” list of terms which already received traffic from Google. After some confirmation exercises, the list was expanded to include unique terms with rankings potential, commercial appeal, and unique product sets to deliver to customers. For example, a heuristic for this category might look like this:
    1. If tag is identical to Wikipedia entry and
    2. Tag + product has estimated search traffic and
    3. Tag has CPC value then
    4. Mark as “Good Tag”
  • Okay Tags: This represents terms that we would like to retain associated with products and their descriptions, as they could be used within the site to add context to a page, but do not warrant their own indexable space. These tags were mapped to be redirected or canonicaled to a “master,” but still included on a page for topical relevancy, natural language queries, long-tail searches, etc. For example, a heuristic for this category might look like this:
    1. If tag is identical to Wikipedia entry but
    2. Tag + product has no search volume
    3. Vector tag matches a “Good Tag”
    4. Mark as “Okay Tag” and redirect to “Good Tag”
  • Bad Tags to Remap: This grouping represents bad tags that were mapped to a replacement. These tags would literally be deleted and replaced with a corrected version. These were most often misspellings or terms discovered through stemming/lemmatization/etc. where a dominant replacement was identified. For example, a heuristic for this category might look like this:
    1. If tag is not identical to either Wikipedia or vector space and
    2. Tag + product has no search volume
    3. Tag has no volume
    4. Tag Wikipedia entry matches a “Good Tag”
    5. Mark as “Bad Tag to Remap”
  • Bad Tags to Remove: These are tags that were flagged as bad tags that could not be related to a good tag. Essentially, these needed to be removed from our database completely. This final group represented the worst of the worst in the sense that the existence of the tag would likely be considered a negative indicator of site quality. Considerations were made for character length of tags, lack of Wikipedia entries, inability to map to word vectors, no previous traffic, no predicted traffic or CPC value, etc. In many cases, these were nonsense phrases.

All together, we were able to reduce the number of tags by 87.5%, consolidating the site down to a reasonable, targeted, and useful set of tags which properly organized the corpus without wasting either crawl budget or limiting user engagement.

Conclusions: Advanced white hat SEO

It was nearly nine years ago that a well-known black hat SEO called out white hat SEO as being simple, stale, and bereft of innovation. He claimed that “advanced white hat SEO” was an oxymoron — it simply did not exist. I was proud at the time to respond to his claims with a technique Hive Digital was using which I called “Second Page Poaching.” It was a great technique, but it paled in comparison to the sophistication of methods we now see today. I never envisioned either the depth or breadth of technical proficiency which would develop within the white hat SEO community for dealing with unique but persistent problems facing webmasters.

I sincerely doubt most of the readers here will have the specific tag sprawl problem described above. I’d be lucky if even a few of you have run into it. What I hope is that this post might disabuse us of any caricatures of white hat SEO as facile or stagnant and inspire those in our space to their best work.


Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don’t have time to hunt down but want to read!

Continue reading →

Lessons from 1,000 Voice Searches (on Google Home)

Posted by Dr-Pete

It’s hardly surprising that Google Home is an extension of Google’s search ecosystem. Home is attempting to answer more and more questions, drawing those answers from search results. There’s an increasingly clear connection between Featured Snippets in search and voice answers.

For example, let’s say a hedgehog wanders into your house and you naturally find yourself wondering what you should feed it. You might search for “What do hedgehogs eat?” On desktop, you’d see a Featured Snippet like the following:

Given that you’re trying to wrangle a strange hedgehog, searching on your desktop may not be practical, so you ask Google Home: “Ok, Google — What do hedgehogs eat?” and hear the following:

Google Home leads with the attribution to Ark Wildlife (since a voice answer has no direct link), and then repeats a short version of the desktop snippet. The connection between the two answers is, I hope, obvious.

Anecdotally, this is a pattern we see often on Google Home, but how consistent is it? How does Google handle Featured Snippets in other formats (including lists and tables)? Are some questions answered wildly differently by Google Home compared to desktop search?

Methodology (10K –> 1K)

To find out the answer to these questions, I needed to start with a fairly large set of searches that were likely to generate answers in the form of Featured Snippets. My colleague Russ Jones pulled a set of roughly 10,000 popular searches beginning with question words (Who, What, Where, Why, When, How) from a third-party “clickstream” source (actual web activity from a very large set of users).

I ran those searches on desktop (automagically, of course) and found that just over half (53%) had Featured Snippets. As we’ve seen in other data sets, Google is clearly getting serious about direct answers.

The overall set of popular questions was dominated by “What?” and “How?” phrases:

Given the prevalence of “How to?” questions, I’ve broken them out in this chart. The purple bars show how many of these searches generated Featured Snippets. “How to?” questions were very likely to display a Featured Snippet, with other types of questions displaying them less than half of the time.

Of the roughly 5,300 searches in the full data set that had Featured Snippets, those snippets broke down into four types, as follows:

Text snippets — paragraph-based answers like the one at the top of this post — accounted for roughly two-thirds of all of the Featured Snippets in our original data set. List snippets accounted for just under one-third — these are bullet lists, like this one for “How to draw a dinosaur?”:

Step 1 – Draw a small oval. Step 5 – Dinosaur! It’s as simple as that.

Table snippets made up less than 2% of the Featured Snippets in our starting data set. These snippets contain a small amount of tabular data, like this search for “What generation am I?”:

If you throw your money recklessly at your avocado toast habit instead of buying a house, you’re probably a millennial (sorry, content marketing joke).

Finally, video snippets are a special class of Featured Snippet with a large video thumbnail and direct link (dominated by YouTube). Here’s one for “Who is the spiciest memelord?”:

I’m honestly not sure what commentary I can add to that result. Since there’s currently no way for a video to appear on Google Home, we excluded video snippets from the rest of the study.

Google has also been testing some hybrid Featured Snippets. In some cases, for example, they attempt to extract a specific answer from the text, such as this answer for “When was 1984 written?” (Hint: the answer is not 1984):

For the purposes of this study, we treated these hybrids as text snippets. Given the concise answer at the top, these hybrids are well-suited to voice results.

From the 5.3K questions with snippets, I selected 1,000, excluding video but purposely including a disproportionate number of list and table types (to better see if and how those translated into voice).

Why only 1,000? Because, unlike desktop searches, there’s no easy way to do this. Over the course of a couple of days, I had to run all of these voice searches manually on Google Home. It’s possible that I went temporarily insane. At one point, I saw a spider on my Google Home staring back at me. Fearing that I was hallucinating, I took a picture and posted it on Twitter:

I was assured that the spider was, in point of fact, not a figment of my imagination. I’m still not sure about the half-hour when the spider sang me selections from the Hamilton soundtrack.

From snippets to voice answers

So, how many of the 1,000 searches yielded voice answers? The short answer is: 71%. Diving deeper, it turns out that this percentage is strongly dependent on the type of snippet:

Text snippets in our 1K data set yielded voice answers 87% of the time. List snippets dropped to just under half, and table snippets only generated voice answers one-third of the time. This makes sense — long lists and most tables are simply harder to translate into voice.

In the case of tables, some of these results were from different sites or in a different format. In other words, the search generated a Featured Snippet and a voice answer, but the voice answer was of a different type (text, for example) and attributed to a different source. Only 20% of Featured Snippets in table format generated voice answers that came from the same source.

From a search marketing standpoint, text snippets are going to generate a voice answer almost 9 out of 10 times. Optimizing for text/paragraph snippets is a good starting point for ranking on voice search and should generally be a win-win across devices.

Special: Knowledge Graph

What about the Featured Snippets that didn’t generate voice answers? It turns out there was quite a variety of exceptions in play. One exception was answers that came directly from the Knowledge Graph on Google Home, without any attribution. For example, the question “What is the nuclear option?” produces this Featured Snippet (for me, at least) on desktop:

On Google Home, though, I get an unattributed answer that seems to come from the Knowledge Graph:

It’s unclear why Google has chosen one over the other for voice in this particular case. Across the 1,000 keyword set, there were about 30 keywords where something similar happened.

Special: Device help

Google Home seems to translate some searches as device-specific help. For example, “How to change your name?” returns desktop results about legally changing your name as an individual. On Google Home, I get the following:

Other searches from our list that triggered device help include:

  • How to contact Google?
  • How to send a fax online?
  • What are you up to?

Special: Easter eggs

Google Home has some Easter eggs that seem unique to voice search. One of my personal favorites — the question “What is best in life?” — generates the following:

Here’s a list of the other Easter eggs in our 1,000 phrase data set:

  • How many letters are in the alphabet?
  • What are your strengths?
  • What came first, the chicken or the egg?
  • What generation am I?
  • What is the meaning of life?
  • What would you do for a Klondike bar?
  • Where do babies come from?
  • Where in the world is Carmen Sandiego?
  • Where is my iPhone?
  • Where is Waldo?
  • Who is your daddy?

Easter eggs are a bit less predictable than device help. Generally speaking, though, both are rare and shouldn’t dissuade you from trying to rank for Featured Snippets and voice answers.

Special: General confusion

In a handful of cases, Google simply didn’t understand the question or couldn’t answer the exact question. For example, I could not get Google to understand the question “What does MAGA mean?” The answer I got back (maybe it’s my Midwestern accent?) was:

On second thought, maybe that’s not entirely inaccurate.

One interesting case is when Google decides to answer a slightly different question. On desktop, if you search for “How to become a vampire?”, you might see the following Featured Snippet:

On Google Home, I’m asked to clarify my intent:

I suspect both of these cases will improve over time, as voice recognition continues to advance and Google becomes better at surfacing answers.

Special: Recipe results

Back in April, Google launched a new set of recipe functions across search and Google Home. Many “How to?” questions related to cooking now generate something like this (the question I asked was “How to bake chicken breast?”):

You can opt to find a recipe on Google search and send it to your Google Home, or Google can simply pick a recipe for you. Either way, it will guide you through step-by-step instructions.

Special: Health conditions

A half-dozen or so health questions, from general questions to diseases, generated results like the following. This one is for the question “Why do we sneeze?”:

This has no clear connection to desktop search results, and I’m not clear if it’s a signal for future, expanded functionality. It seems to be of limited use right now.

Special: WikiHow

A handful of “How to?” questions triggered an unusual response. For example, if I ask Google Home “How to write a press release?” I get back:

If I say “yes,” I’m taken directly to a wikiHow assistant that uses a different voice. The wikiHow answers are much longer than text-based Featured Snippets.

How should we adapt?

Voice search and voice appliances (including Google Assistant and Google Home) are evolving quickly right now, and it’s hard to know where any of this will be in the next couple of years. From a search marketing standpoint, I don’t think it makes sense to drop everything to invest in voice, but I do think we’ve reached a point where some forward momentum is prudent.

First, I highly recommend simply being aware of how your industry and your major keywords/questions “appear” on Google Home (or Google Assistant on your mobile device). Look at the recipe situation above — for 99%+ of the people reading this article, that’s a novelty. If you’re in the recipe space, though, it’s game-changing, and it’s likely a sign of more to come.

Second, I feel strongly that Featured Snippets are a win-win right now. Almost 90% of the text-only Featured Snippets we tracked yielded a voice answer. These snippets are also prominent on desktop and mobile searches. Featured Snippets are a great starting point for understanding the voice ecosystem and establishing your foothold.


Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don’t have time to hunt down but want to read!

Continue reading →