About frans

Website:
frans has written 4625 articles so far, you can find them below.

Introducing Progressive Web Apps: What They Might Mean for Your Website and SEO

Posted by petewailes

Progressive Web Apps. Ah yes, those things that Google would have you believe are a combination of Ghandi and Dumbledore, come to save the world from the terror that is the Painfully Slow WebsiteTM.

But what actually makes a PWA? Should you have one? And if you create one, how will you make sure it ranks? Well, read on to find out…

What’s a PWA?

Given as that Google came up with the term, I thought we’d kick off with their definition:

“A Progressive Web App uses modern web capabilities to deliver an app-like user experience.”
Progressive Web Apps

The really exciting thing about PWAs: they could make app development less necessary. Your mobile website becomes your app. Speaking to some of my colleagues at Builtvisible, this seemed to be a point of interesting discussion: do brands need an app and a website, or a PWA?

Fleshing this out a little, this means we’d expect things like push notifications, background sync, the site/app working offline, having a certain look/design to feel like a native application, and being able to be set on the device home screen.

These are things we traditionally haven’t had available to us on the web. But thanks to new browsers supporting more and more of the HTML5 spec and advances in JavaScript, we can start to create some of this functionality. On the whole, Progressive Web Apps are:

Progressive
Work for every user, regardless of browser choice because they’re built with progressive enhancement as a core tenet.
Responsive
Fit any form factor: desktop, mobile, tablet, or whatever is next.
Connectivity independent
Enhanced with service workers to work offline or on low quality networks.
App-like
Feel like an app to the user with app-style interactions and navigation because they’re built on the app shell model.
Fresh
Always up-to-date thanks to the service worker update process.
Safe
Served via HTTPS to prevent snooping and ensure content hasn’t been tampered with.
Discoverable
Are identifiable as “applications” thanks to W3C manifests and service worker registration scope allowing search engines to find them.
Re-engageable
Make re-engagement easy through features like push notifications.
Installable
Allow users to “keep” apps they find most useful on their home screen without the hassle of an app store.
Linkable
Easily share via URL and not require complex installation.
Source: Your First Progressive Web App (Google)

It’s worth taking a moment to unpack the “app-like” part of that. Fundamentally, there are two parts to a PWA: service workers (which we’ll come to in a minute), and application shell architecture. Google defines this as:

…the minimal HTML, CSS, and JavaScript powering a user interface. The application shell should:
  • load fast
  • be cached
  • dynamically display content
An application shell is the secret to reliably good performance. Think of your app’s shell like the bundle of code you’d publish to an app store if you were building a native app. It’s the load needed to get off the ground, but might not be the whole story. It keeps your UI local and pulls in content dynamically through an API.
Instant Loading Web Apps with an Application Shell Architecture

This method of loading content allows for incredibly fast perceived speed. We are able to get something that looks like our site in front of a user almost instantly, just without any content. The page will then go and fetch the content and all’s well. Obviously, if we actually did things this way in the real world, we’d run in to SEO issues pretty quickly, but we’ll address that later too.

If then, at their core, a Progressive Web App is just a website served in a clever way with extra features for loading stuff, why would we want one?

The use case

Let me be clear before I get into this: for most people, a PWA is something you don’t need. That’s important enough that it bares repeating, so I’ll repeat it:

You probably don’t need a PWA.

The reason for this is that most websites don’t need to be able to behave like an app. This isn’t to say that there’s no benefit to having the things that PWA functionality can bring, but for many sites, the benefits don’t outweigh the time it takes to implement the functionality at the moment.

When should you look at a PWA then? Well, let’s look at a checklist of things that may indicate that you do need one…

Signs a PWA may be appropriate

You have:

  • Content that regularly updates, such as stock tickers, rapidly changing prices or inventory levels, or other real-time data
  • A chat or comms platform, requiring real-time updates and push notifications for new items coming in
  • An audience likely to pull data and then browse it offline, such as a news app or a blog publishing many articles a day
  • A site with regularly updated content which users may check in to several times a day
  • Users who are mostly using a supported browser

In short, you have something beyond a normal website, with interactive or time-sensitive components, or rapidly released or updated content. A good example is the Google Weather PWA:

If you’re running a normal site, with a blog that maybe updates every day or two, or even less frequently, then whilst it might be nice to have a site that acts as a PWA, there’s probably more useful things you can be doing with your time for your business.

How they work

So, you have something that would benefit from this sort of functionality, but need to know how these things work. Welcome to the wonder that is the service worker.

Service workers can be thought of as a proxy that sits between your website and the browser. It calls for intercept of things you ask the browser to do, and hijacking of the responses given back. That means we can do things like, for example, hold a copy of data requested, so when it’s asked for again, we can serve it straight back (this is called caching). This means we can fetch data once, then replay it a thousand times without having to fetch it again. Think of it like a musician recording an album — it means they don’t have to play a concert every time you want to listen to their music. Same thing, but with network data.

If you want a more thorough explanation of service workers, check out this moderately technical talk given by Jake Archibald from Google.

What service workers can do

Service workers fundamentally exist to deliver extra features, which have not been available to browsers until now. These includes things like:

  • Push notifications, for telling a user that something has happened, such as receiving a new message, or that the page they’re viewing has been updated
  • Background sync, for updating data while a user isn’t using the page/site
  • Offline caching, to allow a for an experience where a user still may be able to access some functionality of a site while offline
  • Handling geolocation or other device hardware-querying data (such as device gyrpscope data)
  • Pre-fetching data a user will soon require, such as images further down a page

It’s planned that in the future, they’ll be able to do even more than they currently can. For now though, these are the sorts of features you’ll be able to make use of. Obviously these mostly load data via AJAX, once the app is already loaded.

What are the SEO implications?

So you’re sold on Progressive Web Apps. But if you create one, how will you make sure it ranks? As with any new front-end technology, there are always implications for your SEO visibility. But don’t panic; the potential issues you’ll encounter with a PWA have been solved before by SEOs who have worked on JavaScript-heavy websites. For a primer on that, take a look at this article on JS SEO.

There are a few issues you may encounter if you’re going to have a site that makes use of application shell architecture. Firstly, it’s pretty much required that you’re going to be using some form of JS framework or view library, like Angular or React. If this is the case, you’re going to want to take a look at some Angular.JS or React SEO advice. If you’re using something else, the short version is you’ll need to be pre-rendering pages on the server, then picking up with your application when it’s loaded. This enables you to have all the good things these tools give you, whilst also serving something Google et al can understand. Despite their recent advice that they’re getting good at rendering this sort of application, we still see plenty of examples in the wild of them flailing horribly when they crawl heavy JS stuff.

Assuming you’re in the world of clever JS front-end technologies, to make sure you do things the PWA way, you’ll also need to be delivering the CSS and JS required to make the page work along with the HTML. Not just including script tags with the <code>src attribute, but the whole file, inline.

Obviously, this means you’re going to increase the size of the page you’re sending down the wire, but it has the upside of meaning that the page will load instantly. More than that, though, with all the JS (required for pick-up) and CSS (required to make sense of the design) delivered immediately, the browser will be able to render your content and deliver something that looks correct and works straightaway.

Again, as we’re going to be using service workers to cache content once it’s arrived, this shouldn’t have too much of an impact. We can also cache all the CSS and JS external files required separately, and load them from the cache store rather than fetching them every time. This does make it very slightly more likely that the PWA will fail on the first time that a user tries to request your site, but you can still handle this case gracefully with an error message or default content, and re-try on the next page view.

There are other potential issues people can run in to, as well. The Washington Post, for example, built a PWA version of their site, but it only works on a mobile device. Obviously, that means the site can be crawled nicely by Google’s mobile bots, but not the desktop ones. It’s important to respect the P part of the acronym — the website should enable features that a user can make use of, but still work in a normal manner for those who are using browsers that don’t support them. It’s about enhancing functionality progressively, not demanding that people upgrade their browser.

The only slightly tricky thing with all of this is that it requires that, for best experience, you design your application for offline-first experiences. How that’s done is referenced in Jake’s talk above. The only issue with going down that route: you’re only serving content once someone’s arrived at your site and waited long enough to load everything. Obviously, in the case of Google, that’s not going to work well. So here’s what I’d suggest…

Rather than just sending your application shell, and then using AJAX to request content on load, and then picking up, use this workflow instead:

  • User arrives at site
  • Site sends back the application shell (the minimum HTML, JS, and CSS to make everything work immediately), along with…
  • …the content AJAX response, pre-loaded as state for the application
  • The application loads that immediately, and then picks up the front end.

Adding in the data required means that, on load, we don’t have to make an AJAX call to get the initial data required. Instead, we can bundle that in too, so we get something that can render content instantly as well.

As an example of this, let’s think of a weather app. Now, the basic model would be that we send the user all the content to show a basic version of our app, but not the data to say what the weather is. In this modified version, we also send along what today’s weather is, but for any subsequent data request, we then go to the server with an AJAX call.

This means we still deliver content that Google et al can index, without possible issues from our AJAX calls failing. From Google and the user’s perspective, we’re just delivering a very high-performance initial load, then registering service workers to give faster experiences for every subsequent page and possibly extra functionality. In the case of a weather app, that might mean pre-fetching tomorrow’s weather each day at midnight, or notifying the user if it’s going to rain, for example.

Going further

If you’re interested in learning more about PWAs, I highly recommend reading this guide to PWAs by Addy Osmani (a Google Chrome engineer), and then putting together a very basic working example, like the train one Jake mentions in his YouTube talk referenced earlier. If you’re interested in that, I recommend Jake’s Udacity course on creating a PWA available here.


Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don’t have time to hunt down but want to read!

Continue reading →

Here&rsquo;s How to Generate and Insert Rel Canonical with Google Tag Manager

Posted by luciamarin

This post was originally in YouMoz, and was promoted to the main blog because it provides great value and interest to our community. The author’s views are entirely his or her own and may not reflect the views of Moz, Inc.

In this article, we’re going to learn how to create the rel canonical URL tag using Google Tag Manager, and how to insert it in every page of our website so that the correct canonical is automatically generated in each URL.

We’ll do it using Google Tag Manager and its variables.

Why send a canonical from each page to itself?

Javier Lorente gave us a very good explanation/reminder at the 2015 SEO Salad event in Zaragoza (Spain). In short, there may be various factors that cause Google to index unexpected variants of a URL, and this is often beyond our control:

  • External pages that display our website but use another URL (e.g., Google’s own cache, other search engines and content aggregators, archive.org, etc.). This way, Google will know which one is the original page at all times.
  • Parameters that are irrelevant to SEO/content such as certain filters and order sequences

By including this “standard” canonical in every URL, we are making it easy for Google to identify the original content.

How do we generate the dynamic value of the canonical URL?

To generate the canonical URL, dynamically we need to force it to always correspond to the “clean” (i.e., absolute, unique, and simplified) URL of each page (taking into account the www, URL query string parameters, anchors, etc.).

Remember that, in summary, the URL variables that can be created in GTM (Google Tag Manager) correspond to the following components:

URL variables in Google Tag Manager

We want to create a unique URL for each page, without queries or anchors. We need a “clean” URL variable, and we can’t use the {{Page URL}} built-in variable, for two reasons:

  1. Although fragment doesn’t form part of the URL by default, query string params does
  2. Potential problems with protocol and hostname, if different options are admitted (e.g., SSL and www)

Therefore, we need to combine Protocol + Host + Path into a single variable.

Now, let’s take a step-by-step look at how to create our {{Page URL Canonical}} variable.

1. Create {{Page Protocol}} to compile the section of the URL according to whether it’s an http:// or https://

page protocol

Note: We’re assuming that the entire website will always function under a single protocol. If that’s not the case, then we should substitute the {{Page Protocol}} variable for plain text in the final variable of Step #4. (This will allow us to force it to always be http/https, without exception.)

2. Create {{Page Hostname Canonical}}

We need a variable in which the hostname is always unique, whether or not it’s entered into the browser with the www. The hostname canonical must always be the same, regardless of whether or not it has the www. We can decide based on which one of the domains is redirected to the other, and then keep the original as the canonical.

How do we create the canonical domain?

  • Option 2.1: Redirect the domain with www. to a domain without www. via 301
    Our canonical URL is WITHOUT www. We need to create Page Hostname, but make sure we always remove the www:
    Page hostname canonical without www
  • Option 2.2: Redirect the domain without www. to a domain with www. via 301
    Our canonical URL is WITH www. We need to create Page Hostname without www (like before), and then insert the www in front using a constant variable:
    Page hostname canonical with www

3. Enable the {{Page Path}} built-in variable

Enabled Built-in variables

Note: Although we have the {{Page Hostname}} built-in variable, for this exercise it’s preferable not to use it, as we’re not 100% sure how it will behave in relation to the www (e.g., in this instance, it’s not configurable, unlike when we create it as a GTM custom variable).

4. Create {{Page URL Canonical}}

Link the three previous variables to form a constant variable:

{{Page Protocol}}://{{Page Hostname Canonical}}{{Page Path}}

Summary/Important notes:

  1. Protocol: returns http / https (without ://), which is why we enter this part by hand
  2. Hostname: we can force removal of the www. or not
  3. Path: included from the slash /. Does not include the query, so it’s perfect. We use the built-in option for Page Path.

Page URL canonical

Now that we have created {{Page URL Canonical}}, we could even populate it into Google Analytics via custom dimensions. You can learn to do that in this Google Analytics custom dimensions guide.

How can we insert the canonical into a page using Tag Manager?

Let’s suppose we’ve already got a canonical URL generated dynamically via GTM: {{Page URL Canonical}}.

Now, we need to look at how to insert it into the page using a GTM tag. We should emphasize that this is NOT the “ideal” solution, as it’s always preferable to insert the tag into the <head> of the source code. But, we have confirming evidence from various sources that it DOES work if it’s inserted via GTM. And, as we all know, in most companies, the ideal doesn’t always coincide with the possible!

If we could insert content directly into the <head> via GTM, it would be sufficient to use the following custom HTML tag:

<link href=”{{Page URL Canonical}}” />

But, we know that this won’t work because the inserted content in HTML tags usually goes at the end of the </body>, meaning Google won’t accept or read a <link rel=”canonical”> tag there.

So then, how do we do it? We can use JavaScript code to generate the tag and insert it into the <head>, as described in this article, but in a form that has been adapted for the canonical tag:

<script>
 var c = document.createElement('link'); 
 c.; 
 c.href = {{Page URL Canonical}}; 
 document.head.appendChild(c);
</script>

And then, we can set it to fire on the “All Pages” trigger. Seems almost too easy, doesn’t it?

REL Canonical

How do we check whether our rel canonical is working?

Very simple: Check whether the code is generated correctly on the page.

How do we do that?

By looking at the DevTools Console in Chrome, or by using a browser plugin like like Firebug that returns the code generated on the page in the DOM (document object model). We won’t find it in the source code (Ctrl+U).

Here’s how to do this step-by-step:

  1. Open Chrome
  2. Press F12
  3. Click on the first tab in the console (Elements)
    elements tab
  4. Press Ctrl+F and search for “canonical”
  5. If the URL appears in the correct form at the end of the <head>, that means the tag has been generated correctly via Tag Manager
    tag generated correctly

That’s it. Easy-peasy, right?

So, what are your thoughts?

Do you also use Google Tag Manager to improve your SEO? Why don’t you give us some examples of when it’s been useful (or not)?



Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don’t have time to hunt down but want to read!

Continue reading →

How to Generate Content Ideas Using Screaming Frog in 20(ish) Minutes

Posted by Todd_McDonald

A steady rise in content-related marketing disciplines and an increasing connection between effective SEO and content has made the benefits of harnessing strategic content clearer than ever. However, success isn’t always easy. It’s often quite difficult, as I’m sure many of you know.

A number of challenges must be overcome for success to be realized from end-to-end, and finding quick ways to keep your content ideas fresh and relevant is invaluable. To help with this facet of developing strategic content, I’ve laid out a process below that shows how a few SEO tools and a little creativity can help you identify content ideas based on actual conversations your audience is having online.

What you’ll need

Screaming Frog: The first thing you’ll need is a copy of Screaming Frog (SF) and a license. Fortunately, it isn’t expensive (around $150/USD for a year) and there are a number of tutorials if you aren’t familiar with the program. After you’ve downloaded and set it up, you’re ready to get to work.

Google AdWords Account: Most of you will have access to an AdWords account due to actually running ads through it. If you aren’t active with the AdWords system, you can still create an account and use the tools for free, although the process has gotten more annoying over the years.

Excel/Google Drive (Sheets): Either one will do. You’ll need something to work with the data outside of SF.

Browser: We walk through the examples below utilizing Chrome.

The concept

One way to gather ideas for content is to aggregate data on what your target audience is talking about. There are a number of ways to do this, including utilizing search data, but it lags behind real-time social discussions, and the various tools we have at our disposal as SEOs rarely show the full picture without A LOT of monkey business. In some situations, determining intent can be tricky and require further digging and research. On the flipside, gathering information on social conversations isn’t necessarily that quick either (Twitter threads, Facebook discussion, etc.), and many tools that have been built to enhance this process are cost-prohibitive.

But what if you could efficiently uncover hundreds of specific topics, long-tail queries, questions, and more that your audience is talking about, and you could do it in around 20 minutes of focused work? That would be sweet, right? Well, it can be done by using SF to crawl discussions that your audience is having online in forums, on blogs, Q&A sites, and more.

Still here? Good, let’s do this.

The process

Step 1 – Identifying targets

The first thing you’ll need to do is identify locations where your ideal audience is discussing topics related to your industry. While you may already have a good sense of where these places are, expanding your list or identifying sites that match well with specific segments of your audience can be very valuable. In order to complete this task, I’ll utilize Google’s Display Planner. For the purposes of this article, I’ll walk through this process for a pretend content-driven site in the Home and Garden vertical.

Please note, searches within Google or other search engines can also be a helpful part of this process, especially if you’re familiar with advanced operators and can identify platforms with obvious signatures that sites in your vertical often use for community areas. WordPress and vBulletin are examples of that.

Google’s Display Planner

Before getting started, I want to note I won’t be going deep on how to use the Display Planner for the sake of time, and because there are a number of resources covering the topic. I highly suggest some background reading if you’re not familiar with it, or at least do some brief hands-on experimenting.

I’ll start by looking for options in Google’s Display Planner by entering keywords related to my website and the topics of interest to my audience. I’ll use the single word “gardening.” In the screenshot below, I’ve selected “individual targeting ideas” from the menu mid-page, and then “sites.” This allows me to see specific sites the system believes match well with my targeting parameters.

ssd:private:var:folders:m2:wh1vdy452ps54mq15f_w0jlh0000gn:T:qJyinA:Google Chrome.png

I’ll then select a top result to see a variety of information tied to the site, including demographics and main topics. Notice that I could refine my search results further by utilizing the filters on the left side of the screen under “Campaign Targeting.” For now, I’m happy with my results and won’t bother adjusting these.

Step 2 – Setting up Screaming Frog

Next, I’ll take the website URL and open it in Chrome.

Once on the site, I need to first confirm that there’s a portion of the site where discussion is taking place. Typically, you’ll be looking for forums, message boards, comment sections on articles or blog posts, etc. Essentially, any place where users are interacting can work, depending on your goals.

In this case, I’m in luck. My first target has a “Gardening Questions” section that’s essentially a message board.

ssd:private:var:folders:m2:wh1vdy452ps54mq15f_w0jlh0000gn:T:f8grAc:Google Chrome.png

A quick look at a few of the thread names shows a variety of questions being asked and a good number of threads to work with. The specific parameters around this are up to you — just a simple judgment call.

Now for the fun part — time to fire up Screaming Frog!

I’ll utilize the “Custom Extraction” feature found here:

Configuration → Custom → Extraction

…within SF (you can find more details and broader use-case documentation set for this feature here). Utilizing Custom Extraction will allow me to grab specific text (or other elements) off of a set of pages.

Configuring extraction parameters

I’ll start by configuring the extraction parameters.

ssd:private:var:folders:m2:wh1vdy452ps54mq15f_w0jlh0000gn:T:6CLiO7:SEOSpiderUI.png

In this shot I’ve opened the custom extraction settings and have set the first extractor to XPath. I need multiple extractors set up, because multiple thread titles on the same URL need to be grabbed. You can simply cut and paste the code into the next extractors — but be sure to update the number sequence (outlined in orange) at the end to avoid grabbing the same information over and over.

Notice as well, I’ve set the extraction type to “extract text.” This is typically the cleanest way to grab the information needed, although experimentation with the other options may be required if you’re having trouble getting the data you need.

Tip: As you work on this, you might find you need to grab different parts of the HTML than what you thought. This process of getting things dialed can take some trial-and-error (more on this below).

Grabbing Xpath code

To grab the actual extraction code we need (visible in the middle box above):

  1. Use Chrome
  2. Navigate to a URL with the content you want to capture
  3. Right-click on the text you’d like to grab and select “inspect” or “inspect element”

ssd:private:var:folders:m2:wh1vdy452ps54mq15f_w0jlh0000gn:T:x5zaHV:Google Chrome.png

Make sure you see the text you want highlighted in the code view, then right-click and select “XPath” (you can use other options, but I recommend reviewing the SF documentation mentioned above first).

ssd:private:var:folders:m2:wh1vdy452ps54mq15f_w0jlh0000gn:T:KGwqPz:Google Chrome.png

It’s worth noting that many times, when you’re trying to grab the XPath for the text you want, you’ll actually need to select the HTML element one level above the text selected in the front-end view of the website (step three above).

At this point, it’s not a bad idea to run a very brief test crawl to make sure the desired information is being pulled. To do this:

  1. Start the crawler on the URL of the page where the XPath information was copied from
  2. Stop the crawler after about 10–15 seconds and navigate to the “custom” tab of SF, set the filter to “extraction” (or something different if you adjusted naming in some way), and look for data in the extractor fields (scroll right). If this is done right, I’ll see the text I wanted to grab next to one of the first URLs crawled. Bingo.

ssd:private:var:folders:m2:wh1vdy452ps54mq15f_w0jlh0000gn:T:fDZAyI:SEOSpiderUI.pngResolving extraction issues & controlling the crawl

Everything looks good in my example, on the surface. What you’ll likely notice, however, is that there are other URLs listed without extraction text. This can happen when the code is slightly different on certain pages, or SF moves on to other site sections. I have a few options to resolve this issue:

  1. Crawl other batches of pages separately walking through this same process, but with adjusted XPath code taken from one of the other URLs.
  2. Switch to using regex or another option besides XPath to help broaden parameters and potentially capture the information I’m after on other pages.
  3. Ignore the pages altogether and exclude them from the crawl.

In this situation, I’m going to exclude the pages I can’t pull information from based on my current settings and lock SF into the content we want. This may be another point of experimentation, but it doesn’t take much experience for you to get a feel for the direction you’ll want to go if the problem arises.

In order to lock SF to URLs I would like data from, I’ll use the “include” and “exclude” options under the “configuration” menu item. I’ll start with include options.

ssd:private:var:folders:m2:wh1vdy452ps54mq15f_w0jlh0000gn:T:6scUuu:SEOSpiderUI.png

Here, I can configure SF to only crawl specific URLs on the site using regex. In this case, what’s needed is fairly simple — I just want to include anything in the /questions/ subfolder, which is where I originally found the content I want to scrape. One parameter is all that’s required, and it happens to match the example given within SF ☺:

  • http://www.site.com/questions/.*

The “excludes” are where things get slightly (but only slightly) trickier.

During the initial crawl, I took note of a number of URLs that SF was not extracting information from. In this instance, these pages are neatly tucked into various subfolders. This makes exclusion easy as long as I can find and appropriately define them.

ssd:private:var:folders:m2:wh1vdy452ps54mq15f_w0jlh0000gn:T:fuqMmV:SEOSpiderUI.png

In order to cut these folders out, I’ll add the following lines to the exclude filter:

  • http://www.site.com/question/archive/.*
  • http://www.site.com/question/show/.*

Upon further testing, I discovered I needed to exclude the following folders as well:

  • http://www.site.com/question/genus/.*
  • http://www.site.com/question/popular/.*

It’s worth noting that you don’t HAVE to work through this part of configuring SF to get the data you want. If SF is let loose, it will crawl everything within the start folder, which would also include the data I want. The refinements above are far more efficient from a crawl perspective and also lessen the chance I’ll be a pest to the site. It’s good to play nice.

Completed crawl & extraction example

Here’s how things look now that I’ve got the crawl dialed:

ssd:private:var:folders:m2:wh1vdy452ps54mq15f_w0jlh0000gn:T:MjDfb8:SEOSpiderUI.png

Now I’m 99.9% good to go! The last crawl configuration is to reduce speed to avoid negatively impacting the website (or getting throttled). This can easily be done by going to Configuration → Speed and reducing the number of threads and URIs that can be crawled. I usually stick with something at or under 5 threads and 2 URIs.

Step 3 – Ideas for analyzing data

After the end goal is reached (run time, URIs crawled, etc.) it’s time to stop the crawl and move on to data analysis. There a number of ways to start breaking apart the information grabbed that can be helpful, but for now I’ll walk through one approach with a couple of variations.

Identifying popular words and phrases

My objective is to help generate content ideas and identify words and phrases that my target audience is using in a social setting. To do that, I’ll use a couple of simple tools to help me break apart my information:

The top two URLs perform text analysis, with some of you possibly already familiar with the basic word-cloud generating abilities of tagcrowd.com. Online-Utility won’t pump out pretty visuals, but it provides a helpful breakout of common 2- to 8-word phrases, as well as occurrence counts on individual words. There are many tools that perform these functions; find the ones you like best if these don’t work!

I’ll start with Tagcrowd.com.

Utilizing Tagcrowd for analysis

The first thing I need to do is export a .csv of the data scraped from SF and combine all the extractor data columns into one. I can then remove blank rows, and after that scrub my data a little. Typically, I remove things like:

  • Punctuation
  • Extra spaces (the Excel “trim” function often works well)
  • Odd characters

Now that I’ve got a clean data set free of extra characters and odd spaces, I’ll copy the column and paste it into a plain text editor to remove formatting. I often use the one online at editpad.org.

That leaves me with this:

ssd:private:var:folders:m2:wh1vdy452ps54mq15f_w0jlh0000gn:T:AQjpqU:Google Chrome.png

In Editpad, you can easily copy your clean data and paste it into the entry box on Tagcrowd. Once you’ve done that, hit visualize and you’re there.

Tagcrowd.com

ssd:private:var:folders:m2:wh1vdy452ps54mq15f_w0jlh0000gn:T:SeqYtU:Google Chrome.png

There are a few settings down below that can be edited in Tagcrowd, such as minimum word occurrence, similar word grouping, etc. I typically utilize a minimum word occurrence of 2, so that I have some level of frequency and cut out clutter, which I’ve used for this example. You may set a higher threshold depending on how many words you want to look at.

For my example, I’ve highlighted a few items in the cloud that are somewhat informational.

Clearly, there’s a fair amount of discussion around “flowers,” seeds,” and the words “identify” and “ID.” While I have no doubt my gardening sample site is already discussing most of these major topics such as flowers, seeds, and trees, perhaps they haven’t realized how common questions are around identification. This one item could lead to a world of new content ideas.

In my example, I didn’t crawl my sample site very deeply and thus my data was fairly limited. Deeper crawling will yield more interesting results, and you’ve likely realized already how in this example, crawling during various seasons could highlight topics and issues that are currently important to gardeners.

It’s also interesting that the word “please” shows up. Many would probably ignore this, but to me, it’s likely a subtle signal about the communication style of the target market I’m dealing with. This is polite and friendly language that I’m willing to bet would not show up on message boards and forums in many other verticals ☺. Often, the greatest insights besides understanding popular topics from this type of study are related to a better understanding of communication style, phrasing, and more that your audience uses. All of this information can help you craft your strategy for connection, content, and outreach.

Utilizing Online-Utility.org for analysis

Since I’ve already scrubbed and prepared my data for Tagcrowd, I can paste it into the Online-Utility entry box and hit “process text.”

After doing this, we ended up with this output:

ssd:private:var:folders:m2:wh1vdy452ps54mq15f_w0jlh0000gn:T:F9LpWN:Google Chrome.png

ssd:private:var:folders:m2:wh1vdy452ps54mq15f_w0jlh0000gn:T:mAMxCq:Google Chrome.png

There’s more information available, but for the sake of space, I’ve grabbed only a couple of shots to give you the idea of most of what you’ll see.

Notice in the first image, the phrases “identify this plant” & “what is this” both show up multiple times in the content I grabbed, further supporting the likelihood that content developed around plant identification is a good idea and something that seems to be in demand.

Utilizing Excel for analysis

Let’s take a quick look at one other method for analyzing my data.

One of the simplest ways to digest the information is in Excel. After scrubbing the data and combining it into one column, a simple A→Z sort, puts the information in a format that helps bring patterns to light.

ssd:private:var:folders:m2:wh1vdy452ps54mq15f_w0jlh0000gn:T:EXDvV1:Microsoft Excel.png

Here, I can see a list of specific questions ripe for content development! This type of information, combined with data from tools such as keywordtool.io, can help identify and capture long-tail search traffic and topics of interest that would otherwise be hidden.

Tip: Extracting information this way sets you up for very simple promotion opportunities. If you build great content that answers one of these questions, go share it back at the site you crawled! There’s nothing spammy about providing a good answer with a link to more information if the content you’ve developed is truly an asset.

It’s also worth noting that since this site was discovered through the Display Planner, I already have demographic information on the folks who are likely posting these questions. I could also do more research on who is interested in this brand (and likely posting this type of content) utilizing the powerful ad tools at Facebook.

This information allows me to quickly connect demographics with content ideas and keywords.

While intent has proven to be very powerful and will sometimes outweigh misaligned messaging, it’s always great to know as much about who you’re talking to and be able to cater messaging to them.

Wrapping it up

This is just the beginning and it’s important to understand that.

The real power of this process lies in its usage of simple, affordable, tools to gain information efficiently — making it accessible to many on your team, and an easy sell to those that hold the purse strings no matter your organization size. This process is affordable for mid-size and small businesses, and is far less likely to result in waiting on larger purchases for those at the enterprise level.

What information is gathered and how it is analyzed can vary wildly, even within my stated objective of generating content ideas. All of it can be right. The variations on this method are numerous and allow for creative problem solvers and thinkers to easily gather data that can bring them great insight into their audiences’ wants, needs, psychographics, demographics, and more.

Be creative and happy crawling!


Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don’t have time to hunt down but want to read!

Continue reading →

Duplicate Listings and the Case of the Nomadic New Mexican Restaurant

Posted by MiriamEllis

Albuquerque’s locals and tourists agree, you can’t find a more authentic breakfast in town than at Perea’s New Mexican Restaurant. Yelp reviewers exclaim, Best green chile ever!!”, “Soft, chewy, thick-style homemade flour tortillas soak up all the extra green chili,” “My go-to for great huevos rancheros,” and “Carne was awesome! Tender, flavorful, HOT!” The descriptions alone are enough to make one salivate, but the Yelp reviews for this gem of an eatery also tell another story — one so heavily spiced with the potential of duplicate listings that it may take the appetite of any hard-working local SEO away:

“Thru all of the location changes, this is a true family restaurant with home cooking.”

“This restaurant for whatever reason, changes locations every couple years or so.”

“They seem to wander from different locations”

“As other reviews have already mentioned, Perea’s changes locations periodically (which is puzzling/inconvenient — the only reason they don’t get 5 stars)”

“They switch locations every few years and the customers follow this place wherever it goes.”

Reading those, the local SEO sets aside sweet dreams of sopapillas because he very much doubts the accuracy of that last review comment. Are all customers really following this restaurant from place to place, or are visitors (with money to spend) being misdirected to false locations via outdated, inconsistent, and duplicate listings?

The local SEO can’t stand the suspense, so he fires up Moz Check Listing

He types in the most recent name/zip code combo he can find, and up comes:

nm1.jpg

A total of 2 different names, 3 different phone numbers, and 4 different addresses! In 5 seconds, the local SEO has realized that business listings around the web are likely misdirecting diners left and right, undoubtedly depriving the restaurant of revenue as locals fail to keep up with the inconvenient moves or travelers simply never find the right place at all. Sadly, two of those phone numbers return an out-of-service message, further lessening the chances that patrons will get to enjoy this establishment’s celebrated food. Where is all this bad data coming from?

The local SEO clicks on just the first entry to start gaining clues, and from there, he clicks on the duplicates tab for a detailed, clickable list of duplicates that Check Listing surfaces for that particular location:

nm2.jpg

From this simple Duplicates interface, you can immediately see that 1 Google My Business listing, 1 Foursquare listing, 3 Facebook Places, 1 Neustar Localeze listing, and 1 YP listing bear further investigation. Clicking the icons takes you right to the sources. You’ve got your clues now, and only need to solve your case. Interested?

The paid version of Moz Local supports your additions of multiple variants of the names, addresses, and phone numbers of clients to help surface further duplicates. Finally, your Moz Local dashboard also enables you to request closure of duplicates on our Direct Network partners. What a relief!

Chances are, most of your clients don’t move locations every couple of years (at least, we hope not!), but should an incoming client alert you to a move they’ve made in the past decade or so, it’s likely that a footprint of their old location still exists on the web. Even if they haven’t moved, they may have changed phone numbers or rebranded, and instead of editing their existing listings to reflect these core data changes, they may have ended up with duplicate listings that are then auto-replicating themselves throughout the ecosystem.

Google and local SEOs share a common emotion about duplicate listings: both feel uneasy about inconsistent data they can’t trust, knowing the potential to misdirect and frustrate human users. Feeling unsettled about duplicates for an incoming client today?

Get your appetite back for powerful local SEO with our free Check Listing tool!


Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don’t have time to hunt down but want to read!

Continue reading →