At Last, You Can Now Add Users to Your Moz Pro Account!
Why Meaning Will Ultimately Determine Your Brand’s Content Marketing Success
Posted by ronell-smith
In 2009 Fletcher Cleaves was a top high school football prospect ready for the next level, eager to do in college what he’d done in high school: rack up yards as a running back. But before Cleaves could realize his dream of playing at the next level, a texting, distracted driver plowed into the car he was driving, forever changing his life’s trajectory.
Today, Cleaves, paralyzed from the chest down as a result of the accident, serves as a tragic reminder of something as seemingly harmless as texting and driving can alter lives. It’s impossible to watch the video below and not immediately realize three important facts:
- Texting and driving is a big deal.
- This young man was unfairly robbed of his future.
- This big brand nailed the messaging.
Telecommunications brands (and airline companies) enjoy some of the worst customer service ratings on the planet. And to make matters worse, their core messaging via print, radio and online ads is equally atrocious, doing very little to make would-be customers give them a second look.
However, with the latest iteration of the “It Can Wait” campaign, which is rich with stories and features stunning video recreation, AT&T did something all brands looking to make a mark in content marketing should copy: They delivered content with meaning.
The end of utility
We live in a world rich in information and teeming with data. The ability to analyze the results of our content marketing efforts, even in real-time, is as astonishing as it is mesmerizing and revealing. Our teams can know, before a word is written, a design delivered or a report is generated what the results should be based on the assigned key performance indicators (KPIs). The automation present in online marketing can make it feel as though the world we inhabit is more fantasy than reality, as if the press of a button will always lead to the results we expect.
Yet we still struggle with how to create content that commands attention, that nudges prospects to take immediate action, that leads to the vast majority of our customers moving from brand loyalists to brand ambassadors and advocates.
Why is this?
I propose that we’ve misread the tea leaves.
In the last three years, marketers (even this one) have sung from the rooftops that your content must be useful and relevant, have immediacy, and deliver impact. And if you followed this advice, you likely found a modicum of success, if only for a short time.
How could we expect any different when the customers we’re all clamoring for are being bombarded with thousands of messages every day? When that happens, even the most resonant voices get drowned out. And for those of us who’ve thrown our hats into the usefulness and relevance ring, we’ve largely committed ourselves to a life of struggle that’s tough to recover from.
This line of thinking occurred to me in July of 2014, as I finished Jay Baer’s book Youtility during the plane ride home from MozCon 2014. I agree with and applaud Baer for bringing to light the novel term, which he defines as “Marketing that’s wanted by customers. Youtility is massively useful information, provided for free, that creates long-term trust and kinship between your company and your customers.”
But I’m afraid this ship has largely sailed. Not because usefulness is any less importance, but because the threshold was so low that every brand and their sister jumped online via websites, social media, forums, message boards and everywhere else with information that temporarily sated prospects’ appetites but did little to create a lasting impression.
If your desire is to create a brand whose content is sought-after and, indeed, clamored for, you must bake meaning into your content.
Without meaning, your brand’s content is adrift
Like many of you, most of my early content-creation efforts were centered around pleasing Google, whereby my inspiration was for thinking in terms of queries:
1: Informational: Where prospects are likely to look for information
2: Navigational: What prospects are likely to be looking for on those sites
3: Transactional: What prospects are ready/likely to buy
The result of this thinking (outlined in the graphic below) was the myriad 350-word posts that now clog the web.
There’s a better way.
It’s time your content led with meaning, and that process begins with a revamping of the thought process surrounding content ideation and content creation. Why is that important?
We cannot win otherwise, says Bill Sebald, founder of Greenlane SEO, a Pennsylvania-based SEO firm.
“Think about it,” he says. “Many brands are still writing low-quality articles that deliver little value and have zero impact to their customers or prospects. That’s bad enough, but when you consider the prevalence of these thin content pieces, is there any wonder how the Panda Update evokes fear in these same brands? Being useful is great. It can and does work fine, for a while. But what you want as a brand is lasting impact, people seeking you out, top-of-mind awareness. As it regards content marketing, that only happens when your brand is known for delivering content with meaning, which sticks in the gut of the folks who read it.”
(image source)
In All Your Content Doesn’t Matter Without Meaning, Sebald shared five easy-to-follow questions he thinks brands should ask themselves as they work to create content with meaning:
- Did I say anything new?
- Did I say something that will get someone’s attention?
- Is the content part of a strategy?
- Am I really an expert in this topic?
- Did my copy focus on relationships Google knows about?
Any brand committed to asking themselves at least three of those questions before any content is created is swimming in the deep end of the pool, having moved away from the pack and on the way to delivering meaningful content.
After reading Sebald’s post, I dug into my notes to discern what I think it takes to win the race for content marketings next frontier.
If your brand is looking to separate from the back, I’d like to share three ideas I’ve seen work well for brands of all sizes, even in boring verticals, such as HVAC and plumbing.
1. Be where your prospects are, at the time they need your information, with a message so good they cannot ignore you.
As a lifelong angler, I’m keen to compare marketing to bass fishing, whereby bait and location are pretty much all that matters. Or so I thought, until one day I got my hands on an underwater camera and could see fish swimming all around my lure, which they ignored.
(image source)
That’s when I realized bait and location are only as good as timing.
No matter how great the quality of my tackle or how well-placed was my lure, the fish must be ready to bite for me to mind success.
How your brand can put this thinking to work: Personalize your company’s blog by adding bi-weekly or monthly interviews with people who’ve used your services/products, and who can share information that’s hyper-relevant to issues prospects are likely dealing with at the time.
For example, in the month of October a pool company might highlight a customer who maintains their own pool but who hires a pool company for winterization help. Or, in the same month, an accountant might share a video blog of a couple who owns a small business and does a great job of staying on top of expenses.
You might notice that I never said the person spotlighted mentions the brand or even uses them for service. That’s immaterial. What’s key is (a) the person shares a compelling story that’s (b) delivered on your blog and (c) is information they can use right away for where they are in the decision-making process. (It’s important that the content not appear salesy because too often the prospects who’re most likely to need your services aren’t even looking for those services. They’re simply suckers for a good story.)
2. Make them feel confident about what the brand stands for, not simply the purchase they might someday make.
One of my favorite words from college is ubiquity. Get to know this word if your brand is to produce meaningful content. Your brand should show up in all the places and for all the things prospects would expect to find you ranking for, conversing about and, more important, being shared by others for.
To instill your content with meaning, it must show up in places and for things prospects likely would expect t find it showing up for. This isn’t simply about ubiquity. It shows empathy.
A brand that does this better than most is Seattle-based REI. It’s amazing the range of terms they rank highly for. If they sell it, there’s a great chance REI shows up somewhere in or near the top of the SERPs for the category.
For example, I simply typed “snow goggles” into the search box, and voila, look who shows up. Also, look who they show up above. Better yet, imagine all of the large eyewear brands they’re outcompeting for this position.
By clicking on the query, you immediately see why they’re at the top of the SERPS: The content is rich in visuals and answers every question a prospect would ever have surrounding snow goggles.
I discovered the strength of REI’s content ideation and creation efforts in 2013, while completing a content strategy roadmap for one of the largest two-way radio manufacturers in the world.
Despite the brand’s heft, REI was always ahead of them in the SERPs, with social shares, in online conversations, etc.
When I visited with Jonathon Colman, formerly the in-house SEO for REI, at Facebook headquarters inSan Francisco, I understood why REI had content ubiquity: “From the start, they did something right that continues to [work in their favor],” says Colman, who works for Facebook in the areas of product user experience and content strategy. “They simply focused on creating and sharing the best content for their users, not on marketing.”
Those words resonated with me, as they should with you.
How your brand can put this thinking to work
Stop thinking like a marketer and start thinking like a customer. I’ve written before about keeping and sharing a document that lists the questions and comments prospects and customers share during calls, on social media and via any any other platforms used to capture customer sentiment.
This document could form the basis for content that’s written and shared by your marketing team. However, your brand must go farther to deliver meaning through it’s content.
An approach I’ve recommended to clients and seen good success with works as follows:
- Focus on creating one big piece of content per month: This pulls your team away from thinking about creating content for content’s sake. It also ensures that the team is able to marshal its resources to research, design, and create content with meaning. The goal with each big content piece is to answer every reasonable question and/or objection a prospect might have before doing business with you. For example, an SEO agency might, in month one, create a big content piece titled “How Small Companies Can Win With Personalized Content,” detailing in depth how becoming a popular local expert can earn the brand links, gain press attention and increase overall business. In month two, the same agency might go all-in on a post titled “How Your Mom and Pop Shop Can Beat the Big Guys,” whereby they outline an actionable plan for how to smartly use their blog, one social media platform and a small PPC budget to generate awareness, site visits, links and earned media. Prospects are likely to see the agency as the one to help get them over the hump.
- Ignore the competition: Instead of checking the SERPs to see what’s ranking highest for content in your vertical on the topic you wish to create, look at the content that’s being shared outside your area by brands that have no relation to your vertical. You cannot win long-term by copying a strategy that your competition is better equipped to deploy, so don’t emulate them. Look at what non-competing brands are doing to deliver meaningful content. It could be a TV show, even, which you study for how characters are developed. Think of the regional car dealerships who grew to be household names in the late ’90s by delivering sitcom-style commercials and ads based off popular TV shows that meant something to the audience. Your brand can find similar inspiration by looking outside your area.
- Make consistency a mainstay: REI wins at content marketing in large part because the brand is consistent. No matter where you find their content, it’s thorough and deserving of its place in the pantheon of content marketers. Don’t simply pour your heart into the big content piece, then allow everything else to fall by the wayside. Your brand must imbue every area, all departments and any content shared with meaning. This effort takes shape as the development, design and product teams placing users in the driver’s seat early on in the process; the marketing team only sharing information that, first and foremost, addresses the needs of the audience; the customer service team creating customer happiness, not quashing complaints; and sales team members frequently checking on prospects, even when no sale is imminent.
The goal here is to, as the saying goes, be so good they cannot ignore you.
3. Help your customers become the best versions of themselves
It’s likely you’ve seen the graphic below online before, maybe even on the Buffer Blog, which is where I found it. The image expertly sums up where I think the brands who ultimately win at content marketing will have to go: Turning away from their own interests and keying in on how the brand can better enable the customer to (a) better do what they endeavor to do and (b) become a version of themselves they never imagined possible.
(image source)
Sound far-fetched? Imagine the car commercials showing an average Joe who is all of a sudden a handsome hero admired by beautiful passersby because of his new wheels.
Your brand can become the means-something-to-prospects darling of its industry, too, with the adoption of three simple steps applied with conviction:
- Personalization — Develop people (at least one, but a few would be even better) in your company who can become the public face of the brand, who make it easier for prospects to form a connection with the company and more likely that content is shared and amplified more frequently as their popularity increases.
- Become a helper, not a hero — Stop thinking that your content or your product or your service needs to be life-changing to get the attention of prospects. They desire to be the heroes and sheroes of their own journey; they simply need an assist from you to create a lasting bond they won’t soon forget about.
- Make users’ stories a core of your marketing efforts — Let’s get this straight: No one gives a damn about your story. Your brand’s story only becomes relevant when prospects have been made to feel important, special by you then desire to explore further the meaning behind the brands. How do you accomplish that task? By integrating the stories of customers into your marketing efforts.
How your brand can put this thinking to work
The importance of using using an engaging personality to deliver meaning for your content cannot be overstated. In fact, it’s likely the shortest path to winning attention and garnering success.
I’ll use Canadian personal trainer Dean Somerset as an example. I discovered Somerset a few years ago when he dropped a few helpful knowledge bombs in the comments of a fitness blog I was reading. I then found a link to his blog, which I have now become a religious follower of. Over the years, we’ve traded numerous emails, interacted myriad times via Twitter, Facebook, and Instagram, and I’ve even hired him for training assessments.
Why?
Aside from being brilliant, he’s a goofball who takes his work, not himself, too seriously.
(image source)
But most important, the core of every post he creates or video he shares or every Facebook Q&A he offers is helping others become better at physical health and physical fitness than they ever imagined they could.
The result is that, in a relatively short time span, Somerset has become one of the top young minds in the fitness industry, in no small part because he creates heroes with nearly every piece of content he shares. (If you doubt me, watch the video below.)
Don’t think for a second that your brand can’t do the same:
- Look for members on your team who have personality and who are uniquely qualified to create content (e.g., video, text, SlideShare, etc.) on topics readers care about. Empower them to share, converse and engage around this content, whether locally (e.g., Meetups) nationally (e.g., conferences) or online (e.g., blogs, social media, etc.).
- The script these experts must work from, for everything they share, should begin with the question, “How can this [blog, video, etc.] help at least one person do something better tomorrow that they cannot yet do today?” Answer this question, and you won’t simply create meaning for your content, you’ll create meaning, relevance and top-of-mind awareness for the brand as well.
It’s hard for a brand to escape being successful if this mindset is ever-present.
The last area we’ll look at is storytelling, which is very popular in content marketing. And almost no one gets it right.
Yes, people do love stories. They eat them up, especially compelling, heart-wrenching stories or, even better, tales of tremendous uplift.
However, people are not interested in your brand’s story — at least not yet.
The only story brands should be telling are those of their users. The brands who have realized this are leaving the brand storytellers in the dust, while turning up the dial on meaning and significance to the audience.
A great example is Patagonia and their Worn Wear video series. Instead of creating ads showcasing the durability of their products, they filmed actual customers who’ve been using the same Patagonia products for years and who wouldn’t trade the brand’s products for those of any other company.
These are rabid fans, loyal to the nth degree.
Don’t drink the brand storytelling Kool-Aid. Tell the stories of your users.
Identify a handful of ardent fans of your product or service, then reach out to them via phone to ask if they’d mind being part of a short-video series you’re doing to showcase people and brands doing great things. (I mentioned a similar approach earlier, which is ideal for the smallest companies. I think this effort plays into a much broader strategy for larger brands.)
Depending on your budget and their location, you could either have a small camera crew visit their office or walk them through how to shoot what you need on their mobile devices. You could also provide them with a script.
Here’s the kicker: During the video, they are not allowed to talk about your brand, product or service in any way shape or form.
The goal is to get video of them going about their day, at home and at work, as they share what makes them tick, what’s important to them, who they are and why they do what they do.
This is their story, remember? And as such, your brand is a bit player, not a/the star. Also, the lack of a mention washes away any suspicion viewers might have of your brand’s motives. Most important, however, you get a real, authentic success story on your website and domain, so the implication is that your brand was a helper in this heroic journey.
If this post accomplishes anything, my wish is that it makes clear how necessary and how realistic it is for your brand to create meaningful content.
Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don’t have time to hunt down but want to read!
Continue reading →Million Dollar Content – An Analysis of the Web’s Most Valuable Organic Content
Scraping and Cleaning Your Data with Google Sheets: A Closer Look
Posted by Jeremy_Gottlieb
Have you ever wanted to automate pulling data from a web page—such as building a Twitter audience—and wanted a way to magically make all of the Twitter handles from the web page appear in your Google Sheet, but didn’t know how? If learning Python isn’t your cup of tea, using a few formulas in Google Sheets will allow you to easily and quickly scrape data from a URL that, were you to do so manually, could take hours.
For Windows users, Niels Bosma’s amazing SEO plug-in for Excel is an option that could also be used for this purpose, but if you analyze data on a Mac, this tutorial on formulas in Google Sheets will help make your life much easier, as the plug-in doesn’t work on Macs.
Within Google Sheets, there are 3 formulas that I like to use in order to save myself huge amounts of time and headspace. These are:
- IMPORTXML
- QUERY
- REGEXEXTRACT
With just these 3 formulas, you should be able to scrape and clean the data you need for whatever purpose you may come across—whether that be curating Twitter audiences, analyzing links, or anything else that you can think of. The beauty of these formulas is in their versatility, so the use cases for them are practically infinite. By understanding the concept behind this, the variables can be substituted depending on the individual use case. However, the essential process for scraping, cleaning and presenting data will remain the same.
It should be noted that scraping has limitations, and some sites (like Google) don’t really want anyone scraping their content. The purpose of this post is purely to help you smart Moz readers pull and sort data even faster and more easily than you would’ve thought possible.
Let’s find some funny people on Twitter we should follow (or target. Does it really matter?). Googling around the subject of funny people on Twitter, I find myself landing on the following page:
Bingo. Straight copying and pasting into a Google Doc would be a disaster; there’s simply way too much other content on the page. This is where IMPORTXML comes in.
The first step is to open up a Google Sheet and input the desired URL into a cell. It could be any cell, but in the example below, I placed the URL into cell A1.
Just before we begin with the scraping, we need to figure out exactly what data we plan on scraping. In this case, it happens to be Twitter handles, so this is how we’re going to do it.
First, right click on our target (the Twitter handle) and click “Inspect Element.”
Once in “Inspect Element,” we want to figure out where on the page our target lives.
Because we want the Twitter handle and not the URL, we’re going to focus on the element/modifier/identifier “target” rather than “href” within the <a></a> tags. We also happen to notice that the <a></a> tags are “children” of the <h3></h3> tags. What these values mean is a topic for another post, but what we need to keep in mind is that for this particular URL, this is where our desired information lives that we need to extract. It will almost certainly live in a different area with different modifiers on any other given URL; this is just the information that’s unique to the site we’re on.
Let’s get to the scary stuff (maybe?): how to write the formula.
I put the formula in cell A3, where I have the red arrow. As can be seen in the highlighted rectangle, I wrote =IMPORTXML(A1, “//h3//a[@target=’_blank’]”), which yielded a wonderful, organized list of all the top Twitter handles to follow from the page. Voila. Cool, right?
Something to remember when doing this is that the values have been created via a formula, so trying to copy and paste them regularly can get messy; you’ll need to copy and paste as values.
Now, let’s break down the madness.
Like any other function in Sheets, you’ll need to begin with an equal sign, so we start with =IMPORTXML. Next, we find the cell with our targeted URL (in this case, cell A1) and then add a comma. Double quotation marks are always required to begin the query, followed by two forward slashes (“//”). Next, you select the element you want to scrape (in this case, the h3 tag). We don’t want all of the information in the h3 elements, just a particular part of the <a></a> tags—specifically, the “target” part where we find the Twitter handles. To capture this part, we add //a[@target=’_blank’], which specifies only the target=’_blank” part of the <a></a> tag. Putting it all together, the formula =IMPORTXML(A1, “//h3//a[@target=’_blank’]”) can be translated as “From the URL within cell A1, select the data with an <h3> tag that is also within an <a> tag and also part of the target attribute.”
In this particular case, the Twitter handles were the only element that could be scraped based on our formula and how it was originally written within the HTML, but sometimes that’s not the case. What if we were looking for travel bloggers and came across a site like the one seen below, where our desired Twitter handles are within a text paragraph?
Taking a look at the Inspect Element button, we see the following information:
In the top rectangle is the div and the class we need, and in the second rectangle is the other half of the information we require: the <p> tag. The <p> tag is used in html to specify where a given paragraph is. The Twitter handles we’re looking for are located within a text paragraph, so we’ll need to select the <p> tag as the element to scrape.
Once again, we input the URL into a cell (any empty cell works) and write out the new formula =IMPORTXML(A1, “//div[@class=’span8 column_container’]//p”). Instead of selecting all of the h3 elements like in the preceding example, this time we’re finding all of the <p> tags within the div elements that have a class of “span8 column_container”. The reason we’re looking for <p> tags within div elements that have a class of “span8 column_container” is because there are other <p> tags on the page that contain information we likely won’t need. All of the Twitter handles are contained with <p> tags within that specifically-classed div, so by selecting it, we’ll have selected the most appropriate data.
However, the results of this are not perfect and look like this:
The results are less than ideal, but manageable nonetheless – we ultimately just want Twitter handles, but are provided with a whole bunch of other text. Highlighted in the green rectangle is a result closer to what I want, but not in the column I need (there’s also another one down the page out of the view of the screenshot, but most are where I need them). To make sure we get all the data in the appropriate format, we can copy and paste values for everything within columns A–C, which will remove the values populated by formulas and replace them with hard values that can be manipulated. Once that is done, we can cut and paste the outlying values (one in column B and one in column C) into their corresponding cells in column A.
All of our data is now in column A; however, some of the cells include information that does not contain a Twitter handle. We’re going to fix this by running the =QUERY function and separating the cells that contain “@” from the ones that do not. In a separate cell (I used cell C4), we’re going to input =query(A4:A36, or “Select A where A contains ‘@’”) and hit enter. BOOM. From here on, we’ll have only cells that contain Twitter handles, a huge improvement over having a mixed bag of results that contain both cells with and without Twitter handles. To explain, our formula can be translated as “From within the array A4:A36, select the cell in column A when that cell contains ‘@’.” It’s pretty self-explanatory, but is nonetheless a fantastic formula that is incredibly powerful. The image below shows what this looks like:
Keep in mind that the results we just pulled are going to contain excess information within the cells that we’ll need to remove. To do this, we’ll need to run the =REGEXEXTRACT formula, which will pretty much eliminate any need you have for the =RIGHT, =LEFT, =MID, =FIND, and =LEN formulas, or any mixture of those. While useful, these functions can get a bit complicated and need to work in unison in order to produce the same results as =REGEXEXTRACT. A more detailed explanation of these formulas with visuals can be found here.
We’ll run the formula on the results produced from running the =QUERY formula. Using =REGEXEXTRACT, we’ll select the top cell in the queries column (in this case, C4) and then select everything after it beginning with “@”, the start of what we’re looking for. Our desired formula will look like =REGEXEXTRACT(C4, “@.*”). The backslash signifies to escape the following character, and the .* means select everything after. Thus, the formula can be translated as “For cell C4, extract all of the content beginning at the “@”.
To get all of the other values, all we need to do is click and grab the bottom right corner of cell E4 and drag it down until the end of our array at cell C28. Dragging down the corner of E4 will apply the formula within it to the cells included within the drag. We want to include up to E28 because the corresponding cell C28 is the last cell in the array we are applying the formula to. Doing this will provide the results shown below:
Though a nice and clean output, the data in column E is created by formula and cannot be easily manipulated. We’ll need to do copy and paste values within this column to have everything we need and be able to manipulate the data.
If you’d like to play around with the Google Sheet and make your own copy, you can find the original here.
Hopefully this helps provide some direction and insight into how you can easily scrape and clean data from web pages. If you’re interested in learning more, here’s a list of great resources:
- Xpath Data Scraping Tutorial video (for PC users)
- The ImportXML Guide for Google Docs
- A Content Marketer’s Guide to Data Scraping
- How to Get the Most Out of Regex
Want more use cases, tips, and things to watch out for when scraping? I interviewed the following experts for their insights into the world of web scraping:
- Dave Sottimano, VP Strategy, Define Media Group, Inc.
- Chad Gingrich, Senior SEO Manager, Seer Interactive
- Dan Butler, Head of SEO, Builtvisible
- Tom Critchlow, tomcritchlow.com
- Ian Lurie, CEO and Founder, Portent, Inc.
- Mike King, Founder, iPullRank
Question 1: Describe a time when automated scraping “saved your life.”
“During the time when hreflang was first released, there were a lot of implementation & configuration issues. While isolated testing was very informative, it was the automated scraping of SERPs that helped me realize the impact of certain international configurations and make important decisions for clients.” – Dave Sottimano
“We wanted a way to visualize forum data to see what types of questions their clients’ audiences were talking about most frequently to be able to create a content strategy out of that data. We scraped Reddit and various forums, grabbing data like post titles, views, number of replies, and even the post content. We were able to aggregate all that data to put together a really interesting look at the most popular questions and visualize keywords within the post title and comments that might be a prime target for content. Another way we use scraping often at Seer is for keyword research. Being able to look at much larger seed keyword sets provides a huge advantage and time savings. Additionally, being able to easily pull search results to inform your keyword research is important and couldn’t be done without scraping.” – Chad Gingrich
“I’d say scraping saves my life on a regular basis, but one scenario that stands out in particular was when a client requested Schema.org mark-up for each of its 60 hotels in 6 different languages. Straightforward request, or so I thought—turns out they had very limited development resource to implement themselves, and an aged CMS that didn’t offer the capabilities of simply downloading a database so that mark up could be appended. Firing up ImportXML in Google Sheets, I could scrape anything (titles, source images, descriptions, addresses, geo-coordinates, etc.), and combined with a series of concatenates was able to compile the data so all that was needed was to upload the code to the corresponding page.” – Dan Butler
“I’ve lost count of the times when ad-hoc scraping has saved my bacon. There were low-stress times when fetching a bunch of pages and pulling their meta descriptions into Excel was useful, but my favorite example in recent times was with a client of mine who was in talks with Facebook to be included in F8. We were crunching data to get into the keynote speech and needed to analyze some social media data for URLs at reasonable scale (a few thousand URLs). It’s the kind of data that existed somewhere in the client’s system as an SQL query, but we didn’t have time to get the dev team to get us the data. It was very liberating to spend 20 minutes fetching and analyzing the data ourselves to get a fast turnaround for Facebook.” – Tom Critchlow
“We discovered a client simultaneously pointed all of their home page links at a staging subdomain, and that they’d added a meta robots noindex/nofollow to their home page about one hour after they did it. We saw the crawl result and thought, “Huh, that can’t be right.” We assumed our crawler was broken. Nope. That’s about the best timing we could’ve hoped for. But it saved the client from a major gaffe that could’ve cost them tens of thousands of dollars. Another time we had to do a massive content migration from a client that had a static site. The client was actually starting to cut and paste thousands of pages. We scraped them all into a database, parsed them and automated the whole process.“ – Ian Lurie
“Generally, I hate any task where I have to copy and paste, because any time you’re doing that, a computer could be doing it for you. The moment that stands out the most to me is when I first started at Razorfish and they gave me the task of segmenting 3 million links from a Majestic export. I wrote a PHP script that collected 30 data points per link. This was before any of the tools like CognitiveSEO or even LinkDetective existed. Pretty safe to say that saved me from wanting to throw my computer off the top of the building.“ – Mike King
Question 2: What are your preferred tools/methods for doing it?
“Depends on the scale and the type of job. For quick stuff, it’s usually Google docs (ImportXML, or I’ll write a custom function), and on scale I really like Scraping Hub. As SEO tasks move closer towards data analysis (science), I think I’ll be much more likely to rely on web import modules provided by big data analytics platforms such as RapidMiner or Knime for any scraping.” – Dave Sottimano
“Starting out, Outwit is a great tool. It’s essentially a browser that lets you build scrapers easily by using the source code. …I’ve started using Ruby to have more control and scalability. I chose Ruby because of the front end/backend components, but Python is also a great choice and is definitely a standard for scraping (Google uses it). I think it’s inevitable that you learn to code when you’re interested in scraping because you’re almost always going to need something you can’t readily get from simple tools. Other tools I like are the scraper Chrome plugin for quick one page scrapes, Scrapebox, RegExr, & Text2re for building and testing regex. And of course, SEO Tools for Excel.” – Chad Gingrich
“I love tools like Screaming Frog and URL Profiler, but find that having the power of a simple spreadsheet behind the approach offers a little more flexibility by saving time being able to manage the output, perform a series of concatenated lookups, and turn it into a dynamic report for ongoing maintenance. Google Sheets also has the ability for you to create custom scripts, so you can connect to multiple APIs or even scrape & convert JSON output. Hey, it’s free as well!” – Dan Butler
“Google Docs is by far the most versatile, powerful and fast method for doing this, in my personal experience. I started with ImportXML and cut my teeth using that before graduating to Google Scripts and more powerful, robust, and cron-driven uses. Occasionally, I’ve used Python to build my own scrapers, but this has so far never really proven to be an effective use of my time—though it has been fun.” – Tom Critchlow
“We have our own toolset in-house. It’s built on Python and Cython, and has a very powerful regex engine, so we can extract pretty much anything we want. We also write custom tools when we need them to do something really unique, like analyze image types/compression. For really, really big sites—millions of pages—we may use DeepCrawl. But our in-house toolset does the trick 99% of the time and gives us a lot of flexibility.” – Ian Lurie
“While I know there a number of WYSIWYG tools for it at this point, I still I prefer writing a script. That way I get exactly what I want and it’s in the precise format that I’m looking for.” – Mike King
Question 3: What are common pitfalls with web scraping to watch out for?
“Bad data. This ranges from hidden characters and encoding issues to bad HTML, and sometimes you’re just being fed crap by some clever system admin. As a general rule, I’d far rather pay for an API than scrape.” – Dave Sottimano
“Just because you can scrape something doesn’t mean you should, and sometimes too much data just confuses the end goal. I like to outline what I’m going to scrape and why I need it/what I’ll do with that data before scraping one piece of data. Use brain power up front, let the scraping automate the rest for you, and you’ll come out the other side in a much better place.” – Chad Gingrich
“If you’re setting up dynamic reports or building your own tools, make sure you have something like Change Detection running so you can be alerted when X% of the target HTML has changed, which could invalidate your Xpath. On the flipside, it’s crazy how common parsing private API credentials/authentication is via public HTTP get requests or over XHR—seriously, sites need to start locking this stuff down if they don’t want it accessible in the public domain.” – Dan Butler
“The most common pitfall with computers is that they only do what you tell them—this sounds obvious, but it’s a good reminder that when you get frustrated, you usually only have yourself to blame. Oh—and don’t forget to check your recurring tasks every once in a while.” – Tom Critchlow
“It’s important to slow your crawls down. I’m not even talking about Google scraping. I’m talking about crawling other folks’ web sites. I’m continuously amazed at just how poorly optimized most site technology stacks really are. If you start hitting one page a second, you may actually slow or crash a site for a multi-million-dollar business. We once killed a client’s site with a one-page-per-second crawl—they were a Fortune 1000 company. It’s ridiculous, but it happens more often than you might think. Also, if you don’t design your crawler to detect and avoid spider traps, you could end up crawling 250,000 pages of utter duplicate crap. That’s a waste of server resources. Once you find an infinitely-expanding URL or other problem, have your crawler move on.” – Ian Lurie
“The biggest pitfall I run into these days is that a lot of sites are rendering their content with JavaScript and a standard text-based crawler doesn’t always cut it. More often than not, I’m scraping with a headless browser. My favorite abstraction of PhantomJS is NightmareJS because it’s quick and easy, so I use that. The other thing is that sometimes people’s code is so bad that there’s no structure, so you end up grabbing everything and needing to sort through it.” – Mike King
Do you have any interesting use-cases or experiences with data scraping? Sound off in the comments!
Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don’t have time to hunt down but want to read!
Continue reading →