About frans

Website:
frans has written 4625 articles so far, you can find them below.

IPv6, C-Blocks, and How They Affect SEO

Posted by Tom-Anthony

You have probably heard about IPv6, but you might remain a bit confused about the details of what it is, how it works, and what it means for the future of the Internet. This post gives a quick introduction to IPv6, and discusses the SEO implications that could follow from the IPv6 roll-out (touching specifically on the concept of C-Blocks). A quick caveat: This stuff is hard, so let me know if you spot any missteps!

A very brief intro to IP addresses (v4) & c-blocks

You’re likely familiar with IP addresses; they are usually written in the following format:

 

Example IP address (IPv4).

This format of an IP address is the common format in use everywhere, and is called IPv4. There are four bytes in an IP address like this, with each byte separated by a period (meaning 32 bits in total, for the geeks). Every (sub)-domain resolves to at least one such IP address (it might be several, but lets ignore that for now). Nice and simple.

Now a main SEO concept that comes out of that is the idea of C-Blocks (this shouldn’t be confused with Class C IPs; a different thing people often confuse for C-Blocks), which is a concept that has been around in the SEO space for a decade or more. Very simply, the idea is that if the first 3 bytes of the IP address are identical, then we consider the two IP address to be in the same C-Block:

Two example IP addresses in the same C-Block (blue).

So why is this interesting to us? Why is this important to SEO? The old-school logic is that if you have two IPs that are in the same C-Block, then the sites are quite likely related and thus the links between these sites (on average) should not count as strongly in terms of PageRank. My personal opinion is that nowadays there are many many other signals available to Google to make these same sorts of connections and so the C-Block issue is far less important than it once was.

So, as it turns out (surprise!) the two IP addresses above are indeed related:

Disney and ABC have a near identical IP address, both in the same C-Block.

Sure enough they are both companies in the Disney family. It makes some sense that links between these two domains probably shouldn’t indicate as much trust as links from similarly large, but unrelated, sites.

Introducing IPv6

So, there is a problem with IP addresses in the format above (IPv4); there are “only” 4 billion of them, and we have essentially exhausted the supply. We have so many connected devices nowadays, and the creators for IPv4 never envisioned the vastness of the Internet 30 years from when it was released. Luckily enough, they saw the problem early on andstarted working on a successor, IPv6 (IPv5 was used for another unreleased protocol).

IPv6 address format:

IPv6 addresses are much longer than IPv4 addresses, the format looks thus:

An example IPv6 address.

Things just got serious! There are now 8 blocks rather than 4, and rather than each block being 1 byte (which were represented as a number from 0-255), each block is instead 2 bytes represented by 4 hexadecimal characters. There are 128 bits in an IPv6 address, meaning instead of a measly 4,000,000,000 like IPv4, IPv6 has around 340,000,000,000,000,000,000,000,000,000,000,000,000 addresses.

In the next few years we’ll be entering a world where hundreds of devices in homes will all be capable of networking and needing an IP address and IPv6 will help make that a reality. However, we are also going to see websites starting to use IPv6 addresses more and more commonly, and a few years from now we’ll start to see website that only have an IPv6 address.


CIDR Notation

Before we go any further, it is important to introduce an important concept for understanding IP addresses, which is called CIDR notation.

IPv6 exclusively uses CIDR notation (e.g. /24), so the SEO community will need to understand this concept. It is really simple, but normally really badly explained.

As we mentioned, IPv4 IP addresses are 32 bits long, so if we were sick and twisted we could look at the IP address as binary:

Example IPv4 IP address shown in dot decimal format and as binary.

Colloquially, CIDR notation could be described as a format to describe a group of closely related IP addresses, in a similar fashion to how a C-Block works. It is represented by a number after a slash appended to a partial IP address (e.g. 199.181.132/24) which states how many of the initial bits (binary digits) are the identical. CIDR is flexible and we could use it to describe a C-Block would be /24 because the first 24 bits (3 groups of 8 bits) of the address are the same:

Two IP addresses in the same C-Block. The first 24 bits (3 blocks of 8 bits) are identical.

This can be represented in this case as 199.181.132/24.

Now CIDR notation is more refined and more accurate than the concept of C-Block; in the example above the two IP addresses are not just in the same C-Block they are even more closely related as 6 bits in the last block are also identical. In CIDR notation we could say both these IP addresses are in the 199.181.132/30 block to indicate that the 30 leading bits are identical.

Notice that with CIDR the smaller the number after the slash, the more IP addresses in that block (because we’re saying fewer leading bits must be identical).

IPv6 & C-Blocks?

Now CIDR /24 is not exactly catchy and so someone made up the name “C-Block” to make this easier to talk about, but it doesn’t extend so easily to IPv6. So, the question is, can we generalise something similar?

The point of a C-Block from Google’s perspective and the perspective of our SEO is solely to identify whether links are originating on the same ISP network. So that should obviously remain the focus. So my best guess would be to focus on how these IPs are allocated to ISPs (ISPs normally get large continuous blocks of IP addresses they can then use for their customers’ websites).

In IPv4 ISPs would own bunches of C-Blocks, and so if you could see multiple links originating from the same C-Block it implied the sites were hosted together, and there was a far greater chance they were somehow related.

Illustration of an “ISP Block” (/32); the blue part of the address is stable and

indicates the ISP. The red part can change and represents addresses at that ISP.

With IPv6, I believe that ISPs will be given /32 blocks (the leading 32 bits will be the same, leaving 96 bits to create addresses for their customers), which they will then assign to their users in /64 blocks (I asked a few people, this tends to be what is happening, but I have read that this might sometimes be /48 blocks instead). Notice that ISPs now have an order of magnitude more IP addresses (each) than the whole internet had before!

This also means each end user will get more IP addresses for their own network than there are in total IPv4 IP addresses. Welcome to the Internet of things!

These ISPs may be serving home users so each house gets a block of IPv6 addresses (for the techies: IPv6 does away with NAT for the most part, I believe – all the devices in your house will get a ‘real’ IP) for their devices. In the other scenario the ISP is for servers, and here the servers get assigned a /64 block; this is the case we are interested in.

Illustration of a “Customer Block” (/64); the blue part indicates a particular customer.

 The red part can change and represents addresses belonging to that customer.

So, I think the equivalent of a C-Block in IPv6 land would be a /32 block because that is what an ISP will usually be assigned (and allows them to then carve that up into 4 billion /64 blocks for their users!).

Furthermore, in IPv6 the minimum allocation is /32 so a single /32 block cannot run across multiple ISPs as I understand it, so there is no way two IPs in the same /32 could belong to two different ISPs. If our goal is to continue to examine whether sites are more likely related than two random sites, then knowing they are on the same ISP (which is what C-Blocks do) is our goal.

Also, if you chose /64 then each ISP has 4 billion of these blocks to give away, and that is way too sparse to identify associations between sites in different blocks.

However, there is a counter argument here. Note that a single server having a /64 block of IPs means that every website should have a different IPv6 address (even if it shares an IPv4 address).
Geek side note: indeed, the “host” http header accepts an IPv6 address to distinguish which site on the server you want.

So now a single server with multiple sites will have a separate IP for each of those sites (it is also possible that the server has multiple IPv6 blocks assigned, one for each different customer – I think this is actually the intention and hopefully becomes the reality).

So, if I am running a network of websites I’m interlinking with one another then it is quite likely that if I just have a single hosting account that all these are in the same /64 block of IPv6 addresses. That should be a very strong signal that that sites are linked closely. However, I’m fairly sure that those trying to be manipulative will try to avoid this scenario and end up trying to get in another block of addresses for each site. But if they are with the same ISP then they’ll still be in the same /32 block.

My recommendation on an IPv6 C-Block

So, if you followed all that then I’d suggest:
  • Sites in the same /32 block as before would be equivalent to the same C-Block as previously.
  • Sites in the same /64 block either are on the exact same server, or belong to the same customer, so are even closer related than C-Block level.
These need easier more accessible names, how about:
  • “ISP Block” for /32 blocks.
  • “Customer Block” for /64 blocks.
Then we would be able to say things like:
  • In IPv6 IP addresses in the same ISP Blocks most closely resemble the relationship of IPs in the same C-Block in IPv4.
  • In IPv6 IP addresses in the same User Block are likely very closely related, and probably belong to the same person/organisation.

What should I take away from all this?

As I mentioned further up, I’m not convinced that IPv4 C-Blocks are as important from Google’s perspective as they once were, as they can likely access multiple other signals to tie sites together. Whilst still useful as a substitute for those signals for SEOs, who don’t have all Google’s resources, they aren’t something that should guide your decision making. If you are running legitimate sites, you shouldn’t be concerned about hosting them on the same C-Block. In fact, I’d advise against that as it could look manipulative to Google (who will likely work it out anyway).

With IPv6, I think the “Customer Blocks” could be a very important SEO feature, as it is an even closer relationship than C-Blocks were, and this is something that Google will likely make use of. It is still going to take a while until IPv6 becomes prevalent enough that all of this is important, so for the moment this is just something to have on your radar as it will begin to increase in importance over the next couple of years.

Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don’t have time to hunt down but want to read!

Continue reading →

Using Kimono Labs to Scrape the Web for Free

Posted by CatalystSEM

This post was originally in YouMoz, and was promoted to the main blog because it provides great value and interest to our community. The author’s views are entirely his or her own and may not reflect the views of Moz, Inc.

Historically, I have written and presented about big data—using data to create insights, and how to automate your data ingestion process by connecting to APIs and leveraging advanced database technologies.

Recently I spoke at SMX West about leveraging the rich data in webmaster tools. After the panel, I was approached by the in-house SEO of a small company, who asked me how he could extract and leverage all the rich data out there without having a development team or large budget. I pointed him to the CSV exports and some of the more hidden tools to extract Google data, such as the GA Query Builder and the YouTube Analytics Query Builder

However, what do you do if there is no API? What do you do if you want to look at unstructured data, or use a data source that does not provide an export?

For today’s analytics pros, the world of scraping—or content extraction (sounds less black hat)—has evolved a lot, and there are lots of great technologies and tools out there to help solve those problems. To do so, many companies have emerged that specialize in programmatic content extraction such as MozendaScraperWikiImprtIO, and Outwit, but for today’s example I will use Kimono Labs. Kimono is simple and easy to use and offers very competitive pricing (including a very functional free version). I should also note that I have no connection to Kimono; it’s simply the tool I used for this example.

Before we get into the actual “scraping” I want to briefly discuss how these tools work.

The purpose of a tool like Kimono is to take unstructured data (not organized or exportable) and convert it into a structured format. The prime example of this is any ranking tool. A ranking tool reads Google’s results page, extracts the information and, based on certain rules, it creates a visual view of the data which is your ranking report.

Kimono Labs allows you to extract this data either on demand or as a scheduled job. Once you’ve extracted the data, it then allows you to either download it via a file or extract it via their own API. This is where Kimono really shines—it basically allows you to take any website or data source and turn it into an API or automated export.

For today’s exercise I would like to create two scrapers.

A. A ranking tool that will take Google’s results and store them in a data set, just like any other ranking tool. (Disclaimer: this is meant only as an example, as scraping Google’s results is against Google’s Terms of Service).

B. A ranking tool for Slideshare. We will simulate a Slideshare search and then extract all the results including some additional metrics. Once we have collected this data, we will look at the types of insights you are able to generate.

1. Sign up

Signup is simple; just go to http://www.kimonolabs.com/signup and complete the form. You will then be brought to a welcome page where you will be asked to drag their bookmarklet into your bookmarks bar.

The Kimonify Bookmarklet is the trigger that will start the application.

2. Building a ranking tool

Simply navigate your browser to Google and perform a search; in this example I am going to use the term “scraping.” Once the results pages are displayed, press the kimonify button (in some cases you might need to search again). Once you complete your search you should see a screen like the one below:

It is basically the default results page, but on the top you should see the Kimono Tool Bar. Let’s have a close look at that:

The bar is broken down into a few actions:

  • URL – Is the current URL you are analyzing.
  • ITEM NAME – Once you define an item to collect, you should name it.
  • ITEM COUNT – This will show you the number of results in your current collection.
  • NEW ITEM – Once you have completed the first item, you can click this to start to collect the next set.
  • PAGINATION – You use this mode to define the pagination link.
  • UNDO – I hope I don’t have to explain this 😉
  • EXTRACTOR VIEW – The mode you see in the screenshot above.
  • MODEL VIEW – Shows you the data model (the items and the type).
  • DATA VIEW – Shows you the actual data the current page would collect.
  • DONE – Saves your newly created API.

After you press the bookmarklet you need to start tagging the individual elements you want to extract. You can do this simply by clicking on the desired elements on the page (if you hover over it, it changes color for collectable elements).

Kimono will then try to identify similar elements on the page; it will highlight some suggested ones and you can confirm a suggestion via the little checkmark:

A great way to make sure you have the correct elements is by looking at the count. For example, we know that Google shows 10 results per page, therefore we want to see “10” in the item count box, which indicates that we have 10 similar items marked. Now go ahead and name your new item group. Each collection of elements should have a unique name. In this page, it would be “Title”.

Now it’s time to confirm the data; just click on the little Data icon to see a preview of the actual data this page would collect. In the data view you can switch between different formats (JSON, CSV and RSS). If everything went well, it should look like this:

As you can see, it not only extracted the visual title but also the underlying link. Good job!

To collect some more info, click on the Extractor icon again and pick out the next element.

Now click on the Plus icon and then on the description of the first listing. Since the first listing contains site links, it is not clear to Kimono what the structure is, so we need to help it along and click on the next description as well.

As soon as you do this, Kimono will identify some other descriptions; however, our count only shows 8 instead of the 10 items that are actually on that page. As we scroll down, we see some entries with author markup; Kimono is not sure if they are part of the set, so click the little checkbox to confirm. Your count should jump to 10.

Now that you identified all 10 objects, go ahead and name that group; the process is the same as in the Title example. In order to make our Tool better than others, I would like to add one more set— the author info.

Once again, click the Plus icon to start a new collection and scroll down to click on the author name. Because this is totally unstructured, Google will make a few recommendations; in this case, we are working on the exclusion process, so press the X for everything that’s not an author name. Since the word “by” is included, highlight only the name and not “by” to exclude that (keep in mind you can always undo if things get odd).

Once you’ve highlighted both names, results should look like the one below, with the count in the circle being 2 representing the two authors listed on this page.

Out of interest I did the same for the number of people in their Google+ circles. Once you have done that, click on the Model View button, and you should see all the fields. If you click on the Data View you should see the data set with the authors and circles.

As a final step, let’s go back to the Extractor view and define the pagination; just click the Pagination button (it looks like a book) and select the next link. Once you have done that, click Done.

You will be presented with a screen similar to this one:

Here you simply name your API, define how often you want this data to be extracted and how many pages you want to crawl. All of these settings can be changed manually; I would leave it with On demand and 10 pages max to not overuse your credits.

Once you’ve saved your API, there are a ton of options (too many to review here). Kimono has a great learning section you can check out any time.

To collect the listings requires a quick setup. Click on the pagination tab, turn it on and set your schedule to On demand to pull data when you ask it to. Your screen should look like this:

Now press Crawl and Kimono will start collecting your data. If you see any issues, you can always click on Edit API and go back to the extraction screen.

Once the crawl is completed, go to the Test Endpoint tab to view or download your data (I prefer CSV because you can easily open it in Excel, CSV, Spotfire, etc.) A possible next step here would be doing this for multiple keywords and then analyzing the impact of, say, G+ Authority on rankings. Again, many of you might say that a ranking tool can already do this, and that’s true, but I wanted to cover the basics before we dive into the next one.

3. Extracting SlideShare data

With Slideshare’s recent growth in popularity it has become a document sharing tool of choice for many marketers. But what’s really on Slideshare, who are the influencers, what makes it tick? We can utilize a custom scraper to extract that kind data from Slideshare.

To get started, point your browser to Slideshare and pick a keyword to search for.

For our example I want to look at presentations that talk about PPC in English, sorted by popularity, so the URL would be:
http://www.slideshare.net/search/slideshow?ft=presentations&lang=en&page=1&q=ppc&qf=qf1&sort=views&ud=any

Once you are on that page, pick the Kimonify button as you did earlier and tag the elements. In this case I will tag:

  • Title
  • Description
  • Category
  • Author
  • Likes
  • Slides

Once you have tagged those, go ahead and add the pagination as described above.

That will make a nice rich dataset which should look like this:

Hit Done and you’re finished. In order to quickly highlight the benefits of this rich data, I am going to load the data into Spotfire to get some interesting statics (I hope).

4. Insights

Rather than do a step-by-step walktrough of how to build dashboards, which you can find here, I just want to show you some insights you can glean from this data:

  • Most Popular Authors by Category. This shows you the top contributors and the categories they are in for PPC (squares sized by Likes)

  • Correlations. Is there a correlation between the numbers of slides vs. the number of likes? Why not find out?

  • Category with the most PPC content. Discover where your content works best (most likes).

5. Output

One of the great things about Kimono we have not really covered is that it actually converts websites into APIs. That means you build them once, and each time you need the data you can call it up. As an example, if I call up the Slideshare API again tomorrow, the data will be different. So you basically appified Slisdeshare. The interesting part here is the flexibility that Kimono offers. If you go to the How to Use slide, you will see the way Kimono treats the Source URL In this case it looks like this:

The way you can pull data from Kimono aside from the export is their own API; in this case you call the default URL,
http://www.kimonolabs.com/api/YOURPAIID?apikey=YO…

You would get the default data from the original URL; however, as illustrated in the table above, you can dynamically adjust elements of the source URL.

For example, if you append “&q=SEO”
(http://www.kimonolabs.com/api/YOURPAIID?apikey=YOURAPIKEY&q=SEO)
you would get the top slides for SEO instead of PPC. You can change any of the URL options easily.

I know this was a lot of information, but believe me when I tell you, we just scratched the surface. Tools like Kimono offer a variety of advanced functions that really open up the possibilities. Once you start to realize the potential, you will come up with some amazing, innovative ideas. I would love to see some of them here shared in the comments. So get out there and start scraping … and please feel free to tweet at me or reply below with any questions or comments!


Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don’t have time to hunt down but want to read!

Continue reading →