A letter written on behalf of Pope Francis asked that Gissendaner, the only female inmate currently on Georgia’s death row, be spared.
Google unveiled two new Nexus smartphones today at an event in San Francisco, showing off the LG-made Nexus 5X and the Huawei-made Nexus 6P.
Posted by rjonesx.
As much as we like to debate content vs. links, sometimes great content just seems to dominate. I don’t mean to say that great content doesn’t get great links, or that the purposes of creating great content is not to get links, but simply that some content on the web seems to shine through the SERPs.
Content might not be king, but it has lot of sway in Google’s kingdom.
After sifting through tons of SERP data to find million dollar answer boxes (answer box results that rank at the top for keywords driving millions of dollars in traffic), I decided to dig deep to find content just like it across the web. But I wanted to do something different, something harder. I wanted to find content that didn’t have huge Domain Authority. Sure, it is easy for the Wikipedia’s and YouTubes of this world to rank for huge keywords, but what about the little guy? Are there any pieces of content out there bringing in millions of dollars of traffic coming from domains with Domain Authority around 50 or lower? And if so, what sets this content apart from the rest? Let’s find out!
First, I needed a little help in deconstructing exactly what makes this great content tick. I enlisted the SEO greats – Garrett French of CitationLabs who essentially wrote the book on linkable content, and Mark Traphagen, Internet social guru extraordinaire from Stone Temple.
So let’s begin.
Finding great content
I didn’t want to start with any assumptions. I didn’t want to assume that great content was pretty, or thorough, or authoritative. I wanted to judge content by its results, not its features. I set 3 distinct qualifications:
- The content URL couldn’t be a home page.
- The domain couldn’t have a Moz Domain Authority above 55.
- The content URL had to earn more than $1,000,000 a year in traffic based on a recent click through model, traffic volume, and estimated CPC of the keywords for which it ranks.
With those parameters set, I went digging. With SERPScape and the MozScape API, we quickly uncovered dozens of contenders out of just a sampling of the data set. So, what did we discover? What patterns did we find across the board? What set this content apart?
Feature #1: On-point
One of the most obvious trends was simply how perfectly and thoroughly the top content answered the users queries. It wasn’t that the content was necessarily long (although in many cases it was). However, the content was highly relevant, regardless of its length. Take for example this “bed sizes” web page on SleepTrain.com.
Most webmasters would be content with just throwing up a quick intro paragraph and dimensions, but the SleepTrain site provides it several different ways…
- An overlay comparison image with Dimensions
- A textual table listing of sizes
- Several separate images showing people placement on the different mattresses
- A textual analysis of common bed sizes describing who would and would not fit by their height.
Now, I know what you are thinking. This isn’t all that great!, but everything must be seen in context. Look at the next several listings. Wikipedia is a nightmare of text, BetterSleep is just text, bedding experts is a little better, but doesn’t have the first overlay chart, SleepCountry only has the overlay chart… No other page in the top 10 answers all of a user’s questions as thoroughly but succinctly as the SleepTrain site.
But don’t take my word for it, we saw this over and over again in the data. We know that good, thorough content can rank well, and we saw just that. The average topical relevancy scores of our Million Dollar Content pieces were significantly better time and time again than the average competition in the SERPs.
In fact, some pages had scores that were truly mind blowing. One particular page on resume templates hit 99.96% relevancy! To get that level of precision, not only do you need to be highly thorough, you also have to be highly restrictive to prevent the addition of content that isn’t relevant. That means no filler. Subsequently, this one particular page ranked for over 2,000 related keywords!
Feature #2: Bold
Conventional wisdom rarely helps you win in a competitive atmosphere. If you do what everyone else thinks should be done, you are predictable, and predictable is beatable.
For a few years now, one of the items on my regular audit list has been page speed. We know that TTFB (time-to-first-bite) correlates with search rankings, that fast download speeds correlate with increased conversions and better user engagement, and we even have an official announcement from Google that page speed matters for rankings.
Well, StyleGlam gives Google a giant middle finger when it comes to page speed. The page is bold, image-laden, and is even filled with ads.
The page clocks in at a turtle’s pace of 24.9 seconds to load and an elephant’s weight at 7.49MB in size! But maybe that is the point.
The game of SEO is all about compromises. When you make a page load quickly, you often have to compromise on images, text, and thoroughness. When you make a page informative, you might have to compromise on conversion rates. In this case, the webmasters came up with a completely different balance. They chose not to compromise on thoroughness, information content, conversion points (look at the ads!) and instead let page speed die a horrendous death. But the trade-off worked!
StyleGlam wasn’t the only site we saw throw page speed to the wind in order to go big. Sites in the resume space, calendar, degree and health care spaces often took refuge in being big before being quick.
But we also saw the opposite true. Paired-back resources that answer one question very quickly, very easily, very simply can also win. What seems to never make its way to the top though is conventional content on a conventional sites. If you aren’t a big brand, you better be different, be better, be bold.
Feature #3: Fresh
Can content survive in high spam, high value keyword niches? You bet it can. I was shocked when I came upon this one, as it was just a well managed blog post that was now several years old. It was surrounded by the latest entrants into a niche that was notoriously getting shut down and cleaned out: free streaming movies.
So how does a simple blog post on the best free movie sites manage to bring in $1,000,000+ in traffic not just this year, or last year, or the year before but for years and years on end?
Well, one thing we noticed about it and many others was content freshness. I can’t tell you how many times a client has been scared to update their content that already ranks. “But what if I break it? What if I lose rankings?”
Not updating your content IS breaking it.
The truth is that if you are not updating your content regularly, Google will have to assume that your content is losing its reliability. So why not? Over time, you will build up a great backlink profile by sheer longevity, while at the same time keeping content as fresh as new competitors entering the space.
The author here found a great opportunity. People wanted to find these sites, they kept disappearing, and someone needed to keep an up-to-date record of the best ones. Now, the webmaster didn’t create it once and leave it, or update it annually. They updated it regularly. The net result?
This piece of content has enjoyed long-term, million-dollar rankings while competitors have come and gone. They have ranked for thousands of keywords for several years by simply creating great content and keeping it fresh.
Linkable million-dollar pages
I am now going to turn this study over to Garrett French. Garrett is the founder and chief link strategist of Citation Labs, a link-building agency and campaign incubator. He’s developed multiple link-building tools, including the Link Prospector and the Broken Link Finder. He also co-wrote The Ultimate Guide to Link Building with Link Moses himself, Eric Ward. Garrett and his team lead monthly webinars on enterprise content strategy and promotion from the Citation Labs Blog.
Only 34% of the content studied has at least 1 link in OSE. That’s right – there are tons of pages getting $1,000,000+ worth of organic search traffic yearly that have few if any external links. A lack of links does not necessarily demonstrate a lack of linkability, but I will say that overall these pages don’t seem “designed” for linkability.
Before we get to individual examples of linkability though (they do exist in this set!) I’d like to outline some basics on how we evaluated these pages.
- At Citation Labs, we divide linkers into “curators” who collect URLs for a single existing resource page and “editors” who publish new topic pages. Tactically speaking, the curators support broken link building and “link request” efforts, while editors support PR and guest posting campaigns.
- We believe that it’s primarily the linkers themselves who define a document’s linkability – both by their decision or not to link and how many potential linkers there happen to be.
Linkable Document – Timberline Knolls
Drug addiction, a subcategory of mental health, is one of the single most linkable topics we’ve encountered in our work thus far. This URL provides clear and comprehensive information for concerned loved-ones of a potential heroin user. These concerned loved-ones are a “linker-valued audience.”
To get a quick read on how many curators might be out there for this topic, search for this query heroin inurl:links.html. We use the inurl:links.html portion of the query to get a sense of volume. There’s a ton out there for this document which makes it not only linkable but worthy of further promotion on its own.
Curators are – relatively speaking – quite rare. The existence of curators seems to be topically-driven and are especially prevalent across health and education.
Linkable Document – Wixon Jewelers
I would examine the potential for a broken link building campaign in the “birthstones” area for this URL. In addition, it appears (based on this query: birthstones inurl:links.html) that there are enough potential opportunities to support a request campaign as well.
Birthstones probably won’t get curators linking quite like addiction will. That said, they remain embedded in our collective psyche and if a related URL happens to be dead this could be a great candidate for a linkable page.
Linkable Document – SMU Mustangs
I’m not a sportser, but this URL stood out in our analysis because it had 60+ root linking domains. This seems to be a hub for SMU’s football team, complete with a calendar. Bloggers, sports journalists, opponents, local events websites, all of these folks should be interested in linking to and supporting this team. Businesses could consider starting a competitive football team to replicate this effort 😉
But seriously, one takeaway, especially for local, is supporting the beloved local sports teams and events.
Linkable Document – The Best Schools
At first pass, my strategy would be to promote via PR, ideally in conjunction with the ranked schools to help them get the most out of their top ranking. Secondly, I’d run a low-scale branded guest posting effort. Guest posting topics could cover “following dreams,” “seizing the day,” “increasing your income,” “going back to school as a parent”, etc. If you repackage the data for a linker-valued audience (Best Online Colleges for Seniors) you could potentially build out a link request campaign too.
Linkable Document – Top 10 Home Remedies
The title – “How to Get Rid of Pimples Fast” – makes this one a tough pitch to skin health curators. That said, I think it could be a fantastic citation opportunity in a guest posting campaign. Target blogs that are more lifestyle oriented – makeup blogs perhaps, dating advice blogs etc – and build out titles that are not necessarily directly related to pimples or blemishes themselves.
Here are a couple more in that same vein – they could work well as supporting citations in a guest posting effort:
Most editors would not think twice about allowing those links to live so long as they fit topically and have potential appeal to the reading audience.
The majority of these million dollar pages are not purely linkable, but many could support link building campaigns. Pay close attention to the link profile of the entire domain for link building campaign guidance – the ranking pages may not be there based on their individual link earnings.
Shareable million-dollar pages
So how do these million dollar content pieces actually perform in the very different context of social media? We’ll let the venerable Mark Traphagen, Senior Director of Online Marketing at Stone Temple Consulting and give us some insights on how this high performing content makes out in the world of social media. Mark is a world traveler, speaker, consultant and is actually a Klout Top 10 Expert for SEO & Content Marketing, meaning he actually does know how to make this social stuff work.
Just as Garrett revealed above that million dollar content does not necessarily have to have a lot of external links (or even any at all), so I found that there is little-to-no correlation between the number of social shares and whether or not content will win Russ’s million dollar prize.
45% of our sample group had no social shares at all (according to Buzzsumo) and 66% had fewer than 300 shares.
Of course, just like having a lot of good links “sure can’t hurt,” having a lot of social shares certainly increases the chances that your content will do well organically. In fact, the page with the highest number of social shares in the sample group (it had over 1 million) also has the lowest domain authority of the group (21). Moreover, 60% of the pages with 1000 or more social shares have a DA of 40 or less.
Now I’m not suggesting that this proves that the million dollar status of those pages was driven directly by their social popularity. In fact, I consider it unlikely that social popularity is a direct ranking factor at the present time. However, it is likely that wide exposure via social media increases the chances of activity that very likely does factor into Google’s ranking algorithm.
Before I take a deeper look at the most-shared content, I have to share two interesting tidbits from my examination of the pages Russ sampled for this study:
- Facebook is as killer for this type of content as most people think it is. For those pages with at least 100 social shares, a whopping 92% had the vast majority of those shares occur via Facebook. For most of those, almost all the social sharing happened on Facebook.
- None of the pages that had zero social shares had visible social sharing buttons. To be fair, several of them were simply landing pages linking to other content, and thus not really shareable. But most of the rest have characteristics that typically make content more attractive to shares, yet they provided no easy opportunity for visitors to take that action.
The shareability winners
Let’s examine the factors that most likely made the three most-shared pages in our sample set so shareable.
80 Nail Designs for Short Nails – 1 million shares
This stayglam.com page is almost embarrassingly easy to analyze, as Buzzsumo shows that all but about 800 of its 1 million+ shares came from Pinterest.
If there ever were a textbook example of “made for Pinterest,” it’s this page. The entirety of the content is 80 dazzling images of colorful and exotic nail designs, such as the following:
The images are fashion-centered, brightly-colored, and oriented toward a female audience, the perfect trifecta of Pinterest shareability.
Here’s the kicker: those 1 million Pinterest shares happened in spite of the fact that the stayglam.com page has no social share buttons! This serves as clear proof that if your content is amazingly shareable, and in particular well-adapted for a particular social network, visitors will share it even if it isn’t easy to do so.
It’s probable, though, that the vast majority of those 1 million shares weren’t made directly from the content page. The most likely scenario is that a few influential Pinterest users did the initial sharing, and then thousands upon thousands of other Pinterest users repined those shares.
How to Get Rid of Pimples Fast– 73,300 shares
People love to share “how to” content that they think will be helpful to their social connections. Why? Social psychology tells us that the feeling of being helpful to others conveys as much benefit to the giver as to the receivers, and often more.
A HubSpot study found that content with the word “how” in the title is among the most shared on Twitter.
Furthermore, this content piece speaks directly to a very common (and embarrassing) problem with quick, easy fixes, exactly what people in such a situation seek. The page also has several easy-to-understand infographics, which undoubtedly make it even more appealing to share. The Open Graph image tag is properly set so that the most appealing of those images appears in shares on networks like Facebook and Google+.
Finally, this piece of content, like the previous, exemplifies that highly-shareable content will be shared, even if the site itself does not make sharing easy. In this case, the page does have share buttons for Twitter and Facebook, but they are at the bottom of the page, and below ads and other navigation. Nevertheless, once the content found its way to Facebook (where almost all of its shares occurred), it took off.
Positive & Inspirational Life Quotes– 15,800 shares
Frankly, this page has very little going for it other than the one thing that probably earned it 6.3K shares on Facebook and another 1000 on Twitter. It is well-optimized for a very popular sharing category on both those networks: quotations.
According to a New York Times commissioned study, people share content to satisfy any of four psychological needs. Those needs are:
- Relationship building
Inspirational quotes fulfill at least 1, 2, and 4 of the above, and probably help contribute to #3. They are entertaining in that they fit the kind of light, easily-digested, feel good moments that many people turn to Facebook and Twitter for. Quotations also help us define ourselves to our tribe. They are a quick “tag” to aspirations that are likely shared by others in our social circles. Finally, quotes provide self-fulfillment, as sharing them makes us feel like we have contributed something positive to the world (and with very little effort!).
Out of our sample group, this was the only content that had a volume of Twitter shares worth mentioning. Most likely that was because a number of the quotations used a “click to tweet” feature, where a Twitter user can, with one click, share the quote to her Twitter stream. Even though the previous two examples show that highly-sharable content can get shared even without the site providing an easy way to do so, making that content one-click sharable can boost the share volume even higher.
- Social shares are not necessary to achieving million dollar content status in search. However, in some cases having them may improve your content’s chances in that regard.
- Content that meets the criteria of being highly shareable sometimes needs little or no boost from the publishing site itself, as long as enough visitors take the initiative to share it themselves. A recent Buzzsumo study published here on the Moz Blog found that “surprising, unexpected and entertaining images, quizzes and videos have the potential to go viral with high shares.” However, the study showed that those content types typically earn few links, even if they are highly shared. This confirms Garrett’s findings above.
- While making content easy to share (by providing easy-to-find share buttons, for example), while not necessary, can boost the number of overall shares, and/or get the content shared to other networks where an influencer hasn’t done the work already.
- Despite all the negative press about how much Facebook has reduced the ability for brand content to get organic reach, it remains by far the most “viral-ready” social network. If your content can get a good toehold there by being shared by some influencers, Facebook can still provide organic reach magic. Of course, paid boosting of content can vastly accelerate the chances of that happening, and this study did not examine whether any of the content was supported with paid social advertising.
So what are the takeaways? What makes something million-dollar content? I think there are a few standouts…
- Go big and bold. You have to stand out from the crowd, and if you can’t do that with your domain authority, you have to do it with your content.
- Stay relevant, both in freshness and thoroughness. Know what your user wants and deliver it.
- Some sites just get lucky, but other sites make their luck. There were certainly a number of pages that still seemed to rank inexplicably, with average content, few social shares, and even fewer links. Don’t bank on that. Do the leg work and you too can create million dollar content.
Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don’t have time to hunt down but want to read!
The two leaders are scheduled to meet today at the United Nations for the first time in two years.
Posted by Jeremy_Gottlieb
Have you ever wanted to automate pulling data from a web page—such as building a Twitter audience—and wanted a way to magically make all of the Twitter handles from the web page appear in your Google Sheet, but didn’t know how? If learning Python isn’t your cup of tea, using a few formulas in Google Sheets will allow you to easily and quickly scrape data from a URL that, were you to do so manually, could take hours.
For Windows users, Niels Bosma’s amazing SEO plug-in for Excel is an option that could also be used for this purpose, but if you analyze data on a Mac, this tutorial on formulas in Google Sheets will help make your life much easier, as the plug-in doesn’t work on Macs.
Within Google Sheets, there are 3 formulas that I like to use in order to save myself huge amounts of time and headspace. These are:
With just these 3 formulas, you should be able to scrape and clean the data you need for whatever purpose you may come across—whether that be curating Twitter audiences, analyzing links, or anything else that you can think of. The beauty of these formulas is in their versatility, so the use cases for them are practically infinite. By understanding the concept behind this, the variables can be substituted depending on the individual use case. However, the essential process for scraping, cleaning and presenting data will remain the same.
It should be noted that scraping has limitations, and some sites (like Google) don’t really want anyone scraping their content. The purpose of this post is purely to help you smart Moz readers pull and sort data even faster and more easily than you would’ve thought possible.
Let’s find some funny people on Twitter we should follow (or target. Does it really matter?). Googling around the subject of funny people on Twitter, I find myself landing on the following page:
Bingo. Straight copying and pasting into a Google Doc would be a disaster; there’s simply way too much other content on the page. This is where IMPORTXML comes in.
The first step is to open up a Google Sheet and input the desired URL into a cell. It could be any cell, but in the example below, I placed the URL into cell A1.
Just before we begin with the scraping, we need to figure out exactly what data we plan on scraping. In this case, it happens to be Twitter handles, so this is how we’re going to do it.
First, right click on our target (the Twitter handle) and click “Inspect Element.”
Once in “Inspect Element,” we want to figure out where on the page our target lives.
Because we want the Twitter handle and not the URL, we’re going to focus on the element/modifier/identifier “target” rather than “href” within the <a></a> tags. We also happen to notice that the <a></a> tags are “children” of the <h3></h3> tags. What these values mean is a topic for another post, but what we need to keep in mind is that for this particular URL, this is where our desired information lives that we need to extract. It will almost certainly live in a different area with different modifiers on any other given URL; this is just the information that’s unique to the site we’re on.
Let’s get to the scary stuff (maybe?): how to write the formula.
I put the formula in cell A3, where I have the red arrow. As can be seen in the highlighted rectangle, I wrote =IMPORTXML(A1, “//h3//a[@target=’_blank’]”), which yielded a wonderful, organized list of all the top Twitter handles to follow from the page. Voila. Cool, right?
Something to remember when doing this is that the values have been created via a formula, so trying to copy and paste them regularly can get messy; you’ll need to copy and paste as values.
Now, let’s break down the madness.
Like any other function in Sheets, you’ll need to begin with an equal sign, so we start with =IMPORTXML. Next, we find the cell with our targeted URL (in this case, cell A1) and then add a comma. Double quotation marks are always required to begin the query, followed by two forward slashes (“//”). Next, you select the element you want to scrape (in this case, the h3 tag). We don’t want all of the information in the h3 elements, just a particular part of the <a></a> tags—specifically, the “target” part where we find the Twitter handles. To capture this part, we add //a[@target=’_blank’], which specifies only the target=’_blank” part of the <a></a> tag. Putting it all together, the formula =IMPORTXML(A1, “//h3//a[@target=’_blank’]”) can be translated as “From the URL within cell A1, select the data with an <h3> tag that is also within an <a> tag and also part of the target attribute.”
In this particular case, the Twitter handles were the only element that could be scraped based on our formula and how it was originally written within the HTML, but sometimes that’s not the case. What if we were looking for travel bloggers and came across a site like the one seen below, where our desired Twitter handles are within a text paragraph?
Taking a look at the Inspect Element button, we see the following information:
In the top rectangle is the div and the class we need, and in the second rectangle is the other half of the information we require: the <p> tag. The <p> tag is used in html to specify where a given paragraph is. The Twitter handles we’re looking for are located within a text paragraph, so we’ll need to select the <p> tag as the element to scrape.
Once again, we input the URL into a cell (any empty cell works) and write out the new formula =IMPORTXML(A1, “//div[@class=’span8 column_container’]//p”). Instead of selecting all of the h3 elements like in the preceding example, this time we’re finding all of the <p> tags within the div elements that have a class of “span8 column_container”. The reason we’re looking for <p> tags within div elements that have a class of “span8 column_container” is because there are other <p> tags on the page that contain information we likely won’t need. All of the Twitter handles are contained with <p> tags within that specifically-classed div, so by selecting it, we’ll have selected the most appropriate data.
However, the results of this are not perfect and look like this:
The results are less than ideal, but manageable nonetheless – we ultimately just want Twitter handles, but are provided with a whole bunch of other text. Highlighted in the green rectangle is a result closer to what I want, but not in the column I need (there’s also another one down the page out of the view of the screenshot, but most are where I need them). To make sure we get all the data in the appropriate format, we can copy and paste values for everything within columns A–C, which will remove the values populated by formulas and replace them with hard values that can be manipulated. Once that is done, we can cut and paste the outlying values (one in column B and one in column C) into their corresponding cells in column A.
All of our data is now in column A; however, some of the cells include information that does not contain a Twitter handle. We’re going to fix this by running the =QUERY function and separating the cells that contain “@” from the ones that do not. In a separate cell (I used cell C4), we’re going to input =query(A4:A36, or “Select A where A contains ‘@’”) and hit enter. BOOM. From here on, we’ll have only cells that contain Twitter handles, a huge improvement over having a mixed bag of results that contain both cells with and without Twitter handles. To explain, our formula can be translated as “From within the array A4:A36, select the cell in column A when that cell contains ‘@’.” It’s pretty self-explanatory, but is nonetheless a fantastic formula that is incredibly powerful. The image below shows what this looks like:
Keep in mind that the results we just pulled are going to contain excess information within the cells that we’ll need to remove. To do this, we’ll need to run the =REGEXEXTRACT formula, which will pretty much eliminate any need you have for the =RIGHT, =LEFT, =MID, =FIND, and =LEN formulas, or any mixture of those. While useful, these functions can get a bit complicated and need to work in unison in order to produce the same results as =REGEXEXTRACT. A more detailed explanation of these formulas with visuals can be found here.
We’ll run the formula on the results produced from running the =QUERY formula. Using =REGEXEXTRACT, we’ll select the top cell in the queries column (in this case, C4) and then select everything after it beginning with “@”, the start of what we’re looking for. Our desired formula will look like =REGEXEXTRACT(C4, “\@.*”). The backslash signifies to escape the following character, and the .* means select everything after. Thus, the formula can be translated as “For cell C4, extract all of the content beginning at the “@”.
To get all of the other values, all we need to do is click and grab the bottom right corner of cell E4 and drag it down until the end of our array at cell C28. Dragging down the corner of E4 will apply the formula within it to the cells included within the drag. We want to include up to E28 because the corresponding cell C28 is the last cell in the array we are applying the formula to. Doing this will provide the results shown below:
Though a nice and clean output, the data in column E is created by formula and cannot be easily manipulated. We’ll need to do copy and paste values within this column to have everything we need and be able to manipulate the data.
If you’d like to play around with the Google Sheet and make your own copy, you can find the original here.
Hopefully this helps provide some direction and insight into how you can easily scrape and clean data from web pages. If you’re interested in learning more, here’s a list of great resources:
- Xpath Data Scraping Tutorial video (for PC users)
- The ImportXML Guide for Google Docs
- A Content Marketer’s Guide to Data Scraping
- How to Get the Most Out of Regex
Want more use cases, tips, and things to watch out for when scraping? I interviewed the following experts for their insights into the world of web scraping:
- Dave Sottimano, VP Strategy, Define Media Group, Inc.
- Chad Gingrich, Senior SEO Manager, Seer Interactive
- Dan Butler, Head of SEO, Builtvisible
- Tom Critchlow, tomcritchlow.com
- Ian Lurie, CEO and Founder, Portent, Inc.
- Mike King, Founder, iPullRank
Question 1: Describe a time when automated scraping “saved your life.”
“During the time when hreflang was first released, there were a lot of implementation & configuration issues. While isolated testing was very informative, it was the automated scraping of SERPs that helped me realize the impact of certain international configurations and make important decisions for clients.” – Dave Sottimano
“We wanted a way to visualize forum data to see what types of questions their clients’ audiences were talking about most frequently to be able to create a content strategy out of that data. We scraped Reddit and various forums, grabbing data like post titles, views, number of replies, and even the post content. We were able to aggregate all that data to put together a really interesting look at the most popular questions and visualize keywords within the post title and comments that might be a prime target for content. Another way we use scraping often at Seer is for keyword research. Being able to look at much larger seed keyword sets provides a huge advantage and time savings. Additionally, being able to easily pull search results to inform your keyword research is important and couldn’t be done without scraping.” – Chad Gingrich
“I’d say scraping saves my life on a regular basis, but one scenario that stands out in particular was when a client requested Schema.org mark-up for each of its 60 hotels in 6 different languages. Straightforward request, or so I thought—turns out they had very limited development resource to implement themselves, and an aged CMS that didn’t offer the capabilities of simply downloading a database so that mark up could be appended. Firing up ImportXML in Google Sheets, I could scrape anything (titles, source images, descriptions, addresses, geo-coordinates, etc.), and combined with a series of concatenates was able to compile the data so all that was needed was to upload the code to the corresponding page.” – Dan Butler
“I’ve lost count of the times when ad-hoc scraping has saved my bacon. There were low-stress times when fetching a bunch of pages and pulling their meta descriptions into Excel was useful, but my favorite example in recent times was with a client of mine who was in talks with Facebook to be included in F8. We were crunching data to get into the keynote speech and needed to analyze some social media data for URLs at reasonable scale (a few thousand URLs). It’s the kind of data that existed somewhere in the client’s system as an SQL query, but we didn’t have time to get the dev team to get us the data. It was very liberating to spend 20 minutes fetching and analyzing the data ourselves to get a fast turnaround for Facebook.” – Tom Critchlow
“We discovered a client simultaneously pointed all of their home page links at a staging subdomain, and that they’d added a meta robots noindex/nofollow to their home page about one hour after they did it. We saw the crawl result and thought, “Huh, that can’t be right.” We assumed our crawler was broken. Nope. That’s about the best timing we could’ve hoped for. But it saved the client from a major gaffe that could’ve cost them tens of thousands of dollars. Another time we had to do a massive content migration from a client that had a static site. The client was actually starting to cut and paste thousands of pages. We scraped them all into a database, parsed them and automated the whole process.“ – Ian Lurie
“Generally, I hate any task where I have to copy and paste, because any time you’re doing that, a computer could be doing it for you. The moment that stands out the most to me is when I first started at Razorfish and they gave me the task of segmenting 3 million links from a Majestic export. I wrote a PHP script that collected 30 data points per link. This was before any of the tools like CognitiveSEO or even LinkDetective existed. Pretty safe to say that saved me from wanting to throw my computer off the top of the building.“ – Mike King
Question 2: What are your preferred tools/methods for doing it?
“Depends on the scale and the type of job. For quick stuff, it’s usually Google docs (ImportXML, or I’ll write a custom function), and on scale I really like Scraping Hub. As SEO tasks move closer towards data analysis (science), I think I’ll be much more likely to rely on web import modules provided by big data analytics platforms such as RapidMiner or Knime for any scraping.” – Dave Sottimano
“Starting out, Outwit is a great tool. It’s essentially a browser that lets you build scrapers easily by using the source code. …I’ve started using Ruby to have more control and scalability. I chose Ruby because of the front end/backend components, but Python is also a great choice and is definitely a standard for scraping (Google uses it). I think it’s inevitable that you learn to code when you’re interested in scraping because you’re almost always going to need something you can’t readily get from simple tools. Other tools I like are the scraper Chrome plugin for quick one page scrapes, Scrapebox, RegExr, & Text2re for building and testing regex. And of course, SEO Tools for Excel.” – Chad Gingrich
“I love tools like Screaming Frog and URL Profiler, but find that having the power of a simple spreadsheet behind the approach offers a little more flexibility by saving time being able to manage the output, perform a series of concatenated lookups, and turn it into a dynamic report for ongoing maintenance. Google Sheets also has the ability for you to create custom scripts, so you can connect to multiple APIs or even scrape & convert JSON output. Hey, it’s free as well!” – Dan Butler
“Google Docs is by far the most versatile, powerful and fast method for doing this, in my personal experience. I started with ImportXML and cut my teeth using that before graduating to Google Scripts and more powerful, robust, and cron-driven uses. Occasionally, I’ve used Python to build my own scrapers, but this has so far never really proven to be an effective use of my time—though it has been fun.” – Tom Critchlow
“We have our own toolset in-house. It’s built on Python and Cython, and has a very powerful regex engine, so we can extract pretty much anything we want. We also write custom tools when we need them to do something really unique, like analyze image types/compression. For really, really big sites—millions of pages—we may use DeepCrawl. But our in-house toolset does the trick 99% of the time and gives us a lot of flexibility.” – Ian Lurie
“While I know there a number of WYSIWYG tools for it at this point, I still I prefer writing a script. That way I get exactly what I want and it’s in the precise format that I’m looking for.” – Mike King
Question 3: What are common pitfalls with web scraping to watch out for?
“Bad data. This ranges from hidden characters and encoding issues to bad HTML, and sometimes you’re just being fed crap by some clever system admin. As a general rule, I’d far rather pay for an API than scrape.” – Dave Sottimano
“Just because you can scrape something doesn’t mean you should, and sometimes too much data just confuses the end goal. I like to outline what I’m going to scrape and why I need it/what I’ll do with that data before scraping one piece of data. Use brain power up front, let the scraping automate the rest for you, and you’ll come out the other side in a much better place.” – Chad Gingrich
“If you’re setting up dynamic reports or building your own tools, make sure you have something like Change Detection running so you can be alerted when X% of the target HTML has changed, which could invalidate your Xpath. On the flipside, it’s crazy how common parsing private API credentials/authentication is via public HTTP get requests or over XHR—seriously, sites need to start locking this stuff down if they don’t want it accessible in the public domain.” – Dan Butler
“The most common pitfall with computers is that they only do what you tell them—this sounds obvious, but it’s a good reminder that when you get frustrated, you usually only have yourself to blame. Oh—and don’t forget to check your recurring tasks every once in a while.” – Tom Critchlow
“It’s important to slow your crawls down. I’m not even talking about Google scraping. I’m talking about crawling other folks’ web sites. I’m continuously amazed at just how poorly optimized most site technology stacks really are. If you start hitting one page a second, you may actually slow or crash a site for a multi-million-dollar business. We once killed a client’s site with a one-page-per-second crawl—they were a Fortune 1000 company. It’s ridiculous, but it happens more often than you might think. Also, if you don’t design your crawler to detect and avoid spider traps, you could end up crawling 250,000 pages of utter duplicate crap. That’s a waste of server resources. Once you find an infinitely-expanding URL or other problem, have your crawler move on.” – Ian Lurie
Do you have any interesting use-cases or experiences with data scraping? Sound off in the comments!
Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don’t have time to hunt down but want to read!
NASA revealed today liquid water has been found on the surface of Mars.
NASA is set to announce a major announcement regarding a discovery about Mars.
As if the frenzy couldn’t get any bigger, Apple revealed today it sold more than 13 million of its new iPhone 6s and iPhone 6s Plus during launch weekend.
NY Comic Con Marvel Panels Live Stream