I spend a lot of my day doing technical site audits for clients. It’s often pretty tedious work, but I’ve run into a number of little problems–things that often aren’t even visible to the user–that had giant repercussions for search. There have been a few of them where the fix was BIG…where the development team explained how everything depended on X, and changing X would take forever and break everything and the site would be ugly and users would cry…etc. Then we’d have the little conversation that goes “Well, then, how ARE you going get your customers…since you won’t be getting any from Google!”
Without further ado, here are some of the biggies I’ve run into.
Photo courtesy Robert Huffstutter on Flickr.
Staging servers getting indexed
It’s pretty common to have a staging environment where you can put the latest version of your website for testing and review before going “live” with it. And if your team is in multiple locations, then the easy thing to do is just put it out there on the internet…maybe on a subdomain, like staging.mywebsite.com. The problem comes when somehow, somewhere, Google discovers the site (perhaps you sent a link to it using your Gmail account?). And indexes it.
Now what happens when you move the new version of the site to the live site. What does Google see? Clearly you’re a scraper–Google’s seen all that content weeks ago (and still sees it). Your live site looks like just a copy of the staging site, which appears to Google to be the original (saw all the content there first, after all).
But it’s not super obvious what’s happened–because the staging site has virtually no links to it, it doesn’t rank. But your live site, with all its links, is seen by Google as a festering pile of duplicate content.
The solution? Block all user agents in the robots.txt file on the staging server using Disallow: /.
Oh, and when you go to move the pages from the staging server to the live site? You’re not going to want to move that robots.txt file too 🙂 Think about it….
Ajax and Content
The Site of Many Flavors
So your site responds to requests for mysite.com as well as www.mysite.com? And, you jumped on the bandwagon and made it work under https when Google announced that https pages would get a ranking boost (yeah, right :-/)? Fabulous. But did you do your redirects? If you DON’T 301 redirect from your non-www to your www version (or the other way around is ok too), and you DON’T 301 redirect requests for http to https, then Google will see 4 complete, separate websites…all with the same content.
Just updating your menus to link to everything with www and https isn’t enough. Google’s still got a memory of those non-www and non-https pages (probably from other sites that linked to you a while ago).
Side note here: when you DO move to https, make sure you create a new project in Google Webmaster Tools for the https site. You’ll find that only some of the stuff will still show up under your old http project there.
Robots.txt blocking style sheets
With the avalanche of hackers out there ripping into WordPress sites, people are doing all sorts of things in a desperate attempt to keep the wolves at bay. And so they block wp-content, wp-includes, and wp-admin in their
But, first of all, only spiders respect robots.txt…hackers giggle at your lame attempt to block them, and go right on in.
The problem that is caused by blocking these is that you may have style sheets in those folders that are needed to render images, menus, etc. When Google Panda goes to take a peek at your page and see all your lovely content–especially that content above-the-fold–if the stylesheet is blocked by robots.txt, there might be nothing for Panda to see. You can see how Google sees your page by doing a Fetch & Render in Google Webmaster Tools. I’ve had clients whose sites have been totally image-free because of a blocked style sheet; multiple clients have had what should have been a horizontal menu with pulldowns turn into a vertical 3-page list of black menu items on a white background. Oops.
Blocking in robots.txt instead of doing a noindex,follow
There’s really very few reasons to EVER block anything in robots.txt. One good exception is the staging site example from above. But besides that, it’s NOT the best way to shape what Google indexes on your site, and here’s why.
When you block a set of pages in robots.txt, you’re telling Googlebot STAY OUT. The pages won’t be crawled, and the links on them to other pages on your site won’t be counted.
What you ACTUALLY want to do is to set meta robots directives in the pages themselves like this:
<meta name="robots" content="noindex,follow" />
This tells Google to go ahead and crawl the page, and count the link juice outbound from that page to other pages, but don’t bother indexing the page.
Let’s say you have a “share this page” link on all pages of your 10,000 page site. And that sharing page of course has really nothing on it, so you don’t want it indexed. But, that sharing page has the main navigation on it, like any other page, with links to your 300 most important pages.
/share-page.html in robots.txt means all the link juice of the 10,000 different
share-page.html pages (because you’re probably passing the page to be shared as a parameter, e.g.
share-page.html?page=purple-widgets.html) that WOULD have gone to your 300 most important pages is flushed down the toilet. If, instead, you did a
noindex,follow on those pages, then you’d have 10,000 more little bits of link juice flowing to those 300 most important pages.
Joomla’s Big Bad Default Setting
Joomla, by default, disallows the /images/ folder. So Google sees no images on ANY of your pages. Pretty dry, boring site you’ve got there, dude.
.htaccess is NOT your firewall
Back to them evil hackers. Yes, they’re out there, and yes, there’s a TON of them. The SEMpdx blog has probably had a few hundred hacking attempts in just the time I’ve spent writing this blog post.
Did you know you can block IP addresses in .htaccess? And did you know that lists of IP addresses for China, Russia, Nigeria, etc. are out there?
Don’t do it. You’re using a hammer to drive in a screw.
To hell with URL standards
Don’t stifle my creative side, man. I’ve got a new smooth way to use characters in URLs. Y’all are gonna love it.
I had a client who was using # characters instead of ? and & for parameter separators. They couldn’t figure out why Google only indexed their home page, when they had hundreds of thousands of pages of content.
The # character is supposed to be used to indicate an in-page anchor. Everything AFTER that isn’t technically part of the URL; it’s a location within the page.
A/B Testing Gone Wild
My client was using regular parameters in the URL for A/B testing mods to their home page, e.g.:
That’s not NECESSARILY a bad idea–you can use rel=canonical to point both of those to the base page,
http://www.mysite.com/, and you should be OK. But if you neglect to do that, all of a sudden you have 3 different home pages, in Google’s eyes anyway.
Here’s a case where using the # for something other than an inpage anchor wouldn’t have been such a bad idea.
404 to the home page
If you thought an easy way to handle not-found pages (and capture otherwise lost link juice) would be to set up your 404 handler to 301 redirect to the home page, you’d be right. It IS an easy way. But like a lot of easy things, it’ll bite you.
Google wants to see an HTTP 404 error code returned when a non-existent page is fetched. My theory is that it’s because some spammy people at one time figured they could make Google think they had a million-page site by creating links to URLs to a million pages, then fabricate content on-the-fly in their 404 handler by taking the words out of the URLs and injecting them into a template of other words. Then, if that template had a link to somewheres else on it, well then, some little page might be gettin’ a heap of link juice, might’n it.
Doesn’t matter if I’m right about this, or if I did it myself at one point. I mean, if I had THIS FRIEND who did that at one point. What matters is that Google will check your site for this every few weeks. Look in your webserver logs long enough, and you’ll see Googlebot trying to fetch a URL that’s really long, a big jumble of letters and numbers.
Not only does Google check in this fashion, but if Google finds there’s pages on your site that come back nearly empty and SEEM to be page-not-found pages, Google will mark those as “soft 404 errors” in Webmaster Tools. If you want to see exactly what HTTP responses are being returned by your server, I’m a big fan of the httpfox plugin for Firefox–it will show you not only the final HTTP code, but each hop along the way, if there are multiple redirects.
There’s a million ways to shoot yourself in the foot when it comes to search optimization. With a little luck, some of you have some juicy horror stories to share in the comments!