Have you checked to see how many pages from your blog are in Google’s index? I think it is reasonable to assume that every page you have published is indeed indexed by Google and that each of those pages should be found when you search for a unique phrase from that page.
But the chances of that being the case are actually quite slim.
And while this might seem like a massive problem, when you dig deeper, it might not actually be that much of an issue.
I am doing a lot of SEO work this month. To track the progress and success of this work I first had to make a note of the current stats for the areas I want to improve. This includes, amongst other things, my Alexa ranking, daily traffic, inbound links and some data from my Google Webmaster Tools account.
One of the factors I recorded from my Google Webmaster Tools account is the number of pages from this site that are indexed and cached by Google. Three days ago figure was 133. Yesterday it was down to 129 and today the figure has dropped by another 1.
The total is now 128.
To find out how many of your pages have been indexed, log in to your Google Webmaster Tools account, click on Site Configuration > Sitemaps and read the information on that page.
This is assuming you have a GWT account and you have installed a sitemap. If not, register for your account here and get the XML Sitemap Generator plugin here.
Establishing which pages are in Google’s index
Manually establishing which pages are indexed is not a simple task. I had a look around for a free tool that might do this, but I couldn’t find anything. I eventually bought Andy Black’s Index Checker, which, for $7.00, is an absolute steal. It runs through all the pages on your server (which can be imported from an XML sitemap) and tells you if they have been cached by Google.
I was surprised, and dissapointed, to find out that 22 pages have not been indexed. And some of those were pages I hoped would earn me a commission through an affiliate sale.
This is the full list created by Index Checker
- http://mydigitalinternet.com/seopressor-plugin-review-for-wordpress/
- http://mydigitalinternet.com/offline-reading-for-bloggers-books-on-blogging/
- http://mydigitalinternet.com/membership-site-mastermind-%e2%80%93-are-membership-sites-the-future/
- http://mydigitalinternet.com/what-is-shared-web-hosting/
- http://mydigitalinternet.com/search-results/
- http://mydigitalinternet.com/will-sitewide-reciprocal-and-directory-links-hurt-your-blog/
- http://mydigitalinternet.com/why-i-started-to-blog-and-why-i-can%e2%80%99t-stop/
- http://mydigitalinternet.com/what-is-a-blog/
- http://mydigitalinternet.com/could-you-be-missing-out-on-an-inbound-link/
- http://mydigitalinternet.com/website-design-ideas-looking-for-inspiration/
- http://mydigitalinternet.com/google-introduces-buzz/
- http://mydigitalinternet.com/would-you-like-to-try-the-new-google-adsense-interface/
- http://mydigitalinternet.com/google-wave-will-it-be-the-dogs-bollocks/
- http://mydigitalinternet.com/google-chrome-is-now-available-in-beta-for-the-mac/
- http://mydigitalinternet.com/now-you-can-design-your-own-blogger-theme-using-the-template-designer/
- http://mydigitalinternet.com/privacy-policy/
- http://mydigitalinternet.com/blog-comment-demon-software-review/
- http://mydigitalinternet.com/true-google-ignores-the-content-of-the-meta-keywords-tag/
- http://mydigitalinternet.com/thanks/
- http://mydigitalinternet.com/terms-of-use/
- http://mydigitalinternet.com/google-to-host-third-live-online-webmaster-chat/
- http://mydigitalinternet.com/check-forums-before-buying-when-researching-software/
From the above list I can see a couple of problem URLs which can easily be fixed, a couple of pages I wouldn’t expect to be indexed eg search results and thanks, a few out of date pages (mostly related to Google products, such as Wave) and a couple of false positives, but I would have expected Google to have indexed and cached the vast majority of them. Especially the Privacy Policy and Terms page, which have been online since day one.
As far as I can ascertain there are three possible issues here:
1) Duplicate content
I changed domains a few months ago. I set up 301 redirects (through cPanel) from the old domain for each page. Having dug deeper it looks like there are a two that have not worked properly. I have set these up again.
2) Affiliate links
A few of the pages I expect to have been indexed have affiliate links within the body text. The links point to products on Amazon and are not cloaked in any way. This could be a problem.
3) Stray characters in the filename
I can’t be sure if the stray characters are the cause of the problem, but it does seem likely. This – %e2%80%93 – was caused by a hyphen and this – %e2%80%99 – was caused by an apostrophe.
How to get the pages into Google’s index
Now that I have used Andy’s tool and established which pages are not in Google’s index, I have to do my best to get them in, or back in if they’ve been removed. A method I tried a couple of days ago, when I realized that there were so many missing pages, was to ping my archives page using Pingomatic. This hasn’t done much good yet, but it doesn’t mean it hasn’t worked. It may take a few days for the data in GWT to update.
Instead of waiting for this to happen, I will give Google a little nudge and include (below) links to the unindexed pages I would like Google to know about. I don’t have to include links to the Privacy or Terms page as those links are in the footer (although they haven’t worked yet!).
- http://mydigitalinternet.com/offline-reading-for-bloggers-books-on-blogging/
- http://mydigitalinternet.com/membership-site-mastermind-%e2%80%93-are-membership-sites-the-future/
- http://mydigitalinternet.com/what-is-shared-web-hosting/
- http://mydigitalinternet.com/will-sitewide-reciprocal-and-directory-links-hurt-your-blog/
- http://mydigitalinternet.com/why-i-started-to-blog-and-why-i-can%e2%80%99t-stop/
- http://mydigitalinternet.com/what-is-a-blog/
- http://mydigitalinternet.com/could-you-be-missing-out-on-an-inbound-link/
- http://mydigitalinternet.com/website-design-ideas-looking-for-inspiration/
- http://mydigitalinternet.com/now-you-can-design-your-own-blogger-theme-using-the-template-designer/
- http://mydigitalinternet.com/blog-comment-demon-software-review/
- http://mydigitalinternet.com/true-google-ignores-the-content-of-the-meta-keywords-tag/
- http://mydigitalinternet.com/check-forums-before-buying-when-researching-software/
This has been a very interesting exercise and I will closely watch the progress of page indexation in the future. If you have never monitored the amount of pages from your blog Google has in its index, I think it is worthwhile spending a little time finding out. You could be in for a shock.
Have you experienced similar problems with your your blog? If so, I would be very interested to know about your situation and what you did to resolve the problems you experienced.



This is great Stephen, you talked like an SEO expert here :) I agree with the affiliate links within content, I see most bloggers that don’t hide their affiliate links (though I admire them for their honesty through links), but it does hurt their site in some ways. I do meta refresh with my affiliate links and I restrict crawlers from reaching that redirect through robots.txt.
About indexing, it’s best to structure your internal links very well, if possible each post should have at least 1 link coming from other post within the content, and use “related posts plugin” to also reduce the site’s bounce rate. You can also build deep links to your posts to ensure that they are crawled and indexed. I’m actually focusing on this one, because I’m experimenting something about Pagerank flow and Keyword ranking (for homepage) by only using internal links :)
Lastly, it’s important that you build links pointing to your sitemap once in a while, just to make sure that all pages will be crawled by search engine bots.
Regards,
Jason
Thanks Jason!
There are a few plugins around that help cloak affiliate links. GoCodes is one of them. I prefer to do it manually using PHP files that work much the same way as GoCodes. The directory those files are stored in is blocked using robots.txt.
I use the related posts plugin too, and it’s great, it does a good job of throwing up relevant links at the bottom of each post and I think it does help keep bounce rate down.
Internal linking is very important. I am going through some old posts to see how can they be used to link to some of the newer posts. This helps readers find related content, reduces bounce rate (once again) and helps with SEO (especially as some of the older pages are more likely to have PageRank).
Thanks for the advice Stephen.
I always wondered why people cloak affiliate links. I’m not bothered if my readers can see the affiliation, but I didn’t realise that it’s no good for SEO? Why doesn’t Google like affiliate links? Is it because they regard anything with a sales link as just advertising copy?
I’m going to my Webmaser Tools account now to see how I’m doing………
John
Thanks John.
I don’t think affiliate links are always bad for SEO, I think it depends upon the context of the link(s). If you have a page that contains lots of affiliate links and nothing else, Google will probably decide that the page is not useful and will not index it. If you have a page that contains a 500 word post that adds value and you have one affiliate link at the bottom of the article, that shouldn’t really be a problem.
A few reasons why Google doesn’t like affiliate links could be 1) It considers an affiliate link without the NOFOLLOW tag to be a paid link, 2) If it follows an affiliate link without the NOFOLLOW tag it’s a waste of resources and 3) Because affiliate links are providing traffic to a product, the seller of that product is not using AdWords for promotion, and therefore, Google is losing money.
These reasons are just speculation, of course. It’s hard to be certain of anything in relation to Google!