Think about it. Why do you create a website? For your potential customers or audience to easily find you and for you to stand out among the competition, right? How does your content actually get to be seen? Is all the content on your site always seen?
Why you need to find all the pages on your website
It is possible that pages containing valuable information that actually needs to be seen, do not get to be seen at all. If this is the case for your website, then you are probably losing out on significant traffic, or even potential customers.
There could also be pages that are rarely seen, and when they are, users/visitors/potential customers hit a dead-end, as they cannot access other pages. They can only leave. This is as just as bad as those pages that are never seen. Google will begin to note the high bounce rates and question your site’s credibility. This will see your web pages rank lower and lower.
How your content actually gets to be seen
What is crawling and indexing?
For Google to show your content to users/visitors/potential customers, it needs to know first that content exists. How this happens is via crawling. This is when search engines search for new content and add it to its database of already existing content.
What makes crawling possible?
- Content Management Systems (CMS - Wix, Blogger)
When you add a link from an existing page to another new page, for example via anchor text, search engine bots or spiders are able to follow the new page and add it to Google’s ‘database’ for future reference.
These are also known as XML Sitemaps. Here, the site owner submits a list of all their pages to the search engine. The webmaster can also include details like the last date of modification. The pages are then crawled and added to the ‘database’. This is however not real time. Your new pages or content will not be crawled as soon as you submit your sitemap. Crawling may happen after days or weeks.
Most sites using a Content Management System (CMS) auto-generate these, so it's a bit of a shortcut. The only time a site might not have the sitemap generated is if you created a website from scratch.
If your website is powered by a CMS like Blogger or Wix, the hosting provider (in this case the CMS) is able to ‘tell search engines to crawl any new pages or content on your website.’
Here’s some information to help you with the process:
What is indexing?
Indexing in simple terms is the adding of the crawled pages and content into Google’s ‘database’, which is actually referred to as Google’s index.
Before the content and pages are added to the index, the search engine bots strive to understand the page and the content therein. They even go ahead to catalog files like images and videos.
This is why as a webmaster, on-page SEO comes in handy (page titles, headings, and use of alt text, among others). When your page or pages have these aspects, it becomes easier for Google to ‘understand’ your content, catalog it appropriately and index it correctly.
Sometimes, you may not want some pages indexed, or parts of a website. You need to give directives to search engine bots. Using such directives also makes crawling and indexing easier, as there are fewer pages being crawled. Learn more about robots.txt here.
You can also this other directive if there are pages that you do not want to appear in the search results. Learn more about the noindex.
Before you start adding noindex, you’ll want to identify all of your pages so you can clean up your site and make it easier for crawlers to crawl and index your site properly.
What are orphan pages?
An orphan page can be defined as one that has no links from other pages on your site. This makes it almost impossible for these pages to be found by search engine bots, and in addition by users. If the bots cannot find the page, then they will not show it on search results, which further reduces the chances of users finding it.
How do orphan pages come about?
Orphan pages may result from an attempt to keep content private, syntax errors, typos, duplicate content or expired content that was not linked. Here are more ways:
- Test pages that were used for A/B testing and that were never deactivated
- Landing pages that were based on a season, for example, Christmas, Thanksgiving or Easter
- ‘Forgotten’ pages as a result of site migration
How about dead-end pages?
Unlike orphan pages, dead-end pages have links from other pages on the website but do not link to other external sites. Dead-end pages examples include thank you pages, services pages with no call to actions, and “nothing found” pages when users search for something via the search option.
When you have dead-end pages, people who visit them only have two options: to leave the site or go back to the previous page. That means that you are losing significant traffic, especially if these pages happen to be ‘main pages’ on your website. Worse still, users are left either frustrated, confused or wondering, ‘what’s next’?
Where do dead-end pages come from?
Dead end-pages are a result of pages with no calls to action. An example here would be an about page that alludes to the services that your company offers but has no link to those services. Once the reader understands what drives your company, the values you uphold, how the company was founded and the services you offer and is already excited, you need to tell them what to do next.
A simple call to action button ‘view our services’ will do the job. Make sure that the button when clicked actually opens up to the services page. You do not want the user to be served with a 404, which will leave him/her frustrated as well.
What are hidden pages?
Hidden pages are those that are not accessible via a menu or navigation. Though a visitor may be able to view them, especially through anchor text or inbound links, they can be difficult to find.
Pages that fall into the category section are likely to be hidden pages too, as they are located in the admin panel. The search engine may never be able to access them, as they do not access information stored in databases.
Should all hidden pages be done away with?
Newsletter sign ups
You can have a page that breaks down the benefits of signing up to the newsletter, how frequently users should expect to receive it, or a graphic showing the newsletter (or previous newsletter). Remember to include the sign up link as well.
Pages containing user information
How to find hidden pages
Hidden pages are highly likely to be hidden from search engines via robots.txt. To access a site’s robots.txt, type [domain name]/robots.txt into a browser and enter. Replace ‘domain name’ with your site’s domain name. Look out for entries beginning with ‘disallow’ or ‘nofollow’.
Manually finding them
If you sell products via your website for example, and suspect that one of your product categories may be hidden, you can manually look for it. To do this, copy and paste another products URL and edit it accordingly. If you don’t find it, then you were right!.
What if you have no idea of what the hidden pages could be? If you organize your website in directories, you can add your domainname/folder-name to a site’s browser and navigate through the pages and sub-directories.
Once you have found your hidden pages (and they do not need to stay hidden as discussed above), you need to add it to your sitemap and submit a crawl request.
How to find all the pages on your site
Using your sitemap file
We have already looked at sitemaps. Your sitemap would come in handy when analyzing all of your web pages. If you do not have a sitemap, you can use a sitemap generator to generate one for you. All you need to do is enter your domain name and the sitemap will be generated for you.
Using your CMS
If your site is powered by a content management system(CMS) like WordPress, and your sitemap does not contain all the links, it is possible to generate the list of all your web pages from the CMS. To do this, use a plugin like Export All URLs.
Using a log
A log of all the pages served to visitors also comes in handy. To access the log, log in to your cPanel, then find ‘raw log files’. Alternatively, request your hosting provider to share it. This way you get to see the most frequently visited pages, the never visited pages and those with the highest drop off rates. Pages with high bounce rates or no visitors could be dead-end or orphan pages.
Using Google Analytics
Here are the steps to follow:
Step 1: Log in to your Analytics page.
Step 2: Go to ‘behavior’ then ‘site content’
Step 3: Go to ‘all pages’
Step 4: Scroll to the bottom and on the right choose ‘show rows’
Step 5: Select 500 or 1000 depending on how many pages you would estimate your site to have
Step 6: Scroll up and on the top right choose ‘export’
Step 7: Choose ‘export as .xlsx’ (excel)
Step 8: Once the excel is exported choose ‘dataset 1’
Step 9: Sort by ‘unique page views’.
Step 10: Delete all other rows and columns apart from the one with your URLs
Step 11: Use this formula on the second column:
Step 12: Replace the domain with your site’s domain. Drag the formula so that it is applied to the other cells as well.
You now have all your URLs.
If you want to convert them to hyperlinks in order to easily click and access them when looking something up, go on to step 13.
Step 13: Use this formula on the third row:
Drag the formula so that it is applied to the other cells as well.
Manually typing into Google’s search query
You can also type this site: www.abc.com into Google’s search query. Replace ‘abc’ with your domain name. You will get search results with all the URLs that Google has crawled and indexed, including images, links to mentions on other sites, and even hashtags your brand can be linked to.
What then do you do with your URL list?
At this point, you may be wondering what you need to do with your URL list. Let’s look at the available options:
Manual comparison with log data
One of the options would be to manually compare your URL list with the CMS log and identify the pages that seem to have no traffic at all, or that seem to have the highest bounce rates. You can then use a tool like ours to check for inbound and outbound links for each of the pages that you suspect to be orphan or dead end.
Another approach is to download all your URLs as a .xlsx file (excel) and your log too. Compare them side by side (in two columns for example) and then use the ‘remove duplicates option’ in excel. Follow the step by step instructions. By the end of the process, you will have only orphan and dead-end pages left.
The third comparison approach is copying two data sets - your log and URL list on to Google Sheets. This allows you to use this formula: =VLOOKUP(A1, A: B,2,) to look up URLs that are present in your URL list, but not on your log. The missing pages (rendered as N/A) should be interpreted as orphan pages. Ensure that the log data is on the first or left column.
Using site crawling tools
The other option would be to load your URL list onto tools that can perform site crawls, wait for them to crawl the site and then you copy and paste your URLs onto a spreadsheet before analyzing them one by one, and trying to figure out which ones are orphan or dead end.
These two options can be time-consuming, especially if you have many pages on your site, right?
Well, how about a tool that not only finds you all your URLs but also allows you to filter them and shows their status (so that you know which ones are dead end or orphan?). In other words, if you want a shortcut to finding all of your site's pages SEOptimer's SEO Crawl Tool.
SEOptimer's SEO Crawl Tool
This tool allows you to access all your pages of your site. You can start by going to “Website Crawls” and enter your website url. Hit “Crawl”
Once the crawl is finished you can click on “View Report”:
Our crawl tool will detect all the pages of your website and list them in the “Page Found” section of the crawl.
You can identify “404 Error” issues on our “Issues Found” just beneath the “Pages Found” section:
In this article we have looked at how to find all the pages on your site and why it is important. We have also explored concepts like orphan and dead end pages, as well as hidden pages. We have differentiated each one, how to identify each among your URls. There is no better time to find out whether you are losing out due to hidden, orphan or dead-end pages.