Introduction to pentesting: Reconnaissance

Google directives

Luckily for us, Google provides “directives” that are easy to use and help us get the most out of every search. These directives are keywords that enable us to more accurately extract information from the Google Index.

Consider the following example: assume you’re looking for information on the Dakota State University (dsu) about Pat Engebretson (the writer of the book in which this course is based in). The simplest way to perform the following terms (without the quotes) in a Google search box: “Patrick Engebretson dsu”. This search will yield a fair number of hits. However the first 50 websites returned, only four were pulled directly from the DSU website.

By utilizing Google directives, we can force the Google Index to do our bidding. In the example above we know both the target website and the keywords we want to search. More specifically, we’re interested in forcing Google to return only results that are pulled from the target (dsu.edu) domain. In this case, our best choice is to use the “site:” directive. Using the “site:” directive forces Google to return only hits that contain the keywords we used and come directly from the specified website.

To properly use a Google directive, you need three things:

  1. The name of the directive you want to use
  2. A colon
  3. The term you want to use in the directive

After you’ve entered the three pieces of information above, you can search as you normally would. To utilize the “site:” directive, we need to enter the following into a Google search box:

site:domain term(s) to search

Note that there’s no space between the directive, colon and domain. In our earlier example, we wanted to conduct a search for Patrick Engebretson on the DSU website. To accomplish this, we’d enter the following command into the Google search bar:

site:dsu.edu pat engebretson

Running this search provides us with drastically different results than our initial attempt. First, we’ve trimmed the overall number of hits from 600+ to about 50. There’s a little doubt that a person can sort through and gather information from 50 hits much quicker than 600. Second and possibly more importantly, every single returned result comes directly from the target and look for additional information.

Another good Google directive to use is “intitle:” or “allintitle:”. Adding either of these to your search causes only websites that have your search words in the title of the webpage to be returned. The difference between “intitle:” and “allintitle:” is straightforward. “allintitle:” will only return websites that contain all the keywords in the web page title. The “intitle:” directive will return any page whose title contains at least one of the keywords you entered.

A classic example of putting the “allintitle:” Google hack to work is to perform the following search:

allintititle:index of

Performing this search will allow us to view a list of any directories that have been indexed and are available via the webserver. This is often a great place to gather reconnaissance on your target.

If we want to search for sites that contain specific words in the URL, we can use the “inurl:” directive. For example, we can issue the following command to locate potentially interesting pages on your target’s webpage:

inurl:admin

This search can be extremely useful in revealing administrative or configuration pages on your target’s website.

It can be also very valuable to search the Google cache rather than the target’s website. This process not only reduces your digital footprints on the target’s server, making it harder to catch you, it also provides a hacker with the occasional opportunity to view webpages and files that have been removed from the original website. The Google cache contains a stripped-down copy of each website that the Google bots have spidered. It’s important to understand that the cache contains both the code used to build the site and many of the files that were discovered during the spidering process, These files can be PDFs, MS Office documents like Word and Excel, text files and more.

It’s not uncommon today for information to be placed on the Internet by mistake. Consider the following example: suppose you’re a network administrator for a company. You use MS Excel to create a simple workbook containing all the IP addresses, computer names and locations of the PCs in your network. Rather than carrying this Excel spreadsheet around, you decide to publish it to the intranet where it’ll accessible only by people within your organization. However, rather than publishing this document to the intranet website, you mistakenly publish it to the company Internet website. If the Google bots spider your file before you take this file down, it’s possible the document will live on the Google cache even after you’ve removed it from your site. As a result, it’s important to search the Google cache too.

We can use the “cache:” directive to limit our search results and show only information pulled directly from the Google cache. The following search will provide us with the cached version of github.com for example:

cache:github.com

’s important that you understand that clicking on any of the URLs will bring you to the live website, not the cached version. If you want to view specific cached pages, you’ll need to modify your search.

The last directive we’ll cover here is “filetype:”. We can use “filetype:” to search for specific file extensions. This is extremely useful for finding specific files on your target’s website. For example, to return only hits that contain PDF documents, you’d issue the following command:

filetype:pdf

This powerful directive is a great way to find links to specific files like .doc, .xlsx, .ppt, .txt and many more. Your options are nearly limitless.

For additional power, we can combine multiple directives into the same search. For example, if we want to find all the PDF files on the Github website, you’d enter the following command into the search box:

site:github.com filetype:pdf

In this case, every result that is returned is a PDF file and comes directly from the github.com website.

There are many other types of directives and Google hacks that you should become familiar with. Along with Google, it’s important that you become efficient with several other search engines as well. Oftentimes, different search engines will provide different results, even when you search for the same keywords. As a penetration tester conducting reconnaissance, you want to be as thorough as possible.

As a final warning, it should be pointed out that these passive searches are only passive as long as you’re searching. Once you make a connection with the target system (by clicking on any of the links), you’re back into the active mode. Be aware that active reconnaissance without prior authorization is likely an illegal activity.

Once you’ve thoroughly reviewed the target’s webpage and conducted exhaustive searches utilizing Google and other search engines, it’s important to explore other corners of the Internet. Newsgroups and Bulletin Board Systems to like UseNet and Google Groups can be very useful for gathering information about a target. It’s not uncommon for people to use these discussion boards to post and receive help with technical issues. Unfortunately (or fortunately, depending on which side of the coin you’re looking at), employees often post very detailed questions including sensitive and confidential information. For example, consider a network administrator who is having trouble getting his firewall properly configured. It’s not uncommon to witness discussions on public forums where these admins will post entire sections of their config files. To make matters worse, many people post using their company e-mail addresses. This information is a virtual gold mine for an attacker.

Even if our network admin is smart enough not to post detailed configuration files, it’s hard to get support from the community without inadvertently leaking some information. Reading even carefully scrubbed posts will often reveal specific software version, hardware models, current configuration information and the like about internal systems. All this information should be filed away for future use.

Public forums are an excellent way to share information and receive technical help. However, when using these resources, be careful to use a slightly more anonymous e-mail address like Gmail or Hotmail, rather than your corporate address.

The explosive growth in social media like Facebook, MySpace and Twitter provide us with new avenues to mine data about our targets. When performing reconnaissance, it’s a good idea to use these sites to our advantage. Consider the following fictitious example: You’re conducting a penetration test against a small company. Your reconnaissance has led you to discover that the network administrator for the company has a Twitter and Facebook account. Utilizing a little social engineering you befriend the unsuspecting admin and follow him on both Facebook and Twitter. After a few weeks of boring posts, you strike the jackpot. He makes a post on Facebook that says “great. Firewalled died without warning today. New one being sent over-night. Looks like I’ll be pulling an all-nighter tomorrow to get things back to normal”.

Another example would be a PC tech who posts, “Problem with latest Microsoft patch, had to uninstall. Will call MS in the morning”.

Or even the following: “Just finished the annual budget process. Looks like I’m stuck with that Server 2000 for another year”. Although these examples may seem a bit over the top, you’ll be surprised at the amount of information you can collect by simply monitoring what employees post online.