Integrating Google Search Appliance with Drupal

Most anyone who's worked with Drupal for even a short period of time knows that the performance, relevancy, and scalability of Drupal's core search is very limited. As a result, and no doubt due in part to its open source DNA, the Drupal community has embraced Apache Solr as core search's de facto replacement for medium-to-large scale installations. A plethora of contributed modules have grown around the main module, and Acquia offers a very nice hosted Solr solution.

For organizations with enterprise needs, an attractive, proprietary search alternative is Google Search Appliance. Whereas Solr operates entirely via an HTTP-based API for indexing and serving results, GSA is a hardware solution hosted within an organization's network that crawls webpages and documents and indexes and ranks them in a manner similar to how google.com does the Internet. Search results can be served directly from the Appliance itself--customized via XSLT transforms--or, like Solr, can be retrieved via GET requests and parsed and served elsewhere.

Rather than debating the merits of Solr vs. GSA, this article delves into how to go about integrating Google Search Appliance with Drupal 7. This guide will focus on maximizing your or your organization's ROI, in particular by delving into user search experience improvements.

Table of Contents

Basic Integration: Drupal GSA Module

Lucky for us, most of the work is already done! Developed in large part by Eric Paul for Portland State University, the Drupal 7 Google Search Appliance module forms the basis of the search experience by implementing a broad range of GSA features in a very Drupal friendly way: keyword search, block search form, related searches block, date/relevancy sorting--all with a simple, intuitive configuration page! To top it off, all aspects of the search experience are easily themable via template files and preprocess functions, including a standard Drupal pager. All module configurations are at http://example.com/admin/settings/search/google_appliance/settings:

Google Search Appliance module configuration

Once you have your Drupal site configured to talk to your Appliance (and assuming the Appliance is already configured and crawled some content), you can test it out by going to http://example.com/gsearch. If you want, you can keep a sitewide search field in your menu (or elsewhere in your header or footer) by placing the provided GSA search form block using core block, context, or however else you normally place blocks.

At this point (and perhaps with some minor theming), you have a decent replacement for core (or more likely Solr) search with virtually no effort. That being said, a little time invested in improving end-user experience can go a long way in your GSA integration. For most of the following tweaks, you'll likely want to make a small custom module (i.e. my_gsa_module) to implement some useful hooks. This guide assumes you know the basics of module authoring.

Back to the top

Grouping Results: Collections

GSA comes with a feature called "collections." Each Appliance has a default collection (aptly called default_collection) which serves as a catch-all for all indexed content; GSA administrators can define their own collections based on partial or full URL strings (in the Appliance admin, not within Drupal). Each collection serves as a subset of the default_collection that users can search against.

Obvious use cases include bucketing different subdomains into separate collections, but collections can also be defined by path (e.g. a blog collection for all pages matching against the path...

http://example.com/blog*

By default, the Google Search Appliance module only supports a single collection, but luckily, the module API allows us to alter queries to the GSA before they're dispatched. Switching the collection is as simple as this:

<?php
/**
* Implements hook_google_appliance_query_alter().
*/
function my_gsa_module_google_appliance_query_alter($query) {
  if (isset(
$_GET['site'])) {
   
$query['gsa_query_params']['collection'] = check_plain($_GET['site']);
  }
}
?>

With the above code, you can query arbitrary collections by throwing in a "site" parameter to search result pages (e.g. http://example.com/gsearch/my+search+query?site=my_collection).

It's possible you'll utilize this hook significantly, so you should become familiar with what's available to you by dsm'ing/dumping the $query parameter passed to it, as well as looking at the Google Search Appliance documentation.

Back to the top

Grouping Results: Pseudo-Collections and Filtering

In practice, however, collections can become limiting. Stakeholders may wish to define collections that are simply impossible to define via simple URL strings or paths. When situations like these arise, it may be better to use the default_collection and filter the results via other properties like meta tags (most likely), file formats, languages, or even paths, making something of a "pseudo" collection. (Note that if you wish, many of these settings can be saved off as separate frontends in the GSA admin interface, but they can also be appended/altered onto each query from within Drupal.)

Filtering By Meta Tag

When the Appliance crawls your site, it stores meta tags as key/value pairs in its index with which you can filter your search results. By default, your site likely doesn't have much useful information in its meta tags, so you may wish to look into meta tag management modules like Meta Tags Quick et al. It may also be appropriate to familiarize yourself with drupal_add_html_head() and friends.

One important advantage of meta tag filtering over other filtering methods is that it accepts a relatively complex boolean logic. When filtering with meta tags, the GSA unit expects all arguments in the "requiredfields" or "partialfields" parameters. More details on those are available on the GSA documentation, but I'll cover the basics here.

For a simple example, suppose you wanted to filter your searches by content type. You'd want to get your node pages to have a meta tag like this in their head:

<meta name="content_type" value="my_content_type">

Then, in your query alter hook, you'd add something like the following:

<?php
/**
* Implements hook_google_appliance_query_alter().
*/
function my_gsa_module_google_appliance_query_alter($query) {
  if (
some condition) {
   
$query['gsa_query_params']['requiredfields'] = 'content_type:my_content_type';
  }
  elseif (
some other condition {
   
$query['gsa_query_params']['requiredfields'] = 'content_type:page';
  }
}
?>

The syntax for GSA's meta tag logic is a little non-standard, but should be relatively easy to pick up. AND is represented as a period (.), OR with a pipe (|), and NOT as the standard (for search) dash (-). Also note that logic can be nested using parentheses. Naturally, these expressions can become very complex:

(section:about.(subsection:blog|subsection:contact))|(section:learning.(subsection:testimonials|subsection:webinars|subsection:whitepapers))|(content_type:event)

Note also that you can filter by the existence of a given meta tag rather than checking for a specific value by omitting the colon and test value. In this way, if you set up your meta tag "content_type" to only show up on node pages, you could filter out all node pages (likely returning mostly views pages) with something like this:

-content_type

Filtering By Pseudo Parameters: inurl

Rather than being a proper query parameter, the inurl parameter is simply appended to the query string itself. Doing so filters your search to all pages whose URLs contain the value you provide. Here, advanced logic is not possible, but some basic logic is. Prepending a dash (-) to inurl negates it, and adding multiple inurl parameters has the effect of ANDing them together.

Note that with inurl, only paths can be taken into consideration; domains and subdomains will evaluate only up to the first period.

So if, for instance, you had pathauto set up so that all of your content of type my_content_type is at http://example.com/my-content-type/my-title, you could filter your search to content of that type like this:

<?php
/**
* Implements hook_google_appliance_query_alter().
*/
function my_gsa_module_google_appliance_query_alter($query) {
 
$query['gsa_query_params']['q'] .= ' inurl:/my-content-type/';
}
?>

Note the string concatenation. This could be desirable if you didn't want to pollute your meta data too much, or could also be used in combination with meta tag filtering. Also be aware of a gotcha: the default theme implementations in the GSA module include references to the search query in at least a few places. If you append the inurl parameter for filtering purposes, you'll need to regex them out (in templates or preprocess functions) before they're printed to the page.

Filtering By Language: Search Internationalization

If you run a multilingual website, you'll likely need to be able to provide a seamless search experience for each language. Luckily, the GSA stores language metadata associated with each crawled item. The simplest way to filter this is using the "lr" parameter; the value passed is in the form "lang_en" where "en" is the desired language code.

<?php
/**
* Implements hook_google_appliance_query_alter().
*/
function my_gsa_module_google_appliance_query_alter(&$query) {
  global
$language;
 
$query['gsa_query_params']['lr'] = 'lang_' . $language->language;
}
?>

Unfortunately, depending on how your site's multilingual settings are configured, this may not be enough. The crawler determines the language of a document behind the scenes; if a page is declared to be French, but there are untranslated portions, GSA may not categorize it as expected.

If your site is set up to use language prefixes for each defined language, you may want to, in addition to using the lr parameter, filter using the inurl pseudo-parameter like so:

<?php
/**
* Implements hook_google_appliance_query_alter().
*/
function my_gsa_module_google_appliance_query_alter(&$query) {
  global
$language;
 
$query['gsa_query_params']['lr'] = 'lang_' . $language->language;
 
$query['gsa_query_params']['q'] .= ' inurl:' . $language->prefix . '/';
}
?>

Naturally, if your site's default language doesn't have a language prefix, more nuanced logic will need to be used; in particular, when searching in the default language, you'll need to append negated inurl parameters for all declared languages except the default.

Back to the top

Semantic URLs

Almost certainly, your website uses Drupal's clean URL system to provide user-friendly, RESTful URLs for all of your site content. Undoubtedly, you'll want extend this functionality to your site search.

As shown previously the GSA module will return results at http://example.com/gsearch/user-search-query by default, allowing GSA to run in parallel with core search or even Apache Solr if desired. Ultimately though, end-users don't care which search provider is being used, so providing results at /gsearch is not particularly intuitive.

Additionally, you may want to be able to cleanly query against different collections (or pseudo collections), using different frontends, and with different sorting methods.

While the GSA module doesn't provide a way to alter the search path (or support multiple collections or front ends) out of the box, extending its functionality using standard Drupal hooks is straightforward. In the following example, as noted in the comments, the goal is a URL of the form http://example.com/search/some-collection/some-frontend/my-query.

<?php
/**
* Implements hook_menu().
*
* Allow search queries of the form:
* -/search/COLLECTION/FRONT-END/QUERY
*
* @see my_gsa_module_google_appliance_query_alter()
*/
function my_gsa_module_menu() {
 
$items['search'] = array(
   
'title' => 'Search',
   
'page callback' => 'google_appliance_search_view',
   
'page arguments' => array(3),
   
'access arguments' => array('access_google_appliance_content'),
   
'type' => MENU_SUGGESTED_ITEM,
  );
  return
$items;
}

/**
* Implements hook_google_appliance_query_alter().
*/
function my_gsa_module_google_appliance_query_alter(&$query) {
 
$collection = check_plain(arg(1));
 
$frontend = check_plain(arg(2));

 

$query['gsa_query_params']['site'] = $collection;
 
$query['gsa_query_params']['client'] = $frontend;
}
?>

First, we implement hook_menu(), setting up an alternate path whose page callback is the same as the GSA module's, but with arguments passed according to our URL scheme. The callback function google_appliance_search_view takes two arguments: the query string, and optionally, a sort method. We map which argument is passed to the callback using the "page arguments" array in our hook_menu() implementation. Because the search query will be the 4th item in our URL, we pass in array(3); remember, arg()'s result indices begin at zero.

The additional logic necessary in supporting collections and frontends can be implemented using the GSA module's query alter hook. We get the value of the collection by checking against the 2nd argument in the path, and the frontend value by checking the 3rd argument in the path.

You may want to use a different base path than "search," add additional sanity checks for the actual collections and frontends defined, or add/use completely different parameters. Hopefully the skeleton provided above illustrates how simple it is to implement these features.

Back to the top

Increasing Conversion with Rich Keymatches

Think of keymatches as free adwords for your own site search used to highlight relevant results for commonly searched words and phrases. In the GSA admin interface, you can define specific keywords or phrases that, when matched in an incoming query, return additional data in the form of a title and URL.

GSA Keymatch Admin Interface

The GSA Drupal module has built in support for these, displaying keymatches at the top of search results in the same place google.com shows adwords.

While this functionality is great, the results returned, even when themed, can be bland. You might want to increase visibility and conversion by including an image next to the keymatch, or provide multiple CTAs. One common way to accomplish this is to overload the title field in the GSA admin interface with pipe-delimited data that's processed and themed on the server side. For example:

My Keymatch Title|image.jpg|/path/to/result1|path/to/result2

The trouble with this overloading method is that the title field in the GSA admin interface is limited in length.

Rather than passing pipe-delimited data, a more attractive way to do it in Drupal is to simply pass a node ID corresponding to the content you would have linked to anyway, then load and render the node in a custom keymatch preprocess function. You could define your own "keymatch" view mode, or just use the teaser view.

Example Rich Keymatch

Here's an example of how you might implement it:

<?php
/**
* Implements hook_theme_registry_alter().
*/
function my_gsa_module_theme_registry_alter(&$theme_registry) {
 
$theme_registry['google_appliance_keymatch']['function'] = 'my_gsa_module_google_appliance_keymatch';
}

/**
* Custom theme callback for keymatches.
*/
function my_gsa_module_google_appliance_keymatch($keymatch) {
 
// If the keymatch title is numeric, we mean to load/render a node.
 
if (is_numeric($keymatch['description'])) {
   
$node = node_load($keymatch['description']);
   
$keymatch = node_view($node, 'teaser');
    return
drupal_render($keymatch);
  }
  else {
   
// It may be appropriate to fallback to a some other legacy rendering here.
 
}
}
?>

For maximum flexibility, you may even wish to create a keymatch content type with as many custom image or link fields you need.

Back to the top

Cleaning up Search Results

The GSA module provides a few settings that you can toggle to easily clean up some search results in cases where result snippets or directories are duplicated:

GSA Search Result Filter Configuration

In other situations, after maintaining GSA for awhile, you may start to find that random pages will show up highly ranked in search results because the keyword from the query corresponds to text found in a common part of the page source (like a menu or a footer).

If you look at the GSA documentation, you can exclude blocks of code from the index to varying degrees using googleon/off tags; these are essentially HTML comments that have special meaning to the GSA crawler.

While you could override a template here and there to insert these into problem areas, the most flexible way to do so comes built into the Drupal module (as of version 7.x-1.7). The module allows you to change settings on a block-by-block basis; to do so, simply visit a block's configuration page and you'll see a new "Google Appliance" tab in the visibility settings.

Google Search Appliance Block Visibility Drupal

Back to the top

Search Suggestions

If you regularly use google.com to search the web, and especially if you use Google Chrome, you're likely (if perhaps subconsciously) aware of Google's autosuggest functionality. Type in a small part of your query, and it provides you myriad search suggestions that Google thinks are related to what you've typed so far.

GSA has a similar feature built in which you may wish to leverage. While it comes loaded with a number of JavaScript files to ease implementation, the number and size of the files is burdensome, as they're designed to work without libraries like jQuery. In some cases, it can even conflict with existing JS within Drupal.

Example of GSA Query Suggestions

In order to simplify autosuggest integration, I developed the Google Search Appliance Query Suggestions Drupal module as a dead-simple drop-in, configurable, performant solution. Once enabled, the module is configurable on the normal Google Appliance module configuration page:

Configuring GSA Query Suggestions

Back to the top

OneBoxes

One commonly requested search feature is something of a multimedia "related content" sidebar. Google's embraced this on their search pages, featuring relevant media in the sidebar for notable people, places of interest, etc. The Enterprise solution to this is what Google calls a "OneBox." The architecture for this is somewhat counterintuitive, but I'll explain briefly. Consider this an advanced topic.

Example of a themed GSA onebox

Essentially, you define and enable a "OneBox" that's associated with a collection and has its own frontend; you must also provide a callback URL (a OneBox provider) corresponding to a script on your server. While the GSA processes a search request, for all OneBoxes associated with the queried collection, it makes a request in parallel to the associated provider on your server, passing among other things, the original query string. The script on your server is then supposed to send another query back to the GSA using the OneBox's frontend and the query passed to it; the GSA responds with search results, which your script must format into XML (conforming to this OneBox result schema) and return. Then finally, the GSA gathers all of the results returned from all OneBoxes, and appends their returned XML to its own results.

Due to limitations in how the GSA requests the callback URLs, each script must have a unique URL (you can't provide a single URL for all OneBoxes and differentiate by query parameter). As a result, unless you fully bootstrap Drupal for each OneBox (translating into multiple complete bootstraps for every search), a generic solution is, unfortunately, not practical.

That being said, other than the callback script and configuring your GSA properly, implementing OneBoxes in Drupal is fairly intuitive, though it may require more advanced programming.

At a high level, you'll want to parse all OneBox results using the Drupal module's result alter hook. Though you're doing the XML parsing here, you won't display any of the OneBox results inline; instead, you'll want to implement blocks for each OneBox. The blocks should then pull and render what you parsed previously in the GSA module's static cache. For maximum flexibility, you may want to templatize OneBox results (both as a whole, and individual items).

Though not a complete solution, here's some skeleton code to get you started:

<?php
/**
* Implements hook_google_appliance_results_alter().
*/
function my_gsa_onebox_google_appliance_results_alter(&$results, &$payload) {
 
// Loop through each OneBox.
 
foreach ($payload->xpath('//OBRES') as $onebox) {
    if (
$onebox->resultCode == 'success') {
     
$this_result = array();
     
// dsm $onebox to determine its structure, how to parse it.
      // Once parsed, you might add them to the result like this.
     
$results['onebox'][(string) $onebox->attributes()->module_name] = $this_result;
    }
  }
}

/**
* Implements hook_theme().
*/
function my_gsa_onebox_theme() {
 
$registry = array();
 
// Provide a template file for a single onebox result.
 
$registry['my_gsa_onebox'] = array(
   
'arguments' => array('form' => NULL),
   
'template' => my-gsa-onebox',
    '
path' => drupal_get_path('module', my_gsa_onebox') . '/theme',
   
'variables' => array('result' => NULL),
  );
  return
$registry;
}

/**
* Implements hook_block_info().
*/
function my_gsa_onebox_block_info() {
 
// Define blocks for each unique OneBox.
 
$blocks['video-onebox'] = array(
   
'info' => t('Video OneBox'),
  );
  return
$blocks;
}

/**
* Implements hook_block_view().
*/
function my_gsa_onebox_block_view($delta = '') {
 
$block = array();
 
// Pull GSA response XML from the static cache.
 
$results = &drupal_static('google_appliance_parse_device_response_xml');
  switch (
$delta) {
    case
'video-onebox':
     
$block['subject'] = t('Video Results');
     
$block['content'] = theme('my_gsa_onebox', $results['onebox']['video-onebox']);
      break;
  }
  return
$block;
}
?>

Back to the top

Additional Resources

Hopefully this guide has been useful to you. I've linked to the official Google GSA documentation several times in this article, but it's worth looking over as a developer, stakeholder, or anyone else. Also, don't forget that the Drupal GSA module comes with plenty of documentation (readme, API info). Finally, if your organization is looking to adopt or improve a Google Search Appliance integration with Drupal, don't hesitate to be in touch.

Comments

Nice Work

Excellent work, Eric. Very sensible coverage of common use cases, and even novice site builders should be able to get the most out of their GSA + Drupal search experience using these helpful tutorials. Thanks for sharing.

Great article!

Eric,

Great article! I wanted to make a couple of points that may help it be easier for your readers:

1) The GSA does support chaining collection names together, so they might not need your code modifications. Additionally as of a couple of releases ago, Google introduced composite collections which will do the same thing. While marketed to customers having multiple collections, it does work for a single GSA.
2) There are limitations to the limit to can stuff into a keymatch. There is a OneBox called custom keymatch provider which allows you to add additional attributes to effectively keymatch functionality.
3) Filtering by metadata can also be achieved by configuring the frontend on the GSA so this could be done to attach the filter to every query. You can also pass it in via inmeta...

It would be great if someone added the functionality of the dynamic navigation. This is a great feature for sites having metadata as well. Not sure if that's on the near term road map. We've done it for a C# but I'm not aware of anyone doing it for PHP.

Keep up the great work!

Michael Cizmar

Add new comment