Technical Blog

4 Posts tagged with the search tag

In Elastic Path 6.1.1, we added support for search server clustering, which greatly improved scalability and reliability. In our testing since that time, we've found that large-scale deployments that are clustering the search server can benefit from certain optimizations to index replication and search request distribution. Also, in 6.1.2, we upgraded to Solr 1.3, which brought overall performance gains, including improved indexing performance. Finally, we've updated the replication scripts to use the new snapshot "check" functionality, making the scripts more efficient by duplicating less redundant data.

 

Master Index Replication Interval

The frequency for running the snapshot script depends on how often the slave machines need the updated indexes. Of course, you need to consider the performance impact on the system, and the size of your indexes and the frequency of updates have a direct impact. If you have large indexes, consider a longer interval between snapshots. If the indexes need to be updated frequently, you may require a shorter interval. Keep in mind that frequent index replication across slave servers can affect performance. You may want to experiment to determine the optimal interval that balances your needs for frequency of updates against performance.


The snapshooter.ep script has the responsibility of taking a snapshot of the current master index. There are two ways to run it:

  • By running a cron job at a predefined interval (this is the default)
  • By using Solr's postCommit hook, which fires the script after a commit is complete.


Typically, we recommend using a cron job, unless you need to replicate your indexes more often than in 1-minute intervals or you've changed the frequency at which your Quartz indexers check for new objects to index (normally this is defaulted at 5 seconds). With recent changes from the Solr 1.3 scripts, the snapshooter.ep script now compares the previous snapshot files with the current index files and will only duplicate an index if it has changed. With this new feature, you can set your interval to be quite frequent without additional overhead of duplicating unchanged indexes. For example, in a cron job, you can run this every minute.

 

The advantage of using the postCommit hooks is that there is much more control over when you take a snapshot because they're only run when an actual change is made to an index. A snapshot can be taken post-commit, post-optimize, or both. To use postCommit replication, do the following:

 

  1. Modify snapshooter.ep.start to accept an argument that specifies which index you want to replicate (instead of having them all replicate with one call). For example, you would specify snapshooter.ep.start -i product to create only a product index snapshot.
  2. Update the Solr config file for that specific index to cause replication on a product post-commit. For example, edit WEB-INF/solrHome/conf/product.config.xml to add the following example:

   

product.config.xml Example
<!-- A postCommit event is fired after every commit or optimize command -->
     <listener event="postCommit" class="solr.RunExecutableListener">
       <str name="exe">/path/to/searchserver/WEB-INF/solrHome/bin/snapshooter.ep.start</str>
       <str name="dir">/path/to/searchserver/WEB
-INF/solrHome/bin</str>
       <bool name="wait">true</bool>
       <arr name="args"><str>-i</str><str>product</str></arr>
       <arr name="env"></arr>
     </listener>

 

Note: In 6.1.2, postCommit replication is now be done properly; Solr 1.3 included a fix for an issue that caused each commit call to be run twice. Prior to the fix, using a postCommit snapshot would produce multiple snapshots, wasting CPU time and disk space.

Slave Index Installation Interval

snappuller.ep maintains a status of the last index that was pulled, and therefore, will not unnecessarily pull down indexes on the master that have already been pulled. As a result, this script can be run at fairly regular intervals. snapinstaller.ep also maintains a status of the last index installed and will not attempt to re-install indexes that have already been installed. Therefore, similar to snappuller.ep, this script can be run at fairly regular intervals with little overhead.

 

HTTP Search Request Distribution over Master/Slave(s)

Although the master search server is fully capable of handling search requests from the storefront and CM client, we typically recommend that only the slave nodes handle search requests. Keep in mind that index replication isn't instantaneous. There's a delay between the time the master search server indexes a new object and the time it becomes searchable on the slave. With requests only going through to the slaves, this delay is consistent and all objects will show at the same time on all nodes. Also, you'll want to reduce the load on the master server and let it focus on continuous indexing, if necessary.

 

To take advantage of these changes, be sure to download the 6.1.2 search server cluster/failover scripts from the downloads area at http://grep.elasticpath.com/docs/DOC-1278 (requires login).

0 Comments Permalink

Search Server Failover

Posted by Alan Schroder Jan 21, 2009

Every Elastic Path administrator knows: if your search server goes down, your storefront is down and so is Commerce Manager. You can make your Elastic Path deployment more resilient by setting up search server failover.

 

In a nutshell, you set up two machines to run the search server web application. One is the main search server (master). The other is a backup (slave). Both machines are running behind a load balancer. The master search server uses rsync to synchronize multiple index files to the slave machine and then commits these new indexes into the slave's indexes. This essentially creates a duplicate of the master on the slave and, if the master server goes down, the slave server will be able to handle all requests.

 

The failover configuration consists of three components:

  • The master server
  • The slave server
  • The load balancer.

The load balancer is a PC running Apache web server with the mod_proxy module. The master server contains the search server web application, the index builders, and the indexes. The slave server also contains the search server web application, but it does not build its own indexes. Instead, the master uses rsync to replicate the search index files to the slave. Then, the data in the updated index files is committed into the slave search server's indexes. During normal operation, this machine receives search requests and forwards them to the master.

SSFailover_Figure1.png

If the master becomes unavailable, due to a failure or planned downtime, the load balancer redirects search requests to the slave.

SSFailover_Figure2.png

To make it all work, there are some scripts that need to get run by cron jobs on the master and slave. Note these scripts use rsync and Unix hard-links, so they only work on Linux/Unix environments.

 

You also need to set the searchHost setting in Elastic Path to point to the proxy server.

 

You can download the scripts and the setup documentation from the downloads page.

0 Comments 0 References Permalink

       

The search server web application is responsible for providing searching and browsing functionality for both the storefront and the Commerce Manager (CM) client. At its heart is the Solr search engine, which sits on top of the Lucene indexes. If you wanted to leverage Solr's powerful search capabilities, you could use the Solr search APIs. However, these APIs require extensive knowledge of Solr syntax. Another option might be to use JPQL, which is basically an object oriented SQL. But this requires some JPQL specific knowledge to be able to retrieve data from the database.

 

In Elastic Path 6.1, a new set of APIs were created to allow developers to create complex, platform-independent search queries using a familiar syntax, but with the added benefits of an ecommerce domain-specific language.  The new Elastic Path Query Language (EPQL) gives us the ability to be technology independent and makes the system more flexible.

 

The advanced search feature was introduced in Elastic Path Commerce 6.1, with the ability to handle products, categories, and catalogs initially. Its architecture, however, is extensible, and in the future, it will support searching for other types of domain objects, such as price lists, orders, and customers.  The advanced search functionality is currently available within the CM client and the import-export tool. The CM client integration supports only products at this time, while the import-export tool can use advanced search for looking up sets of products, categories, and catalogs.

 

Advanced search is provided as a library that can be used into any application or a standalone program.

 

The syntax of the Elastic Path Query Language (EPQL) resembles the SQL language. A simple query consists of a single expression. An expression has the following form:

 

FIND Product WHERE <field> <operator> <value>

where

  • <field> is the field you are searching. For example, if you want to look for products of a specific brand, you would include the BrandCode field in your query. The supported fields are described in Supported fields further in this article.
  • <operator> is the operator you are using to perform the comparison.
  • <value> is the literal value you want to compare to the field value.

For example, the following query matches the product whose code is 10030205

 

FIND Product WHERE ProductCode =   '10030205'

 

In addition to searching for field values, you can also search for attribute values. To search for a value in an attribute, the expression has the following form:

 

Attribute{<attribute_name>} <operator> <value>

where <attribute_name> is the name of a product attribute or product SKU attribute.

 

For example, the following query matches all products that have the Header / Model attribute set to MX:

 

FIND Product WHERE Attribute{Header /   Model} = 'MX'

 

Here is a more complicated query that finds all Nike and Adidas products that cost less than $200 USD and belong to catalog A:

 

FIND Product WHERE Catalog = 'A'   AND Price{A}[USD] < 200

AND (BrandName [en] = 'Nike' OR BrandName   [en] = 'Adidas')


You can run queries immediately using the CM client's UI. This is convenient for testing, but it doesn't allow you to schedule actions on the query results. Let's say you want to perform a daily action on products that match the criteria in the previous example. To do that, you would need to create a class that retrieves the products using EPQL and sends that information to a third party system for processing.

 

To use the EPQL search APIs, you need to add com.elasticpath.core.jar and com.elasticpath.ql.jar file to your project's classpath.

 

Next, add the required Spring bean definitions. Create a file named serviceEPQL.xml in the project's conf/spring/service folder and add a reference to it from the Spring application-context.xml. For an example, take a look at the com.elasticpath.cmclient.core RCP plugin or the import/export tool.

 

Create a Java class that will be used to retrieve the products using EPQL. The following code shows how to execute an EPQL query:

 

public boolean doDailyTask() {

// get a search engine instance

EpQLSearchEngine searchEngine =

       getElasticPath().getBean("epQLSearchEngine");

// create a query

String searchString = "FIND Product   WHERE Catalog = 'A' " +

       "AND Price{A}[USD] < 200   AND (BrandName [en] = 'Nike' " +

       "OR BrandName [en] =   'Adidas')";

// get a parser and use it to validate the query

EPQueryParser parser = searchEngine.getEpQueryParser();

try {

   parser.verify(query);

}

catch (EpQLParseException   exception){

   LOG.warn(exception);

return false;

}

// if the query is okay, execute it and   get the results

SolrIndexSearchResult result =   searchEngine.search(searchString);

List<Long> productUids =   result.getSearchResults();

// ... Do some processing on the results

return true;

}

 

The EpQLSearchEngine class has a search method that executes a query and returns the results. Before executing a query, you should make sure that it is syntactically valid.

 

The EPQLQueryParser class provides a verify method that takes a query string as an argument. If there's an error in the query, it throws an exception, including a detailed description to help you fix the problem.  The search result set contains the list of UIDs of the objects that matched the query.

0 Comments 0 References Permalink

       

The search index rebuild process has undergone some substantial changes in Elastic Path 6.1, making it more robust and more usable.

 

In the past, to trigger a search index rebuild, you needed to manually modify the contents of some properties files on the search server. This is no longer supported. Now, Commerce Manager users with the proper permissions can initiate search index rebuilds.

 

In the Activity menu, choose Configuration. In the Configuration page on the left, under System Administration, click Search Indexes.

 

Select an index in the list, and click Rebuild Index. The rebuild should now be scheduled and the status column will show Rebuild is Scheduled.

When the index rebuild begins, the status column for the index being rebuilt will contain Rebuild in Progress. Note that you can rebuild multiple indexes at the same time. When the rebuild is done, the status is changed to Complete.


When the user runs an index rebuild from the Commerce Manager client, this is what happens behind the scenes:

 

  1. The TINDEXNOTIFY table in the database contains a row for each search index. When the user clicks Rebuild Index, the UPDATE_TYPE column for that search index's row is set to REBUILD.
  2. The quartz.xml in com.elasticpath.search/WEB-INF/conf/spring/scheduling defines several Quartz jobs that check to see if the search indexes need to be rebuilt. Each time one of these scheduled jobs is triggered, it calls the method buildIndexJobRunner method in the AbstractIndexService and passes it a search index identifier. This method checks the TINDEXNOTIFY database table to see if the corresponding index needs to be rebuilt (if the value is REBUILD), has never been built, or if a previous rebuild was interrupted and did not complete successfully. If any of those conditions are true, a rebuild of that index is initiated.
  3. When an index rebuild is initiated, the INDEX_STATUS column in the TINDEXBUILDSTATUS table is set to REBUILD_IN_PROGRESS.
  4. After the index has finished rebuilding, the INDEX_STATUS column is set to COMPLETE.

 

You can monitor the status of rebuild process by checking the Status column in the search index list.

 

Finally, search index rebuilds are now more resilient to failures. If an index build is interrupted or fails to complete normally, the system will try to build it again on server startup or at the start of the next quartz job. If there is a rebuild in progress for an index and another rebuild for that index is requested by the user, the system will wait until the first rebuild is complete before processing the second request.

0 Comments 0 References Permalink