What is Enterprise Search?

 Definition

Wikipedia's (typically rather dry) definition of Enterprise Search is:

Enterprise search is the practice of making content from multiple enterprise-type sources, such as databases and intranets, searchable to a defined audience.

This definition is accurate, if a little restrictive in today's market, where "the enterprise" makes use of many different software products, not all of which are as neatly pigeon-holed as "intranets" and "databases".

A brief history of the enterprise search market

For almost 20 years Enterprise Search has been characterised by huge installations of expensive software that can take an army of consultants to configure.

For this you can thank Autonomy, a Cambridge, UK-based company founded by Mike Lynch in 1996. Their flagship product, IDOL, was a powerful and complex piece of software that was installed by many large corporations.

Their success led to increased market awareness and the rise of a number of competitors. Autonomy sold to HP in 2011 in what proved to be controversial circumstances, but many of their competitors are still around, serving the same large corporate customers with the same technology.

Some of the larger competitors included Microsoft (with their FAST Search product, acquired from FAST in 2009) and Oracle (Endeca, acquired 2011).

Google entered the market in 2002 with the Google Search Appliance, a physical server sold as a plug-in-and-go appliance, which offered a Google Web Search-like user experience and an enterprise-friendly set of features in the background. This product has recently been discontinued, leading to bursts of publicity from other vendors and a slew of blog posts about "How to migrate from GSA to [product]".

In the last few years there have been a few newer entrants to the enterprise market, such as LucidWorks, with their Fusion product based on the Open-Source Apache Solr tool.
These have innovated by offering Open-Source-based software, but have typically followed traditional patterns of sales and offered a standard model for integration (frequently involving expensive follow-on consultancy and vast amounts of planning!)

Traditional features of Enterprise Search software

So what does all this very complicated, very expensive software do?
The basic features are actually surprisingly easy to explain.

Content collection

Content collection is the process of gathering your documents from wherever they are in your company. In the "push" model of collection, a source system is integrated with the search engine so that it connects to it and pushes new content directly to the index immediately as changes are made.

By contrast, in a "pull" model, the search software gathers content from sources using "connectors" such as web crawlers or a direct database connection. These go out and grab information from the source systems on a schedule (say, once a day).

Content processing

Your content comes in many different formats and document types, such as XML, HTML, Office document formats and plain text, images and video.

During content processing, the search software processes the incoming documents to plain text. It might also attempt to normalise the content, using techniques like stemming, entity extraction, part of speech tagging and so forth, and it pulls out all the metadata (literally "data about data") that it can find about your document.

Indexing

The final plain text is stored, alongside its' metadata, in an index - a specialised data structure that's tuned for fast lookups. Think of this as a very specific kind of database.

Query processing

Using a webpage or, sometimes, an app, your users send queries to the system.
The system has to interpret the query, and any options supplied alongside it like filters, aggregations or facets, to work out what the user is asking for, and how they'd like the answers to be presented to them.

Matching

The processed query is applied to the index, and the system returns results (or "hits") from source documents that match.
Some systems are able to present the document as it was indexed, perhaps showing sections of text in a highlighted format.

Connectors

These are the adapters to index content from a variety of sources, such as databases and content management systems.
Think of them as funnels, or vacuums, pulling content from your source systems and pushing it into the index.

Federated search

This is the practice of "multiplexing" queries, sending them out to the source systems' own built-in search facilities and attempting to aggregate the returned results into a single, ordered result set.
This has historically been a tricky problem to solve, as no two source systems share the same set of standards for relevance, so presenting a single list of results that makes sense is tough. With the rise of sensible API based (push) solutions, the necessity for federated search may be reducing.

Tagging and bookmarking

These are ways of adding metadata, automatically or manually through a user interface, so that data can be categorised and browsed as well as searched based on text queries.

Nowadays there is some movement in this area towards using Machine Learning techniques to do this automatically where possible (as, frankly, users never bother to apply the metadata and tags themselves!)

Faceted search

Faceted search (or, more generally, the application of aggregations to search results) is a technique for helping users to split results up into different "facets" - aspects like document types, date ranges and so on - to help your users drill down within a returned result set to find exactly what they need.

On a typical website, these are the filters you see, where you can select the sizes and colours of the goods you're purchasing, in an attempt to narrow down your search.

The future

As you can see, in some respects Enterprise Search has been a fairly static market for years now - lots of consolidation and acquisition, but not a lot of real innovation. Vendors' airy-fairy promises notwithstanding, most of the features described above have been available in some form since Autonomy first launched IDOL.

However, things are changing.

Over the last 10 years there's been a shift in how companies do business; increasingly it's possible to operate at scale with no on-premise servers or software; no installation of an intranet or file servers with shared drives.

These days, many companies operate entirely in the cloud, using Google Drive or its' competitors for storage , GMail for email, Slack for chat and messaging and some kind of wiki (Confluence, perhaps) for structured management of documentation and knowledge.

Traditional Enterprise Search vendors haven't kept up with this change, and are largely still focussed on crawling file-shares and interrogating on-premise installations of heavyweight corporate software.

In a following blog post, we'll investigate how a new breed of search software focussing on cloud integrations and rapid integration has arisen.

In the meantime, if you'd like to answer the question of "where's my data? Is it in Trello, Slack, Drive or GitHub?" then you should find out more about CTX, our answer to this question. CTX gives you the power of Enterprise Search, packaged in the convenience of a modern Software-as-a-Service system - it only takes ten minutes to get up and running.