Full Text Search

Full text search improves performance when searching for a specific word, term, or character string in a column.

Example

Consider the following log message:

66.249.65.159 - - [06/Nov/2014:19:10:38 +0600] "GET /news/53f8d72920ba2744fe873ebc.html HTTP/1.1" 404 177 "-" "Mozilla/5.0 (iPhone; CPU iPhone OS 6_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/6.0 Mobile/10A5376e Safari/8536.25 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

Based on the above separator rules, full text search finds a first major segment value of 66.249.65.159.

This major segment contains the following minor segments:

  • 66
  • 249
  • 65
  • 159

Enable

To enable full text search, set the following configuration in your transform on the column you would like to full text search:

  • set index to true
  • set index_options to the following:
{
    "name": "message",
    "datatype": {
        "type":"string",
        "index": true,
        "index_options": {
            "fulltext": true
        }
    }
}

You can use LIKE to search the full text index:

SELECT message
FROM project.table
WHERE message LIKE '%error%'
AND timestamp < now()
AND timestamp > (now() - INTERVAL 60 MINUTE)
ORDER BY timestamp DESC
LIMIT 50
SETTINGS hdx_query_debug=true

This example searches for the word error in the message column within the last hour.

You can confirm that your query uses the full text index with query debug. Check to make sure your full text indexed column name appears in the indexes_used field:

X-Hdx-Query-Stats: exec_time=107 rows_read=0 bytes_read=0 num_partitions=58 num_peers=3 query_attempts=1 memory_usage=9491822
index_stats=[{"project.table":{"columns_read":["message","timestamp"],"indexes_used":["message","timestamp"],"shard_key_values_used":[]}}]

Separators

Full text search splits text into major and minor segments. Hydrolix determines these segments using the following separators:

Major separators

  • []
  • <>
  • ()
  • {}
  • |
  • !
  • ;
  • ,
  • '
  • "
  • *
  • \n
  • \r
  • \s
  • \t
  • &
  • ?
  • +

Minor Separators

  • /
  • :
  • =
  • @
  • .
  • -
  • $
  • #
  • %
  • \
  • _

Best Practices

When field values contain punctuation symbols, follow these best practices for optimum performance:

  • Match punctuation symbols with literals, not wildcards.
  • Avoid wildcards in the beginning or middle of a term, which can reduce search performance and lead to unexpected results.