Full Text Search

Hydrolix support full text search analysis.
Text is split into major and minor segment, we have the following separator per segment:
Major separator:
[ ] < > ( ) { } | ! ; , ' " * \n \r \s \t & ? +
And minor separator:
/ : = @ . - $ # % \ _

Let's take the following log message:

66.249.65.159 - - [06/Nov/2014:19:10:38 +0600] "GET /news/53f8d72920ba2744fe873ebc.html HTTP/1.1" 404 177 "-" "Mozilla/5.0 (iPhone; CPU iPhone OS 6_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/6.0 Mobile/10A5376e Safari/8536.25 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

The first major segment will be: 66.249.65.159 which will then be separated with the following minor segment:

  • 66
  • 249
  • 65
  • 159

To enable full text search the following should be defined in the transform:

{
    "name": "message",
    "datatype": {
        "type":"string",
        "index": true,
        "index_options": {
            "fulltext": true,
            "major_separators": "[ ] < > ( ) { } | ! ; , ' \" * \\n \\r \\s \\t & ? +",
            "minor_separators": "\/ : = @ . - $ # % \\ _"
          }
    }
}

In this example column message is a string where fulltext search is enabled with the default separator.

By default Hydrolix is using the function LIKE to search the fulltext index created:

SELECT message
FROM project.table
WHERE message LIKE '%error%'
AND timestamp < now()
AND timestamp > (now() - INTERVAL 60 MINUTE)
ORDER BY timestamp DESC
LIMIT 50
SETTINGS hdx_query_debug=true

In this example we are looking for the word error in our column message for the last 1h.

By leveraging the query debug we can see that we are leveraging the index for that query:

X-Hdx-Query-Stats: exec_time=107 rows_read=0 bytes_read=0 num_partitions=58 num_peers=3 query_attempts=1 memory_usage=9491822
index_stats=[{"project.table":{"columns_read":["message","timestamp"],"indexes_used":["message","timestamp"],"shard_key_values_used":[]}}]

By enabling Full Text Search you'll be able to filter and search for words much faster using standard delimiters.