Skip to content

Full-Text Search

Enable Full-Text Search (FTS) to speed up text search queries on string columns. FTS indexes string data to locate text patterns without scanning entire column values.

Full-Text Search works best for:

  • Log analysis: Search application logs, system logs, or error messages for specific keywords or patterns
  • Large text fields: Query string columns with long text values like user agent strings, URLs, or message bodies
  • Pattern matching: Find records containing specific words or terms across millions of rows

Without FTS, text search queries scan every value in the column. With FTS enabled, Hydrolix uses the index to identify matching segments and improve query performance.

How Full-Text Search works⚓︎

Full-Text Search splits text into major and minor segments with predefined separators. This segmentation allows Hydrolix to locate matching patterns efficiently.

Example⚓︎

The following log message demonstrates how FTS segments some text:

Example Log Message
66.249.65.159 - - [06/Nov/2014:19:10:38 +0600] "GET /news/53f8d72920ba2744fe873ebc.html HTTP/1.1" 404 177 "-" "Mozilla/5.0 (iPhone; CPU iPhone OS 6_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/6.0 Mobile/10A5376e Safari/8536.25 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

Full-Text Search identifies the first major segment as 66.249.65.159, which contains these minor segments:

  • 66
  • 249
  • 65
  • 159

Hydrolix indexes these segments, allowing queries to match against the segments rather than scanning the full text.

Enable Full-Text Search on a column by setting index to true and index_options.fulltext to true in the transform configuration:

Transform Configuration
{
  "name": "message",
  "datatype": {
    "type": "string",
    "index": true,
    "index_options": {
      "fulltext": true
    }
  }
}

Hydrolix creates the FTS index during data ingestion.

Full-Text Search works with multiple query patterns. These operators use the FTS index for better performance.

Operator Description Example
LIKE Case-sensitive pattern matching message LIKE '%error%'
ILIKE Case-insensitive pattern matching message ILIKE '%ERROR%'
MATCH Regular expression matching match(message, 'error.*timeout')
= Direct string comparison message = 'Specific error text'
Search Query Example
1
2
3
4
5
6
7
8
SELECT message
FROM project.table
WHERE message LIKE '%error%'
  AND timestamp < now()
  AND timestamp > (now() - INTERVAL 60 MINUTE)
ORDER BY timestamp DESC
LIMIT 50
SETTINGS hdx_query_debug=true

This query searches for the word error in the message column within the last hour.

Additional examples⚓︎

Search user agent strings (case-sensitive):

User Agent Search with LIKE
1
2
3
4
5
6
7
SELECT user_agent, COUNT(*) as request_count
FROM web.access_logs
WHERE user_agent LIKE '%Chrome%'
  AND timestamp > now() - INTERVAL 1 DAY
GROUP BY user_agent
ORDER BY request_count DESC
LIMIT 10

Case-insensitive search:

Case-Insensitive Search with ILIKE
1
2
3
4
5
6
SELECT message
FROM app.logs
WHERE message ILIKE '%ERROR%'
  AND timestamp > now() - INTERVAL 1 HOUR
ORDER BY timestamp DESC
LIMIT 50

Find specific error patterns:

Error Pattern Search
1
2
3
4
5
SELECT timestamp, error_message
FROM app.logs
WHERE error_message LIKE '%timeout%'
  AND timestamp > now() - INTERVAL 1 HOUR
ORDER BY timestamp DESC

Verify index usage⚓︎

Confirm that the query uses an index by enabling query debug. When a column with Full-Text Search enabled appears in the indexes_used field, it indicates the FTS index was used for that column:

Query Debug Output
X-Hdx-Query-Stats: exec_time=107 rows_read=0 bytes_read=0 num_partitions=58 num_peers=3 query_attempts=1 memory_usage=9491822
index_stats=[{"project.table":{"columns_read":["message","timestamp"],"indexes_used":["message","timestamp"],"shard_key_values_used":[]}}]

Separators⚓︎

Full-Text Search uses two types of separators to segment text. These separators are predefined and can't be modified.

Major separators⚓︎

Major separators create text segments:

  • [] <> () {}
  • | ! ; ,
  • ' " *
  • \n \r \s \t
  • & ? +

Minor separators⚓︎

Minor separators create sub-segments within major segments:

  • / : = @
  • . - $ #
  • % \ _

Recommendations⚓︎

  • Match punctuation literally: Use literal punctuation symbols rather than wildcards for better performance.
  • Prefer trailing over leading wildcards when possible: Patterns like error% perform significantly better than leading or middle wildcards (%error or er%or), which can reduce performance and match unintended patterns. For example, er%or matches both "error" and "herbivore." When searching for terms that may appear anywhere in the text, such as log messages or URLs, leading wildcards (%error%) are often necessary despite the performance trade-off.

Considerations⚓︎

  • Storage overhead: FTS indexes require storage space. Enable FTS only on columns searched regularly with text pattern matching. To check storage usage, query the table catalog: SELECT sum(manifest_size + data_size + index_size) FROM project_name.\table_name#.catalog`;`
  • Ingestion-time indexing: Hydrolix creates FTS indexes during data ingestion. Adding FTS to existing data requires reingestion for indexing to work.
  • ALTER TABLE behavior: When you run ALTER TABLE on a column, Hydrolix rewrites the partition with the altered data and generates a new index file. This means FTS indexes are automatically updated when columns are altered.
  • Supported operations: FTS optimizes LIKE, ILIKE, MATCH, and equality (=) operations.
  • Index structure: FTS effectiveness depends on separator alignment with the data structure. Text with clear word boundaries and standard punctuation works best.
  • position() vs pattern matching: For columns without FTS, the position() function is more efficient than LIKE for substring searches. With FTS enabled, pattern matching queries (LIKE, ILIKE, MATCH) perform well and offer capabilities that position() doesn't support.

See also⚓︎