Read and watch files

This guide covers how to read files and monitor directories for new files using the from_file operator. Whether you process individual files, batch process directories, or set up real-time file monitoring, from_file provides a unified approach to file-based data ingestion.

Basic File Reading

The from_file operator handles various file types and formats. Start with these fundamental patterns for reading individual files.

Single Files

To read a single file, specify the path to the from_file operator:

from_file "/path/to/file.json"

The operator automatically detects the file format from the file extension. This works for all supported formats including JSON, CSV, Parquet, and others.

Compressed Files

The operator handles compressed files automatically. You need no additional configuration:

from_file "/path/to/file.csv.gz"

Supported compression formats include gzip, bzip2, and Zstd.

Custom Parsing

When automatic format detection doesn’t suffice, specify a custom parsing pipeline:

from_file "/path/to/file.log" {
  read_syslog
}

The parsing pipeline runs on the file content and must return events.

Directory Processing

You can process multiple files efficiently using glob patterns. This section covers batch processing and recursive directory operations.

Processing Multiple Files

Use glob patterns to process multiple files at once:

from_file "/path/to/directory/*.csv.zst"

This example processes all Zstd-compressed CSV files in the specified directory.

You can also use glob patterns to consume files regardless of their format:

from_file "~/data/**"

This processes all files in the ~/data directory and its subdirectories, automatically detecting and parsing each file format.

Recursive Directory Processing

Use ** to match files recursively through subdirectories:

from_file "/path/to/directory/**.csv"

Custom Parsing for Multiple Files

When you process multiple files with custom parsing, the pipeline runs separately for each file:

from_file "/path/to/directory/*.log" {
  read_lines
}

File Monitoring

Set up real-time file processing by monitoring directories for changes. These features enable continuous data ingestion workflows.

Watch for New Files

Use the watch parameter to monitor a directory for new files:

from_file "/path/to/directory/*.csv", watch=true

This sets up continuous monitoring, processing new files as they appear in the directory.

Remove Files After Processing

Combine watching with automatic file removal using the remove parameter:

from_file "/path/to/directory/*.csv", watch=true, remove=true

This approach helps you implement file-based queues where the system should automatically clean up processed files.

Cloud Storage Integration

Access files directly from cloud storage providers using their native URLs. The operator supports major cloud platforms transparently.

Amazon S3

Access S3 buckets directly using s3:// URLs:

from_file "s3://bucket/path/to/file.csv"

Glob patterns work with S3 as well:

from_file "s3://bucket/data/**/*.parquet"

Google Cloud Storage

Access GCS buckets using gs:// URLs:

from_file "gs://bucket/path/to/file.csv"

Cloud storage integration uses Apache Arrow’s filesystem APIs and supports the same glob patterns and options as local files, including recursive globbing across cloud storage hierarchies.

Common Patterns

These examples demonstrate typical use cases that combine multiple features of the from_file operator for real-world scenarios.

Real-time Log Processing

Monitor a log directory and process files as they arrive:

from_file "/var/log/application/*.log", watch=true {
  read_lines
  parse_json
}

Batch Data Processing

Process all files in a data directory:

from_file "/data/exports/**.parquet"

Archive Processing with Cleanup

Process archived data and remove files after successful ingestion:

from_file "/archive/*.csv.gz", remove=true

Migration Notes

Transitioning from Legacy Operators

We designed the from_file operator to replace the existing load_file, load_s3, and load_gcs operators. While we still support these legacy operators, from_file provides a more unified and feature-rich approach to file ingestion.

We plan to add some advanced features from the legacy operators (such as file tailing, anonymous S3 access, and Unix domain socket support) in future releases of from_file.