Skip to content

Read and watch files

This guide covers how to read files and monitor directories for new files using the from_file operator. Whether you process individual files, batch process directories, or set up real-time file monitoring, from_file provides a unified approach to file-based data ingestion.

The from_file operator handles various file types and formats. Start with these fundamental patterns for reading individual files.

To read a single file, specify the path to the from_file operator:

from_file "/path/to/file.json"

The operator automatically detects the file format from the file extension. This works for all supported formats including JSON, CSV, Parquet, and others.

The operator handles compressed files automatically. You need no additional configuration:

from_file "/path/to/file.csv.gz"

Supported compression formats include gzip, bzip2, and Zstd.

When automatic format detection doesn’t suffice, specify a custom parsing pipeline:

from_file "/path/to/file.log" {
read_syslog
}

The parsing pipeline runs on the file content and must return events.

You can process multiple files efficiently using glob patterns. This section covers batch processing and recursive directory operations.

Use glob patterns to process multiple files at once:

from_file "/path/to/directory/*.csv.zst"

This example processes all Zstd-compressed CSV files in the specified directory.

You can also use glob patterns to consume files regardless of their format:

from_file "~/data/**"

This processes all files in the ~/data directory and its subdirectories, automatically detecting and parsing each file format.

Use ** to match files recursively through subdirectories:

from_file "/path/to/directory/**.csv"

When you process multiple files with custom parsing, the pipeline runs separately for each file:

from_file "/path/to/directory/*.log" {
read_lines
}

Set up real-time file processing by monitoring directories for changes. These features enable continuous data ingestion workflows.

Use the watch parameter to monitor a directory for new files:

from_file "/path/to/directory/*.csv", watch=true

This sets up continuous monitoring, processing new files as they appear in the directory.

Combine watching with automatic file removal using the remove parameter:

from_file "/path/to/directory/*.csv", watch=true, remove=true

This approach helps you implement file-based queues where the system should automatically clean up processed files.

Access files directly from cloud storage providers using their native URLs. The operator supports major cloud platforms transparently.

Access S3 buckets directly using s3:// URLs:

from_file "s3://bucket/path/to/file.csv"

Glob patterns work with S3 as well:

from_file "s3://bucket/data/**/*.parquet"

Access GCS buckets using gs:// URLs:

from_file "gs://bucket/path/to/file.csv"

Cloud storage integration uses Apache Arrow’s filesystem APIs and supports the same glob patterns and options as local files, including recursive globbing across cloud storage hierarchies.

These examples demonstrate typical use cases that combine multiple features of the from_file operator for real-world scenarios.

Monitor a log directory and process files as they arrive:

from_file "/var/log/application/*.log", watch=true {
read_lines
parse_json
}

Process all files in a data directory:

from_file "/data/exports/**.parquet"

Process archived data and remove files after successful ingestion:

from_file "/archive/*.csv.gz", remove=true

Transitioning from Legacy Operators

We designed the from_file operator to replace the existing load_file, load_s3, and load_gcs operators. While we still support these legacy operators, from_file provides a more unified and feature-rich approach to file ingestion.

We plan to add some advanced features from the legacy operators (such as file tailing, anonymous S3 access, and Unix domain socket support) in future releases of from_file.

Last updated: