Skip to content

Column selection

Column helpers let you project one or more fields while keeping a file label on each yielded value. They are available on tagged iterators, not on the bare for row in read_dir(...) loop.

Which object to use?

Goal Iterator Column methods
Label = filename stem .with_names() or .enumerate() .iter_column, .select_columns
Label = full path .with_paths() same
Chunked .enumerate() / .with_paths() on chunked reader .iter_column_chunks, .select_columns_chunks

Single column

r = read_dir("/data")

for stem, value in r.with_names().iter_column("email"):
    print(stem, value)
for path, value in r.with_paths().iter_column("email"):
    print(path, value)

If the column is missing and on_mismatch="error", a ValueError is raised. With "skip", the file is skipped.

Multiple columns

for stem, row in r.with_names().select_columns(["id", "name"]):
    # row only contains requested keys
    print(stem, row)

Keys not present in the file still follow dict access rules (KeyError if absent from the row dict after read).

Chunked column values

for stem, values in read_dir("/data", chunksize=50).enumerate().iter_column_chunks("score"):
    print(stem, values)  # list[str], len <= 50
for stem, rows in read_dir("/data", chunksize=50).enumerate().select_columns_chunks(
    ["id", "score"]
):
    print(stem, rows)  # list[dict]

Optional chunk_size argument on chunk methods overrides the reader’s default chunksize for that call.

Label format note

  • with_names() / enumerate() — stem without extension (sales_q1)
  • with_paths() — full path string (/data/2024/sales_q1.csv)

Tests for enumerate-style iterators may use the filename with extension in some code paths; prefer with_names() for stems consistently.

Compare to manual projection

# Manual
for row in read_dir("/data"):
    print(row["email"])

# With file attribution
for stem, email in read_dir("/data").with_names().iter_column("email"):
    print(stem, email)

Use column helpers when you need per-file provenance on each value or chunk.