Headers and schema validation
csvdir can enforce that every file in a directory shares the same columns before yielding rows.
How matching works (CsvDir / CsvChunksDir)
Header comparison uses column names as a set:
- Order of columns in the file does not matter across files covered by dict readers
- Extra or missing column names trigger a mismatch
# File A: id,name,age
# File B: age,id,name → OK (same names)
# File C: id,name → mismatch (missing age)
When that kind of mismatch is detected, behavior depends on on_mismatch:
| Value | Behavior |
|---|---|
"error" (default) |
Raise ValueError with missing/extra column detail |
"skip" |
Skip the entire file (no rows yielded from it) |
Regression check: reorder vs CsvDirFile
Because matching is set-based, permuting header order between files does not count as mismatch for read_dir:
CsvDirFile requires the header sequence to match — see below.
strict_headers
- Read files in sorted path order (see Discovery).
- The first file’s column set establishes the pinned schema unless
expected_headersis already set. - For each subsequent file, compare headers to that schema (order ignored).
Combine with on_mismatch="skip" to build a stream from only compatible files:
Iteration does not mutate the CsvDir.expected_headers field; the pinned schema applies only inside that iteration.
expected_headers
Supply an explicit schema that every file must match:
You do not need strict_headers=True when expected_headers is provided — the list is always enforced.
Error messages
Mismatch errors look like:
CsvDirFile (pandas) — stricter stitching rules
CsvDirFile builds one physical CSV header line followed by concatenated bodies. Stitching compares headers in order (sequence-sensitive): the emitted header establishes column order, and subsequent files must match that header line exactly (after normalization). That is stronger than dict iterators, which compare sets of names.
Choose the canonical sequence this way:
- If
expected_headersis set, that list defines canonical column order. - Else if
strict_headersis True, canonical order is taken from the first discovered file in sorted path order (same sorting asCsvDir). Name files so your intended baseline sorts first — e.g.aaa_main.csv,zzz_extra.csv.
Otherwise the canonical sequence is the lexicographically smallest delimiter-joined header among scanned files.
Row order: body lines appear in sorted file path order, matching traversal order used by CsvDir.
For pandas, assume column order in the stitched stream matters. For tolerant column-set matching alone, iterate with read_dir first.