[{"data":1,"prerenderedAt":1528},["ShallowReactive",2],{"blog-\u002Fblog\u002Ffixed-width-files-at-scale":3},{"id":4,"title":5,"author":6,"body":7,"coverImage":1513,"description":1514,"draft":1515,"extension":1516,"meta":1517,"navigation":405,"path":1518,"publishedAt":1519,"readingTime":534,"seo":1520,"stem":1521,"tags":1522,"__hash__":1527},"blog\u002Fblog\u002Ffixed-width-files-at-scale.md","Fixed-width files at scale","Evan Ritter",{"type":8,"value":9,"toc":1501},"minimark",[10,15,19,22,25,29,32,35,39,42,261,264,274,281,287,294,298,301,309,328,337,344,351,570,573,583,587,594,802,805,933,936,942,953,1183,1201,1205,1208,1221,1224,1251,1257,1260,1286,1301,1304,1308,1311,1319,1322,1385,1388,1391,1395,1398,1439,1446,1452,1471,1475,1478,1481,1485,1488,1497],[11,12,14],"h2",{"id":13},"how-to-ingest-hundreds-of-millions-of-rows-into-postgres-without-setting-fire-to-anything","How to ingest hundreds of millions of rows into Postgres without setting fire to anything",[16,17,18],"p",{},"There's a class of engineering work that doesn't show up on conference talks. Nobody gets invited to give a keynote about COPY-stream patterns, or how to parse a half-gigabyte text file into Postgres without your laptop fan sounding like a small aircraft. The framework hype cycle has nothing to say about character encodings written in 1989 that you still have to deal with on a Tuesday morning in 2026. Fixed-width file parsing is the kind of work that gets handed to whoever was unfortunate enough to be in the wrong meeting.",[16,20,21],{},"I've spent a lot of time on this work over the last year, and I've made every mistake worth making. This post is a writeup of what I'd tell myself eighteen months ago, if I could find a way of getting a message through.",[16,23,24],{},"The example throughout is a synthetic one — insurance policy records — because fixed-width formats are still surprisingly common in insurance, finance, telecoms, and a handful of other sectors where the file format outlived the mainframe it was designed for. Pick whatever boring industry you like; the patterns are the same.",[11,26,28],{"id":27},"why-fixed-width-files-still-exist","Why fixed-width files still exist",[16,30,31],{},"Every now and then someone discovers that fixed-width files exist and writes a confused tweet about it, as though anyone making them in 2026 must be doing so out of spite. The reality is gentler. Fixed-width files predate JSON, predate CSV-as-we-know-it, and were the obvious choice for the systems that originally generated them. They're trivially parseable in COBOL. They're stable. They don't have the row-length ambiguity of CSV or the schema-drift problems of JSON. If your file format spec says \"the policy number lives in characters 12 to 23 of every record, every time, forever,\" then on a long enough timeline that turns out to be a feature.",[16,33,34],{},"The downside is that they're a fiddle to work with in modern languages, and the ecosystem around them is thinner than it should be. If you're new to fixed-width files and you Google for advice, you'll find a lot of \"just use pandas,\" which is fine until your file is 12 GB and your machine has 16 GB of RAM. The grown-up answer is a streaming pipeline that lands the data in Postgres without ever holding more than a few megabytes in memory at once. That's what this post is about.",[11,36,38],{"id":37},"the-naive-approach-and-why-it-dies","The naive approach, and why it dies",[16,40,41],{},"The first instinct, if you've never done this before, is something like this:",[43,44,49],"pre",{"className":45,"code":46,"language":47,"meta":48,"style":48},"language-js shiki shiki-themes github-light github-dark","const lines = fs.readFileSync(path, 'utf8').split('\\n')\nfor (const line of lines) {\n  const policy = line.substring(0, 12).trim()\n  const name   = line.substring(12, 52).trim()\n  const premium = parseInt(line.substring(52, 60))\n  await pool.query(\n    'INSERT INTO policies (id, name, premium) VALUES ($1, $2, $3)',\n    [policy, name, premium]\n  )\n}\n","js","",[50,51,52,103,123,159,189,219,234,243,249,255],"code",{"__ignoreMap":48},[53,54,57,61,65,68,72,76,79,83,86,89,92,95,98,100],"span",{"class":55,"line":56},"line",1,[53,58,60],{"class":59},"szBVR","const",[53,62,64],{"class":63},"sj4cs"," lines",[53,66,67],{"class":59}," =",[53,69,71],{"class":70},"sVt8B"," fs.",[53,73,75],{"class":74},"sScJk","readFileSync",[53,77,78],{"class":70},"(path, ",[53,80,82],{"class":81},"sZZnC","'utf8'",[53,84,85],{"class":70},").",[53,87,88],{"class":74},"split",[53,90,91],{"class":70},"(",[53,93,94],{"class":81},"'",[53,96,97],{"class":63},"\\n",[53,99,94],{"class":81},[53,101,102],{"class":70},")\n",[53,104,106,109,112,114,117,120],{"class":55,"line":105},2,[53,107,108],{"class":59},"for",[53,110,111],{"class":70}," (",[53,113,60],{"class":59},[53,115,116],{"class":63}," line",[53,118,119],{"class":59}," of",[53,121,122],{"class":70}," lines) {\n",[53,124,126,129,132,134,137,140,142,145,148,151,153,156],{"class":55,"line":125},3,[53,127,128],{"class":59},"  const",[53,130,131],{"class":63}," policy",[53,133,67],{"class":59},[53,135,136],{"class":70}," line.",[53,138,139],{"class":74},"substring",[53,141,91],{"class":70},[53,143,144],{"class":63},"0",[53,146,147],{"class":70},", ",[53,149,150],{"class":63},"12",[53,152,85],{"class":70},[53,154,155],{"class":74},"trim",[53,157,158],{"class":70},"()\n",[53,160,162,164,167,170,172,174,176,178,180,183,185,187],{"class":55,"line":161},4,[53,163,128],{"class":59},[53,165,166],{"class":63}," name",[53,168,169],{"class":59},"   =",[53,171,136],{"class":70},[53,173,139],{"class":74},[53,175,91],{"class":70},[53,177,150],{"class":63},[53,179,147],{"class":70},[53,181,182],{"class":63},"52",[53,184,85],{"class":70},[53,186,155],{"class":74},[53,188,158],{"class":70},[53,190,192,194,197,199,202,205,207,209,211,213,216],{"class":55,"line":191},5,[53,193,128],{"class":59},[53,195,196],{"class":63}," premium",[53,198,67],{"class":59},[53,200,201],{"class":74}," parseInt",[53,203,204],{"class":70},"(line.",[53,206,139],{"class":74},[53,208,91],{"class":70},[53,210,182],{"class":63},[53,212,147],{"class":70},[53,214,215],{"class":63},"60",[53,217,218],{"class":70},"))\n",[53,220,222,225,228,231],{"class":55,"line":221},6,[53,223,224],{"class":59},"  await",[53,226,227],{"class":70}," pool.",[53,229,230],{"class":74},"query",[53,232,233],{"class":70},"(\n",[53,235,237,240],{"class":55,"line":236},7,[53,238,239],{"class":81},"    'INSERT INTO policies (id, name, premium) VALUES ($1, $2, $3)'",[53,241,242],{"class":70},",\n",[53,244,246],{"class":55,"line":245},8,[53,247,248],{"class":70},"    [policy, name, premium]\n",[53,250,252],{"class":55,"line":251},9,[53,253,254],{"class":70},"  )\n",[53,256,258],{"class":55,"line":257},10,[53,259,260],{"class":70},"}\n",[16,262,263],{},"This works for a few thousand rows and then collapses in three different ways at once, each of which feels like the end of the world the first time you hit it.",[16,265,266,267,269,270,273],{},"The first is memory. ",[50,268,75],{}," loads the whole file into a single string before the script does anything else. For a 6 GB file, you've now got a 6 GB string in memory, and ",[50,271,272],{},".split('\\n')"," is about to produce a multi-million-element array on top of it. Node will fall over before it parses the first row.",[16,275,276,277,280],{},"The second is throughput. Each ",[50,278,279],{},"INSERT"," is a round trip to Postgres, and each round trip is a few milliseconds. At a million rows that's a few thousand seconds of pure network latency — most of an hour spent waiting on packets. The CPU is idle. The disk is idle. Postgres is idle. You're paying the connection-overhead tax a million times in a row.",[16,282,283,284,286],{},"The third is index updates. Every ",[50,285,279],{}," triggers an update of every index on the table. Even if each index update is fast, doing it a million times serially is enormous overhead, and it gets quadratically worse as the table grows because some index types degrade with size.",[16,288,289,290,293],{},"The fix isn't a clever batched INSERT. The fix is to throw out the INSERT pattern entirely and use Postgres's bulk-load tool, ",[50,291,292],{},"COPY",".",[11,295,297],{"id":296},"the-right-architecture-stream-stage-copy","The right architecture: stream → stage → COPY",[16,299,300],{},"Here's the pipeline that actually scales. There are three stages and they pipeline cleanly:",[43,302,307],{"className":303,"code":305,"language":306},[304],"language-text","Source file  →  parse loop  →  staging file (TSV)  →  COPY into Postgres\n   (stream)      (stream)           (disk)            (single statement)\n","text",[50,308,305],{"__ignoreMap":48},[16,310,311,312,315,316,319,320,323,324,327],{},"The parse loop reads the source file line by line — ",[50,313,314],{},"readline"," over a ",[50,317,318],{},"createReadStream",", with ",[50,321,322],{},"fs.createReadStream(path, { encoding: 'latin1' })"," if you have any reason to suspect non-UTF-8 (more on that in a minute). It extracts fields by character offset, does any per-record validation, and writes one row per record into a tab-separated staging file in ",[50,325,326],{},"\u002Ftmp",". The staging file is the same size order as the source, so make sure you have disk for it.",[16,329,330,331,333,334,336],{},"When the parse finishes, you've got a clean TSV that Postgres can ingest in a single ",[50,332,292],{}," statement. On modern hardware, ",[50,335,292],{}," will load tens of thousands of rows per second per table — orders of magnitude faster than equivalent INSERTs, because it bypasses most of the per-statement overhead and streams data straight into the table's storage pages.",[16,338,339,340,343],{},"The reason you stage to disk rather than piping directly is partly that disk is cheap and partly that it gives you a hard separation between the parsing problem and the loading problem. If the COPY fails because of a constraint violation in row 4,392,118, you can find that row in the TSV with ",[50,341,342],{},"awk"," in two seconds. If you were piping parse output straight into COPY, you'd be re-running the entire parse to track it down.",[16,345,346,347,350],{},"In Node, the actual COPY call looks like this, using ",[50,348,349],{},"pg-copy-streams",":",[43,352,354],{"className":45,"code":353,"language":47,"meta":48,"style":48},"const copyFrom = require('pg-copy-streams').from\nconst { Client } = require('pg')\n\nconst client = new Client(\u002F* ... *\u002F)\nawait client.connect()\n\nconst stream = client.query(copyFrom(\n  'COPY policies (id, name, premium) FROM STDIN'\n))\nfs.createReadStream('\u002Ftmp\u002Fpolicies.tsv').pipe(stream)\n\nawait new Promise((resolve, reject) => {\n  stream.on('finish', resolve)\n  stream.on('error', reject)\n})\n",[50,355,356,376,401,407,430,443,447,467,472,476,496,501,532,549,564],{"__ignoreMap":48},[53,357,358,360,363,365,368,370,373],{"class":55,"line":56},[53,359,60],{"class":59},[53,361,362],{"class":63}," copyFrom",[53,364,67],{"class":59},[53,366,367],{"class":74}," require",[53,369,91],{"class":70},[53,371,372],{"class":81},"'pg-copy-streams'",[53,374,375],{"class":70},").from\n",[53,377,378,380,383,386,389,392,394,396,399],{"class":55,"line":105},[53,379,60],{"class":59},[53,381,382],{"class":70}," { ",[53,384,385],{"class":63},"Client",[53,387,388],{"class":70}," } ",[53,390,391],{"class":59},"=",[53,393,367],{"class":74},[53,395,91],{"class":70},[53,397,398],{"class":81},"'pg'",[53,400,102],{"class":70},[53,402,403],{"class":55,"line":125},[53,404,406],{"emptyLinePlaceholder":405},true,"\n",[53,408,409,411,414,416,419,422,424,428],{"class":55,"line":161},[53,410,60],{"class":59},[53,412,413],{"class":63}," client",[53,415,67],{"class":59},[53,417,418],{"class":59}," new",[53,420,421],{"class":74}," Client",[53,423,91],{"class":70},[53,425,427],{"class":426},"sJ8bj","\u002F* ... *\u002F",[53,429,102],{"class":70},[53,431,432,435,438,441],{"class":55,"line":191},[53,433,434],{"class":59},"await",[53,436,437],{"class":70}," client.",[53,439,440],{"class":74},"connect",[53,442,158],{"class":70},[53,444,445],{"class":55,"line":221},[53,446,406],{"emptyLinePlaceholder":405},[53,448,449,451,454,456,458,460,462,465],{"class":55,"line":236},[53,450,60],{"class":59},[53,452,453],{"class":63}," stream",[53,455,67],{"class":59},[53,457,437],{"class":70},[53,459,230],{"class":74},[53,461,91],{"class":70},[53,463,464],{"class":74},"copyFrom",[53,466,233],{"class":70},[53,468,469],{"class":55,"line":245},[53,470,471],{"class":81},"  'COPY policies (id, name, premium) FROM STDIN'\n",[53,473,474],{"class":55,"line":251},[53,475,218],{"class":70},[53,477,478,481,483,485,488,490,493],{"class":55,"line":257},[53,479,480],{"class":70},"fs.",[53,482,318],{"class":74},[53,484,91],{"class":70},[53,486,487],{"class":81},"'\u002Ftmp\u002Fpolicies.tsv'",[53,489,85],{"class":70},[53,491,492],{"class":74},"pipe",[53,494,495],{"class":70},"(stream)\n",[53,497,499],{"class":55,"line":498},11,[53,500,406],{"emptyLinePlaceholder":405},[53,502,504,506,508,511,514,518,520,523,526,529],{"class":55,"line":503},12,[53,505,434],{"class":59},[53,507,418],{"class":59},[53,509,510],{"class":63}," Promise",[53,512,513],{"class":70},"((",[53,515,517],{"class":516},"s4XuR","resolve",[53,519,147],{"class":70},[53,521,522],{"class":516},"reject",[53,524,525],{"class":70},") ",[53,527,528],{"class":59},"=>",[53,530,531],{"class":70}," {\n",[53,533,535,538,541,543,546],{"class":55,"line":534},13,[53,536,537],{"class":70},"  stream.",[53,539,540],{"class":74},"on",[53,542,91],{"class":70},[53,544,545],{"class":81},"'finish'",[53,547,548],{"class":70},", resolve)\n",[53,550,552,554,556,558,561],{"class":55,"line":551},14,[53,553,537],{"class":70},[53,555,540],{"class":74},[53,557,91],{"class":70},[53,559,560],{"class":81},"'error'",[53,562,563],{"class":70},", reject)\n",[53,565,567],{"class":55,"line":566},15,[53,568,569],{"class":70},"})\n",[16,571,572],{},"That's it. One statement, no batching logic, no progress bars to maintain. Postgres handles the throughput; you just feed it bytes.",[16,574,575,576,578,579,582],{},"If you can't or don't want to add ",[50,577,349],{}," as a dependency, the shell fallback is ",[50,580,581],{},"psql -c \"\\copy policies FROM '\u002Ftmp\u002Fpolicies.tsv'\"",", which works fine but loses you fine-grained error handling.",[11,584,586],{"id":585},"the-escape-function-and-the-gotcha-that-will-catch-you","The escape function, and the gotcha that will catch you",[16,588,589,590,593],{},"Here's where the first real lesson lives. The TSV format Postgres expects from COPY is not quite the same as a CSV. Tabs and newlines need escaping, nulls are represented as ",[50,591,592],{},"\\N",", and backslashes need doubling. So everyone writes a little helper:",[43,595,597],{"className":45,"code":596,"language":47,"meta":48,"style":48},"function escapeCopy(s) {\n  if (s === null || s === undefined) return '\\\\N'\n  if (s === '') return '\\\\N'\n  return s\n    .replace(\u002F\\\\\u002Fg, '\\\\\\\\')\n    .replace(\u002F\\t\u002Fg, '\\\\t')\n    .replace(\u002F\\n\u002Fg, '\\\\n')\n    .replace(\u002F\\r\u002Fg, '\\\\r')\n}\n",[50,598,599,615,654,675,683,715,743,770,798],{"__ignoreMap":48},[53,600,601,604,607,609,612],{"class":55,"line":56},[53,602,603],{"class":59},"function",[53,605,606],{"class":74}," escapeCopy",[53,608,91],{"class":70},[53,610,611],{"class":516},"s",[53,613,614],{"class":70},") {\n",[53,616,617,620,623,626,629,632,635,637,640,642,645,648,651],{"class":55,"line":105},[53,618,619],{"class":59},"  if",[53,621,622],{"class":70}," (s ",[53,624,625],{"class":59},"===",[53,627,628],{"class":63}," null",[53,630,631],{"class":59}," ||",[53,633,634],{"class":70}," s ",[53,636,625],{"class":59},[53,638,639],{"class":63}," undefined",[53,641,525],{"class":70},[53,643,644],{"class":59},"return",[53,646,647],{"class":81}," '",[53,649,650],{"class":63},"\\\\",[53,652,653],{"class":81},"N'\n",[53,655,656,658,660,662,665,667,669,671,673],{"class":55,"line":125},[53,657,619],{"class":59},[53,659,622],{"class":70},[53,661,625],{"class":59},[53,663,664],{"class":81}," ''",[53,666,525],{"class":70},[53,668,644],{"class":59},[53,670,647],{"class":81},[53,672,650],{"class":63},[53,674,653],{"class":81},[53,676,677,680],{"class":55,"line":161},[53,678,679],{"class":59},"  return",[53,681,682],{"class":70}," s\n",[53,684,685,688,691,693,696,699,701,704,706,708,711,713],{"class":55,"line":191},[53,686,687],{"class":70},"    .",[53,689,690],{"class":74},"replace",[53,692,91],{"class":70},[53,694,695],{"class":81},"\u002F",[53,697,650],{"class":698},"snhLl",[53,700,695],{"class":81},[53,702,703],{"class":59},"g",[53,705,147],{"class":70},[53,707,94],{"class":81},[53,709,710],{"class":63},"\\\\\\\\",[53,712,94],{"class":81},[53,714,102],{"class":70},[53,716,717,719,721,723,725,728,730,732,734,736,738,741],{"class":55,"line":221},[53,718,687],{"class":70},[53,720,690],{"class":74},[53,722,91],{"class":70},[53,724,695],{"class":81},[53,726,727],{"class":63},"\\t",[53,729,695],{"class":81},[53,731,703],{"class":59},[53,733,147],{"class":70},[53,735,94],{"class":81},[53,737,650],{"class":63},[53,739,740],{"class":81},"t'",[53,742,102],{"class":70},[53,744,745,747,749,751,753,755,757,759,761,763,765,768],{"class":55,"line":236},[53,746,687],{"class":70},[53,748,690],{"class":74},[53,750,91],{"class":70},[53,752,695],{"class":81},[53,754,97],{"class":63},[53,756,695],{"class":81},[53,758,703],{"class":59},[53,760,147],{"class":70},[53,762,94],{"class":81},[53,764,650],{"class":63},[53,766,767],{"class":81},"n'",[53,769,102],{"class":70},[53,771,772,774,776,778,780,783,785,787,789,791,793,796],{"class":55,"line":245},[53,773,687],{"class":70},[53,775,690],{"class":74},[53,777,91],{"class":70},[53,779,695],{"class":81},[53,781,782],{"class":63},"\\r",[53,784,695],{"class":81},[53,786,703],{"class":59},[53,788,147],{"class":70},[53,790,94],{"class":81},[53,792,650],{"class":63},[53,794,795],{"class":81},"r'",[53,797,102],{"class":70},[53,799,800],{"class":55,"line":251},[53,801,260],{"class":70},[16,803,804],{},"Fine. Now you wire it into your parse loop:",[43,806,808],{"className":45,"code":807,"language":47,"meta":48,"style":48},"const fields = [\n  parseInt(slice(line, POLICY_ID_OFFSET, POLICY_ID_LEN)),\n  trim(slice(line, NAME_OFFSET, NAME_LEN)),\n  parseInt(slice(line, PREMIUM_OFFSET, PREMIUM_LEN)),\n]\nwriteStream.write(fields.map(escapeCopy).join('\\t') + '\\n')\n",[50,809,810,822,846,867,887,892],{"__ignoreMap":48},[53,811,812,814,817,819],{"class":55,"line":56},[53,813,60],{"class":59},[53,815,816],{"class":63}," fields",[53,818,67],{"class":59},[53,820,821],{"class":70}," [\n",[53,823,824,827,829,832,835,838,840,843],{"class":55,"line":105},[53,825,826],{"class":74},"  parseInt",[53,828,91],{"class":70},[53,830,831],{"class":74},"slice",[53,833,834],{"class":70},"(line, ",[53,836,837],{"class":63},"POLICY_ID_OFFSET",[53,839,147],{"class":70},[53,841,842],{"class":63},"POLICY_ID_LEN",[53,844,845],{"class":70},")),\n",[53,847,848,851,853,855,857,860,862,865],{"class":55,"line":125},[53,849,850],{"class":74},"  trim",[53,852,91],{"class":70},[53,854,831],{"class":74},[53,856,834],{"class":70},[53,858,859],{"class":63},"NAME_OFFSET",[53,861,147],{"class":70},[53,863,864],{"class":63},"NAME_LEN",[53,866,845],{"class":70},[53,868,869,871,873,875,877,880,882,885],{"class":55,"line":161},[53,870,826],{"class":74},[53,872,91],{"class":70},[53,874,831],{"class":74},[53,876,834],{"class":70},[53,878,879],{"class":63},"PREMIUM_OFFSET",[53,881,147],{"class":70},[53,883,884],{"class":63},"PREMIUM_LEN",[53,886,845],{"class":70},[53,888,889],{"class":55,"line":191},[53,890,891],{"class":70},"]\n",[53,893,894,897,900,903,906,909,912,914,916,918,920,922,925,927,929,931],{"class":55,"line":221},[53,895,896],{"class":70},"writeStream.",[53,898,899],{"class":74},"write",[53,901,902],{"class":70},"(fields.",[53,904,905],{"class":74},"map",[53,907,908],{"class":70},"(escapeCopy).",[53,910,911],{"class":74},"join",[53,913,91],{"class":70},[53,915,94],{"class":81},[53,917,727],{"class":63},[53,919,94],{"class":81},[53,921,525],{"class":70},[53,923,924],{"class":59},"+",[53,926,647],{"class":81},[53,928,97],{"class":63},[53,930,94],{"class":81},[53,932,102],{"class":70},[16,934,935],{},"This will run perfectly until it hits the first row, at which point it falls over with:",[43,937,940],{"className":938,"code":939,"language":306},[304],"TypeError: s.replace is not a function\n",[50,941,939],{"__ignoreMap":48},[16,943,944,945,948,949,952],{},"The two ",[50,946,947],{},"parseInt"," calls return JavaScript numbers, and numbers don't have a ",[50,950,951],{},".replace"," method. The fix is two characters of type coercion, but the lesson is bigger: any function in a hot data path that calls string methods needs to handle the case where the input isn't a string, because the boundary between \"this came from the parser\" and \"this came from the file\" is fuzzier than you think. Defensive escape functions look like this:",[43,954,956],{"className":45,"code":955,"language":47,"meta":48,"style":48},"function escapeCopy(s) {\n  if (s === null || s === undefined) return '\\\\N'\n  if (typeof s === 'number') return String(s)\n  if (typeof s !== 'string') s = String(s)\n  if (s === '') return '\\\\N'\n  return s\n    .replace(\u002F\\\\\u002Fg, '\\\\\\\\')\n    .replace(\u002F\\t\u002Fg, '\\\\t')\n    .replace(\u002F\\n\u002Fg, '\\\\n')\n    .replace(\u002F\\r\u002Fg, '\\\\r')\n}\n",[50,957,958,970,998,1024,1049,1069,1075,1101,1127,1153,1179],{"__ignoreMap":48},[53,959,960,962,964,966,968],{"class":55,"line":56},[53,961,603],{"class":59},[53,963,606],{"class":74},[53,965,91],{"class":70},[53,967,611],{"class":516},[53,969,614],{"class":70},[53,971,972,974,976,978,980,982,984,986,988,990,992,994,996],{"class":55,"line":105},[53,973,619],{"class":59},[53,975,622],{"class":70},[53,977,625],{"class":59},[53,979,628],{"class":63},[53,981,631],{"class":59},[53,983,634],{"class":70},[53,985,625],{"class":59},[53,987,639],{"class":63},[53,989,525],{"class":70},[53,991,644],{"class":59},[53,993,647],{"class":81},[53,995,650],{"class":63},[53,997,653],{"class":81},[53,999,1000,1002,1004,1007,1009,1011,1014,1016,1018,1021],{"class":55,"line":125},[53,1001,619],{"class":59},[53,1003,111],{"class":70},[53,1005,1006],{"class":59},"typeof",[53,1008,634],{"class":70},[53,1010,625],{"class":59},[53,1012,1013],{"class":81}," 'number'",[53,1015,525],{"class":70},[53,1017,644],{"class":59},[53,1019,1020],{"class":74}," String",[53,1022,1023],{"class":70},"(s)\n",[53,1025,1026,1028,1030,1032,1034,1037,1040,1043,1045,1047],{"class":55,"line":161},[53,1027,619],{"class":59},[53,1029,111],{"class":70},[53,1031,1006],{"class":59},[53,1033,634],{"class":70},[53,1035,1036],{"class":59},"!==",[53,1038,1039],{"class":81}," 'string'",[53,1041,1042],{"class":70},") s ",[53,1044,391],{"class":59},[53,1046,1020],{"class":74},[53,1048,1023],{"class":70},[53,1050,1051,1053,1055,1057,1059,1061,1063,1065,1067],{"class":55,"line":191},[53,1052,619],{"class":59},[53,1054,622],{"class":70},[53,1056,625],{"class":59},[53,1058,664],{"class":81},[53,1060,525],{"class":70},[53,1062,644],{"class":59},[53,1064,647],{"class":81},[53,1066,650],{"class":63},[53,1068,653],{"class":81},[53,1070,1071,1073],{"class":55,"line":221},[53,1072,679],{"class":59},[53,1074,682],{"class":70},[53,1076,1077,1079,1081,1083,1085,1087,1089,1091,1093,1095,1097,1099],{"class":55,"line":236},[53,1078,687],{"class":70},[53,1080,690],{"class":74},[53,1082,91],{"class":70},[53,1084,695],{"class":81},[53,1086,650],{"class":698},[53,1088,695],{"class":81},[53,1090,703],{"class":59},[53,1092,147],{"class":70},[53,1094,94],{"class":81},[53,1096,710],{"class":63},[53,1098,94],{"class":81},[53,1100,102],{"class":70},[53,1102,1103,1105,1107,1109,1111,1113,1115,1117,1119,1121,1123,1125],{"class":55,"line":245},[53,1104,687],{"class":70},[53,1106,690],{"class":74},[53,1108,91],{"class":70},[53,1110,695],{"class":81},[53,1112,727],{"class":63},[53,1114,695],{"class":81},[53,1116,703],{"class":59},[53,1118,147],{"class":70},[53,1120,94],{"class":81},[53,1122,650],{"class":63},[53,1124,740],{"class":81},[53,1126,102],{"class":70},[53,1128,1129,1131,1133,1135,1137,1139,1141,1143,1145,1147,1149,1151],{"class":55,"line":251},[53,1130,687],{"class":70},[53,1132,690],{"class":74},[53,1134,91],{"class":70},[53,1136,695],{"class":81},[53,1138,97],{"class":63},[53,1140,695],{"class":81},[53,1142,703],{"class":59},[53,1144,147],{"class":70},[53,1146,94],{"class":81},[53,1148,650],{"class":63},[53,1150,767],{"class":81},[53,1152,102],{"class":70},[53,1154,1155,1157,1159,1161,1163,1165,1167,1169,1171,1173,1175,1177],{"class":55,"line":257},[53,1156,687],{"class":70},[53,1158,690],{"class":74},[53,1160,91],{"class":70},[53,1162,695],{"class":81},[53,1164,782],{"class":63},[53,1166,695],{"class":81},[53,1168,703],{"class":59},[53,1170,147],{"class":70},[53,1172,94],{"class":81},[53,1174,650],{"class":63},[53,1176,795],{"class":81},[53,1178,102],{"class":70},[53,1180,1181],{"class":55,"line":498},[53,1182,260],{"class":70},[16,1184,1185,1186,1189,1190,1192,1193,1196,1197,1200],{},"While you're there, check that your largest numeric field actually fits in a JavaScript number. ",[50,1187,1188],{},"Number.MAX_SAFE_INTEGER"," is about 9 × 10¹⁵, and any identifier longer than fifteen digits will silently lose precision through ",[50,1191,947],{},". If your fixed-width spec defines a twenty-digit identifier, you need to keep it as a string the whole way through. Postgres's ",[50,1194,1195],{},"BIGINT"," can hold the value; JavaScript's ",[50,1198,1199],{},"Number"," cannot.",[11,1202,1204],{"id":1203},"character-encoding-the-trap-that-doesnt-announce-itself","Character encoding: the trap that doesn't announce itself",[16,1206,1207],{},"The second lesson is uglier, and it's the one most likely to leave broken data in your database for weeks before anyone notices.",[16,1209,1210,1211,1213,1214,1217,1218,293],{},"Node's ",[50,1212,318],{}," defaults to UTF-8. Most fixed-width files predate UTF-8. The two encodings agree on every byte in the ASCII range, so for English-language records everything looks fine. The moment you hit a non-ASCII character — a French accent, a German umlaut, a Welsh circumflex — Node sees an invalid UTF-8 byte sequence and replaces it with U+FFFD, the Unicode replacement character. In the database that comes back as the hex sequence ",[50,1215,1216],{},"efbfbd",", and on screen it shows up as ",[50,1219,1220],{},"�",[16,1222,1223],{},"If you're loading insurance records for English-speaking customers you may never see this. If you're loading records that contain names from anywhere else in the world, you'll see it and not know what's going on. The diagnostic is to look at the bytes:",[43,1225,1229],{"className":1226,"code":1227,"language":1228,"meta":48,"style":48},"language-sql shiki shiki-themes github-light github-dark","SELECT name, encode(name::bytea, 'hex') AS hex\nFROM policies\nWHERE name ~ '[^\\x20-\\x7E]'\nLIMIT 10;\n","sql",[50,1230,1231,1236,1241,1246],{"__ignoreMap":48},[53,1232,1233],{"class":55,"line":56},[53,1234,1235],{},"SELECT name, encode(name::bytea, 'hex') AS hex\n",[53,1237,1238],{"class":55,"line":105},[53,1239,1240],{},"FROM policies\n",[53,1242,1243],{"class":55,"line":125},[53,1244,1245],{},"WHERE name ~ '[^\\x20-\\x7E]'\n",[53,1247,1248],{"class":55,"line":161},[53,1249,1250],{},"LIMIT 10;\n",[16,1252,1253,1254,1256],{},"If the hex contains ",[50,1255,1216],{},", you've got the UTF-8 replacement character and the original bytes are gone. There's no recovery from this in the database — the source bytes were discarded at parse time. You have to re-parse the original file with the correct encoding.",[16,1258,1259],{},"The fix at parse time is one line:",[43,1261,1263],{"className":45,"code":1262,"language":47,"meta":48,"style":48},"const stream = fs.createReadStream(path, { encoding: 'latin1' })\n",[50,1264,1265],{"__ignoreMap":48},[53,1266,1267,1269,1271,1273,1275,1277,1280,1283],{"class":55,"line":56},[53,1268,60],{"class":59},[53,1270,453],{"class":63},[53,1272,67],{"class":59},[53,1274,71],{"class":70},[53,1276,318],{"class":74},[53,1278,1279],{"class":70},"(path, { encoding: ",[53,1281,1282],{"class":81},"'latin1'",[53,1284,1285],{"class":70}," })\n",[16,1287,1288,1291,1292,1296,1297,1300],{},[50,1289,1290],{},"latin1"," (also called ISO-8859-1) is the safest default for fixed-width feeds from older systems, because it's a single-byte encoding that maps every byte to a valid character. You won't get ",[1293,1294,1295],"em",{},"correct"," output for files that are actually UTF-8 or Windows-1252, but you won't get ",[1293,1298,1299],{},"destroyed"," output either — every byte survives the round trip, which means you can fix the encoding interpretation later if you guessed wrong.",[16,1302,1303],{},"The lesson is: always read fixed-width files in a byte-preserving encoding by default. Decide what they actually are afterwards, with evidence from the bytes themselves. The cost of getting this wrong is silent, irreversible data loss.",[11,1305,1307],{"id":1306},"drop-indexes-around-copy","Drop indexes around COPY",[16,1309,1310],{},"The third lesson is about Postgres performance and it's the one that turns an overnight job into a coffee break.",[16,1312,1313,1315,1316,1318],{},[50,1314,292],{}," is fast because it bypasses most of the overhead of ",[50,1317,279],{},". What it can't bypass is index maintenance. Every row loaded triggers an update of every index on the table, and at scale those updates dominate the runtime. On a table with five indexes and a hundred million rows, the index-maintenance time can easily be five or ten times the actual data-load time.",[16,1320,1321],{},"The fix is to drop the secondary indexes before the COPY and rebuild them afterwards. Keep the primary key — you want the uniqueness constraint enforced during load — but everything else comes off:",[43,1323,1325],{"className":1226,"code":1324,"language":1228,"meta":48,"style":48},"BEGIN;\nDROP INDEX idx_policies_premium;\nDROP INDEX idx_policies_created;\nDROP INDEX idx_policies_region;\n\nCOPY policies (id, name, premium, created, region)\nFROM '\u002Ftmp\u002Fpolicies.tsv';\n\nCREATE INDEX idx_policies_premium ON policies (premium);\nCREATE INDEX idx_policies_created ON policies (created);\nCREATE INDEX idx_policies_region  ON policies (region);\nCOMMIT;\n",[50,1326,1327,1332,1337,1342,1347,1351,1356,1361,1365,1370,1375,1380],{"__ignoreMap":48},[53,1328,1329],{"class":55,"line":56},[53,1330,1331],{},"BEGIN;\n",[53,1333,1334],{"class":55,"line":105},[53,1335,1336],{},"DROP INDEX idx_policies_premium;\n",[53,1338,1339],{"class":55,"line":125},[53,1340,1341],{},"DROP INDEX idx_policies_created;\n",[53,1343,1344],{"class":55,"line":161},[53,1345,1346],{},"DROP INDEX idx_policies_region;\n",[53,1348,1349],{"class":55,"line":191},[53,1350,406],{"emptyLinePlaceholder":405},[53,1352,1353],{"class":55,"line":221},[53,1354,1355],{},"COPY policies (id, name, premium, created, region)\n",[53,1357,1358],{"class":55,"line":236},[53,1359,1360],{},"FROM '\u002Ftmp\u002Fpolicies.tsv';\n",[53,1362,1363],{"class":55,"line":245},[53,1364,406],{"emptyLinePlaceholder":405},[53,1366,1367],{"class":55,"line":251},[53,1368,1369],{},"CREATE INDEX idx_policies_premium ON policies (premium);\n",[53,1371,1372],{"class":55,"line":257},[53,1373,1374],{},"CREATE INDEX idx_policies_created ON policies (created);\n",[53,1376,1377],{"class":55,"line":498},[53,1378,1379],{},"CREATE INDEX idx_policies_region  ON policies (region);\n",[53,1381,1382],{"class":55,"line":503},[53,1383,1384],{},"COMMIT;\n",[16,1386,1387],{},"Rebuilding an index in one pass against a fully-loaded table is dramatically faster than maintaining the same index row-by-row during load. The B-tree build can be parallelised, the sort is efficient, and there's no transactional bookkeeping per row. The same hundred-million-row load that took two hours with indexes live will take twenty minutes with indexes dropped and rebuilt.",[16,1389,1390],{},"There's an obvious caveat: while the indexes are gone, queries against the table will sequential-scan, so don't do this on a live table that users are reading from. For an offline rebuild or a fresh load it's the right pattern.",[11,1392,1394],{"id":1393},"dont-let-the-shell-touch-your-data","Don't let the shell touch your data",[16,1396,1397],{},"The fourth lesson is one I learned the long way. When you're prototyping, it's tempting to glue stages together with shell:",[43,1399,1403],{"className":1400,"code":1401,"language":1402,"meta":48,"style":48},"language-bash shiki shiki-themes github-light github-dark","cat input.dat | node parse.js > \u002Ftmp\u002Fstaging.tsv\npsql -c \"\\copy policies FROM '\u002Ftmp\u002Fstaging.tsv'\"\n","bash",[50,1404,1405,1428],{"__ignoreMap":48},[53,1406,1407,1410,1413,1416,1419,1422,1425],{"class":55,"line":56},[53,1408,1409],{"class":74},"cat",[53,1411,1412],{"class":81}," input.dat",[53,1414,1415],{"class":59}," |",[53,1417,1418],{"class":74}," node",[53,1420,1421],{"class":81}," parse.js",[53,1423,1424],{"class":59}," >",[53,1426,1427],{"class":81}," \u002Ftmp\u002Fstaging.tsv\n",[53,1429,1430,1433,1436],{"class":55,"line":105},[53,1431,1432],{"class":74},"psql",[53,1434,1435],{"class":63}," -c",[53,1437,1438],{"class":81}," \"\\copy policies FROM '\u002Ftmp\u002Fstaging.tsv'\"\n",[16,1440,1441,1442,1445],{},"This works until the data contains anything the shell wants to interpret. Quotes in policy holders' names break some quoting strategies. Newlines in free-text fields confuse ",[50,1443,1444],{},"\\copy",". Backticks anywhere on the line set off command substitution if you're using the wrong kind of heredoc. And the failure mode is rarely a clean error — it's silently truncated rows, mysteriously missing records, or, my personal favourite, a heredoc that swallows the next three shell commands because its closing marker got eaten by a paste-buffer hiccup.",[16,1447,1448,1449,1451],{},"The discipline is: do not pipe binary or arbitrary-text data through shell. Use the libraries directly. ",[50,1450,349],{}," connects a Node Readable stream to a Postgres COPY in-process, with no shell anywhere in the loop. The bytes go from your parser straight into Postgres's wire protocol. Nothing in between gets a chance to misinterpret a backslash.",[16,1453,1454,1455,1458,1459,1462,1463,1466,1467,1470],{},"If you can't avoid shell for some reason — maybe you're stuck with ",[50,1456,1457],{},"psql -c"," for environment reasons — at minimum pipe via files on disk rather than via ",[50,1460,1461],{},"|",", quote everything aggressively, and never use unquoted heredocs (",[50,1464,1465],{},"\u003C\u003CEOF"," rather than ",[50,1468,1469],{},"\u003C\u003C'EOF'",") when the body contains anything user-derived.",[11,1472,1474],{"id":1473},"what-this-actually-performs-like","What this actually performs like",[16,1476,1477],{},"Concrete numbers, on a single mid-range server: a 6.5 GB gzipped fixed-width file, ~15 million records, three target tables with two to five indexes each. End to end, including parse, TSV write, COPY, and index rebuild: twelve to eighteen minutes. The bottleneck is the COPY phase on the largest table; everything else is comfortably overlapped with it.",[16,1479,1480],{},"The same job done the naive way — row-by-row INSERT with indexes live — runs for somewhere between eight and twenty hours depending on the indexes, and at the upper end it never finishes because autovacuum can't keep up with the dead-tuple churn from constraint violations on retried inserts. The COPY approach isn't a 10% optimisation; it's a different category of operation.",[11,1482,1484],{"id":1483},"the-boring-layer-matters","The boring layer matters",[16,1486,1487],{},"The lesson I keep coming back to with this kind of work is that the unglamorous parts of an infrastructure are doing most of the load-bearing. Anyone can wire up a dashboard. The thing that determines whether the dashboard is showing fresh data at 8 AM or stale data at noon is the ingest pipeline, and the ingest pipeline is fixed-width files, character encodings, COPY semantics, and disk layout. It's the layer that gets the least attention and earns most of the trust.",[16,1489,1490,1491,1493,1494,1496],{},"If you're working with feeds like these, get the pipeline right early. The patterns above are not novel — Postgres has had ",[50,1492,292],{}," since the 1990s, and ",[50,1495,349],{}," is unglamorous library code — but they're under-documented in the form an engineer actually needs them, which is \"the four things that are about to bite you, in order.\" Hopefully this saves someone the eighteen months I spent learning them.",[1498,1499,1500],"style",{},"html .default .shiki span {color: var(--shiki-default);background: var(--shiki-default-bg);font-style: var(--shiki-default-font-style);font-weight: var(--shiki-default-font-weight);text-decoration: var(--shiki-default-text-decoration);}html .shiki span {color: var(--shiki-default);background: var(--shiki-default-bg);font-style: var(--shiki-default-font-style);font-weight: var(--shiki-default-font-weight);text-decoration: var(--shiki-default-text-decoration);}html .dark .shiki span {color: var(--shiki-dark);background: var(--shiki-dark-bg);font-style: var(--shiki-dark-font-style);font-weight: var(--shiki-dark-font-weight);text-decoration: var(--shiki-dark-text-decoration);}html.dark .shiki span {color: var(--shiki-dark);background: var(--shiki-dark-bg);font-style: var(--shiki-dark-font-style);font-weight: var(--shiki-dark-font-weight);text-decoration: var(--shiki-dark-text-decoration);}html pre.shiki code .szBVR, html code.shiki .szBVR{--shiki-default:#D73A49;--shiki-dark:#F97583}html pre.shiki code .sj4cs, html code.shiki .sj4cs{--shiki-default:#005CC5;--shiki-dark:#79B8FF}html pre.shiki code .sVt8B, html code.shiki .sVt8B{--shiki-default:#24292E;--shiki-dark:#E1E4E8}html pre.shiki code .sScJk, html code.shiki .sScJk{--shiki-default:#6F42C1;--shiki-dark:#B392F0}html pre.shiki code .sZZnC, html code.shiki .sZZnC{--shiki-default:#032F62;--shiki-dark:#9ECBFF}html pre.shiki code .sJ8bj, html code.shiki .sJ8bj{--shiki-default:#6A737D;--shiki-dark:#6A737D}html pre.shiki code .s4XuR, html code.shiki .s4XuR{--shiki-default:#E36209;--shiki-dark:#FFAB70}html pre.shiki code .snhLl, html code.shiki .snhLl{--shiki-default:#22863A;--shiki-default-font-weight:bold;--shiki-dark:#85E89D;--shiki-dark-font-weight:bold}",{"title":48,"searchDepth":105,"depth":105,"links":1502},[1503,1504,1505,1506,1507,1508,1509,1510,1511,1512],{"id":13,"depth":105,"text":14},{"id":27,"depth":105,"text":28},{"id":37,"depth":105,"text":38},{"id":296,"depth":105,"text":297},{"id":585,"depth":105,"text":586},{"id":1203,"depth":105,"text":1204},{"id":1306,"depth":105,"text":1307},{"id":1393,"depth":105,"text":1394},{"id":1473,"depth":105,"text":1474},{"id":1483,"depth":105,"text":1484},"\u002Fblog\u002Ffixed-width-files-at-scale\u002Fcover.png","How to ingest hundreds of millions of rows into Postgres without setting fire to anything — streaming, COPY, encodings, and the gotchas in order.",false,"md",{},"\u002Fblog\u002Ffixed-width-files-at-scale","2026-06-06",{"title":5,"description":1514},"blog\u002Ffixed-width-files-at-scale",[1523,1524,1525,1526],"postgres","data-pipelines","nodejs","etl","p3uH_h0R4x-XKS-n5L7zo_0jq1gHw4m9y3AOxLgnyiA",1780745938947]