tabs are easier to parse than commas on [2013-09-15 Sun 07:35]

Comma delimted files have been the source of much pain the past few days. Why can't everything just be tab delimited? The tools I work with aren't geared to work with subtleties like embedded quotes and embedded commas.

   $ cat large.txt | gawk '{gsub(/["\,]/,"",$6); print "\""$6"\"" }' FPAT='([^,]+)|("[^"]+")' > z.txt

gawk got me pretty close with it's pattern handling. It's really slow on a 12 gig file though. csvquote was OK too, but didn't handle embedded commas, which fread from data.table didn't support.

I ended up just punting on the problem and using SQL SSIS since it was fast and easy. One of these days I need to write a utility that handles it. I started it in J but ran out of time.




Prev: first iOS app  Next: sed, tr and gawk to the rescue