Hi, I'm Harry.

Using grep and gsplit to filter and break up large file

February 2022

Recently we had a challenge at work where we had a large text file of over 5,000 lines, that we needed to filter by a key phrase, and then output the results of that into multiple, sequential 100 line length files. This is the sort of thing that’s actually very quick and simple to do if you know how; so I thought I’d document it here for future reference - largely for myself!

I split this into two parts, largely for manual checking when going through the process. Firstly, I started off by outputting the whole file, using cat, and piping that to grep, matching for the string that was found on each line of the results I wanted. I then outputted those results to a separate file. At this point, I had a new file that contained only the results that I wanted.

cat ~/Desktop/large_file | grep “string to match” > ~/Desktop/filtered_results_file

Once I had this file of matched lines (which was just over 2,000 lines long), I needed to break that into a collection of 100 line files. For this, I was able to turn to gsplit. I initially used split, but realised that I wanted to append the file with numbers rather than letters, as we were wanting to be able to provide rough estimations of completion, and that is possible with gsplit but unfortunately not with split.

gsplit -a 2 -d -l 100 ~/Desktop/filtered_results_file ~/Desktop/result_set_

This command splits the provided results file into 100 line files, saving each one incrementally to the output path, appending with the number of the file (00, 01, 02). The config options I’m passing in are -a 2, which determines the suffix length, -d to use numerical suffixes (something that wasn’t possible with split), and -l 100 which limits the number of lines per output file.

The above command ended up outputting 21 separate files, named result_set_00, result_set_01, etc. This allowed us to be able to get a very rough estimation of completion when going through those files.