Write a file duplicate finder. Seriously.

You will learn something about I/O handling, data structures like maps, loops, bulding commandline applications, and as a bonus you will have a useful tool for cleaning up your hard drive.

My version is accepting a path as an argument. While the program traverses the path, it clusters files by size. after the size of all files is computed, the program calculates a checksum for all files with identical size. When the checksum is the same, the paths of the duplicates are printed to stdout and I can decide which one I will trash.

In the first iterations of development I got plenty of duplicates because of the Ruby Gem and NodeJS files scattered over my projects, so I limited the scanning only to files above a certain size.

I recovered this way over 8GB of duplicated files, mostly audio, images and ebook files from my MacBook, which has a 128GB SSD. I can use the same tool on any of my Linux Servers, with the same command line options to tidy up them too.

No more switches acrobatics between Mac / Linux commandline tools.