The long tail

As what happens with other endeavors across the net, I have experimented with a variety of platforms and have left bread crumbs of blogging fits and starts across multiple platforms. To that end, I have decided to restart this one with hopes of it growing organically and persistently, like a daily practice.

This is week four of Zipfian Academy. I’m in an amazing group of 13 fellow students and 3 (soon to be 6 teachers). It’s been a wonderful experience because it feels more like a start-up than grad school. All lessons are distributed via github, so one gets used to the daily forking and branching and code reviews. There is collaborative learning, which is also why I’m excited to be here.

Earlier in the week, we did a crawling/scraping exercise using the wikipedia API on Zipf’s Law. My first introduction to the “long tail” on the internet was in fighting search engine spam that was popping up around 2002.  In the spam-fighting context, the “long tail” was to be feared, as an endless stream of web pages exploiting the use of rarely-encountered keywords, thus the name “dictionary spam”.  Perhaps it is no coincidence that George Kingley Zipf  himself was a linguist who noticed the eponymous distribution as a description of word frequencies.

In any case, I’ve been thinking of the “long tail” and this intense data science exploration. While there are ecosystems of technologies  in this discipline, learning all the pieces in order to execute a workflow can seem a bit daunting like slipping down the Zipfian distribution. Perhaps it’s best to think of the CDF: each tool or technology will get us closer to one.