The mathematics of science’s broken reward system

Science has been peculiarly resistant to self-examination. During the ‘science wars’ of the 1990s, for instance, scientists disdained sociological studies of their culture. Yet there is now a growing trend for scientists to use the quantitative methods of data analysis and theoretical modelling to try to work out how, and how well, science works — often with depressing conclusions. Why are these kinds of studies being produced, and what is their value?

Take a study published on 10 November1 by psychologists Andrew Higginson of the University of Exeter and Marcus Munafò of the University of Bristol, UK. It considers how scientists can maximize their ‘fitness’, or career success, in a simplified ecosystem that allows them to invest varying amounts of time and effort into exploratory studies. The study finds that in an ecosystem that rewards a constant stream of high-profile claims, researchers will rationally opt for corner-cutting strategies, such as small sample sizes. These save on the effort required for each study, but they raise the danger that new findings will not prove robust or repeatable.

A slightly different perspective — but a similar conclusion — comes from work published on 21 September2, by information scientist Paul Smaldino at the University of California, Merced, and evolutionary ecologist Richard McElreath, at the Max Planck Institute for Evolutionary Anthropology in Leipzig, Germany. They take an evolutionary view, imagining that laboratories are in competition for rewards, and that the most successful of them produce more ‘progeny’: new research groups that use the same techniques and strategies. There is generally a trade-off between productivity and rigour: producing more statistically secure, replicated findings takes more time and effort, but generating too many false positives will eventually take its toll on reputations. Under selection for productivity, however, less-rigorous methods spread and false discovery rates increase.

Source: The mathematics of science’s broken reward system : Nature News & Comment

 

A History of Hard Drives

2016 marks the 60th anniversary of the venerable Hard Disk Drive (HDD). While new computers increasingly turn to Solid State Disks (SSDs) for main storage, HDDs remain the champions of low-cost, high-capacity data storage. That’s a big reason why we still use them in our Storage Pods. Let’s take a spin the Wayback Machine and take a look at the history of hard drives. Let’s also think about what the future might hold.
Source: A History of Hard Drives

 

The Joy of Linux Desktop Environments

I’m endlessly fascinated by Linux, to the extent that I wrote a book about it, Learn Linux in a Month of Lunches. My very favorite thing about Linux is the desktop environment concept. Desktop environments are graphical interfaces for the entire operating system, but where most operating systems, like Windows, OS X, iOS, and Android, have one common interface, Linux users can easily install and user a variety of interfaces without changing their underlying system.

This means all of your files and bookmarks are right where you left them. You can use the same programs. The only difference is how your system looks, which can be dramatically different, depending upon which desktop environments you wish to use.

It’s a tough concept to explain because there aren’t many comparables outside of Linux. If you’re using Windows, you’re pretty much stuck with the Windows interface. The menus are always going to be in the same place and while you can do things, like change your desktop image and your colors and themes, you’re still limited in how you interact with your computer.

We’re so conditioned to accept these interfaces, we actually have trouble understanding a model that gives us a choice. In fact, the desktop environment chapter of my book turned out to be challenging to write, because I was not only explaining some of the different desktop environments, like GNOME and KDE, but I was also explaining what desktop environments are. Luckily, the challenge of explaining them didn’t diminish my love for the concept of them.

I love that I can log-in to different desktop environments, depending upon what I wish to do. I love that my computer doesn’t always have to look the same when I’m working. And I love the flexibility to choose the right desktop environment for the job.

Source: The Joy of Linux Desktop Environments

 

Mining of Massive Datasets

The 2nd edition of the book (v2.1)

The following is the second edition of the book. There are three new chapters, on mining large graphs, dimensionality reduction, and machine learning. There is also a revised Chapter 2 that treats map-reduce programming in a manner closer to how it is used in practice.

Together with each chapter there is aslo a set of lecture slides that we use for teaching Stanford CS246: Mining Massive Datasets course. Note that the slides do not necessarily cover all the material convered in the corresponding chapters.

 

Chapter Title Book Slides Videos
Preface and Table of Contents PDF
Chapter 1 Data Mining PDF PDF PPT
Chapter 2 Map-Reduce and the New Software Stack PDF PDF PPT 1 2 3 4 5 6 7 8
Chapter 3 Finding Similar Items PDF PDF PPT 1 2 3 4 5 6 7 8 9 10 11 12 13
Chapter 4 Mining Data Streams PDF Part 1:
Part 2:
PDF
PDF
PPT
PPT
1 2 3 4 5
Chapter 5 Link Analysis PDF Part 1:
Part 2:
PDF
PDF
PPT
PPT
1 2 3 4 5 6 7 8 9 10 11 12 13 14
Chapter 6 Frequent Itemsets PDF PDF PPT 1 2 3 4
Chapter 7 Clustering PDF PDF PPT 1 2 3 4 5
Chapter 8 Advertising on the Web PDF PDF PPT 1 2 3 4
Chapter 9 Recommendation Systems PDF Part 1:
Part 2:
PDF
PDF
PPT
PPT
1 2 3 4 5
Chapter 10 Mining Social-Network Graphs PDF Part 1:
Part 2:
PDF
PDF
PPT
PPT
1 2 3 4 5 6 7 8 9 10 11 12
Chapter 11 Dimensionality Reduction PDF PDF PPT 1 2 3 4 5 6 7 8 9 10 11 12
Chapter 12 Large-Scale Machine Learning PDF Part 1:
Part 2:
PDF
PDF
PPT
PPT
1 2 3 4 5 6 7 8 9 10 11 12
Index PDF
Errata HTML

 

Download the latest version of the book as a single big PDF file (511 pages, 3 MB).

Source: Mining of Massive Datasets

 

GNU Parallel tutorial

Input sources

GNU parallel reads input from input sources. These can be files, the command line, and stdin (standard input or a pipe).

A single input source

Input can be read from the command line:

  parallel echo ::: A B C

Output (the order may be different because the jobs are run in parallel):

  A
  B
  C

The input source can be a file:

  parallel -a abc-file echo

Output: Same as above.

STDIN (standard input) can be the input source:

  cat abc-file | parallel echo

Output: Same as above.

Multiple input sources

GNU parallel can take multiple input sources given on the command line. GNU parallel then generates all combinations of the input sources:

  parallel echo ::: A B C ::: D E F

Output (the order may be different):

  A D
  A E
  A F
  B D
  B E
  B F
  C D
  C E
  C F

Source: GNU Parallel tutorial

 

How We Knew It Was Time to Leave the Cloud | GitLab

At a small scale, the cloud is cheaper and sufficient for many projects. However, if you need to scale, it’s not so easy. It’s often sold as, “If you need to scale and add more machines, you can spawn them because the cloud is ‘infinite'”. What we discovered is that yes, you can keep spawning more machines but there is a threshold in time, particulary when you’re adding heavy IOPS, where it becomes less effective and very expensive. You’ll still have to pay for bigger machines. The nature of the cloud is time sharing so you still will not get the best performance. When it comes down to it, you’re paying a lot of money to get a subpar level of service while still needing more performance.

Source: How We Knew It Was Time to Leave the Cloud | GitLab