Building a Regex Search Engine for DNA

This winter, Tony interned at Benchling and built the foundation of our DNA search feature. Somak and I brought it into production shortly after his internship ended, and we’re finally writing it up! We recently launched DNA search on Benchling – you can have a library of thousands of plasmids

 

We recently launched DNA search on Benchling – you can have a library of thousands of plasmids and primers, and we’ll search through them in less than 100ms end to end. It was a fun project, and through it we ended up building a regex search engine (on top of Elasticsearch) to support the various search needs of scientists. Here’s how we did it.

Source: Building a Regex Search Engine for DNA

 

Row Level Security with PostgreSQL 9.5

Release 9.5 of PostgreSQL delivers many new features like upsert, new JSONB functions, new GROUPING functions, and more. While some of these like upsert or JSONB may be useful to many people, a number of these new features really only service edge cases. If you have the particular edge case a feature solves though then that new feature can be invaluable. RLS (Row Level Security) is one of these edge case features.

RLS does just what it says: it secures a row in a table. But, you do have to enable it for each table plus you need to commit to using database roles as a main security mechanism. That last part is the barrier but also the reason to use such a feature.

With RLS, you use the database tier to secure the data (at least for the enabled tables). Both multi-tenant tables and analytics schemas where users have general access to the database via a query tool are solid examples of when RLS makes sense.

Source: Row Level Security with PostgreSQL 9.5

 

Snappy Changes – Labix Blog

As announced last Saturday, Snappy Ubuntu Core 2.0 has just been tagged and made its way into the archives of Ubuntu 16.04, which is due for the final release in the next days. So this is a nice time to start covering interesting aspects of what is being made available in this release.

A good choice for the first post in this series is talking about how snappy performs changes in the system, as that knowledge will be useful in observing and understanding what is going on in your snappy platform. Let’s start with the first operation you will likely do when first interacting with the snappy platform — install:

% sudo snap install ubuntu-calculator-app
120.01 MB / 120.01 MB [================================================================] 100.00 % 1.45 MB/s

This operation is traditionally done on analogous systems in an ephemeral way. That is, the software has either a local or a remote database of options to install, and once the change is requested the platform of choice will start acting on it with all state for the modification kept in memory. If something doesn’t go so well, such as a reboot or even a crash, the modification is lost.. in the best case. Besides being completely lost, it might also be partially applied to the system, with some files spread through the filesystem, and perhaps some of the involved hooks run. After the restart, the partial state remains until some manual action is taken.

Snappy instead has an engine that tracks and controls such changes in a persistent manner. All the recent changes, pending or not, may be observed via the API and the command line:

% snap changes
ID   Status  ...  Summary
1    Done    ...  Install "ubuntu-calculator-app" snap

(the spawn and ready date/time columns have been hidden for space)

Source: Snappy Changes – Labix Blog

 

Peer review: Troubled from the start : Nature News & Comment


Pivotal moments in the history of academic refereeing have occurred at times when the public status of science was being renegotiated, explains Alex Csiszar.

Referees are overworked. The problem of bias is intractable. The referee system has broken down and become an obstacle to scientific progress. Traditional refereeing is an antiquated form that might have been good for science in the past but it’s high time to put it out of its misery.

What is this familiar litany? It is a list of grievances aired by scientists a century ago.If complaining about the faults of referee systems is nothing new, such systems are not as old as historical accounts often claim. Investigators of nature communicated their findings without scientific referees for centuries. Deciding whom and what to trust usually depended on personal knowledge among close-knit groups of researchers. (Many might argue it still does.)

Source: Peer review: Troubled from the start : Nature News & Comment

 

19 Tips For Everyday Git Use

I’ve been using git full time for the past 4 years, and I wanted to share the most practical tips that I’ve learned along the way. Hopefully, it will be useful to somebody out there.

If you are completely new to git, I suggest reading Git Cheat Sheet first. This article is aimed at somebody who has been using git for three months or more.

Table of Contexts:

  1. Parameters for better logging
    git log --oneline --graph
  2. Log actual changes in a file
    git log -p filename
  3. Only Log changes for some specific lines in file
    git log -L 1,1:some-file.txt
  4. Log changes not yet merged to the parent branch
    git log --no-merges master..
  5. Extract a file from another branch
    git show some-branch:some-file.js
  6. Some notes on rebasing
    git pull --rebase
  7. Remember the branch structure after a local merge
    git merge --no-ff
  8. Fix your previous commit, instead of making a new commit
    git commit --amend
  9. Three stages in git, and how to move between them
    git reset --hard HEAD and git status -s
  10. Revert a commit, softly
    git revert -n
  11. See diff-erence for the entire project (not just one file at a time) in a 3rd party diff tool
    git difftool -d
  12. Ignore the white space
    git diff -w
  13. Only “add” some changes from a file
    git add -p
  14. Discover and zap those old branches
    git branch -a
  15. Stash only some files
    git stash -p
  16. Good commit messages
  17. Git Auto-completion
  18. Create aliases for your most frequently used commands
  19. Quickly find a commit that broke your feature (EXTRA AWESOME)
    git bisect

Source: 19 Tips For Everyday Git Use

 

6 Lesser Known Python Data Analysis Libraries

Python offers a great environment and rich set of libraries to developers while working with data. There are tons of useful libraries out there for novice or experienced developers or analysts for helping out with processing or visualizing datasets. Some of the libraries are really popular and used by millions of developers, for example – Pandas, Numpy, Scikit-learn, NTLK etc. Some of the libraries are not so well known and turned out to be handy in my experience. This article introduces 6 such Python libraries when working with data. Readers might already be familiarized with some of them, but I hope this article still proves to be useful.
Source: 6 Lesser Known Python Data Analysis Libraries

 

Sorry, You Can’t Speed Read – The New York Times


Don’t be fooled by courses or digital technologies that promise otherwise.

OUR favorite Woody Allen joke is the one about taking a speed-reading course. “I read ‘War and Peace’ in 20 minutes,” he says. “It’s about Russia.”

The promise of speed reading — to absorb text several times faster than normal, without any significant loss of comprehension — can indeed seem too good to be true. Nonetheless, it has long been an aspiration for many readers, as well as the entrepreneurs seeking to serve them. And as the production rate for new reading matter has increased, and people read on a growing array of devices, the lure of speed reading has only grown stronger.

Source: Sorry, You Can’t Speed Read – The New York Times

 

AKT: Ancestry and Kinship Toolkit | bioRxiv

Motivation:
Ancestry and Kinship Toolkit (AKT) is a statistical genetics tool for analysing large cohorts of
whole-genome sequenced samples. It can rapidly detect related samples, characterise sample ancestry,
detect IBD segments, calculate correlation between variants, check Mendel consistency and perform data
clustering. AKT brings together the functionality of many state-of-the-art methods, with a focus on speed
and a unified interface. We believe it will be an invaluable tool for the curation of large WGS data-sets.
Availability:
The source code is available at
https://illumina.github.io/akt

Source: AKT: Ancestry and Kinship Toolkit | bioRxiv