June 28, 2018 by Raony Guimaraes on Uncategorized

#define CTO OpenAI

It’s been two years since I wrote #define CTO, in which I documented my quest for a role where I could have scalable impact by writing code. I’ve finally found that role, though not by seeking it — instead, I sought out a problem more important to me than my role within it, brought together the right people, and found that I can best make them effective by writing code.

Source: #define CTO OpenAI

June 28, 2018 by Raony Guimaraes on Uncategorized

Git Strategizing: Branch, Commit, Review, and Merge – Simple Talk

Being a pragmatist, I recognize that—like the eternal spaces vs tabs debate—one’s branching and merging styles are often held dear. If I may ask you to set all that aside for just a few minutes though, have a look and see if any of the following might make your environment more productive. The next caveat is that this article is focused on Git, which is a distributed version control system (DVCS). The conclusions may or may not be relevant to other DVCS’s but probably not to centralized version control systems, like Subversion, where branches can be expensive.

Git Strategizing: Branch, Commit, Review, and Merge

Source: Git Strategizing: Branch, Commit, Review, and Merge – Simple Talk

June 28, 2018 by Raony Guimaraes on Uncategorized

Ask HN: Teaching programming in computational biology? | Hacker News

Hi everyone,I work at a university institute of computational biology. Besides doing research, we also teach quite a few courses for biology students, many of which include an introduction to programming (mostly with R, but also Python). We have a long-standing debate as to what the best approach to this is, and I would like to hear some of your opinions on the matter.

Basically, we have two factions: the “tools-first” and the “fundamentals-first” approach. The supporters of “tools-first” argue that we are teaching biologists, not software developers. They like to teach the specific tools (languages, libraries, functions, etc.) that our students are actually going to need as quickly as possible. To cover as much ground as possible, they are willing to sacrifice a deeper understanding of programming.

Ask HN: Teaching programming in computational biology?

Source: Ask HN: Teaching programming in computational biology? | Hacker News

June 28, 2018 by Raony Guimaraes on Uncategorized

Obey the Testing Goat! The Book

Obey the Testing Goat!

Source: The Book

June 28, 2018 by Raony Guimaraes on Uncategorized

Mathematician-M.D. solves one of the greatest open problems in the history of mathematics – USC Viterbi | School of Engineering

Athanassios Fokas, a mathematician from the Department of Applied Mathematics and Theoretical Physics of the University of Cambridge and visiting professor in the Ming Hsieh Department of Electrical Engineering at the USC Viterbi School of Engineering has announced the solution of one of the long-standing problems in the history of mathematics, the Lindelöf Hypothesis.

The solution, first published in arXiv, has far reaching implications for fields like quantum computing, number theory, and encryption which forms the basis for cybersecurity.

Source: Mathematician-M.D. solves one of the greatest open problems in the history of mathematics – USC Viterbi | School of Engineering

June 26, 2018 by Raony Guimaraes on Uncategorized

GCCBOSC18 – 26/06/2018 Daily Notes

CWL Tutorial

https://docs.google.com/presentation/d/1Kf82kImNLucLvRsoafRWr_J1GEBzYosWuWmg6vRj7dg/edit

https://github.com/rabix/composer/pull/259

GATK Training

https://drive.google.com/drive/folders/1U6Zm_tYn_3yeEgrD1bdxye4SXf5OseIt

In order to make our time together as effective as possible, you’ll need to do a bit of homework before coming to the workshop session: download a data bundle and get GATK4 installed on your laptop. To be clear, you will not have time to do this at the start of the session so it’s imperative that you do this ahead of time.

1) Download the “gatk_bundle.zip” data bundle containing data that we will use in the hands-on exercises:
Direct link: https://drive.google.com/open?id=1ixEFNgWQBf79eRVqhKv9OGpXzGzcrbn2
Enclosing folder (containing additional material for further reading): https://drive.google.com/drive/folders/1U6Zm_tYn_3yeEgrD1bdxye4SXf5OseIt?usp=sharing

2) Install Docker on your laptop and download the GATK4 container image, which contains all the system dependencies needed to run GATK4. Please follow the instructions provided here:
https://gatkforums.broadinstitute.org/gatk/discussion/11090/how-to-run-gatk-in-a-docker-container/p1?new=1

If for whatever reason you are unable to follow the docker installation instructions, the recommended alternative is to use the Conda environment that we provide to manage dependencies, as described in the github repository README:
https://github.com/broadinstitute/gatk/blob/master/README.md

And if that doesn’t work for you, for the purposes of this workshop you can just get the GATK package, as long as you make sure you have Java 8 installed on your laptop: https://github.com/broadinstitute/gatk/releases/download/4.0.5.1/gatk-4.0.5.1.zip

Thank you and see you in Portland!

GATK Haplotype Called
GermlineCNVCaller
SVDiscovery
Mutect
ModelSegments

CNN, gCNV Germline CNV Calling
Probabilistic Graphical Models

THIS IS THE GATK WORKSHOP BUNDLE FOR MARCH 2018

The materials are now ready for download. The gatk_bundle.zip package contains the data that is sed in the hands-on exercises. The “worksheets” directory contains the exercise instructions. The “dayX” directories contain all the presentation slide decks from the workshop.

== LINKS FOR SHARING ==

PDFs and gatk_bundle: http://broad.io/gatk-1803
Installation prep: https://broad.io/gatk-w-prep

PairHMM depends on the machine you are running on.

15k genomes in 2 weeks.
76k genomes WGS processing
GenomicsDB gives 100k genomes, but still need some work for doing more than that.

https://software.broadinstitute.org/gatk/documentation/article?id=11090

docker run -v /path/gatk_data

https://software.broadinstitute.org/gatk/blog?id=11398

Somatic Variant Analysis
Call Variants per Sample
Haplotype Caller in GVCF mode

Alibaba Cloud
AWS
Azure
Google Cloud Platform
IBM Cloud

Not only of for the cloud
BIGStack* 2.0

docker run -v /home/raony/gccbosc/gatk/gatk_bundle/2-germline/:/gatk/gatk_data -it broadinstitute/gatk:4.0.5.1

gatk HaplotypeCaller -R /gatk/gatk_data/ref/ref.fasta -I /gatk/gatk_data/bams/mother.bam -O /gatk/gatk_data/sandbox/variants.vcf
Using GATK jar /gatk/build/libs/gatk-package-4.0.5.1-local.jar

gatk ValidateSamFile -I bams/mother.bam -MODE SUMMARY

gatk –java-options “-Xmx4G” MarkDuplicatesSpark -R ref/ref.fasta -I bams/mother.bam -O sandbox/mother_dedup.bam -M sandbox/metrics.txt — –spark-master local[*]
Using GATK jar /gatk/build/libs/gatk-package-4.0.5.1-local.jar

gatk –java-options “-Xmx4G” HaplotypeCaller -R /gatk/gatk_da/ref/ref.fasta -I /gatk/gatk_data/bams/mother.bam -O /gatk/gatk_data/sandbox/mother.g.vcf -ERC GVCF

gatk –java-options “-Xmx4G” HaplotypeCaller -R /gatk/gatk_da/ref/ref.fasta -I /gatk/gatk_data/bams/father.bam -O /gatk/gatk_data/sandbox/father.g.vcf -ERC GVCF

10reads of difference beetwen markduplicates, markduplicatesspark, they are trying to explain that.

7 different levels of certification
Stringent Options Available

export GATK_GCS_STAGING=gs://gatk-jar-cache/
gatk MarkDuplicatesSpark -R gs://gatk-workshops/GCCBOSC2018/ref/ref.fasta -I gs://gatk-workshops/GCCBOSC2018/ref/ref.fasta -O mother_dedup.bam -M metrics.txt — –spark-runner GCS –cluster aardvark-01

https://gccbosc2018.sched.com/overview/type/B.+Conference/Birds-of-a-Feather

June 25, 2018 by Raony Guimaraes on Uncategorized

We’re moving from Azure to Google Cloud Platform | GitLab

GitLab.com is migrating to Google Cloud Platform – here’s what this means for you now and in the future.

Source: We’re moving from Azure to Google Cloud Platform | GitLab

June 25, 2018 by Raony Guimaraes on Uncategorized

Galaxy Conference – Admin 25/06/2018

bit.ly/galaxyadmin
bit.ly/adminvms
bit.ly/gadminchat

Galaxy Release Schedule

3 releases per year: January, May and September

Install Galaxy using Ansible

https://github.com/galaxyproject/dagobah-training/blob/2018-gccbosc/sessions/14-ansible/ex2-galaxy-ansible.md

sudo pip install ansible
git clone https://github.com/ARTbio/GalaxyKickStart
cd GalaxyKickStart
git checkout 2018-gccbosc
ansible-galaxy install -r requirements_roles.yml -p roles –force

curl https://gist.githubusercontent.com/raonyguimaraes/e70220921504d26a0627050ade17bc24/raw/327f952683dc550613c4c60b2b74d32427dde0ea/galaxy_ansible_install.sh | sh

$ sudo su galaxy
$ vi /srv/galaxy/config/galaxy.yml
# Add the following line under galaxy: section
    admin_users: your@email.address
$ exit  # change back to ubuntu user
$ sudo supervisorctl restart galaxy:

galaxy.ansible.com/explore#

Time	Topic	Links	Instructor
09:00	Welcome and introduction	Slides	(Č)
09:15	Deployment and platform options	Slides	(Č)
9:30	Using Ansible to deploy Galaxy	Slides, Exercise	(E)(G)
10:20	Extending installation	Slides, Exercise	(G)
10:40	Defining and importing genomes, Data Managers	Slides, Exercise	(E)
11:00	Galactic Database	Slides	(M)(N)
11:15	Web Servers nginx/Apache	Slides	(M)(N)
11:30	Close Morning Session

https://github.com/galaxyproject/dagobah-training/blob/2018-gccbosc/sessions/05-reference-genomes/ex1-reference-genomes.md#exercise-3-install-a-datamanager-from-the-toolshed

https://github.com/galaxyproject/dagobah-training/blob/2018-gccbosc/sessions/14-ansible/ex2-galaxy-ansible.md

Galaxy admin -> local data: Create DBKey and Reference Genome – fetching

Install dbkey from saccer2 data_manager_fetch_genome_dbkeys_all_fasta

Install BWA data_manager_bwa_mem_index_builder

https://ephemeris.readthedocs.io/en/latest/

Admin -> create bwa index

Second Session

ubuntu@2018-gcc-training-0:~⟫ sudo vim /srv/galaxy/config/galaxy.yml

In /srv/galaxy/config/galaxy.yml, uncomment #nginx_x_accel_redirect_base: False and change it to nginx_x_accel_redirect_base: /_x_accel_redirect. Remember, this file is owned by the galaxy user so be sure to use sudo -u galaxy when editing it.

https://github.com/galaxyproject/dagobah-training/blob/2018-gccbosc/sessions/03-production-basics/ex3-nginx.md

sudo supervisorctl restart nginx galaxy:

Google’s PageSpeed Tools can identify any compression or caching improvements you can make.

If configuring SSL (out of scope for this training), out-of-the-box SSL settings are often insecure!

Use the Mozilla SSL config generator to create a default config and Qualys SSL Server Test to check it.

https://planemo.readthedocs.io/en/latest/

https://planemo.readthedocs.io/en/latest/readme.html

https://planemo.readthedocs.io/en/latest/writing_cwl_appliance.html?highlight=write%20appliance

$ planemo test –no-container –engine toil seqtk_seq.cwl

galaxy-tool-test

planemo o

#this will open the browser

https://planemo.readthedocs.io/en/latest/writing_advanced.html

Ephemeris uses Bioblend to remotely manage Galaxy instances via Galaxy’s API.

https://stats.galaxyproject.org/login

https://grafana.com

https://telescope.galaxyproject.org

cd /srv/galaxy/server/lib/galaxy/jobs/runners

Plugins

Correspond to job runner plugins in lib/galaxy/jobs/runners

Plugins for:

local
Slurm (DRMAA subclass)
DRMAA: SGE, PBS Pro, LSF, Torque
HTCondor
Torque: Using the pbs_python library
Pulsar: Galaxy’s own remote job management system
Command Line Interface (CLI) via SSH
Kubernetes
Go-Docker
Chronos

https://galaxyproject.github.io/dagobah-training/2018-gccbosc/15-job-conf/job_conf.html#3

https://research.cs.wisc.edu/htcondor/

Need a shared file system, nfs, ceph and etc.

Exception is Pulsar!

sudo cat job_conf.xml.sample_basic 
<?xml version="1.0"?>
<!-- A sample job config that explicitly configures job running the way it is configured by default (if there is no explicit config). -->
<job_conf>
 <plugins>
 <plugin id="local" type="runner" load="galaxy.jobs.runners.local:LocalJobRunner" workers="4"/>
 </plugins>
 <destinations>
 <destination id="local" runner="local"/>
 </destinations>
</job_conf>

https://docs.galaxyproject.org/en/master/admin/jobs.html#dynamic-destination-mapping

https://docs.galaxyproject.org/en/master/admin/dependency_resolvers.html

http://galaxyproject.github.io/training-material/topics/admin/tutorials/connect-to-compute-cluster/slides.html#1

https://github.com/galaxyproject/galaxy-hub/blob/master/src/galaxy-updates/2018-03/index.md

https://pulsar.incubator.apache.org/

http://galaxyproject.github.io/training-material/topics/admin/tutorials/connect-to-compute-cluster/slides.html#13

https://github.com/galaxyproject/galaxy-kubernetes

http://galaxyproject.github.io/training-material/topics/admin/tutorials/connect-to-compute-cluster/tutorial.html#section-4—statically-map-a-tool-to-a-job-destination

Edit:/srv/galaxy/config/tool_destinations.yml

---
tools:
  multi:
    rules:
      - rule_type: file_size
        lower_bound: 16
        upper_bound: Infinity
        destination: slurm-2c
    default_destination: slurm_cluster
default_destination: local_no_container
verbose: True

gdb

attach 28376

https://www.gnu.org/software/gdb/

June 20, 2018 by Raony Guimaraes on Uncategorized

A Django Async Roadmap – Aeracode

I think that the time has come to start talking seriously about bringing async functionality into Django itself, and so I have been working on a draft “roadmap” for what I think this might look like. I’ve run this past a few people – some of who were Django core members, and some who weren’t – but I’m now posting it up for public feedback (see the end for where to discuss this).

Source: A Django Async Roadmap – Aeracode

June 19, 2018 by Raony Guimaraes on Uncategorized

Machine Learning: The High Interest Credit Card of Technical Debt – Google AI

Abstract

Machine learning offers a fantastically powerful toolkit for building complex systems quickly. This paper argues that it is dangerous to think of these quick wins as coming for free. Using the framework of technical debt, we note that it is remarkably easy to incur massive ongoing maintenance costs at the system level when applying machine learning. The goal of this paper is highlight several machine learning specific risk factors and design patterns to be avoided or refactored where possible. These include boundary erosion, entanglement, hidden feedback loops, undeclared consumers, data dependencies, changes in the external world, and a variety of system-level anti-patterns.

Machine Learning: The High Interest Credit Card of Technical Debt

Source: Machine Learning: The High Interest Credit Card of Technical Debt – Google AI