Alex Dobin

148 posts

Alex Dobin

@a_dobin

Director of Bioinformatics @ArcInstitute Formerly: ENCODE; PI@CSHL Developer of STAR.

Palo Alto, CA Katılım Ağustos 2012

125 Takip Edilen767 Takipçiler

Alex Dobin@a_dobin·22 Ara

@fulop_dan Indeed, strandedness of the libraries does not (presently) affect alignments. --soloStrand option is necessary for assigning reads to genes in the single-cell gene expression.

English

Dan Fulop@fulop_dan·21 Ara

Never mind, and sorry to bug you. I see now that a general answer about strandedness and the lack of a need for additional parameters is contained here: groups.google.com/g/rna-star/c/o…

English

Dan Fulop@fulop_dan·21 Ara

@a_dobin The --readStrand switch is gone in the STAR short read aligner Is it now replaced by --soloStrand even for bulk RNAseq? #Bioinformatics

English

Alex Dobin@a_dobin·14 Eki

@AnnLorainePhD @nomad421 @anshulkundaje Good suggestions from Rob! And as Anshul pointed out, tweaking parameters could be helpful. If you have specific examples of stubbornly wrong alignments, please post them on GitHub: github.com/alexdobin/STAR…

English

Alex Dobin@a_dobin·21 Tem

@anshulkundaje @satijalab @nomad421 @stephaniehicks @humancellatlas @_hubmap To mitigate this approach, we can prioritize exons over introns when they overlap, as suggested in nature.com/articles/s4158… . This approach is being implemented in STARsolo (to be released soon). (2/2)

English

Alex Dobin@a_dobin·21 Tem

@anshulkundaje @satijalab @nomad421 @stephaniehicks @humancellatlas @_hubmap Seconding all responses, good discussion! An issue with including intronic reads is with the genes whose exons overlap introns of other genes. Reads mapping to such overlapped regions will be considered multi-gene and (typically) excluded. (1/2)

English

Anshul Kundaje@anshulkundaje·21 Tem

Folks - Do you incorporate intronic reads in 10X scRNA and snRNAseq when creating count matrices for downstream analyses. Why or why not? @satijalab @nomad421 @stephaniehicks @a_dobin @humancellatlas @_hubmap

English

Alex Dobin@a_dobin·17 May

@APredeus Indeed, we were using the "abridged" 10X annotations that exclude small non-coding RNA and pseudogenes. We checked it for the full Gencode 37 annotations, and the results were very similar.

English

Alex Predeus 🇺🇦@APredeus·14 May

@a_dobin Quick question: it seems like you've applied the same GTF filtering as cellRanger does (removing about 30k noncoding RNA genes) for all tested tools; is this correct? Would you expect the accuracy to change a lot if the full reference is used?

English

Alex Dobin@a_dobin·5 May

STARsolo preprint is out on bioRxiv: biorxiv.org/content/10.110… STAR release 2.7.9a: github.com/alexdobin/STAR… The major new feature is quantification of multi-gene (multi-mapping) reads/UMIs, which are necessary to detect expression from overlapping genes and paralogs. 1/5

English

292

Alex Dobin@a_dobin·12 May

@alexwstockinger Supertranscripts should work if you can make a set of Supertranscript sequences and a GTF describing spliced/unspliced transcripts with respect to transcsirpts and giving it to the STAR genome generation step.

English

Alex Stockinger@alexwstockinger·11 May

@a_dobin So a simple gene/transcript map is the way to go? Ad supertranscripts: to my understanding, cellranger IS splice-aware, right? And so is STARsolo? What am I missing here?

English

Alex Stockinger@alexwstockinger·7 May

Just two months after Kallisto was shown to often outperform STAR in mapping #scRNAseq data (biorxiv.org/content/10.110…), STAR strikes back by integrating multi-mapping data. Happy to see these tools improving, maybe @10xGenomics could consider integrating them into cellranger?

Alex Dobin@a_dobin

English

Alex Dobin@a_dobin·11 May

@nomad421 Interesting approach, and very impressive accuracy improvement! And incredibly quick turn-around time!

English

𝕐@nomad421·11 May

@a_dobin : you may be interested in the approach suggested here; we'd be happy to have your thoughts / feedback (we haven't gotten around to looking at the simulated data yet)!

English

𝕐@nomad421·10 May

Can the prediction of "expression for thousands of non-expressed genes” arising in certain approaches for #scRNAseq quantification #sec-5" target="_blank" rel="nofollow noopener">biorxiv.org/content/10.110… be ameliorated while retaining their computational benefits? It seems possible; a short thread! 1/10

English

Alex Dobin@a_dobin·8 May

@alexwstockinger The SuperTranscripts are very cool - but they would require spliced alignments. We were actually looking into that at some point but did not get far. The redundancy is not a problem, as long as redundant transcripts are assigned to the same gene.

English

Alex Stockinger@alexwstockinger·8 May

@a_dobin Is there a recommended procedure for this that takes care of the innate redundancy in a de-novo assembly? I've had good results using supertranscripts with a trinity assembly in the past - would this be a valid approach with STARsolo? genomebiology.biomedcentral.com/articles/10.11…

English

Alex Dobin@a_dobin·7 May

@alexwstockinger This is a good point: for species without genome assembly, mapping to the transcriptome is the only option. You can do it with STARsolo by generating the genome index from transcript sequences instead of chromosomes. 3/3

English

Alex Dobin@a_dobin·7 May

@alexwstockinger Using simulations, we show the differences are due to Kallisto's lower accuracy, which is caused by the pseudoalignment-to-transcriptome algorithm. It forces intronic reads (abundant in single-cell data) to map to spurious genes. 2/3

English

Alex Dobin@a_dobin·7 May

@timtriche @manvendr7 @MollyHammell Interesting paper, thanks! It looks like they are aggregating reads over "meta" TE - they are not doing EM over individual genes.

English

Alex Dobin@a_dobin·7 May

@bdeonovic @BMirauta @biomonika @lpachter Sure, no disagreement here. I was thinking about a specific data type, scRNA-seq gene/cell counts: mostly 0s, many 1s, and fewer >=2 elements. But maybe Lior has something else on his mind, and I am being paranoid. twitter.com/a_dobin/status…

Alex Dobin@a_dobin

@hypercompetent @lpachter It’s getting late on the East coast, and still no blog from Lior, so I will make my presumptuous guess. I think Lior is trying to puzzle out why Kallisto to CellRanger correlation is lower in our Fig.4C biorxiv.org/content/10.110… vs. their Fig.2D nature.com/articles/s4158… 1/3

English

Benjamin Deonovic@bdeonovic·7 May

@a_dobin @BMirauta @biomonika @lpachter If the model is x[i] = 1-y[i] for i < k and x[i]=y[i] for i>=k then given k the association between x and y is perfect. My point above is that it is important what the underlying model is and the underlying model should inform what measures of association you use

English

Lior Pachter@lpachter·6 May

This is a subtweet (until I get around to writing the blog post).

English

Alex Dobin@a_dobin·7 May

@nomad421 @p_bourguet Indeed!

English

𝕐@nomad421·7 May

@a_dobin @p_bourguet One can also use salmon with the transcriptome-projected alignments from STAR; it is quite fast. It's a great pairing (and, as a bonus, you don't have to disallow indels in the projected alignments).

English

Alex Dobin@a_dobin·7 May

@BMirauta @bdeonovic @biomonika @lpachter And correlation coefficient does not have to be higher than the proportion of equal elements. An even simpler toy example: x=[0 0 1 1] y=[0 1 0 1] corr(x,y)=0 (obviously) while 50% of the elements agree.

English

Bogdan Mirauta@BMirauta·7 May

@bdeonovic @biomonika @lpachter I fully agree Pearson is not the best correlation coef in many cases (outliers notably) and that even the concept of correlation is not the most appropriate sometimes. But, on this data I do not agree it is missleading. The r2 of 0.36 indicates the right prediction accuracy.

English

Alex Dobin@a_dobin·7 May

@p_bourguet Right, there are a few features in STARsolo that would be good to have for bulk (e.g., counting only reads that are concordant with transcripts). They are high on my TODO list. Though for multimappers, quantifying with RSEM is still a better (albeit slower) option.

English

Pierre Bourguet@p_bourguet·7 May

@a_dobin Glad to see that multimappers are getting some love! Did you implement the multimapper quantification options only for single cell or also with the bulk version?

English

Alex Dobin@a_dobin·7 May

@hypercompetent @lpachter The answer to “why Kallisto to CellRanger correlation is lower in our calculation” is simple. We used Spearman correlation, while they used Pearson. Pearson correlation, of course, can be inflated by various artifacts and is not a good choice for RNA-seq data. 3/3

English

Alex Dobin@a_dobin·7 May

@hypercompetent @lpachter I am still not sure what’s the point of Lior’s toy example. Should we not use correlation as a metric at all? Then why was it used in Kallisto paper? 2/3

English

Alex Dobin@a_dobin·7 May

@dna_rosenberg @ParseBio Thanks, Alex!

English

Alex Rosenberg@dna_rosenberg·6 May

Looks really interesting. It’s amazing to me the impact @a_dobin has had on the field, especially RNA-seq and scRNA-seq. I’ve been using STAR for years now and we rely heavily on it in @parsebio’s single cell analysis pipeline. A truly incredible tool

Alex Dobin@a_dobin

English

Keşfet

@fulop_dan @nomad421 @anshulkundaje @satijalab @stephaniehicks @humancellatlas @_hubmap @APredeus