Mechanical Turk

80 posts

Mechanical Turk banner
Mechanical Turk

Mechanical Turk

@MechMathTurk

Sumali Ekim 2010
106 Sinusundan29 Mga Tagasunod
#CVPR2026
#CVPR2026@CVPR·
#CVPR2025 reviewers and area chairs: The review assignments have been released! Please check them for any anomalies.
English
1
3
38
13.2K
Alexia Jolicoeur-Martineau
Alexia Jolicoeur-Martineau@jm_alexia·
I'm finally starting to train video-game generative models! 🎮 The data processing took a long time.
English
5
3
48
3.9K
Aakash Kumar Nain
Aakash Kumar Nain@A_K_Nain·
Qualcomm presents Mobile Video Diffusion. Deploys an optimized mobile-friendly UNet. Includes: - Low resolution finetuning - Temporal multiscaling - Cross-attention optimization - Temporal block pruning ...
Aakash Kumar Nain tweet media
English
1
0
14
959
Mechanical Turk
Mechanical Turk@MechMathTurk·
@LinBin46984 Amazing work! Many thanks for sharing it with the community! May I ask about the video generation evaluation? Has Latte-L been initialized with the same PixArt-α despite the latent space had changed? Or have pretrained weights for the UCF and SkyTimelapse datasets been used?
English
1
0
0
52
Bin Lin
Bin Lin@LinBin46984·
🚀🚀🚀The ultimate VideoVAE has arrived! 💥Faster, ✨ultra-efficient, and delivering performance with Wavelet Flow VAE (WFVAE). When you're building video generation models, this is the essential component you can’t afford to miss. Welcome aboard!🚄🚄🚄 github.com/PKU-YuanGroup/…
English
7
34
149
11.8K
zkjiang
zkjiang@jiang_zhengkai·
@_akhaliq Thanks for sharing our latest work on I2V distillation.
English
1
0
1
231
AK
AK@_akhaliq·
OSV One Step is Enough for High-Quality Image to Video Generation discuss: huggingface.co/papers/2409.11… Video diffusion models have shown great potential in generating high-quality videos, making them an increasingly popular focus. However, their inherent iterative nature leads to substantial computational and time costs. While efforts have been made to accelerate video diffusion by reducing inference steps (through techniques like consistency distillation) and GAN training (these approaches often fall short in either performance or training stability). In this work, we introduce a two-stage training framework that effectively combines consistency distillation with GAN training to address these challenges. Additionally, we propose a novel video discriminator design, which eliminates the need for decoding the video latents and improves the final performance. Our model is capable of producing high-quality videos in merely one-step, with the flexibility to perform multi-step refinement for further performance enhancement. Our quantitative evaluation on the OpenWebVid-1M benchmark shows that our model significantly outperforms existing methods. Notably, our 1-step performance(FVD 171.15) exceeds the 8-step performance of the consistency distillation based method, AnimateLCM (FVD 184.79), and approaches the 25-step performance of advanced Stable Video Diffusion (FVD 156.94).
AK tweet media
English
1
17
108
12.7K
Mechanical Turk
Mechanical Turk@MechMathTurk·
@_akhaliq Am I missing something or have the authors indeed forgotten to include the assessment of visual results in their manuscript?
English
0
0
0
78
AK
AK@_akhaliq·
Qihoo-T2X An Efficiency-Focused Diffusion Transformer via Proxy Tokens for Text-to-Any-Task discuss: huggingface.co/papers/2409.04… The global self-attention mechanism in diffusion transformers involves redundant computation due to the sparse and redundant nature of visual information, and the attention map of tokens within a spatial window shows significant similarity. To address this redundancy, we propose the Proxy Token Diffusion Transformer (PT-DiT), which employs sparse representative token attention (where the number of representative tokens is much smaller than the total number of tokens) to model global visual information efficiently. Specifically, in each transformer block, we randomly sample one token from each spatial-temporal window to serve as a proxy token for that region. The global semantics are captured through the self-attention of these proxy tokens and then injected into all latent tokens via cross-attention. Simultaneously, we introduce window and shift window attention to address the limitations in detail modeling caused by the sparse attention mechanism. Building on the well-designed PT-DiT, we further develop the Qihoo-T2X family, which includes a variety of models for T2I, T2V, and T2MV tasks. Experimental results show that PT-DiT achieves competitive performance while reducing the computational complexity in both image and video generation tasks (e.g., a 48% reduction compared to DiT and a 35% reduction compared to Pixart-alpha).
AK tweet media
English
2
19
136
18.5K
Mechanical Turk
Mechanical Turk@MechMathTurk·
@jm_alexia To be honest, I do not really understand what the difference of the proposed CMMD vs well-known KID is (except for the different kernel used). The authors mention KID in the Related Works section but do not evaluate against it
English
0
0
0
24
Alexia Jolicoeur-Martineau
Alexia Jolicoeur-Martineau@jm_alexia·
Alternative to FID: MMD distance of CLIP embeddings Meanwhile, FID is the Wasserstein-2 distance of inception features when incorrectly assuming that the features are Gaussian. This new metric is better correlated with quality and is easily extendable to different modalities.
Tanishq Mathew Abraham, Ph.D.@iScienceLuvr

Rethinking FID: Towards a Better Evaluation Metric for Image Generation abs: arxiv.org/abs/2401.09603 This paper from Google Research proposes the use of CLIP MMD distance (CMMD) as an alternative to FID for text-to-image generation eval. CMMD does not make assumption about normality like FID does (which is anyway violated by the Inception embeddings and causes problems), it is an unbiased estimator, it is sample efficient, it better matches with expected trends (ex: distorting images reliably increased CMMD unlike with FID), and appears to better match human perception.

English
2
7
39
4.8K
Amazon Help
Amazon Help@AmazonHelp·
@MechMathTurk We're sorry to learn about that. Have you already checked the spam folder in your e-mail account for an answer? -Sandra
English
1
0
0
107
Mechanical Turk
Mechanical Turk@MechMathTurk·
Hello @AmazonHelp, my Amazon DE account is currently on hold, and I cannot log in on the website/in the app. I received an email from account-resolution(AT)amazon(DOT)de and replied with the requested document, but after 4 days still haven't received any response
English
1
0
0
108
AK
AK@_akhaliq·
I am also on substack and publish a daily newsletter covering trending ai research papers and more: akhaliq.substack.com
English
2
0
29
26.9K
Mechanical Turk
Mechanical Turk@MechMathTurk·
@k_struminsky try COLMAP instead, it has a simple GUI in addition to the powerful CLI, and you can play with different settings incl. camera models to obtain the most visually pleasant point cloud
English
1
0
0
61
Kirill Struminsky
Kirill Struminsky@k_struminsky·
So I divided the photo into multiple patches and generated depth maps for each patch individually. The tricky part was to stitch the depth estimates over multiple overlapping views. Depth maps were neither calibrated nor consistent.
Kirill Struminsky tweet mediaKirill Struminsky tweet mediaKirill Struminsky tweet media
English
2
0
1
221
Kirill Struminsky
Kirill Struminsky@k_struminsky·
With all the fuss around neural fields and multi-view reconstruction, I wondered how I could reconstruct a scene from a single photo.
English
1
0
14
533
Mechanical Turk
Mechanical Turk@MechMathTurk·
@AxSauer @sedielem @jacoblee628 What do you mean by local batches? Is it just data that is processed on the particular GPU? You seem to reject the specific kind of "minibatch discrimination" that was employed in the StyleGAN2 discriminator, am I right?
English
0
0
0
19
Axel Sauer
Axel Sauer@AxSauer·
@sedielem @jacoblee628 you don't have to worry at all, we use a special variant of BN without running stats or affine params, and compute stats on local "virtual" batches. This way, we get rid of test time weirdness and the need for syncing gradients during training.
English
2
0
1
264
Sander Dieleman
Sander Dieleman@sedielem·
Batch normalisation appears to be falling out of favour (probably for the best IMO, so many bugs end up being batchnorm bugs😬). One area where it persists is GAN discriminators (e.g. in StyleGAN-T and VQGAN). Are there any other settings where batchnorm is still hard to avoid?
English
15
18
229
177.1K
Rafael Spring
Rafael Spring@Rafael_L_Spring·
edge device. And that's a wrap! I hope you enjoyed my engineer's deep dive through today's NeRF challenges and developments. If you are interested in 3D computer vision and graphics topics, please consider giving me a follow. Much❤️ #NeRF
English
9
1
43
0
Rafael Spring
Rafael Spring@Rafael_L_Spring·
NeRFs are getting attention these days! However widespread adoption is still slow. But why? It comes down to a) file size and b) rendering. An engineer's viewpoint: TLDR: NeRFs are fundamentally a bad fit for today's edge device architectures. Let's explain that in detail: 1/
English
7
60
416
0
Ferenc Huszár
Ferenc Huszár@fhuszar·
Wow, this is very cool. Too early to say how useful this will prove, but I will definitely run some tests in my reading group course. explainpaper.com
Ferenc Huszár tweet media
English
11
39
266
0
Mechanical Turk nag-retweet
Taras Khakhulin
Taras Khakhulin@t_khakhulin·
📢📢 Multi-shot view synthesis from our group. We extend our previous StereoLayers for an arbitrary number of input images with blazing fast inference and high quality. Look at SIMPLI interactive demo samsunglabs.github.io/MLI/ Seconds for a new scene and rendering in a browser💥
GIF
English
1
6
15
0
John Ryan
John Ryan@john_pryan·
@MechMathTurk A different way to view the sampling is that we sample from our prior of where the light comes from along the ray (either sigma and or a uniform), then we reweigh (using whatever sampling algorithm we want) by the exponential term.
English
1
0
0
0
Mechanical Turk
Mechanical Turk@MechMathTurk·
@john_pryan However, AFAIK, there's no similar equation for E[f(X)] with a flavor of LOTUS. Therefore, you cannot use just a survival function to compute the expectation.
English
0
0
0
0
Mechanical Turk
Mechanical Turk@MechMathTurk·
@john_pryan if you throw away the sigma multiplier, the equation is no longer a valid expression for the expected value of r.v. The confusion may come from an alternative equation for an expectation of a non-negative r.v. E[X] = \int P(X > t) dt which looks similar to NeRF equation w/o sigma
English
1
0
0
0