Learnings from Petabytes of Video Data
We were one of the first egocentric data companies. A few months ago we pivoted out of the video selling space (research pulled us away), but we wanted to dump everything we learned — because this is the post I wish existed when we started.
Diversity and continuity is the name of the game
The single most important thing we learned: the name of the game is diversity and continuity. Not just "some variety." Way more diversity than anyone imagines. Labs are still experimenting with what drives the biggest model lift, and diversity is how you implicitly prove two things — that you have real scale, and that you actually care about what's being fed into the model. Nail this, and you'll land recurring contracts that are extremely hard for competitors to displace.
Monocular video is everywhere — and almost nobody is making it useful
Making monocular video actually useful to the point where you could sell it for $100-200/hour requires figuring out how to do full body estimation with sparse information. Broadly doing research and figuring out these research questions or hiring someone to do them is definitely worth the effort, especially in an industry where things are getting commoditized quickly. Many customers want the ability to estimate skeletons from egocentric video and motion but there's no existing research in this space (it took us ~5 months to figure this out). Hiring researchers early to solve for this problem is something I recommend all founders to do because it makes the process of selling much easier, even if it takes longer initially.
The most scalable way to collect egocentric data is through a phone
A lot of people in this space over-index on custom hardware, but our experience was that if you design the collection flow well, you can solve a surprising amount through vision and product design alone. For example, if someone's hands drift out of frame, you can prompt them in the moment. You do not necessarily need more sensors, more complicated rigs, or a huge hardware rollout to get useful data. That made the whole system much lighter operationally (relying on people to manually go out and collect SD cards, ingest, etc. was a massive cost overhead that could just be avoided altogether).
Using phones also let us index heavily on diversity, which customers really valued. Instead of being constrained to one facility, one geography, or one hardware setup, you can leverage a global network of people recording diverse tasks in diverse real-world environments. In practice, labs did not just want more footage. They wanted variation in people, settings, workflows, edge cases, and failure modes. For us, the phone was the only practical way to unlock that combination of scale and diversity.
Moreover, some of our customers are moving away entirely from general tabletop manipulation, so figuring this out becomes critical. If you want to get anywhere near a million hours of data per month, we think this is the only viable path.
Annotations deserve months of R&D to get right
Drift in occluded joints, passing videos through Gemini to get text annotations, using Gemini to annotate failure and recovery segments, using generic models like fine-tuned MediaPipe / HaMeR hands, not collecting accurate metadata about someone's height, tacking IMUs to arms without reconstructing the head to get accurate upper body — these all create terrible data. Most data collection companies don't realize this until months after, when their data is actually ingested into the model (yes, there's lag between recurring data purchase and quality feedback, and it's usually hard to get).
SLAM techniques also tend to not work well with egocentric data because of the lack of initial texture and calibration. It's easier than ever to have a data company, but it's also harder than ever to prove the quality of your annotations — and a few cloned or fine-tuned models are unlikely to do well on egocentric videos and give the quality and consistency of points required for a production-grade robot environment.
Most labs would kill for clip-level quality reports, but when I was in the market for egocentric data, clip-level QA was rare. All of this stuff requires an insane amount of infrastructure that most companies don't even fully realize they need to build in order to sell to a lab. The bar for consistency is much higher than many people expect.
Scoping tasks matters more than the actual data collection
You get much better data when you spend real time understanding exactly what the customer wants and then designing collection around that, rather than collecting a huge generic dataset and hoping it maps cleanly to demand later. The more we talked to customers, the more obvious this became. Small changes in task design, recording instructions, annotation scope, or environment constraints could make the difference between data that was useful and data that was hard to sell.
A lot of customers also wanted role-play or highly specific behavioral data, where they needed someone to perform an action in a particular way, repeatedly, and with consistency. That sounds simple until you actually try to operationalize it. Getting large numbers of people to, say, make coffee while following the same sequence over and over again is much harder than it looks (and requires someone to guide them through steps, which can only be done with an app). Standardization of formats across recordings becomes a real challenge very quickly.
Maximize data and quality without increasing costs
A mobile app gives you much more than video. You can capture audio, have people narrate what they are doing, and also use phone-native signals like IMU or LiDAR which you can now get for free. But making that experience work well is not trivial — running lightweight models on-device, keeping battery drain manageable, handling thermal issues, making uploads reliable, preventing dropped sessions, maintaining compliance over long recordings, and giving workers enough real-time feedback without making the product annoying.
The recording experience needs to be airtight
We also found that bounty systems consistently produced better data than non-bounty systems. When contributors understood exactly what was valuable and were rewarded for specific tasks or edge cases, the quality was meaningfully higher. You got better compliance, better coverage of desired behaviors, and less unusable footage. Open-ended collection sounds appealing in theory, but in practice the best results came from tightly scoped asks tied to clear incentives.
Bonus tip: Get your recorders to voice-annotate what they're doing. Not only is this interesting voice data, but it's actually free and helps you get ground truth — not Gemini-generated annotations. Also go through the process of fine-tuning your own model (QWEN!) if you care about good ground truth annotations. Your customers do.
What we're doing now
Overall, one of our biggest takeaways is that the hard part is not just getting people to record video. The hard part is building all the infrastructure around collection, QA, annotation, packaging, and customer-specific delivery. A lot of companies do not realize how much of that stack they need until they are already deep into trying to sell to a real lab with hard quality demands and lots of volume needs.
We are stepping away from building this as a data business because we found the research problems more interesting. But we did build a lot of infrastructure that may be useful to other companies with operational workforces who want to collect and sell egocentric data — it was used by tens of thousands of people last year.
We are licensing the internal processing pipeline, QA systems, and app stack we built so other teams don't have to recreate all of this from scratch. We want to make it easier for folks to sell into big labs and quickly turn massive operational capacity into revenue.
Happy to share what we learned if it's useful. And if you are actively trying to operationalize a workforce for data collection, or think the infrastructure we built would save you a lot of time, feel free to reach out.