John Keiser

1.6K posts

John Keiser

@jkeiser2

Mad scientist / software engineer. In spare time, I design languages and unicycle with my kids. I helped start @CampQuestNW.

Seattle, WA เข้าร่วม Temmuz 2009

244 กำลังติดตาม657 ผู้ติดตาม

John Keiser รีทวีตแล้ว

Adam Jacob@adamhjk·25 Eyl

System Initiative is the future of DevOps Automation. It's ready to replace Infrastructure as Code today. systeminit.com/blog-system-in… - sign up at systeminit.com!

System Initiative@thesysteminit

System Initiative is the future and it's ready for you to replace Infrastructure as Code. Read more about why on our blog, systeminit.com/blog-system-in…, and get started today: systeminit.com youtu.be/20vIl7I0R4I

English

156

39.4K

John Keiser@jkeiser2·4 Haz

@spacegho_st Yeah, true. Your use case seems to split up the file and process it in parallel, which implies you are copying parts of the file to other machines. If you controlled the producer you could even *generate* each machine's part locally and avoid copy cost.

English

John Keiser@jkeiser2·3 Haz

@spacegho_st Parallel processing is harder if records split across chunks. Ensuring a clean split during JSON generation seems easiest. Surgically scanning for record boundaries when splitting could work too. Could also overlap chunks if records have max size.

English

John Keiser@jkeiser2·30 May

@spacegho_st We called it "on demand" in simdjson. It doesn't use schema explicitly, but takes advantage of the fact that most code that consumes JSON already encodes the schema (loop this array, read string with key "name" and int "age" ...).

English

John Keiser@jkeiser2·29 May

@spacegho_st I suspect simdjson will beat most custom jobs up until the point where they are so big they need streaming parsing (which is your case). Parallel parsing is also an interesting case, and I've got thoughts on that but haven't had time to write them down in C++ :)

English

John Keiser@jkeiser2·29 May

@spacegho_st What sort of duel?

English

John Keiser@jkeiser2·25 May

@justinschiebs Is it permissible to stub someone's toe as part of an art piece that inspires contentment, leading many people not to stub others' toes?

English

Justin Schieber@justinschiebs·25 May

Is it possible for a weak moral reason to be permissibly overridden by a profound aesthetic good requiring it? For example, is it ok for one to predictably allow another to stub their toe so that some piece of profound and inspirational art can result?

English

1.3K

John Keiser@jkeiser2·25 May

@justinschiebs @GhostCoase Do you mean it might always be the less moral option, or that it depends? Your original question reads like "is it *ever* ok to prioritize aesthetic good over harm?" And if it depends, the answer to that is yes.

English

Justin Schieber@justinschiebs·25 May

@GhostCoase I think that’s fair. But an act that prioritizes those mental states may nevertheless be the less moral of two options.

English

John Keiser@jkeiser2·17 May

@PhilAndrew61181 @lemire Yep, if you access the keys in a different order than the json it will circle round. There isn't an explicit stack: outerObj and b are variables that keep track of their own start positions.

English

Phil Andrews@PhilAndrew61181·17 May

@jkeiser2 @lemire One last question. Lets say I have `{a:1, b:{b1:11,b2:22},c:3,d:4}`. I do b["b1"], now I want to access a, will it look a c, go to end then circle around? Does the iterator have a stack so it can restart from the beginning of each depth? and if there's a write up I'd look at it

English

Daniel Lemire@lemire·6 May

On-demand JSON: A better way to parse documents? JSON is a popular standard for data interchange on the Internet. Ingesting JSON documents can be a performance bottleneck. A popular parsing strategy consists in converting the input text into a tree-based data structure—sometimes called a Document Object Model or DOM. We designed and implemented a novel JSON parsing interface—called On-Demand—that appears to the programmer like a conventional DOM-based approach. However, the underlying implementation is a pointer iterating through the content, only materializing the results (objects, arrays, strings, numbers) lazily. On recent commodity processors, an implementation of our approach provides superior performance in multiple benchmarks. To ensure reproducibility, our work is freely available as open source software. Several systems use On Demand: for example, Apache Doris, the Node.js JavaScript runtime, Milvus, and Velox. On-demand JSON: A better way to parse documents? Software: Practice and Experience 54 (6), 2024. doi.org/10.1002/spe.33… Software: github.com/simdjson/simdj…

English

511

121.6K

John Keiser@jkeiser2·17 May

@spacegho_st @lemire It's clever since it is still valid json. But the pretty printing step has to be serial, no? Are you going to process the pretty printed JSON more than once? Otherwise I'm not sure how you gain much ...

English

John Keiser@jkeiser2·17 May

@spacegho_st @lemire Yep, your trick is essentially to store the depth of each value using the number of tabs before it. This particularly allows you to parallelize the parse since you can start anywhere and know what you are looking at (assuming your schema is simple enough).

English

John Keiser@jkeiser2·17 May

@PhilAndrew61181 @lemire No, since you just finished reading the email, it starts looking for "password" key starting from the comma (e.g {"email": "...", "password": "..."})

English

Phil Andrews@PhilAndrew61181·17 May

@jkeiser2 @lemire Does it scan the entire file (until it finds it) every time?

English

John Keiser@jkeiser2·16 May

@the_yamiteru @lemire When you use ondemand you get a cursor representing an object or array; jsonObj["email"] puts the cursor at the "," after the email, jsonObj["password"] scans from that point to find the password and puts the cursor at the "}" or "," after the password.

English

Yamiteru ☯︎@the_yamiteru·16 May

@jkeiser2 @lemire But if I do something like get(jsonString, “email”) and later get(jsonString, “password”) then the parser has to loop the jsonString twice from the beginning, doesn’t it?

English

John Keiser@jkeiser2·16 May

@PhilAndrew61181 @lemire It's not designed around a skeleton. ondemand myobject["key"] scans through the JSON (counting braces) until it finds "key" in the current JSON object, skips (and validates) the ":", and puts the parser at the start of the value so you can work with it.

English

Phil Andrews@PhilAndrew61181·8 May

@lemire I must have read the performance doc because I did everything there. Is there anything written on the impl difference between DOM and ondemand? Does ondemand work by building a skeleton of the JSON, marking the key and a pointer to the start of the value string? (to decode later)

English

John Keiser@jkeiser2·16 May

@clintonmead @lemire @etorreborre On Demand is a streaming dom, and will check the closing brace when you get to the end of the array or object you are reading. SIMD isn't really the trick here: On Demand reads whole documents faster than simdjson's SIMD DOM parser, even using the same SIMD code.

English

Clinton Mead@clintonmead·7 May

Surely you have to parse the entire document at least initially match opening and closing braces? I guess you can skip all the value parsing at this point if you just keep pointers to the opening/closing braces, and with some smarts you can use SIMD instructions to search for { and } (although you’ll presumably need to handle “ and \” also to deal with quoted braces). I guess this is quicker than a full parse though. Is this roughly what’s going on?

English

142

John Keiser@jkeiser2·16 May

@PhilAndrew61181 @lemire In most cases dom is slower, because you have to iterate the whole document a second time (in its dom form) when you actually use it. dom can get faster when you *need* to iterate the document multiple times in a generic form.

English

Phil Andrews@PhilAndrew61181·6 May

@lemire I should reread the github but if I need to process 100% of the data is simdjson::dom::parser faster than the on demand variant?

English

2.4K

John Keiser@jkeiser2·16 May

@the_yamiteru @lemire To use a typical one-shot JSON parser, you have to iterate the document twice: first to create a generic object that can hold any JSON type, and second to retrieve "email" and "password" keys from that generic structure. On Demand lets you do it once, strictly faster.

English

1.3K

Yamiteru ☯︎@the_yamiteru·7 May

@lemire Let’s say I use this in a user login endpoint where I send email and password. Then I’d like to send both in a database query. Wouldn’t you say that in such a use-case getting email and password on-demand has the same performance or slower than just reading everything at once?

English

310

John Keiser@jkeiser2·16 May

@adamhjk I'm really looking forward to seeing infrastructure management that actually cares about its user interface!

English

Adam Jacob@adamhjk·15 May

It's all happening, friends. That's an early build of System Initiative from within System Initiative. That's screenshot is the moral equivalent of your entire production IaC repository. We're almost ready for other people to use it for real.

English

233

74.9K

ค้นพบ

@spacegho_st @justinschiebs @GhostCoase @PhilAndrew61181 @lemire @the_yamiteru @elonmusk @BarackObama