Lady Bumps and Data Dumps

Posted by 22111
Jul 22, 2022 at 08:46 AM

On the road again, and more about dumps, and, below, from wikipedia data onto your desktop (or wherever):

As said, I have NOT found a single adequate iPhone/iPad “app” for data dumps, but a single one for Android, and which on a cheap “smartphone”, and for “hit lists” (i.e. “filtering”) from dumps in the 1-million-char range, gives immediate and visually pleasant results, but for single search strings only; thus, you’ll better put your dump lines in alphabetical order upon export, so that your “bergman, ingmar (year)” (by “berman”; “bergm”, without the quotes, will suffice, too) will follow each other, in chronological order, and then only, afterwards, you’ll get the movies with the actress, instead of a blunt mix.

You would have to do some scripting anyway, upon export, since diacritics, as you know, ain’t people fed up to the core after a 3-hour slides show evening, but chars like ä or é, and searching for them on your virtual hand-held keyboard would be a nightmare, so ä>ae, é>e, ñ>n, ß>ss, etc., before the dump - you see here that just changing the default US keyboard to something - one specific - “national” would not resolve the problem…

Btw, I don’t know why people use 60-bucks tools for transmitting files to and from their hand-helds, or even pay for annual subscriptions of such tools - they may be “necessary” for appleware then? -, since for Android at least, your USB loading cable, connected to your PC, will do it - works fine in both directions -, but perhaps if you load your battery by “induction” now, or whatever they call it… all I know for sure being that proud iMobiles users don’t like it, but not at all, when you imply, use your iTablet in your grocery store, and other (grocery, not necessarily apple…) customers might think you’re a clerk - well, wear a white coat then, and they’ll figure you for the manager!

Speaking of dumps for quick reviewing info here, not for editing, then, in which case you would need better soft- and better hardware, and such inputs, in virtual keyboards of a handheld, are error-prone, according to me, and even for just new telephone numbers, it’s always a good idea to ring that number immediately, also perhaps in order to check it the number’s owner (e.g. female, diverse?) committed an oral communication error (e.g. in the above-mentioned grocery situation), but of course, you can also script the info way back, from hand-held back to your (more or less “stationary”) Windows device.

As for two-way, I just read in some forum that between Scrivener and .fdx format (Final Draft and others), it’s possible to transfer forth AND back your data, including comments (i.e. “ScriptNotes”), and while I do this, from UR to FD and FI, one-way - it obviously comes enormously handy to have this two-way and out-of-the-box - pay attention though that most web comments re Scrivener refer to the Mac version, not to the heavily crippled Windows version, you the above probably doesn’t apply to the latter…
_______________

Now for wikipedia dumps; you might prefer to look it all up “online”, but in good’ol’Europe at least, more and more countries currently fall back to third-world standards, and governments think about heavily taxing “traffic”, not the one of human beings, mind you, but the electronic one, so having “your” data at home, together with some good, heavy batteries might come handy for quite everybody soon, whatever:

First, “national” dumps-without-pics are near 30 gb, and the English one is about 50 gb - “download while you can!”, hehe!

Then, you will have difficulty to find the “necessary” software, in order to handle such data, and - speaking for the Windows Club here only! -, there are some “XML editors” out there, with prices near (or, incl. VAT, attaining the) 4 digits, and you would prefer an “XML database” anyway?

Now there are several ones, even free ones, but then try to get the wiki data into them, let alone the “necessary” indexing, by the different “page” elements… but is that necessary, in the end? Good luck to you; I failed, since I’m not willing to spend a fortnite upon that “problem”, and yes, there is some out-of-the-box wikipedia db, called “wikitaxi”, the developer of which really knowing what he does, e.g. its size (after trouble-free import) is just a fraction of the original dumps (about 35, 40 p.c.), and the page title strings are indexed, so this specific search is instantaneous.

Unfortunately, the wikitaxi developer knows “too well” what he does, since - my assumption only - he deliberately (?) discarded any possibility to select, and to copy (and there’s no comment functionality either). Dump import into your sql or other, general db then, with full-text search built up, upon import, by SQLite or the maker of the db? That’s possible in theory indeed, from the “work-flow” that follows, i.e. your general db (i.e. UR in my case) will probably offer automated import of text files within a folder (and its sub-folders), with every file (i.e. originally: wiki “page”) becoming an “item”.

On the other hand, importing multi-millions of files, with together 100 or more GB, into your (even db-based) “outliner”, would be an incredible “stress test” for that, and not speaking here of the “answer times” after import, or of the fact that in some months, you might be interested in “updating” your dump, i.e., technically, in doing, and then process a brand-new dump; the maintenance of this in a Postgres-backed “outliner” would be hassle-free, but there is no such thing, and my experience with SQLite (“outliners”) will make me avoid that adventure before trying.

Hence: You will split those multi-gb dumps into single “pages” again, a file per “page”, i.e. you will get multi-millions of files, necessarily spread over a set of (just numbered) (sub-) folders, each one containing a set of perhaps 2,000 to 5,000 files (up to 5,000 each is reasonable in NTFS; modern Macs though have got some other file system I don’t know the characteristics of, but as said, describing the Windows work-flow here anyway).

You would have, for example, d:\w for wikipedia, then d:\w\f for the French wikipedia, and in there, d:\w\f\1…d:\w\f\400, with each 5,000 files, 1.txt…5000.txt, or .xml or .w or whatever you like, you then set in Windows a default “app” for that suffix, for “Enter” on the file system entry; instead of 1…5000 in every one of the 400 folders, you might get 1…2000000 instead, according to your script, or to the (free or paid) tool you will use, or you might create 1,000 folders in d:\w\f, each with just 2,000 files, whatever.

Now, how to split? I have not found any (even paid) tool which, instead of numbering the files, fetches the page titles, then names the files accordingly, be it with additional numbering, or even without; in fact, any worthwhile numbering would be by the page IDs anyway. Btw, the page titles may contain chars not allowed in file names, so your script would have to replace them accordingly, before trying to creating the files. Also, none of the tools will delete the trailing indentation spaces, contained in the dumps, and which technically are not needed for their xml construction - let alone discarding unwanted metadata like redactors, revisions, etc. - you own script could delete them easily, since that’s the “beauty” of well-formed xml: you just delete all lines between and including

and

e.g. if you want your “output” text somewhat “neater”.

Thus, from the (paid or free) tools you get, you’ll get several millions of “page” files per dump, just numbered (also, in case, instead of 1…n in the form of a, aa, aaa, aaaa, aaab, etc.), all of them with all sorts of “content” parts which you may not be interested in, and with leading spaces before the

... you run then your own “cleaning”, and especially “meaningful-title” script on millions of files (outer loop for the folders, intermediate loop for for the files, inner loop for the lines… and then finally “innerst” loops for replacing within some lines, etc.) - technically, this is no problem at all, but this “work-flow” means writing millions of files (by the tool), then opening, changing, and saving again, millions of files, one-by-one (by your script). (Some of the wiki pages being titles identically, you will need some lines of additional code, checking if the intended file already “exists” (in that sub-folder) as a homonym, and then adding “order numbers” (i.e. 1, 2, 3…) if necessary.)

Thus, needing your script anyway… why not do it “better”? Ideally, you could run a script upon your 50 gb dump, reading line-by-line, then creating the necessary, already-“cleaned” files, and that might be possible indeed. I, using Autohotkey, cannot do that, since the smartest = fastest and most reliable ways of doing this in there don’t allow for reading but into variables (i.e. not files) of less than 1 gb, forcing me to begin by splitting - not in Autohotkey - the dumps into such multiple chunks.

With a paid tool, you can do just that, set a limit of less than 1 gb, than have the tool split “enough” “pages” into each chunk, in order to come as near as possible to the limit, but without exceeding it; for a 50 gb dump, you’ll thus get 51 chunks, and then you run your script on these 51 files, similarly to the description above, it’s just 51 source files for reading line-by-line, than millions.

You can do similar with free tools, but among them, I don’t have found any that will do as well, since the ones I found will either set the limit by lines (but that risks to exceed the (here for Autohotkey: 1 gb) size limit), fetching complete “pages” (as the paid tool above does), or will set the limit by, here, 1 gb, but then fill up the chunks with as many lines as it gets, not taking the care to not split but after a complete “page”; thus, in the first alternative, you will have to set the line limit low enough in order to not exceed the (not settable) size limit (which will multiply the chunks), and in the second alternative, you will have a minimum number of chunks, but your script code is somewhat “complicated” since your “page” loop crosses the (chunk) file loop.

So I had settled for the first alternative, and the free split tool split 23 gb into 55 chunks, in less than 10 minutes (on hdd), and then I ran my (up to now, just “cleaning, analyzing and target file creating”) script on one of those chunks, from which (i.e. 450 MB read into var, then working from that var) it neatly created 110,000 correctly processed and meaningfully named files (in 55 sub-folders à 2,000 files) in again less than 10 minutes; I’ll now write the complete script, for the situation where the split tool will create, for 23 gb, just 24 chunks, with with “pages” overlapping two chunks, and this will run then, for processing the whole 23 gb and not counting the 10 minutes for the chunk creation, about 7 to 8 hours (the “pages” in the dumps are not alphanumerically ordered, and the “pages” within the first chunks tend to be much “longer”, i.e. those contain, at equal size, much less “pages”, hence the high “page” number mentioned above for a “later” chunk (110,000 for “just one” out of in fact 52 “and a half”).

Then, the big moment - as said, I have already created 110,000 (“final”, not “dummy”) files like that, and “trialed” them -: All the power of Voidtools “Everything” (even from the command line in case, and incl. regex and all) will be available upon these - combined or distinct - “sets”, for their file names = “page” titles, and if you buy some indexing search tool (in this case, you must name or rename your files to “.txt”, or even “.xml”, in order to probably take advantage of the tool’s xml categorization functionality?), even the files’ contents will be indexed, i.e. becoming available instantly.

I think I’ll be happy with “Everything”‘s power re the meaningful titles (i.e. file names), with which not only “stored searches” are possible, but also building of your own “collections”, e.g. by automatic “renaming”, i.e. bulk adding of some “collection” code (e.g. ” .ar”) to just the currently selected (sic!) files within any search result.

Needless to say that for “people on the road” (nowadays: “active people”), any (even 500 bucks, “Everything” is that fast!) laptop or even slate will do, but Windows it must be: Applers’ mileage shall vary.