UPDATE: I found a better solution. This post is retained for historical purposes.
When Yahoo! Answers shut down, they offered me a chance to download all my personal data, which I gladly took. The meat of the data was in two JSON files: one for my questions posed, and one for my answers given. The latter one was huge, almost 50 MB. Had they simply supplied the file in XML format, I could have pulled it into a spreadsheet directly, and this whole quest would not exist.
Quest: To pull all my data into an Excel (or LibreOffice calc) spreadsheet for easy browsing, without paying a cent.
A quick internet search suggested that Excel could import JSON directly. That turned out to be a relatively new feature only available in the subscription-based Office365. Even the public library only had Excel 2016.
There are also numerous Online Converter tools – some that directly export spreadsheet format, and some that export csv or xml, which CAN be imported. Some simply stall or throw errors. They all have a size limit, though – the largest one I saw being 40 MB. I get it. No company wants me to use their CPU, storage, and internet bandwidth in exchange for showing me a few ads on the sidebar. But why isn’t there some freeware or open-source that I could download and run on my own computer? There are libraries for Python to handle JSON and XML. Writing a program would seem to be too much trouble for a one-time conversion, though. What I do have is some Linux skills.
My plan is to split the file, and process 2 segments separately. Anyone who has found a fantastic bargain at the store that said, “Limit 1 per visit” should be well aware of this strategy.
To get a sense of what the file looked like, I tried pulling the file into vi
, but since the file was not pretty-formatted, it was basically just one long 49.4 MB line, and vi is not the tool to use. gedit
(the notepad of Linux) was even worse; it simply hung. Even the head
command was useless – it tried to display line 1, which is the whole file. All right, let’s go primitive, then.
$ dd bs=1 count=1500 if=answers.json
[{"id":"20210418044641AADzDlq","category":{"name":"Polls \u0026 Surveys"},"title":"On your last day will you walk out into the sun only to discover the sun on your last day?","detail":"","state":"OPEN","userAnswer":{"id":"Stll9OJPx1pUT74XN2Rugve2Asxh5vniSDnEb3MTpIrEWqqoecWVrJs\u003d","qid":"20210418044641AADzDlq","text":"Hopefully, I will have enjoyed the sun and other beautiful things long before my time comes, but even if I missed something earlier, I\u0026#39;d count it as a bonus if it came on my last day.\n\nGratitude, always.","isBestAnswer":false,"ratings":{"thumbsUp":1,"thumbsDown":0},"commentsCount":0},"starCount":0,"answerCount":4,"userQuestionRelation":{"hasAnswered":true,"hasFlagged":false,"hasVoted":false,"hasStarred":false,"hasAddedToWatchlist":false,"canAnswer":false,"canVote":false,"canStar":true,"canFlag":true,"canEditCategory":false,"isContact":false,"hasBlocked":false},"ratings":{"thumbsUp":0,"thumbsDown":0},"created":"Apr 18, 2021 4:46:41 AM"},{"id":"20210417123737AAHBFJJ","category":{"name":"Hair"},"title":"Did my dog notice when I dyed my hair?","detail":"My hair has always been very dark. I recently bleached my hair and now it is almost white. Did my dog notice? If so, does she think I am a different person or does she know that I am still me? ","state":"OPEN","userAnswer":{"id":"t8J6RxhPx1pdlkBx4i7sMm1SZIjLqVHV6PJD0UqP5Y7UodALsAny8Fk\u003d","qid":"20210417123737AAHBFJJ","text":"Your dog probably goes by scent, so still recognizes you, just as we still
I see that the above file has no Schema included; it’s just raw data. Good. The file begins with a [
, and a subsequent tail
command showed me that it ends with a ]
.
I want to split my file into two files of about 25 MB each, to be under the 40 MB limit. To know where to cut, my clues are that the beginning of an object that I want to be a row looks like {"id":"20210418044641AADzDlq","category":
and the end of that object looks like "created":"Apr 18, 2021 4:46:41 AM"}
. The object boundary is within the red text below:
$ dd if=answers.json bs=1 skip=25000000 count=1500
16PWDNcr-oYiFnaMG-exAvHyn86px7vRI\u003d","qid":"20130223145623AA18Yr8","text":"I really doubt it, but you can no doubt find brand new 5.25 drives on eBay or other surplus sources.","isBestAnswer":false,"voteCount":0,"ratings":{"thumbsUp":1,"thumbsDown":0},"commentsCount":0},"starCount":0,"answerCount":2,"userQuestionRelation":{"hasAnswered":true,"hasFlagged":false,"hasVoted":false,"hasStarred":false,"hasAddedToWatchlist":false,"canAnswer":false,"canVote":false,"canStar":false,"canFlag":true,"canEditCategory":false,"isContact":false,"hasBlocked":false},"ratings":{"thumbsUp":0,"thumbsDown":0},"created":"Feb 23, 2013 2:56:23 PM"},{"id":"20130223082531AAKyhax","category":{"name":"Engineering"},"title":"voltage used when device is off?","detail":"name of voltage being used by a device when it is off","state":"REFERENCE","userAnswer":{"id":"xkZ5WuFPx1r5LyPgnKo1gnAcTey1wipzUvopA7jN-VAnVG0XicrgUCs\u003d","qid":"20130223082531AAKyhax","text":"You mean parasitic power or parasitic wattage?\r\n\r\nIt varies by device.\r\n\r\nAnything with a physical on/off switch and no remote control will use an unmeasurably small amount of power.\r\n\r\nModern computers, monitors, TV\u0026#39;s, and chargers use less then 0.5 watt when off. Older ones use more.\r\n\r\nSome devices, like satellite boxes, are never really turned off, even if they appear to be.","isBestAnswer":false,"voteCount":0,"ratings":{"thumbsUp":0,"thumbsDown":0},"commentsCount":0},"starCount":0,"answerCount":8,"userQuestionRelati1500+0 records in
1500+0 records out
1500 bytes (1.5 kB, 1.5 KiB) copied, 0.00709396 s, 211 kB/s
I was too lazy to count how many characters it was to the boundary, which was the comma in PM"},{"id"
, so I did trial and error with the value of skip=
until I found the very beginning of an object:
$ dd if=answers.json bs=1 count=1000 skip=25000635
{"id":"20130223082531AAKyhax","category":{"name":"Engineering"},"title":"voltage used when device is off?","detail":"name of voltage being used by a device when it is off","state":"REFERENCE","userAnswer":{"id":"xkZ5WuFPx1r5LyPgnKo1gnAcTey1wipzUvopA7jN-VAnVG0XicrgUCs\u003d","qid":"20130223082531AAKyhax","text":"You mean parasitic power or parasitic wattage?\r\n\r\nIt varies by device.\r\n\r\nAnything with a physical on/off switch and no remote control will use an unmeasurably small amount of power.\r\n\r\nModern computers, monitors, TV\u0026#39;s, and chargers use less then 0.5 watt when off. Older ones use more.\r\n\r\nSome devices, like satellite boxes, are never really turned off, even if they appear to be.","isBestAnswer":false,"voteCount":0,"ratings":{"thumbsUp":0,"thumbsDown":0},"commentsCount":0},"starCount":0,"answerCount":8,"userQuestionRelation":{"hasAnswered":true,"hasFlagged":false,"hasVoted":false,"hasStarred":false,"hasAddedToWatchlist":false,"canAnswer":false,"canVote":1000+0 records in
1000+0 records out
1000 bytes (1.0 kB) copied, 0.00347224 s, 288 kB/s
The means that I want the first 25000634 bytes of the file to be the first part (it’s not 25000635 because I want to leave out the comma between the two sections), and the remaining bytes of the file to form the second part, starting at byte 25000636 (skip first 25000635 bytes), again, to omit the leading comma.
$ dd bs=1 if=answers.json count=25000634 of=p1 # this will be part 1 25000634+0 records in 25000634+0 records out 25000634 bytes (25 MB, 24 MiB) copied, 151.915 s, 165 kB/s $ echo -n "]" >> p1 # append closing bracket to file $ mv p1 part1.json $ $ dd if=answers.json bs=1 skip=25000635 of=p2 # this will be part 2 24407040+0 records in 24407040+0 records out 24407040 bytes (24 MB, 23 MiB) copied, 147.714 s, 165 kB/s $ echo -n "[" > part2.json # prepend opening bracket $ cat p2 >> part2.json
Once I got the file split, it took maybe 6 minutes to convert each one. 5 minutes of that was just uploading, since my internet at home has 1 Mbps upload at best. I assumed that when network activity dropped to almost nothing, that the file had been received. From that time, it was about 1 minute until the website said by output was ready. Downloading the resulting xlsx files took maybe 10 seconds each.
I then pulled each xslx file into its own spreadsheet, and cut-and-pasted all the rows from the second one onto the first one, making one long file. I still had to do some fixup for html junk that had been passed directly in the json, like replacing ' " & < >
and so forth with their ordinary characters.
Note that this same technique could be used to break the json file into more parts if the file was bigger, or the conversion site limit was lower. This was just a one-off conversion, so it wasn’t worth writing a program to automate the process. But it would probably be a simple matter to write a program to count the number of top level objects in the json file, then offer to split at every N objects. Maybe I’ll write that, just for fun. Another approach would be to write a standalone program to convert the json to xml, which Excel 2016 or LibreOffice Calc will take as input. That would be a slightly harder program, but not too bad, if I didn’t want to pretty-print the result. If I write either of those programs, I’ll post them here. Or maybe you already wrote them? Drop me a line. I know there are commercial suites that will do all that and more, but I’m looking for free.