I need to retrieve an object at a specific index from a massive JSON array. The array contains 2,000,000 objects and the file size is around 5GB.
I've experimented with various approaches using jq
in combination with Python, but performance remains an issue.Here are some of the methods I've tried:
Direct indexing:
jq -c '.[100000]' Movies.json
Slurping and indexing:
jq --slurp '.[0].[100000]' Movies.json
Using
nth()
:jq -c 'nth(100000; .[])' Movies.json
While these methods seem to work, they are too slow for my requirements. I've also tried using streams, which significantly improves performance:
jq -cn --stream 'nth(100000; fromstream(1|truncate_stream(inputs)))' Movies.json
However, as the index increases, so does the retrieval time, which I suspect is due to how streaming operates.
I understand that one option is to divide the file into chunks, but I'd rather avoid creating additional files by doing so.
JSON structure example:
[ {"Item": {"Name": "Darkest Legend","Year": 1992,"Genre": ["War"],"Director": "Sherill Eal Eisenberg","Producer": "Arabella Orth","Screenplay": ["Octavia Delmer"],"Cast": ["Johanna Azar", "..."],"Runtime": 161,"Rate": "9.0","Description": "Robin Northrop Cymbre","Reviews": "Gisela Seumas" },"Similars": [ {"Name": "Smooth of Edge","Year": 1985,"Genre": ["Western"],"Director": "Vitoria Eustacia","Producer": "Auguste Jamaal Corry","Screenplay": ["Jaquenette Lance Gibe"],"Cast": ["Althea Nicole", "..."],"Runtime": 96,"Rate": "6.5","Description": "Ashlan Grobe","Reviews": "Annnora Vasquez" } ] }, ...]
How could I improve the efficiency of object retrieval from such a large array?