Quantcast
Channel: Active questions tagged jq - Stack Overflow
Viewing all articles
Browse latest Browse all 524

Handle mixed charsets in the same json file

$
0
0

Given I have the following file:

{"title": {"ger": "Komödie"    (utf8, encoded as c3 b6)  },"files": [    {"filename": "Kom�die"   (latin1, encoded as f6)    }  ]}

(might look differently if you try to copy-paste it)

This happened due to an application bug, I cannot fix the source which generates these files.

I try now to fix the charset of the filename field(s), there can be multiple of them. I tried first with jq (single field):

value="$(jq '.files[0].filename'<in.txt | iconv -f latin1 -t utf-8)"jq --arg f "$value" '.files[0].filename = $f'<in.txt

But jq interprets the whole file as utf-8 and this damages the single f6 character.

I would like to find a solution in python, but also there, the input is by default interpreted as utf-8 in linux. I tried with 'ascii', but this doesn't allow characters >= 128.

Now, I think I found a way, but the json serializer escapes all characters. As I (intentionally) work with the wrong character set, the escaped sequence is also garbage.

#!/usr/bin/python3import sysimport ioimport jsonwith open('in.txt', encoding='latin1') as fh:  j = json.load(fh)for f in j['files']:  f['filename'] = f['filename'].encode('utf-8').decode('latin1')   # might be wrong, couldn't testwith open('out.txt', 'w', encoding='latin1') as fh:  json.dump(j, fh)

What can I do to achieve the expected result, a clean non-escaped utf-8 json file?


Viewing all articles
Browse latest Browse all 524

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>