Given I have the following file:
{"title": {"ger": "Komödie" (utf8, encoded as c3 b6) },"files": [ {"filename": "Kom�die" (latin1, encoded as f6) } ]}
(might look differently if you try to copy-paste it)
This happened due to an application bug, I cannot fix the source which generates these files.
I try now to fix the charset of the filename field(s), there can be multiple of them. I tried first with jq (single field):
value="$(jq '.files[0].filename'<in.txt | iconv -f latin1 -t utf-8)"jq --arg f "$value" '.files[0].filename = $f'<in.txt
But jq interprets the whole file as utf-8 and this damages the single f6 character.
I would like to find a solution in python, but also there, the input is by default interpreted as utf-8 in linux. I tried with 'ascii', but this doesn't allow characters >= 128.
Now, I think I found a way, but the json serializer escapes all characters. As I (intentionally) work with the wrong character set, the escaped sequence is also garbage.
#!/usr/bin/python3import sysimport ioimport jsonwith open('in.txt', encoding='latin1') as fh: j = json.load(fh)for f in j['files']: f['filename'] = f['filename'].encode('utf-8').decode('latin1') # might be wrong, couldn't testwith open('out.txt', 'w', encoding='latin1') as fh: json.dump(j, fh)
What can I do to achieve the expected result, a clean non-escaped utf-8 json file?