6/24/2023 0 Comments Cygwin grep binary file matches![]() So, by the end of this pipeline, I've converted a bunch of files into lines of utf-8, prefixed with the filename, which I then grep. nl is much dumber than sed, and will just take the parameter -s entirely literally, and the shell handles the escaping for me.) If I used a sed expression, I have to worry about there regular expression characters in the filenames, which in my case there were a lot of. (Why I didn't use sed: escaping is much easier this way. Then I use cut to strip off the line number, leaving just the filename prefix. This is a hack: nl inserts line numbers, but it happens to have a "use this arbitrary string to separate the number from the line" parameter, so I put the filename (followed by colon and space) in that. The output from this conversion goes to stdout. The -s makes iconv shut up about any conversion errors (there will be a lot, because some files in this directory structure are not utf-16). Couldn't think of a way to do that if I was feeding multiple files at once to iconv, and since I'm going to be doing one file at a time anyway, shell loop is easier syntax/escaping.) iconv -s -f utf-16le -t utf-8 "$l"Ĭonvert the file named in $l: assume the input file is utf-16 little-endian and convert it to utf-8. (Why I used a shell loop instead of xargs, which would've been much faster: I need to prefix each line of the output with the name of the current file. ![]() doneīash loop for each line of the list of file paths, put the path into $l and do the thing in the loop. Gives a recursive list of filenames with paths relative to current while read l do. This is absolutely horrible and very slow I'm certain there's a better way and I hope someone can improve on it - but I was in a hurry :P I needed to do this recursively, and here's what I came up with: find -type f | while read l do iconv -s -f utf-16le -t utf-8 "$l" | nl -s "$l: " | cut -c7- | grep 'somestring' done : Little-endian **UTF-16 Unicode text**, with CRLF line terminators I use this one all the time after dumping the Windows registry as its output is unicode. This searches for the hex version of the string Test (in utf-16) in the file test.txt Also this won't work if the utf-16 in your binary file is stored in a different endianness than your machine.ĮDIT2: Got it!!!! grep -P `echo -n "Test" | iconv -f utf-8 -t utf-16 | sed 's/.//' | hexdump -e '/1 "x%02x"' | sed 's/x/\x/g'` test.txt Unfortunately I think this will end up printing out the ENTIRE file if there is a single match. This is then piped into hexdump so that the query and the input are the same. This is then piped into sed to remove the BOM (the first two bytes of a utf-16 file used to determine endianness). Grep is using a query that is constructed by echoing your query (without a newline) into iconv which converts it to utf-16. How does it work? Well it converts your file to hex (without any extra formatting that hexdump usually applies). I can only conclude that grep is converting my query to ascii.ĮDIT: Here's a really really crazy one that kind of works but doesn't give you very much useful info: hexdump -e '/1 "%02x"' test.txt | grep -P `echo -n Test | iconv -f utf-8 -t utf-16 | sed 's/.//' | hexdump -e '/1 "%02x"'` If test.txt is a utf-16 file this won't work, but it does work if test.txt is ascii. Here is what I tried: grep `echo -n query | iconv -f utf-8 -t utf-16 | sed 's/.//'` test.txt It seems as though grep will convert a query that is utf-16 to utf-8/ascii. I think it might have to do with endianness, but I'm not sure. I tried to do the opposite (convert my query to utf-16) but it seems as though grep doesn't like that. ![]() ![]() The example file is treated as binary because it doesn't fit the current locale (probably some UTF locale), not because it has null bytes.The easiest way is to just convert the text file to utf-8 and pipe that to grep: iconv -f utf-16 -t utf-8 file.txt | grep query However, that's not the case with the example input. and the binary-ness is due to null bytes. ![]() the -I/ -binary-files=without-match options are given.Variables), or null input bytes when the -z ( -null-data) That are improperly encoded for the current locale (see Environment Non-text bytes indicate binary data these are either output bytes However, grep also considers other data as indicating binary files: It assumes that the rest of the file does not match this is If type is ‘ without-match’, when grep discovers null input binary data Looking at the grep manual, this seems to be because (bold mine): ![]()
0 Comments
Leave a Reply. |