Wednesday, September 5, 2018

Grep within Microsoft Word Files

How many of you are actually UNIX users who write fiction in Microsoft Word for the convenience of sharing comments with others? Wait, it's just me? Regardless of the limited audience, I need to save this cool trick for my own reference.

The issue: I want to bring back a scene that I've written and discarded. I'm sure it lives in some previous version, but it is tedious to open all the previous versions and search by hand. (Yes, this is partly because I save so many previous versions despite my inability to open and search them. Specifically, there are 229 docx files of my current work-in-progress).

The real thanks  goes to the last answer of this question on stack overflow: How do I grep in microsoft word files? . I can't upvote the comment, but the theory and specifics are invaluable.

docx files are just zipped xml files. 
So all you have to do is unzip them, 
use sed to strip off the xml tags 
and you have a grep-able text file. 

(Grep is an invaluable unix command, and one of the main reasons I use a mac so I can open a terminal and use these tools I know.  Grep will return any instances of a string from a text file).

I wrapped the user's answer (in italics below) in order to convert all my docx files. I'm not going to explain all the unix commands; feel free to comment if you want to know more.


ls *.docx > ! list.all
grep -v -e "~" list.all > ! list.docx
mkdir text_files
set list = `cat list.docx`
foreach file ( $list )
    unzip -p "$file" word/document.xml | sed -e 's/<[^>]\{1,\}>//g; s/[^[:print:]]\{1,\}//g'  >! text_files/$file:r.txt
    ls text_files/$file:r.txt
end

Now I should be able to search for my text string (orach, in this case) through each file.
cd text_files
grep orach *

However, the format of the replacement (sed) command above basically makes the entire document into one long line, and since grep returns the entire line that the match was found in, this is too much output.  I spent the rest of my morning writing time trying to get a newline at each replacement with sed, but it is leading me down a rabbit hole.  It all has to do with newline and escaping the newline and such and there are considerations based off of different flavors of unix/linux and shells. (For the record, OSX on a mac is BSD unix and my shell is tcsh).  I can get the \n into the file but it doesn't show up as an actual newline. 's#<[^>]\{1,\}>#\\\n#g'

So meanwhile, all I need to know are the filenames that match, then I can open them in 'less' and get it out.
foreach file ( * )
    set test = `grep orach $file`
    if ( $#test > 0 ) echo "$file matches"
    unset test
end





No comments: