During the latest PowerShell Oneliner Contest, Brian came up with a solution to Task 3 which is completely fantastic: he makes a very smart use of Group-Object -AsHashTable -AsString as well as of Invoke-Expression to produce an impressive 187 chars long solution.
Brian as kindly accepted to be my guest blogger today.
First of all, let’s have a look at his answers to my PowerShell contest:
TASK 1: MANIPULATING OUTPUT – 43 chars
gwmi Win32_Share|% N*|%{"\\$(hostname)\$_"}
TASK 2: MANDELBROT JOKE – 54 chars
"The B in $(($b=gv q* -v|% s*g 27 20)) stands for $b."
TASK 3: TEXT MINING – 187 chars
$t1,$t2|%{$_-split'\W+'|% *wer|group -ash -ass}-ov h|% K*|sort|gu|%{($d+=($1=$h[0].$_.Count)*($2=$h[1].$_.Count)),($a+=$1*$1),($b+=$2*$2)}|%{$m="[math]::Sqrt("}{}{"$d/($m$a)*$m$b))"|iex}
1. Brian, tell us a bit about yourself and about the way you got to work with PowerShell
I’m a syadmin/SRE/Systems Engineer working mostly on Windows but going cross platform whenever I can. I got started with PowerShell relatively late; maybe about 5 years ago now. Before using PowerShell my scripting tasks were mostly done with Ruby. I really enjoyed it as a language. At one of my previous jobs PowerShell was in heavy use there (particularly by the Exchange admin) so I started picking it up and then ran with it.
2. Is there any PowerShell project of your you want to speak about?
I have a module called Idempotion that lets you easily use DSC resources directly in scripts. It’s a templated wrapper around Invoke-DscResource that gives you more natural PowerShell function syntax and some additional features (-WhatIf support, etc.).
This is also on the GitHub page but for example, where you would use a line like this:
Invoke-DscResource -Name File -ModuleName PSDesiredStateConfiguration -Method Set -Property @{ DestinationPath = 'C:\Folder\File.txt' ; Contents = 'Hello' }
Idempotion lets you do this:
Set-File -DestinationPath 'C:\Folder\File.txt' -Contents 'Hello'
3. Can you show us the way you tackled Cosine Similarity Task?
This was a pretty challenging task. I had never heard of cosine similarity before, so I had to first learn what that was, learn how to apply to it a string (since it’s really about numbers), then come up with an implementation that could be sufficiently golfed.
My first attempt, at 254 characters defined functions (as ScriptBlocks in variables) for dot product and vector magnitude, and then called them later once the full vectors were realized. To make a long story short, it’s much better to calculate the dot product and magnitude as you go along; it just took me a while to figure out that I could do that with this algorithm.
So actually I want to talk about some of the other challenges.
Unlike traditional code golfing, this is specifically a one-liner contest; so no newlines and no semicolons. This really forces you think hard about how you can do discrete tasks (even variable assignment) without stopping for a new statement. The fact that we’re starting with two discrete variables for the source string puts that problem right up front.
So I start by making an array of $t1 and $t2 with the comma operator and then pipe that into ForEach-Object.
Splitting the string with \W+ splits on contiguous non-word characters as needed so that we get an array of words. After that I really want to lowercase version of the words, and then I want to group them into a hashtable.
To do lowercase, you can call .ToLower() but calling methods directly is painful in code golf. You need the entire name, need to use parentheses, if the source is not a variable or literal you also have to wrap the source in parentheses.
Luckily there’s a little-known parameter set to ForEach-Object. Instead of passing a script block, you pass a member name like a property or method and then it gets retrieved/invoked for each input object. It even takes arguments for methods. Best of all it accepts wildcards (it must be unambiguous). With properties, this is like using Select-Object -ExpandProperty, just much shorter than even select -exp.
So:
("I can't read words."-split'\W+').ToLower()
Can become:
"I can't read words."-split'\W+'|% *wer
I use this extensively in code golf, and I wrote about it in the Tips for Golfing PowerShell thread on Stack Exchange’s Code Golf site.
So you’ll see me use this A LOT in this task.
Back to the pipeline: after ToLower I’m using Group-Object with -AsHashTable and -AsString. You’ll see soon why I want a hashtable. -AsString is needed to get real strings for the hashtbale keys (this is annoying). The purpose of grouping is to get the the counts of each unique word. Group-Object isn’t case sensitive so we don’t actually need ToLower for this; but we need it later.
So the result of this ForEeach-Object is two hashtables, one for each of the input strings. The keys of each hashtable are the unique words, the values are an array of each instance of the word. So if the string contained the word “really” twice, the hashtable would contain a key of “really” with a value of @(‘really’, ‘really’).
I’m using the -OutVariable parameter to store the resulting array of hashtables in a variable named h, while also sending it down the pipeline.
The next part, |% K* uses the aforementioned method of using ForEach-Object to expand a property. K* resolves to “Keys”. This gets passed to sort and gu (Get-Unique) to get a list of unique keys. Since Get-Unique is case sensitive, this is why I lowercased the words previously. At this point in the pipeline though, all we have are keys. The pipeline objects are just strings, and the original hashtables they came from are not in the pipeline. So that’s why I put them in $h.
The next ForEach-Object does a lot of the “work” here. For each key I’m sending in, I need to retrieve the count of that key from the first hashtable ($h[0], which is the words in $t1) and the second hashtable ($h[1], the words in $t2). So $h[0].$_.Count does that for $t1, where $_ is the current key. These are the “pairs” of each vector. Doing it this way, with hashtables, ensures I’ll get 0 for words that are in one string but not in another. Originally I was just using groups and missing words because of that. I’m going to need each of these values 3 times so it makes sense to store them in variables. I chose $1 and $2.
Small aside: PowerShell has a neat little quirk whereby you can do an assignment inside of a substatement (parentheses), and it does the assignment while also returning the value that was assigned. This also works with += and -=. This is really critical here.
To calculate the dot product, I need to multiply $1 and $2, and keep adding those up as I go along. $d holds my dot product. So:
$d+=($1=$h[0].$_.Count)*($2=$h[1].$_.Count)
Keeps accumulating $d with my dot product as I go along, while assigning $1 and $2 for what comes next in the current iteration.
Within this iteration I also have to accumulate the values for magnitude for each vector (or at least for the squares of the current vector value). $a and $b hold the accumulating pre-square root magnitude values for $1 and $t2 respectively.
So I need to do 3 assignments here all within this 1 iteration, without semi-colons and newlines, so what to do? Let’s just make an array of all three while assigning them, with prodigious use of parentheses and 2 commas!
Now I’m accumulating all the right values at the right time. Problem is, I don’t need this array! I could nullify it, but that’s a problem too; if I don’t return anything to the pipeline, the next element won’t run its process block.
Instead, the next ForEach-Object uses all 3 blocks, and the Process block is empty, because at this point I don’t care about what’s in the pipeline, I just want to finish the work.
The begin Block was necessary to get to End block, so I kind of got it “for free”. What a perfect place to do another assignment! I’m setting up $m to contain a string that looks an awful lot like a piece of PowerShell code that calls [math]::Sqrt(.
In the End block, we bring it all together. What I need to do here is divide the dot product ($d) by the product of the Sqrt of $a and the Sqrt of $b. I do this by generating a string which ends up containing something like “123/([math]::Sqrt(4)*[math]::Sqrt(5))”, and then I pipe that into Invoke-Expression (iex).
An intermediate solution I had assigned [math]::Sqrt (the actual method itself) to the variable $m, so that I could call $m.Invoke($a) through the shorter $m|% I* $a, but the stringification with iex is actually way shorter. This kind of thing comes in handy a lot in golfing.
So that’s it! You can find more of my golfing on StackOverflow’s golfing site (I pretty much only use PowerShell there even though they are open to any language).