Windows dir command, Unicode file names and Perl

The other day I was trying to use some old Perl script on Windows 7.
In this case I used ActiveState Perl for Windows 5.14.2 (x64).
The script extracts the contents of a directory using the ms-dos command line “dir”, and use each line to make its calculations.
It’s true, it should better use opendir perl function… but that’s not the point.
Ups, post-scriptum correction: opendir do not seem to manage Unicode windows names, it returns the old “___~1.TXT” ms-dos short names for them:

opendir $fList, '.'
|| die "can't open actual dir: $!";
open fOut, '>', 'out.txt';
while (readdir $fList) {
print fOut $_, "\n";

The relevant thing was that I had file names in the directory with no english chars; in this case, Japanese (both Kanji and Kana).
As ms-dos have the bad habit of not knowing Unicode, this can’t run well:
@list=split("\n", `dir`);
You have to use this instead:
@list=split("\n", `cmd /U /C dir`);
Anyway this is not the correct way of doing things, either… and I don’t know exactly why not (probably split does not correctly understand Unicode chars? Why should it, anyway).
I mean, it’s true that iterating through the @list needs some massaging via decode():

use Encode qw/decode/;
@list=split("\n", `cmd /U /C dir`);
open fOut, '>', 'out.txt';
foreach (@list) {
$line = decode("UTF-16BE", $_);
utf8::encode($line); # to produce a beauty utf8 output. utf16 ugly.
print fOut $line;

Ok, now there is a lot of problems there…
If I’m not misled, “cmd /U” produces an UTF-16LE output; but using decode with any of these options is incorrect:

$line = decode("UTF-16", $_);
$line = decode("UTF-16LE", $_);

The first is obviously incorrect, as there is no BOM in the output, but the latter should work… but it just produces a correct first line, and after that, a mangling garbage. It seems as if cmd output would be converted on the fly from little endian to big endian. Which doesn’t make much sense, if I hadn’t seen it with my eyes.
So, the remaining UTF-16BE option should work. And it seems, but does not…

For the rest of the post you’ll need Unicode fonts in your system. Nowadays you shouldn’t do anything: systems are already configured for it.
If you have problems seeing the files, even when its binary content is perfect, see this great article about Windows, BOM, and Notepad.

If the file is named: 漢字, 片仮名, カタカナ, かたかな.txt
the code produces an output like:

 䔀氀 瘀漀氀甀洀攀渀 搀攀 氀愀 甀渀椀搀愀搀 䐀 渀漀 琀椀攀渀攀 攀琀椀焀甀攀琀愀⸀ഀ
15/09/2012 10:31 715 out.txt
15/09/2012 10:32 187
15/09/2012 10:29 0 "潗嬬 G狮不听 «タカナ〬 Kたかな〮txt

Which is kind of curious… the first line is garbage from outer space, in the place where the line from dir really says:
Volume in drive C has no label
If UTF-16LE is used, the result is reversed, as previously stated: the first line is correct, but the rest of the output is garbage.
In this case, after that mangled line, everything seems fine.
But it isn’t.
The beginning of the kana words has been changed, if one pays attention… and kanji is totally incorrect.
Now, why and how this happened? No idea. May be the problem is on using split, expecting that lines (newline characters) can be recognized in such a hazardous UTF-16 environment…

The solution is to use the correct codification from the beginning, via open:

open fList, '-|:encoding(UTF-16LE)', 'cmd /U /C dir';
open fOut, '>', 'out.txt';
foreach (<fList>) {
print fOut $_;

And for obtaining just filenames, use the /W dir modifier:

open fList, '-|:encoding(UTF-16LE)', 'cmd /U /C dir /W';
open fOut, '>', 'out.txt';
foreach (<fList>) {
print fOut $_;


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s