One more time on dutree ...

Tim Peters (tim@ksr.com)
Mon, 27 Jan 92 23:08:08 EST

Posted the following to comp.lang.perl; figure there's enough interest
in the Python world to justify sending it to python-list too; feel free
to scream if you disagree.

tersely y'rs - tim

Newsgroups: comp.lang.perl
Subject: Re: The problems of Perl (Re: Question (silly?))
Summary: use the right tool for the job <grin>
Message-ID: <9170@ksr.com>
Date: 27 Jan 92 22:41:16 EST
References: <=#Hues+4@cs.psu.edu> <1992Jan17.053115.4220@convex.com> <5230@charon.cwi.nl> <1992Jan27.020608.5983@convex.com>

In article <1992Jan27.020608.5983@convex.com> tchrist@convex.COM (Tom Christiansen) writes:
>...
>Larry mentioned to me offline last week that all it takes for this task
>is getting the right sort criteria, and that you needn't use recursion
>or string-munging trickery (the two approaches we've seen) to do it.

By assigning a unique (but otherwise arbitrary) integer ID to each group
of siblings, and using that as the primary sort key (and the du-reported
size as a secondary sort key), only one sort step is needed. However,
it leaves the records in a bad order for printing (because it makes
siblings adjacent even if they in turn have children). With more
trickery, a second (stable) sort could fix that.

The parent-child relationships can be captured without recursion and
without string comparisons by noting that:

1) The output of "du dir" is a postorder traversal of the subtree rooted
at "dir".

2) The number of components in a directory's pathname reliably
identifies the depth of the directory.

Those clues can be used to build up the tree in a single non-recursive
scan of du's output, just comparing the number of components from one
line to the next.

>Perhaps someone would care to post a solution using that method.

Will attach one that uses recursion only in the output routine & that
restricts itself to "flat" (non-recursive) data types. Even so, I wrote
it in Python because it's still clumsier in Perl. And I think this was
Felix Lee's original point: while it's amusing to think of ways to twist
the problem to fit comfortably into Perl's repertoire, the *obvious* way
to attack a tree-based algorithm is to use recursive data types, so that
trying to use a language that doesn't support them changes the focus of
the effort from solving the problem to wrestling with the language. For
this reason, as an old-time Fortran programmer I am sometimes overcome
by deja vu when using Perl <0.7 grin>.

BTW, this is not intended as a knock at Perl! I use Perl dozens of
times each day & believe it's superbly suited to many real-world tasks.
There are, however, many real-world tasks (like this one) for which Perl
is about the last tool I'd choose.

Some notes on the code below:

- This inserts a '.' entry as needed to suck up otherwise-unaccounted-
for space.

- The paths at the top level are printed out in full.

- It works correctly if more than one directory is given on the
command line (Tom's original program printed only the last one in
this case, and Guido's sometimes got the sorting wrong at (just) the
topmost level in this case).

- When two or more siblings have the same size, this version prints
them in alphabetical order. This was my intent, but you won't see
any code specifically arranging for that to happen -- it happens by
magic because of the way Python defines ordering on tuples.

- Unlike the other programs, this one does *not* try to vary the
indentation depending on the length of a component name. I simply
dislike the way the output looks when that's done, and much prefer a
fixed indentation increment.

- The sorting is done as it goes along (rather than using the trick
mentioned above to do it all at once) just because that's the most
natural way to do it; it would be clumsier to push the sorting off
to the end.

- The primary data structure is the associative array 'kids', which
has type (making up a type-decl notation here ...)

array[string] of list of triple(int,string,string)

I know it isn't hard to fake such a thing in Perl, but it does take
some tedious trickery not needed in Python. That's the "deep"
reason I think Python (or Icon) is a better tool for this problem:
it consumes less of *my* time to get the job done (& I prefer Perl
for problems where Perl is less hassle ...).

if-i-wanted-speed-i'd-write-it-in-fortran<grin>-ly y'rs - tim

Tim Peters Kendall Square Research Corp
tim@ksr.com, ksr!tim@uunet.uu.net

#!/usr/local/bin/python
import string, posix, sys

leadering = ' |'

def suck_up_du( dir ):
# 'kids' maps a pathname to a flat list of the pathname's children;
# each entry in the list is a triple:
# size of child (negated so it sorts in decreasing order)
# last component of child's pathname (cached for output)
# child's full pathname (used to index back into 'kids')
global kids

a, total, leftover, oldncomps = [], 0, {}, 0
for line in posix.popen( 'du ' + dir, 'r' ).readlines():
[size, path] = string.split( line )
size = eval(size) # convert to int
comps = string.splitfields( path, '/' )
ncomps = len(comps)
if ncomps < oldncomps: # line is parent
if total < size: a.append( (total-size, '.', path+'/.') )
a.sort()
kids[path] = a
oldncomps = ncomps
try: a, total = leftover[`ncomps`]
except KeyError: a, total = [], 0
leftover[`ncomps`] = [], 0
elif ncomps > oldncomps: # line starts new subtree
leftover[`oldncomps`] = a, total
a, total, oldncomps = [], 0, ncomps
a.append( (-size, comps[-1], path) ) # negate size for sort
total = total + size
return a[0]

def dump( prefix, list ):
for size, last, path in list:
print prefix + string.rjust(`-size`,lenlead), last
if kids.has_key(path): dump( prefix + leadering, kids[path] )

def dudump( dirlist ):
global kids, lenlead
kids, a, lenlead = {}, [], len(leadering)
for dir in dirlist:
size, junk, path = suck_up_du( dir )
a.append( (size,path,path) ) # display full path at top level
a.sort()
dump( '', a )

dudump( sys.argv[1:] or [''] )

>>> END OF MSG