[svlug] can a linux webbrowser make sites avail offline?

Mark S Bilk mark at cosmicpenguin.com
Tue Jan 8 14:22:02 PST 2002


In-Reply-To: <20020108192528.GF27948 at zork.net>; from 
schoen at loyalty.org on Tue, Jan 08, 2002 at 11:25:28AM -0800

On Tue, Jan 08, 2002 at 11:25:28AM -0800, Seth David Schoen wrote:
>J C Lawrence writes:
>> Gordon Vrololjak <gvrdolja at nature.Berkeley.EDU> wrote:
>> On Tue, 8 Jan 2002 09:40:47 -0800 (PST) 
>> > Hello, I was wondering if anyone new of a feature available in
>> > internet explorer for a browser for linux.  I want to download
>> > some class computer programming web pages for offline use on my
>> > laptop on BART.  In internet explorer they have an option for
>> > making pages available off line with up to 3 links in depth.  Is
>> > there something comparable in linux through one of the browsers or
>> > another application?  
>> 
>> man wget.
>
>wget doesn't seem to have some of the features of the IE browse
>offline command -- it only follows links, and downloads inline images,
>but doesn't seem to download stylesheets, etc.  IE's browse offline
>also seems to rewrite absolute links into relative links so that they
>will work properly without an Internet connection.
>
>Is there a way to make wget do this?  I tried to write a script which
>uses wget to preserve snapshots of web sites, but the snapshots
>produced with wget -m were often fairly incomplete.

There are (at least) two perl programs that do this kind 
of thing -- webcopy, and bew (which I've never used).  

Webcopy used to work fine on my Netcom shell account, 
but doesn't seem compatible with the version of perl 
that I have now, or maybe some libraries or something
are missing (I know almost nothing of perl).  

The description is here; the files are no longer on the 
ftp site, but I have them if you want them (the latest 
version, apparently -- v0.98b7 96/06/08).  Any perl 
jockey should be able to get them to work, since they 
did before.  Making the program pull down additional 
types of files, like style sheets, shouldn't be too hard 
to add, either.

http://www.inf.utfsm.cl/~vparada/webcopy.html

sed (with xargs) can relativize links in a collection 
of HTML files, at least on an ad hoc basis.  Just look 
in them, see what needs to be changed, and write a 
sed 's/../../' command to do it.

IIRC, webcopy only downloads each page once (probably
refraining if it's already in the destination tree), 
whereas wget pulls each page again every time it scans a 
link to it on another page -- very annoying, and could 
make owners (and payers) of websites hate you, especially
if you're doing your websucking via a high-speed connection.  
Rape and pillage responsibly!  

For that reason I've given up using wget in its recursive
mode, and only fetch single pages, or the pages whose 
URLs are in a file (the -i option).  To create such a 
file you can get a list of all the URLs linked to on a 
web page, using lynx's l (list) command; just print the 
list to a file and clean it up with an editor or with 
sed and grep.

bew (web mirror, ha ha) uses lynx to fetch pages, and
looks simpler than webcopy.  I've never tried it.

http://MarginalHacks.com/Hacks/bew

I think this kind of functionality belongs in a separate 
program, not in browsers, which are already complicated
enough.  Modularity is the Unix way.  





More information about the svlug mailing list