[svlug] can a linux webbrowser make sites avail offline?
Mark S Bilk
mark at cosmicpenguin.com
Tue Jan 8 14:22:02 PST 2002
In-Reply-To: <20020108192528.GF27948 at zork.net>; from
schoen at loyalty.org on Tue, Jan 08, 2002 at 11:25:28AM -0800
On Tue, Jan 08, 2002 at 11:25:28AM -0800, Seth David Schoen wrote:
>J C Lawrence writes:
>> Gordon Vrololjak <gvrdolja at nature.Berkeley.EDU> wrote:
>> On Tue, 8 Jan 2002 09:40:47 -0800 (PST)
>> > Hello, I was wondering if anyone new of a feature available in
>> > internet explorer for a browser for linux. I want to download
>> > some class computer programming web pages for offline use on my
>> > laptop on BART. In internet explorer they have an option for
>> > making pages available off line with up to 3 links in depth. Is
>> > there something comparable in linux through one of the browsers or
>> > another application?
>>
>> man wget.
>
>wget doesn't seem to have some of the features of the IE browse
>offline command -- it only follows links, and downloads inline images,
>but doesn't seem to download stylesheets, etc. IE's browse offline
>also seems to rewrite absolute links into relative links so that they
>will work properly without an Internet connection.
>
>Is there a way to make wget do this? I tried to write a script which
>uses wget to preserve snapshots of web sites, but the snapshots
>produced with wget -m were often fairly incomplete.
There are (at least) two perl programs that do this kind
of thing -- webcopy, and bew (which I've never used).
Webcopy used to work fine on my Netcom shell account,
but doesn't seem compatible with the version of perl
that I have now, or maybe some libraries or something
are missing (I know almost nothing of perl).
The description is here; the files are no longer on the
ftp site, but I have them if you want them (the latest
version, apparently -- v0.98b7 96/06/08). Any perl
jockey should be able to get them to work, since they
did before. Making the program pull down additional
types of files, like style sheets, shouldn't be too hard
to add, either.
http://www.inf.utfsm.cl/~vparada/webcopy.html
sed (with xargs) can relativize links in a collection
of HTML files, at least on an ad hoc basis. Just look
in them, see what needs to be changed, and write a
sed 's/../../' command to do it.
IIRC, webcopy only downloads each page once (probably
refraining if it's already in the destination tree),
whereas wget pulls each page again every time it scans a
link to it on another page -- very annoying, and could
make owners (and payers) of websites hate you, especially
if you're doing your websucking via a high-speed connection.
Rape and pillage responsibly!
For that reason I've given up using wget in its recursive
mode, and only fetch single pages, or the pages whose
URLs are in a file (the -i option). To create such a
file you can get a list of all the URLs linked to on a
web page, using lynx's l (list) command; just print the
list to a file and clean it up with an editor or with
sed and grep.
bew (web mirror, ha ha) uses lynx to fetch pages, and
looks simpler than webcopy. I've never tried it.
http://MarginalHacks.com/Hacks/bew
I think this kind of functionality belongs in a separate
program, not in browsers, which are already complicated
enough. Modularity is the Unix way.
More information about the Svlug
mailing list