[svlug] Someone to Bounce thoughts off of

bruce bedouglas at earthlink.net
Thu Jan 15 10:39:37 PST 2009


Hi...

I'm working on a project, developing a basic framework for a distributed
crawler. I've looked at BOINC/Grub/Heritirx/Hadoop...

I'm looking for someone, or a few someones that I could talk with, about how
to architect/implement the following app/process. Hopefully, I can find a
couple of people, that I can exchange emails/phone calls with, to bounce
thoughts/things off of!!! So even though this might not be the "right" email
list, maybe someone could point me to a better list/IRC channel, or some
actual live group!!!

--------------------------------------------

The goal is to create a an application that will allow me to have a central
process, that interfaces with client apps running on a client server. The
overall process would/should support multiple clients on multiple
machines...

The app will be used to parse/crawl over a targeted set of websites to
extract information. The client apps will be parsing scripts, written for
each site the app targets. Each client system, will have the complete set of
parsing scripts residient on the system/image.

I envision a master Server/System that interfaces with client apps, with
each client app performing uploads/downloads of files/data with the master
server.

The client apps run on the client server, with the server running multiple
copies of the client. The client app fetches file(s) from the master, with
each file containing a list of actions/urls to parse. The action defines
which parsing script to call/run, as well as the url/data for the parsing
script to use.

So in the end, the overall process would be a system that can efficiently
parse 5000-10000 websites in a short amount of time, running over 50-100
servers/images...


Basic Needs:

Server System:
 -Connects with Client apps
 -Registers/tracks Client apps/servers
 -Allows client apps to download information packets/files
 -Accepts information packets/uploaded files from client apps
 -Web Based/Server Based
 -Tracks which files have been uploaded from the Client apps
 -Tracks which files have been downloaded to the Client apps
 -
 -

Client System/App:
 -Client app runs on client server
 -Client server accepts/allows multiple clients to run simultaneously
 -Client app is spawn/run by the Client Managing App
 -Client app pings the Server App to get the file(s) to process
 -Client app notifies the Server app of overall status
 -Client app performs internal processing (within the app)
 -Client app copies output file/data back to server
 -Client app dies when it completes, (cleans up after itself)
 -Multiple/Distributed Client Servers
 -Client server is secure, within a "cloud" system
 -


Client Managing App:
 -Communicates with Master Server app
 -Manages the Client Apps
 -Spawns/Creates/Starts Client Apps
 -Monitors the ongoing Client Apps, kills them if they halt/stop/etc..
  (eliminates potential zombie apps)
 -Spawns additional copies of Client app if the total number
  of clients running falls between a certain number
 -Runs continuously
 -Has a cron process running to ensure the app always runs
 -

Thanks

-bruce
bedouglas at earthlink.net






More information about the svlug mailing list