[flud-devel] DHT Performance? Design?

Bill Broadley bill at broadley.org
Sat Oct 27 15:32:03 PDT 2007


Greetings all,

I've been tinkering with the idea of writing a p2p backup client with the 
usual goals:
* Don't trust your peers more than you have to (encrypt blocks before you
     send)
* Donate storage to get storage
* Find redundancy when feasible
* Use DHT for finding peers/contracts, possibly wait for version 2
* open source
* Challenge peers to verify they stored what they claimed
* Public keys and SHA256 checksums are very useful to insure who you are
   talking to, identifying duplicate files, etc.
* at least for version 1 I was going to use simple replication and
   recognize duplicates instead of erasure codes.
* only private and public key should be necessary for disaster recovery

While a DHT is a very cool technology and very aesthetically pleasing I
wasn't sure that it was up to the task of a query per file for a large number 
of peers distributed over the internet.  Does anyone have numbers handy? 
Things like azueus use it for finding peers, but that's relatively little
info that is just a single query (who's on this torrent) that seems like
it often takes minutes.

Flud looks awesome, I've seen many related papers (idibs and myriadstore among 
others), but precious few that actually publish source.  Kudos to Alen.  Of 
course I have some questions.  I've tried to read as much on the website, 
p2p-hackers, and the flud mailing list as I could find.  RTFM is welcome, 
please include a URL ;-).

As an example my desktop uses 150GB, 50k directories, 345k files, average file 
size of 478KB.

Questions:
#1 Does every backup require 150GB to be read and encrypted then 345k queries
    of the DHT?  I found a mention of only filename and storage keys
    stored locally.
#2 Is this a new DHT?  If not which is it?  Does it related to an existing
    dht?
#3 how big is a block?
#4 for 1000 peers how many nodes (on average) have to be contacted to query
    the DHT for sK? log n hops? Log n neighbors stored on each peer?
#5 Say you have 1000 flud users, each with 345k files, how long would it take
    for 1000 peers to make 345k queries, answer 345k queries from
    it's peers, and handle DHT routing (number of files * log n?) ?

My planned approach was more along the lines of:
for <directories to be backed up>
    for files in directory
        # query local metadata
        if (replicated(file)<desired_replication(directory))
           queue_for_encrypt_and_send(file)
        elseif (replicated(file)>desired_replication(directory))
           delete_from_peer(file)

Then a second thread/process:
for <queue_for_encryption_and_send>
     if (policy.security(file)==secure) #less storage efficient, more secure)
         #file duplicate will only be within this key, i.e. the sysadmin
         #responsible for a group of machines will benefit from duplicates
         #within his machines
         efile=pkencrypt_file(private_key,file)
     elseif (policy.security(file)==less_secure) #more storage efficient
         # less secure this allows a peer you have a contract with to find out 

         # if you have a given file by sha256'ing it and see if it matches
	efile=symmetric_encrypt_file(file.sha256,file)
     peer=find_peer(file) # find a peer with a contract for enough space
              # and isn't already storing a previous replication of said
              # file
     queue_send_peer(efile,peer)
     #track file, stat, sha256sum, and which peers locally
     update_local_metadata(file,peer)

Then:
<backup host>   I need this files stores <list of sha256s>
<selected peer> I do not have this list <potentially shorter list>
<backup host>   here's the file contents of <potentially shorter list>
<backup host>   I'll update the metadata listing you as the storage for
                 those files, and update our contract.
<selected peer> I'll update my metadata listing you as the owner (or partial
                 owner and add the storage totals (or partial storage totals)
                 to our contract.

After a backup is finished a copy of all metadata (filename and stat related 
info) will be sent to each peer for all files that are stored on that peer.

Then for reputation:
    foreach peer
       list=find_small_subset_of_files(peer)
       foreach(list_of_files)
          min=rand%file.size
          max=rand%file.size
	 if (max<min)
             swap(max,min)
          sha256=checksum_range(file,min,max)
          peer_sha256=challenge_peer(peer,file,min,max)
          if (sha256 == peer_sha256)
             increase_reputation(peer)
	 else
             decrease_reputation(peer)

So some differences from flud(from what I can tell), the p2p cloud will have 
traffic for the following:
* filesystem churn resulting in peer <-> peer file transfers
* DHT traffic for folks looking for new contracts
* challenge/requests for managing reputation
* occasional restores
* in general very network efficienct

Disadvantages vs Flud:
* lower storage efficiency, replicated files are found only among
   select peers not the entire dht
* lower storage efficiency, replication is less efficient than erasure codes

Disaster recovery instead of 345k DHT queries would just wait for a challenge 
from ones peers.  Then you could ask them for all the files they have of yours.

I've left out some details, maybe some of the ideas listed will be helpful.




More information about the flud-devel mailing list