directory isn't rendered at all sometimes #463

New Issue

zooko · 2008-06-14T01:03:44Z

zooko commented

2008-06-14 01:03:44 +00:00

Justin wasn't connected to the introducer or to any servers, and when he looked at a directory, the boilerplate at the top rendered, but then no directory contents were rendered -- it just waited indefinitely. Brian said he thinks that if there are no storage servers at all then instead of giving an error about failing to download the SSK, it hangs.

Just now I saw the same thing. It looked like I did have many servers connected (on the Test Grid), but I wasn't sure if that welcome page with the stats was stale -- had been loaded earlier when I was connected to a different wireless network. I reloaded the status page and it showed the same (as far as I noticed) status, and then I reloaded the directory and it loaded normally.

Justin wasn't connected to the introducer or to any servers, and when he looked at a directory, the boilerplate at the top rendered, but then no directory contents were rendered -- it just waited indefinitely. Brian said he thinks that if there are no storage servers at all then instead of giving an error about failing to download the SSK, it hangs. Just now I saw the same thing. It looked like I *did* have many servers connected (on the Test Grid), but I wasn't sure if that welcome page with the stats was stale -- had been loaded earlier when I was connected to a different wireless network. I reloaded the status page and it showed the same (as far as I noticed) status, and then I reloaded the directory and it loaded normally.

zooko added the

labels 2008-06-14 01:03:44 +00:00

zooko added this to the undecided milestone 2008-06-14 01:03:44 +00:00

zooko commented

2008-06-20 17:02:57 +00:00

This just happened to me again, and reloading the directory, even after the storage servers are connected, doesn't help -- it still fails to render the directory contents in the same way. Restarting the tahoe node, and waiting until the servers are connected before loading the directory, causes it to load normally.

zooko commented

2008-07-02 23:08:12 +00:00

This just happened again. Even though the node had been running for a long time and had many storage servers connected, the fact that I attempted to load the directory earlier, when too few servers were connected, appears to prevent it from ever loading until I restart my node. I guess this could have to do with our caching of the DirNode object.

This just happened again. Even though the node had been running for a long time and had many storage servers connected, the fact that I attempted to load the directory earlier, when too few servers were connected, appears to prevent it from ever loading until I restart my node. I guess this could have to do with our caching of the [DirNode](wiki/DirNode) object.

warner commented

2008-07-07 06:56:02 +00:00

Hm, we keep the dirnode object around, but we don't really cache the results of the read (each time you do dirnode.read(), it will contact all the servers again).

Is it fairly reproduceable? I'll see if I can trigger it under closer observation, maybe by starting a node on my laptop with the network disconnected, try (and fail) to read the directory, then connect the network, allow servers to connect, then try to read the directory again.

Hm, we keep the dirnode object around, but we don't really cache the results of the read (each time you do dirnode.read(), it will contact all the servers again). Is it fairly reproduceable? I'll see if I can trigger it under closer observation, maybe by starting a node on my laptop with the network disconnected, try (and fail) to read the directory, then connect the network, allow servers to connect, then try to read the directory again.

warner commented

2008-07-07 07:09:16 +00:00

Ok, so I am able to reproduce this locally. The second read failing is because of our serialization strategy: the second read is not allowed to proceed until the first has finished, and the first one never finishes. Interrupting the GET doesn't cause the read to stop (although it probably should.. the API doesn't lend itself to that, though).

I'll look more closely at what happens when there are no servers to be asked, that case is probably not handled correctly.

Ok, so I am able to reproduce this locally. The second read failing is because of our serialization strategy: the second read is not allowed to proceed until the first has finished, and the first one never finishes. Interrupting the GET doesn't cause the read to stop (although it probably should.. the API doesn't lend itself to that, though). I'll look more closely at what happens when there are no servers to be asked, that case is probably not handled correctly.

warner commented

2008-07-07 07:20:37 +00:00

Yup, it was never entering the state machine.. the operation would just hang forever. Fixed (by changeset:91c7e0f6897827fe), although the new behavior is to emit a "no recoverable versions" error message, whereas if we aren't connected to any servers it might be more useful to say something like "I'm not connected to any servers".

Yup, it was never entering the state machine.. the operation would just hang forever. Fixed (by changeset:91c7e0f6897827fe), although the new behavior is to emit a "no recoverable versions" error message, whereas if we aren't connected to *any* servers it might be more useful to say something like "I'm not connected to any servers".

warner commented

2008-07-07 07:21:06 +00:00

leaving this open for a while longer, because it needs a unit test

warner self-assigned this 2008-07-07 07:21:06 +00:00

warner commented

2008-07-07 19:23:17 +00:00

changeset:2074c92dd13abb23 adds the unit tests, they aren't exactly on the same situation as Justin saw (a webapi GET of a dirnode while no servers are connected), but they should cover the same underlying problems.

I think it's safe to close this one now.

changeset:2074c92dd13abb23 adds the unit tests, they aren't exactly on the same situation as Justin saw (a webapi GET of a dirnode while no servers are connected), but they should cover the same underlying problems. I think it's safe to close this one now.

warner added the

fixed

label 2008-07-07 19:23:17 +00:00

warner closed this issue

2008-07-07 19:23:17 +00:00

Sign in to join this conversation.