Hate List part 4

So, I came into the Computer Science building to make sure that my program will still work while I’m in the building and rediscovered how patchy the wireless network is.

After about 20 minutes of fiddling with my laptop, I got it to connect to the wireless network and ran my program. Which worked really well for all the 10 minutes I could stayย  connected. After that, no matter what I tried I couldn’t reconnect to the network. ๐Ÿ™ Bah.

Screenshot

Ahead of my presentation on Monday, here’s a screenshot of the program I’ve written in action:
screenshot

When you click on one of the topics in the table on the left, a window pops up showing the page that the topic represents. The window doesn’t show the actual layout of the page or display the css that formats it, but it’s far more useful than a tree showing how all the links were found.

Being Synchronous and OutOfMemory errors

So, I started to wonder why I only ever got 4 pages of urls whenever I looked at blogs with a lot of links on them.

Looking at the API for ArrayList reveals that it can go a bit funny when accessed by multiple threads that aren’t synchronous and you can either put the arraylist in a collections thingy to deal with it, or make the threads synchronous.

Making the threads synchronous gives and OutOfMemory error.

So it looks like I’m going to have a stab at putting the arraylist in whatever that collections thing was.

Proxy Server and graphing

For outgoing HTTP connections you must use the proxy server webcache port 3128.

See instructions

Otherwise program may not be able to run within SoCS!

Also, look at JGraph.

Figure out way to use previous version of Java DK on departmental machines. Departmental Machines all use Java 1.5blahblahblah. Will have to find laptop-carrying minion or just carry the infernal machine myself. ๐Ÿ™

Rollback

After hours and hours hacking away at the spider code to try and make it work under the latest version of Java, it looks like it’ll be easier just to revert to the previous version of Java.

Which is slightly annoying, but at least it should all compile correctly without arguing about enum being a keyword.

To recurse or not to recurse?

At the moment, I’m working on a smaller version of my initial idea that should be far more manageable and less completely daunting. Anyway, the spider part of the project is still important and I’ve been thinking about whether it should be recursive or not.

If it was programmed recursively, the spider would look at a webpage and all the urls in it and process the first url…calling itself to do so and then looking at the links on that page and then calling itself on the first url on that page…and so on and on and on and then my computer would explode because there would be so many calls on the stack just waiting for something to finish searching the internet, which would probably send it into a loop anyway and all the memory would get used up and I would then be a sad bunny.

So, recursion isn’t the way, as great as it is – it’s only good if I have maybe a handful of things to look at. I need to plan for a heck load of urls in a page.

Therefore, a method that would not use recursion is needed. This is pretty much just going through the page and keeping a list of the urls I have to visit. I’ll probably stick some processing in there to check whether something is a blog or not from the address and then maybe some stuff to check if it’s a blog from the actual page. Any non-blog pages can be discarded.

Another thing that occurred to me was that I could implement some kind of nice shiny graphical display of the related blogs by using a library that I used for my Team Java project. Should find out what the stance is on using libraries and stuff and the extent of library using allowed.

Search methods and things

Breadth-first search would result in computers exploding due to the size of the database before I could get anywhere with results.

Depth-first search would work better, but may still need to impose a limit on the number of links down it goes.

In other news, it occured to me that podcasts have a very similar format to blogs (in fact, they are blogs, just with the addition of sound files in their feed), so I may need to find some way of dealing with this.

Also, Talkr is a services that turns text-based blogs into podcasts. Looks fairly interesting, but I doubt I’ll ever use it. I read all the blogs I subscribe to quicker than I could listen to them and I already have a hefty menu of podcasts that I listen to regularly.