Use Case modelling
User requirements definition
Requirements Engineering
Category: Project
Hate List part 4
So, I came into the Computer Science building to make sure that my program will still work while I’m in the building and rediscovered how patchy the wireless network is.
After about 20 minutes of fiddling with my laptop, I got it to connect to the wireless network and ran my program. Which worked really well for all the 10 minutes I could stayย connected. After that, no matter what I tried I couldn’t reconnect to the network. ๐ Bah.
Screenshot
Ahead of my presentation on Monday, here’s a screenshot of the program I’ve written in action:
When you click on one of the topics in the table on the left, a window pops up showing the page that the topic represents. The window doesn’t show the actual layout of the page or display the css that formats it, but it’s far more useful than a tree showing how all the links were found.
Current project status
At the moment the coding part is going ok. I’m a bit behind w.r.t. my schedule, but that’s mostly been down to feeling ropey for the larger part of last week. Hopefully, I’ll be able to catch up over the next week, but this shouldn’t be too much of a problem.
Being Synchronous and OutOfMemory errors
So, I started to wonder why I only ever got 4 pages of urls whenever I looked at blogs with a lot of links on them.
Looking at the API for ArrayList reveals that it can go a bit funny when accessed by multiple threads that aren’t synchronous and you can either put the arraylist in a collections thingy to deal with it, or make the threads synchronous.
Making the threads synchronous gives and OutOfMemory error.
So it looks like I’m going to have a stab at putting the arraylist in whatever that collections thing was.
Proxy Server and graphing
For outgoing HTTP connections you must use the proxy server webcache port 3128.
See instructions
Otherwise program may not be able to run within SoCS!
Also, look at JGraph.
Figure out way to use previous version of Java DK on departmental machines. Departmental Machines all use Java 1.5blahblahblah. Will have to find laptop-carrying minion or just carry the infernal machine myself. ๐
Rollback
After hours and hours hacking away at the spider code to try and make it work under the latest version of Java, it looks like it’ll be easier just to revert to the previous version of Java.
Which is slightly annoying, but at least it should all compile correctly without arguing about enum being a keyword.
Manageability – Open Source Web Crawlers Written in Java
Manageability – Open Source Web Crawlers Written in Java
A list of various open source spiders written in Java – something to look at.
To recurse or not to recurse?
At the moment, I’m working on a smaller version of my initial idea that should be far more manageable and less completely daunting. Anyway, the spider part of the project is still important and I’ve been thinking about whether it should be recursive or not.
If it was programmed recursively, the spider would look at a webpage and all the urls in it and process the first url…calling itself to do so and then looking at the links on that page and then calling itself on the first url on that page…and so on and on and on and then my computer would explode because there would be so many calls on the stack just waiting for something to finish searching the internet, which would probably send it into a loop anyway and all the memory would get used up and I would then be a sad bunny.
So, recursion isn’t the way, as great as it is – it’s only good if I have maybe a handful of things to look at. I need to plan for a heck load of urls in a page.
Therefore, a method that would not use recursion is needed. This is pretty much just going through the page and keeping a list of the urls I have to visit. I’ll probably stick some processing in there to check whether something is a blog or not from the address and then maybe some stuff to check if it’s a blog from the actual page. Any non-blog pages can be discarded.
Another thing that occurred to me was that I could implement some kind of nice shiny graphical display of the related blogs by using a library that I used for my Team Java project. Should find out what the stance is on using libraries and stuff and the extent of library using allowed.
Search methods and things
Breadth-first search would result in computers exploding due to the size of the database before I could get anywhere with results.
Depth-first search would work better, but may still need to impose a limit on the number of links down it goes.
In other news, it occured to me that podcasts have a very similar format to blogs (in fact, they are blogs, just with the addition of sound files in their feed), so I may need to find some way of dealing with this.
Also, Talkr is a services that turns text-based blogs into podcasts. Looks fairly interesting, but I doubt I’ll ever use it. I read all the blogs I subscribe to quicker than I could listen to them and I already have a hefty menu of podcasts that I listen to regularly.