środa, 16 grudnia 2015

Web Image Downloader - WebsiteConstructor

Some exciting stuff coming up today. It's finally time for Website and WebsiteImage models. Like a cherry on a top of big whiped cream, pancake, icecream pie there will be the Constructor, which I'll create!
So first of all - models. Don't need no more right now.
When I already have some containers, now it'll be good to know what to expect from them. I don't need anymore to simulate any web operations. Right now I care just about returning valid models from my WebsiteConstructor and WebsiteImageConstructor. I was wondering what should be input for such a constructor and decided to go for my HtmlParser as it already takes page content and returns its elements and url as a second parameter so I don't have to play with modifying the model it'll return. That results in a need to create one more stub (I already tested my parser so the test doesn't have to go through that again). I've extracted interface for its vital methods and created stub that returns elements I want.
At this point I exactly now what's the result I want to have. Now I just have to make sure that WebsiteConstructor returns it.
Now Url and Name are directly available from IHtmlParser but it doesn't return any WebsiteImage, just string urls which it can dig from the content. Thats good, it's not supposed to do that, for that purpose I'll use another constructor that will return classes I need. Lets take a second to test that first. That WebsiteImageCreator could be made static, but for now I don't need it anywhere else so I'll stick to instantiating it. It'll take just string url that parser spits out. As a matter of fact I could use entire collection so thats what I'll go for. HtmlParser already throws ImagesNotFoundException when I've got no images found, so I'm not sure if I should do the same if passed collection will contain no elements. Perhaps yes, because this class is still public. It could be private but then while testing it would depend on WebsiteCreator and I don't want that. Lets throw some exceptions instead. I've had an idea that I could just return null but it wouldn't be meaningful enough. Maybe I'm overdoing that but it seems to work so far, lets stick to that idea for now.
The final step for todays session is to make the test for WebsiteConstructor pass. I still wonder for that exceptions I keep throwing, but thats how I see it: to create website I use IHtmlParser that means that the webpage exists and I was able to get content from it, parser gives me title and images. When it won't find any images exception will be thrown and I'll know that I can discard page from further steps (that will be downloading images), but if some other implementation of parser would be used I'd lost that behaviour and I don't want that. Theres one thing I noticed now. In my parser I also throw exception if no title is found, it's time to handle that exception here. It just feels right for WebsiteConstructor to take care of assigning default name for unknown website, not the IHtmlParser. So one more case to test.

wtorek, 15 grudnia 2015

Web Image Downloader - HtmlParser

Before I started today, I decided to go back for a moment and look at my tests so far. Everything seems fine, except for the WebRetriever class I've extracted yesterday. The tests for it were part of HeaderRetrieverTest and there was also one case missing. Let me quickly show you the changes.
First thing was to merge WebRequestCreatorForContentStub with WebRequestCreatorForHeaderStub. You can see that I return three stub requests. Two last of them are related to test my actual implementations of HeaderRetriever and ContentRetriever which I have already tested. Now the one with error simulates situation when response comes back with an error code, lets say NotFound. That was the case I was missing. The other test just makes sure that if StatusCode is alright the abstract method is able to return.
To complete those two tests another stubs were required
Alright. Enough enhancements for now. Lets get back to the main quest. Today I'll take care of following step or second part of it as the first one is complete - content is being read from the website and strings matching img pattern extracted.
To accomplish this I'll decided to go easy way and use HtmlAgilityPack library instead of scratching up anything on my own.
There are just four cases for now that I want to check so here they are:
And making that pass will assure me that I'm really close to start building models. There will be one more step before that. You can notice the _link2 is relative. Some webpages provide with full urls while others will show just relative paths. To download anything I need to have absolute path to the resource, but it's nothing to worry this time. Just make sure it comes back from my parser.
Thanks to HtmlAgilityPack implementing this class is piece of cake. Both SelectSingleNode and SelectNodes take XPath expression to look for the nodes, you can read about it here. It was helpful to take a look into comments on those methods. I discovered that if no nodes are found then they return just null that made check at GetImageLinks() easier as I didn't have to verify if the collection is empty or not. Handling those exceptions will allow me later to apply default title for website or just discard if from further process notifying user that there's nothing to download there...And that made me realise that it will come in handy if I derive from HtmlWebException and make two new ones that will tell me directly whats missing, so I can avoid any further checks when this occurs.
With those two wonderful exceptions and four more green lights on my path I can say that todays goal is accomplished.

poniedziałek, 14 grudnia 2015

Web Image Downloader - ContentRetriever

At this point I'm already able to determine if website exists, so I finally have a need to get page content to read elements I'm interested in. So thats the task for today. First of all, I'll be using urls that have been already checked, so I believe I can skip tests that check for url related exceptions. My class that I'll test will be ContentRetriever which I'll introduce in the moment. There's only one test I want to complete here, as the class has only one responsibility - return page content as string. That means that I have to deal with some stream, to be more specific Stream that will come from my CustomWebResponse.

Test to accomplish my goal: Now you may remember or lookup at previous post about HeaderRetriever. I decided to refactor it, extracting base class so it can be used in the way I need it. So the base class WebRetriever consists almost entirely of the class that I extracted it from. So right now what I need to do is derive from it, pass my method string to the constructor, instance of IWebRequestCreator and override abstract method HandleResponse which simply allows me to pick and return any part of the response I'll get.
Now to make test go green again, I simply did the steps I just mentioned and as the result got quite slim and even easier to read HeadRetriever:
At this point ContentRetriever was already piece of cake. One thing I've had to do in addition was the Stream property on my IHttpWebResponse.
And thats pretty much it for the production code. The test would pass, but again, I don't want to rely on the internet connection while testing. There's a functional test to check if it also works with real connection, but I need to mock it again. You probably already noticed that in the test I used some static members of WebRequestCreatorForContentStub class. That will probably need more refactoring as it can easily be connected with my creator for headers, but for now I'll just created another stub for IWebRequestCreator. The most important reason here was that I wanted to have some constant strings to rely on.
The last thing to make it work was a IHttpWebRequest stub that will allow me to get some made up stream from. Notice that for StreamWriter is enough to flush its buffer, disposing it would cause us to lose that precious stream.
Now the test passes without access to the internet and another step is complete. There's always some effort to put in those tests but I cannot overrate the feeling of having all the code under control by just looking on the list of green lights. Furthermore that implementation of ContentRetriever is safe for memory as it never will keep any stream alive more than it's necessary to read it and return content as string to process in the next step.

niedziela, 13 grudnia 2015

Web Image Downloader - Header Retriever

Some time has passed so I decided to fill some more content to this marvelous blog. This time the idea is to create software that will be able to download images from websites. There are probably tons of software for doing that but let me play around for a while :]

Thats pretty much how it's supposed to work:

  • user enters website url to some field and clicks download button
  • program checks if that page exists and if so, places it on download list
  • content is being read from the website and strings matching img pattern extracted
  • each extracted url has to be checked to see if it exists and if so, how big the file is
  • files that will pass are displayed to user so he knows whats being downloaded
  • downloaded files go to folder specified by the user 

As I already traveled the route of extremely overengineering this, I decided to calm down and go as easy as I can and not skipping any unit testing, mocking the hell out of it.
First step is easy and clear, entered url either passes or not
To make those tests work I simply decided to check Uri.Scheme so if scheme matched desired ones the url passes if scheme doesn't match, url is null, whitespace, empty or just cannot by created by Uri.TryCreate then it doesn't. Simple as that. I thought about some url fixer, but that'll come later on.
Now the sweet thing from the title - HeaderRetriever. To get information about file size without actually downloading it I decided to go for headers. Part of it is also providing me with HttpStatusCode so I can first determine if I can work with that given url. To test it I decided to go for couple iterfaces and do some facades. Classes required for this to work are WebRequest and HttpWebResponse (as just WebResponse doesn't show the HttpStatusCode I want).
Here are the tests:

Now the fun part. As you may know or not, you can get that WebRequest using its static method Create, but I don't like it static, it's bad for testing and your teeth. So first step was to be able to inject IWebRequestCreator to my retriever. You can notice that in System.Net there's already IWebRequestCreate interface. Unfortunately it was forcing me to return WebRequest and I wanted the simpliest way so I've created that interface and decided to return another one IHttpWebRequest. That one exposes just two properties: Timeout, Method and allows me to get IHttpWebResponse, which I'll show in the moment. In matter of fact it'll be probably easier to show them now and describe what you see.

Just looking at the interfaces appears to be great way of understanding how they will work. I expose only what I need at the moment (there's small step in the future with Method property, as I'll use it soon for getting page content, but let it be). Concrete implementations are also quite simple. All I did here was passing original Request and Response into my Custom ones so they can work on them (thats kind of facade pattern if I'm right).
At this point testing was easy, I've created stubs for each of those interfaces so I could fully decide on the result I want and also decoupled this part from internet connection. Here is the complete HeaderRetriever: