#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jun 2009
    Posts
    11
    Rep Power
    0

    Web caching policy and best bidding site?


    Hi group,
    I want to ask about web site caching policies.

    I have built a website in C from scratch, and have got to the point where I need a fuzzy search facility. I want to put a job for this on Rent-a-coder, Find-a-guru or script lance.
    The database contains a hierarchical set of categories, products and artists / makers.
    The search would need to search the product / artists names from the database. I do not need to search text in pre-written html pages here.


    It occurs to me I really need to sort out a policy on caching first. There would be little point in trying to do fuzzy searching through built in MySQL facilities for example, if that’s possible, only to end up with all or most of the database contents cached in main memory where it would be far faster to search it there. (my data will be fairly small for quite some time).

    I believe that caching web pages, semi or fully formed, rather than caching the raw relational data is more usual.
    Also I believe a simple expiry policy is often used rather than directly tracking dependencies between raw data and web pages but that doesn’t sound like an acceptable solution due to inconsistencies in the generated web pages.

    Any one want a discussion about caching policies or the pros/cons of the above three contractor websites?

    Are there any third party products I can use if I dispense with the caching and just run on the database? How much will MySQL do for me?

    If I use a cache and assume that eventually not all data will fit in the cache then I need to build some cache coherence logic, perhaps involving all the relational data having an indication of which cache pages are dependent on them.


    Very best wishes,
    David
  2. #2
  3. No Profile Picture
    Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Jul 2005
    Location
    Bay Area, California
    Posts
    841
    Rep Power
    1682
    I tend to design with capabilities in mind, but don't add them until necessary. That way my implementation is flexible enough to mature elegantly, but I haven't wasted any time on wrong assumptions. Your assumption here is that you're going to have enough requests that the load will impact the database / response time. This sounds premature.

    In these types of scenarios, you would perform a query to retrieve the item ids that matched a search and then retrieve a set of items as the user paginates through them. Since the data is fairly static, I would expect MySql's caching to be sufficent if you enable query caching. When you begin seeing too much load, you can either have read-only replicas or use memcached for query/item caching. In your scenario you don't need local caching, so a global remote cache will maintain cache coherency.

    The most important aspect to build in up-front is monitoring and exposing statistics. By understanding the data you can perform capacity planning and mature the system with an understanding of what the common cases are.
    Core design principles when developing software systems.
    See my open-source project as an example of professional code.
    ---
    The opinions expressed do not represent those of my employer.
  4. #3
  5. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jun 2009
    Posts
    11
    Rep Power
    0
    Hi NovaX,
    Many thanks for your response. That's much appreciated.

    I agree with not building logic before there is reason to. I’m at the stage of considering the implications on my code of later introduction of caching logic and considering the caching options. This is about the right time in the project for caching considerations… 

    Thanks for the advice on MySQL caching, local / remote. That’s something for me to read up on.

    I’m familiar with memcached but wasn’t convinced by the idea of cluster in-memory storage. I would have thought local hard drive storage would be faster.

    Heres a quick brain dump on my thinkink in case it’s of any interest to anyone.



    There are two main options for caching – cache web pages, formed from SQL queries, or cache raw relational data and use it to generate pages for each request.

    Caching web pages is straight forward for most pages. They can be indexed by one or more URL parameters. The fully or partly formed html is cached. If partially formed, the spaces where custom html needs to be inserted for each request are marked.

    Caching serves two purposes – pre-processed consolidation of data needed for a webpage in one block of conveniently available data, and/or rapid localised data access.

    Cached data can be stored in local shared memory, remotely in clustering caching software like memcached or in the database. In the database and to a lesser extent in memcached (?) you do not get the advantage of rapid local access. Local storage of raw data doesn’t give the consolidation.

    For most pages the web caching approach is easier and less disruptive. The cached pages can be associated with one or more URL parameters.

    In my own web site there are two interesting cases that are not so straight forward.

    First, a ‘people’ screen (e.g. artists or manufacturers) which has a left hand side (LHS) panel with the list of people, an A-Z indexing across the top and Next | Prev at the bottom. Through combination of the A-Z index and the Next | Prev controls can cause the people list to start from any arbitrary point. Also a marker (*) marks the selected person (selected by clicking). The middle panel lists items associated with that person. Multiple people can have the same name. The contents of the middle panel are independent of the LHS then.

    For these two reasons, the pages can not efficiently be cached using the web-page caching approach in their entirety. Instead we cache the LHS and middle panel contents separately. The LHS list would be cached as a full list of names, not in html format. When used to generate a page the required section of the data is printed, adding the ;’*’. The middle panel is stored as html.

    The structure of the existing code is disrupted a little. The code sections that produce the LHS and middle panels must both be made conditional so they run only if their output is not already in the cache. The html template itself can be cached separately (most likely), compiled into the code itself, or merged in with the middle panel html in the cache.

    The html template insertion logic is extended to allow printed data to be pulled back and inserted into the cache – either the full html page or specific sections (e.g. middle panel).

    If we used local caching of raw data (in relational or pointer form) then we would probably want to retrieve the data from the in-memory structures and re-present them to look like the SQL retrieved results table so the existing output loops would not be disturbed.

    The other interesting case is the search.
    To search rapidly with cached data, in shared memory or through memcached, you would store the raw data, presumably in some pre-processed form to facilitate fuzzy searching.
    To help with fast output of results the same stored data may as well be used.
    If the search does take place in the database, cached product name / description / people etc blocks of html could still be cached locally to allow fast output.
    A local memory search on people names needs to be referenced back to their products so the data needs to be connected in memory (relational or pointers).

    If search logic is implemented with MySQL queries then a certain amount of reimplementation is needed when a raw data cache searching technique is added. Having this backup available if we want to turn off caching during debugging could be useful mind.

    Temporary data like search results and session data that will be written frequently can be write cached.

    I won’t go into detail about performance pros/cons of web-page caching versus raw data caching, implications of limited cache space and mechanics of cache coherence / update policy… because I need to give that more thought.
  6. #4
  7. No Profile Picture
    Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Jul 2005
    Location
    Bay Area, California
    Posts
    841
    Rep Power
    1682
    A harddrive has moving parts and thrashing can reduce its throughput. Often benchmarks don't account for file I/O caching which uses local memory, so valid benchmarks are hard to come by. The primary benefit of disk is that its cheaper and persistant, allowing for the potential of cache reuse across restarts. The advent of SSDs may diminish some of the physical limitations of spindle drives. The main reason to use this approah is cost, imho.

    The benefit of memcached in your case is primarily that its fast enough, scalable, well supported, and global. If you have local caching then you need to determine a valid cache coherency protocol, which can be tricky for complex applications. The common approaches are TTL, an invalidation message, scope (request/session), or versioning. I personally prefer scopes and versioning, but these are the more complex approaches to put in place but offer nice advantages.

    To pick the best strategy, you need to understand the real constraints of the problem and what you can get away with. Often an expiration is good enough for search results since the indexes are static and updated by a batch process. If so, then local memory/disk are cheap and efficient enough for your scale. If the cached data may affect transactions, like a business policy, then scoping or versioning are prefered so that the data doesn't change unexpectedly. The best approach depends on the problem, but often the simplest valid approach is good enough.

    Your implementation details sounds like your on the right track from what I've seen before, but I won't dig into them unless you have a specific question.
    Core design principles when developing software systems.
    See my open-source project as an example of professional code.
    ---
    The opinions expressed do not represent those of my employer.
  8. #5
  9. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jun 2009
    Posts
    11
    Rep Power
    0
    Many thanks for this. That's very helpful.
    Do you have any hints on what to google for to find caching policy information? 'web caching' throughs up a lot of info about proxy server design which is a whole different matter.
    I won't go any further with implimenting anything right now. KKnowing where I'm heading with it will be very helpful though.
    Very best wishes,
    David
  10. #6
  11. No Profile Picture
    Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Jul 2005
    Location
    Bay Area, California
    Posts
    841
    Rep Power
    1682
    What do you mean by "caching policy"? That might be the eviction policy, the coherence protocol, etc. Its too vague. There's a lot of interesting details in regards to caching, like the 4Cs of a cache miss and concurrent algorithms. Then there are web-specific issues such as page fragment caching, reverse proxies, etc. You need to be more specific...

    From my experience, I would focus on two things. The first is to take the simplest, cheapest, scalable approach for now until the application is mature enough that a more complex approach is warrented. The second is to design / implement for scale, which means to avoid bad I/O operations like complex SQL queries. The third is to adopt versioning early, because its an untapped gold-mine of value (auditing, cache coherency, roll-backs, etc). If you do these, then you can mature the caching layer as the success of your application drives the need for better scaling (a great problem to have).
    Core design principles when developing software systems.
    See my open-source project as an example of professional code.
    ---
    The opinions expressed do not represent those of my employer.
  12. #7
  13. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Mar 2010
    Posts
    2
    Rep Power
    0
    Hi David, this is Nicole from Rent a Coder.

    I can point out a few important differences between our service and services like scriptlance.

    Moneyback Guarantee:

    Scriptlance workers can't guarantee a pay-for-deliverable project with a foreitable deposit. In addition, Scriptlance doesn't refund you if your worker misses a required status report the way Rentacoder does.

    Unlike other services, we guarantee your pay-for-deliverable projects with free arbitration, refund, and in some cases, an Expert Guarantee.

    Disputes/Arbitration:

    Unfortunately, 10-20% of projects fail (and on some sites this # is higher). If your worker is a bum, it's important the site offers escrowing and arbitration so you are guaranteed to get your money back. However, some sites charge so much for arbitration or make it so time consuming that it becomes impractical.

    Scriptlance doesn’t guarantee that if you complete the contract, you’ll be paid the full amount, so if the buyer doesn’t want to pay you, you can end up doing the work for free.

    At Rentacoder, we offer arbitration on all projects free of charge and we test your deliverables to make sure they meet requirements. We also prevent abusive buyers from stalling an arbitration's start. In fact, 45% of our arbitrations are completed under a day and 75% of them are completed under a week. Even more, we show the public how our arbitrators make their decisions.

    Verified Work Performance:

    Most of these types of sites let you pay a worker you have employed before by the hour, which is the most convenient and cheapest way. However, Scriptlance doesn't verify the worker's timecard is accurate. On Rent a Coder, workers must punch in and out of a timeclock, and you can see a continuous record of their webcam and desktop, so you know the time is accurate.

    There are other differences as well. I invite you to compare the 7 major services through our site to learn even more.

    If you have any questions, please let me know. You can also call in to talk to a facilitator 7 days a week, or email us.

    Nicole

IMN logo majestic logo threadwatch logo seochat tools logo