PDA

View Full Version : [python] Neopets Web scraper



randysavage
06-17-2015, 05:38 PM
Hello. I am having troubles writing a web scraper for neopets. I would like to have a program that could take a list of account names and then search each user's lookup and collect stats about avatar, stamp, CC, site themes, and kad count of the account. These stats could then be saved to a csv or something.

I only have experience with purely math based programming (matlab and mathematica) and I am struggling with this task.

Neopets doesn't store the avatar count (or any of these) in any specific class so I don't know how to find it. I have been trying to use beautifulsoup to search for the strings that matter

<b>Site Themes:</b><br />
## </td>
Is there a way to select the ## from that statement?

I have also considered searching for the image before the counter

<img src="[Only registered and activated users can see links]" width="75" height="50" alt="" border="0">
and then somehow selecting the code that comes so say 15 or 16 characters into the text after that image.

As you can see I am pretty lost so any help is appreciated. Thanks.

Zachafer
06-17-2015, 07:24 PM
I would personally use Regex (regular expressions)... but that's kind of advanced string parsing.

Here's another way to look at it:
Parse the string between <b>Site Themes:</b><br /> and </td>. Then trim the whitespaces (.strip()) and you have the ## you are looking for!

Check out this Stackoverflow post about parsing between 2 strings in Python: [Only registered and activated users can see links]

randysavage
07-05-2015, 10:30 PM
I ended up giving up on this for a while but then I read Ghosts thread here. ([Only registered and activated users can see links]) It motivated me to continue once I saw how he was able to extract the numbers.

I then decided to change from python to VB.net so that I could write a gui for it more easily. I have since "completed" the program however it is still really rough and it breaks quite frequently. I am using the login and [Only registered and activated users can see links] from DarkByte 's tutorial here ([Only registered and activated users can see links](Basic-login)). Every now and the program isn't able to collect the string of the HTML code and it messes up what I was trying to collect. I assume this is that the web page didn't load properly in a certain time frame but I am not sure. I could probably write a retry condition if it fails but I just haven't gotten around to it. If anyone wants to take a look at my code feel free to PM me.

I do have another problem that I can't solve. I want to find a way to collect how many battledome wins an account has. The information is found on [Only registered and activated users can see links] but the value doesn't appear in the HTML of the page that I am gathering. Is there a way to extract the variable from the webpage as an integer? Any help would be appreciated.

Thank you,
Randy