I can’t get Google to crawl site

KenTamelson · December 11, 2021, 11:11am

I apologize but yesterday I exceeded the quota of messages allowed to new users.

I see what Google says. Weird so. My Search Console says the exact opposite: passaporto-di-sanita-1630 hosted at ImgBB — ImgBB

About the sitemap: /sitemap1a.xml

About robots.txt:
User-agent: *
Allow: /
Disallow: /administrator/
Disallow: /cache/
Disallow: /cli/
Disallow: /components/
Disallow: /includes/
Disallow: /installation/
Disallow: /language/
Disallow: /libraries/
Disallow: /logs/
Disallow: /modules/
Disallow: /plugins/
Disallow: /tmp/

Admin · December 11, 2021, 12:39pm

You should always use a sitemap generator built into your website software to begin with. The point of site maps is to tell search engines about the structure of the site what they can’t figure out on their own. Your website software knows that structure, but external crawlers do not. If external crawlers are sufficiently detailed, it means you don’t need a sitemap, because Google can just crawl your site themselves and learn the same.

No, it doesn’t. A good crawler, like those used by search engines, behave like regular web browsers and will have no problems with this. Only dumb scrapers, like those used by sitemap generators, have this issue.

Oxy · December 11, 2021, 1:17pm

@KenTamelson

I changed some things in your sitemap to make it valid

so download and upload

sitemap.xml (4.8 KB)

and then visit this link or do it via the GSC (sitemap section)

Google Webmaster Tools - Sitemap Notification Received

Here it tells you what it actually means excluded

Excluded The page is not indexed, but we think that was your intention. (For example, you might have deliberately excluded it by a noindex directive, or it might be a duplicate of a canonical page that we’ve already indexed on your site.)

src - Page Indexing report - Search Console Help

I don’t know why this is happening !

KenTamelson · December 11, 2021, 1:23pm

Admin:

KenTamelson:

And in fact I had to create the sitemap with an internal Joomla tool, because all external crawlers fail.

You should always use a sitemap generator built into your website software to begin with. The point of site maps is to tell search engines about the structure of the site what they can’t figure out on their own. Your website software knows that structure, but external crawlers do not. If external crawlers are sufficiently detailed, it means you don’t need a sitemap, because Google can just crawl your site themselves and learn the same.

KenTamelson:

Anyway this also prevents any possibility to be crawled and indexed from search engines, and it makes drop the usefulness of this commendable service at minimal, if any, levels.

No, it doesn’t. A good crawler, like those used by search engines, behave like regular web browsers and will have no problems with this. Only dumb scrapers, like those used by sitemap generators, have this issue.

Thank you. The reason why I didn’t start with an internal sitemap is that the plugin I normally use only has a commercial version, that is, paid, and the customer said he didn’t want it. Then I was forced, seeing failures with any online crawler in fact.

The discourse on the impossibility of generating the sitemap with external tools is however only incidental: my question is, and remains, why Google crawler (certainly, an excellent crawler) after almost three weeks has not yet managed to perform its usual work on the site.

Is it just “in late”? I have never experienced delays of weeks, not even on subdomains. So my question arose from this, if there’d be are technical reasons on the domain / hosting side that prevent crawling by Google.

KenTamelson · December 11, 2021, 1:53pm

Thanks so much. I have immediately uploaded your modified sitemap, sorry if my wasn’t, It was generated with a sitemap plugin so I don’t know how it could happen.

The link you provided me says all ok: " Sitemap Notification Received - Your Sitemap has been successfully added"

Anyway “my” Search Console gone mad : error 404 on sitemap.xml img3 hosted at ImgBB — ImgBB

Sometimes it says “Couldn’t fetch” at first upload, but just refresh the page to get the “Success” message. Now it doesnt… and I never seen a 404 while the file is in place. Even because i clicked on “OPEN SITEMAP” and it showed the sitemap properly.

Yes I have read what “Excluded” means: I have no “noindex” directive at all, let alone duplicate pages that have not been indexed … there being zero indexed pages at the moment

Well, I took too much of your time. Let’s give Google a few days to see if it notices the new sitemap, and if it can do anything with it.

For now, thanks very much to everyone.

Oxy · December 11, 2021, 1:55pm

one of the reasons may be that your page is invalid (all)

up here you have a piece of code that doesn’t belong there

that part should be in the head section

Greenreader9 · December 11, 2021, 1:57pm

As far as my knowledge goes, I do not think that InfinityFree delays or prevents Google’s crawler. Have you already added to domain to the Google Search Console and requested it to index your site? Google will prioritize more popular, as well as sites that have requested an index before yours, so it may take some time before the sitemap is noticed.

We are here to help you, you haven’t wasted our time at all

KenTamelson · December 11, 2021, 2:01pm

God. And where did that come from? Can’t believe it … i slipped on such a banana peel.

This thing is embarrassing … I have thought of the maximum systems, and I have not noticed such a trivial thing. Terrible.

I just have to apologize doubly … I’m going to fix the mistake, and take a vacation … it’s time for it.

Sorry again, thanks for everything

Oxy · December 11, 2021, 2:03pm

GSC often has problems even when everything is fine on your part

and this has been going on for quite some time (since they introduced the new version)

sometimes it just report can't fetch and that actually means that the bot hasn’t come to index yet and has a plan to come in the next 2 weeks or a few months.

sometimes if your domain is quite fresh then they don’t have DNS data yet
so it also knows to throw out a problems.

it might also help if you insert this part of the code into the head section
(in each page you want the bots to index)

<meta name="robots" content="index,follow">

Oxy · December 11, 2021, 2:14pm

btw. this forum supports images - so just drag and drop ( no need for imgbb )

KenTamelson · December 11, 2021, 2:21pm

Interesting, it is something I was not aware about. Good to know for the future.

OK the page is fixed and the meta robots directive is in place now.

KenTamelson · December 11, 2021, 2:26pm

Yes I did it but I see that I have several limitations, as a new user, and I was not allowed to insert images (at least, until a couple hours ago) hence I used imgbb. I am also prohibited from inserting more than one link per post.

However, now we have fixed some things (ehm) so let’s wait and see if Google wakes up. I will let you know within a few days. Thank you.

Greenreader9 · December 11, 2021, 2:49pm

You should be able to now.

Please let us know if this works!

system · December 18, 2021, 2:50pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.